Multi-modal large model retrieval method and apparatus for field of security checks

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating multimodal heterogeneous data in the security inspection field through a multimodal large model retrieval method, the problem of insufficient flexibility and intelligence of traditional retrieval methods is solved. This enables efficient, accurate retrieval and in-depth utilization of massive security inspection data, improving the flexibility and identification accuracy of the security inspection process.

WO2026129607A1PCT designated stage Publication Date: 2026-06-25NUCTECH CO LTD +1

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: NUCTECH CO LTD
Filing Date: 2025-06-27
Publication Date: 2026-06-25

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively integrate multimodal heterogeneous data in the security inspection field, failing to meet the complex and ever-changing user query needs. Traditional retrieval methods lack flexibility and intelligence, failing to fully utilize valuable unstructured data resources.

Method used

Employing a multimodal large-model retrieval method, combining image segmentation, object detection expert models, and large models, this method generates accurate cargo identification results through image patch extraction, database retrieval, and intent recognition. This overcomes the limitations of image recognition range and provides an intuitive interactive interface.

Benefits of technology

It enables efficient and accurate retrieval and in-depth utilization of massive amounts of multimodal heterogeneous security inspection data, enhancing the flexibility and intelligence of the security inspection process, meeting diverse user needs, and improving identification accuracy and work efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025104572_25062026_PF_FP_ABST

Patent Text Reader

Abstract

Provided in the present disclosure are a multi-modal large model retrieval method and apparatus for the field of security checks, an electronic device, a storage medium, and a program product. The method comprises: receiving a user input comprising a target text and a target image, and processing the user input on the basis of a retrieval enhancement process, comprising: first retrieving a first image block to be recognized in the target image from a security check database to obtain a first retrieval result, the first retrieval result comprising at least one first security check image and first text record data of an image block in the first security check image successfully matched with the first image block; then using a first prompt word to instruct a large model to determine the type of goods in the first image block on the basis of the first text record data in the first retrieval result, so as to obtain a first goods recognition result; and then using a preset second prompt to instruct the large model to generate first reply content for the target text in light of the first goods recognition result.

Need to check novelty before this filing date? Find Prior Art

Description

Multimodal large model retrieval method and device for security inspection field

[0001] This disclosure claims priority to Chinese Patent Application No. 202411855183.5, filed on December 16, 2024, the contents of which are incorporated herein by reference. Technical Field

[0002] This disclosure relates to the field of security inspection, specifically to a multimodal large-scale model retrieval method, device, electronic device, storage medium, and program product for the security inspection field. Background Technology

[0003] With the increasingly severe global security situation and the ever-increasing demands for security checks, security work is facing unprecedented challenges. The amount of data in security check scenarios is increasing dramatically, and the data types are becoming increasingly diverse, including multiple modalities such as scanned images of items, text reports, video surveillance, and biometrics. This massive, heterogeneous, and multimodal data contains rich security clues and is a valuable resource for data mining and intelligent analysis. At the same time, multimodal and large-scale model technologies have developed rapidly in recent years, opening up new avenues for the comprehensive processing of massive structured and unstructured data. Multimodal technology, with its excellent feature fusion and alignment capabilities across different modalities, can effectively improve the accuracy of joint retrieval of multimodal data, while large-scale models, with their superior semantic understanding capabilities, demonstrate enormous potential in accurately capturing and interpreting users' diverse and complex search needs.

[0004] However, current technologies still face many challenges in optimizing for the specific application scenario of security inspection. For example, security inspections involve numerous X-ray images, each with a large and complex set of data records, and often involve various declaration forms. Furthermore, the requirements may differ across security inspection scenarios. Some scenarios may only require determining the presence of prohibited items, while others (such as customs) require further verification of consistency with declared goods. Moreover, the types of prohibited items may vary depending on the specific security requirements. Due to these complexities and unique characteristics, there is currently a lack of a seamless solution that can effectively integrate various heterogeneous security inspection data, thereby maximizing the use of accumulated security data to facilitate the security inspection process. Summary of the Invention

[0005] In view of this, this disclosure provides a multimodal large-scale model retrieval method, device, electronic device, medium, and program product for the security inspection field. With the help of the large-scale model, the diverse and personalized needs of users in the security inspection field can be captured, and with the help of multimodal technology, various heterogeneous security inspection data can be retrieved according to needs. The large amount of multimodal and heterogeneous data measured in security inspection can be fully utilized to provide strong security clues for the security inspection process.

[0006] A first aspect of this disclosure provides a multimodal large-scale model retrieval method for the security inspection field. The method includes: receiving user input, the user input including target text and target image; and processing the user input according to a retrieval enhancement process.

[0007] The step of processing the user input according to the retrieval enhancement process includes: extracting image blocks to be identified from the target image to obtain at least one first image block; retrieving the first image block from the security inspection database to obtain a first retrieval result, the first retrieval result including at least one first security inspection image and first text record data of image blocks in the first security inspection image that successfully match the first image block, wherein the security inspection database is a database formed based on historical security inspection data, the historical security inspection data including at least security inspection images and text record data of security inspection images; inputting the first text record data from the first retrieval result into a large model, and using a preset first prompt word to prompt the large model to determine the type of goods in the first image block based on the input first text information to obtain a first goods identification result; obtaining an overall goods identification result based on the first goods identification result corresponding to at least one first image block; inputting the overall goods identification result and the target text into the large model, and using a preset second prompt word to prompt the large model to generate a first response content for the target text based on the overall goods identification result, and outputting the first response content.

[0008] According to one embodiment of this disclosure, processing the user input according to the retrieval enhancement process further includes: identifying the first image patch using at least one object detection expert model to obtain a second cargo identification result, wherein the object detection expert model is a machine learning model based on image recognition. Correspondingly, obtaining the overall cargo identification result based on the first cargo identification result corresponding to at least one first image patch further includes: obtaining the overall cargo identification result based on the first cargo identification result corresponding to at least one first image patch and the second cargo identification result.

[0009] According to one embodiment of this disclosure, the at least one target detection expert model includes at least one contraband detection model, wherein each contraband detection model is used to detect one contraband.

[0010] According to an embodiment of this disclosure, the step of extracting an image block to be identified from the target image to obtain at least one first image block includes: determining a region to be identified in the target image based on the content of the target text, wherein the region to be identified includes the entire target image or a user-specified region in the target image; and segmenting at least one first image block from the region to be identified based on image texture features.

[0011] According to one embodiment of this disclosure, the method further includes: identifying the user intent of the target text using an intent recognition model; and determining the user intent category based on the user intent and a preset intent classification. Wherein, when the user intent category is a retrieval enhancement-related intent category, the user input is processed according to a retrieval enhancement process, wherein the intent in the retrieval enhancement-related intent category includes recognizing goods in an image.

[0012] According to one embodiment of this disclosure, the method further includes: when the user intent category is a general retrieval intent category, processing the user input according to a general retrieval process, wherein the intent in the general retrieval intent category includes retrieving data from the security inspection database. The step of processing the user input according to the general retrieval process includes: extracting an image block to be retrieved from the target image to obtain at least one second image block; retrieving the second image block from the security inspection database to obtain a second retrieval result, the second retrieval result including at least one second security inspection image, and tag information and second text record data of an image block in the second security inspection image that successfully matches the second image block; and using a preset third prompt word to prompt the large model to organize the second retrieval result according to the target text before outputting it.

[0013] According to one embodiment of this disclosure, the step of retrieving the second image block in the security inspection database to obtain the second search result further includes: preprocessing the target text to determine search scope information; and retrieving the second image block in the security inspection database according to the search scope information to obtain the second search result.

[0014] According to one embodiment of this disclosure, the security inspection database includes a structured database and an unstructured database, wherein the unstructured database includes at least one of the following: a vector database or a graph database.

[0015] According to one embodiment of this disclosure, the target image includes a perspective image, and the security inspection image includes a perspective image.

[0016] A second aspect of this disclosure provides a multimodal large-scale model retrieval device for the security inspection field. The device includes: a text-image interaction module, a central scheduling module, an image segmentation expert module, a retrieval device, and a post-retrieval processing module.

[0017] The graphic interaction module is used to receive user input, which includes target text and target image.

[0018] The central scheduling module is used to process the user input according to the retrieval enhancement process, wherein the central scheduling module calls the image segmentation expert module, the retrieval unit, and the retrieval post-processing module to process the user input.

[0019] Specifically, the image segmentation expert module is used to extract the image block to be identified from the target image to obtain at least one first image block.

[0020] The retrieval system is used to retrieve the first image block in the security inspection database to obtain a first retrieval result. The first retrieval result includes at least one first security inspection image and first text record data of the image block in the first security inspection image that successfully matches the first image block. The security inspection database is a database formed based on historical security inspection data, which includes at least security inspection images and text record data of the security inspection images.

[0021] The post-retrieval processing module is used to: input the first text record data in the first retrieval result into the large model, and use a preset first prompt word to prompt the large model to determine the type of goods in the first image block according to the input first text information to obtain a first goods recognition result, and obtain an overall goods recognition result based on the first goods recognition result corresponding to at least one first image block; and input the overall goods recognition result and the target text into the large model, and use a preset second prompt word to prompt the large model to generate a first response content for the target text according to the overall goods recognition result.

[0022] The graphic interaction module is also used to output the content of the first response.

[0023] According to one embodiment of this disclosure, the apparatus further includes various expert modules, wherein at least one object detection expert model is deployed in each of the various expert modules, and the object detection expert model is a machine learning model based on image recognition. The central scheduling module is further configured to invoke the various expert modules. The various expert modules are configured to use at least one object detection expert model to identify the first image patch to obtain a second cargo identification result. The retrieval post-processing module is further configured to: obtain the overall cargo identification result based on the first cargo identification result and the second cargo identification result corresponding to at least one of the first image patches.

[0024] According to one embodiment of this disclosure, the central scheduling module is further configured to: identify the user intent of the target text using an intent recognition model; determine the user intent category based on the user intent and a preset intent classification; and process the user input according to a retrieval enhancement process when the user intent category is a retrieval enhancement-related intent category, wherein the intent in the retrieval enhancement-related intent category includes recognizing goods in an image.

[0025] According to one embodiment of this disclosure, the central scheduling module is further configured to: when the user intent category is a general retrieval intent category, process the user input according to a general retrieval process, wherein the intent in the general retrieval intent category includes retrieving data from the security check database. Specifically, the central scheduling module invokes the image segmentation expert module, the retrieval device, and the post-retrieval processing module to process the user input according to the general retrieval process.

[0026] Specifically, the image segmentation expert module is used to: extract the image block to be retrieved from the target image to obtain at least one second image block.

[0027] The retrieval device is used to retrieve the second image block in the security inspection database to obtain a second retrieval result. The second retrieval result includes at least one second security inspection image, as well as the tag information and second text record data of the image block in the second security inspection image that successfully matches the second image block.

[0028] The post-retrieval processing module is used to prompt the large model to merge the second retrieval results according to the target text using a preset third prompt word.

[0029] According to an embodiment of this disclosure, the central scheduling module processes the user input according to a general retrieval process and also calls the retrieval preprocessing module. The retrieval preprocessing module is used to: preprocess the target text and determine retrieval scope information. Correspondingly, the retrieval device is specifically used to: retrieve the second image block from the security check database according to the retrieval scope information to obtain the second retrieval result.

[0030] A third aspect of this disclosure provides an electronic device. The electronic device includes one or more processors and a memory. The memory is used to store one or more computer programs. The one or more processors execute the one or more computer programs to implement the steps of the method described above.

[0031] A fourth aspect of this disclosure provides a computer-readable storage medium having a computer program or instructions stored thereon, which, when executed by a processor, implement the steps of the above-described method.

[0032] A fifth aspect of this disclosure provides a computer program product, including a computer program or instructions. When executed by a processor, the computer program or instructions implement the steps of the method described above. Attached Figure Description

[0033] The above and other objects, features and advantages of this disclosure will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:

[0034] Figure 1 schematically illustrates an application scenario of a multimodal large model retrieval method and apparatus for the security inspection field according to an embodiment of the present disclosure;

[0035] Figure 2 schematically illustrates a flowchart of a multimodal large model retrieval method for the security inspection field according to an embodiment of this disclosure;

[0036] Figure 3 schematically illustrates a flowchart of a multimodal large model retrieval method for the security inspection field according to another embodiment of this disclosure;

[0037] Figure 4 schematically illustrates a flowchart of a multimodal large model retrieval method for the security inspection field according to another embodiment of the present disclosure;

[0038] Figure 5 schematically illustrates a general retrieval process in one embodiment of this disclosure;

[0039] Figure 6 schematically shows a block diagram of a multimodal large model retrieval device for the security inspection field according to an embodiment of the present disclosure;

[0040] Figure 7 schematically shows a block diagram of a multimodal large model retrieval device for the security inspection field according to another embodiment of the present disclosure;

[0041] Figures 8A and 8B schematically illustrate the graphical user interface in the embodiments of this disclosure;

[0042] Figure 9 schematically illustrates an example of feature extraction in the field of customs inspection;

[0043] Figure 10 schematically illustrates a target image input by a user in an example of a multimodal large model retrieval device for the security inspection field applying embodiments of the present disclosure, wherein a retrieval enhancement processing flow is applied in this example;

[0044] Figures 11A and 11B schematically illustrate examples of the two sets of first search results obtained by the searcher in the example shown in Figure 10;

[0045] Figure 12 schematically illustrates a target image input by a user in another example of a multimodal large model retrieval device for the security inspection field that applies embodiments of the present disclosure, wherein a general retrieval process is applied in this example;

[0046] Figure 13 schematically illustrates an example of the second search result obtained by the searcher in the example shown in Figure 12;

[0047] Figure 14 schematically illustrates an example of the final result output to the user in the instance shown in Figure 12;

[0048] Figure 15 schematically illustrates an electronic device suitable for implementing the multimodal large model retrieval method for the security inspection field according to embodiments of the present disclosure. Detailed Implementation

[0049] The embodiments of the present disclosure will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of the disclosure. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of the present disclosure for ease of explanation. However, it will be apparent that one or more embodiments may be practiced without these specific details. Furthermore, descriptions of well-known structures and techniques are omitted in the following description to avoid unnecessarily obscuring the concepts of the present disclosure.

[0050] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.

[0051] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.

[0052] Figure 1 schematically illustrates an application scenario 100 of the multimodal large model retrieval method and apparatus for the security inspection field according to an embodiment of the present disclosure.

[0053] As shown in Figure 1, the application scenario 100 according to this embodiment may include terminal devices 101, 102, and 103, a network 104, a server 105, and a security inspection database 106. The network 104 serves as a medium to provide a communication link between the terminal devices 101, 102, and 103 and the server 105. The network 104 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0054] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications, such as browsers, large-scale model chat client applications, and social software, can be installed on terminal devices 101, 102, and 103. Specifically, the multimodal large-scale model retrieval device for the security inspection field according to this embodiment can provide a graphical and textual interactive interface. This graphical and textual interactive interface can be presented to the user on terminal devices 101, 102, and 103 through a browser or application client.

[0055] Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.

[0056] Server 105 can be a server that provides various services. Server 105 contains a large model.

[0057] Server 105 can communicate with security inspection database 106. Server 105 may be equipped with a multimodal large-scale model retrieval device for the security inspection field, as described in this embodiment. Through this device, server 105 can analyze and process input information sent to server 105 by users via terminal devices 101, 102, and 103 according to the multimodal large-scale model retrieval method for the security inspection field described in this embodiment, and then feed the processing results back to the terminal devices. The multimodal large-scale model retrieval device for the security inspection field, as described in this embodiment, is built based on multimodal and large-scale model technologies, enabling efficient, accurate retrieval and in-depth utilization of massive amounts of multimodal heterogeneous security inspection data.

[0058] During the user input processing on server 105, the security inspection database 106 can retrieve historical security inspection data from the database 106 according to processing needs to assist the large model in generating better responses. The data in the security inspection database 106 effectively augments the knowledge of the large model. Specifically, by combining the large model's generation and understanding capabilities, the retrieval results are intelligently organized and optimized, and multimodal large model retrieval technology provides knowledge augmentation to the large model, filling its knowledge gaps and improving its overall ability to respond to user input.

[0059] It should be noted that the application scenario shown in Figure 1 is merely exemplary and does not constitute any limitation on the application scenario, architecture, and environment of the embodiments of this disclosure.

[0060] Figure 2 schematically illustrates a flowchart of a multimodal large model retrieval method for the security inspection field according to an embodiment of the present disclosure.

[0061] As shown in Figure 2, the method may include operation S210 and operation S220.

[0062] First, in operation S210, user input is received, which includes target text and target image. For example, through a graphic interface, the user can upload an image and input their requirements via text. The target image can be a visible light image, a perspective view scanned during security checks, or an image from video data. The target text can be text-formatted data or voice data that can be converted into text information.

[0063] Next, in operation S220, the user input is processed according to the search enhancement process.

[0064] Operation S220 may specifically include operations S221 to S226.

[0065] In operation S221, an image block to be identified is extracted from the target image to obtain at least one first image block.

[0066] In one embodiment, a region to be identified in a target image can be determined based on the content of the target text. This region may include the entire target image or a user-specified region within the target image. Specifically, text recognition of the target text determines whether it relates to the specified region; if so, an image block is extracted from that specified region; otherwise, an image block is extracted from the entire image.

[0067] When extracting image blocks, at least one first image block is segmented from the region to be identified based on image texture features.

[0068] In operation S222, the first image block is retrieved from the security inspection database to obtain a first search result. The first search result includes at least one first security inspection image and first text record data of the image block in the first security inspection image that successfully matches the first image block. The security inspection database is a database based on historical security inspection data, which includes at least security inspection images and text record data of the security inspection images. The text record data in the security inspection database may include cargo declaration information corresponding to the security inspection image and / or security inspection process record information of the cargo in the security inspection image, etc. The first text record data in the first search result comes from the text record data of the first security inspection image, and may be extracted from the text record data of the first security inspection image based on the content of interest to the user in the target text (such as extracting cargo type or name).

[0069] In operation S223, the first text record data from the first retrieval result is input into the large model, and a preset first prompt word is used to prompt the large model to determine the type of goods in the first image block based on the input first text information, so as to obtain the first goods recognition result. The large model can be arbitrary.

[0070] The reason for using a large model to organize the first search results is that there may be many security inspection images for the same first image patch in the first search results, and the goods in the image patches that match the first image patch may not be completely consistent (for example, some are films, some are plastics, etc.). In this case, the large model is used to organize and statistically analyze the goods information matched for the same first image patch according to the first prompt words, and then filter out the goods with high credibility (such as selecting the one or several with the highest frequency) as the goods recognition results for the first image patch, thereby improving the credibility and readability of the first goods recognition results.

[0071] In operation S224, based on the first cargo identification result corresponding to at least one of the first image blocks, an overall cargo identification result is obtained. For example, the first cargo identification results corresponding to all the first image blocks are summarized. In other embodiments, cargo of the same type can also be merged, sorted, etc., to form an overall cargo identification result.

[0072] In operation S225, the overall cargo identification result and the target text are input into the large model, and a preset second prompt word is used to prompt the large model to generate a first response content for the target text based on the overall cargo identification result. The large model can combine the overall cargo identification result with the user's question in the first text to answer the user's question, such as whether there are prohibited items or whether it is consistent with the declared cargo.

[0073] In operation S226, the first response content is output.

[0074] The method in this embodiment uses a large model twice. The first time is in operation S223, where the large model is used to organize the first search results and filter out goods with high credibility (e.g., selecting the one or more most frequent items) as the goods identification result for the first image block. The second time is in operation S225, where the large model answers user questions. Thus, on the one hand, by organizing the initial results retrieved from the security inspection database using the large model, it becomes possible to provide usable and reliable security clues using historical security inspection data; on the other hand, by allowing the large model to provide feedback to the user in the form of answering user questions based on the search results, it can meet the diverse user needs in security inspections and improve the user experience.

[0075] The method of this disclosure retrieves security clues from historical security inspection data in a security inspection database. This allows for a more comprehensive determination of the types of goods in the image or image block to be identified, by fully referencing historical data. Compared to methods that rely solely on image recognition, this disclosure can identify a wider range of goods. This is because image recognition models often depend on the types of items selected during training and annotation. If the model encounters goods types it has not learned, it cannot identify them, thus limiting the recognition range.

[0076] The method disclosed in this embodiment searches for similar items from historical security inspection data in a security inspection database. When the amount of data in the security inspection database accumulates sufficiently, it can greatly improve the comprehensiveness of cargo identification. Specifically, with the increasingly severe global security situation and the continuous improvement of security inspection requirements, the amount of data in security inspection scenarios has increased dramatically, and the data types have become increasingly diversified, including multiple modalities such as item scan images, text reports, video surveillance, and biometrics. This massive, heterogeneous, and multimodal data contains rich security clues and is a valuable resource for data mining and intelligent analysis.

[0077] Figure 3 schematically illustrates a flowchart of a multimodal large model retrieval method for the security inspection field according to another embodiment of this disclosure.

[0078] As shown in Figure 3, the method may include operation S210 and operation S220. Operation S220 may include operation S2221 in addition to operations S221 to S226, and operation S224 is specifically operation S2241.

[0079] Operation S2221 can be performed in parallel with operations S222 and S223. In operation S2221, at least one object detection expert model can be used to identify the first image patch to obtain a second cargo identification result, wherein the object detection expert model is a machine learning model based on image recognition. Specifically, the object detection expert model identifies the type of cargo in the image through image recognition. This at least one object detection expert model may include, but is not limited to, detection models for various prohibited items (such as cigarettes, controlled knives, and liquid fuels).

[0080] Accordingly, in operation S2241, the overall cargo identification result is obtained based on the first cargo identification result and the second cargo identification result corresponding to at least one first image patch. In some embodiments, the first cargo identification result and the second cargo identification result for the same first image patch may be merged, wherein cargo types (such as product names) that are inconsistent between the two are retained, and the frequency of each product name may be recorded for reference by the subsequent model when generating response content. In other embodiments, non-conflicting information in the first cargo identification result and the second cargo identification result for the same first image patch may be retained, while in the case of conflict, the information in the second cargo identification result shall prevail. For example, if the expert model identifies a certain image patch as cigarettes, but the search result identifies it as a chair, the identification result of the object detection expert model may be used.

[0081] This allows the first cargo identification result obtained from operation S223 and the second cargo identification result obtained from operation S2221 to be combined.

[0082] Next, in operation S225, the second prompt word is used to prompt the large model to generate a first response content for the target text based on the overall cargo recognition result, and in operation S226, the first response content is output.

[0083] Thus, this embodiment of the disclosure can combine security clue retrieval from the security inspection database with image recognition machine learning algorithms. By retrieving from the security inspection database, the limitation of the recognition range in image recognition can be compensated for, thereby improving the comprehensiveness and accuracy of security inspection cargo identification.

[0084] As can be seen, the method of this disclosure, based on cutting-edge multimodal large-scale model retrieval technology, provides a novel means for efficient, accurate, and in-depth retrieval and utilization of massive amounts of multimodal heterogeneous security inspection data, effectively revolutionizing traditional retrieval methods in the security inspection field. Traditional retrieval methods generally rely on preset, static retrieval rules, mainly focusing on simple matching of structured data, lacking effective correlation retrieval and analysis methods for complex unstructured image data. More importantly, when security personnel raise complex and varied query requirements, especially those queries that do not precisely match structured databases, such as queries based on fuzzy descriptions, contextual understanding, or cross-modal associations, traditional retrieval methods often fail to accurately capture the user's true intent, lacking flexibility and intelligence, and unable to effectively handle the diversity and uncertainty of user input. This results in unsatisfactory search results, or even completely failing to meet the user's actual needs. Furthermore, a large amount of valuable unstructured data resources are thus left unused and unexplored.

[0085] Figure 4 schematically illustrates a flowchart of a multimodal large model retrieval method for the security inspection field according to an embodiment of the present disclosure.

[0086] As shown in Figure 4, the method may include operation S210, operation S420 to operation S440, operation S220 and operation S450.

[0087] First, in operation S210, user input is received, which includes target text and target image.

[0088] Next, in operation S420, the user intent of the target text is identified using an intent recognition model.

[0089] In operation S430, the user intent category is determined based on the user intent and a preset intent classification. For example, user intents can be categorized into several types, including but not limited to the following main categories: general retrieval intent, retrieval enhancement-related intent, and general intent. The actual classification can be further refined according to system requirements. General retrieval intent includes the user's intention for the large model to retrieve data from the security database. Retrieval enhancement-related intent includes the intent to identify goods in an image. General intent is, for example, a conventional question-and-answer intent, such as a situation where the large model can answer without relying on the security database.

[0090] Next, based on the user intent category determined in operation S430, the corresponding processing flow is used to process the user input.

[0091] Specifically, when the user's intent category is a search enhancement-related intent, operation S220 is executed to process the user input according to the search enhancement process. The specific implementation of operation S220 can be found in the preceding description and will not be repeated here.

[0092] When the user intent category is a general search intent, operation S440 is executed to process the user input according to the general search process. The processing procedure of the general search process can be referred to in detail in Figure 5 below.

[0093] When the user intent category is a category other than the enhanced retrieval intent and the general retrieval intent (such as a general intent), operation S450 is executed to process the user input according to other processing flows, such as processing the user input according to the question-and-answer dialogue flow.

[0094] Figure 5 schematically illustrates a general retrieval process according to an embodiment of the present disclosure.

[0095] As shown in Figure 5, the specific process of processing user input according to the general retrieval process in operation S440 may include operations S441 to S443.

[0096] First, in operation S441, the image block to be retrieved is extracted from the target image to obtain at least one second image block.

[0097] Next, in operation S442, the second image block is retrieved from the security inspection database to obtain a second search result. The second search result includes at least one second security inspection image, as well as the tag information and second text record data of the image block in the second security inspection image that successfully matches the second image block.

[0098] Then, in operation S443, the preset third prompt word prompts the large model to organize and output the second search results according to the target text. For example, based on the items required to be output in the target text, the items in the second text record data are filtered, and then the retrieved second security inspection images containing the marked information and the filtered text record data are categorized, sorted, and output.

[0099] As can be seen, the method of this disclosure embodiment can selectively select the process for processing user input according to user intent, and meet the diverse needs of users through a single system.

[0100] Figure 6 schematically shows a block diagram of a multimodal large model retrieval device 600 for the security inspection field according to an embodiment of the present disclosure.

[0101] As shown in Figure 6, the device 600 may include a text-image interaction module 610, a central scheduling module 620, an image segmentation expert module 630, a retrieval device 640, and a retrieval post-processing module 650.

[0102] The graphic interaction module 610 is used to receive user input, which includes target text and target image. In one embodiment, the graphic interaction module 610 can perform the operation S210 described above.

[0103] The central scheduling module 620 is used to process the user input according to the retrieval enhancement process, wherein the central scheduling module 620 calls the image segmentation expert module 630, the retrieval unit 640 and the retrieval post-processing module 650 to process the user input.

[0104] Specifically, the image segmentation expert module 630 is used to extract the image patch to be identified from the target image to obtain at least one first image patch. In one embodiment, the image segmentation expert module 630 can perform the operation S221 described above.

[0105] The retrieval unit 640 is configured to: retrieve the first image block from the security inspection database to obtain a first retrieval result, wherein the first retrieval result includes at least one first security inspection image and first text record data of an image block in the first security inspection image that successfully matches the first image block; wherein the security inspection database is a database formed based on historical security inspection data, and the historical security inspection data includes at least security inspection images and text record data of the security inspection images. In one embodiment, the retrieval unit 640 may perform the operation S222 described above.

[0106] The retrieval post-processing module 650 is configured to: input the first text record data from the first retrieval result into the large model, and use a preset first prompt word to prompt the large model to determine the type of goods in the first image block based on the input first text information, so as to obtain a first goods identification result, and obtain an overall goods identification result based on the first goods identification result corresponding to at least one first image block; and input the overall goods identification result and the target text into the large model, and use a preset second prompt word to prompt the large model to generate a first response content for the target text based on the overall goods identification result. In one embodiment, the retrieval post-processing module 650 may perform the operations S223, S224, and S225 described above.

[0107] The graphic interaction module 610 is also used to output the first response content. In one embodiment, the graphic interaction module 610 can perform the operation S226 described above.

[0108] The device 600 can implement the method described with reference to Figures 2 to 4, which can be referred to in the previous description and will not be repeated here.

[0109] The multimodal large model retrieval method and apparatus of this disclosure revolutionize traditional retrieval methods. Through an interactive user interface, it intelligently analyzes diverse user needs and plans the optimal retrieval path accordingly. Specifically, its advantages include:

[0110] - Efficient multimodal data processing and fusion: Utilizing advanced multimodal data processing technology, it achieves deep fusion of different modal information such as text, images, and videos in security inspection data, providing comprehensive security clue analysis.

[0111] - Precise and intelligent retrieval: Intelligently analyzes user needs and provides accurate search results, ensuring that security personnel can quickly retrieve the information they need from massive amounts of data, thereby improving work efficiency.

[0112] -Large Model Knowledge Enhancement: Combining the generation and understanding capabilities of large models, the search results are intelligently organized and optimized, and multimodal large model retrieval technology is used to enhance the knowledge of large models, fill their knowledge gaps, and improve their overall answering ability.

[0113] - User-friendly interactive design: The design features an intuitive and convenient interactive interface, enabling security personnel to easily input query requests, view search results, and enjoy a smooth operating experience.

[0114] Figure 7 schematically shows a structural block diagram of a multimodal large model retrieval device 700 for the security inspection field according to an embodiment of the present disclosure.

[0115] As shown in Figure 7, the device 700 may include a text-image interaction module 610, a central scheduling module 620, an image segmentation expert module 630, a retrieval preprocessing module 760, a retrieval device 640, a retrieval postprocessing module 650, various expert modules 770, a generator 780, and a data management module 790.

[0116] The device 700 integrates multimodal large-model retrieval technology and large-model technology, and introduces a Retrieval-Augmented Generation (RAG) mechanism. It can deeply mine and efficiently integrate complex multimodal information from various security inspection scenarios, including but not limited to scanned images of container vehicles and luggage, radioactive detection information, declaration information, visible light images, video images, voice, and various associated databases. Through efficient cross-modal retrieval capabilities, it not only achieves rapid location and extraction of these information resources, but also seamlessly integrates this valuable data as a core reference into other non-retrieval functional modules, significantly improving the accuracy and efficiency of these modules when handling complex tasks. The specific descriptions of each module of the device 700 are as follows.

[0117] The data management module 790 is responsible for providing comprehensive, accurate, and diverse data sources to support the various functions of the retrieval system. These data sources include, but are not limited to, various types of structured, semi-structured, and unstructured data such as security inspection cargo information, regulations, historical inspection records, filing information, scanned images of container vehicles and luggage, and radioactive material detection results. To reduce the overhead of repeated searches, the data management module may also include the management of historical search records.

[0118] The main functions of the data management module 790 include data source integration, data storage, data preprocessing, and security inspection database.

[0119] Data source integration: Data is extracted from multiple heterogeneous data sources, cleaned, transformed, and loaded, and integrated into a unified data management platform. This process ensures data integrity, consistency, and availability.

[0120] Data storage: A hybrid storage strategy is adopted based on data type and access requirements. Structured data is stored in a relational database for easy SQL queries; unstructured data (such as documents, images, and audio) is stored in a distributed file system or object storage to support efficient data access and expansion.

[0121] Data preprocessing: The integrated data is preprocessed, including data cleaning (noise removal and error correction) and data formatting (unified encoding and format) to improve data quality and retrieval efficiency.

[0122] Security inspection databases include vector databases, structured databases, graph databases, etc.

[0123] The image and text interaction module 610 can receive user input such as images, text, and voice through an interactive interface, and output the retrieved images, text, and voice. The interactive interface can be as shown in Figure 8A or 8B. The interface shown in Figure 8A uses image and keyword input, while the interface shown in Figure 8B uses dialogue input.

[0124] Users can search for the entire image, a specific region within an image, input text for retrieval, or combine image input with questions for retrieval. The system supports outputting one or more images, as well as images with labeled specific regions, various related information, plain text or voice, tables, etc.

[0125] The central scheduling module 620 is the core hub of the device 700 and can interact with other modules.

[0126] The central scheduling module 620 uses natural language processing (NLP) technology, such as text classification and named entity recognition, or it can directly use large models to achieve accurate parsing and flexible classification of user intents. Based on the user's specific needs and the complexity of the needs, it divides user intents and customizes differentiated processing flows and routing strategies for each intent.

[0127] User intent classification: Based on a deep understanding of user input, user intent is classified into several types, including but not limited to the following main categories: general search intent, search enhancement related intent, and general intent. The actual classification can be further refined according to system requirements.

[0128] General search intent: For direct query requests submitted by users through diverse means such as images and text, the central scheduling module 620 quickly identifies and guides them to the pre-search processing module 760, and the searcher 640 uses efficient search algorithms to accurately extract relevant information from massive databases to meet the user's immediate needs.

[0129] Enhanced Relevant Search Intents: This category encompasses a range of query scenarios requiring complex processing and in-depth analysis. When a user's question demands a high level of expertise in the security inspection field or requires substantial support from security inspection knowledge, or when various expert modules (such as object detection models, image description models, risk analysis models, etc.) and generators (such as general language large models) cannot directly provide a satisfactory answer, the central scheduling module 620 will categorize such queries as enhanced relevant search intents.

[0130] General Intent: When a user question can be answered directly and accurately without relying on the data provided by the data management module, the central scheduling module 620 can classify it as a general intent. The processing flow for this type of intent is simple, and internal resources (such as various expert modules 770, generators (including large models)) can be directly invoked to provide the answer.

[0131] The central scheduling module 620 routes user input to different processing flows based on user intent. Specifically, after identifying the user intent, the central scheduling module 620 can intelligently plan and execute the optimal processing path according to different intent categories.

[0132] For general search intents: The central scheduling module 620 can process user queries before searching (such as query optimization, keyword expansion, etc.) and then use efficient search algorithms to retrieve relevant information from databases or information sources. For complex queries that require step-by-step execution, the central scheduling module will plan detailed execution steps and ensure that each step returns valuable information to support subsequent operations.

[0133] For the intent to enhance retrieval: The central scheduling module 620 initiates the retrieval enhancement process. This process includes multiple steps such as retrieving relevant data from multiple information sources according to the planned path, integrating relevant information, calling on expert modules to assist in judgment, and performing multiple rounds of evaluation and optimization to provide a more comprehensive, in-depth, and accurate answer.

[0134] For general intent: the central scheduling module 620 will directly utilize resources other than the data management module 790 (such as various expert modules, large models in the generator, etc.) to answer, ensuring the simplicity and efficiency of the processing flow.

[0135] The retrieval preprocessing module 760 is an important component of retrieval. It is responsible for preprocessing user queries or retrieval requests to ensure that the query statements can be processed accurately and efficiently by the retrieval machine. This includes deep parsing of user queries, using technologies such as thesaurus, machine learning models, knowledge graphs, or large models to enrich and expand the query statements, and optimizing and reconstructing the query statements based on the enhancement results.

[0136] One of the functions of the 760 pre-processing module is query enhancement. Through strategies such as keyword expansion, synonym replacement, pronoun replacement, and concept association, it enriches the query content, broadens the search scope, and increases the likelihood of discovering relevant information, providing users with more comprehensive and abundant search results. For example, when a user queries for images declared as fruit, it first replaces the "customs" with the customs district on the declaration form, and then expands "fruit" to include specific fruit categories such as apples, pears, bananas, and pineapples.

[0137] The second function of the retrieval preprocessing module 760 is query construction: it transforms the enhanced query intent into a query statement that the retrieval machine can execute. This module can parse query requirements from different modalities, including visible light images, perspective images, videos, audio, and text. When extracting keywords or phrases, it intelligently classifies them according to different retrieval methods of the retrieval machine. For example, it identifies whether image vector retrieval needs to be performed and which image vector retrieval method to use, whether to return images as part of the result set, which text queries need to be split for vector retrieval, and which can be directly converted into SQL query statements for structured query execution.

[0138] The retrieval engine 640 is designed as a multi-source data retrieval engine for security inspection applications. It integrates various database resources, including vector libraries and structured databases, and incorporates multiple retrieval strategies, including but not limited to precise keyword matching, vector similarity calculation, and graph-based semantic association analysis. It performs refined retrieval operations based on user queries and supports multi-condition joint searches, with each search term assigned different weights. Especially for security inspection-specific data types (such as cargo manifests and container scan images), the system supports highly customized retrieval needs, effectively improving retrieval efficiency and effectiveness.

[0139] The core modules of the retrieval system 640 include a feature extraction module and an indexing module.

[0140] The feature extraction module in the retrieval unit 640 integrates diverse feature extraction models to address the complex needs of different information sources. Figure 9 schematically illustrates an example of feature extraction in the customs inspection field. The feature extraction module is explained below with reference to Figure 9.

[0141] Referring to Figure 9, for text data, to address the complex needs of specific domains, such as the recognition of professional terminology and the parsing of policy clauses in security inspection documents, a text feature extraction model customized for security inspection scenarios is used based on natural language processing technology to extract features such as keywords, phrases, and sentences, ensuring accurate capture of key information in the text. For image and audio data, deep learning models are used to extract visual and audio features and convert them into vector form. To handle complex images in specific domains, such as the large number of container vehicle scan images, luggage X-ray images, and CT images in customs inspection, dedicated feature extraction models such as cargo feature extraction models, vehicle front feature extraction models, and chassis feature extraction models are used to extract corresponding image features according to business needs. Through multimodal learning frameworks, such as cross-modal pre-trained models, data from different modalities can also be processed simultaneously, allowing these features to be represented in a unified feature space, facilitating subsequent multimodal large-scale model retrieval. For all data with the same serial number in the security inspection scenario, its related feature vectors and structured data are associated through the serial number.

[0142] The indexing module in Retrieval 640 transforms multimodal data from the data source into an efficient index structure for rapid response to retrieval requests. For text data, an inverted index technique is used, associating keywords with document IDs. For image and video data, vector indexing is employed, constructing an approximate nearest neighbor search index in the vector space for each vector feature extracted by the feature extraction module. By combining multimodal features, a hybrid index library can be built, enabling efficient organization and retrieval of multimodal data within a unified framework. The index is updated regularly to reflect changes in the data source. An incremental indexing mechanism is implemented, updating the index only for newly added or modified data, reducing index building time and resource consumption. Index performance is monitored, and the index structure and query algorithms are optimized to ensure the accuracy and efficiency of the index.

[0143] The post-processing module 650 performs post-processing on the results returned by the search engine 640 to improve the quality of the search results and user experience, including sorting, deduplication, and adding prompts. The post-processing module 650 can use machine learning algorithms tailored to security inspection tasks, information retrieval evaluation metrics, and user feedback data. For cases where a single query result is unsatisfactory, the system can perform multiple iterative searches and further refine and optimize the search results. The functions of this module include, but are not limited to, the following: result integration, sorting optimization, and adding prompts.

[0144] The results integration includes a post-retrieval processing module 650 that integrates search results from different modalities (such as text, images, audio, and video) across modalities to form a comprehensive, multi-dimensional result set. This helps users understand the query content from multiple perspectives, improving the comprehensiveness and accuracy of the search. It also includes functions to remove duplicate and irrelevant results, reduce noise interference, and ensure that the returned result set is high-quality and valuable.

[0145] Ranking optimization refers to the post-retrieval processing module 650 using a refined relevance calculation mechanism to rank the search results based on their match with the user's query intent. Specifically, for different types of query intents, such as image description-based enhanced search questions like "What's in the image?", the module trains an image-text matching model to calculate the relevance between the user's input image and the retrieved text, thereby re-ranking the search results. For general search needs like "Retrieving fruit-related documents", the module trains a semantic matching model to calculate the relevance between the retrieved documents and the user's question. Through this differentiated relevance calculation method, the module ensures that the most relevant and valuable results are always ranked first, significantly improving user search efficiency and satisfaction.

[0146] Adding prompts: For search enhancement intents, the post-search processing module 650 analyzes the query content and adds necessary prompts or contextual information to enrich the semantic expression of the search results and improve the input quality of the generator model.

[0147] Generator 780 generates answers or summaries that meet user needs based on search results. It can use a large model to combine different user intentions with appropriate prompts to generate final conclusions based on the search results.

[0148] The following two examples illustrate the specific use of device 700 in the field of security inspection.

[0149] Figure 10 schematically illustrates a target image input by a user in an example of a multimodal large-scale model retrieval device for the security inspection field, applying embodiments of the present disclosure. This example represents a retrieval enhancement requirement in a customs image review scenario, and a retrieval enhancement processing flow is applied.

[0150] In this example, users can input text and images through a graphical interface. The text input is: "Does the cargo in the container match the customs declaration information?" The declared cargo is: film. The image input is: the image shown in Figure 10.

[0151] The retrieval source is the security inspection database in data management module 790. The vector database uses FAISS to manage container perspective images and their regional features, as well as the text features of the corresponding information. The structured database uses MySQL to manage declaration information. A feature extraction module extracts features from both the perspective images and the text information. Each perspective image is linked to the declaration information in the structured database via an image number.

[0152] The intent recognition model in the central scheduling module 620 outputs the user intent as a retrieval enhancement-related intent category. Based on the input image, the process is determined as follows: The central scheduling module 620 calls the image segmentation expert module to divide the image into different texture regions, extracts features from each texture region (i.e., image patch), and calls the retrieval unit 640 to retrieve similar images corresponding to each texture region. Then, the post-retrieval processing module 650 uses the text descriptions of similar images to complete the product name category of the current image. Simultaneously, the central scheduling module 620 also calls various expert modules 770 (e.g., the contraband detection expert module) to provide reference conclusions, and inputs the retrieval enhancement conclusions and the conclusions from the contraband detection model into the generator 780, which generates the final conclusion based on the user's question. The retrieval enhancement process is explained below.

[0153] Retrieval 640: Employs a feature extraction module for container vehicle images to extract image features, including but not limited to local features extracted from different texture regions of the cargo in the input image. Then, it performs vector retrieval using the extracted features to obtain cargo images similar to the local features (i.e., the first security inspection image), returning an image list and the corresponding declaration information. For each image, it returns similar images for each texture region and corresponding text records such as product name and tax code. Figures 11A and 11B show examples of the retrieval results returned by the retrieval system for two different regions in Figure 10.

[0154] Next, the retrieval post-processing module 650 performs the processing. For retrieval enhancement intents, the retrieval post-processing module 650 first processes the retrieval results provided by the retrieval machine 640, and the processed retrieval enhancement results should also be merged with the recognition results of various expert modules 770.

[0155] The retrieval post-processing module 650 first processes the retrieval results. In this example, the retrieval returned multiple sets of product names as shown in Figures 11A and 11B, with each set of product names corresponding to a texture region in the image. The retrieval post-processing module 650 first determines the product name for each texture region, and then outputs all product names for the entire image. Specifically, specific algorithms or directly using a generative model can be used to merge, remove duplicates, and reorder all returned product names.

[0156] If we use generator 780 and preset first prompt words to merge the search results, an example of a first prompt word is as follows:

[0157] "You are a customs broker. Based on the customs tariff's product name classification method, you need to classify and statistically analyze product names. The method is as follows: First, merge similar product names according to the customs tariff. Then, count the frequency of each product name and rank them by frequency, outputting the top k product names. Several groups of product names will be given below. First, perform classification and statistical analysis on each group (k=1), and merge the classification and statistical results of each group. Then, combine the classification and statistical results into a new group, and perform classification and statistical analysis on the newly formed group (k=5). If the returned product name categories are less than 5, output all product name categories."

[0158] Group 1: List of Goods Names

[0159] Group 2: List of Goods Names

[0160] ...

[0161] Group n: List of goods names

[0162] Output product name:

[0163] After combining the first prompt keyword mentioned above with the two sets of search results in Figures 11A and 11B, the list of goods names is instantiated as follows:

[0164] Group 1: Film, film, plastic, film

[0165] Group 2: Plastic molds, plastic molds, polyethylene molds, chairs;

[0166] The first prompt word and Figures 11A and 11B are provided to the generator 780. The generator 780 first statistically filters the product names matched for each image block to find the product names with the highest frequency. Then, it combines the product names filtered from all image blocks into a new group and filters out the five or fewer product names with the highest frequency as the product recognition result. The output result is: film, plastic mold.

[0167] In this way, the large model in Generator 780 can be used to organize and statistically analyze the matched goods information according to the first prompt word, and filter out the goods names with high credibility as the first goods identification result.

[0168] Next, the post-retrieval processing module 650 can merge the processed retrieval results (i.e., the first cargo identification result) with the results of the contraband detection expert module, and then generate a response message for the user through the generator 780.

[0169] For example, suppose the contraband detection expert module detects cigarettes from Figure 10. Then, a second suggestion word is used to integrate the output of the contraband detection expert module with the list of retrieved goods names retrieved by the retrieval unit 640 and processed by the generator 780. An example of a second suggestion word is as follows:

[0170] "The cargo in the container has a similar texture to the following cargo: Search for a list of cargo names."

[0171] The container contained the following prohibited items: Results from the Prohibited Items Detection Expert Module

[0172] Goods to be declared: List of goods to be declared

[0173] Based on the above information, determine what goods are inside the container and answer the user's question.

[0174] User question: Do the goods in the container match the customs declaration information?

[0175] Output: "

[0176] In this example, the second prompt word mentioned above is instantiated as follows:

[0177] "The cargo in the container has a similar texture to the following: films, plastic molds."

[0178] The container contained the following prohibited items: cigarettes.

[0179] Goods to be declared: film.

[0180] The generator will ultimately output the response, which could be, for example:

[0181] "The container does contain goods (film) that match the customs declaration information, but there may also be inconsistent goods (cigarettes, plastic molds)."

[0182] In this example, the generator is called twice: the first time to integrate the search results, and the second time to answer the user's question. Based on the generator selected for this example, the output results of the two calls are as follows:

[0183] Initial search results: Product names: films, plastic molds

[0184] Second response to user question: The container does contain goods (film) that match the customs declaration information, but there may also be goods that do not match (cigarettes, plastic molds).

[0185] It should be noted that in the search enhancement process, as shown in Figures 11A and 11B, if the user does not explicitly request the return of the search result (as in this example), the search result will not be provided to the user through the graphic interface.

[0186] Figure 12 schematically illustrates a target image input by a user in another example of a multimodal large-scale model retrieval device for the security inspection field, which applies an embodiment of the present disclosure, wherein a general retrieval process is applied in this example.

[0187] This example demonstrates an image-text retrieval scenario within a customs image review process. User input is received through an interactive image-text interface. The text input is "Retrieve goods in the box exported from this port last month," and the image input is the image shown in Figure 12.

[0188] The retrieval source is the security inspection database in data management module 790. The vector database uses FAISS to manage container perspective images and their regional features, as well as the text features of the corresponding information. The structured database uses MySQL to manage declaration information, and a feature extraction module is used to extract features from both the perspective images and the text information. Each perspective image is bound to the declaration information in the structured database through an image number.

[0189] The intent recognition function in the central scheduling module 620 outputs a general search intent, such as a text and image search, and then executes a general search process based on the input text, image, and specified box.

[0190] The preprocessing module 610 can preprocess the input text and determine the search scope information. Specifically, it can use query enhancement functions to replace pronouns in the query statement, such as replacing "this port" with "Port A" and "last month" with "June 2024". The Text2SQL model is used to convert the user input into two parts: a query vector library and a query SQL library. It identifies similar images of container cargo that need to be retrieved in the query statement and converts the port and time query information in the query statement into an SQL query statement. The converted SQL query statement is as follows:

[0191] SELECT*

[0192] FROM history table

[0193] WHERE Export Port = 'Port A'

[0194] AND Export time BETWEEN' 2024-06-01' AND 2024-06-30'

[0195] The retrieval unit 640 uses the feature extraction module of the container vehicle image to extract features from the input image frame, performs feature vector retrieval, obtains cargo images similar to local features, and returns an image list and the corresponding declaration information. In the data returned by the vector library, the SQL statement generated by the pre-retrieval processing module 760 further filters data that meets the conditions. In the example, the filtered data only contains the cargo name, export time, and port name. Actual data can include various related information associated with the image, such as the cargo HS code, cargo name, cargo specifications, country of origin, company, and import / export port. Figure 13 shows the top four cargo images returned by the retrieval unit 640 and their corresponding text records.

[0196] The post-retrieval processing module 650 interacts with the generator 780 to perform operations such as merging and reordering the retrieval results, ensuring that the returned results meet the user's query requirements.

[0197] In this example, third-party suggestions are used to merge the search results. Following customs tariff classification requirements, the returned product names are categorized, and the frequency of each category is calculated. The categories are then reordered based on frequency, and the top three product names are selected for further evaluation using an image-text matching model to determine their relevance. Finally, the category with the highest ranking is output as the recommended product name for the user's search. The output content can include text and images. For example, the output text could be "The goods in the box may be film; a list of similar images is shown below," and the output image is shown in Figure 14. The combined output text and image are then returned to the user through an interactive image-text interface.

[0198] According to the embodiments of this disclosure, the system process of device 700 is divided into two core links: data entry and user retrieval. The two work closely together to support an efficient and intelligent information processing system.

[0199] Data import process introduction

[0200] Data import is the foundation for building an information retrieval system. It aims to integrate diverse data sources (including unstructured data such as security inspection images, document text, visible light images, and videos) and transform them into a searchable format. This process includes both batch and single data import modes to accommodate data processing needs of different scales.

[0201] Data processing: The data management module cleans, standardizes, and classifies the collected data to ensure data quality and consistency.

[0202] Feature extraction: Using various feature extraction modules in the retrieval tool, key information is extracted from data such as images and text to generate feature vectors or keywords for retrieval.

[0203] Data storage: The processed data and its characteristic information are stored in the corresponding database, supporting fast retrieval and efficient access. Simultaneously, an incremental update strategy is implemented to ensure the timeliness and accuracy of the database content.

[0204] User search process: The user search process is a direct reflection of the system's value. Through a series of refined operations, it transforms users' query needs into accurate answers or suggestions.

[0205] Introduction to the user search process.

[0206] Graphical and textual interaction: Users can input their queries in the form of images or text through a user-friendly graphical and textual interface, and the system will ultimately provide the answer in the form of images or text.

[0207] Central Scheduling: The central scheduling module intelligently analyzes the user's input intent and allocates the optimal execution path for different types of queries (such as retrieval questions, retrieval enhancement questions, etc.).

[0208] Pre-processing: For queries that require retrieval or retrieval enhancement, the pre-processing module expands, rewrites, and completes the query request to generate more accurate query statements that are more suitable for the retrieval machine to process.

[0209] Retrieval Execution: The retrieval engine performs retrieval operations across multiple data sources based on the processed query. For vector retrieval requests, query features are specifically extracted, and efficient vector retrieval algorithms are used to quickly locate relevant information.

[0210] Post-search processing and enhancement: The post-processing module integrates and sorts the search results. For general search intents, the output of this module can be directly used as the user's final output. For search enhancement intents, this module also needs to add necessary prompts to provide high-quality input to the generator.

[0211] Response generation: The generator is based on a deep learning model and combines prompts, query results, and the user's original request to generate natural, fluent responses or suggestions that meet the user's needs.

[0212] In the above process, the system records user queries and their search results as feedback data to continuously optimize various models within the retrieval system. Simultaneously, for new data not included in the search source, the system will perform necessary data import operations to enrich data resources and improve search performance.

[0213] Through the above process, device 700 not only achieves efficient and accurate information retrieval and response generation, but also continuously improves its intelligence level and user experience through a continuous feedback and update mechanism.

[0214] The multimodal large model retrieval method and apparatus of this disclosure bring significant technological advancements and operational optimizations to the field of security inspection. The specific beneficial effects can be summarized as follows:

[0215] 1. Optimize user experience and improve work efficiency.

[0216] This disclosure innovatively introduces a graphical and text-based interactive method, enabling security personnel to retrieve information and issue commands simply through natural language dialogue, eliminating the need for complex database queries. This intuitive and convenient interaction method significantly lowers the operational threshold, substantially improves work efficiency, and reduces the workload of security personnel.

[0217] 2. Enhance data processing capabilities and broaden the scope of information utilization.

[0218] This disclosure provides powerful data processing capabilities, enabling efficient handling of massive amounts of structured and unstructured data. In particular, for unstructured data (such as images and videos) that was previously difficult to utilize effectively, the system achieves efficient integration and utilization through multimodal large-scale model retrieval technology, providing richer and more comprehensive information support for security inspection work. Combined with the deep learning capabilities of the large-scale model, this system can further uncover the relationships and patterns behind the data, providing security personnel with more in-depth and accurate retrieval results. This deep retrieval capability helps to discover potential risk points and improve the accuracy of inspection work.

[0219] 3. Make up for the knowledge deficiencies of large models and solve the problem of local knowledge base.

[0220] For scenarios where large models are used as generators, this disclosure employs retrieval enhancement technology to effectively supplement the large models with massive amounts of security inspection data. Particularly for complex scenarios such as perspective images, the system can provide necessary background information and reference data for the current image by retrieving and analyzing historical image content, thereby improving the intelligence and practicality of the large model in security inspections.

[0221] 4. Enhance comprehensive risk identification capabilities

[0222] This disclosed embodiment, through rapid retrieval and retrieval enhancement technologies, can quickly identify various relationships and potential risks related to current goods, helping security personnel to prevent, quickly detect, and handle various high-risk situations. Simultaneously, the system can also perform intelligent analysis based on real-time data and historical cases, providing more accurate and scientific decision support for security inspection work, thereby further improving security inspection efficiency and security.

[0223] In summary, the multimodal large model retrieval method and apparatus of this disclosure have demonstrated significant beneficial effects in the field of security inspection. They not only optimize the interactive experience and improve work efficiency, but also enhance data processing capabilities and broaden the scope of information utilization. At the same time, they make up for the knowledge deficiencies of large models, enhance the level of intelligence, and ultimately improve risk identification capabilities and regulatory effectiveness.

[0224] Figure 15 schematically illustrates an electronic device 900 suitable for implementing the multimodal large model retrieval method for the security inspection field according to embodiments of the present disclosure.

[0225] As shown in FIG. 15, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage portion 908 into a random access memory (RAM) 903. The processor 901 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 901 may also include onboard memory for caching purposes. The processor 901 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of the present disclosure.

[0226] RAM 903 stores various programs and data required for the operation of electronic device 900. Processor 901, ROM 902, and RAM 903 are interconnected via bus 904. Processor 901 performs various operations of the method flow according to embodiments of the present disclosure by executing programs in ROM 902 and / or RAM 903. It should be noted that the programs may also be stored in one or more memories other than ROM 902 and RAM 903. Processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in said one or more memories.

[0227] According to embodiments of this disclosure, the electronic device 900 may further include an input / output (I / O) interface 905, which is also connected to a bus 904. The electronic device 900 may also include one or more of the following components connected to the I / O interface 905: an input section 906 including a keyboard, mouse, etc.; an output section 907 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card such as a LAN card, modem, etc. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to the I / O interface 905 as needed. A removable medium 911, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 910 as needed so that computer programs read from it can be installed into the storage section 908 as needed.

[0228] This disclosure also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs that, when executed, implement the method according to the embodiments of this disclosure.

[0229] According to embodiments of this disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, the computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this disclosure, the computer-readable storage medium may include ROM 902 and / or RAM 903 and / or one or more memories other than ROM 902 and RAM 903 described above.

[0230] Embodiments of this disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code is used to cause the computer system to implement the methods provided in the embodiments of this disclosure.

[0231] When the computer program is executed by the processor 901, it performs the functions defined in the system / apparatus of this disclosure embodiments. According to embodiments of this disclosure, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0232] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and downloaded and installed via the communication section 909, and / or installed from a removable medium 911. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.

[0233] In such an embodiment, the computer program can be downloaded and installed from a network via the communication section 909, and / or installed from the removable medium 911. When the computer program is executed by the processor 901, it performs the functions defined in the system of this disclosure embodiment. According to embodiments of this disclosure, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0234] According to embodiments of this disclosure, program code for executing the computer programs provided in embodiments of this disclosure can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages include, but are not limited to, languages such as Java, C++, Python, "C", or similar programming languages. The program code can execute entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0235] The above one or more embodiments have the following advantages or beneficial effects: the large model is used twice. The first time, the large model is used to organize the first search results. The second time, the large model is used to answer user questions. In this way, on the one hand, the large model is used to organize the initial results retrieved from the security inspection database, so that usable and reliable security clues can be provided by utilizing historical security inspection data. On the other hand, the large model can provide feedback to users based on the search results by answering user questions, which can meet the diverse user needs in security inspection and improve the user experience.

[0236] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0237] Those skilled in the art will understand that the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. In particular, the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways without departing from the spirit and teachings of this disclosure. All such combinations and / or combinations fall within the scope of this disclosure.

[0238] The embodiments of this disclosure have been described above. However, these embodiments are for illustrative purposes only and are not intended to limit the scope of this disclosure. Although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be used advantageously in combination. The scope of this disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of this disclosure, and all such substitutions and modifications should fall within the scope of this disclosure.

Claims

1. A multimodal large model retrieval method for the security inspection field, comprising: Receive user input, which includes target text and target image; as well as The user input is processed according to the search enhancement process; The step of processing the user input according to the search enhancement process includes: Extract the image patch to be identified from the target image to obtain at least one first image patch; The first image block is retrieved from the security inspection database to obtain a first search result. The first search result includes at least one first security inspection image and first text record data of the image block in the first security inspection image that successfully matches the first image block. The security inspection database is a database formed based on historical security inspection data, which includes at least security inspection images and text record data of the security inspection images. The first text record data in the first search result is input into the large model, and the preset first prompt word is used to prompt the large model to determine the type of goods in the first image block based on the input first text information, so as to obtain the first goods recognition result; Based on the first cargo recognition result corresponding to at least one of the first image blocks, the overall cargo recognition result is obtained; The overall cargo identification result and the target text are input into the large model, and a preset second prompt word is used to prompt the large model to generate a first response content for the target text based on the overall cargo identification result; and Output the content of the first response.

2. The method according to claim 1, wherein, The step of processing the user input according to the retrieval enhancement process further includes: using at least one object detection expert model to identify the first image patch to obtain a second cargo identification result, wherein the object detection expert model is a machine learning model based on image recognition; Therefore, obtaining the overall cargo identification result based on the first cargo identification result corresponding to at least one of the first image blocks further includes obtaining the overall cargo identification result based on the first cargo identification result and the second cargo identification result corresponding to at least one of the first image blocks.

3. The method of claim 2, wherein, The at least one target detection expert model includes at least one contraband detection model, wherein each contraband detection model is used to detect one contraband.

4. The method of claim 1, wherein, The step of extracting the image patch to be identified from the target image to obtain at least one first image patch includes: Based on the content of the target text, a region to be identified in the target image is determined. The region to be identified includes the entire target image or a user-specified region in the target image. At least one first image block is segmented from the region to be identified based on image texture features.

5. The method according to claim 1, further comprising: The user intent of the target text is identified using an intent recognition model. Based on the user intent and the preset intent classification, the user intent category is determined; When the user intent category is a search enhancement-related intent category, the user input is processed according to the search enhancement process. The intent in the search enhancement-related intent category includes identifying goods in the image.

6. The method according to claim 5, further comprising: When the user intent category is a general retrieval intent category, the user input is processed according to the general retrieval process, wherein the intent in the general retrieval intent category includes retrieving data from the security check database; The step of processing the user input according to the general retrieval process includes: Extract the image patch to be retrieved from the target image to obtain at least one second image patch; The second image block is retrieved from the security inspection database to obtain a second retrieval result. The second retrieval result includes at least one second security inspection image, as well as the tag information and second text record data of the image block in the second security inspection image that successfully matches the second image block. The model uses a preset third prompt word to organize the second search results according to the target text and then outputs them.

7. The method of claim 6, wherein, The step of retrieving the second image block from the security inspection database to obtain the second search result further includes: Preprocess the target text to determine the retrieval scope information; The second image block is retrieved from the security inspection database according to the retrieval scope information to obtain the second retrieval result.

8. The method of claim 1, wherein, The security inspection database includes structured and unstructured databases, wherein the unstructured database includes at least one of the following: a vector database or a graph database.

9. The method of claim 1, wherein, The target image includes a perspective image, and the security inspection image includes a perspective image.

10. A multimodal large-scale model retrieval device for the security inspection field, comprising: The graphic interaction module is used to receive user input, which includes target text and target image. The central scheduling module is used to process the user input according to the retrieval enhancement process, wherein the central scheduling module processes the user input by calling the image segmentation expert module, the retrieval unit, and the retrieval post-processing module; The image segmentation expert module is used to extract image blocks to be identified from the target image to obtain at least one first image block; The retrieval device is used to retrieve the first image block in the security inspection database to obtain a first retrieval result. The first retrieval result includes at least one first security inspection image and first text record data of the image block in the first security inspection image that successfully matches the first image block. The security inspection database is a database formed based on historical security inspection data, which includes at least security inspection images and text record data of the security inspection images. The post-retrieval processing module is used for: The first text record data in the first search result is input into the large model, and the preset first prompt word is used to prompt the large model to determine the type of goods in the first image block based on the input first text information, so as to obtain the first goods recognition result; Based on the first cargo recognition result corresponding to at least one of the first image blocks, the overall cargo recognition result is obtained; and The overall cargo identification result and the target text are input into the large model, and a preset second prompt word is used to prompt the large model to generate a first response content for the target text based on the overall cargo identification result; The graphic interaction module is also used to output the content of the first response.

11. An electronic device, comprising: One or more processors; Memory, used to store one or more computer programs. The one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 9.

12. A computer readable storage medium having stored thereon a computer program or instructions, wherein, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 9.

13. A computer program product comprising computer programs or instructions, wherein, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1 to 9.