Consignment inspection method, computer device, and storage medium

By processing the declaration information and physical images of consigned items using a multimodal large model, the problem of low efficiency in traditional manual inspection is solved, intelligent inspection is achieved, inspection efficiency and accuracy are improved, and the cost of manual inspection is reduced.

WO2026138214A1PCT designated stage Publication Date: 2026-07-02SF TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SF TECH CO LTD
Filing Date
2025-11-11
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Traditional express delivery inspection relies on manual inspection, which results in high workload and low efficiency, making it difficult to meet the needs of the significant increase in cargo flow and the diversification of trade forms in global trade.

Method used

A multimodal large model is used to process the declaration information and physical images of the consigned items. Through image segmentation and guide word optimization, combined with the multimodal large model, information comparison is performed to automatically identify the name, specifications, and quantity of the consigned items and generate inspection results.

Benefits of technology

It improves inspection efficiency, reduces the time and subjective factors of manual inspection, lowers the risk of oversight, reduces the cost of manual inspection, and is adaptable to large-scale inspection tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025134138_02072026_PF_FP_ABST
    Figure CN2025134138_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present application discloses a consignment inspection method, a computer device, and a storage medium, which can improve the efficiency of inspection. The consignment inspection method comprises: acquiring declaration information and a physical image of a consignment; preprocessing the physical image to obtain a preprocessed image; respectively determining a prompt of the physical image and a prompt of the preprocessed image; and inputting the declaration information, the physical image, the preprocessed image, and the prompts into a multi-modal large model to obtain an inspection result outputted by the multi-modal large model.
Need to check novelty before this filing date? Find Prior Art

Description

A method for inspecting mailed items, computer equipment, and storage medium Technical Field

[0001] This application relates to the field of inspection technology, specifically to a method for inspecting consigned items, computer equipment, and storage medium. Background Technology

[0002] Inspection of express delivery items is a crucial aspect of express delivery supervision. Traditional express delivery inspection mainly relies on manual inspection, which requires staff to check each item one by one, resulting in high workload and low efficiency. Especially against the backdrop of rapid growth in global trade, the volume of goods has increased significantly, and trade forms have become increasingly diversified and complex, making traditional manual inspection methods inadequate for current needs. Summary of the Invention

[0003] To address the aforementioned technical problems, this application is proposed. Embodiments of this application provide a method for inspecting consigned items, a computer device, and a storage medium, which can improve inspection efficiency.

[0004] According to a first aspect of this application, a method for inspecting consigned goods is provided, comprising: obtaining declaration information and physical images of the consigned goods; preprocessing the physical images to obtain preprocessed images; determining guide words for the physical images and preprocessed images respectively; inputting the declaration information, physical images, preprocessed images and guide words into a multimodal large model to obtain the inspection results output by the multimodal large model.

[0005] As one possible implementation, the entity image is preprocessed to obtain a preprocessed image, including: segmenting the entity image to obtain a preprocessed image of the marked consignment.

[0006] As one possible implementation, the entity image is segmented to obtain a preprocessed image of the marked object, including: inputting an image segmentation model; wherein the entity image includes the object to be segmented; based on preset prompts, the image segmentation model segments the object in the entity image; obtaining the segmentation result output by the image segmentation model; wherein the segmentation result includes the preprocessed image of the marked object.

[0007] As one possible approach, the segmentation results are represented by outlining the contours of the object using different colors or lines.

[0008] As one possible implementation, guide words for entity images and preprocessed images are determined separately, including: based on the declaration information, determining the first guide word corresponding to the entity image and the second guide word corresponding to the preprocessed image; wherein, the guide words include role, context, task, format and example, and the guide words are used to indicate the difference between the multimodal large model entity image and the preprocessed image.

[0009] As one possible implementation, based on the declaration information, a first guiding word corresponding to the entity image and a second guiding word corresponding to the preprocessed image are determined, including: based on the second guiding word, the role of the preprocessed image is set as a counting expert; wherein, the counting expert indicates that the preprocessed image is used to calibrate the multimodal large model's count recognition function; based on the first guiding word, the role of the entity image is set as an auditing expert; wherein, the auditing expert indicates that information recognition is performed by combining the preprocessed image and the declaration information.

[0010] As one possible implementation, the declaration information includes name, specifications, and quantity. The declaration information, physical images, pre-processed images, and guiding words are input into a multimodal large-scale model to obtain the verification results output by the model. This includes: inputting the declaration information, physical images, pre-processed images, and guiding words into the multimodal large-scale model to obtain the item parameters output by the model, which include one or more of the item name, item specifications, and item quantity; comparing the item name, item specifications, and item quantity output by the multimodal large-scale model with the name, specifications, and quantity in the declaration information; if any of the compared information (name vs. item name, specifications vs. item specifications, item quantity vs. quantity) is inconsistent, the verification result output by the multimodal large-scale model is "verification failed."

[0011] As one possible implementation, when any of the comparison information such as name vs. item name, specification vs. item specification, or number of items vs. number of items is inconsistent, the multimodal large model outputs a verification result of "verification fails." This includes: when the name in the declaration information does not exist in the physical image, or when the physical image contains an item name not included in the declaration information, the name is determined to be inconsistent with the item name; when the specification in the declaration information does not exist in the physical image, or when the physical image contains a specification not included in the declaration information, the specification is determined to be inconsistent with the item specification; when the number in the declaration information is inconsistent with the number on the preprocessed image, the number of items is determined to be inconsistent with the number of items.

[0012] As one possible implementation, it also includes: performing structured processing on the inspection results to obtain structured inspection results; displaying the structured inspection results; and inputting the structured inspection results into a data analysis tool for data processing.

[0013] As one possible implementation, after obtaining the declaration information and physical image of the consigned item, the consigned item inspection method also includes: pre-cleaning the declaration information to obtain the declaration text; performing keyword recognition on the declaration text to obtain one or more of the name, specifications, and quantity; and performing structured processing on the extracted name, specifications, and quantity to obtain structured name information, structured specifications information, and structured quantity information.

[0014] As one possible implementation, the declaration information, entity images, preprocessed images, and guide words are input into the multimodal large model to obtain the verification results output by the multimodal large model. This includes inputting the structured name information, structured specification and model information, structured number information, entity images, preprocessed images, and guide words into the multimodal large model to obtain the verification results output by the multimodal large model.

[0015] As one possible implementation, a multimodal big model is a model that integrates multiple modal data for comprehensive understanding and reasoning, wherein the multiple modal data includes at least two types of data from text, images, audio, and video.

[0016] According to a second aspect of this application, a computer device is provided, comprising: one or more processors; a memory; and one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the consignment inspection method of the first aspect or any implementation thereof.

[0017] According to a third aspect of this application, a computer-readable storage medium is provided, the storage medium storing a computer program for performing the mail inspection method of the first aspect or any implementation thereof.

[0018] According to the fourth method of this application, a computer program product is provided, including program code for performing a method as described in the first aspect or any implementation thereof.

[0019] The mail inspection method, computer equipment, and storage medium provided in this application can preprocess images of mailed items to enhance the targeting and accuracy of large-scale model recognition, thereby achieving intelligent inspection, improving inspection efficiency, reducing inspection time, reducing the possibility of human inspection being easily affected by subjective factors and resulting in omissions, reducing the cost of human inspection, and being able to handle large-scale inspection tasks. Attached Figure Description

[0020] The above and other objects, features, and advantages of this application will become more apparent from the more detailed description of the embodiments of this application in conjunction with the accompanying drawings. The drawings are provided to further illustrate the embodiments of this application and form part of the specification. They are used together with the embodiments of this application to explain this application and do not constitute a limitation thereof. In the drawings, the same reference numerals generally represent the same components or steps.

[0021] Figure 1 is a flowchart illustrating a method for inspecting consigned items provided in an exemplary embodiment of this application.

[0022] Figure 2 is a schematic diagram of the structure of a consignment inspection device provided in an exemplary embodiment of this application.

[0023] Figure 3 is a structural diagram of an electronic device provided in an exemplary embodiment of this application. Detailed Implementation

[0024] Hereinafter, exemplary embodiments according to this application will be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this application, and not all embodiments of this application. It should be understood that this application is not limited to the exemplary embodiments described herein.

[0025] Application Overview

[0026] Express delivery inspection is a crucial step in ensuring package security. While primarily relying on manual inspection, this process is time-consuming, especially with high volumes of packages. Long queues and inspection times can reduce overall service efficiency. For example, customs inspections primarily target cross-border shipments, such as goods purchased overseas and shipped into China. During customs inspections, assuming a container contains 100 packages, not every package is inspected; instead, 10 are selectively chosen. Traditionally, customs inspections involve manual opening and inspection of these 10 packages. First, the purchaser or consignor fills out a declaration form, including the name, specifications, and quantity of the goods. This declaration is usually in text format. After receiving this information, customs officials open the packages and compare the contents with the written information, checking for discrepancies in name, specifications, and quantity. If any discrepancy is found, the shipment is intercepted. During manual inspection, a large amount of information comparison is involved, which not only consumes the time and energy of operators, but also makes manual inspection susceptible to subjective factors and risks of oversight. To address this issue, this application proposes to use a multimodal large model to understand the name, specifications, model and quantity of all consigned items in the image, and then compare the information obtained with the name, specifications, model and quantity filled in by the user in the declaration information. If any condition fails to be compared, an interception mark is returned. After receiving the interception mark, the operator intercepts the consigned item, which greatly shortens the inspection time, improves customs clearance efficiency and reduces the impact of human factors.

[0027] Exemplary methods

[0028] To address the problems associated with manual inspection, a new method for inspecting consigned items is proposed below, which can improve inspection efficiency. Figure 1 is a flowchart illustrating the method for inspecting consigned items provided in an exemplary embodiment of this application. Taking Figure 1 as an example, the declaration information and physical image of the consigned item are obtained (see S100 in Figure 1). The declaration information and physical image are the basic content for information comparison. Then, the physical image is preprocessed to obtain a preprocessed image (see S200 in Figure 1). The physical image is taken by the operator at the express delivery inspection station. To reduce the error rate of the large model recognition, the physical image containing the consigned item is preprocessed before being input into the large model to highlight the consigned item in the physical image. Guide words are determined for the physical image and the preprocessed image respectively (see S300 in Figure 1). Both the physical image and the preprocessed image need to be input into the large model for verification. Therefore, different guide words need to be set for the two images to provide the model with additional information about the image content, context, or expected output direction, thereby optimizing the model's performance and output quality. Therefore, the declaration information, physical images, pre-processed images, and guiding words are finally input into the multimodal large model to obtain the inspection results output by the multimodal large model (see S400 in Figure 1). Operators only need to intercept or release the consigned goods according to the inspection results, which improves the overall inspection efficiency, customs clearance efficiency, and reduces the trade costs of enterprises.

[0029] The following section, in conjunction with Figure 1, provides a more detailed description of the mail inspection method provided in the embodiments of this application.

[0030] In S100, the declaration information and physical images of the consigned items are obtained. The declaration information is filled out by the user, including the name, specifications, and quantity of the consigned items, usually in text format. The physical images of the consigned items are taken by the operators at the express delivery inspection station. For example, after the package is opened, the items inside are placed on a special tray. Multiple items can be placed on the special tray. To improve the recognition efficiency and accuracy of the large model, the items can be placed upright with gaps, and clear images must be taken.

[0031] In addition to the name, specifications, and quantity of the consignment, the declaration information also contains a lot of useless text information. Therefore, after S100, the declaration information can be extracted and simplified through natural language processing. For example, pre-clean the declaration information (such as text cleaning, word segmentation, and stop word removal), remove irrelevant characters in the text, split the text into meaningful words or phrases, and remove common words that are not helpful for information extraction (such as "de", "le", etc.). Finally, obtain the declaration text; identify the key entities in the declaration text, analyze the sentence structure, identify the key components such as the subject, predicate, and object in the sentence, and their relationships. Finally, perform keyword recognition on the declaration text to obtain one or more of the name, specifications, and quantity; perform structured processing on the extracted name, specifications, and quantity (such as processing into JSON and tables, etc.) to obtain the structured name information, structured specification information, and structured quantity information after structuring. Among them, S400 can be adjusted to: input the structured name information, structured specification information, and structured quantity information, entity pictures, preprocessed pictures, and guiding words after structuring into a multimodal large model to obtain the inspection results output by the multimodal large model.

[0032] In S200, preprocess the entity picture to obtain the preprocessed picture after preprocessing. Since there is a defect in the multimodal large model when recognizing pictures, that is, the quantity recognition is not accurate enough. For example, if there are 5 packaging boxes placed in the picture, but in fact the large model may recognize 3, 4, or 6. The recognition of quantity is relatively important for express delivery inspection. Therefore, aiming at this shortcoming of the multimodal large model, the entity picture can be preprocessed so that in the preprocessed picture after preprocessing, the edges of the objects are clear and the objects themselves are prominent, thereby helping the multimodal large model to better recognize the quantity.

[0033] One possible approach to preprocessing entity images is to segment the images to obtain a preprocessed image of the labeled objects. This can be achieved by using SAM (Segment Anything Model, an open-source image segmentation model from Meta), which segments objects in the image, ensuring each object is effectively labeled. SAM can segment any object in any image without requiring any annotations. For example, the input image segmentation model includes the objects to be segmented in the entity image. Based on preset prompts, the model segments the objects in the entity image. The resulting segmentation image includes a preprocessed image of the labeled objects. Therefore, one preprocessing procedure can be as follows: select the entity image to be segmented, ensure that the entity image is clear and contains the object to be segmented, and input the entity image into the image segmentation model; based on preset prompts, the image segmentation model segments the object in the entity image; obtain the segmentation result output by the image segmentation model; wherein, the segmentation result includes a preprocessed image of the object marked, for example, the object outline is delineated with different colors or lines to represent the segmentation result.

[0034] In S300, guide words are defined for both entity images and preprocessed images. These guide words clearly indicate what type of content the model is expected to generate and specify how or to whom that content will be viewed. This explicitness helps narrow down the scope of content generated by the model, making the results more aligned with the requirements. Furthermore, guide words can significantly improve the relevance, accuracy, and quality of the model's output. By providing specific details, background information, and formatting guidelines, the model can more accurately understand the task requirements, enabling it to better adapt to different scenarios and user needs.

[0035] Based on the simplified declaration information, guidance words can be designed, including roles, context, tasks, formats, and examples. For example, based on the declaration information, a first guidance word is determined for the entity image and a second guidance word for the preprocessed image; the guidance words include roles, context, tasks, formats, and examples, and are used to highlight the differences between the entity image and the preprocessed image in the multimodal large model. Although the content of the entrusted items in the preprocessed image and the entity image is the same, the preprocessed image emphasizes the entrusted items, which can better help the multimodal large model to identify the number of items. In other words, by setting different guidance words for the preprocessed image and the entity image, the key information in the image is clearly indicated, which can help the multimodal large model to understand the image content more accurately. By designing specific guidance words, important elements in the image can be highlighted, reducing the possibility of misidentification or omission. In addition, the design of guidance words helps to optimize the processing flow of the multimodal large model and improve recognition efficiency. By providing clear recognition targets, guidance words can guide the recognition system to quickly locate key areas in the image, reducing unnecessary calculation and analysis processes. This not only saves time but also reduces the resource consumption of the multimodal large model and improves overall performance.

[0036] To address the respective tasks of entity images and preprocessed images, different roles can be assigned to them. As one implementation, based on the second guiding keyword, the role of preprocessed images is set as a counting expert; this indicates that preprocessed images are used to calibrate the multimodal large model's count recognition function. Based on the first guiding keyword, the role of entity images is set as an auditing expert; this indicates that the auditing expert combines preprocessed images and declaration information for information recognition. Preprocessed images primarily assist the multimodal large model in count recognition; therefore, setting the role of preprocessed images as counting experts helps calibrate the multimodal large model's count recognition function, aiming for the most accurate count recognition possible. Entity images, on the other hand, need to be combined with preprocessed images to sequentially perform item name and specification recognition, count recognition, content translation, and limiting the scope of specification and model recognition. Therefore, the role of entity images can be set as auditing experts, combining preprocessed images and declaration information to sequentially determine whether each item is included in the declaration information.

[0037] Understandably, besides the aforementioned "counting expert" and "review expert," different names can be assigned to the roles to achieve the same purpose. Alternatively, roles can be freely named to achieve other purposes, and various guiding words can be set as needed to help the multimodal large model complete express delivery inspection.

[0038] In the S400 system, declaration information, entity images, pre-processed images, and guiding words are input into the multimodal large language model to obtain the verification results output by the multimodal large language model. Multimodal large language model (MLLM) is an important concept in the field of artificial intelligence, representing a large model capable of processing and understanding multiple types of information. A multimodal large language model is a model that can integrate multiple modalities of data, such as text, images, audio, and video, and perform comprehensive understanding and reasoning. Compared to traditional single-modal models (which only process single types of information such as text, images, or audio), multimodal models have stronger information processing capabilities and a wider range of application scenarios. Using multimodal large language models for recognition offers several advantages. First, most items can be identified without training the multimodal large language model. Furthermore, the multimodal large language model can combine guiding words with pre-processed images to complete directional recognition, obtaining highly accurate verification results and improving verification efficiency.

[0039] For example, the declaration information includes name, specifications, and quantity. The declaration information, physical images, pre-processed images, and guiding words are input into the multimodal large model to obtain the item parameters of the consigned item output by the multimodal large model. The item parameters include one or more of the item name, item specifications, and item quantity. The item name, item specifications, and item quantity of the consigned item output by the multimodal large model are compared with the name, specifications, and quantity in the declaration information. If any of the compared information is inconsistent, the multimodal large model outputs an inspection result of "inspection failed".

[0040] In some embodiments, the inspection results output by the multimodal large model can be structured and returned. The structured data is presented in a clear and organized manner, which makes it easy for operators to quickly understand and grasp the key information points, thereby further improving inspection efficiency. The structured data is easy to import into data analysis tools for further mining, analysis and visualization. The structured data is easy to transmit and share between different systems, platforms and departments, which helps to break down information silos, realize information interconnection and interoperability, and improve the overall business collaboration efficiency.

[0041] For example, a possible structured template for the verification results output by a multimodal large model is: [Name]: xxx, [Specification]: xxx, [Number]: xxx, [Included]: Yes / No, [Description]: xxx. If it cannot be identified, it is marked as "Unidentified" and the reason for not being included is described in Chinese.

[0042] Furthermore, the conditions for release and non-release can be set according to needs. For example, in customs inspection, which is a relatively strict checkpoint, the example above sets the requirement that the name, specifications, and quantity must match for release. If the name in the declaration information does not exist in the physical image, or if the physical image contains an item name not included in the declaration information, the name is determined to be inconsistent with the item name. Similarly, if the specifications in the declaration information do not exist in the physical image, or if the physical image contains a specification not included in the declaration information, the specifications are determined to be inconsistent with the item specifications. If the quantity in the declaration information does not match the quantity in the pre-processed image, the quantity is determined to be inconsistent. All these inconsistencies result in inspection failure, and therefore, the item is not released during customs inspection. In contrast, in regular express delivery inspection, it is usually only necessary to determine whether the item is ineligible for mailing; therefore, it can be set to release the item even if the quantity is inconsistent.

[0043] As mentioned above, delivery personnel can place items on a dedicated pallet for photography. To ensure the item is centered on the pallet during photography, a mat is usually placed on it with markings at the four corners and center to indicate the pallet's center and boundaries. Personnel can use these markings to center the item. When photographing the physical object, the markings on the pallet are also present in the image. However, during image recognition, the multimodal model might mistake these markings (e.g., white) for an object, such as a pencil or chalk. To mitigate the impact of these markings on the recognition results, environmental factors in the image can be marked. These environmental factors can then be set not to be recognized in the multimodal model, or recognized but not output as a result. For example, a cue word can be pre-set in the multimodal model to indicate a black background and white markings at the four corners and center, preventing the model from recognizing the markings.

[0044] Exemplary device

[0045] Figure 2 is a schematic diagram of the structure of a consignment inspection device provided in an exemplary embodiment of this application. As shown in Figure 2, the consignment inspection device 2 includes: an acquisition module 21, used to acquire the declaration information and physical image of the consigned item; a preprocessing module 22, used to preprocess the physical image to obtain a preprocessed image; a determination module 23, used to determine the guiding words for the physical image and the preprocessed image respectively; and a processing module 24, used to input the declaration information, physical image, preprocessed image and guiding words into a multimodal large model to obtain the inspection result output by the multimodal large model.

[0046] The mail inspection device provided in this application can preprocess images of mailed items to enhance the targeting and accuracy of large model recognition, thereby achieving intelligent inspection, improving inspection efficiency, reducing inspection time, reducing the possibility of omissions due to subjective factors that easily affect manual inspection, reducing the cost of manual inspection, and being able to handle large-scale inspection tasks.

[0047] As one possible implementation, the preprocessing module 22 can be configured to segment the entity image to obtain a preprocessed image of the marked consignment.

[0048] As one possible implementation, the preprocessing module 22 can also be configured to: input an image segmentation model; wherein the entity image includes the object to be segmented; based on preset prompts, the image segmentation model segments the object in the entity image; obtain the segmentation result output by the image segmentation model; wherein the segmentation result includes a preprocessed image of the object marked.

[0049] As one possible implementation, the determination module 23 can be configured to: determine the first guiding word corresponding to the entity image and the second guiding word corresponding to the preprocessed image based on the declaration information; wherein, the guiding word includes role, context, task, format and example, and the guiding word is used to prompt the difference between the multimodal large model entity image and the preprocessed image.

[0050] As one possible implementation, the determining module 23 can also be configured to: set the role of the preprocessed image as a counting expert based on the second guiding word; wherein, the counting expert indicates that the preprocessed image is used to calibrate the multimodal large model's number recognition function; and set the role of the entity image as an auditing expert based on the first guiding word; wherein, the auditing expert indicates that information recognition is performed by combining the preprocessed image and the declaration information.

[0051] As one possible implementation, the declaration information includes name, specifications, and quantity; the processing module 24 can be configured to: input the declaration information, entity image, preprocessed image, and guiding words into the multimodal large model to obtain the item parameters of the consigned item output by the multimodal large model, the item parameters including one or more of item name, item specifications, and item quantity; compare the item name, item specifications, and item quantity of the consigned item output by the multimodal large model with the name, specifications, and quantity in the declaration information; when any of the comparison information is inconsistent, the multimodal large model outputs an inspection result of "inspection failed".

[0052] As one possible implementation, the processing module 24 can also be configured to: determine that the name is inconsistent with the item name when the name in the declaration information does not exist in the physical image, or when the physical image contains an item name not included in the declaration information; determine that the specifications are inconsistent with the item specifications when the specifications in the declaration information do not exist in the physical image, or when the physical image contains specifications not included in the declaration information; and determine that the number of items is inconsistent with the number of items when the number in the declaration information is inconsistent with the number in the preprocessed image.

[0053] As one possible implementation, the consignment inspection device 2 can also be configured to: pre-clean the declaration information to obtain the declaration text; perform keyword recognition on the declaration text to obtain one or more of the name, specifications, and number; perform structured processing on the extracted name, specifications, and number to obtain structured name information, structured specifications information, and structured number information; wherein, the processing module 24 can be configured to: input the structured name information, structured specifications information, structured number information, entity image, pre-processed image, and guiding words into the multimodal large model to obtain the inspection results output by the multimodal large model.

[0054] Exemplary electronic devices

[0055] An electronic device includes: a processor; a memory for storing processor-executable instructions; and a processor for executing the mail inspection method described in the embodiments of this application.

[0056] Hereinafter, an electronic device according to an embodiment of the present application will be described with reference to FIG3. The electronic device may be either or both of a first device and a second device, or a standalone device independent of them, which may communicate with the first device and the second device to receive acquired input signals from them.

[0057] Figure 3 illustrates a block diagram of an electronic device according to an embodiment of this application.

[0058] As shown in Figure 3, the electronic device 10 includes one or more processors 11 and memory 12.

[0059] The processor 11 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

[0060] The memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the consignment verification methods of the various embodiments of this application described above and / or other desired functions. Various contents such as input signals, signal components, and noise components may also be stored in the computer-readable storage medium.

[0061] In one example, the electronic device 10 may also include an input device 13 and an output device 14, which are interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0062] When the electronic device is a standalone device, the input device 13 can be a communication network connector for receiving the collected input signals from the first device and the second device.

[0063] In addition, the input device 13 may also include, for example, a keyboard, a mouse, etc.

[0064] The output device 14 can output various information to the outside, including determined distance information, direction information, etc. The output device 14 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0065] Of course, for simplicity, Figure 3 only shows some of the components of the electronic device 10 that are relevant to this application, omitting components such as buses, input / output interfaces, etc. In addition, the electronic device 10 may include any other suitable components depending on the specific application.

[0066] Computer program products can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this application. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0067] A computer-readable storage medium stores a computer program for executing the mail inspection method of the embodiments provided in this application.

[0068] Computer-readable storage media may take the form of any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may, for example, include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0069] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this application to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations thereof.

Claims

1. A method for inspecting consigned items, characterized in that, include: Obtain declaration information and physical photos of the consigned items; The entity image is preprocessed to obtain a preprocessed image; The guiding words for the entity image and the preprocessed image are determined respectively; The declaration information, the entity image, the preprocessed image, and the guiding words are input into the multimodal large model to obtain the verification results output by the multimodal large model.

2. The method for inspecting consigned items according to claim 1, characterized in that, The step of preprocessing the entity image to obtain a preprocessed image includes: The entity image is segmented to obtain the pre-processed image of the object being shipped.

3. The method for inspecting consigned goods according to claim 2, characterized in that, The segmentation process of the entity image to obtain the preprocessed image with the marked consignment item includes: In the input image segmentation model; wherein, the entity image includes the object to be segmented; Based on preset prompts, the image segmentation big model segments the objects in the entity image; Obtain the segmentation result output by the large image segmentation model; wherein the segmentation result includes the preprocessed image with the object marked.

4. The method for inspecting consigned goods according to claim 3, characterized in that, The segmentation results are represented by outlining the contours of the object using different colors or lines.

5. The method for inspecting consigned goods according to any one of claims 1 to 4, characterized in that, The step of determining the guiding words for the entity image and the preprocessed image respectively includes: Based on the application information, a first guiding word corresponding to the entity image and a second guiding word corresponding to the preprocessed image are determined; wherein, the guiding word includes role, context, task, format and example, and the guiding word is used to indicate the difference between the multimodal large model entity image and the preprocessed image.

6. The method for inspecting consigned goods according to claim 5, characterized in that, The step of determining the first guiding word corresponding to the entity image and the second guiding word corresponding to the preprocessed image based on the declaration information includes: Based on the second guiding word, the role of the preprocessed image is set as a counting expert; wherein, the counting expert indicates that the preprocessed image is used to calibrate the number recognition function of the multimodal large model; Based on the first guiding word, the role of the entity image is set as an expert reviewer; wherein, the expert reviewer refers to information identification by combining the preprocessed image and the application information.

7. The method for inspecting consigned goods according to any one of claims 1 to 6, characterized in that, The declared information includes name, specifications, and quantity. The step of inputting the declaration information, the entity image, the preprocessed image, and the guiding words into the multimodal large model to obtain the verification result output by the multimodal large model includes: The declaration information, the entity image, the preprocessed image, and the guiding words are input into the multimodal large model to obtain the item parameters of the consigned item output by the multimodal large model. The item parameters include one or more of the following: item name, item specifications and model number; The item name, item specifications, and item quantity output by the multimodal large model are compared with the name, item specifications, and item quantity in the declaration information; When any of the comparison information is inconsistent between the name and the item name, the specifications and model number, or the number of items and the number of items, the verification result output by the multimodal large model is that the verification fails.

8. The method for inspecting consigned goods according to claim 7, characterized in that, When any one of the comparison information—name versus item name, specification versus item specification, or number of items versus number—is inconsistent, the multimodal large model outputs a verification result of "verification failed," including: When the name in the declaration information does not exist in the physical image, or when the physical image contains an item name that is not included in the declaration information, it is determined that the name is inconsistent with the item name. If the specifications in the declaration information do not exist in the physical image, or if the physical image contains specifications that are not included in the declaration information, then the specifications are determined to be inconsistent with the specifications of the item. When the number in the declared information is inconsistent with the number in the preprocessed image, it is determined that the number of items is inconsistent with the declared number.

9. The method for inspecting consigned goods according to claim 8, characterized in that, Also includes: The inspection results are then processed into a structured format to obtain the structured inspection results. The structured verification results are displayed, and the structured verification results are input into a data analysis tool for data processing.

10. The method for inspecting consigned goods according to any one of claims 1 to 9, characterized in that, After obtaining the declaration information and physical photos of the consigned item, the process also includes: Pre-clean the application information to obtain the application text; Keyword recognition is performed on the application text to obtain one or more of the following: name, specifications, and quantity. The extracted names, specifications, and quantities are processed into structured information to obtain structured name information, structured specification information, and structured quantity information.

11. The method for inspecting consigned goods according to claim 10, characterized in that, The step of inputting the declaration information, the entity image, the preprocessed image, and the guiding words into the multimodal large model to obtain the verification result output by the multimodal large model includes: The structured name information, structured specification and model information, structured number information, entity image, preprocessed image, and guide word are input into the multimodal large model to obtain the verification result output by the multimodal large model.

12. The method for inspecting consigned goods according to any one of claims 1 to 11, characterized in that, The multimodal big model is a model that integrates multiple modal data and performs comprehensive understanding and reasoning, wherein the multiple modal data includes at least two types of data from text, images, audio and video.

13. A computer device, characterized in that, The computer device includes: One or more processors; Memory; and One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the consignment inspection method according to any one of claims 1 to 12.

14. A computer-readable storage medium, characterized in that, The storage medium stores a computer program for executing the mail inspection method according to any one of claims 1 to 12.

15. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method for inspecting consigned goods as described in any one of claims 1 to 12.