Entity object traversal recognition detection method and apparatus based on unmanned aerial vehicle device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using drones to collect cargo information along a preset flight path and comparing it with a database, the accuracy and real-time nature of inventory counting in warehouse management are solved, achieving efficient inventory management.

CN120688985BActive Publication Date: 2026-06-23ALADDIN UAV (SHENZHEN) CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ALADDIN UAV (SHENZHEN) CO LTD
Filing Date: 2025-06-25
Publication Date: 2026-06-23

Application Information

Patent Timeline

25 Jun 2025

Application

23 Jun 2026

Publication

CN120688985B

IPC: G06Q10/087; G06F18/213; G06F18/241; G06N3/0464; G06N3/08

AI Tagging

Application Domain

Neural learning methods

Technology Topics

Uncrewed vehicle Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN120688985B_ABST

Patent Text Reader

Abstract

The application relates to a kind of entity object traversal identification detection method and device based on unmanned aerial vehicle equipment.The method comprises: obtaining the goods information collected by unmanned aerial vehicle equipment in goods storage area, and the goods information represents the information related to goods storage obtained by sequentially collecting each storage position in goods storage area according to the preset flight trajectory of unmanned aerial vehicle equipment;According to the entity feature category in goods information, determine the goods identification algorithm matched with goods information, identify the goods feature information corresponding to goods information based on goods identification algorithm, and the goods feature information includes the product model, storage quantity and storage position of corresponding goods;Compare goods feature information with inventory information in preset database, and obtain the inventory analysis result of goods storage area.The method can realize the systematic analysis of goods storage state, and help to improve the accuracy of inventory management and the real-time performance of information update.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of warehouse management technology, and in particular to a method and apparatus for entity object traversal recognition and detection based on unmanned aerial vehicle (UAV) equipment. Background Technology

[0002] In the field of warehouse management technology, inventory counting is usually carried out through manual inspection. However, this method has disadvantages such as high labor costs, limited coverage, high operational risks, and unstable identification accuracy, resulting in problems such as delayed inventory information updates, lengthy inventory counting cycles, and frequent errors. Especially in large logistics warehouses or high-bay racking areas, traditional methods are difficult to achieve comprehensive and timely identification, often requiring high-altitude operations or multi-job collaboration, which seriously restricts the accuracy and efficiency of inventory management. Summary of the Invention

[0003] Therefore, it is necessary to provide a method, device, computer equipment, and computer-readable storage medium for entity object traversal recognition and detection based on UAV equipment to address the above-mentioned technical problems. This will enable systematic analysis of the storage status of goods and help improve the accuracy of inventory management and the real-time nature of information updates.

[0004] Firstly, this application provides a method for entity object traversal recognition and detection based on unmanned aerial vehicle (UAV) equipment, including:

[0005] The cargo information is obtained by the drone equipment in the cargo storage area. The cargo information refers to the cargo storage-related information collected by the drone equipment in sequence at each storage location in the cargo storage area according to the preset flight trajectory.

[0006] Based on the entity feature category in the cargo information, a cargo identification algorithm is determined to match the cargo information. Based on the cargo identification algorithm, cargo feature information corresponding to the cargo information is identified. The cargo feature information includes the product model, storage quantity, and storage location of the corresponding cargo.

[0007] The cargo feature information is compared with the inventory information in a preset database to obtain the inventory analysis results of the cargo storage area.

[0008] Secondly, this application also provides an entity object traversal recognition and detection device based on unmanned aerial vehicle (UAV) equipment, comprising:

[0009] The acquisition module is used to acquire cargo information collected by the drone equipment in the cargo storage area. The cargo information refers to cargo storage-related information collected by the drone equipment in sequence at each storage location in the cargo storage area according to a preset flight trajectory.

[0010] The identification module is used to determine the cargo identification algorithm matching the cargo information based on the entity feature category in the cargo information, and to identify the cargo feature information corresponding to the cargo information based on the cargo identification algorithm. The cargo feature information includes the product model, storage quantity and storage location of the corresponding cargo.

[0011] The comparison module is used to compare the cargo feature information with the inventory information in a preset database to obtain the inventory analysis results of the cargo storage area.

[0012] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the above steps.

[0013] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the above steps.

[0014] The aforementioned entity object traversal recognition and detection method, device, computer equipment, and computer-readable storage medium based on UAV equipment firstly collect cargo information from various storage locations in the cargo storage area sequentially according to a preset flight trajectory, ensuring the coverage and order of the collection process and establishing a stable data foundation for subsequent recognition. Secondly, a matching cargo recognition algorithm is determined based on the entity feature categories in the cargo information, thereby achieving targeted information extraction to obtain cargo feature information and improving the accuracy and adaptability of feature recognition. Thirdly, the cargo feature information is compared with inventory information in a preset database to achieve the correspondence verification between the recognition results and inventory registration data, outputting quantifiable inventory analysis results. Based on this, through the entire process of data collection, feature recognition, and inventory comparison processing, a systematic analysis of the cargo storage status is achieved, which helps to improve the accuracy of inventory management and the real-time nature of information updates. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a flowchart illustrating an entity object traversal recognition and detection method based on a drone device in one embodiment.

[0017] Figure 2 This is a schematic diagram of the structure of a drone device in one embodiment;

[0018] Figure 3 This is a schematic diagram illustrating the process of identifying cargo feature information using a text detection and recognition algorithm based on relation modeling in one embodiment;

[0019] Figure 4 This is a flowchart illustrating a text detection and recognition algorithm based on relation modeling in one embodiment;

[0020] Figure 5 This is a flowchart illustrating the process of obtaining text detection results using a relation-based text detection and recognition algorithm in one embodiment.

[0021] Figure 6 This is a schematic diagram of the structure of a text detection network using Mamba fusion and multi-path sampling in one embodiment;

[0022] Figure 7 This is a schematic diagram of the structure of a multi-path sampling layer in a text detection network in one embodiment;

[0023] Figure 8 This is a schematic diagram of the feature fusion layer in a text detection network in one embodiment;

[0024] Figure 9 This is a flowchart illustrating the process of obtaining text recognition results using a relation-based text detection and recognition algorithm in one embodiment.

[0025] Figure 10 This is a schematic diagram of the structure of a convolutional-deconvolutional text recognition network based on global relation modeling in one embodiment;

[0026] Figure 11 This is a schematic diagram of the process for identifying cargo feature information based on a hybrid attention and temporal fusion enhancement algorithm in one embodiment;

[0027] Figure 12 This is a flowchart illustrating the process of obtaining feature information at different scales using an algorithm based on hybrid attention and temporal fusion enhancement in one embodiment.

[0028] Figure 13 This is a schematic diagram of the structure of a cargo recognition and detection algorithm based on hybrid attention and Transformer fusion enhancement in one embodiment;

[0029] Figure 14 This is a schematic diagram of the ACK module in a cargo recognition and detection algorithm based on hybrid attention and Transformer fusion enhancement in one embodiment;

[0030] Figure 15 This is a flowchart illustrating the process of obtaining fused features at different scales using an algorithm based on hybrid attention and temporal fusion enhancement in one embodiment.

[0031] Figure 16 This is a schematic diagram of the structure of the multi-scale temporal feature fusion layer in a cargo recognition and detection algorithm based on hybrid attention and Transformer fusion enhancement in one embodiment;

[0032] Figure 17 This is a schematic diagram of a process for identifying cargo feature information based on an identifier recognition algorithm in one embodiment;

[0033] Figure 18 This is a structural block diagram of an entity object traversal recognition and detection device based on a drone device in one embodiment. Detailed Implementation

[0034] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0035] In one embodiment, such as Figure 1 As shown, a method for entity object traversal recognition and detection based on a drone device is provided. This embodiment illustrates the application of this method to a server. It can be understood that this method can also be applied to a terminal, or to a system including a terminal and a server, and implemented through the interaction between the terminal and the server. In this embodiment, the method includes the following steps S101 to S103.

[0036] Step S101: Obtain cargo information collected by the drone equipment in the cargo storage area. The cargo information refers to cargo storage-related information collected by the drone equipment in sequence at each storage location in the cargo storage area according to the preset flight trajectory.

[0037] in, Figure 2 A schematic diagram of a drone device is shown, which includes a camera 1 and an RFID reader 2 installed on the bottom of the drone device. The camera 1 can be a regular RGB camera used to capture RGB images, or an RGB-D camera used to capture RGB images and depth images simultaneously, in order to collect cargo information in the form of text, images, QR code labels, etc. The RFID reader 2 is used to read radio frequency signals from RFID tags in order to collect cargo information in the form of RFID tag labels.

[0038] Furthermore, Figure 2The drone equipment in the picture is a quadcopter drone, with four sets of propeller arms 3 extending to the four corners of the body. Each set of propeller arms 3 is equipped with a set of motors 4 and propellers 5 at the end, which enables the drone equipment to hover, control precise position and turn in space. It can fly at low altitude above or to the side of the cargo, and can move and shoot around the cargo from multiple perspectives.

[0039] Among them, the goods storage area refers to the physical space range used to store different types of goods, such as a warehouse or shopping mall with multi-layer shelves; the storage location refers to a specific goods storage unit in the goods storage area with a unique space number or geometric coordinate identifier, such as the storage location corresponding to the 5th cell of the 3rd layer of a shelf.

[0040] Among them, cargo information refers to the set of data directly related to cargo collected by the drone equipment along its flight path through shooting, scanning or other sensing methods. This cargo information can represent information of data types such as text, images, identification codes, and empty locations.

[0041] The flight trajectory refers to the pre-planned three-dimensional path that the drone equipment follows when operating in the cargo storage area. It is used to guide the drone equipment to complete the sequential access and data collection of all storage locations without repetition or omission. For example, a flight path that starts from the area entrance and passes through each row of shelves from top to bottom. Based on the flight trajectory, information such as product model, location and count of each cargo can be obtained in real time and in an orderly manner and information model can be created.

[0042] For example, firstly, the cargo storage area can be spatially divided and marked using static surveying or existing layout drawings, thereby forming basic environmental data for drone flight trajectory planning to clarify the spatial location of each cargo storage unit. Subsequently, a set of preset flight trajectories is constructed for the aforementioned cargo storage area. These trajectories are required to cover all cargo storage units, ensuring that the drone can sequentially collect data from each storage location during its operation. Furthermore, during actual operation, the drone operates stably according to the preset flight trajectory, while controlling its flight altitude, turning, and hovering time to perform a complete scan of each storage location, ensuring that the cargo status from each perspective is completely collected and recorded as cargo information.

[0043] Step S102: Based on the entity feature category in the cargo information, determine the cargo identification algorithm for matching cargo information, and identify the cargo feature information corresponding to the cargo information based on the cargo identification algorithm. The cargo feature information includes the product model, storage quantity and storage location of the corresponding cargo.

[0044] Among them, the entity feature category represents the classification identifier that reflects the data type in the cargo information. It is used to determine which type of cargo recognition algorithm should be selected to identify and process the cargo information. It may include the text feature category corresponding to the text data type, the image feature category corresponding to the image data type, the identifier feature category corresponding to the identifier code data type, etc.

[0045] Among them, the cargo identification algorithm refers to a set of rule-based processing logic or operation flow used to perform identification operations for different entity feature categories, so as to adaptively extract cargo feature fields with business significance from cargo information according to the data type of cargo information, namely cargo feature information such as product model, storage quantity and storage location.

[0046] For example, firstly, based on the collected cargo information, the entity feature categories contained in the cargo information are identified. This involves analyzing the format of the cargo information to determine if it includes any content available for feature extraction, such as structural features like images, text, or identification codes. This process identifies the entity feature categories corresponding to the cargo information. Secondly, after identifying the entity feature categories, a corresponding cargo recognition algorithm is applied to identify the cargo feature information. Different cargo recognition algorithms have different processing logics for different entity feature categories. Thirdly, during the recognition process, the corresponding cargo recognition algorithm processes the feature content in the cargo information according to its processing logic, thereby extracting feature information such as the product model, corresponding storage quantity, and storage location of each cargo. This feature information is expressed as structured fields and uniformly organized into the cargo feature information for the corresponding cargo.

[0047] Step S103: Compare the cargo feature information with the inventory information in the preset database to obtain the inventory analysis results of the cargo storage area.

[0048] The preset database refers to a pre-established and continuously maintained data storage space used to store the structured data set of various goods' storage status. For example, it can represent the product information database in a warehouse management system, which includes inventory information such as product model, packaging specifications, registered storage location, and inventory registration time.

[0049] The inventory analysis results represent the data analysis output formed by comparing the identified goods feature information with the existing inventory information in the database. It is used to display the differences, changes or consistency between the current actual identification results and the system registered data. For example, it can represent comparison results such as "Product A inventory is consistent, no difference", "Product B has a location deviation", "Product C is a newly added unregistered item", and is presented in the form of a table or list.

[0050] For example, firstly, before comparing the goods' characteristic information with the inventory information in the database, it is necessary to ensure that each parameter field in the goods' characteristic information is comparable, that is, its format, unit, and numbering system must all conform to the consistency standards defined in the database. Secondly, during the comparison process, the product model field is used as the main index field and matched sequentially with the inventory information registered in the database. Through the one-to-one correspondence of field values, it is determined whether the corresponding goods have a matching inventory item in the inventory information. If a goods have a matching inventory item in the inventory information, the difference between the identified storage quantity and the storage quantity registered in the database is compared to further determine whether the current goods inventory is newly added, missing, or a change in quantity. If a goods do not have a matching inventory item in the inventory information, it is recorded as a newly added inventory item and identified separately. Furthermore, it is also necessary to verify the identified storage location with the storage location registered in the database to determine whether there are any misplaced storage, inconsistent information, or incorrect records. Ultimately, all comparison results will be compiled into a unified inventory analysis result. This inventory analysis result includes the matching status of inventory items and quantity difference information of goods, as well as information such as the accuracy of goods information identification, the completeness of inventory information, and the storage location of suspicious and abnormal items. This information will serve as the basis for subsequent inventory management operations and will be used to trigger inventory correction, replenishment warnings, or manual review operations.

[0051] In the aforementioned entity object traversal recognition and detection method based on UAV equipment, firstly, the UAV equipment sequentially collects cargo information from various storage locations in the cargo storage area according to a preset flight trajectory, thereby ensuring the coverage and order of the collection process and establishing a stable data foundation for subsequent recognition. Secondly, a matching cargo recognition algorithm is determined based on the entity feature categories in the cargo information, thereby achieving targeted information extraction to obtain cargo feature information and improving the accuracy and adaptability of feature recognition. Thirdly, the cargo feature information is compared with the inventory information in a preset database, thereby achieving the correspondence verification between the recognition results and the inventory registration data, and outputting quantifiable inventory analysis results. Based on this, through the entire process of data collection, feature recognition, and inventory comparison processing, a systematic analysis of the cargo storage status is achieved, which helps to improve the accuracy of inventory management and the real-time nature of information updates.

[0052] In one exemplary embodiment, such as Figure 3 As shown, based on the entity feature category in the cargo information, a cargo identification algorithm for matching cargo information is determined, and cargo feature information corresponding to the cargo information is identified based on the cargo identification algorithm, including steps S201 to S203.

[0053] Step S201: If the entity feature category in the cargo information is a text feature category, then the preset text detection and recognition algorithm based on relation modeling will be used as the cargo recognition algorithm for cargo information matching.

[0054] Among them, the text feature category represents the visual data type identified from the cargo information that consists of characters, words, or symbols, such as printed text or handwritten marks that appear on cargo labels or cargo surfaces.

[0055] Among them, the text detection and recognition algorithm based on relation modeling represents a joint recognition process that considers both the spatial arrangement of characters and the semantic structure association, which is used to improve the recognition accuracy in complex backgrounds, curved arrangements, or multilingual text scenarios.

[0056] For example, by analyzing factors such as pixel arrangement, boundary structure, and color density in the captured cargo information, it can be determined whether the cargo information contains character components. If character components are found, the cargo information is considered to belong to the text feature category, meaning it contains identifiable character sequences or label text. A text detection and recognition algorithm based on relational modeling can then be used as the matching cargo recognition algorithm to enable more structured recognition of the data in the text region. Furthermore, such as... Figure 4 As shown, this type of algorithm includes two interrelated processing flows: one is the text region detection flow, which is to locate the text region in the input image based on the text detection network and obtain the text detection result; the other is the text content recognition flow, which is to perform content recognition on the character sequence or label text in the located text region based on the text recognition network and obtain the text recognition result.

[0057] Step S202: Based on the text detection network in the cargo recognition algorithm, multi-scale feature extraction and fusion processing are performed on the cargo information to obtain the text detection result corresponding to the cargo information.

[0058] The text detection network refers to a neural network structure used to identify text regions in cargo information, in order to accurately locate and separate regions containing text content from complex backgrounds, i.e., output text detection results, which may include structured fields such as detection boxes, tilt angles, and detection confidence scores for several text regions.

[0059] For example, in a text detection network, image data from cargo information is used as input, and a multi-scale feature extraction and fusion mechanism is used to analyze and process the text regions that may exist in the image data. In this process, firstly, multi-layer feature extraction is performed on the original image data through convolution operations. Each convolutional layer can extract information such as texture, edges, and local contrast at different scales, forming a series of feature map sets with different receptive fields and resolutions. Since the size, arrangement, image quality, and format of the text corresponding to cargo information vary in practical applications, a comprehensive analysis of the image data is required through multi-scale processing to enhance the perception ability of text regions with small characters, diagonal text, or complex backgrounds.

[0060] Furthermore, after acquiring these multi-scale feature maps, upsampling is performed to align the size of the feature maps at each scale, ensuring that the features from different convolutional layers have a uniform spatial distribution. This preserves the key structures in each layer's information and improves overall expressive power. Based on the upsampling process, feature fusion is performed to integrate multi-layer features in a unified scale space. During the fusion process, local structural information and overall spatial correlation are considered simultaneously to form a unified fused feature map. Subsequently, by detecting the spatial distribution and feature intensity information of the fused feature map, potential text regions are identified in the cargo information. These text regions are output as bounding boxes, representing the specific location and extent of the text regions in the cargo information, thus constituting the final text detection result.

[0061] Step S203: Based on the text recognition network in the cargo recognition algorithm, the text detection result is parsed to obtain the text recognition result corresponding to the text detection result, and the text recognition result is used as the cargo feature information corresponding to the cargo information.

[0062] Among them, the text recognition network refers to the neural network structure that performs content parsing on the text detection results, so as to convert the information in the text region into a specific text sequence, that is, output the text recognition result, which may include text fields such as product model, storage quantity and storage location.

[0063] For example, in a text recognition network, the text detection result is used as input, and multi-stage recognition and parsing processing is performed on the text regions contained therein to convert the text regions in image form into text sequences with clear semantics. In this process, the text regions are first uniformly formatted to adapt to the input requirements of subsequent recognition structures. Then, a processing path for feature extraction and feature modeling is constructed to achieve in-depth parsing of the text region content. This processing path extracts local image features while introducing sequence modeling-based structural logic to capture the sequential relationships and semantic coherence between characters. Based on this, not only can low-level image details such as character edges and strokes be obtained, but global associations between characters can also be established, thereby improving the parsing ability for complex text arrangements. Furthermore, to further enhance recognition stability, a processing path for feature restoration and feature modeling is also constructed. That is, while maintaining structural modeling capabilities, some original image features in the text region are restored to ensure the consistency of character morphology and spatial structure. Based on this, the feature results generated by the two processing paths are fused in a unified space to form the final output features, which include the image morphology, arrangement structure, and contextual semantic relationships of the characters. Finally, the output feature is subjected to character-by-character recognition. The recognition result of each character is output through a structured classification and decoding process, and these characters are combined into a complete text string according to the recognition order, which serves as the text recognition result contained in the text region.

[0064] In this embodiment, firstly, if the entity feature category in the cargo information is a text feature category, a text detection and recognition algorithm based on relation modeling is matched to achieve targeted adaptation between the recognition algorithm and the information type, thereby improving the accuracy and efficiency of subsequent processing. Secondly, multi-scale feature extraction and fusion processing are performed based on the text detection network to obtain text detection results, thereby enhancing the detection capability of text regions of different sizes and complex backgrounds, and improving the completeness and accuracy of text region extraction. Thirdly, the text detection results are parsed and processed by the text recognition network to obtain text recognition results, thereby achieving structured restoration of the text content and generating cargo feature information that can be directly used for business comparison. Based on this, through the collaborative processing of entity feature category judgment, text region localization, and text content recognition, high-precision recognition of text content in cargo information is achieved.

[0065] In one exemplary embodiment, the text detection network includes convolutional layers, multi-path sampling layers, feature fusion layers, and a detection head structure; such as Figure 5 As shown, based on the text detection network in the cargo recognition algorithm, multi-scale feature extraction and fusion processing are performed on cargo information to obtain the text detection result corresponding to the cargo information, including steps S301 to S303.

[0066] Here, convolutional layers represent network structural units used to extract local features from the input image, thereby establishing feature map representations of the image at different scales; multi-path sampling layers represent network structural units used to perform size restoration processing on feature maps at different scales, thereby improving the spatial resolution of deep feature maps to match that of shallow feature maps. Figure 1 The feature fusion layer represents a network structure unit used to fuse feature maps from different scales and semantic levels to enhance the overall semantic expressiveness of the image while preserving local details; the detection head structure represents a network structure module used to perform region discrimination and boundary regression tasks based on the fused feature maps to locate, classify, or predict the confidence of possible text regions in the image.

[0067] Step S301: Based on the convolutional layer, perform multi-scale feature extraction processing on the cargo information to obtain feature maps of different scales.

[0068] Step S302: Combining the linear interpolation and deconvolution processing logic corresponding to the multi-path sampling layer and the global information modeling and splicing processing logic corresponding to the feature fusion layer, the feature maps of different scales are jointly upsampled and fused to obtain the fused feature map.

[0069] The multi-path sampling layer, which combines linear interpolation and deconvolution, represents a processing mechanism that combines two different image enhancement methods to restore the size of the feature map during upsampling. Linear interpolation is used to smoothly fill in the values of newly added pixels in the feature map, making them consistent with the original pixel distribution. Deconvolution is used to expand the spatial structure of the feature map and enhance the expressive power of edges and patterns, so as to preserve key structural information in the image while improving the spatial resolution of the feature map.

[0070] The global information modeling and splicing processing logic corresponding to the feature fusion layer represents a processing mechanism that performs both information splicing and context relationship modeling in the fusion processing of multi-scale feature maps. The splicing operation is used to retain the original feature content from different scales, while the global information modeling is used to establish spatial dependencies and cross-regional connections between features, in order to generate a more comprehensive and spatially consistent fused feature map.

[0071] Step S303: Based on the detection head structure, perform detection analysis on the fused feature map to obtain text region detection information based on cargo information, and use the text region detection information as the text detection result corresponding to the cargo information.

[0072] The text region detection information refers to a set of spatial localization data output by the detection head structure after detecting and analyzing the fused feature map. It is used to identify the specific location and structural features of regions in the input image that may contain text. For example, each piece of detection information may include the coordinates, size, orientation angle, and confidence score of whether the region is a text region, corresponding to a rectangular bounding box that marks a certain region.

[0073] For example, in order to solve multiple problems such as inaccurate text region localization and complex background interference in the text detection process, and to ensure accurate extraction of text information, the text detection and recognition algorithm based on relation modeling in this embodiment is a text detection and recognition algorithm based on Mamba fusion and global relation modeling, and the text detection network therein is a text detection network based on Mamba fusion and multi-path sampling.

[0074] like Figure 6 As shown, in the text detection network, the input image is... The data is sequentially input into Conv1 (the first convolutional layer), Conv2 (the second convolutional layer), Conv3 (the third convolutional layer), and Conv4 (the fourth convolutional layer) for feature extraction layer by layer, resulting in feature maps at different levels. The low-level feature maps mainly correspond to features such as edges and textures, while the high-level feature maps mainly correspond to semantic features.

[0075] The feature map output from Conv4 is upsampled using Multi-upsample1 (the first multi-upsample layer) to obtain the feature map. Features are fused through Mamba Fusion1 (the first feature fusion layer). The feature map is fused with the feature map output from Conv4 to obtain the feature map. Features are processed through Multi-upsample2 (i.e., the second multi-upsample layer). Upsampling is performed to obtain features. Features are fused using Mamba Fusion2 (the second feature fusion layer). The feature map is fused with the feature map output by Conv2 to obtain the feature map. Features are processed through Multi-upsample3 (the third multi-upsampled layer). Upsampling is performed, and then the features output by Multi-upsample3 are combined with the features through Mamba Fusion3 (the third feature fusion layer). Perform fusion processing to obtain features Features are detected by the head structure. The detection and analysis process involves generating a probability map and a threshold map, which are then fused to calculate an approximate binary map. This enables accurate detection of text regions and yields the text detection results. .

[0076] like Figure 7 As shown, in the multi-path sampling layer, two different upsampling strategies are used to achieve the upsampling effect, that is, two different upsampling operations are designed, including an upsampling operation based on linear interpolation. Compared with deconvolution-based upsampling operations Input features Pass through separately and Perform upsampling, and then add the features obtained from the two upsampling operations together. Operation with average Averaging is performed to obtain the final upsampled output. .

[0077] Furthermore, the processing logic of the above multi-path sampling layer can be referred to equation (1):

[0078] (1)

[0079] In equation (1), the upsampling operation is based on linear interpolation. This can be achieved through linear interpolation or bilinear interpolation; upsampling operation based on deconvolution. This can be achieved through deconvolution with optimized parameters; The function composite symbol is used to indicate that the functions are executed sequentially; the processing logic of equation (1) corresponds to the linear interpolation and deconvolution combined processing logic of the multi-path sampling layer.

[0080] like Figure 8 As shown, in the feature fusion layer, a new MambaFusion network structure layer was designed using Mamba and non-local techniques; on the one hand, an input feature is fused into the MambaFusion network structure layer. With another input feature Each feature is input into Mamba for global information modeling. Each of these steps yields its corresponding global information enhancement features; these global information enhancement features are then input into Concat1 (the first concatenation layer) for feature concatenation. The first concatenated feature is obtained, and then the first concatenated feature is input into the Non-local database for global information fusion modeling. The first fusion feature is obtained.

[0081] On the other hand, input features and The input is fed into Concat2 (the second concatenation layer) for feature concatenation. The second splicing feature is obtained; then the second splicing feature and the first fused feature are combined using an additive operation. Operation with average Averaging is performed to obtain the final fused output. .

[0082] Furthermore, the processing logic of the aforementioned feature fusion layer can be referenced from equations (2) to (4):

[0083] (2)

[0084] (3)

[0085] (4)

[0086] In equations (2) to (4), This represents the first fused feature of the Non-local output. This represents the second concatenation feature output by Concat2; The function composite symbol is used to indicate that the functions are executed sequentially; the processing logic of equations (2) to (4) corresponds to the global information modeling and splicing combination processing logic of the feature fusion layer.

[0087] In this embodiment, firstly, multi-scale feature extraction processing is performed on the cargo information using convolutional layers to obtain feature maps that express the structure and contour features of text regions under different receptive fields, thereby enhancing the recognition support for characters of different sizes. Secondly, upsampling and fusion processing is performed on the multi-scale feature maps using multi-path sampling layers and feature fusion layers to obtain fused feature maps, thereby improving the uniformity of feature maps in terms of spatial structure and semantic information. Thirdly, the fused feature maps are detected and analyzed using the detection head structure to obtain text region detection information, thereby accurately extracting the spatial position of text regions in the image. Based on this, accurate and efficient recognition and localization of text regions in the image are achieved.

[0088] In one exemplary embodiment, the text recognition network includes convolutional layers, deconvolutional layers, temporal modeling layers, and a detection head structure; such as Figure 9 As shown, based on the text recognition network in the cargo recognition algorithm, the text detection results are parsed to obtain the text recognition results corresponding to the text detection results, including steps S401 to S404.

[0089] Among them, the convolutional layer represents a network structure unit used to extract local features from the input data, so as to establish the feature representation of the original data at different scales; the deconvolutional layer represents a network structure unit used to restore the spatial resolution of features and enhance the detailed representation of data, so as to gradually restore low-resolution features to a higher-resolution form; the temporal modeling layer represents a network structure unit used to capture the contextual relationship of characters in the sequence, so as to establish semantic and sequential associations between characters; the detection head structure represents a network structure module that performs content recognition and sequence generation on the received features, so as to recognize and combine the text content in the text region and output the corresponding structured text recognition results.

[0090] Step S401: Combining the convolution processing logic corresponding to the convolutional layer and the global relation modeling processing logic corresponding to the temporal modeling layer, determine the first comprehensive processing logic that combines feature extraction and feature modeling.

[0091] The convolutional processing logic corresponding to the convolutional layer represents the processing mechanism used to extract local image features from the input text region. That is, it captures character edges, contours and texture structures through a sliding window to construct a local spatial representation of the character.

[0092] Among them, the global relation modeling processing logic corresponding to the temporal modeling layer is used to establish a processing mechanism for the contextual semantic association formed by the arrangement order of characters in the text region, so as to express the global sequence relationship and logical order between characters.

[0093] The first integrated processing logic represents a feature processing path jointly constructed based on convolution processing logic and global relation modeling logic, which is used to simultaneously complete the spatial feature extraction and sequence semantic modeling of text regions, thereby outputting character features with temporal structure.

[0094] Step S402: Combining the deconvolution processing logic corresponding to the deconvolution layer and the global relation modeling processing logic corresponding to the temporal modeling layer, determine the second comprehensive processing logic that combines feature restoration and feature modeling.

[0095] Among them, the deconvolution processing logic corresponding to the deconvolution layer represents a processing mechanism used to restore abstract features to a higher resolution expression, so as to emphasize the reconstruction of data details and the restoration of character structure, thereby supplementing character edge information and repairing visual details lost during compression.

[0096] The second integrated processing logic represents the feature reconstruction path constructed based on the deconvolution processing logic and the global relation modeling logic. It is used to maintain the semantic coherence between character sequences while restoring the data space structure, thereby outputting restored features with high resolution and semantic structure.

[0097] Step S403: Combining the first and second integrated processing logics, the text detection results are processed by integrating feature extraction, modeling, and restoration to obtain output features.

[0098] Step S404: Based on the detection head structure, the output features are detected and analyzed to obtain the text recognition result corresponding to the text detection result.

[0099] For example, such as Figure 10 As shown, the text recognition network is a convolutional-deconvolutional text recognition network based on global relation modeling, capable of accurately recognizing text in scenarios with complex backgrounds, distortions, and multilingual mixing. Specifically, in the text recognition network, the text detection results are... The data are sequentially input into Conv Block1 (the first convolutional block) and Conv Block2 (the second convolutional block) for convolution processing to obtain the features. ; Features The input is fed into Transformer Block 1 (i.e., the first temporal modeling module) for feature modeling processing to obtain the features. ;Will The input is fed into Conv Block3 (the third convolutional block) for convolution processing to obtain the features. ;Will The input is fed into DeconvBlock1 (the first deconvolution block) for deconvolution processing to obtain the features. ;Will The input is fed into TransformerBlock2 (the second temporal modeling module) for feature modeling processing, resulting in... ;Will The inputs are sequentially fed into DeconvBlock2 (the second deconvolution block) and DeconvBlock3 (the third deconvolution block) for deconvolution processing to obtain the features. ;Will The data is input into the detection head structure for detection and analysis to obtain the final text recognition result. .

[0100] Furthermore, the processing logic of the above text recognition network can be referenced in equation (5):

[0101] (5)

[0102] In equation (5), This represents the convolution operation performed sequentially on Conv Block1 and Conv Block2; This represents the global relational modeling operation performed by Transformer Block1; This indicates the convolution operation performed on Conv Block3; This indicates the deconvolution operation performed by Deconv Block1; This represents the global relational modeling operation performed by Transformer Block2; This indicates the convolution operations performed sequentially on Deconv Block2 and Deconv Block3; This represents a function compound symbol, used to indicate that functions are executed sequentially.

[0103] Furthermore, , , The combined processing logic corresponds to the first comprehensive processing logic in this embodiment; , , The combined processing logic corresponds to the second comprehensive processing logic in this embodiment.

[0104] In this embodiment, firstly, based on the first integrated processing logic constructed by the convolutional layer and the temporal modeling layer, local structural features of characters are extracted and semantic relationships between characters are established, enhancing the sequence consistency of the basic expression for recognition. Secondly, based on the second integrated processing logic constructed by the deconvolutional layer and the temporal modeling layer, the detail restoration of characters and the enhancement of semantic continuity are achieved, improving the discriminative ability of the restored features. Thirdly, by jointly applying the first and second integrated processing logics, feature extraction, modeling, and restoration are jointly achieved, obtaining unified output features with spatial details and semantic expression. Fourthly, the output features are detected and analyzed based on the detection head structure, thereby restoring the character information in the image into a structured text recognition result. Based on this, accurate restoration and sequential recognition of character content in text images are achieved, ensuring the integrity and reliability of the text recognition result.

[0105] In one exemplary embodiment, such as Figure 11 As shown, based on the entity feature category in the cargo information, a cargo identification algorithm for matching cargo information is determined, and cargo feature information corresponding to the cargo information is identified based on the cargo identification algorithm, including steps S501 to S504.

[0106] Step S501: If the entity feature category in the cargo information is an image feature category, then the preset algorithm based on hybrid attention and temporal fusion enhancement will be used as the cargo recognition algorithm for cargo information matching.

[0107] Among them, the image feature category represents the visual data type identified from the cargo information that is composed of visual content such as image structure, texture, and outline, such as visual forms such as cargo outer packaging images, appearance shape, and structural outline.

[0108] Among them, the algorithm based on hybrid attention and temporal fusion enhancement represents a joint recognition processing flow that combines spatial attention mechanism and multi-scale temporal information modeling method, which is used to simultaneously enhance the responsiveness of key regions in the image and the structural consistency between multi-layer features.

[0109] For example, if features such as object structure, edges, and color distribution are detected prominently in the cargo information, the cargo information is considered to belong to the image feature category, meaning that an identifiable object exists in the image. An algorithm based on hybrid attention and temporal fusion enhancement can then be used as the matching cargo recognition algorithm to enable more structured identification of the image data. This type of algorithm is built upon deep visual perception mechanisms and incorporates attention mechanisms and temporal modeling logic to enhance the perception of the interrelationships between the spatial location and local details of target objects in the image.

[0110] Step S502: Based on the backbone structure in the cargo recognition algorithm, feature extraction processing is performed on the cargo information to obtain feature information at different scales.

[0111] The backbone structure represents the network structure modules used to extract multi-scale semantic features, in order to extract low- and mid-level semantic features such as texture, edge, and shape from the input image.

[0112] For example, based on the backbone network in the cargo recognition algorithm structure, multi-stage feature extraction processing is performed on the input cargo information. Specifically, the cargo information is fed into the convolutional units in the backbone structure layer by layer. The resolution, receptive field, and response distribution of the feature map output by each layer are retained and recorded. In this way, feature information containing different semantic levels such as structure, texture, edge, and shape is extracted layer by layer from the original image to form a stable feature representation under multi-scale distribution. Among them, the backbone structure extracts and compresses the local response in the image by using continuous feature transformation, thereby forming a layer-by-layer abstract representation of the characteristics of the object in the image.

[0113] Step S503: Based on the neck structure in the cargo recognition algorithm, feature information at different scales is fused to obtain fused features at different scales.

[0114] The neck structure represents a network structure module used to integrate feature information at different scales, in order to fuse low-level detailed information with high-level semantic information and generate unified and more expressive fused features at different scales.

[0115] For example, based on the neck network in the cargo recognition algorithm structure, feature information at different scales is fused. This involves uniformly modeling and fusing feature information at different scales according to dimensions such as feature level, spatial size, and response intensity, to form fused features that include both low-level details and high-level semantics. Furthermore, during the fusion process, a structural enhancement mechanism can be introduced to strengthen the distinguishing ability between the target region and the background, thereby improving the accuracy of the fused features' response to the target region. Ultimately, the fused features obtained at different scales possess stronger target representation capabilities, higher contextual modeling capabilities, and semantic coverage at each scale.

[0116] Step S504: Based on the detection head structure in the cargo recognition algorithm, the fusion features at different scales are detected and analyzed to obtain the image recognition result corresponding to the cargo information, and the image recognition result is used as the cargo feature information corresponding to the cargo information.

[0117] The detection head structure represents a network structure module used to perform region discrimination and boundary regression tasks based on fused features at different scales, in order to locate, classify, or predict the confidence of cargo targets in images.

[0118] Among them, the image recognition result identifier is the image description data output after the image recognition result detects and analyzes the fusion features at different scales. It is used to characterize the specific recognition object in the image and its location, category and confidence level in the image.

[0119] For example, based on the detection head structure in the cargo recognition algorithm, fusion features at different scales are detected and analyzed to achieve target region recognition and boundary determination, thereby generating a complete image recognition result and extracting cargo feature information. Specifically, firstly, sliding region analysis is performed on the fusion features, calculating the response intensity of each region in the image block by block in the feature dimension to determine whether the corresponding region contains the target to be identified; then, by classifying and regressing the feature distribution in the target region containing the target to be identified, the recognition information and positioning information of the cargo target in each target region are output, such as the product type, confidence score, and bounding box position of the cargo target, thereby constituting the image recognition result.

[0120] Therefore, in actual warehousing scenarios, goods are often found in complex environments such as dense stacking, partial occlusion, and structural overlap. Traditional goods identification and counting methods often fail to accurately distinguish individual goods when encountering occlusion, easily leading to omissions or redundancies in counting. This embodiment aims to improve the ability to accurately identify and count goods under occlusion conditions. Specifically, by constructing an identification framework with global contextual understanding capabilities, even if there are partially invisible areas on the surface of the goods, effective reconstruction and reasoning can be performed based on their edges, texture extensions, or spatial relationships. This achieves complete restoration of the structure within the invisible areas, thereby accurately identifying the category and quantity of occluded goods, exhibiting stronger occlusion robustness and counting accuracy.

[0121] In this embodiment, firstly, if the entity feature category in the cargo information is an image feature category, an algorithm based on hybrid attention and temporal fusion enhancement is matched to achieve targeted adaptation of the recognition algorithm to the information type, thereby improving the accuracy and efficiency of subsequent processing. Secondly, feature information at different scales is extracted based on the backbone structure to enhance the ability to express multi-level visual details and semantic structures. Thirdly, feature information at different scales is fused based on the neck structure to obtain feature information at different scales, thereby improving the consistency of feature expression and the completeness of the response of the target region. Fourthly, the image recognition result is output based on the detection head structure to achieve structured detection and localization of cargo targets in the image. Based on this, stable recognition and standardized result generation of image-based cargo information are achieved, improving the accuracy of image recognition and the versatility of the system.

[0122] In one exemplary embodiment, the backbone structure includes convolutional optimization layers and attention enhancement layers; such as Figure 12 As shown, based on the backbone structure in the cargo recognition algorithm, feature extraction processing is performed on cargo information to obtain feature information at different scales, including steps S601 to S604.

[0123] Among them, the convolution optimization layer represents a network structure unit used for efficient feature extraction of image information, which improves extraction accuracy and computational efficiency by adjusting the convolution structure or designing convolution parameters; the attention enhancement layer represents a network structure unit used to improve the expression quality of feature map channels, which combines multiple attention modeling mechanisms to adjust the channel response intensity, so as to strengthen key feature channels and suppress redundant information, thereby improving the expression stability of the target region.

[0124] Step S601: Based on the current convolutional optimization layer, perform optimized feature extraction processing on the cargo information to obtain the feature information output by the current convolutional optimization layer.

[0125] Step S602: Based on the channel attention modeling module, self-attention modeling module and C3K modeling module in the current attention enhancement layer, the feature information output by the current convolutional optimization layer is comprehensively processed by combining data connection, channel dimension transformation and channel dimension segmentation to obtain the enhanced feature information output by the current attention enhancement layer.

[0126] Among them, the channel attention modeling module represents the modeling sub-module in the attention enhancement layer that constructs attention weights based on the statistical characteristics and response distribution between feature map channels, so as to selectively enhance channels with discriminative capabilities; the self-attention modeling module represents the modeling sub-module in the attention enhancement layer that models the contextual dependencies between channels, so as to realize the dynamic interaction of information between different channels through internal feature similarity calculation; the C3K modeling module represents the modeling sub-module in the attention enhancement layer that performs feature alignment and completion by fusing convolutional feature paths and contextual modeling paths, so as to comprehensively improve the spatial details of feature maps and the consistency of structural expression between channels.

[0127] Step S603: Input the enhanced feature information output by the current attention enhancement layer into the next convolutional optimization layer for processing to obtain the feature information output by the next convolutional optimization layer. Input the feature information output by the next convolutional optimization layer into the next attention enhancement layer for processing to obtain the enhanced feature information output by the next attention enhancement layer, until all feature information output by the convolutional optimization layers is traversed.

[0128] Step S604: Use the feature information output by different convolutional optimization layers as feature information of different scales output by the backbone structure.

[0129] For example, such as Figure 13 As shown, the algorithm based on hybrid attention and temporal fusion enhancement in this embodiment is a cargo recognition and detection algorithm based on hybrid attention and Transformer fusion enhancement. The input of the algorithm can be a single image or a panoramic image synthesized from multiple viewpoints to reduce or eliminate the influence of occlusion. Specifically, the input image... The data is input into the Backbone section (i.e., the backbone structure) for processing. Specifically: The inputs are sequentially fed into Layer1 (the first convolutional optimization layer), ACK1 (the first attention enhancement layer), and Layer2 (the second convolutional optimization layer) for convolution, attention enhancement, and convolution processing, respectively, to obtain the output features of Layer2. ;Will The inputs are sequentially fed into ACK2 (the second attention enhancement layer) and Layer3 (the third convolutional optimization layer) for attention enhancement and convolution processing, respectively, to obtain the output features of Layer3. ;Will The inputs are sequentially fed into ACK3 (the third attention enhancement layer) and Layer4 (the fourth convolutional optimization layer) for attention enhancement and convolution processing, respectively, to obtain the output features of Layer4. Ultimately, , , These are used as inputs to the Neck section (i.e., the neck structure).

[0130] Furthermore, in the Neck part, the high-level features input from the Backbone part are fused with the low-level features, thereby simultaneously utilizing the high resolution of the low-level features and the rich semantic information of the high-level features, and supporting the back-end detection head part (i.e., the detection head structure) to perform independent prediction of multi-scale features; furthermore, for example... Figure 13 As shown, in the detection head part, the One2Many Head and One2One Head of YoLov11 (an "end-to-end" single-stage object detection model) are used to detect features at different scales, obtaining the detection output of the input image and thus the image recognition result. .

[0131] Furthermore, such as Figure 14 As shown, the ACK module (i.e., attention enhancement layer) in this cargo recognition and detection algorithm is a C3K network structure with channel attention and self-attention enhancement, implemented based on the original C3K module (i.e., a convolutional network in YoLov11 that introduces an attention mechanism). It also includes a channel attention modeling module. Self-attention modeling module and C3K modeling module Specifically, in the attention enhancement layer, the input features are processed by a 1x1 Conv (i.e., a 1x1 size convolutional layer). Perform feature channel dimension transformation, and then split the output features of the 1x1 Conv using Split to obtain... , and .

[0132] In the self-attention modeling module In the middle, The data is sequentially input into Multi-head Attention1 (the first multi-head attention layer) and Multi-head Attention2 (the second multi-head attention layer) for self-attention enhancement processing, resulting in a self-attention modeling module. Output characteristics .

[0133] In the channel attention modeling module In the middle, The output features of Multi-head Attention1 are concatenated with those of Concat1 (the first connection layer), and then the output features of Concat1 are input into ChannelAttention (the channel attention layer) for channel attention enhancement, resulting in the channel attention modeling module. Output characteristics .

[0134] In the C3K modeling module In the middle, The inputs are sequentially fed into C3K modules at different layers for attention enhancement processing, so that the output features of the C3K modules at different layers are used as C3K modeling modules respectively. Output characteristics .

[0135] Based on this, , , The features are concatenated using Concat2 (the second connection layer) and then restored using 1x1 Conv to obtain the output features of the ACK module. .

[0136] Furthermore, the processing logic of the ACK module can be referred to equations (5) to (6):

[0137] (5)

[0138] (6)

[0139] In equations (5) to (6), Represents the channel attention modeling module Output characteristics , Represents the self-attention modeling module Output characteristics , Represents the C3K modeling module Output characteristics ; This indicates the connection operation performed by Concat2. This represents the channel dimension transformation operation performed by 1x1 Conv. This indicates the splitting operation performed by Split along the channel dimension.

[0140] Furthermore, the convolutional optimization layer in this cargo recognition and detection algorithm is a convolutional network layer composed of deformable convolution and depthwise separable convolution. It can extract the shape features of irregular objects and reduce the time cost of the model in the inference stage by taking advantage of the characteristics of fewer parameters and less computation.

[0141] In this embodiment, firstly, optimized feature extraction processing is performed on cargo information using the convolutional optimization layer, thereby improving the clarity and stability of feature information in local representations such as edges and textures. Secondly, the feature information is enhanced using the multi-module modeling mechanism in the attention enhancement layer, thereby strengthening the responsiveness of key channels and improving the discriminativeness of features and semantic coordination between channels. Thirdly, the multi-scale construction and semantic hierarchical expression of the feature extraction process are realized through the layer-by-layer iterative processing mechanism of the convolutional optimization layer and the attention enhancement layer. Fourthly, the feature information output by each convolutional optimization layer is uniformly summarized and processed to obtain multi-scale feature information covering different receptive fields and semantic levels. Based on this, through the structural design combining multi-layer convolutional optimization extraction and attention enhancement, deep feature modeling and multi-scale expression of cargo images are realized.

[0142] In one exemplary embodiment, the neck structure includes a multi-scale temporal feature fusion layer and an intermediate layer, wherein the intermediate layer includes at least one upsampling layer, a convolutional optimization layer, and an attention enhancement layer; as Figure 15 As shown, based on the neck structure in the cargo recognition algorithm, feature information at different scales is fused to obtain fused features at different scales, including steps S701 to S704.

[0143] Among them, the multi-scale temporal feature fusion layer represents a network structure unit used to model the temporal relationship of feature information at different scales and perform joint fusion processing, so as to establish cross-scale and cross-time contextual relationships between multiple sets of feature information with inconsistent spatial scales, thereby improving the continuity and semantic expressive ability of fused features.

[0144] The intermediate layer represents a network structural unit or a set of multiple network structural units used for intermediate processing of feature information at different scales. This allows for structural adjustment, dimensional reconstruction, or response enhancement of the input feature information before multi-scale feature fusion, thereby improving the adaptability and expression quality of subsequent fusion processing.

[0145] The upsampling layer represents a network structure unit used to perform size restoration processing on feature information, thereby improving the spatial resolution of deep feature maps to match that of shallow feature maps. Figure 1The convolution optimization layer represents a network structure unit used for efficient feature extraction of image information. It improves extraction accuracy and computational efficiency by adjusting the convolution structure or designing convolution parameters. The attention enhancement layer represents a network structure unit used to improve the expression quality of feature map channels. It combines multiple attention modeling mechanisms to adjust the channel response intensity to strengthen key feature channels and suppress redundant information, thereby improving the expression stability of the target region.

[0146] Step S701: Determine the feature information of the first target scale from the feature information of different scales, and perform parsing processing on the feature information of the first target scale based on the current intermediate layer to obtain the intermediate feature information output by the current intermediate layer.

[0147] Step S702: Determine the feature information of the second target scale from the feature information of different scales, and input the intermediate feature information and the feature information of the second target scale into the current multi-scale temporal feature fusion layer for comprehensive processing of combined feature splicing and feature similarity calculation to obtain the fused feature output by the current multi-scale temporal feature fusion layer.

[0148] Step S703: Input the fused features output by the current multi-scale temporal feature fusion layer into the next intermediate layer for processing to obtain the intermediate feature information output by the next intermediate layer. Input the intermediate feature information output by the next intermediate layer into the next multi-scale temporal feature fusion layer for processing to obtain the fused features output by the next multi-scale temporal feature fusion layer. Repeat this process until all fused features output by the multi-scale temporal feature fusion layers have been traversed.

[0149] Step S704: Determine the fusion features of different scales output by the neck structure from the fusion features output by all multi-scale temporal feature fusion layers.

[0150] For example, Figure 13 As shown, , , These will be used as inputs to the Neck section, specifically: The data is sequentially input into ACK4 (the fourth attention enhancement layer), Upsample1 (the first upsampling layer), and ACK5 (the fifth attention enhancement layer) for attention enhancement, upsampling, and attention enhancement processing, respectively. The output features of ACK5 are fused with the scale-time features through Fusion1 (the first multi-scale temporal feature fusion layer) to obtain the output features of Fusion1. The output features of Fusion1 are then input into Upsample2 (the second upsampling layer) for upsampling. Finally, the output features of Upsample2 are combined with the output features of ACK5. Feature fusion is performed using Fusion2 (the second multi-scale temporal feature fusion layer) to obtain the output features of Fusion2. ;Will The input is fed into Layer 5 (the fifth convolutional optimization layer) for convolution processing. Then, the output features of Layer 5 and the output features of Fusion 1 are fused through Fusion 3 (the third multi-scale temporal feature fusion layer) to obtain the output features of Fusion 3. ;Will The inputs are sequentially fed into Layer 6 (the sixth convolutional optimization layer) and ACK6 (the sixth attention enhancement layer) for convolution and attention enhancement. Then, the output features of ACK6 and ACK4 are fused through Fusion4 (the fourth multi-scale temporal feature fusion layer) to obtain the output features of Fusion4. Ultimately, , , These are used as inputs to the detection head.

[0151] Furthermore, such as Figure 16 As shown, the multi-scale temporal feature fusion layer in this cargo recognition and detection algorithm is a Transformer-based multi-scale feature fusion network structure; specifically, it integrates two different levels of input features. and The concatenated features are obtained by concatenating them using a concat layer (i.e., a connection layer). Then, a 1x1 Conv layer (i.e., a 1x1 convolutional layer) is used to reduce the channel dimension of the concatenated features, resulting in transformed features. The similarity between a pair of transformed features (i.e., the similarity between two different levels of input features) is calculated using a multiplication operation. This similarity is then multiplied by the original transformed features to obtain the final result. This result is then processed by a 1x1 Conv layer to restore the channel dimension, resulting in restored features. Finally, these restored features are fused with the original concatenated features using an addition operation to obtain the output features of the multi-scale temporal feature fusion layer. .

[0152] Furthermore, the processing logic of the multi-scale temporal feature fusion layer can be referenced in equation (7):

[0153] (7)

[0154] In equation (7), This indicates the concatenation operation performed by Concat; This represents the channel dimension transformation operation performed by 1x1 Conv, and is intended to enhance the feature representation capability of this multi-scale temporal feature fusion layer. Figure 15 The two types of 1x1 Conv do not share parameters with each other; Indicates multiplication operation, This indicates an addition operation.

[0155] In this embodiment, firstly, a first target scale is selected from feature information at different scales and input into an intermediate layer for parsing to obtain intermediate feature information, thereby achieving preprocessing and structural optimization of the original feature information. Secondly, a second target scale is selected from feature information at different scales, and the intermediate feature information and the feature information at the second target scale are processed by feature splicing and similarity analysis according to the multi-scale temporal feature fusion layer to obtain fused features, thereby improving the temporal consistency expression capability between cross-scale features. Furthermore, according to the iterative processing logic of the multi-scale temporal feature fusion layers and intermediate layers at each level, the progressive enhancement and temporal depth expansion of multi-layer fused features are realized. Furthermore, based on the fused features output by all multi-scale temporal feature fusion layers, the coordinated output and unified expression of multi-scale information are ensured. Based on this, the structural consistency enhancement and temporal expression optimization of feature information at different scales are achieved, thereby improving the integrity and robustness of image feature fusion processing.

[0156] In one exemplary embodiment, such as Figure 17 As shown, based on the entity feature category in the cargo information, a cargo identification algorithm for matching cargo information is determined, and cargo feature information corresponding to the cargo information is identified based on the cargo identification algorithm, including steps S801 to S803.

[0157] Step S801: If the entity feature category in the cargo information is an identifier feature category, then the preset identifier recognition algorithm is used as the cargo recognition algorithm for matching cargo information.

[0158] Among them, the identification feature category represents the data type category that uniquely identifies goods from the goods information. It is used to quickly locate and identify the identity information of goods. For example, visually encoded data such as barcodes and QR codes presented in the form of images, or radio frequency data presented in the form of radio frequency response, can represent the information to which the identification feature category belongs.

[0159] Among them, the identifier recognition algorithm refers to the logical processing flow designed for identifier feature categories to identify, extract and parse identifier information, and is used to transform visual encoded data or radio frequency data into structured identifier fields with business significance.

[0160] For example, firstly, if the cargo information contains coded structures, identification tags, or other uniquely identifying data, the entity feature category in the cargo information is classified as an identification feature category. Secondly, after determining the identification feature category, to ensure consistency between subsequent processing methods and data types, a processing logic designed for identification data is selected as the cargo identification algorithm to parse various types of identification information, including but not limited to graphic codes in visual form or radio frequency data in electronic form. Furthermore, this matching process is not a simple static rule search, but rather a joint judgment combining preliminary analysis of the data structure and device configuration status to ensure that the identification recognition algorithm has sufficient adaptability and processing coverage.

[0161] Step S802: Based on the identification recognition module configured in the UAV equipment that matches the identification feature category, the identification information in the cargo information is identified. When the identification recognition module represents a visual coding recognition module, the identification information represents either a barcode or a QR code. When the identification recognition module represents a radio frequency identification module, the identification information represents a radio frequency identification.

[0162] The identification module represents a set of functional components configured in the drone equipment to acquire and identify the identification information carried by the cargo; the visual coding identification module represents a collection component for collecting visual coding data attached to the surface of the cargo, such as a camera or image sensing device; and the radio frequency identification module represents a wireless communication component for receiving radio frequency signals emitted by the cargo and extracting the radio frequency data therefrom, such as a radio frequency receiving device.

[0163] The identification information refers to the coded data content extracted by the identification module to represent the identity of the goods; barcode and QR code identification represent coded data content that expresses the identity of the goods in a visual coding form, and RFID identification represents coded data content that expresses the identity of the goods in a radio frequency response form.

[0164] For example, based on the category of identification data in the cargo information, the identification recognition module configured in the drone equipment is adaptively invoked to acquire the identification information carried by the cargo. Specifically, according to the presentation method of the identification features, an appropriate acquisition channel is automatically selected, so that the identification information can be effectively extracted without human intervention. On the one hand, when the cargo information carries identification content in the form of visual encoding, the visual encoding recognition module scans the cargo surface or the collected cargo information to extract the corresponding shape code or QR code. On the other hand, when the cargo information carries identification content in the form of radio frequency response, the radio frequency identification module receives radio frequency data from the RFID electronic tag set on the cargo surface or the collected cargo information to obtain the corresponding RFID tag.

[0165] Step S803: Based on the cargo identification algorithm, the identification information is parsed to obtain the cargo feature information corresponding to the cargo information.

[0166] For example, for visually encoded identification information, the information is input into image decoding logic. By locating the encoded region in the image, restoring the pattern structure, and performing format correction, the content carried in the barcode or QR code is restored. For radio frequency (RF) response identification information, the received RF data is processed by RF decoding logic to perform protocol identification, displacement verification, and field extraction. The electronically encoded value stored in the RF tag is extracted and the fields are decomposed according to a preset protocol structure. Based on this, the decoded data content is further mapped to fields according to a unified format template, thereby obtaining standardized and comparable cargo feature information. Thus, the structure of the original identification content is restored through image decoding logic or RF decoding logic, transforming the encoded information in the physical carrier into logical fields that the system can recognize. This achieves a key processing step in the transition from the perception layer to the data layer.

[0167] In this embodiment, firstly, if the entity feature category in the cargo information is an identification feature category, an identification recognition algorithm is matched to ensure that the information recognition path is compatible with the data type. Secondly, the visual encoding recognition module or radio frequency identification module is called according to the identification feature category to identify the identification information, thereby achieving targeted collection of multiple identification forms and enhancing the adaptability to diverse cargo identifications. Thirdly, the identification information is parsed according to image decoding logic or radio frequency decoding logic to achieve accurate conversion of encoded data into cargo feature information. Based on this, a stable identification process for different identification features is constructed, realizing efficient identification and standardized parsing of cargo identification information.

[0168] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0169] Based on the same inventive concept, this application also provides an entity object traversal recognition and detection device for implementing the aforementioned entity object traversal recognition and detection method for UAV devices. The solution provided by this device is similar to the implementation described in the above method. Therefore, the specific limitations in one or more embodiments of the entity object traversal recognition and detection device for UAV devices provided below can be found in the limitations of the entity object traversal recognition and detection method for UAV devices described above, and will not be repeated here.

[0170] In one exemplary embodiment, such as Figure 18 As shown, an entity object traversal recognition and detection device based on a drone is provided, including: an acquisition module 101, a recognition module 102, and a comparison module 103, wherein:

[0171] The acquisition module 101 is used to acquire cargo information collected by the drone equipment in the cargo storage area. The cargo information represents cargo storage-related information collected by the drone equipment in sequence at each storage location in the cargo storage area according to the preset flight trajectory.

[0172] The identification module 102 is used to determine the cargo identification algorithm that matches the cargo information based on the entity feature category in the cargo information, and to identify the cargo feature information corresponding to the cargo information based on the cargo identification algorithm. The cargo feature information includes the product model, storage quantity and storage location of the corresponding cargo.

[0173] The comparison module 103 is used to compare the cargo feature information with the inventory information in the preset database to obtain the inventory analysis results of the cargo storage area.

[0174] In an exemplary embodiment, the identification module 102 includes a first identification unit, which is configured to: if the entity feature category in the cargo information is a text feature category, use a preset text detection and recognition algorithm based on relation modeling as the cargo identification algorithm for matching the cargo information; perform multi-scale feature extraction and fusion processing on the cargo information based on the text detection network in the cargo identification algorithm to obtain the text detection result corresponding to the cargo information; parse the text detection result based on the text recognition network in the cargo identification algorithm to obtain the text recognition result corresponding to the text detection result, and use the text recognition result as the cargo feature information corresponding to the cargo information.

[0175] In an exemplary embodiment, the first recognition unit is further configured to: perform multi-scale feature extraction processing on the cargo information based on the convolutional layer to obtain feature maps of different scales; combine the linear interpolation and deconvolution processing logic corresponding to the multi-path sampling layer and the global information modeling and splicing processing logic corresponding to the feature fusion layer to jointly perform upsampling and fusion processing on the feature maps of different scales to obtain fused feature maps; and perform detection analysis on the fused feature maps based on the detection head structure to obtain text region detection information based on the cargo information, and use the text region detection information as the text detection result corresponding to the cargo information.

[0176] In an exemplary embodiment, the first recognition unit is further configured to: determine a first comprehensive processing logic combining feature extraction and feature modeling by combining the convolution processing logic corresponding to the convolutional layer and the global relation modeling processing logic corresponding to the temporal modeling layer; determine a second comprehensive processing logic combining feature restoration and feature modeling by combining the deconvolution processing logic corresponding to the deconvolutional layer and the global relation modeling processing logic corresponding to the temporal modeling layer; perform comprehensive processing combining feature extraction, modeling, and restoration on the text detection result by combining the first comprehensive processing logic and the second comprehensive processing logic to obtain output features; and perform detection analysis on the output features based on the detection head structure to obtain the text recognition result corresponding to the text detection result.

[0177] In an exemplary embodiment, the recognition module 102 includes a second recognition unit, which is configured to: if the entity feature category in the cargo information is an image feature category, use a preset algorithm based on hybrid attention and temporal fusion enhancement as the cargo recognition algorithm for cargo information matching; perform feature extraction processing on the cargo information based on the backbone structure in the cargo recognition algorithm to obtain feature information at different scales; perform fusion processing on the feature information at different scales based on the neck structure in the cargo recognition algorithm to obtain fused features at different scales; and perform detection and analysis on the fused features at different scales based on the detection head structure in the cargo recognition algorithm to obtain the image recognition result corresponding to the cargo information, and use the image recognition result as the cargo feature information corresponding to the cargo information.

[0178] In an exemplary embodiment, the second recognition unit is further configured to: perform optimized feature extraction processing on the cargo information based on the current convolutional optimization layer to obtain the feature information output by the current convolutional optimization layer; perform comprehensive processing on the feature information output by the current convolutional optimization layer based on the channel attention modeling module, self-attention modeling module, and C3K modeling module in the current attention enhancement layer, combining data connection, channel dimension transformation, and channel dimension segmentation to obtain the enhanced feature information output by the current attention enhancement layer; input the enhanced feature information output by the current attention enhancement layer into the next convolutional optimization layer for processing to obtain the feature information output by the next convolutional optimization layer; input the feature information output by the next convolutional optimization layer into the next attention enhancement layer for processing to obtain the enhanced feature information output by the next attention enhancement layer, until all feature information output by all convolutional optimization layers is traversed; and use the feature information output by different convolutional optimization layers as feature information of different scales output by the backbone structure.

[0179] In an exemplary embodiment, the second identification unit is further configured to: determine feature information of a first target scale from feature information of different scales; parse and process the feature information of the first target scale based on the current intermediate layer to obtain intermediate feature information output by the current intermediate layer; determine feature information of a second target scale from feature information of different scales; input the intermediate feature information and the feature information of the second target scale into the current multi-scale temporal feature fusion layer for comprehensive processing of combined feature splicing and feature similarity calculation to obtain the fused feature output by the current multi-scale temporal feature fusion layer; input the fused feature output by the current multi-scale temporal feature fusion layer into the next intermediate layer for processing to obtain intermediate feature information output by the next intermediate layer; input the intermediate feature information output by the next intermediate layer into the next multi-scale temporal feature fusion layer for processing to obtain the fused feature output by the next multi-scale temporal feature fusion layer, until all fused features output by all multi-scale temporal feature fusion layers are traversed; and determine the fused features of different scales output by the neck structure from the fused features output by all multi-scale temporal feature fusion layers.

[0180] In an exemplary embodiment, the identification module 102 includes a third identification unit, which is configured to: if the entity feature category in the cargo information is an identification feature category, use a preset identification identification algorithm as the cargo identification algorithm for matching the cargo information; identify the identification information in the cargo information based on the identification identification module configured in the UAV device that matches the identification feature category, wherein when the identification identification module represents a visual coding identification module, the identification information represents either a barcode identification or a QR code identification, and when the identification identification module represents a radio frequency identification module, the identification information represents a radio frequency identification identification; and parse the identification information based on the cargo identification algorithm to obtain the cargo feature information corresponding to the cargo information.

[0181] The various modules in the aforementioned entity object traversal recognition and detection device based on UAV equipment can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in the processor of a computer device in hardware form or independent of it, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0182] In one exemplary embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of any of the above embodiments.

[0183] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the above embodiments.

[0184] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0185] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0186] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A method for entity object traversal recognition and detection based on unmanned aerial vehicle (UAV) equipment, characterized in that, The method includes: The cargo information is obtained by the drone equipment in the cargo storage area. The cargo information refers to the cargo storage-related information collected by the drone equipment in sequence at each storage location in the cargo storage area according to the preset flight trajectory. Based on the entity feature category in the cargo information, a cargo identification algorithm is determined to match the cargo information. Based on the cargo identification algorithm, cargo feature information corresponding to the cargo information is identified. The cargo feature information includes the product model, storage quantity, and storage location of the corresponding cargo. The cargo feature information is compared with the inventory information in the preset database to obtain the inventory analysis results of the cargo storage area; The step of determining the cargo identification algorithm for matching cargo information based on the entity feature category in the cargo information includes: If the entity feature category in the cargo information is a text feature category, then a preset text detection and recognition algorithm based on relation modeling will be used as the cargo recognition algorithm for matching the cargo information; if the entity feature category in the cargo information is an image feature category, then a preset algorithm based on hybrid attention and temporal fusion enhancement will be used as the cargo recognition algorithm for matching the cargo information; if the entity feature category in the cargo information is an identifier feature category, then a preset identifier recognition algorithm will be used as the cargo recognition algorithm for matching the cargo information. Wherein, when the entity feature category in the cargo information is an image feature category, the step of identifying the cargo feature information corresponding to the cargo information based on the cargo recognition algorithm includes: Based on the backbone structure of the cargo recognition algorithm, feature extraction processing is performed on the cargo information to obtain feature information at different scales; based on the neck structure of the cargo recognition algorithm, feature information at different scales is fused to obtain fused features at different scales; based on the detection head structure of the cargo recognition algorithm, fused features at different scales are detected and analyzed to obtain the image recognition result corresponding to the cargo information, and the image recognition result is used as the cargo feature information corresponding to the cargo information; The backbone structure includes a convolutional optimization layer and an attention enhancement layer; based on the backbone structure in the cargo recognition algorithm, feature extraction processing is performed on the cargo information to obtain feature information at different scales, including: Based on the current convolutional optimization layer, optimized feature extraction processing is performed on the cargo information to obtain the feature information output by the current convolutional optimization layer. Based on the channel attention modeling module, self-attention modeling module, and C3K modeling module in the current attention enhancement layer, the feature information output by the current convolutional optimization layer is comprehensively processed by combining channel dimension transformation and channel dimension segmentation to obtain the enhanced feature information output by the current attention enhancement layer. The enhanced feature information output by the current attention enhancement layer is input into the next convolutional optimization layer for processing to obtain the feature information output by the next convolutional optimization layer. The feature information output by the next convolutional optimization layer is input into the next attention enhancement layer for processing to obtain the enhanced feature information output by the next attention enhancement layer, until the feature information output by all convolutional optimization layers is traversed. The feature information output by different convolutional optimization layers is used as the feature information of different scales output by the backbone structure.

2. The method according to claim 1, characterized in that, When the entity feature category in the cargo information is a text feature category, the step of identifying the cargo feature information corresponding to the cargo information based on the cargo recognition algorithm includes: Based on the text detection network in the cargo recognition algorithm, multi-scale feature extraction and fusion processing are performed on the cargo information to obtain the text detection result corresponding to the cargo information; Based on the text recognition network in the cargo recognition algorithm, the text detection result is parsed to obtain the text recognition result corresponding to the text detection result, and the text recognition result is used as the cargo feature information corresponding to the cargo information.

3. The method according to claim 2, characterized in that, The text detection network includes convolutional layers, multi-path sampling layers, feature fusion layers, and a detection head structure; The text detection network based on the cargo recognition algorithm performs multi-scale feature extraction and fusion processing on the cargo information to obtain the text detection result corresponding to the cargo information, including: Based on the convolutional layer, multi-scale feature extraction processing is performed on the cargo information to obtain feature maps of different scales; By combining the linear interpolation and deconvolution processing logic corresponding to the multi-path sampling layer and the global information modeling and splicing processing logic corresponding to the feature fusion layer, feature maps of different scales are jointly upsampled and fused to obtain fused feature maps. Based on the detection head structure, the fused feature map is analyzed to obtain text region detection information based on cargo information, and the text region detection information is used as the text detection result corresponding to the cargo information.

4. The method according to claim 2, characterized in that, The text recognition network includes convolutional layers, deconvolutional layers, temporal modeling layers, and a detection head structure; The text recognition network based on the cargo recognition algorithm parses the text detection results to obtain the text recognition results corresponding to the text detection results, including: Combining the convolution processing logic corresponding to the convolutional layer and the global relation modeling processing logic corresponding to the temporal modeling layer, a first comprehensive processing logic combining feature extraction and feature modeling is determined. Combining the deconvolution processing logic corresponding to the deconvolution layer and the global relation modeling processing logic corresponding to the temporal modeling layer, a second comprehensive processing logic combining feature restoration and feature modeling is determined. Combining the first and second integrated processing logics, the text detection results are subjected to integrated processing including feature extraction, modeling, and reconstruction to obtain output features; Based on the detection head structure, the output features are detected and analyzed to obtain the text recognition result corresponding to the text detection result.

5. The method according to claim 1, characterized in that, The neck structure includes a multi-scale temporal feature fusion layer and an intermediate layer, wherein the intermediate layer includes at least one upsampling layer, a convolutional optimization layer and an attention enhancement layer; The method based on the neck structure in the cargo identification algorithm fuses feature information at different scales to obtain fused features at different scales, including: The feature information of the first target scale is determined from the feature information of different scales. The feature information of the first target scale is parsed and processed based on the current intermediate layer to obtain the intermediate feature information output by the current intermediate layer. The feature information of the second target scale is determined from the feature information of different scales. The intermediate feature information and the feature information of the second target scale are input into the current multi-scale temporal feature fusion layer for comprehensive processing of combined feature splicing and feature similarity calculation, so as to obtain the fused feature output by the current multi-scale temporal feature fusion layer. The fused features output by the current multi-scale temporal feature fusion layer are input into the next intermediate layer for processing to obtain the intermediate feature information output by the next intermediate layer. The intermediate feature information output by the next intermediate layer is input into the next multi-scale temporal feature fusion layer for processing to obtain the fused features output by the next multi-scale temporal feature fusion layer. This process continues until all fused features output by the multi-scale temporal feature fusion layers have been traversed. The fusion features at different scales output by the neck structure are determined from the fusion features output by all multi-scale temporal feature fusion layers.

6. The method according to claim 1, characterized in that, When the entity feature category in the cargo information is an identifier feature category, the step of identifying the cargo feature information corresponding to the cargo information based on the cargo identification algorithm includes: Based on the identification recognition module configured in the UAV equipment that matches the identification feature category, the identification information in the cargo information is identified. When the identification recognition module represents a visual encoding recognition module, the identification information represents either a barcode or a QR code. When the identification recognition module represents a radio frequency identification module, the identification information represents a radio frequency identification. Based on the cargo identification algorithm, the identification information is parsed to obtain the cargo feature information corresponding to the cargo information.

7. A device for traversing, recognizing, and detecting entities based on unmanned aerial vehicle (UAV) equipment, characterized in that, The device includes: The acquisition module is used to acquire cargo information collected by the drone equipment in the cargo storage area. The cargo information refers to cargo storage-related information collected by the drone equipment in sequence at each storage location in the cargo storage area according to a preset flight trajectory. The identification module is used to determine the cargo identification algorithm matching the cargo information based on the entity feature category in the cargo information, and to identify the cargo feature information corresponding to the cargo information based on the cargo identification algorithm. The cargo feature information includes the product model, storage quantity and storage location of the corresponding cargo. The comparison module is used to compare the cargo feature information with the inventory information in a preset database to obtain the inventory analysis results of the cargo storage area; The recognition module includes a first recognition unit, a second recognition unit, and a third recognition unit. The first recognition unit is used to use a preset text detection and recognition algorithm based on relation modeling as the cargo recognition algorithm for matching the cargo information when the entity feature category in the cargo information is a text feature category. The second recognition unit is used to use a preset algorithm based on hybrid attention and temporal fusion enhancement as the cargo recognition algorithm for matching the cargo information when the entity feature category in the cargo information is an image feature category. The third recognition unit is used to use a preset identifier recognition algorithm as the cargo recognition algorithm for matching the cargo information when the entity feature category in the cargo information is an identifier feature category. The second identification unit is further configured to: perform feature extraction processing on the cargo information based on the backbone structure in the cargo identification algorithm to obtain feature information at different scales; perform fusion processing on the feature information at different scales based on the neck structure in the cargo identification algorithm to obtain fused features at different scales; and perform detection and analysis on the fused features at different scales based on the detection head structure in the cargo identification algorithm to obtain the image recognition result corresponding to the cargo information, and use the image recognition result as the cargo feature information corresponding to the cargo information. The backbone structure includes a convolutional optimization layer and an attention enhancement layer. The second recognition unit is further configured to: perform optimized feature extraction processing on the cargo information based on the current convolutional optimization layer to obtain the feature information output by the current convolutional optimization layer; perform comprehensive processing on the feature information output by the current convolutional optimization layer based on the channel attention modeling module, self-attention modeling module, and C3K modeling module in the current attention enhancement layer, combining channel dimension transformation and channel dimension segmentation, to obtain the enhanced feature information output by the current attention enhancement layer; input the enhanced feature information output by the current attention enhancement layer into the next convolutional optimization layer for processing to obtain the feature information output by the next convolutional optimization layer; input the feature information output by the next convolutional optimization layer into the next attention enhancement layer for processing to obtain the enhanced feature information output by the next attention enhancement layer, until all feature information output by all convolutional optimization layers is traversed; and use the feature information output by different convolutional optimization layers as feature information of different scales output by the backbone structure.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.