Target object recognition method, device, equipment, storage medium and program product

By dividing the image to be identified into multiple regions and using deep neural networks for feature extraction and object recognition, the problems of insufficient feature extraction and high latency in multi-device recognition in complex scenes are solved, achieving efficient and accurate object recognition.

CN122244458APending Publication Date: 2026-06-19GUANGZHOU POWER SUPPLY BUREAU GUANGDONG POWER GRID CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU POWER SUPPLY BUREAU GUANGDONG POWER GRID CO LTD
Filing Date
2026-02-25
Publication Date
2026-06-19

Smart Images

  • Figure CN122244458A_ABST
    Figure CN122244458A_ABST
Patent Text Reader

Abstract

This application relates to a method, apparatus, device, storage medium, and program product for identifying target objects. The method includes: acquiring an image to be identified; the image to be identified includes at least one target object in a target region; dividing the image to be identified into at least one image region based on the location of each target object in the image; each image region includes at least one target object; for any image region, performing feature extraction on the image region to obtain target feature extraction results corresponding to each target object in the image region; and determining the object identification result of each target object in the image region based on the target feature extraction results corresponding to the image region. This method can improve the accuracy of object identification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of target recognition technology, and in particular to a method, apparatus, device, storage medium, and program product for identifying target objects. Background Technology

[0002] In complex scenarios such as industrial sites and smart cities, the need for identification of multiple devices simultaneously is becoming increasingly urgent. When there are diverse types of devices, similar appearance features, or severe background interference (such as changes in lighting or occlusion), the identification accuracy drops significantly.

[0003] Currently, although some studies have introduced deep learning technology, there are still problems such as insufficient feature extraction, high model inference latency, and difficulty in real-time system response, which make it difficult to deploy efficiently in real-world scenarios and urgently need to be solved. Summary of the Invention

[0004] Therefore, it is necessary to provide a method, apparatus, device, storage medium, and program product for identifying target objects that can improve the accuracy of object recognition, in order to address the aforementioned technical problems.

[0005] Firstly, this application provides a method for identifying a target object, including:

[0006] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0007] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0008] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0009] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0010] In one embodiment, before dividing the image to be identified into at least one image region based on the location of each target object in the image to be identified, the method further includes:

[0011] The image to be recognized is input into the pre-sensory network of the target object recognition model to obtain the position prediction results of each target object;

[0012] Based on the prediction results at each location, the region where each target object is located in the image to be identified is determined.

[0013] In one embodiment, the target object recognition model further includes a feature extraction network and a feature encoding network corresponding to each image region;

[0014] Accordingly, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region, including:

[0015] The image region is input into the feature extraction network to obtain the initial feature extraction results;

[0016] The initial feature extraction results are input into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0017] In one embodiment, the feature extraction network includes a first sensing unit and a second sensing unit;

[0018] The first perceptual unit is used to extract the first dimension features; the second perceptual unit is used to extract the second dimension features.

[0019] In one embodiment, based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined, including:

[0020] The target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region; the target detection results include at least one of the target object's location information, device shape information, and device category information;

[0021] Based on the target detection results of each target object, the object recognition results of each target object in the image region are determined.

[0022] In one embodiment, the target detection network includes a decoding network, a location regression network, and an edge alignment network; the target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region, including:

[0023] The target feature extraction results are input into the decoding network to obtain the initial bounding boxes of each target object in the image region;

[0024] Each initial bounding box is input into the location regression network, and the boundaries of each initial bounding box are optimized to obtain the target bounding boxes of each target object in the image region.

[0025] The bounding boxes of each target are input into the edge alignment network to obtain the target detection results of each target object in the image region.

[0026] Secondly, this application also provides a target object identification device, comprising:

[0027] An image acquisition module is used to acquire an image to be recognized; the image to be recognized includes at least one target object in the target region;

[0028] The image segmentation module is used to divide the image to be identified into at least one image region based on the region where each target object is located in the image to be identified; each image region includes at least one target object.

[0029] The feature extraction module is used to extract features from any image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0030] The object recognition module is used to determine the object recognition result of each target object in the image region based on the target feature extraction results corresponding to the image region.

[0031] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0032] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0033] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0034] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0035] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0036] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the following steps:

[0037] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0038] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0039] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0040] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0041] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps:

[0042] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0043] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0044] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0045] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0046] The aforementioned target object identification method, apparatus, device, storage medium, and program product acquire an image to be identified; the image to be identified includes at least one target object in a target region; based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object; for any image region, feature extraction is performed on the image region to obtain the target feature extraction result corresponding to each target object in the image region; based on the target feature extraction result corresponding to the image region, the object identification result of each target object in the image region is determined. In this process, since the image to be identified includes at least one target object, dividing the image to be identified into at least one image region before object identification makes the subsequent object identification process more targeted, thereby improving the accuracy of object identification. Attached Figure Description

[0047] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0048] Figure 1 This is a flowchart illustrating a target object identification method in one embodiment;

[0049] Figure 2 This is a flowchart illustrating the steps for determining the target feature extraction result in one embodiment;

[0050] Figure 3This is a flowchart illustrating the steps for determining the object recognition result in one embodiment;

[0051] Figure 4 This is a flowchart illustrating the target object identification method in another embodiment;

[0052] Figure 5 This is a structural block diagram of a target object identification device in one embodiment;

[0053] Figure 6 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0055] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.

[0056] Before introducing the embodiments of this application, it should be noted that in the prior art, recognition systems for multi-device environments typically rely on traditional serial processing architectures. This architecture often struggles to meet real-time requirements in complex scenarios where multiple devices appear simultaneously, resulting in low recognition efficiency and slow response times. Furthermore, existing systems often employ shallow models in the feature extraction and classification stages, leading to difficulties in guaranteeing recognition accuracy when dealing with diverse device types, similar appearances, or severe background interference. Some research has attempted to introduce deep learning techniques to improve recognition performance; however, these still face problems such as insufficient feature extraction, high model inference latency, and difficulty in real-time system response, making efficient deployment and stable operation in practical applications challenging. Therefore, there is an urgent need for a multi-device recognition system that supports parallel processing, possesses efficient feature extraction capabilities, and is suitable for real-time scenarios to improve overall recognition efficiency and accuracy, meeting the device recognition needs in complex environments.

[0057] In one exemplary embodiment, such as Figure 1 As shown, a method for identifying a target object is provided, including the following steps:

[0058] S110, acquire the image to be recognized.

[0059] The image to be identified includes at least one target object in the target region, which can be understood as a target device.

[0060] In one alternative implementation, the image to be identified can be acquired by a data acquisition device connected to the execution entity and pre-stored in a corresponding database. When there is a need to identify a target object, the image to be identified can be retrieved from the corresponding database.

[0061] In another optional implementation, the executing entity has a built-in front-end acquisition device for acquiring the image to be identified. The front-end acquisition device is used to acquire image or video information from multiple devices in the target area in real time. Accordingly, the image to be identified can be an image acquired by the front-end acquisition device, or a frame from a video.

[0062] For example, multi-source image acquisition devices, including high-definition industrial cameras, thermal imagers, and depth cameras, can be deployed to construct a multimodal fusion acquisition system. The acquisition coordination module can uniformly control parameters such as frame rate, exposure time, and resolution of different acquisition devices to ensure the temporal and spatial synchronization of multi-channel image data, providing fundamental support for subsequent fusion and processing.

[0063] To reduce redundant images and improve system response efficiency, an edge computing node-based image acquisition task scheduling strategy can be employed. This strategy dynamically allocates acquisition frequency and shooting angle based on device density, movement frequency, and historical recognition load in the target area, achieving adaptive resource scheduling and load balancing. This strategy is based on a sliding window prediction model, automatically adjusting the priority and execution cycle of acquisition tasks.

[0064] Furthermore, in order to improve the efficiency and accuracy of subsequent recognition, the collected data needs to be preprocessed, such as image enhancement, size normalization, and noise filtering, to eliminate background interference and standardize the input data format, ensuring that the data quality meets the processing requirements of deep neural networks.

[0065] Optionally, adaptive histogram equalization can be applied to the acquired images to improve image discernibility in low-light or backlit scenes; multi-scale Retinex algorithm can be combined to enhance the contrast of detailed areas; and Fourier domain denoising methods can be applied to suppress artifacts such as periodic interference stripes to ensure that key equipment feature information is preserved and enhanced.

[0066] Optionally, images of different resolutions can be uniformly adjusted to the input size required by deep neural networks (e.g., 640×640 pixels), converted to a unified color space, and further subjected to channel normalization operations (e.g., Z-score normalization) to eliminate imaging biases from different devices. All images are organized and stored in tensor format and marked with timestamps and device origin information for easy traceability in subsequent processing.

[0067] Optionally, an Adaptive Gaussian Mixture Model (AGMM) can be used to construct a background model of the target region, model and identify static background regions in consecutive frames, remove dynamic interference (such as people moving or light flickering) in real time, and retain only the effective foreground region related to the device; for detected occluded or partially missing images, an image inpainting network can be invoked to perform structured filling.

[0068] Accordingly, in this embodiment, the image to be identified can be a preprocessed image, and the preprocessing method can include at least one of the above.

[0069] S120, based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region.

[0070] Each image region includes at least one target object.

[0071] In one alternative implementation, the location information of each target object can be extracted, and the image to be identified can be divided into at least one image region based on the location information of each target object.

[0072] In one alternative implementation, the image to be identified can be input into the pre-sensor network of the target object recognition model to obtain the position prediction results of each target object; based on the position prediction results, the region where each target object is located in the image to be identified can be determined.

[0073] Among them, the pre-sensing network is used to perform preliminary partitioning of the image to be recognized, resulting in multiple candidate target regions, that is, multiple image regions.

[0074] For example, the image to be recognized is input into the pre-sensory network of the target object recognition model so that the pre-sensory network can perform preliminary partitioning of the image to be recognized, generate multiple candidate target regions (Region Proposals), and adjust the region boundaries in combination with a context-sensitive enhancement strategy to ensure that each candidate region covers the independent device target to the maximum extent, thereby improving the regional relevance and completeness of subsequent feature extraction.

[0075] The target object recognition model is a pre-trained deep neural network model with a network structure that has good multi-scale perception capabilities (such as Swing Transformer or improved ResNeSt). The key parameters in the model (such as convolution kernel size, stride, and activation function type) have been pre-adapted and initialized based on the target object recognition requirements to ensure the model's generalization performance and processing efficiency in the current target object recognition task.

[0076] For example, in this embodiment, the image to be identified can be divided into a preset number of image regions according to the number of predetermined image regions and the region where each target object is located in the image to be identified, and each image region includes at least one target object.

[0077] S130, For any image region, perform feature extraction on the image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0078] The target feature extraction result is used to represent the features of each target object in the corresponding image region, such as at least one of the following: device type features, device shape features, etc.

[0079] For example, features can be extracted from each image region based on the feature extraction network in the target object recognition model to obtain the target feature extraction results corresponding to each target object in each image region.

[0080] S140, Based on the target feature extraction results corresponding to the image region, determine the object recognition results of each target object in the image region.

[0081] The object recognition result is used to describe the basic information of the corresponding object. For example, when the target object is a device, the object recognition result can be used to represent at least one of the device type, device appearance, etc. of the corresponding device.

[0082] For example, in this embodiment, the target feature extraction results corresponding to each image region can be input into the target detection network in the target object recognition model to determine the object recognition results of each target object in the corresponding image region.

[0083] In the aforementioned method for identifying target objects, the following steps are taken: An image to be identified is acquired; the image to be identified includes at least one target object in a target region; based on the location of each target object in the image to be identified, the image is divided into at least one image region; each image region includes at least one target object; for any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region; based on the target feature extraction results corresponding to the image region, the object identification result of each target object in the image region is determined. In this process, since the image to be identified includes at least one target object, dividing the image to be identified into at least one image region before object identification makes the subsequent object identification process more targeted, thereby improving the accuracy of object identification.

[0084] Based on the technical solutions of the above embodiments, this application also provides an optional embodiment. In this optional embodiment, the target object recognition model further includes a feature extraction network and a feature encoding network corresponding to each image region. In this case, the process of extracting features from the image region to obtain the target feature extraction results corresponding to each target object in the image region is refined.

[0085] See Figure 2 The steps for determining the target feature extraction results shown include:

[0086] S210, input the image region into the feature extraction network to obtain the initial feature extraction result.

[0087] In one optional implementation, the feature extraction network includes a first perceptual unit and a second perceptual unit; wherein the first perceptual unit is used to extract features in a first dimension; and the second perceptual unit is used to extract features in a second dimension.

[0088] The first perceptual unit can be understood as a shallow perceptual unit, used to extract shallow features (i.e., first-dimensional features), such as object edges and object textures. The second perceptual unit can be understood as a deep perceptual unit, used to extract deep features (i.e., second-dimensional features), such as capturing structural and semantic features.

[0089] For example, in this embodiment, the image region is sequentially processed by a shallow perceptual unit (for extracting low-level features such as edges and textures) and a deep perceptual unit (for capturing structural and semantic features) to obtain initial feature extraction results. Furthermore, a multi-scale convolution block is employed, including atrous convolution and variable receptive field fusion techniques, to adapt to the spatial distribution characteristics of targets on devices of different scales.

[0090] In another alternative implementation, to improve the accuracy of feature extraction, the target object recognition model also includes a multi-head attention module to weight the feature map from both spatial and channel dimensions.

[0091] In this embodiment, based on the features extracted by convolution, a multi-head attention module (such as multi-head self-attention MHSA or parallel channel attention PCA) is added to perform weighted processing on the feature map from the spatial dimension and the channel dimension respectively, thereby improving the response intensity of key regions and suppressing background redundant response, thus enhancing the model's ability to distinguish fine-grained device differences.

[0092] S220, the initial feature extraction results are input into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0093] In this embodiment, different image regions correspond to different feature encoding networks, which can process the initial feature extraction results of each image region in parallel to obtain the target feature extraction results corresponding to each target object in the image region.

[0094] For example, feature encoding networks can introduce graph computation parallel scheduling mechanisms (such as using TensorRT or OpenVINO parallel inference engines).

[0095] In this process, a regional pipeline scheduling method is adopted to divide the candidate region into several groups of parallel computing units, which run simultaneously on the cores of multi-core central processing units (CPU) or graphics processing units (GPUs), significantly reducing the overall coding latency.

[0096] Based on the technical solutions of the above embodiments, this application also provides an optional embodiment. In this optional embodiment, the process of determining the object recognition results of each target object in the image region based on the target feature extraction results corresponding to the image region is described in detail.

[0097] See Figure 3 The steps for determining the object recognition result shown include:

[0098] S310, the target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region.

[0099] The target detection results include at least one of the following: the target object's location information, the device's shape information, and the device's category information.

[0100] The object detection network can be an anchor-free structure (such as CenterNet or DETR series) to reduce the dependence on prior box settings and improve detection robustness in complex backgrounds or device pose change scenarios.

[0101] In one optional implementation, the target detection network includes a decoding network, a location regression network, and an edge alignment network. Accordingly, the target feature extraction results can be input into the decoding network to obtain the initial bounding boxes of each target object in the image region. The initial bounding boxes are then input into the location regression network to perform boundary optimization on each initial bounding box, thereby obtaining the target bounding boxes of each target object in the image region. Finally, the target bounding boxes are input into the edge alignment network to obtain the target detection results of each target object in the image region.

[0102] Specifically, in this embodiment, a decoding network is used to decode the target feature extraction results to generate a set of candidate bounding boxes. Each bounding box contains location coordinates, confidence scores, and class predictions. To improve localization accuracy, non-maximum suppression (NMS) and an IoU adaptive filtering mechanism are introduced to automatically eliminate redundant detection results and ensure that the minimum overlap between regions is below a set threshold (e.g., 0.5) to prevent the same target from being detected repeatedly. Furthermore, a multi-scale location regression module is used to optimize the boundaries of the candidate boxes after initial screening. This module combines a feature pyramid structure and a bi-branch regression network to perform accurate boundary regression for targets of different scales; at the same time, a gradient-guided shape optimization strategy (e.g., Gradient Alignment Refinement, GAR) is introduced to achieve dynamic fine-tuning of the detection box details, improving the detection accuracy of small target devices or edge targets.

[0103] To ensure the extracted region better matches the actual device shape and prevent background interference, a corresponding instance mask can be generated for each bounding box to more finely define the true outline of the device target. A high-resolution binary mask is generated using a graph cut-optimized edge alignment network (such as RefineMask), and boundary alignment is performed using a fully connected conditional random field (DenseCRF).

[0104] S320, based on the target detection results of each target object, determines the object recognition result of each target object in the image region.

[0105] In the above embodiments, the target detection network includes a decoding network, a location regression network, and an edge alignment network. Based on the above networks, the target feature extraction results are processed in sequence to make the target detection results more accurate, thereby improving the accuracy of the object recognition results.

[0106] Furthermore, to further improve the accuracy of object recognition results, each segmented device region can be categorized, and the category label of each device target can be determined by combining the confidence score output by the classification network. To improve recognition stability, the system introduces a confidence threshold filtering mechanism, outputting only the recognition result when it meets the preset confidence interval, and marking low-confidence results as pending review to support subsequent optimization.

[0107] For example, a dedicated classification network can be built based on ResNet, SE-ResNet, or Transformer architecture, and a multi-task learning mechanism can be introduced. On the basis of the main classification task of the target object recognition model, branches for inter-category similarity estimation and attribute prediction are added to identify and determine the target object from multiple dimensions.

[0108] In addition, to adapt to the input requirements of the classification network, normalization operations such as size scaling, brightness standardization, and channel rearrangement can be performed on the image to be recognized to enhance classification robustness. Further, the normalized image patch is input into the classification network, outputting a category probability distribution vector. Combining the Softmax function result, the maximum probability value is extracted as the target category, and this maximum value is recorded as a confidence score. Different classification results are filtered based on a preset confidence threshold (e.g., high confidence interval ≥ 0.85, low confidence interval ≤ 0.60, and the middle interval is suspicious). For example, when the predicted confidence falls within the high confidence interval, the category is directly output as the final recognition result; if it falls within the middle interval, the target device is marked as "requiring manual review" and recorded in the review task pool; if the confidence is below the low confidence interval, it is determined to be an unreliable recognition, the result is not output temporarily, and a re-object recognition process is triggered.

[0109] In addition, to provide a basis for subsequent model fine-tuning, data sampling and label enhancement, and to achieve closed-loop optimization, each target object, such as the target device's classification label, confidence score, region, classification timestamp, etc., is encapsulated into a structured recognition result unit and stored in the database for subsequent model training.

[0110] Based on the technical solutions of the above embodiments, this application also provides an optional embodiment. In this optional embodiment, the method for identifying the target object provided by this application is described in detail.

[0111] See Figure 4 The target object identification method shown includes:

[0112] S401, Obtain the image to be recognized;

[0113] The image to be identified includes at least one target object in the target region;

[0114] S402, The image to be recognized is input into the pre-sensory network of the target object recognition model to obtain the position prediction results of each target object;

[0115] S403, Based on the prediction results of each location, determine the region where each target object is located in the image to be identified;

[0116] S404, based on the location of each target object in the image to be identified, divide the image to be identified into at least one image region;

[0117] Each image region includes at least one target object;

[0118] S405: For any image region, input the image region into the feature extraction network to obtain the initial feature extraction result;

[0119] The feature extraction network includes a first perceptual unit and a second perceptual unit; the first perceptual unit is used to extract features in the first dimension; and the second perceptual unit is used to extract features in the second dimension.

[0120] S406, Input the initial feature extraction results into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0121] S407, Input the target feature extraction results into the decoding network to obtain the initial bounding boxes of each target object in the image region;

[0122] S408, input each initial bounding box into the location regression network, perform boundary optimization on each initial bounding box, and obtain the target bounding box of each target object in the image region;

[0123] S409, Input the bounding boxes of each target into the edge alignment network to obtain the target detection results of each target object in the image region;

[0124] The target detection results include at least one of the following: the target object's location information, the device's shape information, and the device's category information;

[0125] S410, based on the target detection results of each target object, determine the object recognition results of each target object in the image region.

[0126] It should be noted that the final recognition results can be output in a structured manner for use in upper-level systems for device monitoring, management decisions, or control execution. Simultaneously, the system retains all intermediate features and judgment results, and feeds back performance metrics such as recognition accuracy and processing time during operation to the model optimization module. This enables dynamic parameter adjustment and iterative model training, enhancing the system's adaptability in diverse scenarios.

[0127] For example, the target identification results for each device target, including classification labels, confidence scores, spatial locations, processing timestamps, and other identification information, are encapsulated into structured result units (such as JSON or Protobuf format) and published to upper-layer systems through a unified interface service (such as RESTful API or message queue MQTT). This supports rapid integration with heterogeneous systems such as device management platforms, real-time monitoring systems, and intelligent scheduling systems, meeting the needs of various application scenarios.

[0128] For example, intermediate data such as feature vectors, bounding box information, attention heatmaps, and confidence evaluation results generated during the recognition process are cached in a local high-speed storage module (such as an NVMe SSD cache) and transmitted to a backend database or distributed object storage system (such as HDFS or Amazon S3) through an asynchronous write mechanism to support subsequent model review, visualization analysis, and error diagnosis.

[0129] For example, performance metrics data for each round of recognition tasks are collected in real time, including image processing time, target recognition latency, classification accuracy, bounding box regression error, etc., and the recognition results are automatically "Trust Labeling" in combination with confidence level, historical label consistency and operation context to build a high-quality feedback dataset, avoid manual full review and improve data utilization efficiency.

[0130] For example, the collected feedback data is managed hierarchically according to its confidence level. High-confidence samples can be directly added to the incremental training set, while low-confidence samples are designated as "suspected anomalies" for manual annotation before being incorporated into the model optimization process. To improve training efficiency, a selection mechanism based on sample contribution (such as K-Center Sampling or Gradient Matching) is introduced, prioritizing samples with ambiguous boundaries and unstable confidence levels for model fine-tuning.

[0131] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.

[0132] Based on the same inventive concept, this application also provides a target object identification device for implementing the target object identification method described above. The solution provided by this device is similar to the implementation described in the above method; therefore, the specific limitations in one or more target object identification device embodiments provided below can be found in the limitations of the target object identification method described above, and will not be repeated here.

[0133] In one exemplary embodiment, such as Figure 5 As shown, a target object recognition device is provided, comprising: an image acquisition module 510, an image segmentation module 520, a feature extraction module 530, and an object recognition module 540, wherein:

[0134] Image acquisition module 510 is used to acquire an image to be recognized; the image to be recognized includes at least one target object in the target region.

[0135] The image segmentation module 520 is used to divide the image to be identified into at least one image region based on the region where each target object is located in the image to be identified; each image region includes at least one target object.

[0136] The feature extraction module 530 is used to extract features from any image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0137] The object recognition module 540 is used to determine the object recognition result of each target object in the image region based on the target feature extraction result corresponding to the image region.

[0138] In one embodiment, the target object recognition device further includes a region determination module, comprising a position prediction unit for inputting the image to be recognized into the pre-sensor network of the target object recognition model to obtain the position prediction result of each target object; and a region determination unit for determining the region where each target object is located in the image to be recognized based on the position prediction result.

[0139] In one embodiment, the target object recognition model further includes a feature extraction network and a feature encoding network corresponding to each image region; correspondingly, the feature extraction module 530 includes a first extraction unit for inputting the image region into the feature extraction network to obtain an initial feature extraction result; and a second extraction unit for inputting the initial feature extraction result into the feature encoding network corresponding to the image region to obtain the target feature extraction result corresponding to each target object in the image region.

[0140] In one embodiment, the feature extraction network includes a first perceptual unit and a second perceptual unit; wherein the first perceptual unit is used to extract features in a first dimension; and the second perceptual unit is used to extract features in a second dimension.

[0141] In one embodiment, the object recognition module 540 includes a target detection unit, which is used to input the target feature extraction result into the target detection network of the target object recognition model to obtain the target detection result of each target object in the image region; the target detection result includes at least one of the target object's location information, device shape information and device category information; and the object recognition unit is used to determine the object recognition result of each target object in the image region based on the target detection result of each target object.

[0142] In one embodiment, the target detection network includes a decoding network, a location regression network, and an edge alignment network; the target detection unit includes a first detection subunit, used to input the target feature extraction result into the decoding network to obtain the initial bounding box of each target object in the image region; a second detection subunit, used to input each initial bounding box into the location regression network to perform boundary optimization on each initial bounding box to obtain the target bounding box of each target object in the image region; and a third detection subunit, used to input each target bounding box into the edge alignment network to obtain the target detection result of each target object in the image region.

[0143] Each module in the aforementioned target object identification device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0144] In one exemplary embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 6 As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When executed by the processor, the computer program implements a method for identifying a target object. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0145] Those skilled in the art will understand that Figure 6 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0146] In one exemplary embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0147] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0148] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0149] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0150] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0151] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0152] The image to be recognized is input into the pre-sensory network of the target object recognition model to obtain the position prediction results of each target object;

[0153] Based on the prediction results at each location, the region where each target object is located in the image to be identified is determined.

[0154] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0155] The image region is input into the feature extraction network to obtain the initial feature extraction results;

[0156] The initial feature extraction results are input into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0157] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0158] The target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region; the target detection results include at least one of the target object's location information, device shape information, and device category information;

[0159] Based on the target detection results of each target object, the object recognition results of each target object in the image region are determined.

[0160] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0161] The target feature extraction results are input into the decoding network to obtain the initial bounding boxes of each target object in the image region;

[0162] Each initial bounding box is input into the location regression network, and the boundaries of each initial bounding box are optimized to obtain the target bounding boxes of each target object in the image region.

[0163] The bounding boxes of each target are input into the edge alignment network to obtain the target detection results of each target object in the image region.

[0164] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor:

[0165] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0166] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0167] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0168] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0169] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0170] The image to be recognized is input into the pre-sensory network of the target object recognition model to obtain the position prediction results of each target object;

[0171] Based on the prediction results at each location, the region where each target object is located in the image to be identified is determined.

[0172] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0173] The image region is input into the feature extraction network to obtain the initial feature extraction results;

[0174] The initial feature extraction results are input into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0175] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0176] The target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region; the target detection results include at least one of the target object's location information, device shape information, and device category information;

[0177] Based on the target detection results of each target object, the object recognition results of each target object in the image region are determined.

[0178] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0179] The target feature extraction results are input into the decoding network to obtain the initial bounding boxes of each target object in the image region;

[0180] Each initial bounding box is input into the location regression network, and the boundaries of each initial bounding box are optimized to obtain the target bounding boxes of each target object in the image region.

[0181] The bounding boxes of each target are input into the edge alignment network to obtain the target detection results of each target object in the image region.

[0182] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, performs the following steps:

[0183] Acquire the image to be identified; the image to be identified includes at least one target object in the target region;

[0184] Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object.

[0185] For any image region, feature extraction is performed on the image region to obtain the target feature extraction results corresponding to each target object in the image region;

[0186] Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

[0187] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0188] The image to be recognized is input into the pre-sensory network of the target object recognition model to obtain the position prediction results of each target object;

[0189] Based on the prediction results at each location, the region where each target object is located in the image to be identified is determined.

[0190] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0191] The image region is input into the feature extraction network to obtain the initial feature extraction results;

[0192] The initial feature extraction results are input into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region.

[0193] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0194] The target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region; the target detection results include at least one of the target object's location information, device shape information, and device category information;

[0195] Based on the target detection results of each target object, the object recognition results of each target object in the image region are determined.

[0196] In one embodiment, when the computer program is executed by a processor, it also performs the following steps:

[0197] The target feature extraction results are input into the decoding network to obtain the initial bounding boxes of each target object in the image region;

[0198] Each initial bounding box is input into the location regression network, and the boundaries of each initial bounding box are optimized to obtain the target bounding boxes of each target object in the image region.

[0199] The bounding boxes of each target are input into the edge alignment network to obtain the target detection results of each target object in the image region.

[0200] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0201] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0202] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0203] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A method for identifying a target object, characterized in that, The method includes: Acquire an image to be identified; the image to be identified includes at least one target object in the target region; Based on the location of each target object in the image to be identified, the image to be identified is divided into at least one image region; each image region includes at least one target object; For any of the image regions, feature extraction is performed on the image regions to obtain the target feature extraction results corresponding to each target object in the image regions; Based on the target feature extraction results corresponding to the image region, the object recognition results of each target object in the image region are determined.

2. The method according to claim 1, characterized in that, Before dividing the image to be identified into at least one image region based on the location of each target object in the image to be identified, the method further includes: The image to be identified is input into the pre-sensory network of the target object recognition model to obtain the position prediction results of each target object; Based on the location prediction results, the region where each target object is located in the image to be identified is determined.

3. The method according to claim 2, characterized in that, The target object recognition model also includes a feature extraction network and a feature encoding network corresponding to each image region; Accordingly, the step of extracting features from the image region to obtain the target feature extraction results corresponding to each target object in the image region includes: The image region is input into the feature extraction network to obtain the initial feature extraction result; The initial feature extraction results are input into the feature encoding network corresponding to the image region to obtain the target feature extraction results corresponding to each target object in the image region.

4. The method according to claim 3, characterized in that, The feature extraction network includes a first sensing unit and a second sensing unit; The first sensing unit is used to extract features in the first dimension; the second sensing unit is used to extract features in the second dimension.

5. The method according to claim 1, characterized in that, The step of determining the object recognition result of each target object in the image region based on the target feature extraction result corresponding to the image region includes: The target feature extraction results are input into the target detection network of the target object recognition model to obtain the target detection results of each target object in the image region; the target detection results include at least one of the target object's location information, device shape information, and device category information; Based on the target detection results of each target object, the object recognition results of each target object in the image region are determined.

6. The method according to claim 5, characterized in that, The target detection network includes a decoding network, a location regression network, and an edge alignment network; the step of inputting the target feature extraction result into the target detection network of the target object recognition model to obtain the target detection result of each target object in the image region includes: The target feature extraction results are input into the decoding network to obtain the initial bounding boxes of each target object in the image region; Each of the initial bounding boxes is input into the location regression network, and the boundaries of each of the initial bounding boxes are optimized to obtain the target bounding boxes of each target object in the image region. The bounding boxes of each target are input into the edge alignment network to obtain the target detection results of each target object in the image region.

7. A target object identification device, characterized in that, The device includes: An image acquisition module is used to acquire an image to be identified; the image to be identified includes at least one target object in a target region; An image segmentation module is used to divide the image to be identified into at least one image region based on the region where each target object is located in the image to be identified; each image region includes at least one target object; The feature extraction module is used to extract features from any of the image regions to obtain the target feature extraction results corresponding to each target object in the image region. The object recognition module is used to determine the object recognition result of each target object in the image region based on the target feature extraction result corresponding to the image region.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1-6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-6.