Deep recognition model training method, image depth recognition method, and related device
By using an instance segmentation network to identify static and dynamic objects in vehicle images, generating target images, and adjusting the depth recognition network, the problem of inaccurate depth recognition in vehicle images is solved, improving the model's recognition accuracy and training speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HON HAI PRECISION INDUSTRY CO LTD
- Filing Date
- 2022-07-04
- Publication Date
- 2026-06-12
AI Technical Summary
Existing vehicle image depth recognition models cannot accurately identify the true distance between the vehicle and objects or obstacles in the surrounding environment due to the presence of dynamic objects, which affects driving safety.
The image is segmented by an instance segmentation network to identify static and dynamic objects. The target dynamic object is selected and the dynamic pose matrix is calculated to generate the target image. The depth recognition network is adjusted to reduce the influence of dynamic objects and improve the recognition accuracy.
It effectively reduces the impact of dynamic objects on the training accuracy of deep recognition models, thereby improving the recognition accuracy and training speed of deep recognition models.
Smart Images

Figure CN117409389B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing, and more particularly to a method for training a depth recognition model, an image depth recognition method, and related equipment. Background Technology
[0002] Current methods for depth recognition of in-vehicle images employ training images to train deep learning networks. However, because the training images include dynamic objects, these objects can cause the trained depth recognition model to fail to accurately identify the depth information in the in-vehicle images. Consequently, it becomes difficult to determine the true distance between the vehicle and various objects or obstacles in the surrounding environment, thus affecting driving safety. Summary of the Invention
[0003] In view of the above, it is necessary to provide a depth recognition model training method, an image depth recognition method, and related equipment to solve the technical problem of inaccurate depth information recognition of vehicle images.
[0004] This application provides an image depth recognition method, comprising: acquiring a first image and a second image; performing instance segmentation on the first image based on an instance segmentation network to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image; performing instance segmentation on the second image based on the instance segmentation network to obtain a second static object and multiple second dynamic objects corresponding to the second image; selecting multiple target dynamic objects from the multiple first dynamic objects based on the number of pixels of each first dynamic object and a preset position; selecting multiple feature dynamic objects from the multiple second dynamic objects based on the number of pixels of each second dynamic object and the preset position; and identifying whether each target dynamic object exists. The corresponding dynamic objects are identified, and the target dynamic objects and the dynamic objects with corresponding relationships are identified as recognition objects. Based on the dynamic pose matrix corresponding to the recognition object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold matrix, the object state of the target dynamic object in the recognition object is identified. A target image is generated based on the object state, the first dynamic position and the first image, and a target projection image is generated based on the object state, the first dynamic position and the initial projection image corresponding to the first image. Based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the target projection image and the target image, the acquired depth recognition network is adjusted to obtain a depth recognition model.
[0005] According to an optional embodiment of this application, the instance segmentation network includes a feature extraction layer, a classification layer, and a mapping layer. The step of segmenting the first image based on the instance segmentation network to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image includes: standardizing the first image to obtain a standardized image; extracting features from the standardized image based on the feature extraction layer to obtain an initial feature map; segmenting the standardized image based on the multiple relationship between the size of the initial feature map and the size of the standardized image and the convolution stride in the feature extraction layer to obtain a rectangular region corresponding to each pixel in the initial feature map; classifying the initial feature map based on the classification layer to obtain a predicted probability that each pixel in the initial feature map belongs to a first preset category; and confirming the pixels in the initial feature map whose predicted probabilities are greater than a preset threshold. The target pixel is defined, and the rectangular regions corresponding to the target pixel are defined as multiple feature regions. Based on the mapping layer, each feature region is mapped to the initial feature map to obtain the mapping region corresponding to each feature region in the initial feature map. The multiple mapping regions are divided according to a preset number to obtain multiple division regions corresponding to each mapping region. The center pixel of each division region is determined, and the pixel value of the center pixel is calculated. The multiple pixel values corresponding to the multiple center pixels are pooled to obtain the mapping probability value corresponding to each mapping region. The multiple mapping regions are restored, and the restored multiple mapping regions are concatenated to obtain the target feature map. Based on the target feature map, the mapping probability value, the restored multiple mapping regions, and the second preset category, the first static object corresponding to the first image, the multiple first dynamic objects, and the first dynamic position of each first dynamic object are generated.
[0006] According to an optional embodiment of this application, the step of generating a first static object, a plurality of first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image based on the target feature map, the mapping probability value, the restored plurality of mapping regions, and the second preset category includes: classifying each pixel of the target feature map according to the mapping probability value and the second preset category to obtain the pixel category of each pixel in the restored mapping region; determining the region formed by a plurality of pixels corresponding to the same pixel category in the restored mapping region as a first object; obtaining the pixel coordinates of all pixels in the first object and determining the pixel coordinates as the first position corresponding to the first object; dividing the plurality of first objects into the plurality of first dynamic objects and the first static objects according to a preset rule, and determining the first position corresponding to each first dynamic object as the first dynamic position.
[0007] According to an optional embodiment of this application, the step of selecting multiple target dynamic objects from the plurality of first dynamic objects based on the number of pixels of each first dynamic object and a preset position includes: counting the number of pixels contained in each first dynamic object, sorting the plurality of first dynamic objects according to the number of pixels, and selecting the first dynamic object whose number of pixels after sorting is at the preset position as the plurality of target dynamic objects.
[0008] According to an optional embodiment of this application, identifying whether each target dynamic object has a corresponding feature dynamic object includes: obtaining multiple target element information of each target dynamic object, and obtaining feature element information corresponding to each target element information in the same category of feature dynamic objects, performing matching processing on each target element information and the corresponding feature element information to obtain a matching value between the target dynamic object and the same category of feature dynamic objects, and if the matching value is within a preset range, then it is determined that there is a corresponding feature dynamic object in the target dynamic object.
[0009] According to an optional embodiment of this application, the step of identifying the object state of the target dynamic object in the identification object based on the dynamic pose matrix corresponding to the identification object, the static pose matrix corresponding to the first static object and the second static object, and a preset threshold matrix includes: subtracting each matrix element in the static pose matrix from the corresponding matrix element in the dynamic pose matrix corresponding to the identification object to obtain a pose difference; taking the absolute value of the pose difference to obtain the pose absolute value in the static pose matrix; arranging the pose absolute values according to the element position of each pose absolute value in the static pose matrix to obtain a pose absolute value matrix; comparing each pose absolute value in the pose absolute value matrix with the corresponding pose threshold in the preset threshold matrix; if there is at least one pose absolute value in the pose absolute value matrix that is greater than the corresponding pose threshold, then the object state of the target dynamic object in the identification object is determined to be moving; or, if all pose absolute values in the pose absolute value matrix are less than or equal to the corresponding threshold, then the object state of the target dynamic object in the identification object is determined to be stationary.
[0010] According to an optional embodiment of this application, generating a target image based on the object state, the first dynamic position, and the first image includes: if the object state of any target dynamic object in the identified objects is moving, then performing masking processing on the target dynamic object in the first image based on the first dynamic position of the target dynamic object to obtain the target image; or, if the object state of all target dynamic objects in the identified objects is stationary, then determining the first image as the target image.
[0011] According to an optional embodiment of this application, adjusting the depth recognition network based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the target projection image and the target image to obtain a depth recognition model includes: calculating the depth loss value of the depth recognition network based on the gradient error and the photometric error, adjusting the depth recognition network based on the depth loss value until the depth loss value is reduced to the minimum, and obtaining the depth recognition model.
[0012] This application provides an image depth recognition method, which includes: acquiring an image to be recognized, inputting the image to be recognized into a depth recognition model, obtaining a target depth image of the image to be recognized and depth information of the image to be recognized, wherein the depth recognition model is obtained by executing the depth recognition model training method.
[0013] This application provides an electronic device, the electronic device comprising:
[0014] Memory, storing at least one instruction; and
[0015] The processor executes at least one instruction to implement the depth recognition model training method or the image depth recognition method.
[0016] This application provides a computer-readable storage medium storing at least one instruction, which is executed by a processor in an electronic device to implement the depth recognition model training method or the image depth recognition method.
[0017] As can be seen from the above technical solution, this application performs instance segmentation on the first image to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image. Based on the number of pixels and a preset position of each first dynamic object, multiple target dynamic objects are selected from the multiple first dynamic objects. This reduces the number of first dynamic objects, thus improving the training speed of the deep recognition network. It identifies whether each target dynamic object has a corresponding feature dynamic object, and selects the same feature dynamic object in the second image as each target dynamic object. By calculating the dynamic pose matrix of each target dynamic object and the same feature dynamic object, and comparing the dynamic pose matrix with the preset threshold matrix, it can determine whether the state of each target dynamic object in the first image is moving. Based on the target object in the recognition... The target image is generated by using the state of the target dynamic object, the first dynamic position, and the first image. Based on the first dynamic position, moving target dynamic objects in the first image can be filtered out to generate the target image. Since the positional change of a moving target dynamic object causes a change in the depth value of the corresponding pixel in the initial depth image, filtering out moving target dynamic objects in the target image prevents the depth value from being used in the loss calculation, thus avoiding the influence of moving target dynamic objects on the loss calculation. The target image retains stationary target dynamic objects, preserving more image information from the first image. Therefore, the depth recognition model trained using the target image avoids the impact of moving target dynamic objects on the training accuracy of the depth recognition model, thereby improving the recognition accuracy of the depth recognition model. Attached Figure Description
[0018] Figure 1 This is an application environment diagram provided by an embodiment of this application.
[0019] Figure 2 This is a flowchart of a deep recognition model training method provided in an embodiment of this application.
[0020] Figure 3 This is a schematic diagram of the pixel coordinate system and camera coordinate system provided in the embodiments of this application.
[0021] Figure 4 This is a flowchart of the image depth recognition method provided in the embodiments of this application.
[0022] Figure 5 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this application clearer, the application will be described in detail below with reference to the accompanying drawings and specific embodiments.
[0024] like Figure 1 The diagram shown is an application environment diagram provided by an embodiment of this application. The depth recognition model training method and the image depth recognition method can be applied to one or more electronic devices 1. The electronic device 1 communicates with the imaging device 2, which can be a monocular camera or other devices for imaging.
[0025] The electronic device 1 is a device capable of automatically calculating parameter values and / or processing information according to pre-set or stored instructions. Its hardware includes, but is not limited to: microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.
[0026] The electronic device 1 can be any electronic product that can interact with the user, such as a personal computer, tablet computer, smartphone, personal digital assistant (PDA), game console, interactive network television (IPTV), smart wearable device, etc.
[0027] The electronic device 1 may further include network devices and / or user devices. The network devices include, but are not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of hosts or network servers.
[0028] The network in which the electronic device 1 is located includes, but is not limited to: the Internet, wide area network, metropolitan area network, local area network, virtual private network (VPN), etc.
[0029] like Figure 2 The diagram shown is a flowchart of a deep recognition model training method provided in an embodiment of this application. Depending on different needs, the order of the steps in the flowchart can be adjusted according to actual detection requirements, and some steps can be omitted. The execution subject of the method is an electronic device, such as... Figure 1 Electronic device 1 shown.
[0030] 101. Obtain the first image and the second image.
[0031] In at least one embodiment of this application, the first image and the second image are adjacent frames of primary color light (Red, Green, Blue, RGB) images, the generation time of the second image is longer than the generation time of the first image, the first image and the second image may contain initial objects such as vehicles, ground, pedestrians, sky, and trees, and the first image and the second image contain the same initial objects.
[0032] In at least one embodiment of this application, the electronic device acquires the image to be identified by:
[0033] The electronic device controls the shooting device to capture the target scene to obtain the first image, and captures the target scene again after a preset time interval to obtain the second image.
[0034] The shooting device can be a monocular camera, and the target scene can include vehicles, ground, pedestrians, and other target objects. It is understood that the preset time is very short, for example, 10ms.
[0035] 102. The first image is segmented based on the instance segmentation network to obtain the first static object, multiple first dynamic objects and the first dynamic position of each first dynamic object corresponding to the first image. The second image is then segmented based on the instance segmentation network to obtain the second static object and multiple second dynamic objects corresponding to the second image.
[0036] In at least one embodiment of this application, the first dynamic object and the second dynamic object refer to objects that can move, such as pedestrians and vehicles. The first static object and the second static object refer to objects that cannot move, such as trees and the ground.
[0037] In at least one embodiment of this application, the instance segmentation network includes a feature extraction layer, a classification layer, and a mapping layer. The electronic device performs instance segmentation on the first image based on the instance segmentation network to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image, including:
[0038] The electronic device performs standardization processing on the first image to obtain a standardized image. Further, the electronic device performs feature extraction on the standardized image based on the feature extraction layer to obtain an initial feature map. Even further, the electronic device segments the standardized image based on the multiple relationship between the size of the initial feature map and the size of the standardized image and the convolution stride in the feature extraction layer to obtain rectangular regions corresponding to each pixel in the initial feature map. Even further, the electronic device performs classification processing on the initial feature map based on the classification layer to obtain the predicted probability that each pixel in the initial feature map belongs to a first preset category. Even further, the electronic device determines the pixels in the initial feature map whose predicted probabilities are greater than a preset threshold as target pixels, and determines multiple rectangular regions corresponding to multiple target pixels as multiple feature regions. Even further, the electronic device uses the mapping layer to map each feature region... The electronic device maps the feature regions onto the initial feature map to obtain a mapped region corresponding to each feature region in the initial feature map. Further, the electronic device divides multiple mapped regions based on a preset number to obtain multiple partitioned regions corresponding to each mapped region. Further, the electronic device determines the center pixel in each partitioned region and calculates the pixel value of the center pixel. Further, the electronic device performs pooling processing on the multiple pixel values corresponding to the multiple center pixels to obtain a mapping probability value corresponding to each mapped region. Further, the electronic device restores the multiple mapped regions and concatenates the restored multiple mapped regions to obtain a target feature map. Further, the electronic device generates a first static object corresponding to the first image, multiple first dynamic objects, and a first dynamic position of each first dynamic object based on the target feature map, the mapping probability value, the restored multiple mapped regions, and a second preset category.
[0039] The standardization process includes cropping, and the standardized image is typically square. The feature extraction layer includes convolutional layers, batch standardization layers, and pooling layers, etc. For example, the feature extraction layer can be a VGG network with fully connected layers removed. The pixel value of the center pixel is calculated using bilinear interpolation, which is a prior art technique and will not be described in detail here. The mapping layer can be an ROI Align layer.
[0040] The first preset category can be customized. For example, the first preset category can be foreground or background. The classification layer can be a fully connected layer and a softmax layer. The preset threshold can be set by the user, and this application does not limit it. The preset number can be set by the user, and this application does not limit it. The second preset category can be set by the user according to the target objects appearing in the target scene, and this application does not limit it. For example, the second preset category may include, but is not limited to: cars, buses, roads, pedestrians, streetlights, sky, and buildings, etc.
[0041] In this embodiment, the instance segmentation network further includes a fully convolutional neural network, which is used to restore the multiple mapping regions.
[0042] Specifically, the electronic device segments the standardized image based on the ratio between the size of the initial feature map and the size of the standardized image, and the convolution stride in the feature extraction layer, to obtain a rectangular region corresponding to each pixel in the initial feature map, including:
[0043] The electronic device uses the product of the multiplication factor and the convolution stride as the width and height to segment the standardized image, obtaining a rectangular region corresponding to each pixel in the initial feature map.
[0044] For example, the size of the standardized image is 800*800, the size of the initial feature map is 32*32, the convolution stride is 4, the ratio between the size of the initial feature map 32*32 and the size of the standardized image 800*800 is 25, the product of the ratio and the convolution stride is 100, and the electronic device divides the standardized image into 8 rectangular regions, each rectangular region being 100*100 in size.
[0045] Specifically, the preset quantity includes a first preset quantity and a second preset quantity. The electronic device divides the multiple mapping regions based on the preset quantity to obtain multiple partitioned regions corresponding to each mapping region, including:
[0046] The electronic device divides each mapping region based on the first preset number to obtain multiple intermediate regions corresponding to each mapping region. Furthermore, the electronic device divides each intermediate region based on the second preset number to obtain multiple partitioned regions corresponding to each mapping region.
[0047] The first and second preset quantities can be set independently, and this application does not impose any restrictions on them. For example, the first preset quantity can be 7*7, and the second preset quantity can be 2*2. For example, when the size of the mapped area is 14*14, the mapped area is divided into 7*7 intermediate areas, each intermediate area is 2*2 in size, and each intermediate area is further divided into 2*2 sub-areas, each sub-area is approximately 0.5*0.5 in size.
[0048] In this embodiment, the instance segmentation network also outputs the location of the first static object, the location of the second static object, the category of each target dynamic object, the category of the first static object, the category of the second static object, and the category of each feature dynamic object.
[0049] Through the above implementation method, the first image and the second image are segmented based on the instance segmentation network, and each initial object in the first image and the second image can be distinguished according to its position, so that each initial object can be processed based on its position.
[0050] Specifically, the electronic device generates a first static object corresponding to the first image, a plurality of first dynamic objects, and a first dynamic position of each first dynamic object based on the target feature map, the mapping probability value, the restored plurality of mapping regions, and the second preset category, including:
[0051] The electronic device classifies each pixel of the target feature map according to the mapping probability value and the second preset category to obtain the pixel category of each pixel in the restored mapping region. Further, the electronic device determines the region formed by multiple pixels corresponding to the same pixel category in the restored mapping region as a first object. Further still, the electronic device obtains the pixel coordinates of all pixels in the first object and determines the pixel coordinates as the first position corresponding to the first object. Further still, the electronic device divides the multiple first objects into multiple first dynamic objects and first static objects according to preset rules, and determines the first position corresponding to each first dynamic object as the first dynamic position.
[0052] The preset rules define movable initial objects, such as vehicles, people, or animals, as the multiple first dynamic objects, and immovable initial objects, such as plants or fixed objects, as the first static objects. For example, movable pedestrians, cats, dogs, bicycles, and cars are defined as the multiple first dynamic objects, while immovable initial objects, such as trees, streetlights, and buildings, are defined as the first static objects.
[0053] In this embodiment, the division method of the plurality of second dynamic objects is basically the same as the division method of the plurality of first dynamic objects, and the division method of the second static objects is basically the same as the division method of the first static objects, so this application will not elaborate further here.
[0054] 103. Select multiple target dynamic objects from the plurality of first dynamic objects based on the number of pixels of each first dynamic object and the preset position, and select multiple feature dynamic objects from the plurality of second dynamic objects based on the number of pixels of each second dynamic object and the preset position.
[0055] In at least one embodiment of this application, the electronic device selects a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels of each first dynamic object and a preset position, including:
[0056] The electronic device counts the number of pixels in each first dynamic object and sorts the plurality of first dynamic objects according to the number of pixels. Further, the electronic device selects the first dynamic object whose number of pixels after sorting is at the preset position as the plurality of target dynamic objects.
[0057] The preset positions can be set by the user. For example, the preset positions can be the first five.
[0058] In this embodiment, the selection method of the plurality of feature dynamic objects is basically the same as the selection method of the plurality of target dynamic objects, therefore, this application will not elaborate on it here.
[0059] In at least one embodiment of this application, the generation process of the second static image is substantially the same as that of the first static image, and the generation process of the second dynamic image is substantially the same as that of the first dynamic image; therefore, this application will not elaborate further here.
[0060] Through the above implementation method, the plurality of target dynamic objects and the plurality of feature dynamic objects are selected based on the number of pixels and preset positions. Since the number of the plurality of first dynamic objects is reduced, the training speed of the deep recognition network can be improved.
[0061] 104. Identify whether each target dynamic object has a corresponding feature dynamic object, and determine the target dynamic objects and feature dynamic objects that have a corresponding relationship as identification objects.
[0062] In at least one embodiment of this application, the electronic device identifies whether each target dynamic object has a corresponding characteristic dynamic object, including:
[0063] The electronic device acquires multiple target element information for each target dynamic object, and acquires the feature element information corresponding to each target element information in the same category of feature dynamic objects. Further, the electronic device performs matching processing on each target element information and the corresponding feature element information to obtain the matching value between the target dynamic object and the feature dynamic object of the same category. If the matching value is within a preset range, the electronic device determines that there is a corresponding feature dynamic object in the target dynamic object.
[0064] The method involves acquiring information about multiple target elements and corresponding feature elements for each target element based on a target tracking algorithm. The target tracking algorithm is existing technology and will not be described in detail here. The preset interval can be set independently, and this application does not impose any restrictions on it.
[0065] In this embodiment, the multiple target element information can be parameters of the features of the target dynamic object, and the multiple feature element information can be parameters of the features of dynamic objects of the same category. For example, when the target dynamic object is a car, the multiple target element information can be the car's size, texture, position, and outline, etc. Since the parameters of each target element information and its corresponding feature element information are different, the matching processing method is also different. The matching processing method includes subtraction, addition, weighted operations, etc. For example, both the target dynamic object in the first image and the feature dynamic object in the second image are cars. The car in the first image is 4.8 meters long and 1.65 meters wide, while the car in the second image is 4.7 meters long and 1.6 meters wide. Subtracting the length of the car in the first image (4.8 meters) from the length of the car in the second image (4.7 meters) yields a first matching value of 0.1 meters and a corresponding second matching value of 0.05 meters. When the first matching value corresponds to a first preset interval of [0, 0.12] and the second matching value corresponds to a second preset interval of [0, 0.07], since the first matching value is within the first preset interval and the second matching value is within the second preset interval, the car in the second image and the car in the first image are the same car.
[0066] Through the above implementation methods, multiple target element information of each target dynamic object and the feature element information corresponding to each target element information in the same category of feature dynamic objects are obtained. Selecting feature dynamic objects of the same category can more quickly identify that the feature dynamic object and the target dynamic object are the same. By selecting multiple target element information and matching each target element information with the corresponding feature element information, the features of the target dynamic object and the feature dynamic objects of the same category can be extracted more comprehensively, which can eliminate reasonable errors and improve matching accuracy.
[0067] 105. Based on the dynamic pose matrix corresponding to the identified object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold matrix, identify the object state of the target dynamic object in the identified object.
[0068] In at least one embodiment of this application, the dynamic pose matrix refers to the transformation relationship from camera coordinates to world coordinates of the pixel corresponding to the recognition object, the camera coordinates of the pixel corresponding to the recognition object refer to the coordinates of each pixel in the camera coordinate system, and the static pose matrix refers to the transformation relationship from camera coordinates to world coordinates of the first static object and the second static object.
[0069] like Figure 3 The diagram shows a pixel coordinate system and a camera coordinate system provided in an embodiment of this application. The electronic device constructs a pixel coordinate system with the pixel point Ouv in the first row and first column of the first image as the origin, the parallel line containing the first row of pixels as the u-axis, and the vertical line containing the first column of pixels as the v-axis. Furthermore, the electronic device constructs a camera coordinate system with the light point OXY of the monocular camera as the origin, the optical axis of the monocular camera as the Z-axis, the parallel line to the u-axis of the pixel coordinate system as the X-axis, and the parallel line to the v-axis of the pixel coordinate system as the Y-axis.
[0070] In at least one embodiment of this application, the electronic device identifies the object state of the target dynamic object in the identified object based on the dynamic pose matrix corresponding to the identified object, the static pose matrices corresponding to the first static object and the second static object, and a preset threshold matrix, including:
[0071] The electronic device subtracts each element of the static pose matrix from the corresponding element of the dynamic pose matrix of the identified object to obtain a pose difference. Further, the electronic device takes the absolute value of the pose difference to obtain the absolute pose value in the static pose matrix. Even further, the electronic device arranges the absolute pose values according to their element positions in the static pose matrix to obtain a pose absolute value matrix. Even further, the electronic device compares each absolute pose value in the pose absolute value matrix with the corresponding pose threshold in the preset threshold matrix. If there is at least one absolute pose value in the pose absolute value matrix that is greater than the corresponding pose threshold, the electronic device determines that the target dynamic object in the identified object is in a moving state; or, if all absolute pose values in the pose absolute value matrix are less than or equal to the corresponding threshold, the electronic device determines that the target dynamic object in the identified object is in a stationary state.
[0072] Specifically, the dynamic pose matrix is generated as follows:
[0073] The electronic device determines the pixel corresponding to the target dynamic object in the first image as the first pixel and the pixel corresponding to the feature dynamic object in the second image as the second pixel. Further, the electronic device obtains the first homogeneous coordinate matrix of the first pixel and the second homogeneous coordinate matrix of the second pixel, and obtains the inverse matrix of the intrinsic parameter matrix of the capturing device that captured the first and second images. Further, the electronic device calculates the first camera coordinates of the first pixel based on the first homogeneous coordinate matrix and the inverse matrix of the intrinsic parameter matrix, and calculates the second camera coordinates of the second pixel based on the second homogeneous coordinate matrix and the inverse matrix of the intrinsic parameter matrix. Further still, the electronic device calculates the first camera coordinates and the second camera coordinates based on a preset epipolar constraint relationship to obtain a rotation matrix and a translation matrix. Further still, the electronic device concatenates the rotation matrix and the translation matrix to obtain the target pose matrix.
[0074] Wherein, the first homogeneous coordinate matrix of the first pixel is a matrix with one more dimension than the pixel coordinate matrix, and the element value of the extra dimension is 1. The pixel coordinate matrix is a matrix generated based on the first pixel coordinates of the first pixel, where the first pixel coordinates refer to the coordinates of the first pixel in the pixel coordinate system. For example, if the first pixel coordinates of the first pixel in the pixel coordinate system are (u, v), then the pixel coordinate matrix of the first pixel is... Then the homogeneous coordinate matrix of the pixel is Multiply the first homogeneous coordinate matrix and the inverse of the intrinsic parameter matrix to obtain the first camera coordinates of the first pixel, and multiply the second homogeneous coordinate matrix and the inverse of the intrinsic parameter matrix to obtain the second camera coordinates of the second pixel.
[0075] The generation method of the second homogeneous coordinate matrix is basically the same as that of the first homogeneous coordinate matrix, and will not be described in detail here.
[0076] The rotation matrix can be represented as:
[0077]
[0078] Wherein, pose is the dynamic pose matrix, which is a 4x4 matrix; R is the rotation matrix, which is a 3x3 matrix; and t is the translation matrix, which is a 3x1 matrix.
[0079] The formulas for calculating the translation matrix and the rotation matrix are as follows:
[0080] K -1 p1(txR)(K -1 p2) T =0;
[0081] Among them, K -1 p1 is the coordinate of the first camera, K -1 p2 represents the coordinates of the second camera, p1 represents the first homogeneous coordinate matrix, p2 represents the second homogeneous coordinate matrix, and K represents the coordinates of the second camera. -1 is the inverse of the intrinsic parameter matrix.
[0082] In this embodiment, the method of generating the static pose matrix is basically the same as that of generating the dynamic pose matrix, so this application will not elaborate on it here.
[0083] Through the above implementation method, when there are multiple recognition objects, the number of dynamic pose matrices is also multiple. Since each dynamic pose matrix corresponds to each target dynamic object in the first image, the object state of the corresponding target dynamic object in the first image can be determined through each dynamic pose matrix, thereby distinguishing the object states of multiple target dynamic objects.
[0084] 106. Generate a target image based on the object state, the first dynamic position, and the first image, and generate a target projection image based on the object state, the first dynamic position, and the initial projection image corresponding to the first image.
[0085] In at least one embodiment of this application, the target image refers to an image generated after processing the target dynamic object in the first image based on the first dynamic position and the object state.
[0086] In at least one embodiment of this application, the initial projected image represents an image of the transformation process, wherein the transformation process refers to the transformation process between the pixel coordinates of a pixel in the first image and the corresponding pixel coordinates in the second image.
[0087] In at least one embodiment of this application, the electronic device generates an initial projection image of the first image based on the first image, the initial depth image, and the target pose matrix, including:
[0088] If any target dynamic object in the identified objects is in a moving state, the electronic device performs masking processing on the target dynamic object in the first image based on the first dynamic position of the target dynamic object to obtain the target image; or, if all target dynamic objects in the identified objects are in a stationary state, the electronic device determines the first image as the target image.
[0089] Specifically, the method for generating the initial projected image includes:
[0090] The electronic device acquires an initial depth image of the first image, acquires the target homogeneous coordinate matrix of each pixel in the first image, and acquires the depth value of each pixel in the first image from the initial depth image. Further, the electronic device calculates the projection coordinates of each pixel in the first image based on the target pose matrix, the target homogeneous coordinate matrix of each pixel, and the depth value of each pixel. Furthermore, the electronic device arranges each pixel according to the projection coordinates of each pixel to obtain the initial projection image.
[0091] The electronic device inputs the first image into the depth recognition network to obtain the initial depth image, where the depth value refers to the pixel value of each pixel in the initial depth image.
[0092] Specifically, the formula for calculating the projection coordinates of each pixel in the initial projected image is as follows:
[0093] P = K * pose * Z * K -1 *H;
[0094] Where P represents the projected coordinates of each pixel, K represents the intrinsic parameter matrix of the imaging device, and pose represents the target pose matrix. -1 Let K be the inverse matrix, H be the target homogeneous coordinate matrix of each pixel in the first image, and Z be the depth value of the corresponding pixel in the initial depth image.
[0095] In this embodiment, the target projection image includes multiple projection objects corresponding to the multiple target dynamic objects. The method of generating the target projection image based on the multiple projection objects is basically the same as the method of generating the target image, so this application will not elaborate on it.
[0096] Through the above implementation, when the object state of the target dynamic object in the identification object is moving, the target dynamic object can be accurately masked in the first image according to the first dynamic position of the target dynamic object, which can avoid the influence of the moving dynamic object on the calculation loss value. When the object state of the target dynamic object in the identification object is stationary, the target dynamic object is retained in the first image, which can retain more image information of the first image.
[0097] 107. Based on the gradient error between the initial depth image and the target image and the photometric error between the target projection image and the target image, the acquired depth recognition network is adjusted to obtain a depth recognition model.
[0098] In at least one embodiment of this application, the deep recognition model refers to a model generated after adjusting the deep recognition network.
[0099] In at least one embodiment of this application, the electronic device adjusts the depth recognition network based on the gradient error between the initial depth image and the target image and the photometric error between the target projection image and the target image to obtain a depth recognition model, including:
[0100] The electronic device calculates the depth loss value of the depth recognition network based on the gradient error and the photometric error. Furthermore, the electronic device adjusts the depth recognition network based on the depth loss value until the depth loss value is reduced to the minimum, thereby obtaining the depth recognition model.
[0101] The deep recognition network can be a deep neural network, and the deep recognition network can be obtained from a database on the Internet.
[0102] Specifically, the formula for calculating the depth loss value is as follows:
[0103] Lc = Lt + Ls;
[0104] Where Lc represents the depth loss value, Lt represents the photometric error, and Ls represents the gradient error.
[0105] The formula for calculating the photometric error is as follows:
[0106]
[0107] Where Lt represents the photometric error, α is a preset balance parameter, typically set to 0.85, and SSIM(x, y) represents the structural similarity index between the target projection image and the target image, ||x i -y i|| represents the grayscale difference between the target projection image and the target image, x i y represents the pixel value of the i-th pixel in the target projection image. i This represents the pixel value of the pixel corresponding to the i-th pixel in the target image. The calculation method of the structural similarity index is existing technology and will not be described in detail here.
[0108] The formula for calculating the gradient error is:
[0109]
[0110] Where Ls represents the gradient error, x represents the initial depth image, y represents the target image, D(u, v) represents the pixel coordinates of the i-th pixel in the initial depth image, and I(u, v) represents the pixel coordinates of the i-th pixel in the target image.
[0111] By implementing the above methods, the accuracy of the deep recognition model can be improved because the influence of moving dynamic objects on the calculation of the loss value of the deep recognition network is avoided.
[0112] like Figure 4 The diagram shown is a flowchart of the image depth recognition method provided in an embodiment of this application.
[0113] 108. Obtain the image to be recognized.
[0114] In at least one embodiment of this application, the image to be identified refers to an image for which depth information needs to be identified.
[0115] In at least one embodiment of this application, the electronic device obtains the image to be identified from a preset database, which may be a KITTI database, a Cityscapes database, a vKITTI database, etc.
[0116] 109. The image to be identified is input into the depth recognition model to obtain the target depth image of the image to be identified and the depth information of the image to be identified. The depth recognition model is obtained by performing the depth recognition model training method as described above.
[0117] In at least one embodiment of this application, the target depth image refers to an image containing depth information of each pixel in the image to be identified, and the depth information of each pixel in the image to be identified refers to the distance between the object to be identified corresponding to each pixel in the image to be identified and the shooting device that captured the image to be identified.
[0118] In at least one embodiment of this application, the method of generating the target depth image is basically the same as the method of generating the initial depth image, so this application will not elaborate further.
[0119] In at least one embodiment of this application, the electronic device acquires the pixel value of each pixel in the target depth image as the depth information of the corresponding pixel in the image to be identified.
[0120] By implementing the above methods, the accuracy of depth recognition model is improved, thereby enhancing the accuracy of depth recognition of the image to be recognized.
[0121] As can be seen from the above technical solution, this application performs instance segmentation on the first image to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image. Based on the number of pixels and a preset position of each first dynamic object, multiple target dynamic objects are selected from the multiple first dynamic objects. This reduces the number of first dynamic objects, thus improving the training speed of the deep recognition network. It identifies whether each target dynamic object has a corresponding feature dynamic object, and selects the same feature dynamic object in the second image as each target dynamic object. By calculating the dynamic pose matrix of each target dynamic object and the same feature dynamic object, and comparing the dynamic pose matrix with the preset threshold matrix, it can determine whether the state of each target dynamic object in the first image is moving. Based on the target object in the recognition... The target image is generated by using the state of the target dynamic object, the first dynamic position, and the first image. Based on the first dynamic position, moving target dynamic objects in the first image can be filtered out to generate the target image. Since the positional change of a moving target dynamic object causes a change in the depth value of the corresponding pixel in the initial depth image, filtering out moving target dynamic objects in the target image prevents the depth value from being used in the loss calculation, thus avoiding the influence of moving target dynamic objects on the loss calculation. The target image retains stationary target dynamic objects, preserving more image information from the first image. Therefore, the depth recognition model trained using the target image avoids the impact of moving target dynamic objects on the training accuracy of the depth recognition model, thereby improving the recognition accuracy of the depth recognition model.
[0122] like Figure 5 The diagram shown is a schematic diagram of the structure of the electronic device provided in an embodiment of this application.
[0123] In one embodiment of this application, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and executable on the processor 13, such as an image depth recognition program and a depth recognition model training program.
[0124] Those skilled in the art will understand that the schematic diagram is merely an example of electronic device 1 and does not constitute a limitation on electronic device 1. It may include more or fewer components than shown in the diagram, or combine certain components, or different components. For example, electronic device 1 may also include input / output devices, network access devices, buses, etc.
[0125] The processor 13 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. The processor 13 is the computing core and control center of the electronic device 1, connecting various parts of the electronic device 1 through various interfaces and lines, and acquiring the operating system and installed applications and program code of the electronic device 1. For example, the processor 13 can acquire the first image captured by the imaging device 2 through an interface.
[0126] The processor 13 acquires the operating system and various installed applications of the electronic device 1. The processor 13 acquires these applications to implement the steps in the aforementioned deep recognition model training methods and image deep recognition method embodiments, for example... Figure 2 and Figure 5 The steps are shown.
[0127] For example, the computer program may be divided into one or more modules / units, which are stored in the memory 12 and retrieved by the processor 13 to complete this application. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the process of retrieving the computer program from the electronic device 1.
[0128] The memory 12 can be used to store the computer programs and / or modules. The processor 13 implements various functions of the electronic device 1 by running or retrieving the computer programs and / or modules stored in the memory 12, and by calling the data stored in the memory 12. The memory 12 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the electronic device, etc. In addition, the memory 12 may include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.
[0129] The memory 12 can be the external memory and / or internal memory of the electronic device 1. Furthermore, the memory 12 can be a physical memory, such as a memory module, a TF card (Trans-flash Card), etc.
[0130] If the modules / units integrated in the electronic device 1 are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when the computer program is acquired by a processor, it can implement the steps of the various method embodiments described above.
[0131] The computer program includes computer program code, which may be in the form of source code, object code, accessible file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, and read-only memory (ROM).
[0132] Combination Figure 2The memory 12 in the electronic device 1 stores multiple instructions to implement a deep recognition model training method. The processor 13 can acquire the multiple instructions to: acquire a first image and a second image; perform instance segmentation on the first image based on an instance segmentation network to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image; and perform instance segmentation on the second image based on the instance segmentation network to obtain a second static object and multiple second dynamic objects corresponding to the second image; select multiple target dynamic objects from the multiple first dynamic objects based on the number of pixels of each first dynamic object and a preset position; and select multiple feature dynamic objects from the multiple second dynamic objects based on the number of pixels of each second dynamic object and the preset position. The process involves identifying whether each target dynamic object has a corresponding feature dynamic object, and determining the target dynamic objects and feature dynamic objects with corresponding relationships as recognition objects; identifying the object state of the target dynamic object in the recognition objects based on the dynamic pose matrix corresponding to the recognition object, the static pose matrix corresponding to the first static object and the second static object, and a preset threshold matrix; generating a target image based on the object state, the first dynamic position, and the first image, and generating a target projection image based on the object state, the first dynamic position, and the initial projection image corresponding to the first image; adjusting the acquired depth recognition network based on the gradient error between the initial depth image corresponding to the first image and the target image, and the photometric error between the target projection image and the target image, to obtain a depth recognition model.
[0133] Combination Figure 4 The memory 12 in the electronic device 1 stores multiple instructions to implement an image depth recognition method. The processor 13 can acquire the multiple instructions to achieve: acquiring an image to be recognized, inputting the image to be recognized into a depth recognition model, and obtaining the target depth image of the image to be recognized and the depth information of the image to be recognized.
[0134] Specifically, the processor 13's implementation method for the above instructions can be found in [reference needed]. Figure 2 and Figure 4 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.
[0135] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0136] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0137] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0138] Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of this application is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within this application. No appended diagram markings in the claims should be construed as limiting the scope of the claims.
[0139] Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices described in this application may also be implemented by a single unit or device through software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any specific order.
[0140] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and are not intended to limit it. Although this application has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of this application without departing from the spirit and scope of the technical solutions of this application.
Claims
1. A deep recognition model training method, applied to electronic devices, characterized in that, The deep recognition model training method includes: Acquire a first image and a second image; the first image and the second image are adjacent frames and contain the same initial object; The first image is segmented based on the instance segmentation network to obtain the first static object, multiple first dynamic objects and the first dynamic position of each first dynamic object corresponding to the first image. The second image is then segmented based on the instance segmentation network to obtain the second static object and multiple second dynamic objects corresponding to the second image. Multiple target dynamic objects are selected from the plurality of first dynamic objects based on the number of pixels of each first dynamic object and the preset position, and multiple feature dynamic objects are selected from the plurality of second dynamic objects based on the number of pixels of each second dynamic object and the preset position. Identify whether each target dynamic object has a corresponding feature dynamic object, and determine the target dynamic objects and feature dynamic objects that have a corresponding relationship as identification objects; Based on the dynamic pose matrix corresponding to the identified object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold matrix, the object state of the target dynamic object in the identified object is identified; Generating a target image based on the object state, the first dynamic position, and the first image includes: if the object state of any target dynamic object in the identified object is moving, performing masking processing on the target dynamic object in the first image based on the first dynamic position of the target dynamic object to obtain the target image; or, if the object state of all target dynamic objects in the identified object is stationary, determining the first image as the target image. Generate a target projection image based on the object state, the first dynamic position, and the initial projection image corresponding to the first image; Based on the gradient error between the initial depth image corresponding to the first image and the target image, and the photometric error between the target projection image and the target image, the acquired depth recognition network is adjusted to obtain a depth recognition model.
2. The deep recognition model training method as described in claim 1, characterized in that, The instance segmentation network includes a feature extraction layer, a classification layer, and a mapping layer. The instance segmentation of the first image based on the instance segmentation network to obtain a first static object, multiple first dynamic objects, and a first dynamic position of each first dynamic object corresponding to the first image includes: The first image is standardized to obtain a standardized image; Based on the feature extraction layer, features are extracted from the standardized image to obtain an initial feature map; Based on the multiple relationship between the size of the initial feature map and the size of the standardized image and the convolution stride in the feature extraction layer, the standardized image is segmented to obtain a rectangular region corresponding to each pixel in the initial feature map; The initial feature map is classified based on the classification layer to obtain the predicted probability that each pixel in the initial feature map belongs to the first preset category; The pixel point in the initial feature map whose predicted probability is greater than a preset threshold is determined as the target pixel point, and the multiple rectangular regions corresponding to the multiple target pixels are determined as multiple feature regions. Based on the mapping layer, each feature region is mapped to the initial feature map to obtain the mapped region corresponding to each feature region in the initial feature map; Based on a preset number, the multiple mapping regions are divided to obtain multiple partitioned regions corresponding to each mapping region; Determine the center pixel in each divided region and calculate the pixel value of the center pixel; Pooling is performed on the pixel values corresponding to the multiple center pixels to obtain the mapping probability value corresponding to each mapping region; Multiple mapping regions are restored, and the restored mapping regions are then concatenated to obtain the target feature map; Based on the target feature map, the mapping probability value, the restored multiple mapping regions, and the second preset category, generate the first static object corresponding to the first image, the multiple first dynamic objects, and the first dynamic position of each first dynamic object.
3. The deep recognition model training method as described in claim 2, characterized in that, The step of generating a first static object corresponding to the first image, the plurality of first dynamic objects, and the first dynamic position of each first dynamic object based on the target feature map, the mapping probability value, the restored plurality of mapping regions, and the second preset category includes: Based on the mapping probability value and the second preset category, each pixel of the target feature map is classified to obtain the pixel category of each pixel in the restored mapping region; The region consisting of multiple pixels of the same pixel category in the restored mapping region is defined as the first object; Obtain the pixel coordinates of all pixels in the first object, and determine the pixel coordinates as the first position corresponding to the first object; According to preset rules, the multiple first objects are divided into multiple first dynamic objects and first static objects, and the first position corresponding to each first dynamic object is determined as the first dynamic position.
4. The deep recognition model training method as described in claim 1, characterized in that, The step of selecting multiple target dynamic objects from the plurality of first dynamic objects based on the number of pixels and preset position of each first dynamic object includes: Count the number of pixels in each of the first dynamic objects; The plurality of first dynamic objects are sorted according to the number of pixels; The first dynamic object whose sorted pixel count is at the preset position is selected as the plurality of target dynamic objects.
5. The deep recognition model training method as described in claim 1, characterized in that, The step of identifying whether each target dynamic object has a corresponding feature dynamic object includes: Obtain multiple target element information for each target dynamic object, and obtain the feature element information corresponding to each target element information in the same category of feature dynamic objects; Each target element information is matched with its corresponding feature element information to obtain the matching value between the target dynamic object and the feature dynamic object of the same category; If the matching value is within a preset range, then it is determined that a corresponding feature dynamic object exists in the target dynamic object.
6. The deep recognition model training method as described in claim 1, characterized in that, The step of identifying the object state of the target dynamic object in the identified object based on the dynamic pose matrix corresponding to the identified object, the static pose matrices corresponding to the first static object and the second static object, and a preset threshold matrix includes: Subtract the corresponding element in the dynamic pose matrix of the identified object from each element in the static pose matrix to obtain the pose difference. The absolute value of the pose difference is taken to obtain the absolute pose value in the static pose matrix; Based on the element position of each pose absolute value in the static pose matrix, the pose absolute values are arranged to obtain a pose absolute value matrix. Each pose absolute value in the pose absolute value matrix is compared with the corresponding pose threshold in the preset threshold matrix. If there is at least one pose absolute value in the pose absolute value matrix that is greater than the pose threshold, then the object state of the target dynamic object in the identified object is determined to be moving; or If all absolute pose values in the pose absolute value matrix are less than or equal to the pose threshold, then the object state of the target dynamic object in the identification object is determined to be static.
7. The deep recognition model training method as described in claim 1, characterized in that, The step of adjusting the depth recognition network based on the gradient error between the initial depth image corresponding to the first image and the target image, and the photometric error between the target projection image and the target image, to obtain the depth recognition model includes: Based on the gradient error and the photometric error, the depth loss value of the depth recognition network is calculated; The depth recognition network is adjusted based on the depth loss value until the depth loss value is reduced to the minimum, thus obtaining the depth recognition model.
8. An image depth recognition method, characterized in that, The image depth recognition method includes: Acquire the image to be recognized; The image to be identified is input into a depth recognition model to obtain a target depth image of the image to be identified and depth information of the image to be identified. The depth recognition model is obtained by performing the depth recognition model training method as described in any one of claims 1 to 7.
9. An electronic device, characterized in that, The electronic device includes: Memory, storing at least one instruction; and The processor executes the at least one instruction to implement the depth recognition model training method as described in any one of claims 1 to 7, or the image depth recognition method as described in claim 8.