Deep recognition model training method, image depth recognition method, and related device

By calculating the projection slope of the test object to generate a threshold range, identifying the ground type and adjusting the ground plane area, the problem of ups and downslopes affecting the training accuracy of the depth recognition network is solved, and the recognition accuracy of the depth recognition model is improved.

CN117542005BActive Publication Date: 2026-06-26HON HAI PRECISION INDUSTRY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HON HAI PRECISION INDUSTRY CO LTD
Filing Date
2022-07-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, the training accuracy of deep recognition networks is low because the terrain areas on slopes in the training images affect the training accuracy of deep recognition networks.

Method used

By calculating the pixel coordinates of the test object in the image, a test projection slope is generated, and a threshold interval is generated based on multiple projection slopes. The ground type is identified, the ground plane area is adjusted, the target height loss of the depth recognition network is generated, and the depth recognition network is adjusted to improve the recognition accuracy.

Benefits of technology

It improves the recognition accuracy of the depth recognition model, avoids the impact of pixel points on the target height due to ups and downslope terrain, and enhances the accuracy of image recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117542005B_ABST
    Figure CN117542005B_ABST
Patent Text Reader

Abstract

The application relates to image processing and provides a depth recognition model training method, an image depth recognition method and related equipment. In the application, a test projection slope is calculated according to the coordinates of pixel points of a test object in an acquired test image, a threshold interval is generated according to a plurality of test projection slopes, a ground type is recognized according to an initial projection slope of an initial object in an acquired first image and the threshold interval, an initial ground plane region in the first image is adjusted according to the ground type and the pixel coordinates of the initial object, a target ground plane region is obtained, a target height loss is generated based on a preset depth recognition network, a shooting device, an initial depth image corresponding to the first image and the target ground plane region, the depth recognition network is adjusted based on the target height loss and a depth loss, a depth recognition model is obtained, a to-be-recognized image is input into the depth recognition model, and depth information is obtained. The application can improve the depth recognition accuracy of an image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing, and more particularly to a method for training a depth recognition model, an image depth recognition method, and related equipment. Background Technology

[0002] In current methods for depth recognition of vehicle images, the training accuracy of the depth recognition network is affected by terrain features such as slopes and inclines, resulting in low recognition accuracy of the trained model. Therefore, improving image recognition accuracy has become a pressing technical problem. Summary of the Invention

[0003] In view of the above, it is necessary to provide a deep recognition model training method, an image depth recognition method, and related equipment to solve the technical problem of low accuracy in image depth recognition.

[0004] This application provides a depth recognition model training method, which includes: determining a test object from an acquired test image, and acquiring a first image and a second image obtained by the imaging device after capturing the initial object; calculating the test projection slope of the test object based on the coordinates of the pixel points of the test object in the test image; generating a threshold interval based on multiple test projection slopes; identifying the ground type corresponding to the location of the initial object based on the initial projection slope of the initial object in the first image and the threshold interval; adjusting the initial ground plane region in the first image based on the ground type and the pixel coordinates of the initial object to obtain a target ground plane region in the first image; generating a target height loss of the depth recognition network based on a preset depth recognition network, the imaging device, the initial depth image corresponding to the first image, and the target ground plane region; and adjusting the depth recognition network based on the target height loss and the depth loss generated based on the first image and the second image to obtain a depth recognition model.

[0005] According to an optional embodiment of this application, the step of calculating the test projection slope of the test object based on the coordinates of the pixels of the test object in the test image includes: obtaining the x-coordinate and y-coordinate values ​​of each pixel in the test object; calculating the average x-coordinate of multiple x-coordinate values ​​and the average y-coordinate of multiple y-coordinate values; calculating the x-coordinate difference between the x-coordinate of each pixel and the average x-coordinate and the y-coordinate difference between the y-coordinate of each pixel and the average y-coordinate; counting the number of pixels of all pixels in the test object; generating a covariance matrix according to a preset rule, the number of pixels, multiple x-coordinate differences, and multiple y-coordinate differences; performing singular value decomposition on the covariance matrix to obtain a feature vector; determining the ratio of the first vector element to the second vector element of the feature vector as the projection slope; and selecting the test projection slope from the projection slopes.

[0006] According to an optional embodiment of this application, generating a covariance matrix based on a preset rule, the number of pixels, multiple horizontal coordinate differences, and multiple vertical coordinate differences includes: calculating the horizontal coordinate variance based on the multiple horizontal coordinate differences and the number of pixels, calculating the vertical coordinate variance based on the multiple vertical coordinate differences and the number of pixels, calculating the covariance based on the number of pixels, the multiple horizontal coordinate differences, and the multiple vertical coordinate differences, and arranging the covariance, the horizontal coordinate variance, and the vertical coordinate variance according to the preset rule to obtain the covariance matrix.

[0007] According to an optional embodiment of this application, generating a threshold interval based on multiple test projection slopes includes: calculating the projection average and projection standard deviation of the multiple test projection slopes, calculating a configuration value based on the projection standard deviation, determining the difference between the projection average and the configuration value as a minimum threshold, determining the sum of the projection average and the projection standard deviation as a maximum threshold, and determining the interval formed by the minimum threshold and the maximum threshold as the threshold interval.

[0008] According to an optional embodiment of this application, identifying the ground type corresponding to the location of the initial object based on the initial projection slope of the initial object in the first image and the threshold interval includes: if the initial projection slope is within the threshold interval, determining the ground type as flat ground; or if the initial projection slope is outside the threshold interval, determining the ground type as uphill or downhill.

[0009] According to an optional embodiment of this application, adjusting the initial ground plane region in the first image based on the ground type and the pixel coordinates of the initial object to obtain the target ground plane region in the first image includes: identifying a feature ground plane region corresponding to any initial object in the initial ground plane region based on the pixel coordinates of any initial object; if the ground type corresponding to any initial object is flat, determining the feature ground plane region as the target ground plane region; or, if the ground type corresponding to any initial object is uphill or downhill, performing masking processing on the feature ground plane region in the initial ground plane region to obtain the target ground plane region.

[0010] According to an optional embodiment of this application, generating the target height loss of the depth recognition network based on a preset depth recognition network, the shooting device, the initial depth image corresponding to the first image, and the target ground plane region includes: obtaining the real-world height from the optical center of the shooting device to the target ground plane region; constructing a camera coordinate system based on the first image and the shooting device; calculating the projected height based on the coordinates of each ground pixel in the target ground plane region in the first image in the camera coordinate system; and calculating the target height loss based on the pixel coordinates of the pixels in the initial depth image, the projected height, and the real-world height.

[0011] According to an optional embodiment of this application, the step of calculating the projection height based on the coordinates of each ground pixel in the target ground plane region in the camera coordinate system includes: obtaining the coordinates of any ground pixel in the target ground plane region in the camera coordinate system; calculating a unit normal vector based on the coordinates of the any ground pixel; determining the vector formed by taking the optical center of the shooting device as the starting point and each ground pixel as the ending point as the target vector of the ground pixel; calculating the projection distance corresponding to each ground pixel based on the target vector of each ground pixel and the unit normal vector; and performing a weighted average calculation on the projection distances corresponding to all ground pixels to obtain the projection height.

[0012] This application provides an image depth recognition method, which includes: acquiring an image to be recognized, inputting the image to be recognized into a depth recognition model, obtaining a target depth image of the image to be recognized and depth information of the image to be recognized, wherein the depth recognition model is obtained by performing the depth recognition model training method as described above.

[0013] This application provides an electronic device, the electronic device comprising:

[0014] Memory, storing at least one instruction; and

[0015] The processor executes at least one instruction to implement the depth recognition model training method or the image depth recognition method.

[0016] This application provides a computer-readable storage medium storing at least one instruction, which is executed by a processor in an electronic device to implement the depth recognition model training method and the image depth recognition method.

[0017] In summary, in this application, the test projection slope of the test object is calculated based on the coordinates of the pixel points of the test object in the test image, and a threshold interval is generated based on multiple test projection slopes. Since the ground type corresponding to the location of the test object is flat, the threshold interval provides a reference range for the initial projection slope of the initial object. By generating the threshold interval, the occurrence of extreme values ​​of a single test projection slope is avoided, thus improving the rationality of the threshold interval. Based on the initial projection slope of the initial object in the first image and the threshold interval, it is identified whether the ground type corresponding to the location of the initial object is uphill or downhill. Then, based on the ground type and the pixel coordinates of the initial object, the initial ground plane region in the first image is adjusted, which can filter out the regions corresponding to the initial objects that are uphill or downhill, so that the target ground plane region does not contain uphill or downhill regions. Since the target height loss of the depth recognition network is calculated using the target ground plane region, the influence of pixel value changes in uphill and downhill ground planes on the target height loss can be avoided. Therefore, the recognition accuracy of the trained depth recognition model is higher, thereby improving the recognition accuracy of the image. Attached Figure Description

[0018] Figure 1 This is an application environment diagram provided by an embodiment of this application.

[0019] Figure 2 This is a flowchart of a deep recognition model training method provided in an embodiment of this application.

[0020] Figure 3 This is a schematic diagram of the pixel coordinate system and camera coordinate system provided in the embodiments of this application.

[0021] Figure 4 This is a flowchart of the image depth recognition method provided in the embodiments of this application.

[0022] Figure 5 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this application clearer, the application will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0024] like Figure 1 The diagram shown illustrates the application environment provided by an embodiment of this application. The depth recognition model training method and image depth recognition method provided in this application can be applied to one or more electronic devices 1. The electronic device 1 communicates with a shooting device 2, which can be a monocular camera or other devices for shooting. Figure 1 The provided electronic device 1 and camera device 2 are for illustrative purposes only.

[0025] The electronic device 1 is a device capable of automatically calculating parameter values ​​and / or processing information according to pre-set or stored instructions. Its hardware includes, but is not limited to: microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.

[0026] The electronic device 1 can be any electronic product that can interact with the user, such as a personal computer, tablet computer, smartphone, personal digital assistant (PDA), game console, interactive network television (IPTV), smart wearable device, etc.

[0027] The electronic device 1 may further include network devices and / or user devices. The network devices include, but are not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of hosts or network servers.

[0028] The network in which the electronic device 1 is located includes, but is not limited to: the Internet, wide area network, metropolitan area network, local area network, virtual private network (VPN), etc.

[0029] like Figure 2 The diagram shown is a flowchart of a deep recognition model training method provided in an embodiment of this application. Depending on different needs, the order of the steps in the flowchart can be adjusted according to actual detection requirements, and some steps can be omitted. The execution subject of the method is an electronic device, such as... Figure 1 Electronic device 1 shown.

[0030] 101. Determine the test object from the acquired test images, and acquire the first image and the second image obtained by the shooting device after shooting the initial object.

[0031] In at least one embodiment of this application, the shooting device may be a monocular camera, the first image and the second image are adjacent frames of three primary color light (Red Green Blue, RGB) images, and the generation time of the second image is longer than the generation time of the first image.

[0032] In at least one embodiment of this application, the electronic device determines the test object from the acquired test image by:

[0033] The electronic device acquires an instance segmentation network and acquires a test image. Further, the electronic device uses the instance segmentation network to perform instance segmentation on the test image to obtain the test object.

[0034] The instance segmentation network can be Mask R-CNN, YOLCAT, PolarMask, etc. Since instance segmentation networks are existing technologies, they will not be described in detail here. The test image includes a horizontal ground plane and the test object, where the test object refers to an object in the test image that is on the horizontal ground plane. For example, the test object can be a vehicle on the horizontal ground plane.

[0035] In at least one embodiment of this application, the electronic device acquiring a first image and a second image obtained after the shooting device takes a picture of the initial object includes:

[0036] The electronic device controls the shooting device to capture multiple initial objects to obtain the first image, and then captures the multiple initial objects again after a preset time interval to obtain the second image.

[0037] The shooting device can be a monocular camera, and the multiple initial objects can be vehicles, ground, pedestrians, ground, pedestrians, sky, and trees. It is understood that the preset time is very short, for example, 10ms.

[0038] In this embodiment, the multiple initial objects are photographed after a preset time interval to obtain the second image. Since the preset time is very short, the distance that the initial objects can move within the short preset time is small. Therefore, the second image and the first image can contain more identical initial objects.

[0039] 102. Calculate the test projection slope of the test object based on the coordinates of the pixel points of the test object in the test image.

[0040] In at least one embodiment of this application, the test projection slope refers to the degree of inclination of the position of the test object relative to the horizontal ground plane.

[0041] In at least one embodiment of this application, the electronic device calculates the test projection slope of the test object based on the coordinates of the pixel points of the test object in the test image, including:

[0042] The electronic device acquires the x-coordinate and y-coordinate values ​​of each pixel in the test object. Further, the electronic device calculates the average x-coordinate of multiple x-coordinate values ​​and the average y-coordinate of multiple y-coordinate values. Even further, the electronic device calculates the x-coordinate difference between the x-coordinate of each pixel and the average x-coordinate, and the y-coordinate difference between the y-coordinate of each pixel and the average y-coordinate. The electronic device counts the number of pixels in all pixels of the test object. Further, the electronic device generates a covariance matrix based on preset rules, the number of pixels, multiple x-coordinate differences, and multiple y-coordinate differences. Even further, the electronic device performs singular value decomposition on the covariance matrix to obtain eigenvectors. Even further, the electronic device determines the ratio of the first element to the second element of the eigenvector as the projection slope, and selects the test projection slope from the projection slopes.

[0043] Wherein, the horizontal coordinate value and the vertical coordinate value refer to the coordinate value of each pixel in the test image in the pixel coordinate system corresponding to the test image. The construction process of the pixel coordinate system corresponding to the test image is basically the same as the construction process of the pixel coordinate system corresponding to the first image below, so this application will not repeat it here.

[0044] Specifically, the electronic device generates a covariance matrix based on preset rules, the number of pixels, multiple horizontal coordinate differences, and multiple vertical coordinate differences, including:

[0045] The electronic device calculates the variance of the horizontal coordinates based on the plurality of horizontal coordinate differences and the number of pixels, and calculates the variance of the vertical coordinates based on the plurality of vertical coordinate differences and the number of pixels. Further, the electronic device calculates the covariance based on the number of pixels, the plurality of horizontal coordinate differences, and the plurality of vertical coordinate differences. Even further, the electronic device arranges the covariance, the horizontal coordinate variance, and the vertical coordinate variance according to the preset rules to obtain the covariance matrix.

[0046] The preset rule includes using the variance values ​​of the horizontal and vertical coordinates as matrix elements on the main diagonal and the covariance value as matrix elements on the secondary diagonal.

[0047] In this embodiment, since there are multiple feature vectors, there are also multiple projection slopes, and there are projection slopes that are greater than zero and less than zero. Since the coordinate value of each pixel in the test image is greater than zero in the pixel coordinate system corresponding to the test image, the electronic device selects a projection slope that is greater than zero as the test projection slope.

[0048] For example, the test object has a total of 5 pixels, and the pixel coordinates of the 5 pixels are shown in Table 1:

[0049] Table 1. Pixel coordinates

[0050]

[0051] Specifically, the formula for calculating the variance of the abscissa is as follows:

[0052]

[0053] Where Var(x) represents the variance of the horizontal coordinate, n represents the number of pixels, and x i This represents the variance value of the i-th horizontal axis. The formula for calculating the variance value of the vertical axis is the same as that for the horizontal axis, and will not be repeated here. According to the above formula, the variance value of the horizontal axis is 0.24, and the variance value of the vertical axis is 0.56.

[0054] The formula for calculating the covariance value is as follows:

[0055]

[0056] Wherein, cov(x,y) represents the covariance value, n represents the number of pixels, and x i Let y represent the variance of the i-th x-coordinate. i Let represent the variance value of the i-th ordinate. According to the above formula, the covariance value is 0.12. Taking the variance values ​​of the abscissa (0.24) and ordinate (0.56) as elements on the main diagonal, and the covariance value of 0.12 as elements on the secondary diagonal, the covariance matrix is ​​obtained: The first eigenvector is obtained by performing singular value decomposition on the covariance matrix Q. Second eigenvector The first ratio of the first vector element to the second vector element of the first feature vector is calculated to be approximately +0.33, and the second ratio of the first vector element to the second vector element of the second feature vector is calculated to be approximately -3. Since the coordinate values ​​of all 5 pixels are greater than zero, the first ratio of +0.33 is selected as the test projection slope of the test object.

[0057] In this embodiment, since the test projection slope refers to the degree of inclination of the position of the test object relative to the horizontal ground plane, the ground type of the initial object can be preliminarily determined based on the test projection slope.

[0058] 103, generate threshold intervals based on multiple test projection slopes.

[0059] In at least one embodiment of this application, the threshold interval refers to the range of the initial projection slope of a test object on a horizontal ground plane.

[0060] In at least one embodiment of this application, the electronic device generates a threshold range based on a plurality of test projection slopes, including:

[0061] The electronic device calculates the average projection value and standard deviation of the multiple test projection slopes, and calculates a configuration value based on the standard deviation of the projection. Further, the electronic device determines the difference between the average projection value and the configuration value as a minimum threshold, and determines the sum of the average projection value and the configuration value as a maximum threshold. Even further, the electronic device determines the interval formed by the minimum threshold and the maximum threshold as the threshold interval.

[0062] The configuration value can be a multiple of the projection standard deviation. For example, the configuration value can be twice the projection standard deviation.

[0063] Through the above implementation method, the minimum threshold and the maximum threshold are generated based on the average projection value and the configuration value, and the interval formed by the minimum threshold and the maximum threshold is determined as the threshold interval. This can expand the threshold interval, thereby improving the fault tolerance of the threshold interval. Since the average projection value and the configuration value can reduce the error of the multiple test projection slopes, the rationality of the threshold interval can be improved.

[0064] 104. Based on the initial projection slope of the initial object in the first image and the threshold range, identify the ground type corresponding to the location of the initial object.

[0065] In at least one embodiment of this application, the ground type includes at least flat ground and slopes, wherein flat ground refers to a horizontal ground plane and slopes refer to ground planes that have an angle of inclination with respect to the horizontal ground plane.

[0066] In at least one embodiment of this application, the electronic device identifies the ground type corresponding to the location of the initial object based on the initial projection slope of the initial object in the first image and the threshold interval, including:

[0067] If the initial projection slope is within the threshold range, the electronic device determines that the ground type is flat; or, if the initial projection slope is outside the threshold range, the electronic device determines that the ground type is uphill or downhill.

[0068] Through the above implementation method, the ground type of the initial image is identified as being on an uphill or downhill slope based on the initial projection slope corresponding to the initial object in the first image and the threshold range. Since the threshold range has higher fault tolerance and rationality, the ground type of the initial object can be accurately determined.

[0069] 105. Adjust the initial ground plane region in the first image according to the ground type and the pixel coordinates of the initial object to obtain the target ground plane region in the first image.

[0070] like Figure 3 The diagram shown is a schematic of the pixel coordinate system and camera coordinate system provided in an embodiment of this application. The electronic device uses the pixel point O in the first row and first column of the first image as the reference. uv Using the first pixel as the origin, a pixel coordinate system is constructed with the parallel line containing the first row of pixels as the u-axis and the vertical line containing the first column of pixels as the v-axis. For example, the first pixel in the upper left corner can be used as the origin. Furthermore, the electronic device uses the light spot O of the monocular camera... XY The camera coordinate system is constructed with the origin as the optical axis of the monocular camera as the X-axis, the line parallel to the u-axis of the pixel coordinate system of the first image as the Y-axis, and the line parallel to the v-axis of the pixel coordinate system of the first image as the Z-axis.

[0071] In at least one embodiment of this application, the initial ground plane region refers to the ground plane region generated after segmenting the first image using a ground plane segmentation network, which can be obtained from an internet database. For example, the ground plane segmentation network can be a high-resolution network (High-Resolution Net v2, HRNetv2).

[0072] In at least one embodiment of this application, the electronic device adjusts the initial ground plane region in the first image according to the ground type and the pixel coordinates of the initial object to obtain the target ground plane region in the first image, including:

[0073] The electronic device identifies the characteristic ground plane region corresponding to any initial object in the initial ground plane region based on the pixel coordinates of any initial object. If the ground type corresponding to any initial object is flat, the electronic device determines the characteristic ground plane region as the target ground plane region. Alternatively, if the ground type corresponding to any initial object is uphill or downhill, the electronic device performs masking processing on the characteristic ground plane region in the initial ground plane region to obtain the target ground plane region.

[0074] Specifically, the electronic device identifies the characteristic ground plane region corresponding to any initial object in the initial ground plane region based on the pixel coordinates of any initial object, including:

[0075] The electronic device acquires the ground plane coordinates of each ground plane pixel in the initial ground plane region in the pixel coordinate system of the first image, and acquires the initial pixel coordinates of each initial pixel of any initial object in the pixel coordinate system of the first image. Further, the electronic device calculates the pixel distance between each initial pixel coordinate and the coordinates of the origin in the pixel coordinate system of the first image, and determines the initial pixel coordinates corresponding to the largest pixel distance. Further, the electronic device determines the region formed by the ground plane pixels corresponding to the ground plane coordinates between the coordinates of the origin and the largest initial pixel coordinates as the feature ground plane region.

[0076] In this embodiment, the region formed by the ground pixels corresponding to the ground plane coordinates between the coordinates of the origin and the largest initial pixel coordinates is defined as the feature ground plane region, which can accurately determine the position of the ground plane in the first image that is on an uphill or downhill slope.

[0077] 106. Based on the preset depth recognition network, the shooting device, the initial depth image corresponding to the first image, and the target ground plane region, generate the target height loss of the depth recognition network.

[0078] In at least one embodiment of this application, the depth recognition network refers to a network capable of recognizing depth information in an image.

[0079] In at least one embodiment of this application, the initial depth image refers to an image containing the depth information of the first image, wherein the depth information refers to the distance between the initial object corresponding to each pixel in the first image and the shooting device that captured the first image.

[0080] In at least one embodiment of this application, the target height loss refers to the difference between the predicted height and the real-world height, the predicted height refers to the distance between each pixel in the first image predicted by the depth recognition network and the shooting device, and the real-world height refers to the distance between the initial object corresponding to the pixel in the first image and the shooting device in reality.

[0081] In at least one embodiment of this application, the electronic device generates the target height loss of the depth recognition network based on a preset depth recognition network, the imaging device, an initial depth image corresponding to the first image, and the target ground plane region, including:

[0082] The electronic device acquires the real-world height from the optical center of the shooting device to the target ground plane region. Further, the electronic device constructs a camera coordinate system based on the first image and the shooting device. Further still, the electronic device calculates the projected height based on the coordinates of each ground pixel in the target ground plane region in the first image in the camera coordinate system. Further still, the electronic device calculates the target height loss based on the pixel coordinates of the pixels in the initial depth image, the projected height, and the real-world height.

[0083] The electronic device inputs the first image into the depth recognition network to obtain the initial depth image.

[0084] Specifically, the electronic device calculates the projection height in the camera coordinate system based on the coordinates of each ground pixel in the target ground plane region of the first image, including:

[0085] The electronic device acquires the coordinates of any ground pixel in the target ground plane area in the camera coordinate system. Based on the coordinates of any ground pixel, the electronic device calculates a unit normal vector. Further, the electronic device determines the target vector of the ground pixel as the vector formed by the optical center of the shooting device as the starting point and each ground pixel as the ending point. Based on the target vector of each ground pixel and the unit normal vector, the electronic device calculates the projection distance corresponding to each ground pixel. Further, the electronic device performs a weighted average calculation on the projection distances corresponding to all ground pixels to obtain the projection height.

[0086] The formula for calculating the unit normal vector is as follows:

[0087] N t =(P t P t T ) -1 P t ;

[0088] Where, N t This refers to the unit normal vector, P. t P refers to the coordinates of any ground pixel in the target ground plane region within the camera coordinate system. t T This refers to the target vector.

[0089] In this embodiment, the projection height refers to the weighted average of multiple projection distances between each pixel in the first image and the shooting device. Since the coordinates of all pixels in the ground plane area are included in the calculation, the projection height can be made more accurate.

[0090] Specifically, the electronic device calculates the target height loss based on the pixel coordinates of pixels in the initial depth image, the projected height, and the real-world height, including:

[0091] The electronic device calculates the height ratio of the real-world height to the projected height. Further, the electronic device multiplies the height ratio by the pixel coordinates of each pixel in the initial depth image to obtain the depth coordinates corresponding to each pixel. Further still, the electronic device generates a first height loss based on the pixel coordinates and corresponding depth coordinates of each pixel in the initial depth image. The electronic device multiplies the translation matrix by the height ratio to obtain a multiplication matrix. Further still, the electronic device generates a second height loss based on the multiplication matrix and the translation matrix. Further still, the electronic device generates the target height loss based on the first height loss and the second height loss.

[0092] Specifically, the formula for calculating the first height loss is as follows:

[0093]

[0094] Wherein, the L d This refers to the first height loss, where n is the total number of pixels in the initial depth image, and i is the i-th pixel in the initial depth image. The D... i t (u,v) refers to the depth coordinates corresponding to the i-th pixel in the initial depth image, D i(u,v) refers to the pixel coordinates of the i-th pixel in the initial depth image.

[0095] Specifically, the formula for calculating the second height loss is as follows:

[0096] L ts =|t s -t|;

[0097] Among them, L ts This refers to the second altitude loss, t s t refers to the multiplication matrix, and t refers to the translation matrix.

[0098] The electronic device performs a weighted average calculation on the first height loss and the second height loss to obtain the target height loss.

[0099] Through the above implementation method, the target height loss is calculated based on the pixel coordinates of the pixels in the initial depth image, the projected height, and the real-world height. Since the projected height is more accurate, the target height loss can be reduced more quickly.

[0100] 107. Based on the target height loss and the depth loss generated based on the first image and the second image, adjust the depth recognition network to obtain a depth recognition model.

[0101] In at least one embodiment of this application, the depth loss includes photometric loss and gradient loss.

[0102] In at least one embodiment of this application, the deep recognition model refers to a model generated after adjusting the deep recognition network.

[0103] In at least one embodiment of this application, the electronic device adjusts the depth recognition network based on the target height loss and the depth loss generated based on the first image and the second image to obtain a depth recognition model, including:

[0104] The electronic device calculates the overall loss of the depth recognition network based on the depth loss and the target height loss. Furthermore, the electronic device adjusts the depth recognition network based on the overall loss until the overall loss is reduced to the minimum, thereby obtaining the depth recognition model.

[0105] Specifically, the electronic device performs a weighted average calculation on the depth loss and the target height loss to obtain the overall loss.

[0106] In this embodiment, the overall loss includes the depth loss and the target height loss. Since the depth loss can more accurately reflect the difference between the first image and the second image, adjusting the depth network based on the overall loss can improve the learning ability of the depth network and make the recognition accuracy of the depth recognition model higher.

[0107] Specifically, the electronic device calculates the gradient loss between the initial depth image and the first image, and calculates the photometric loss between the projected image of the first image and the first image. Further, the electronic device performs a weighted average operation on the gradient loss and the photometric loss to obtain the depth loss.

[0108] The electronic device generates a projected image of the first image based on the first image, an initial depth image corresponding to the first image, and a pose matrix corresponding to both the first image and the second image. The generation process of the pose matrix is ​​prior art and will not be elaborated upon here.

[0109] Specifically, the formula for calculating the photometric loss is:

[0110] Lt=αSSIM(x,y)+(1-α)‖x i -y i ||;

[0111] Where Lt represents the photometric loss, α is a preset balance parameter, typically set to 0.85, and SSIM(x,y) represents the structural similarity index between the projected image and the first image, ||x|||y| ... i -y i ‖ represents the grayscale difference between the projected image and the first image, x i y represents the pixel value of the i-th pixel in the projected image. i This represents the pixel value of the pixel corresponding to the i-th pixel in the first image. The calculation process of the structural similarity index is prior art and will not be described in detail here.

[0112] Specifically, the formula for calculating the gradient loss is as follows:

[0113]

[0114] Where Ls represents the gradient loss, x represents the initial depth image, y represents the first image, D(u, v) represents the pixel coordinates of the i-th pixel in the initial depth image, and I(u, v) represents the pixel coordinates of the i-th pixel in the first image.

[0115] In this embodiment, since the depth loss includes the changes in luminance and gradient of each pixel in the first image to the corresponding pixel in the second image, the depth loss can more accurately reflect the differences between the first image and the second image.

[0116] like Figure 4 The diagram shown is a flowchart of the image depth recognition method provided in an embodiment of this application.

[0117] Depending on the specific requirements, the order of the steps in the flowchart can be adjusted according to actual testing requirements, and some steps can be omitted. The method is executed by an electronic device, for example... Figure 1 Electronic device 1 shown.

[0118] 108. Obtain the image to be recognized.

[0119] In at least one embodiment of this application, the image to be identified refers to an image for which depth information needs to be identified.

[0120] In at least one embodiment of this application, the electronic device acquires the image to be identified by:

[0121] The electronic device retrieves the image to be identified from a preset database.

[0122] The preset database can be the KITTI database, the Cityscapes database, the vKITTI database, etc. The deep recognition network can be a deep neural network, which can be obtained from Internet databases.

[0123] 109. The image to be identified is input into the depth recognition model to obtain the target depth image of the image to be identified and the depth information of the image to be identified. The depth recognition model is obtained by performing the depth recognition model training method as described above.

[0124] In at least one embodiment of this application, the target depth image refers to an image containing depth information of each pixel in the image to be identified, and the depth information of each pixel in the image to be identified refers to the distance between the object to be identified corresponding to each pixel in the image to be identified and the shooting device that captured the image to be identified.

[0125] In at least one embodiment of this application, the method of generating the target depth image is basically the same as the method of generating the initial depth image, so this application will not elaborate further.

[0126] In at least one embodiment of this application, the electronic device acquires the pixel value of each pixel in the target depth image as the depth information of the corresponding pixel in the image to be identified.

[0127] By implementing the above methods, the accuracy of depth recognition model is improved, thereby enhancing the accuracy of depth recognition of the image to be recognized.

[0128] In summary, in this application, the test projection slope of the test object is calculated based on the coordinates of the pixel points of the test object in the test image, and a threshold interval is generated based on multiple test projection slopes. Since the ground type corresponding to the location of the test object is flat, the threshold interval provides a reference range for the initial projection slope of the initial object. By generating the threshold interval, the occurrence of extreme values ​​of a single test projection slope is avoided, thus improving the rationality of the threshold interval. Based on the initial projection slope of the initial object in the first image and the threshold interval, it is identified whether the ground type corresponding to the location of the initial object is uphill or downhill. Then, based on the ground type and the pixel coordinates of the initial object, the initial ground plane region in the first image is adjusted, which can filter out the regions corresponding to the initial objects that are uphill or downhill, so that the target ground plane region does not contain uphill or downhill regions. Since the target height loss of the depth recognition network is calculated using the target ground plane region, the influence of pixel value changes in uphill and downhill ground planes on the target height loss can be avoided. Therefore, the recognition accuracy of the trained depth recognition model is higher, thereby improving the recognition accuracy of the image.

[0129] like Figure 5 The diagram shown is a schematic diagram of the structure of the electronic device provided in an embodiment of this application.

[0130] In one embodiment of this application, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and executable on the processor 13, such as an image depth recognition program and a depth recognition model training program.

[0131] Those skilled in the art will understand that the schematic diagram is merely an example of electronic device 1 and does not constitute a limitation on electronic device 1. It may include more or fewer components than shown in the diagram, or combine certain components, or different components. For example, electronic device 1 may also include input / output devices, network access devices, buses, etc.

[0132] The processor 13 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. The processor 13 is the computing core and control center of the electronic device 1, connecting various parts of the electronic device 1 through various interfaces and lines, and acquiring the operating system and installed applications and program code of the electronic device 1. For example, the processor 13 can acquire the first image captured by the imaging device 2 through an interface.

[0133] The processor 13 acquires the operating system and various installed applications of the electronic device 1. The processor 13 acquires these applications to implement the steps in the aforementioned deep recognition model training methods and image deep recognition method embodiments, for example... Figure 2 and Figure 4 The steps are shown.

[0134] For example, the computer program may be divided into one or more modules / units, which are stored in the memory 12 and retrieved by the processor 13 to complete this application. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the process of retrieving the computer program from the electronic device 1.

[0135] The memory 12 can be used to store the computer programs and / or modules. The processor 13 implements various functions of the electronic device 1 by running or retrieving the computer programs and / or modules stored in the memory 12, and by calling the data stored in the memory 12. The memory 12 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the electronic device, etc. In addition, the memory 12 may include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.

[0136] The memory 12 can be the external memory and / or internal memory of the electronic device 1. Furthermore, the memory 12 can be a physical memory, such as a memory module, a TF card (Trans-flash Card), etc.

[0137] If the modules / units integrated in the electronic device 1 are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when the computer program is acquired by a processor, it can implement the steps of the various method embodiments described above.

[0138] The computer program includes computer program code, which may be in the form of source code, object code, accessible file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, and read-only memory (ROM).

[0139] Combination Figure 2The memory 12 in the electronic device 1 stores multiple instructions to implement a depth recognition model training method. The processor 13 can acquire the multiple instructions to: determine a test object from the acquired test image, and acquire a first image and a second image obtained by the shooting device after capturing the initial object; calculate the test projection slope of the test object based on the coordinates of the pixel points of the test object in the test image; generate a threshold interval based on the multiple test projection slopes; identify the ground type corresponding to the location of the initial object based on the initial projection slope of the initial object in the first image and the threshold interval; adjust the initial ground plane region in the first image based on the ground type and the pixel coordinates of the initial object to obtain the target ground plane region in the first image; generate the target height loss of the depth recognition network based on the preset depth recognition network, the shooting device, the initial depth image corresponding to the first image, and the target ground plane region; adjust the depth recognition network based on the target height loss and the depth loss generated based on the first image and the second image to obtain a depth recognition model.

[0140] Combination Figure 4 The memory 12 in the electronic device 1 stores multiple instructions to implement an image depth recognition method. The processor 13 can acquire the multiple instructions to achieve: acquiring an image to be recognized, inputting the image to be recognized into a depth recognition model, and obtaining the target depth image of the image to be recognized and the depth information of the image to be recognized.

[0141] Specifically, the processor 13's implementation method for the above instructions can be found in [reference needed]. Figure 2 and Figure 4 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.

[0142] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.

[0143] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0144] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.

[0145] Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of this application is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within this application. No appended diagram markings in the claims should be construed as limiting the scope of the claims.

[0146] Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices described in this application may also be implemented by a single unit or device through software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any specific order.

[0147] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and are not intended to limit it. Although this application has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of this application without departing from the spirit and scope of the technical solutions of this application.

Claims

1. A deep recognition model training method, applied to an electronic device, wherein the electronic device communicates with a shooting device, characterized in that, The deep recognition model training method includes: The test object is determined from the acquired test images, and the first image and the second image obtained by the shooting device after taking a picture of the initial object are acquired. The test projection slope of the test object is calculated based on the coordinates of the pixels of the test object in the test image. This includes: obtaining the x-coordinate and y-coordinate values ​​of each pixel in the test object; calculating the average x-coordinate of multiple x-coordinate values ​​and the average y-coordinate of multiple y-coordinate values; calculating the x-coordinate difference between the x-coordinate of each pixel and the average x-coordinate and the y-coordinate difference between the y-coordinate of each pixel and the average y-coordinate; counting the number of pixels in all pixels of the test object; generating a covariance matrix according to a preset rule, the number of pixels, multiple x-coordinate differences, and multiple y-coordinate differences; performing singular value decomposition on the covariance matrix to obtain a feature vector; determining the ratio of the first vector element to the second vector element of the feature vector as the projection slope; and selecting the test projection slope from the projection slopes. A threshold range is generated based on multiple test projection slopes; Based on the initial projection slope of the initial object in the first image and the threshold range, the ground type corresponding to the location of the initial object is identified; The initial ground plane region in the first image is adjusted according to the ground type and the pixel coordinates of the initial object to obtain the target ground plane region in the first image; Based on the preset depth recognition network, the shooting device, the initial depth image corresponding to the first image, and the target ground plane region, the target height loss of the depth recognition network is generated; The depth recognition network is adjusted based on the target height loss and the depth loss generated based on the first image and the second image to obtain a depth recognition model.

2. The deep recognition model training method as described in claim 1, characterized in that, The step of generating a covariance matrix based on preset rules, the number of pixels, multiple horizontal coordinate differences, and multiple vertical coordinate differences includes: The variance of the horizontal coordinate is calculated based on the multiple differences in the horizontal coordinates and the number of pixels, and the variance of the vertical coordinate is calculated based on the multiple differences in the vertical coordinates and the number of pixels. The covariance value is calculated based on the number of pixels, the multiple differences in the horizontal coordinates, and the multiple differences in the vertical coordinates. The covariance values, the variance values ​​of the horizontal axis, and the variance values ​​of the vertical axis are arranged according to the preset rules to obtain the covariance matrix.

3. The deep recognition model training method as described in claim 1, characterized in that, The step of generating a threshold range based on multiple test projection slopes includes: Calculate the average projection value and standard deviation of the multiple test projection slopes, and calculate the configuration value based on the standard deviation of the projection. The difference between the average projection value and the configuration value is determined as the minimum threshold, and the sum of the average projection value and the standard deviation of the projection is determined as the maximum threshold; The interval formed by the minimum threshold and the maximum threshold is defined as the threshold interval.

4. The deep recognition model training method as described in claim 1, characterized in that, The step of identifying the ground type corresponding to the location of the initial object based on the initial projection slope of the initial object in the first image and the threshold range includes: If the initial projection slope is within the threshold range, the ground type is determined to be flat; or If the initial projection slope is outside the threshold range, the ground type is determined to be uphill or downhill.

5. The deep recognition model training method according to any one of claims 1 to 4, characterized in that, The step of adjusting the initial ground plane region in the first image according to the ground type and the pixel coordinates of the initial object to obtain the target ground plane region in the first image includes: Identify the feature ground plane region corresponding to any initial object in the initial ground plane region based on the pixel coordinates of any initial object; If the ground type corresponding to any of the initial objects is flat, the characteristic ground plane region is determined as the target ground plane region; or If the ground type corresponding to any initial object is uphill or downhill, the feature ground plane region is masked in the initial ground plane region to obtain the target ground plane region.

6. The deep recognition model training method as described in claim 1, characterized in that, The generation of the target height loss of the depth recognition network based on the preset depth recognition network, the shooting device, the initial depth image corresponding to the first image, and the target ground plane region includes: The first image is input into the depth recognition network to obtain the initial depth image; Obtain the real-world height from the optical center of the imaging device to the target ground plane region; A camera coordinate system is constructed based on the first image and the shooting device; Calculate the projection height based on the coordinates of each ground pixel in the target ground plane region in the first image in the camera coordinate system; The target height loss is calculated based on the pixel coordinates of the pixels in the initial depth image, the projected height, and the real-world height.

7. The deep recognition model training method as described in claim 6, characterized in that, The step of calculating the projection height based on the coordinates of each ground pixel in the target ground plane region of the first image in the camera coordinate system includes: Obtain the coordinates of any ground pixel in the target ground plane region in the camera coordinate system; Calculate the unit normal vector based on the coordinates of any ground pixel; The vector formed by taking the optical center of the shooting device as the starting point and each ground pixel as the ending point is determined as the target vector of that ground pixel. Calculate the projection distance corresponding to each ground pixel based on the target vector of each ground pixel and the unit normal vector; The projection height is obtained by performing a weighted average calculation on the projection distances corresponding to all ground pixels.

8. An image depth recognition method, characterized in that, The image depth recognition method includes: Acquire the image to be recognized; The image to be identified is input into a depth recognition model to obtain a target depth image of the image to be identified and depth information of the image to be identified. The depth recognition model is obtained by performing the depth recognition model training method as described in any one of claims 1 to 7.

9. An electronic device, characterized in that, The electronic device includes: Memory, storing at least one instruction; and The processor executes the at least one instruction to implement the depth recognition model training method as described in any one of claims 1 to 7, or the image depth recognition method as described in claim 8.