A multi-target tracking method, system, storage medium and terminal for an autonomous driving scene

By performing 3D object detection and cross-union calculation on images in autonomous driving scenarios, the problem of dynamic object interference is solved, accurate self-localization and motion consistency are achieved, and loop closure errors and trajectory drift are avoided.

CN117132620BActive Publication Date: 2026-06-16UNIV OF ELECTRONICS SCI & TECH OF CHINA +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIV OF ELECTRONICS SCI & TECH OF CHINA
Filing Date
2023-08-28
Publication Date
2026-06-16

Smart Images

  • Figure CN117132620B_ABST
    Figure CN117132620B_ABST
Patent Text Reader

Abstract

The application discloses a kind of automatic driving scene multi-target tracking method, system, storage medium and terminal, belong to the field of automatic driving, including: three-dimensional target detection is carried out to each frame input image;Current all three-dimensional target frame is projected to bird's eye view, and detection frame is obtained;Whether the object object generated in history is judged in field of view range, object in field of view range is projected to bird's eye view, and target frame is obtained;Intersection-over-union of the detection frame and target frame is calculated;According to the calculation result of the intersection-over-union, it is judged whether object is successfully tracked;Characteristic points in the dynamic object successfully tracked are eliminated.The application projects three-dimensional target detection result to bird's eye view, establishes the matching relationship of current detection object and historical generation object, and tracks multiple targets according to the matching relationship, while eliminating characteristic points in dynamic frame, is not disturbed by dynamic object, and estimation is accurate.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of autonomous driving, and in particular to a multi-target tracking method, system, storage medium, and terminal for autonomous driving scenarios. Background Technology

[0002] Simultaneous Localization and Mapping (SLAM) aims to enable robots to localize themselves in unknown environments using sensor observation data, and simultaneously build incremental maps based on their localization. SLAM is a hot research topic in the field of intelligent transportation and is crucial for autonomous path planning by unmanned vehicles. Visual SLAM algorithms, which use low-cost cameras as the primary sensor, have been widely applied.

[0003] Visual SLAM systems consist of a front-end and a back-end. The front-end estimates robot motion using camera observation data, while the back-end performs loop closure detection and global optimization of the motion estimated by the front-end. V-SLAM front-ends are divided into direct methods and feature-point methods. Direct methods operate directly on image pixels, estimating changes in image pixels based on the assumption of invariant grayscale values, thereby estimating the robot's own motion. While direct methods save time in calculating keypoints and descriptors, their assumption of invariant grayscale values ​​is often unsatisfactory in practical applications, especially for autonomous vehicles operating outdoors, which inevitably face the problem of varying lighting conditions. Feature-point methods do not directly use image pixel information but extract keypoints from the image and calculate their descriptors. The descriptors are used to determine the matching relationships between keypoints, and then algorithms such as epipolar geometry and PnP are used to estimate the robot's own motion. Feature-point methods generally have good robustness to changes in scale, lighting, and viewpoint, and possess high accuracy.

[0004] Visual SLAM systems estimate their own motion trajectories by matching pixels across frames. These systems typically assume a static environment, meaning the robot treats everything around it as static. However, in real life, dynamic objects are ubiquitous and can interfere with the camera's self-localization, leading to loop closure errors, trajectory drift, and even tracking loss. Furthermore, many applications require motion information extraction to aid subsequent decision-making. For example, autonomous driving needs to identify surrounding vehicles and predict their potential trajectories.

[0005] Many dynamic SLAM methods obtain the prior target mask position through semantic segmentation and detect and remove dynamic feature points through motion consistency detection methods. However, such methods are not applicable to some special motions. For example, the RGB image used in existing dynamic visual SLAM is a front view. When the moving target moves along the camera's optical axis, the commonly used motion consistency judgment will fail. Dynamic SLAM methods use point reprojection, which is sensitive to pixel coordinates but not to depth. Therefore, when the target moves along the camera's optical axis, that is, when the depth relative to the camera changes, motion consistency will fail. Summary of the Invention

[0006] The purpose of this invention is to overcome the problems of existing dynamic SLAM methods and provide a multi-target tracking method, system, storage medium and terminal for autonomous driving scenarios.

[0007] The objective of this invention is achieved through the following technical solution:

[0008] Firstly, a multi-target tracking method for autonomous driving scenarios is provided, including the following steps:

[0009] S1. Perform 3D target detection on each frame of the input image;

[0010] S2. Project all current 3D target bounding boxes onto the bird's-eye view to obtain the detection boxes;

[0011] S3. Determine whether the historically generated objects are within the field of view, and project the objects within the field of view onto the bird's-eye view to obtain the target bounding box;

[0012] S4. Calculate the intersection-union ratio (IUU) between the detection box and the target box;

[0013] S5. Determine whether the object has been successfully tracked based on the calculated crossover ratio;

[0014] S6. Remove feature points within successfully tracked dynamic objects.

[0015] In some possible embodiments, a multi-target tracking method for an autonomous driving scenario is provided, wherein the historically generated object is obtained from a historically generated object database, and the information of the object includes the object ID, object speed, and world coordinates of the object.

[0016] In some possible embodiments, a multi-object tracking method for an autonomous driving scenario is provided, wherein projecting all current 3D target bounding boxes onto a bird's-eye view includes:

[0017] Transform the world coordinates of the object to the coordinates in the current camera coordinate system;

[0018] The transformed coordinates are projected onto the bird's-eye view to obtain the homogeneous pixel coordinates of the object.

[0019] In some possible embodiments, a multi-target tracking method for an autonomous driving scenario is provided, wherein determining whether a historically generated object is within the field of view includes:

[0020] Determine whether the homogeneous pixel coordinates are within a certain range. If they are within the corresponding range, the object is in the field of view.

[0021] In some possible embodiments, a multi-target tracking method for an autonomous driving scenario is provided, wherein the object within the field of view is projected onto a bird's-eye view, including:

[0022] The current bird's-eye view coordinates of the object are estimated based on a constant velocity motion model.

[0023] In some possible embodiments, a multi-target tracking method for an autonomous driving scenario is provided, wherein step S5 includes:

[0024] Objects with an intersection-over-union ratio greater than a threshold are considered successfully tracked;

[0025] Objects with an intersection-union ratio of 0 are considered new objects;

[0026] Objects with an intersection-union ratio (IU) between 0 and the threshold are considered unmatchable.

[0027] In some possible embodiments, a multi-object tracking method for an autonomous driving scenario is provided, wherein the removal of feature points within successfully tracked dynamic objects includes:

[0028] All extracted feature points are projected onto the bird's-eye view, and feature points located within the bounding box of dynamic objects are removed.

[0029] Secondly, a multi-target tracking system for autonomous driving scenarios is provided, including:

[0030] The 3D object detection module is configured to perform 3D object detection on each frame of the input image and project all current 3D object bounding boxes onto the bird's-eye view to obtain the detection bounding boxes.

[0031] The target bounding box generation module is configured to determine whether historically generated objects are within the field of view, and then project objects within the field of view onto the bird's-eye view to obtain the target bounding box.

[0032] The intersection-union ratio (IUU) calculation module is configured to calculate the IUU between the detection box and the target box;

[0033] The feature point removal module is configured to determine whether an object has been successfully tracked based on the calculation result of the intersection-union ratio; and to remove feature points within the successfully tracked dynamic object.

[0034] Thirdly, a computer storage medium is provided, on which computer instructions are stored, wherein the computer instructions, when executed, perform the relevant steps in any one of the multi-target tracking methods for an autonomous driving scenario.

[0035] Fourthly, a terminal is provided, including a memory and a processor, wherein the memory stores computer instructions that can be executed on the processor, and the processor executes the relevant steps in any one of the multi-target tracking methods for an autonomous driving scenario when executing the computer instructions.

[0036] It should be further noted that the technical features corresponding to the above options can be combined or substituted to form new technical solutions if there is no conflict.

[0037] Compared with the prior art, the beneficial effects of the present invention are:

[0038] This invention projects the 3D target detection results onto a bird's-eye view to establish a matching relationship between the currently detected object and historically generated objects, and tracks multiple targets based on this matching relationship. On the other hand, it projects feature points onto the bird's-eye view to remove feature points within dynamic bounding boxes, so that the camera's self-localization is not affected by dynamic objects, and a more accurate pose is estimated, avoiding loop closure errors, trajectory drift, or even tracking loss. At the same time, it calculates the intersection-union ratio of the detection box and the target box to ensure motion consistency when the target moves along the camera's optical axis or performs other special movements. Attached Figure Description

[0039] Figure 1 This is a flowchart illustrating a multi-target tracking method for an autonomous driving scenario, as shown in an embodiment of the present invention.

[0040] Figure 2 This is a specific judgment process for target tracking as shown in an embodiment of the present invention;

[0041] Figure 3 This is a diagram illustrating the tracking effect in an embodiment of the present invention;

[0042] Figure 4 This is a schematic diagram illustrating the dynamic feature point removal effect in an embodiment of the present invention. Detailed Implementation

[0043] The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0044] Furthermore, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

[0045] Reference Figure 1 In one exemplary embodiment, a multi-target tracking method for an autonomous driving scenario is provided, comprising the following steps:

[0046] S1. Perform 3D target detection on each frame of the input image;

[0047] S2. Project all current 3D target bounding boxes onto the bird's-eye view to obtain the detection boxes;

[0048] S3. Determine whether the historically generated objects are within the field of view, and project the objects within the field of view onto the bird's-eye view to obtain the target bounding box;

[0049] S4. Calculate the intersection-union ratio (IUU) between the detection box and the target box;

[0050] S5. Determine whether the object has been successfully tracked based on the calculated crossover ratio;

[0051] S6. Remove feature points within successfully tracked dynamic objects.

[0052] Specifically, this invention performs 3D object detection on each frame of image, and then projects the 3D object detection results onto a bird's-eye view to obtain 2D detection boxes. Additionally, historically generated objects in the field of view are projected onto the bird's-eye view to obtain target boxes, and then the intersection-union (IOU) of the two boxes is calculated. Based on the IOU result, it is determined whether it is a new object; new objects are added to the object database, and objects with an IOU greater than a threshold are tracked.

[0053] Furthermore, the historically generated object is obtained from the historically generated object database, and the information of the object includes the object ID, object velocity, and object world coordinates.

[0054] Furthermore, the projection of all current 3D target bounding boxes onto the bird's-eye view includes:

[0055] Transform the world coordinates of the object to the coordinates in the current camera coordinate system;

[0056] The transformed coordinates are projected onto the bird's-eye view to obtain the homogeneous pixel coordinates of the object. Specifically, the object's coordinates in the world coordinate system are P. w Transform to the current camera coordinate system and set the coordinates to P. c The conversion method is P c =T cw P w P c Projecting the image onto the bird's-eye view yields the homogeneous pixel coordinates of the object, where (u,v,1) = K. bev P c K bevThe intrinsic parameter matrix representing the bird's-eye view.

[0057] Furthermore, determining whether a historically generated object is within the field of view includes:

[0058] It is determined whether the homogeneous pixel coordinates are within a certain interval. If they are within the corresponding interval, the object is in the field of view. Specifically, the interval is selected as 0-608. It is determined whether (u,v) is within the interval, and whether both numbers are within the range of 0-608. If they are within the range, the object is in the field of view.

[0059] Furthermore, the projection of objects within the field of view onto the bird's-eye view includes:

[0060] For a target within the field of view, since its world coordinates represent its position at the previous moment, its current bird's-eye view coordinates are estimated based on the constant velocity motion model. The transformation process is as follows:

[0061]

[0062] in, This represents the spatial coordinates of object l at time t-1. T represents the homogeneous pixel coordinates of an object in a bird's-eye view. cw T represents the transformation matrix from world coordinates to the current camera coordinates. bc K represents the transformation matrix from camera view to bird's-eye view. bev The intrinsic parameter matrix representing the bird's-eye view. The motion matrix of an object is represented as:

[0063]

[0064] Exp represents the exponential mapping. Indicates linear velocity. It represents angular velocity.

[0065] Furthermore, referring to Figure 2 (Target box A in the figure is the same as the detection box mentioned above, and target box B is the same as the target box mentioned above). Step S5 includes:

[0066] Objects with an intersection-over-union ratio (IoU) greater than a threshold (selected according to actual needs) are considered successfully tracked. The target's velocity and world coordinates are updated, and the tracking effect is as follows: Figure 3 As shown, lines of the same color depth represent the trajectory of the same object;

[0067] An object with an intersection-union ratio of 0 is considered a new object. Its velocity is initialized to 0, it is added to the database of historically generated objects, and its current world coordinates are recorded.

[0068] Objects with an intersection-union ratio (IU) between 0 and the threshold are considered unmatchable.

[0069] Furthermore, the removal of feature points within successfully tracked dynamic objects includes:

[0070] To ensure the stability of camera tracking, feature points extracted from dynamic objects need to be removed. All extracted feature points are projected onto the bird's-eye view, and feature points located within the bounding box of the dynamic object are removed. The result is shown in the image below. Figure 4 As shown, moving vehicles and other objects are dynamic objects within the field of view.

[0071] In another exemplary embodiment, a multi-target tracking system for an autonomous driving scenario is provided, comprising:

[0072] The 3D object detection module is configured to perform 3D object detection on each frame of the input image and project all current 3D object bounding boxes onto the bird's-eye view to obtain the detection bounding boxes.

[0073] The target bounding box generation module is configured to determine whether historically generated objects are within the field of view, and then project objects within the field of view onto the bird's-eye view to obtain the target bounding box.

[0074] The intersection-union ratio (IUU) calculation module is configured to calculate the IUU between the detection box and the target box;

[0075] The feature point removal module is configured to determine whether an object has been successfully tracked based on the calculation result of the intersection-union ratio; and to remove feature points within the successfully tracked dynamic object.

[0076] In another exemplary embodiment, the present invention provides a computer storage medium storing computer instructions thereon, which, when executed, perform relevant steps in the multi-target tracking method for an autonomous driving scenario.

[0077] Based on this understanding, the technical solution of this embodiment, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0078] In another exemplary embodiment, the present invention provides a terminal including a memory and a processor, wherein the memory stores computer instructions that can be executed on the processor, and the processor executes relevant steps in the multi-target tracking method for an autonomous driving scenario when executing the computer instructions.

[0079] The processor may be a single-core or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement the present invention.

[0080] The embodiments of the subject matter and functional operation described in this specification can be implemented in: tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or combinations thereof. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by a data processing device or for controlling the operation of a data processing device. Alternatively or additionally, the program instructions may be encoded on artificially generated propagation signals, such as machine-generated electrical, optical, or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiving device for execution by the data processing device.

[0081] The processing and logic flow described in this specification can be executed by one or more programmable computers that execute one or more computer programs to perform corresponding functions by operating on input data and generating output. The processing and logic flow can also be executed by dedicated logic circuitry—such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits), and the device can also be implemented as dedicated logic circuitry.

[0082] Suitable processors for executing computer programs include, for example, general-purpose and / or special-purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory and / or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices for storing data, such as disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to such mass storage devices to receive data from or transfer data to them, or both. However, a computer is not required to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device such as a universal serial bus (USB) flash drive, to name a few.

[0083] While this specification contains numerous specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claims, but rather are primarily intended to describe features of specific embodiments of a particular invention. Certain features described in the various embodiments herein may also be implemented in combination in a single embodiment. Conversely, various features described in a single embodiment may also be implemented separately in various embodiments or in any suitable sub-combination. Furthermore, while features may function in certain combinations as described above and even initially claimed in this way, one or more features from a claimed combination may be removed from that combination in some cases, and a claimed combination may refer to a sub-combination or a variation thereof.

[0084] Similarly, although the operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0085] The above detailed embodiments are a description of the present invention. It should not be considered that the specific embodiments of the present invention are limited to these descriptions. For those skilled in the art, several simple deductions and substitutions can be made without departing from the concept of the present invention, and all of these should be considered to fall within the protection scope of the present invention.

Claims

1. A multi-target tracking method for autonomous driving scenarios, characterized in that, Includes the following steps: S1. Perform 3D target detection on each frame of the input image; S2. Project all current 3D target bounding boxes onto the bird's-eye view to obtain the detection boxes; The step of projecting all current 3D target bounding boxes onto the bird's-eye view includes: Transform the world coordinates of the object to the coordinates in the current camera coordinate system; Projecting the transformed coordinates onto the bird's-eye view yields the homogeneous pixel coordinates of the object: ,in, The intrinsic parameter matrix representing the bird's-eye view. Indicates the transformed coordinates; S3. Determine whether the historically generated objects are within the field of view, and project the objects within the field of view onto the bird's-eye view to obtain the target bounding box; the determination of whether the historically generated objects are within the field of view includes: Determine whether the homogeneous pixel coordinates are within a certain range. If they are within the corresponding range, the object is in the field of view. The range is 0-608. When u and v are both within the range, it means that the historically generated object is within the field of view. The projection of objects within the field of view onto the bird's-eye view includes: The current bird's-eye view coordinates of the object are estimated based on a constant-velocity motion model. The transformation process is as follows: ,in, Represents the object at time t-1 l spatial coordinates, T represents the homogeneous pixel coordinates of an object in a bird's-eye view. cw T represents the transformation matrix from world coordinates to the current camera coordinates. bc The transformation matrix representing the view from the camera view; The motion matrix of the object: Exp Represents an exponential mapping. Indicates linear velocity. Indicates angular velocity; S4. Calculate the intersection-union ratio (IUU) between the detection box and the target box; S5. Determine whether the object has been successfully tracked based on the calculated intersection-union ratio; step S5 includes: Objects with an intersection-over-union ratio greater than a threshold are considered successfully tracked; Objects with an intersection-union ratio of 0 are considered new objects; Objects with an intersection-union ratio (IU) between 0 and the threshold are considered unmatchable. S6. Remove feature points within successfully tracked dynamic objects; the removal of feature points within successfully tracked dynamic objects includes: All extracted feature points are projected onto the bird's-eye view, and feature points located within the bounding box of dynamic objects are removed.

2. The multi-target tracking method for an autonomous driving scenario according to claim 1, characterized in that, The historically generated object is obtained from the historically generated object database, and the information of the object includes the object ID, object velocity, and object world coordinates.

3. A multi-target tracking system for autonomous driving scenarios, characterized in that, include: The 3D target detection module is configured to perform 3D target detection on each frame of the input image; Then, all the current 3D target bounding boxes are projected onto the bird's-eye view to obtain the detection boxes; The step of projecting all current 3D target bounding boxes onto the bird's-eye view includes: Transform the world coordinates of the object to the coordinates in the current camera coordinate system; Projecting the transformed coordinates onto the bird's-eye view yields the homogeneous pixel coordinates of the object: ,in, The intrinsic parameter matrix representing the bird's-eye view. Indicates the transformed coordinates; The target bounding box generation module is configured to determine whether historically generated objects are within the field of view, and to project objects within the field of view onto the bird's-eye view to obtain target bounding boxes; the determination of whether historically generated objects are within the field of view includes: Determine whether the homogeneous pixel coordinates are within a certain range. If they are within the corresponding range, the object is in the field of view. The range is 0-608. When u and v are both within the range, it means that the historically generated object is within the field of view. The projection of objects within the field of view onto the bird's-eye view includes: The current bird's-eye view coordinates of the object are estimated based on a constant-velocity motion model. The transformation process is as follows: ,in, Represents the object at time t-1 l spatial coordinates, T represents the homogeneous pixel coordinates of an object in a bird's-eye view. cw T represents the transformation matrix from world coordinates to the current camera coordinates. bc The transformation matrix representing the view from the camera view; The motion matrix of the object: Exp Represents an exponential mapping. Indicates linear velocity. Indicates angular velocity; The intersection-union ratio (IUU) calculation module is configured to calculate the IUU between the detection box and the target box; The feature point removal module is configured to determine whether an object has been successfully tracked based on the calculated intersection-union ratio (IU / R); and to remove feature points within successfully tracked dynamic objects; the determination of whether an object has been successfully tracked based on the calculated IU / R includes: Objects with an intersection-over-union ratio greater than a threshold are considered successfully tracked; Objects with an intersection-union ratio of 0 are considered new objects; Objects with an intersection-union ratio (IU) between 0 and a threshold are considered unmatchable; the removal of feature points within successfully tracked dynamic objects includes: All extracted feature points are projected onto the bird's-eye view, and feature points located within the bounding box of dynamic objects are removed.

4. A computer storage medium storing computer instructions thereon, characterized in that, When the computer instructions are executed, they perform the relevant steps in the multi-target tracking method for an autonomous driving scenario as described in any one of claims 1-2.

5. A terminal, comprising a memory and a processor, wherein the memory stores computer instructions executable by the processor, characterized in that, When the processor executes computer instructions, it performs the relevant steps in the multi-target tracking method for an autonomous driving scenario as described in any one of claims 1-2.