System and method for 3D object detection and tracking with monocular surveillance cameras
By using a monocular surveillance camera and computing devices to generate two-dimensional and three-dimensional bird's-eye view images of vehicles, the problems of high sensor cost and large computing resource requirements in vehicle-road cooperative systems are solved, enabling efficient information sharing and collaborative operation, and improving the safety and efficiency of the transportation system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JINGDONG TECH HLDG CO LTD
- Filing Date
- 2021-07-15
- Publication Date
- 2026-06-16
Smart Images

Figure CN116601667B_ABST
Abstract
Description
[0001] Cross-references
[0002] References are cited and discussed in the description of this disclosure, which may include patents, patent applications, and various publications. The citation and / or discussion of such references are provided solely to clarify the description of this disclosure and do not imply that any such reference is “prior art” as disclosed herein. All references cited in the “References” section or discussed in this specification are incorporated herein by reference in their entirety, to the same extent as each individual reference is individually incorporated by reference. Technical Field
[0003] This disclosure generally relates to object detection and tracking, and more specifically to vehicle-road cooperative systems that use monocular surveillance cameras to detect and track vehicles. Background Technology
[0004] The background description provided herein is intended to provide a general overview of the context of this disclosure. Within the scope of this background description, the work of the currently named inventors and descriptions that might not have been considered prior art at the time of filing are not, expressly or impliedly, acknowledged as prior art to this disclosure.
[0005] Intelligent transport systems (ITS) are transportation systems that integrate advanced information, communication, computer, sensor, and control technologies applied to the transportation field to improve safety, sustainability, efficiency, and comfort. As an integrated system of people, roads, and vehicles, it provides drivers with road information and convenient services, reduces traffic congestion, and increases road capacity.
[0006] As an advanced stage of intelligent transportation systems, Cooperative Vehicle Infrastructure System (CVIS) utilizes wireless communication and sensor detection technologies to acquire vehicle and road information, enabling interaction and data sharing between vehicles and infrastructure. This system effectively solves the problem of intelligent communication and coordination between vehicles and infrastructure, allowing for more efficient use of system resources, safer road traffic, and reduced traffic congestion. It can interpret the intentions of traffic participants with high precision and significantly improve perception of autonomous vehicles. Sensors such as vision, radar, light detection and ranging (LiDAR) can be installed on vehicles and streetlight poles, evolving into integrated signal poles, integrated traffic poles, and integrated electronic alarm poles. Simultaneous perception of vehicles and road terminals minimizes blind spots and provides early warning of unseen collisions.
[0007] CVIS involves technologies such as intelligent vehicle systems, intelligent road testing, and vehicle-to-everything (V2X). Autonomous driving is one of the main applications of V2X communication, which can have a primary impact on people's lifestyles. V2X communication overcomes two limitations of existing autonomous vehicles that rely solely on a perception subsystem composed of onboard sensors: (1) the limited sensing range of onboard sensors only allows the detection of adjacent vehicles; and (2) vehicles cannot cooperate to efficiently perform highly complex maneuvers. These shortcomings can be overcome because V2X achieves two key features in autonomous vehicles: (1) cooperative sensing, i.e., increasing the sensing range through the mutual exchange of sensing data; and (2) cooperative maneuvering, i.e., a group of autonomous vehicles cooperating in driving according to a common centralized or decentralized decision-making strategy. To ensure safety and improve efficiency, real-time alerts from trusted sources are sent to drivers and pedestrians, providing information about road hazards, congestion, and the presence of emergency vehicles.
[0008] like Figure 1 As shown, V2X is mainly deployed in four operating modes: (1) Vehicle-to-vehicle (V2V), (2) Vehicle-to-infrastructure (V2I), (3) Vehicle-to-pedestrian (V2P), and (4) Vehicle-to-Network (V2N). V2I can provide information to vehicles, such as available parking spaces, traffic congestion, and road conditions. V2I application information is generated by locally available application servers and transmitted through remote switching units (RSUs), which are roadside fixed units that act as transceivers. To improve the efficiency and accuracy of V2I applications, it is recommended to use different types of sensors, such as high-resolution cameras and ultra-high frequency (UHF) radio waves. However, using multiple sensors is costly and requires computational resources to integrate information from these sensors.
[0009] Therefore, there is a need in this field to address the aforementioned defects and shortcomings. Summary of the Invention
[0010] In some aspects, this disclosure relates to a system. In some embodiments, the system includes: a camera; and a computing device in communication with the camera. The computing device is configured to:
[0011] Receive multiple video frames from the camera;
[0012] Detecting objects from the plurality of video frames, wherein the detected objects in each video frame are represented by a detection vector, the detection vector including a first dimension and a second dimension, the first dimension representing the two-dimensional 2D parameters of the object, and the second dimension representing the three-dimensional 3D parameters of the object;
[0013] Based on the detection vectors of the object in the multiple video frames, the object is tracked in the multiple video frames to obtain the trajectory of the object, wherein the loss minimization in object tracking is calculated based on the first and second dimensions of the detection vectors; and
[0014] The plurality of video frames are converted into a bird's-eye view, wherein the bird's-eye view of the plurality of video frames includes the trajectory of the object.
[0015] In some embodiments, the object is a vehicle, and the system further includes a server computing device. The server computing device is configured to receive bird's-eye views of the plurality of video frames from the computing device, and to use the received bird's-eye views of the plurality of video frames to perform at least one of coordinated manipulation and coordinated risk warning.
[0016] In some embodiments, the computing device, the server computing device, and the vehicle communicate via a fifth-generation mobile network.
[0017] In some embodiments, the 2D parameters of the object include the position and size of the 2D bounding box surrounding the object in a corresponding video frame of the plurality of video frames, and the 3D parameters of the object include the vertices, center point, and orientation of the 3D bounding box surrounding the object in a corresponding video frame of the plurality of video frames.
[0018] In some embodiments, the computing device is configured to simultaneously detect the 2D and 3D parameters of the object using a single-shot 3D object detector.
[0019] In some embodiments, the computing device is configured to detect 2D parameters of the object using a 2D object detector and to detect 3D parameters of the object using a 3D object detector, the 2D detector detecting the 2D parameters of the object from video frames, and the 3D object detector detecting the 3D parameters of the object from the 2D parameters.
[0020] In some embodiments, the computing device is configured to use a graph convolutional long short-term memory (GC-LSTM) network to track the object, the GC-LSTM network employing a twin graph convolutional network (GCN) to associate the object's identifiers in the plurality of video frames based on the object's 2D parameters, the object's 3D parameters, and the object's visual features.
[0021] In some embodiments, the computing device is configured to use a Kalman filter and a Hungarian algorithm to track the object, wherein the Kalman filter is used to optimize the 2D and 3D parameters of the detected object, and the Hungarian algorithm is used to associate the identifiers of the object in the plurality of video frames.
[0022] In some embodiments, the camera includes a monocular surveillance camera, and the computing device is further configured to calibrate the monocular surveillance camera. In some embodiments, the camera and the computing device are mounted on a traffic pole.
[0023] In some embodiments, the computing device is configured to detect and track the object using a built-in chip.
[0024] In some embodiments, this disclosure relates to a method. In some embodiments, the method includes:
[0025] The computing device receives multiple video frames captured by the camera;
[0026] The computing device detects objects from the plurality of video frames, wherein the objects detected in each video frame are represented by a detection vector, the detection vector including a first dimension and a second dimension, the first dimension representing the two-dimensional 2D parameters of the object, and the second dimension representing the three-dimensional 3D parameters of the object;
[0027] The computing device tracks the object across multiple video frames based on the object's detection vectors, obtaining the object's trajectory. The loss during object tracking is minimized based on a first and second dimension of the detection vectors.
[0028] The computing device converts the plurality of video frames into a bird's-eye view, wherein the bird's-eye view of the plurality of video frames includes the trajectory of the object.
[0029] In some embodiments, the object is a vehicle, and the method further includes:
[0030] A bird's-eye view of the plurality of video frames received by the server computing device from the computing device; and
[0031] The server computing device uses the bird's-eye view of the received plurality of video frames to perform at least one of collaborative manipulation and collaborative risk warning.
[0032] In some embodiments, the 2D parameters of the object include the position and size of the 2D bounding box surrounding the object in a corresponding video frame of the plurality of video frames, and the 3D parameters of the object include the vertices, center point, and orientation of the 3D bounding box surrounding the object in a corresponding video frame of the plurality of video frames.
[0033] In some embodiments, a single-shot 3D object detector is used to detect both the 2D and 3D parameters of the object simultaneously.
[0034] In some embodiments, a 2D object detector and a 3D object detector are used to detect the 2D parameters and 3D parameters of the object, respectively. The 2D detector detects the 2D parameters of the object from video frames, and the 3D object detector detects the 3D parameters of the object from the 2D parameters.
[0035] In some embodiments, a graph convolutional long short-term memory (GC-LSTM) network is used to track the object. The GC-LSTM network employs a twin graph convolutional network (GCN) to associate the object's identifiers across multiple video frames based on the object's 2D parameters, 3D parameters, and visual features.
[0036] In some embodiments, a Kalman filter and a Hungarian algorithm are used to track the object, wherein the Kalman filter is used to optimize the 2D and 3D parameters of the detected object, and the Hungarian algorithm is used to associate the object's identifiers in the plurality of video frames.
[0037] In some embodiments, the camera includes a monocular surveillance camera.
[0038] In some embodiments, this disclosure relates to a non-transitory computer-readable medium storing computer-executable code. The computer-executable code, when executed at a processor of a computing device, is configured to perform the methods described above.
[0039] These and other aspects of this disclosure will become apparent from the following description of preferred embodiments in conjunction with the accompanying drawings and descriptions, although variations and modifications therein may be made without departing from the spirit and scope of the novel concept of this disclosure. Attached Figure Description
[0040] The accompanying drawings illustrate one or more embodiments of this disclosure and, together with the written description, serve to explain the principles of this disclosure. Where possible, the same reference numerals are used throughout the drawings to refer to the same or similar elements of the embodiments, wherein:
[0041] Figure 1 It schematically depicts the interaction between vehicles and everything in a 5G network environment.
[0042] Figure 2 The data flow of a vehicle-road cooperative system according to certain embodiments of the present disclosure is schematically depicted.
[0043] Figure 3A computing device for a vision-centric 3D object detection and tracking system according to certain embodiments of the present disclosure is schematically depicted.
[0044] Figure 4 The graph convolution-long short term memory (GC-LSTM) for 3D object tracking according to certain embodiments of the present disclosure is illustrated schematically.
[0045] Figure 5 The diagram schematically depicts the conversion of a camera view into a bird's-eye view according to certain embodiments of the present disclosure.
[0046] Figure 6 The process of 2D and 3D detection and tracking for collaborative manipulation and collaborative risk warning according to certain embodiments of the present disclosure is illustrated schematically. Detailed Implementation
[0047] The present disclosure is described in more detail in the following examples, which are intended to be illustrative only, as many modifications and variations therein will be apparent to those skilled in the art. Various embodiments of the present disclosure are now described in detail. Referring to the accompanying drawings, throughout the views, the same numerals indicate the same components. Unless the context clearly specifies otherwise, the terms “a,” “an,” and “the” as used herein and throughout the claims have the meaning of the plural. Furthermore, as used in the description and claims of this disclosure, unless the context clearly specifies otherwise, “in” has the meaning of “in” and “on”. And, headings or subheadings may be used in the specification for the reader's convenience, without affecting the scope of the present disclosure. In addition, some terms used in this specification are given more specific definitions below.
[0048] The terms used in this specification generally have their ordinary meaning in the art, in the context of this disclosure, and in the specific context in which each term is used. Certain terms used to describe this disclosure are discussed below or elsewhere in the specification to provide practitioners with additional guidance regarding the description of this disclosure. It will be understood that the same thing can be expressed in more than one way. Therefore, alternative language and synonyms may be used for any one or more terms discussed herein, and have no particular significance in whether a term is elaborated or discussed herein. This disclosure provides synonyms for certain terms. The use of one or more synonyms does not preclude the use of other synonyms. The use of examples anywhere in this specification, including examples of any terms discussed herein, is merely illustrative and in no way limits the scope or meaning of this disclosure or any exemplary terms. Likewise, this disclosure is not limited to the various embodiments given in this specification.
[0049] It should be understood that when an element is referred to as being “on” another element, the element may be directly on the other element, or an intermediate element may exist between the two elements. Conversely, when an element is referred to as being “directly” on another element, there is no intermediate element. As used herein, the term “and / or” includes any and all combinations of one or more of the associated listed items.
[0050] It should be understood that although the terms first, second, third, etc., may be used herein to describe various elements, components, regions, layers, and / or parts, these elements, components, regions, layers, and / or parts should not be limited by these terms. These terms are used only to distinguish one element, component, region, layer, or part from another. Therefore, the first element, component, region, layer, or part discussed below may be referred to as the second element, component, region, layer, or part without departing from the teachings of this disclosure.
[0051] Furthermore, relative terms such as “lower” or “bottom” and “upper” or “top” may be used herein to describe the relationship between one element and another, as illustrated in the figures. It should be understood that relative terms are intended to cover different orientations of the device other than those depicted in the figures. For example, if a device in one of the figures is flipped, an element described as being “below” the other elements will be oriented “above” the other elements. Thus, the exemplary term “lower” can include both “down” and “up” orientations, depending on the specific orientation of the figure. Similarly, if a device in one of the figures is flipped, an element described as being “below” or “under” the other elements will be oriented “above” the other elements. Thus, the exemplary term “below” or “under” can include both “up” and “down” orientations.
[0052] Unless otherwise defined, all terms used in this disclosure (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. It should also be understood that terms such as those defined in common dictionaries should be interpreted as having the same meaning as they have in the context of the relevant technology and this disclosure, and, unless expressly defined herein, should not be interpreted as having an idealized or overly formal meaning.
[0053] As stated herein, “approximately,” “roughly,” “substantially,” or “approximately” should generally mean within 20%, preferably within 10%, and more preferably within 5% of a given value or range. The values given herein are approximate, meaning that the terms “approximately,” “roughly,” “substantially,” or “approximately” can be inferred unless explicitly stated otherwise.
[0054] As stated in this article, "multiple" refers to two or more.
[0055] As stated in this article, the terms “including,” “contains,” “carries,” “has,” “includes,” “involves,” etc., should be understood as open-ended, meaning including but not limited to.
[0056] As described herein, at least one of the phrases A, B, and C should be interpreted as representing logic (A or B or C) using non-exclusive logical OR. It should be understood that one or more steps within the method may be performed in a different order (or simultaneously) without altering the principles of this disclosure.
[0057] As described herein, the term "module" may refer to or include application-specific integrated circuits (ASICs); electronic circuitry; combinational logic circuitry; field-programmable gate arrays (FPGAs); processors (shared, dedicated, or grouped) that execute code; other suitable hardware components that provide the described functionality; or some or all of the above, such as in a system-on-a-chip. The term "module" may include memory (shared, dedicated, or grouped) that stores code executed by a processor.
[0058] The term "code" as used herein can include software, firmware, and / or microcode, and can refer to programs, routines, functions, classes, and / or objects. The term "shared" as used above means that some or all of the code from multiple modules can be executed using a single (shared) processor. Furthermore, some or all of the code from multiple modules can be stored in a single (shared) memory. The term "group" as used above means that some or all of the code from a single module can be executed using a group of processors. Furthermore, a group of memories can be used to store some or all of the code from a single module.
[0059] As described herein, the term "interface" generally refers to a communication tool or device used at the interaction point between components to perform data communication between components. Generally, interfaces can be applied at both the hardware and software levels, and can be unidirectional or bidirectional. Examples of physical hardware interfaces can include electrical connectors, buses, ports, cables, terminals, and other I / O devices or components. Components communicating with the interface can be, for example, multiple components of a computer system or peripheral devices.
[0060] This disclosure relates to computer systems. As shown in the accompanying drawings, computer components may include physical hardware components indicated by solid lines and virtual software components indicated by dashed lines. Those skilled in the art will understand that, unless otherwise stated, these computer components may be implemented as software, firmware, or hardware components, or combinations thereof, but are not limited to these forms.
[0061] The apparatus, systems, and methods described herein can be implemented by one or more computer programs executed by one or more processors. The computer program includes processor-executable instructions stored on a non-transitory tangible computer-readable medium. The computer program may also include stored data. Non-limiting examples of non-transitory tangible computer-readable media are non-volatile memory, magnetic storage, and optical storage.
[0062] This disclosure will now be described more fully below with reference to the accompanying drawings, in which embodiments of the disclosure are illustrated. However, this disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the scope of the disclosure to those skilled in the art.
[0063] The combination of autonomous vehicles and V2I technology enables two key collaborative features: sensing and manipulation. Among the many important pieces of information, the vehicle's 3D range and trajectory are crucial clues for predicting the vehicle's future position (sensing) and planning future motion based on these predictions (manipulation). In some aspects, this disclosure provides a system that utilizes a static monocular surveillance camera and a corresponding locally available application server to detect and track multiple vehicles, mapping these vehicles to the camera coordinate system for transmission via a local RSU. The positions and identities of these vehicles are displayed in a bird's-eye view for collaborative sensing and manipulation.
[0064] Figure 2 The data flow of a vehicle-to-infrastructure (V2I) system according to certain embodiments of this disclosure is schematically depicted. For example... Figure 2As shown, the system includes a master server 210, multiple remote switching units (RSUs) 230, multiple local cameras 240, and multiple local application servers 250. Each local application server 250 includes a calibration module 256. The calibration module 256 calibrates the local cameras 240, instructs the local cameras 240 to capture environmental video, receives multiple video frames from the local cameras 240, and prepares and provides the received video frames to a 3D object detection module 258. The 3D object detection module 258 receives the calibrated video frames and detects 2D and 3D objects, such as vehicles in the video. Then, a 3D object tracking module 262 tracks the objects in the video. A bird's-eye view transformer 267 converts the view of the video with objects into a bird's-eye view. The bird's-eye view from the local application servers 250 is then transmitted to the master server 210 via the RSUs 230. When bird's-eye views in different camera coordinate systems are available, the master server 210 combines these bird's-eye views to obtain a global bird's-eye view within a defined area covered by the local cameras 240. Using a global bird's-eye view of the area and objects detected in the global area, the master server 201 can perform collaborative manipulation 270 and collaborative risk warnings 290. These manipulations and risk warnings can be relayed back to the object via the master server 210, the local RSU 230, and the local application server 250, which can directly instruct the object's actions or communicate with the object. In some embodiments, the master server 210 can also communicate with the object in other ways. In some embodiments, the object is a vehicle or an autonomous vehicle.
[0065] Each local camera 240 can be mounted on a traffic pole. A corresponding local application server 250 can be mounted on or near the traffic pole. In some embodiments, the local application server 250 can also be located away from the traffic pole, provided that communication between the local camera 240 and the remotely placed local application server 250 is efficient, and communication from the local application server 250 and / or the main server 210 to vehicles near the traffic pole is efficient. In some embodiments, the functionality of the local application server 250 can also be integrated into the main server 210.
[0066] In some embodiments, the 3D object detection module 258 is performed using a two-dimensional object detection (2DOD) model 259 and a three-dimensional object detection (3DOD) model 260. The combination of 2DOD model 259 and 3DOD model 260 is a top-down approach. In some embodiments, after detecting 2D bounding boxes of objects from multiple video frames captured by the local camera 240, this disclosure also uses these 2D detection results to reduce the search space while regressing the 3D bounding boxes. Alternatively, the 3D object detection module 258 is performed using a single-shot 3D object detection (3DOD) module 261. This is a bottom-up approach where the disclosure performs joint 2D and 3D object detection in one-shot. In some embodiments, the 3D object detection module 258 may provide both the top-down and bottom-up approaches described above, and provide a mechanism to select from one of these methods based on certain criteria (e.g., the location of the local camera 240, the quality of the video captured by the local camera 240, the distance between the local camera 240 and other adjacent local cameras 240, the computing power of the application server 250, and the requirements of specific tasks such as collaborative manipulation and collaborative risk warning). In some embodiments, the 3D object detection module 258 may include only a single 3DOD module 261, or only a 2DOD model 259 and a 3DOD model 260. In some embodiments, the input to the 3D object detection module 258 is a series of video frames, and the output of the 3D object detection module 258 is multiple video frames, 2D bounding boxes of objects in the video frames, and 3D object information. In some embodiments, the 2D and 3D information of the objects are combined to form a vector for each detected object.
[0067] In some embodiments, the 3D object tracking module 262 is executed using a 3D Kalman filter 263 and a Hungarian algorithm 264. The Kalman filter 263 is used to smooth the trajectories of 2D and 3D objects, while the Hungarian algorithm 264 is used to associate the identifiers of objects across multiple video frames. The Kalman filter 263 and the Hungarian algorithm 264 can be executed serially or in parallel. Alternatively, the 3D object tracking module 262 is executed using a GC-LSTM module 265. In some embodiments, the GC-LSTM module 265 performs online tracking, which has the side effect of smoothing the trajectory. Furthermore, the GC-LSTM module 265 uses a Siamese Graph Convolutional Network (GCN) to associate object IDs, such as vehicle IDs. In some embodiments, the 3D object tracking module 262 may provide both of the aforementioned tracking methods and provide a mechanism to select one of the methods based on certain criteria (e.g., the position of the local camera 240, the quality of the video captured by the local camera 240, the number of objects or vehicles in the video, the distance between the local camera 240 and other adjacent local cameras 240, and the computing power of the application server 250). In some embodiments, the 3D object tracking module 262 may include only the GC-LSTM module 265, or only the 3D Kalman filter 263 and the Hungarian algorithm 264.
[0068] The bird's-eye view transformer 267 can convert multiple video frames with detected and tracked objects from the camera coordinate system into a bird's-eye view. Local bird's-eye views from multiple local application servers 250 are transmitted to the master server 210 via their respective RSUs 230. The master server 210 combines the bird's-eye views into a global bird's-eye view in the world coordinate system. The master server 210 can utilize the global bird's-eye view, preferably combined with other relevant information, to achieve collaborative manipulation and collaborative risk warning. In some embodiments, the bird's-eye view transformer 267 may also be a component of the master server 210, rather than a component of the local application servers 250.
[0069] Figure 3 A vehicle-to-everything (CVIS) system according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, the CVIS 300 includes a plurality of computing devices 350. The computing devices 350 may be server computers, clusters, cloud computers, general-purpose computers, headless computers, or dedicated computers, providing object detection and object tracking for multiple video frames and providing a bird's-eye view of the objects in a camera coordinate system. In some embodiments, each computing device 350 may correspond to... Figure 2 One of the application servers 250 shown. For example... Figure 3As shown, computing device 350 may include, but is not limited to, processor 351, memory 352, and storage device 353. In some embodiments, computing device 350 may include other hardware and software components (not shown) to perform their respective tasks. Examples of these hardware and software components may include, but are not limited to, other required memory, interfaces, buses, input / output (I / O) modules or devices, network interfaces, and peripheral devices.
[0070] Processor 351 may be a central processing unit (CPU) configured to control the operation of computing device 350. Processor 351 may execute an operating system (OS) or other applications of computing device 350. In some embodiments, computing device 350 may have more than one CPU as a processor, such as two CPUs, four CPUs, eight CPUs, or any suitable number of CPUs.
[0071] Memory 352 may be volatile memory, such as random access memory (RAM), for storing data and information during operation of computing device 350. In some embodiments, memory 352 may be an array of volatile memory. In some embodiments, computing device 350 may operate on more than one memory 352.
[0072] In some embodiments, the computing device 350 may also include a graphics card to assist the processor 351 and memory 352 in image processing and display.
[0073] Storage device 353 is a non-volatile data storage medium used to store the operating system (not shown) and other applications of computing device 350. Examples of storage device 353 may include non-volatile memory such as flash memory, memory card, USB drive, solid-state drive, floppy disk, optical drive, or any other type of data storage device. In some embodiments, computing device 350 may have multiple storage devices 353, which may be the same storage device or different types of storage devices, and applications of computing device 350 may be stored in one or more storage devices 353 of computing device 350.
[0074] In this embodiment, processor 351, memory 352, and storage device 353 are components of computing device 350 (e.g., a server computing device). In other embodiments, computing device 350 may be a distributed computing device, where processor 351, memory 352, and storage device 353 are shared resources from multiple computers in a predefined area.
[0075] In addition, storage device 353 includes a local camera application 354. Local camera application 354 includes a calibration module 356, a 3D object detection module 358, a 3D object tracking module 362, a bird's-eye view transformer 367, and an optional user interface 368. In some embodiments, storage device 353 includes other applications or modules required for the local camera application 354 to run. It should be noted that modules 356, 358, 362, 367, and 368 are each implemented as computer-executable code or instructions, data tables, or databases, and together they form an application. In some embodiments, each module may also include submodules. Alternatively, some modules may be combined into a stack. In other embodiments, some modules may be implemented as circuits rather than executable code, which can significantly improve computational speed.
[0076] The calibration module 356 is used to calibrate the local camera 240, receiving video from the local camera, calibrating the received video, and sending the calibrated video to the 3D object detection module 358. Calibration is required for the local camera 240 on the traffic pole. However, since the camera is static, its intrinsic and extrinsic parameters are stable, so the calibration process for each camera only needs to be performed once. In some embodiments, the calibration process acquires a series of chessboard images (e.g., an 8-column, 6-row chessboard) and then detects the chessboard corners. The calibration process then iterates to find the precise sub-pixel positions of corners or radial saddle points. Based on this information, the embodiment calibrates the camera and derives the camera matrix, distortion coefficients, and translation and rotation vectors. The embodiment then calculates the projection matrix and rotation matrix. In some embodiments, calibration is performed by the calibration module 356. In some embodiments, calibration parameters can also be loaded into the camera 240 to calibrate the video output by the camera 240, which can then be directly input into the 3D object detection module 358. In some embodiments, calibration parameters are provided to the 3D object detection module 358, and the camera 240 sends the captured video directly to the 3D object detection module 358, allowing the 3D object module 358 to process the video using the calibration parameters. In some embodiments, the calibration module 356 is configured to store calibration parameters from the local camera application 354, such as projection and rotation matrices, and the stored calibration parameters are available for use by modules of the local camera application 354.
[0077] The 3D object detection module 358 is configured to detect 2D and 3D objects from multiple video frames after receiving video from camera 240 or calibration video from calibration module 356, and send the 2D and 3D object detection results to 3D object tracking module 362. In some embodiments, such as Figure 3As shown, this disclosure provides two detector options: 1) a top-down detector, including a 2D object detector and a subsequent 3D object regressor, namely 2D Object Detection (2DOD) model 359 and 3D Object Detection (3DOD) model 360; and 2) a bottom-up detector that detects both 2D and 3D objects at once, namely a single-shot 3D Object Detection (3DOD) module 361. The first detector is general, i.e., the 2D object detector can be freely replaced to adopt the latest state-of-the-art methods. Because it is top-down, it tends to have better accuracy but higher complexity in terms of speed-accuracy trade-offs. The second detector is bottom-up, and the computational cost is not proportional to the number of objects. In some embodiments, the 3D object detection module 358 also includes a decision mechanism to select one of the top-down and bottom-up methods. For example, the 3D object detection module 358 may choose the top-down method when high accuracy is required and the computing power of the computing device 350 is sufficient, or when the video does not contain many objects. When there are a large number of objects in the video, the 3D object detection module 358 can choose a bottom-up approach. In some embodiments, the use of a top-down or bottom-up approach is predetermined based on the characteristics of the environment, such that the 3D object detection module 358 of the local computing device 350 may include only the top-down approach or only the bottom-up approach.
[0078] The top-down 3D object detector includes a 2DOD model 359 and a 3DOD model 360. The 3DOD 360 features a 3D object regressor and a bounding box regressor. The 2DOD model 359 is configured to detect 2D objects in the image coordinate system after receiving multiple video frames or video images from the camera 240 or calibration module 356. Regions in the image that tightly surround an object are defined as regions of interest (ROIs), each ROI being a cropped area in the image surrounding, for example, a vehicle. The 2DOD model 359 is also configured to send the ROIs to the 3D object regressor. Upon receiving the ROI, the 3D object regressor regresses the cropped RGB pixels to, for example, eight rectangular points (in the image coordinate system) representing the vehicle, its dimensions (length, width, height), and its depth / distance from the camera 240, where the vehicle is treated as a cuboid. This information, along with the camera matrix, is fed into the final box regressor and regressed into fine-grained vehicle dimensions (height, width, length), the vehicle center in the camera coordinate system (x, y, z), and the rotation angle θ on the y-axis (i.e., the yaw axis).
[0079] In some embodiments, the 2DOD model 359 includes at least one of two alternative and readily available 2D object detectors: YOLOv3 and CenterNet. The 2D object detectors can be top-down or bottom-up, as long as they provide accurate 2D object regions. Therefore, the 2DOD model 359 is general and model-based.
[0080] In some embodiments, the 3D object regressor of the 3DOD model 360 is implemented using a pre-trained ResNet34 backbone (with fully connected layers removed) and three fully connected layers following it. The number of channels can be reduced from 512 to 256, 128, and finally to 20. These 20 channels represent eight points (2×8=16 channels), a coarse vehicle size (3 channels), and a coarse depth (1 channel).
[0081] In some embodiments, the final bounding box regressor is not a deep neural network (DNN), but rather an optimization algorithm. The final bounding box regressor geometrically estimates the 3D box parameters (x, y, z, h, w, l, θ) by minimizing the difference between the regressed pixel coordinates and the pixel coordinates obtained by projecting the estimated 3D box onto the image plane. In this minimization process, the regressed 3D box size and distance are used for initialization and regularization. This nonlinear least squares problem is solved using the least squares method in SciPy's optimization module, which implements the Trust Region Reflective algorithm.
[0082] The single-shot 3DOD module 361 is a first-level bottom-up 3D object detector. The single-shot 3DOD module 361 is configured to jointly detect 2D and 3D bounding boxes from multiple video frames after receiving video from a local camera 240 or from a calibration module 356. In some embodiments, this disclosure acquires the entire input video frame and regresses the center of objects and their corresponding attributes, including 2D object information such as width, height, and offset (to aid in downsampling differences) and 3D object information such as vehicle size, distance / depth, and yaw axis direction θ. This process can be implemented using a fully convolutional network, specifically a pre-trained backbone (two alternatives: Hourglass and deformable ResNet) on ImageNet followed by a 3D object detection head. This head consists of 3×3 convolutional layers followed by pointwise convolutional layers. The number of output channels is the same as the number of attributes to be estimated. The output resolution is 1 / 4 of the input image; that is, when the input frame is resized to 2×256, the output resolution is 64×64. Therefore, the output tensor size is 64×64×18. In the bottom-up single-shot 3DOD module, the 18 channels include: (1) 3 channels for the center keypoints of each category: car, truck, pedestrian / bicycle; (2) 2 channels for the 2D bounding box: width, height; (3) 2 channels for the center keypoint offset (restoring the keypoints from the 64×64 mapping to their original positions in the 256×256 mapping): Δ u Δ v (4) One channel is used for depth estimation for each object; (5) Four channels are used for rotation: two for bin classification and two for intra-bin regression; (6) Three channels are used for 3D object shape: (w, h, l); (7) One channel is used for recording metrics; (8) One channel is used for the offset mask; (9) One channel is used for the rotation mask. This disclosure uses the following steps to decode the tensor:
[0083] (1) Local peaks in these heatmaps are found by applying 3×3 max pooling to the heatmaps respectively.
[0084] (2) Based on the heatmap response, find the first K peaks of these heatmaps respectively.
[0085] (3) Flatten the 2D image coordinate system into 1D and find the indices of the first K peaks.
[0086] (4) Take these peaks and organize them into the final output tensor to provide convenient information about the 3D object, such as center coordinates, dimension, depth, rotation, fraction, vehicle type, etc.
[0087] In some embodiments, this disclosure uses Apache MXNet to implement a single-shot 3D object detector because MXNet provides both command and symbolic modes. Command mode is easy to use during development and debugging because it allows access to intermediate tensor results. Symbolic mode is faster during inference. Symbolic and trained weights can be reused across different target platforms and are easy to apply. In some embodiments, the symbolic inference portion is implemented in C++, which is convenient for various platforms.
[0088] The 3D object tracking module 362, after receiving 2D and 3D object detection results from the 3D object detection module 358, tracks the detected objects across multiple video frames and sends the tracking results to the bird's-eye view transformer 367. The tracking results include the trajectories of the detected objects. For example... Figure 3 As shown, the 3D object tracking module 362 can perform tracking using the 3D Kalman filter module 363 and the Hungarian algorithm module 364, or alternatively using the GC-LSTM module 365 and the Siamese graph convolution (SGC) module 366.
[0089] When the 3D Kalman filter module 363 and the Hungarian algorithm module 364 are used for tracking, the 3D Kalman filter module 363 is configured to smooth the trajectory using a Kalman filter, and the Hungarian algorithm module 364 is configured to update the identity of the object using a Hungarian algorithm. In some embodiments, this disclosure applies a Kalman filter to both 2D and 3D detection results. During data association, this disclosure applies the Hungarian algorithm to associate identities, that is, assigning correctly detected measurements to the predicted trajectory. This step is described in detail below:
[0090] (1) If no previous track is available, a track is created. In the system of this disclosure, a track is an entity comprising several attributes, including: a) prediction vectors representing 2D and 3D positions, b) a unique track ID, c) a Kalman filter instance of the 2D and 3D objects, and d) the trajectory history of the track, represented by a list of prediction vectors. In the case of a Kalman filter instance, it tracks the estimated state of the system and the variance or uncertainty of the estimate. In this case, the state vector is the 2D / 3D object position, represented by coordinate vectors.
[0091] (2) The prediction results of the Kalman filter are calculated based on the tracking history, and then the cost between the prediction results and the current detection results of the detection module is calculated. The cost is defined as the weighted sum of the Euclidean distance between the detection results and the prediction results in 2D and 3D coordinates.
[0092] (3) Use the Hungarian algorithm to associate the correct tracking instances (detected measurements, trajectory history) with the prediction results based on cost.
[0093] (4) Create new tracking entities for unassigned (meaning new) detection results.
[0094] (5) For each track, if the cost of the associated association assigned to that track is higher than a threshold, the track is marked as unassigned. Unassigned tracks are kept in memory for a period of time in case they reappear soon. Unassigned tracks are deleted if the objects in these tracks are not detected in a certain number of frames.
[0095] (6) Update the state of the Kalman filter for each tracking instance based on the assigned 2D and 3D prediction results, so that the Kalman filter can maintain accurate prediction capability under the latest input.
[0096] In some embodiments, the 3D object tracking using the Kalman filter and Hungarian algorithm described above is a lightweight option because the Kalman filter only considers positional history and not global visual cues. Therefore, the 3D object tracking module 362 also provides a heavier tracking module—the GC-LSTM module 365—which is computationally more demanding but more accurate. Depending on the computing power of the computing device 350, the 3D object tracking can be easily switched between the Kalman filter / Hungarian algorithm route and the GC-LSTM route. The GC-LSTM route fully considers occlusion issues in the video.
[0097] The GC-LSTM module 365 is a bottom-up 3D object tracker. In some embodiments, the GC-LSTM module 365 treats objects as points, representing 3D objects as center points in an image coordinate system, with additional properties associated with these center points, such as 3D object size, depth, and yaw axis rotation. In some embodiments, such as Figure 4 As shown, the GC-LSTM module 365 is configured to treat each vehicle as a set of keypoints. Specifically, the GC-LSTM module 365 treats nine points in the camera coordinate system (the vehicle center plus eight 3D bounding box vertices) as objects. The GC-LSTM module 365 is configured to extract lightweight low-level features around the vehicle center keypoint, such as local binary patterns (LBP), histogram of oriented gradients (HOG), color histogram, or combinations thereof. Each keypoint is represented by 3D coordinates and is alternatively concatenated with some local visual features of the center point. Figure 4 As shown, these keypoints are input into the GC-LSTM model, which considers spatiotemporal cues to address occlusion and smooth the tracking trajectory. Specifically, the vehicle 3D detection bounding box G... t-T Gt-T+1 ... G t-1 Convert the time series into the corresponding vector representation A t-T A t-T+1 ... A t-1 The vector representation is fed into the GC-LSTM model, and then the GC-LSTM outputs the prediction of the 3D bounding box at time t: P t .
[0098] In some embodiments, when a 3D detection deviates from the corresponding GC-LSTM prediction of a previous frame, the GC-LSTM module 365 considers the tracked vehicle lost and performs data association to link the detection to the tracking history. In some embodiments, the GC-LSTM module 365 also includes a data association model, such as LightTrack-GCN, to perform the data association. For example, the GC-LSTM module 365 can treat the data association process as a re-identification (Re-ID) problem and use a Siamese GCN network to classify whether the 3D detection matches the 3D trajectory prediction. To improve LightTrack-GCN, in some embodiments, the GC-LSTM module 365 is configured to feed keypoint coordinates and local visual features as input to the GCN network so that the Siamese GCN network classifies pairs with spatial layout and visual features. Considering a 3D vehicle, keypoints encode its orientation, size, and position, while visual features can encode color, texture, and other vehicle appearance patterns.
[0099] In some embodiments, the GC-LSTM route further includes an SGC module 366, which is used for object re-identification when the GC-LSTM module 365 detects tracking deviations. In some embodiments, the SGC module 366 is LightTrack as described in Reference 13, the entire contents of which are incorporated herein by reference.
[0100] In some embodiments, the 3D object tracking module 362 further includes a decision mechanism to perform 3D object tracking using the 3D Kalman filter module 363 and the Hungarian algorithm module 364, or using the GC-LSTM module 365. After tracking, return to the reference. Figure 3 The 3D object tracking module 362 is also used to send the tracking results to the bird's-eye view transformer 367.
[0101] The bird's-eye view transformer 367 is configured to convert information into a bird's-eye view (or top view) in the camera coordinate system after receiving multiple video frames, detected 2D and 3D objects, and the trajectory of the 3D objects. Figure 5The transformation from a camera image view to a bird's-eye view is illustrated schematically. The bird's-eye view stores global spatial information of the traffic scene and can be used for a variety of applications, including: (1) cooperative maneuvering of autonomous vehicles; and (2) cooperative risk warning.
[0102] In some embodiments, the user interface 368 is configured to provide a user interface or graphical user interface in the computing device 350. In some embodiments, a user or administrator of the system is able to configure parameters for the computing device 350.
[0103] In some embodiments, the local camera application 354 may also include a database that can be configured to store at least one of the following: calibration parameters of the camera 240, multiple captured video frames, and detection and tracking results of the multiple video frames. However, the local camera application 354 preferably loads the video to be processed, calibration parameters, bounding boxes, etc., into memory 352 for rapid processing.
[0104] In the above embodiments, each of the 3D object detection module 358 and the 3D object tracking module 362 includes two distinct routes for object detection and object tracking, as well as a mechanism for alternatively selecting one of these two routes. In some embodiments, each of the 3D object detection module 358 and the 3D object tracking module 362 includes only one of the two routes. For example, the 3D object detection module 358 includes only a single-shot 3DOD module 361, and the 3D object tracking module 362 includes only a GC-LSTM module 365.
[0105] Return to reference Figure 2 When a bird's-eye view from each local camera 240 is available, the RSU 230 sends bird's-eye views from different local application servers 250 to the master server 210. The master server 210 is configured to map the bird's-eye views in the camera coordinate system to the world coordinate system and combine the bird's-eye views to obtain a global bird's-eye view in the world coordinate system. Each local application server 250 corresponds to a different camera located on a different traffic pole or shooting from a different angle. The vehicle is actually perceived by multiple cameras (which expand each other's coverage but also overlap to prevent blind spots) and result in multiple camera coordinate systems. The master server 210 is configured to map the vehicle from the individual camera coordinate systems to a unified world coordinate system.
[0106] Once the information about the camera coordinates is transmitted from the local RSU 230 to the master server 210, world coordinates can be exported and synchronized. The master server 210 can also perform more advanced cooperative manipulation 270 and cooperative risk warning 290 because it has advanced sensing and awareness of overall traffic in the area. In some embodiments, specific cooperative manipulation servers and specific cooperative risk warning servers may work with the master server 210 to perform corresponding functions.
[0107] In some embodiments, the world coordinate system is established after the distance to the traffic pole camera is measured (the camera's rotation matrix is also obtained during camera calibration). Because the traffic pole camera is static and only needs to be installed once, the world coordinate system is very stable and reliable.
[0108] In some embodiments, the local camera application 354 may also include a scheduler to schedule the processing of multiple video frames or video images. For example, the scheduler may determine multiple keyframes from a time series of multiple video frames for object detection and tracking, the scheduler may define a sliding window of multiple video images for batch processing, and the scheduler may load multiple video frames into memory 352 in real time. In some embodiments, the scheduler may also be configured to maintain certain inputs and outputs of video frame processing steps in memory 314. Input and output information may include object or target IDs, bounding box IDs (optionally), points or vertices of 2D or 3D boxes, and vectors representing 2D and 3D detection results of objects. In some embodiments, the scheduler may also store this information in storage device 353. In some embodiments, the scheduler is also configured to invoke modules of the local camera application 354 to perform their respective functions at different times.
[0109] Figure 6 A process for 3D object detection and tracking according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, the detection and tracking process is performed by a computing device, such as... Figure 3 The computing device 350 shown (or Figure 2 The application server 250 shown is specifically executed by the local camera application 354. It should be noted that, unless otherwise stated in this disclosure, the steps of the object detection and tracking process or method may be arranged in different orders, and are therefore not limited to... Figure 6 The order shown.
[0110] like Figure 6As shown, in step 602, calibration module 356 calibrates local camera 240. In some embodiments, local camera 240 is a monocular surveillance camera. Local camera 240 is mounted on a fixed structure, such as a traffic pole. When there are many traffic poles in a defined area, each traffic pole may be equipped with a local camera 240 and a corresponding computing device 350 for processing images from local camera 240. The following process applies to one local camera 240 mounted on the same traffic pole and one corresponding computing device 350 (or local application server 250). However, multiple local cameras 240 may be mounted on the same traffic pole facing different directions, and the corresponding computing devices 350 may not need to be mounted on the same traffic pole. These different camera and computing device arrangements can be adapted to the current process with slight variations in the system. In some embodiments, calibration module 356 uses a chessboard to perform calibration of local camera 240. By capturing chessboard images at different locations in the camera view, the parameters of local camera 240 can be determined. Calibration may only need to be performed once after local camera 240 is mounted, and the calibrated camera parameters can be stored as a data file in local camera application 354. In some embodiments, calibration may also be performed at predetermined time intervals, such as once or twice a year, to compensate for changes in the local camera 240 and the environment. In other embodiments, calibration may be performed after a change in the traffic pole or the local environment.
[0111] In step 604, the local camera 240 captures video of the environment and provides the video to the 3D object detection module 358. The video may be calibrated before being input to the 3D object detection module 358, or calibrated before or during object detection. In some embodiments, the video is captured in real time, and multiple frames of the video are input to the 3D object detection module 358 consecutively or in batches. In some embodiments, the local camera application 354 may have a scheduler to schedule the processing of multiple video frames. For example, the scheduler may use a sliding window to process multiple video frames in batches, each batch may include, for example, three to five frames, and adjacent batches of video frames may have overlapping frames. For example, the first batch of frames is frames 0, 1, and 2, the second batch of frames is frames 1, 2, and 3, and the third batch of frames is frames 2, 3, and 4. In some embodiments, the scheduler may also select multiple keyframes from the video frames, providing only the keyframes for object detection and tracking.
[0112] In step 606, after receiving multiple video frames, the 3D object detection module 358 detects 2D and 3D objects from the multiple video frames and sends the detected 2D and 3D object parameters to the 3D object tracking module 362. In some embodiments, the 3D object detection module 358 uses either a top-down or bottom-up approach for detection. In some embodiments, the 3D object detection module 358 selects between using a top-down or bottom-up approach. This selection can be determined by factors such as computational resources and the required detection accuracy. In some embodiments, a bottom-up approach is the preferred detection method, especially when there are many objects in the multiple video frames. In some embodiments, the 3D object detection module 358 may include only one of the two approaches.
[0113] When using a top-down approach, after receiving a video frame, the 2DOD model 359 detects 2D objects in the frame. YOLOv3 or CenterNet can be used to perform 2D object detection. Since this disclosure may only require detecting vehicles in multiple frames, the 2D object detection parameters can be configured to suit the vehicle detection task, thereby enabling more efficient 2D object detection. Detected 2D objects are represented by bounding boxes. Bounding box parameters may include the position and size of the bounding box in an image coordinate system defined by pixels. After 2D object detection, 2D bounding boxes are cropped from the frame, and the 3DOD model 360 performs 3D object detection on each cropped bounding box. In some embodiments, 3D detection is performed via a neural network used for each 2D bounding box to obtain corresponding 3D information about the object. In some embodiments, the neural network includes a pre-trained ResNet34 backbone with several fully connected layers. Detected 3D objects may be in the form of 3D boxes, represented by the 8 vertices of the 3D box, the center point, and the yaw angle of the 3D box. Because 3D object detection is based on 2D bounding boxes, it is fast and reliable.
[0114] When using a bottom-up approach, after receiving video frames, the single-shot 3DOD module 361 simultaneously regresses 2D and 3D object information from the RGB frames. In some embodiments, the single-shot 3DOD module 361 includes a pre-trained backbone and a subsequent 3D object detection head. The backbone may be Hourglass or a deformable ResNet. The obtained detections may include 2D object information, such as the width, height, and offset of the 2D bounding box, and 3D object information, such as vehicle dimensions, distance / depth, and orientation on the yaw axis of the 3D box.
[0115] As mentioned earlier, after obtaining 2D object information and 3D object information from the top-down or bottom-up route, the 3D object detection module 358 will also send the detected 2D object information and 3D object information to the 3D object tracking module 362.
[0116] In step 608, after receiving 2D and 3D information from the 3D object detection module 358, the 3D object tracking module 362 tracks the object based on the 2D and 3D information in multiple frames and provides the tracked object to the bird's-eye view transformer 367. Tracking is more accurate by using both 2D and 3D detection information simultaneously. In some embodiments, the 3D object tracking module 362 uses one of two paths for tracking. The two paths are the 3D Kalman filter and Hungarian algorithm path and the GC-LSTM path. In some embodiments, the 3D object tracking module 362 selects between using the two paths. In some embodiments, the GC-LSTM path is preferred, as it efficiently provides more accurate tracking. In some embodiments, the 3D object tracking module 362 may include only one of the two paths.
[0117] When using a 3D Kalman filter and Hungarian algorithm path, the 3D Kalman filter module 363 smooths the object's trajectory, and the Hungarian algorithm module 364 updates the object's identity. For example, multiple sequential frames 0, 1, and 2 serve as input to the 3D Kalman filter module 363, with 2D and 3D detections from multiple frames represented by vectors. Frame 2 is the current frame, and frames 0 and 1 are previous frames. Each detected object corresponds to a vector, thus including both 2D and 3D information about the object. The 3D Kalman filter module 363 can detect noise in the scaling dimension of the vector and smooth the vector corresponding to the detected object. When a 2D or 3D detection in the vector is inaccurate, the 3D Kalman filter module 363 can correct the inaccuracy using the corresponding 3D or 2D detection in the same vector. After smoothing the object's trajectory in frames 0, 1, and 2, the 3D Kalman filter module 363 provides a prediction of the object in the current frame 2. The prediction can still be represented as a vector, and the predicted 2D bounding box position and 3D point position and orientation indicated by the prediction vector are more accurate than those indicated in the input vector. This is because the prediction considers and compensates for the order changes of the object in frames 0, 1, and 2, and takes into account the noise in frames 0, 1, and 2. The object-smoothed prediction vector is then applied to the Hungarian algorithm module 364, which identifies or reassigns the object's identifier by: matching objects in different frames, calculating the cost of the match, and if the cost is less than a threshold, keeping the same ID; if the cost is greater than the threshold, changing the ID or assigning a new ID to the object. This process continues, for example, processing frames 1, 2, and 3 to give the prediction of the object in frame 3, and processing frames 2, 3, and 4 to give the prediction of the object in frame 4. In some embodiments, the number of frames for each prediction may be more or less than the above three frames. In the above embodiments, the 3D Kalman filter module 363 and the Hungarian algorithm module 364 process multiple frames serially. In other embodiments, the 3D Kalman filter module 363 and the Hungarian algorithm module 364 can also process multiple frames in parallel, which can make the prediction process faster. However, serial processing is preferred because it is faster and more reliable when the Hungarian algorithm module 364 performs re-identification using object trajectories smoothed by the Kalman filter module 363.
[0118] When using GC-LSTM and SGC paths, the GC-LSTM module 365 smooths the detection vectors and updates the object IDs simultaneously, while the SGC module 366 re-identifies the object when the predicted tracking deviates from 3D. The input to GC-LSTM includes not only 2D and 3D detection information but also the object's visual attributes, such as visual attributes around the object's center point. These visual attributes can be extracted using LBP, HOG, or color histograms. In other words, the input to GC-LSTM is a graph including the object's feature points and visual attributes; the output is a predicted graph including the object's position and orientation, and optionally its visual attributes. GC-LSTM is used to infer the smoothed position of a vehicle given a location history. GC-LSTM functions similarly to a Kalman filter, but it is more robust to occlusion because it considers spatiotemporal cues. In some embodiments, Re-ID is required when the GC-LSTM 365's tracking deviates from 3D detection. Re-ID is performed by the SGC module 366. In one example, if there are 5 cars in each of frames 0, 1, and 2, and the 5 cars in each frame are extracted into 5 images, then the 5 images from each of frames 0, 1, and 2 will be used as input to GC-LSTM, and the output will be the 5 predicted images in the current frame (frame 2). While GC-LSTM may require more computational resources than a Kalman filter, its cost can be lower than a combination of a Kalman filter and the Hungarian algorithm. Therefore, the GC-LSTM approach may be a more accurate and efficient approach. In some embodiments, the video frames used for detection and tracking are sequential frames. In some embodiments, detection and tracking may also be performed on multiple frames at specific time intervals, or only keyframes may be used.
[0119] As described above, after obtaining tracking from either of the two routes, the 3D object tracking module 362 also sends multiple video frames, predicted 2D bounding boxes and 3D object boxes, and the trajectory of the object to the bird's-eye view transformer 367.
[0120] In step 610, after receiving multiple video frames, detection information and tracking information, the bird's-eye view converter 367 converts the multiple video frames, detection information and tracking information into a bird's-eye view and sends the bird's-eye view to the main server 210 through RUS230.
[0121] In step 612, upon receiving bird's-eye views from different computing devices 350 (or application servers 250), the main server 210 combines these bird's-eye views into a global bird's-eye view in the world coordinate system. In some embodiments, a bird's-eye view transformer 267 may also be located in the main server 210, which converts views, detection information, and tracking information from multiple frames in different camera coordinate systems from different application servers 250 into a global bird's-eye view in the world coordinate system. In some embodiments, overlapping bird's-eye views corresponding to different cameras helps improve the global bird's-eye view of the object.
[0122] In step 614, when the global bird's-eye view is available, the master server 210 can perform real-time cooperative manipulation by controlling the autonomous vehicles in the area and / or perform cooperative risk warnings by sending warning messages to the vehicles in the area.
[0123] In some aspects, this disclosure relates to a non-transitory computer-readable medium for storing computer-executable code. In some embodiments, the computer-executable code may be software stored in the storage device 353 as described above. When executed, the computer-executable code may perform the methods described above.
[0124] Some embodiments of this disclosure have the following novel advantages in particular:
[0125] (1) This disclosure provides a unique 3D object detection and tracking system with a monocular surveillance camera for a vehicle-road cooperative system (CVIS).
[0126] (2) This disclosure provides for the first time a CVIS system based on a traffic pole and a monocular camera for tracking 3D vehicles. With the low cost and wide availability of monocular surveillance cameras and optional 5G networks, the CVIS is designed to be reliable and cost-effective.
[0127] (3) This disclosure is the first to use GC-LSTM for 3D object tracking.
[0128] (4) This disclosure is the first to use twin GCN for data association in 3D object tracking.
[0129] (5) The 3D object detection and tracking system disclosed herein also has universality in some of its sub-modules, such as 2D object detection, 3D object detection and multi-target tracking, which are all replaceable and upgradeable.
[0130] (6) The 3D object detection and tracking system disclosed herein can also be used as a subsystem or service of a third-party vehicle-road cooperative system.
[0131] The foregoing description of exemplary embodiments of this disclosure is presented for illustrative and descriptive purposes only and is not intended to be exhaustive or to limit this disclosure to the precise form disclosed. Many modifications and variations are possible in accordance with the foregoing teachings.
[0132] The embodiments were chosen and described to explain the principles of this disclosure and its practical application, thereby enabling others skilled in the art to utilize this disclosure and various embodiments, as well as various modifications suitable for the particular intended use. Alternative embodiments will become apparent to those skilled in the art to which this disclosure pertains without departing from the spirit and scope of this disclosure. Therefore, the scope of this disclosure is defined by the appended claims rather than the foregoing description and the exemplary embodiments described therein.
[0133] References (in whole or in part, cited in this paper):
[0134] 1.Sheng-hai An, Byung-Hyug Lee, and Dong-Ryeol Shin, A survey of intelligent transportation systems, 2011Third International Conference on Computational Intelligence, Communication Systems and Networks, IEEE, 2011.
[0135] 2.Ling Sun, Yameng Li, and Jian Gao, Architecture and application research of cooperative intelligent transport systems, Procedia Engineering, 2016, 137: 747-753.
[0136] 3. Cooperative vehicle infrastructure system (CVIS) and vehicle to everything (V2X) industry report, 2018.
[0137] 4.Shangguan Wei, Yu Du, and Linguo Chai, Interactive perception-based multiple object tracking via CVIS and AV, IEEE, 2019, 7: 121907-121921.
[0138] 5.Tong He,and Stefano Soatto,Mono3d++:monocular 3D vehicle detectionwith two-scale 3D hypotheses and task priors,AAAI,2018,8409-8416.
[0139] 6.Xiaozhi Chen,Kaustav Kundu,Yukun Zhu,Andrew Berneshawi,Huimin Ma,Sanja Fidler,and Raquel Urtasun,3D object proposals for accurate object classdetection,NIPS,2015.
[0140] 7.Arsalan Mousavian,Dragomir Anguelov,John Flynn,and Jana Kosecka,3Dbounding box estimation using deep learning and geometry,CVPR,2017,5632-5640.
[0141] 8.Erik Linder-Norn and Fredrik Gustafsson,Automotive 3d objectdetection without target domain annotations,Master of Science Thesis,2018.
[0142] 9.Joseph Redmon,and Ali Farhadi,YOLOv3:An incremental improvement,2018,arXiV:1804.02767.
[0143] 10.Xingyi Zhou,Dequan Wang,and Philipp Krahenbuhl,Objects as Points,2019,arXiv:1904.07850.
[0144] 11.Hou-Ning Hu,et al.,Joint monocular 3D Vehicle detection andtracking,Proceedings of the IEEE ICCV,2019,5390-5399.
[0145] 12.Jason Ku,Alex D.Pon,and Steven L.Waslander,Monocular 3D objectdetection leveraging accurate proposals and shape reconstruction,Proceedingsof the IEEE CVPR,2019,arXiv:1904.01690.
[0146] 13.Guanghan Ning,and Heng Huang,LightTrack:a generic framework foronline top-down human pose tracking,2019,arXiv:1905.02822。
Claims
1. A system for 3D object detection and tracking using a monocular monitoring camera, comprising: camera; as well as A computing device that communicates with the camera, wherein the computing device is configured to: Receive multiple video frames from the camera; Detecting objects from the plurality of video frames, wherein the detected objects in each video frame are represented by a detection vector, the detection vector including a first dimension and a second dimension, the first dimension representing the two-dimensional 2D parameters of the object, and the second dimension representing the three-dimensional 3D parameters of the object; Based on the detection vectors of the object in the multiple video frames, the object is tracked in the multiple video frames to obtain the trajectory of the object, wherein the loss minimization in object tracking is calculated based on the first and second dimensions of the detection vectors; and The plurality of video frames are converted into a bird's-eye view, wherein the bird's-eye view of the plurality of video frames includes the trajectory of the object, the 2D parameters of the object include the position and size of the 2D bounding box surrounding the object in a corresponding video frame of the plurality of video frames, and the 3D parameters of the object include the vertex, center point and orientation of the 3D bounding box surrounding the object in a corresponding video frame of the plurality of video frames; The computing device is configured to detect 2D parameters of the object using a 2D object detector and to detect 3D parameters of the object using a 3D object detector, wherein the 2D object detector detects the 2D parameters of the object from video frames and the 3D object detector detects the 3D parameters of the object from the 2D parameters.
2. The system according to claim 1, wherein, The object is a vehicle, and the system also includes a server computing device configured to receive bird's-eye views of the plurality of video frames from the computing device and use the received bird's-eye views of the plurality of video frames to perform at least one of coordinated manipulation and coordinated risk warning.
3. The system according to claim 2, wherein, The computing device, the server computing device, and the vehicle communicate via a fifth-generation mobile network.
4. The system according to claim 1, wherein, The computing device is configured to simultaneously detect the 2D and 3D parameters of the object using a single-shot 3D object detector.
5. The system according to claim 1, wherein, The computing device is configured to use a graph convolutional long short-term memory (GC-LSTM) network to track the object, wherein the GC-LSTM network employs a twin graph convolutional network (GCN) to associate the object's identifiers in the plurality of video frames based on the object's 2D parameters, the object's 3D parameters, and the object's visual features.
6. The system according to claim 1, wherein, The computing device is configured to use a Kalman filter and a Hungarian algorithm to track the object, wherein the Kalman filter is used to optimize the 2D and 3D parameters of the detected object, and the Hungarian algorithm is used to associate the identifiers of the object in the plurality of video frames.
7. The system according to claim 1, wherein, The computing device is also configured to calibrate the monocular surveillance camera.
8. The system according to claim 1, wherein, The computing device is configured to detect and track the object using a built-in chip.
9. A method for 3D object detection and tracking using a monocular monitoring camera, comprising: The computing device receives multiple video frames captured by the camera; The computing device detects objects from the plurality of video frames, wherein the objects detected in each video frame are represented by a detection vector, the detection vector including a first dimension and a second dimension, the first dimension representing the two-dimensional 2D parameters of the object, and the second dimension representing the three-dimensional 3D parameters of the object; The computing device tracks the object across multiple video frames based on the object's detection vectors, obtaining the object's trajectory. The loss during object tracking is minimized based on a first and second dimension of the detection vectors. The computing device converts the plurality of video frames into a bird's-eye view, wherein the bird's-eye view of the plurality of video frames includes the trajectory of the object; The 2D parameters of the object include the position and size of the 2D bounding box surrounding the object in a corresponding video frame of the plurality of video frames, and the 3D parameters of the object include the vertex, center point and orientation of the 3D bounding box surrounding the object in a corresponding video frame of the plurality of video frames; The 2D parameters and 3D parameters of the object are detected using a 2D object detector and a 3D object detector, respectively. The 2D object detector detects the 2D parameters of the object from video frames, and the 3D object detector detects the 3D parameters of the object from the 2D parameters.
10. The method according to claim 9, wherein, The object is a vehicle, and the method further includes: A bird's-eye view of the plurality of video frames received by the server computing device from the computing device; and The server computing device uses the bird's-eye view of the received plurality of video frames to perform at least one of collaborative manipulation and collaborative risk warning.
11. The method according to claim 9, wherein, The object's 2D and 3D parameters are detected simultaneously using a single-shot 3D object detector.
12. The method according to claim 9, wherein, The object is tracked using a graph convolutional long short-term memory (GC-LSTM) network, which employs a twin graph convolutional network (GCN) to associate the object's identifiers across multiple video frames based on the object's 2D parameters, 3D parameters, and visual features.
13. The method according to claim 9, wherein, The object is tracked using a Kalman filter and a Hungarian algorithm. The Kalman filter is used to optimize the 2D and 3D parameters of the detected object, and the Hungarian algorithm is used to associate the object's identifiers in the plurality of video frames.
14. A non-transitory computer-readable medium storing computer-executable code, wherein the computer-executable code is configured, when executed at a processor of a computing device, to: Receive multiple video frames from the camera; Detecting objects from the plurality of video frames, wherein the detected objects in each video frame are represented by a detection vector, the detection vector including a first dimension and a second dimension, the first dimension representing the two-dimensional 2D parameters of the object, and the second dimension representing the three-dimensional 3D parameters of the object; Based on the detection vectors of the object in the multiple video frames, the object is tracked in the multiple video frames to obtain the trajectory of the object, wherein... Loss minimization in object tracking is calculated based on the first and second dimensions of the detection vector; and The plurality of video frames are converted into a bird's-eye view, wherein the bird's-eye view of the plurality of video frames includes the trajectory of the object, the 2D parameters of the object include the position and size of the 2D bounding box surrounding the object in a corresponding video frame of the plurality of video frames, and the 3D parameters of the object include the vertex, center point and orientation of the 3D bounding box surrounding the object in a corresponding video frame of the plurality of video frames; The computing device is configured to detect 2D parameters of the object using a 2D object detector and to detect 3D parameters of the object using a 3D object detector, wherein the 2D object detector detects the 2D parameters of the object from video frames and the 3D object detector detects the 3D parameters of the object from the 2D parameters.