A monocular camera-based real-time twin method and system for road vehicles

By combining target detection, target tracking, and depth estimation technologies, this system utilizes a monocular camera to perceive changes in road vehicles in real time and maps them onto a virtual twin scene. This addresses the shortcomings of existing systems in dynamic vehicle twinning and enables accurate real-time twinning of multiple target vehicles.

CN120260000BActive Publication Date: 2026-06-30SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2025-03-26
Publication Date
2026-06-30

Smart Images

  • Figure CN120260000B_ABST
    Figure CN120260000B_ABST
Patent Text Reader

Abstract

This invention proposes a real-time road vehicle twinning method and system based on a monocular camera. Combining target detection, target tracking, and depth estimation techniques, it can perceive changes in multiple target vehicles in dynamic scenes in real time and accurately map their positions and features to a virtual twin scene, overcoming the shortcomings of existing road twinning methods in supporting dynamic vehicle twinning. A large visual question-answering model is used to extract vehicle features, reducing reliance on specific task data and scenes, and supporting generative reasoning and multi-turn interaction. This allows for dynamic question adjustment and rapid adaptation to new scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of digital twin technology, and in particular relates to a method and system for real-time twinning of road vehicles based on a monocular camera. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] Digital twins, serving as a crucial bridge connecting the physical and virtual worlds, have seen widespread application in recent years in fields such as industrial manufacturing, smart cities, and intelligent transportation. By synchronizing the state of physical entities into a virtual environment in real time, digital twins enable dynamic monitoring, prediction, and optimization of physical objects and systems. Among these applications, twins are particularly prevalent in localized road scenarios, such as traffic flow monitoring, traffic congestion analysis, and road planning, significantly improving road management efficiency and traffic safety.

[0004] However, in real-world applications, most roads are only equipped with fixed-position surveillance cameras, lacking expensive sensing devices like LiDAR. This means they can only provide single-view monocular video stream data to reflect real-time changes in road traffic. Consequently, most current road twin systems primarily focus on modeling and managing static scenes, lacking sufficient support for twinning real-time moving vehicles, and thus failing to fully meet real-time requirements. Therefore, how to achieve real-time twinning of vehicles in a road scene with only monocular video stream data has become a pressing problem for road twin systems. Summary of the Invention

[0005] To overcome the shortcomings of the prior art, this invention provides a real-time road vehicle twinning method and system based on a monocular camera. By combining target detection, target tracking and depth estimation technologies, it can perceive changes in multiple target vehicles in dynamic scenes in real time and accurately map their positions and features to the virtual twin scene, thus making up for the deficiencies of existing road twinning scenes in supporting dynamic vehicle twinning.

[0006] To achieve the above objectives, the present invention adopts the following technical solution:

[0007] In a first aspect, the present invention provides a real-time twinning method for road vehicles based on a monocular camera, comprising:

[0008] A target detection network is used to detect vehicles in the video stream and obtain rectangular bounding boxes for the detected vehicles; wherein, the video stream is captured by a monocular camera;

[0009] The color and category of vehicles in the video stream are identified based on a large visual question answering model, and the target tracker is updated based on the rectangular bounding box of the detected vehicle obtained from the target detection, and the updated tracking ID is used as the detected vehicle ID.

[0010] The depth map of the vehicle in the video stream is calculated using a depth estimation model, and the depth information of the detected vehicle is obtained by combining it with the rectangular bounding box of the detected vehicle.

[0011] Based on the depth information of the detected vehicle and the geographic coordinates of the monocular camera, the true geographic coordinates of the detected vehicle are obtained.

[0012] The real-time acquired vehicle ID, color, category, and actual geographic coordinates of the detected vehicle are updated in the road twin scene.

[0013] Secondly, the present invention provides a real-time road vehicle twin system based on a monocular camera, comprising:

[0014] The target detection module is configured to: use a target detection network to detect vehicles in the video stream and obtain the rectangular bounding boxes and outlines of the detected vehicles; wherein the video stream is captured by a monocular camera;

[0015] The target tracking module is configured to: identify the color and category of vehicles in the video stream based on a visual question-answering big model, update the target tracker based on the rectangular bounding box of the detected vehicle obtained from target detection, and use the updated tracking ID as the detected vehicle ID;

[0016] The depth estimation module is configured to: calculate the depth map of the vehicle in the video stream using a depth estimation model, and combine it with the rectangular bounding box of the detected vehicle to obtain the depth information of the detected vehicle;

[0017] The coordinate transformation module is configured to obtain the true geographical coordinates of the detected vehicle based on the depth information of the detected vehicle and the geographical coordinates of the monocular camera.

[0018] The real-time twin module is configured to update the road twin scene with the real-time acquired vehicle ID, color, category, and real geographic coordinates of the detected vehicle.

[0019] Thirdly, the present invention provides an electronic device including a memory and a processor, and computer instructions stored in the memory and running on the processor, wherein the computer instructions, when executed by the processor, perform the method described in the first aspect.

[0020] Fourthly, the present invention provides a computer-readable storage medium for storing computer instructions, which, when executed by a processor, perform the method described in the first aspect.

[0021] Fifthly, the present invention provides a computer program product, including a computer program that, when executed by a processor, implements the method described in the first aspect.

[0022] The above one or more technical solutions have the following beneficial effects:

[0023] This invention combines target detection, target tracking, and depth estimation technologies to perceive changes in multiple target vehicles in dynamic scenes in real time and accurately map their positions and features to a virtual twin scene, thus overcoming the shortcomings of existing road twin scenes in supporting dynamic vehicle twins. A large visual question-answering model is used to extract vehicle features, reducing reliance on specific task data and scenes, and supporting generative reasoning and multi-turn interaction. This allows for dynamic adjustment of questions and rapid adaptation to new scenarios.

[0024] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0025] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0026] Figure 1 This is a flowchart of the real-time road vehicle twinning method based on a monocular camera in Embodiment 1 of the present invention;

[0027] Figure 2 This is a network structure diagram of FC3k2, FC3k, FasterNet Block, and PConv modules in Embodiment 1 of the present invention;

[0028] Figure 3 This is a network structure diagram of the OAA Head module in Embodiment 1 of the present invention. Detailed Implementation

[0029] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0030] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.

[0031] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0032] Example 1

[0033] This embodiment discloses a real-time road vehicle twinning method based on a monocular camera, including:

[0034] A target detection network is used to detect vehicles in the video stream and obtain rectangular bounding boxes for the detected vehicles; wherein, the video stream is captured by a monocular camera;

[0035] The color and category of vehicles in the video stream are identified based on a large visual question answering model, and the target tracker is updated based on the rectangular bounding box of the detected vehicle obtained from the target detection, and the updated tracking ID is used as the detected vehicle ID.

[0036] The depth map of the vehicle in the video stream is calculated using a depth estimation model, and the depth information of the detected vehicle is obtained by combining it with the rectangular bounding box of the detected vehicle.

[0037] Based on the depth information of the detected vehicle and the geographic coordinates of the monocular camera, the true geographic coordinates of the detected vehicle are obtained.

[0038] The real-time acquired vehicle ID, color, category, and actual geographic coordinates of the detected vehicle are updated in the road twin scene.

[0039] This embodiment combines target detection, target tracking, and depth estimation technologies to perceive changes in multiple target vehicles in dynamic scenes in real time and accurately map their positions and features to a virtual twin scene, thus overcoming the shortcomings of existing road twin scenes in supporting dynamic vehicle twins. It employs a large visual question-answering model to extract vehicle features, reducing reliance on specific task data and scenes, and supports generative reasoning and multi-turn interaction, enabling dynamic adjustment of questions and rapid adaptation to new scenarios.

[0040] The following is combined Figure 1 This embodiment provides a detailed description of a real-time twinning method for road vehicles based on a monocular camera.

[0041] Step 1: Use an object detection network to detect vehicles in the video stream and obtain the rectangular bounding boxes of the detected vehicles.

[0042] In this embodiment, the improved YOLOv11 network is used as the target detection network. Based on the FC3k2 module, the inference speed is improved by partial convolution and point convolution. By introducing the OAA detection head, the detection accuracy of occluded targets is improved by using a hybrid attention mechanism. By adopting a loss function, the target box regression is optimized, thereby improving the detection accuracy of medium-quality targets.

[0043] The improved YOLOv11 network structure is described in detail below.

[0044] Step 11: Construct the FC3k2 module, i.e., Faster C3k2, to replace the C3k2 module in the YOLOv11 network, thereby improving the overall network's computation speed.

[0045] The FC3k2 in this embodiment is an improvement on the C3k2 module. Specifically, it introduces the FasterNet Block module with PConv (partial convolution) to replace the Bottleneck module based on traditional convolution in the C3k2 module.

[0046] The FasterNet Block module consists of several core components designed to improve inference speed while maintaining high accuracy by optimizing computational efficiency and memory access. First, the input feature map undergoes a standard convolution operation for initial processing. Then, a PConv operation is introduced, which performs convolution computation only on a subset of the input feature map channels, reducing redundant computation and memory bandwidth consumption, thus significantly improving computational efficiency. Next, PWConv (point convolution) is used to fuse information between the feature map channels, further enhancing feature interactions between channels. To avoid information loss, the FasterNet Block uses residual connections, adding the input feature map to the output after convolution processing to ensure uninterrupted information flow. Furthermore, after each convolutional layer, batch normalization (BN) and ReLU activation functions are used to accelerate training and improve network stability.

[0047] Thanks to the introduction of PConv to replace traditional convolution operations, the FC3K2 module improves floating-point operations per second (FLOPS) while reducing computational and memory bandwidth requirements, maintaining high precision and accelerating the inference process. In traditional convolution operations, each channel of the input feature map needs to be operated with the convolution kernel, which slides across the entire input feature map. This means that for each convolution kernel, all pixels in each channel of the input feature map need to be accessed, and the results need to be stored in the corresponding position in the output feature map. This leads to computational redundancy, especially when some channels contain a lot of repetitive information. Because all information in each channel must be read and processed, the large amount of redundant computation not only increases floating-point operations (FLOPs) but also increases memory access. For larger input feature maps and multiple convolution kernels, this generates a large number of memory read and write operations, especially in the case of multiple channels, where the demand for memory access increases dramatically. This high frequency of memory access not only increases the demand for memory bandwidth but may also lead to memory access bottlenecks, thus affecting overall computational efficiency. PConv, by performing convolution calculations only on a subset of the input feature map, can effectively solve the computational performance problems caused by traditional convolution.

[0048] First, PConv's practice of performing convolution operations on only a subset of channels reduces redundant computation and lowers the computational cost. In PConv, only a portion of the input feature map's channels participate in the convolution calculation, while other channels remain unchanged, thus significantly reducing the number of memory accesses. Assuming the feature map size is w×h, the total number of channels is c, and the convolution kernel size is k, the number of channels requiring convolution calculation is c. p Then PConv's FLOPs are only Typically, a partial proportion is taken. PConv's FLOPs are simply those of traditional convolution. Furthermore, PConv significantly reduces the amount of memory required by performing convolution computations on only a subset of the channels of the input feature map. PConv selects only a portion of the channels for computation, thus avoiding access to all channels. By reducing the amount of data required for each convolution operation, PConv reduces memory bandwidth consumption, alleviates the burden of memory access, makes PConv more efficient in computation, and reduces latency caused by frequent memory accesses.

[0049] Therefore, by introducing the FC3k2 module to replace the C3k2 module in the original YOLOv11 network, the inference speed of object detection can be improved.

[0050] The SPPF module enhances the multi-scale information of feature maps using feature pyramid technology. The module processes the input feature map using multiple pooling layers of different sizes, such as 1x1, 2x2, and 3x3, and finally concatenates these pooling results together.

[0051] The PSABloc module incorporates an attention mechanism, which enhances the model's focus on important features by weighting the feature maps. The module first extracts features through a single convolutional layer, then applies an attention mechanism to weight the feature maps, and finally outputs the enhanced feature maps through a single convolutional layer.

[0052] The C2PSA (C2P Spatial Attention) module extracts features from the input image through convolutional layers and enhances the spatial attention of these features through the PSABloc module. The module first extracts features through a single convolutional layer, then feeds these features into multiple PSABloc modules for spatial attention processing, and finally outputs the enhanced feature map through a single convolutional layer.

[0053] The YOLOv11 network backbone has 11 layers, in the following order: a layer 1 Conv module, a 2x2 layer Conv-FC3k2 (c3k=false) combined module, a 2x2 layer Conv-FC3k2 (c3k=True) combined module, a layer 1 SPPF module, and a layer 1 C2PSA module, with output channel numbers of 64, 128, 256, 256, 512, 512, 512, 1024, 1024, 1024, and 1024 respectively.

[0054] The YOLOv11 network's Neck network is used for feature fusion and upsampling. The Neck network has three branches and a total of nine layers, in the following order: one Upsample layer, one C3k2 (c3k=False) layer, one Upsample layer, one C3k2 (c3k=False) layer, one Conv layer, one C3k2 (c3k=False) layer, one Conv layer, and one C3k2 (c3k=True) layer. The three branches ultimately output three feature maps with sizes of 80*80*64, 40*40*128, and 20*20*256 respectively.

[0055] In this embodiment, an OAAHead (Occlusion-Aware Attention Head) module is constructed to replace the Head module in YOLOv11, thereby improving the overall network's resistance to occlusion.

[0056] Specifically, the OAA Head module includes two processing branches: classification and regression loss. The first branch includes the Conv module, OAA module, and Con2d module connected in sequence; the second branch includes the DWConv module, Conv module, OAA module, and Con2d module connected in sequence. The outputs of the first and second branches are concatenated to obtain the output of the OAA Head module.

[0057] The first part of the OAA module is a depthwise separable convolution with residual connections. This depthwise separable convolution is performed layer-by-layer, meaning it separates convolutions by channel. While depthwise separable convolution can learn the importance of different channels and reduce the number of parameters, it ignores the information relationships between channels. To compensate for this loss, the outputs of convolutions at different depths are then combined using PWConv (point convolution). Next, referencing the classic channel attention mechanism SENet, average pooling is used to aggregate the feature maps along the channel direction to obtain a global description of the channels. Then, a two-layer fully connected network is used to fuse the information of each channel, and the two fully connected layers learn multi-channel relationships through non-linear transformations. Finally, the output of the OAA module is used as attention weights and multiplied with the original features input to the OAA module. The key to the OAA module's effectiveness in solving the object occlusion problem lies in its introduction of a hybrid attention mechanism, which enhances the model's ability to learn features from occluded regions. In the case of object occlusion, the occluded object prevents some features of the target from being extracted normally, thus affecting the overall object detection accuracy.

[0058] The traditional convolutional networks used in the original detector head of the YOLOv11 network struggle to handle features lost in occluded regions. The OAA module, however, introduces depthwise separable convolution (DWConv) and fully connected transformations to extract features from occluded regions from multiple dimensions. Specifically, the depthwise separable convolution in OAA independently convolves the spatial location of each channel, constructing a spatial attention distribution and enabling spatial recalibration. The fully connected transformation learns the complex relationships between channels, providing a concrete implementation of the channel attention mechanism and helping to capture important features between channels.

[0059] This embodiment designs a DAS (Dynamic-Attentive-Scale) loss function to replace the CIoU loss function in the original YOLOv11 network, thereby improving the overall detection accuracy of the network.

[0060] The formula for the DAS loss function is:

[0061]

[0062]

[0063] Where λ is a hyperparameter controlling the behavior of the non-monotonic attention mechanism; I represents the intersection between the predicted box and the ground truth box; U represents the union between the predicted box and the ground truth box; dw1 represents the horizontal distance between the left edge of the predicted box and the left edge of the target box; dw2 represents the horizontal distance between the right edge of the predicted box and the right edge of the target box; dh1 represents the vertical distance between the top edge of the predicted box and the top edge of the target box; and dh2 represents the vertical distance between the bottom edge of the anchor box and the bottom edge of the target box. gtand h gt These represent the width and height of the bounding box, respectively; w pred and h pred These represent the width and height of the anchor frame, respectively.

[0064] The DAS loss introduces a penalty factor that adapts to the size of the target bounding box, preventing unnecessary expansion of the anchor box during regression and thus accelerating convergence. Simultaneously, the DAS loss dynamically adjusts the gradient based on the quality of the anchor box through a gradient adjustment mechanism, making the regression more stable, especially when dealing with anchor boxes of unbalanced quality, effectively improving the optimization speed of medium-quality anchor boxes. Furthermore, the DAS loss introduces a non-monotonic attention mechanism to enhance attention to medium-quality anchor boxes, improving the network's adaptability. Finally, a scale-matching penalty term is added to the loss to guide the regression process to consider the consistency of aspect ratio and area ratio, ensuring that anchor boxes of different sizes receive reasonable gradients and focus intensity, avoiding gradient imbalance between small and large targets. Therefore, introducing the DAS loss to measure the similarity between the predicted bounding box and the ground truth target bounding box can improve the overall target detection accuracy of the network.

[0065] Load the pre-trained weights from the Coco dataset, input RGB three-channel frame images captured from a monocular camera video stream, and output bounding box information and vehicle outline information of the vehicle in the image after calculation by the trained object detection network.

[0066] Step 2: Identify the color, category, and appearance features of vehicles in the video stream based on the visual question-answering big model, update the target tracker based on the bounding box of the detected vehicle obtained from the target detection, and use the updated tracking ID as the detected vehicle ID.

[0067] Create a deep learning tracker and a large visual question answering model to obtain the tracking ID and feature information such as color and category of the detected vehicle.

[0068] Step 2 involves the following specific steps:

[0069] Step 21: Initialize the deep learning target tracker, update the target tracker with the vehicle detection bounding box information, and use the updated tracking ID as the vehicle ID.

[0070] Step 22: For vehicle IDs that have not appeared before, use the corresponding vehicle outline information to crop the transparent background image matrix of the target vehicle from the original frame image.

[0071] Step 23: Create and initialize the large visual image question answering model.

[0072] Predefine a candidate set F for appearance features:

[0073] F = {f1, f2, ..., fm}

[0074] Among them, f k This indicates candidates for color or category appearance characteristics.

[0075] Design problem template:

[0076] Is this vehicle[X]?

[0077] The placeholder [X] represents the color or category feature description.

[0078] Each feature f in the candidate feature set F k Fill in the placeholders [X] in the template problem one by one to form the candidate problem set Q. set :

[0079] Q set ={Q k =Isthisvehiclef k ? |k=1,2,…,m}

[0080] Each candidate question Q k The target vehicle image i is input into the visual image question answering model, and the confidence score S for each target vehicle for each question is calculated. k :

[0081] S k =M VQA (Q k ,i)

[0082] Among them, M VQA This represents a visual question-answering model.

[0083] The feature with the highest confidence score among the candidate features is selected as the final recognition result.

[0084] Step 24: Based on the target detection results of the image, extract a local image of each detected vehicle to obtain the target set:

[0085] T = {t1, t2, ..., t} n}

[0086] Among them, t i This represents a local image of the i-th target vehicle.

[0087] A local image of the target vehicle is input into a large visual question answering model, and the extracted color and appearance features are associated with and cached with the corresponding vehicle's tracking ID.

[0088] Step 25: For vehicle IDs that have appeared before, directly use the cached vehicle feature extraction results.

[0089] Step 3: Calculate the depth map of the vehicle in the video stream using the depth estimation model, and combine it with the rectangular bounding box of the detected vehicle to obtain the depth information of the detected vehicle.

[0090] Real-time calculation of video stream frame depth map, combined with vehicle pixel coordinates to calculate vehicle depth.

[0091] Step 3 involves the following specific steps:

[0092] Step 31: Create and initialize a monocular absolute depth estimation deep learning model.

[0093] Step 32: Input the RGB three-channel frame image extracted from the video stream of the monocular camera, and calculate the depth map through the absolute depth estimation model.

[0094] Step 33: Combine the depth map and the vehicle pixel coordinates obtained from object detection to obtain the depth information of the detected vehicle.

[0095] Step 34: Use a mean smoother to smooth the latest estimated depth value using historical data.

[0096] Step 4, Vehicle spatial coordinate transformation calculation: Based on the vehicle pixel coordinates, vehicle depth and camera information, calculate the vehicle's true geographic coordinates.

[0097] Step 4 involves the following specific steps:

[0098] Step 41: Use the target camera to take multiple pictures of the chessboard, ensuring that the chessboard posture and camera angle are different for each shot, and perform grayscale processing on each image.

[0099] Step 42: Use the cornerSubPix function to perform sub-pixel level fine-tuning of the detected corner coordinates to improve the accuracy of corner positioning.

[0100] Step 43: Define the chessboard plane as the XY plane of the world coordinate system, Z=0, and construct the true 3D coordinates of each corner point of the chessboard in the world coordinate system. For an m×n chessboard (m rows and n columns), the coordinates of each corner point can be represented as:

[0101] (x,y,z)=(j·d,j·d,0)

[0102] Where d is the side length of each cell in the chessboard, and i and j are the row and column numbers of the corner points.

[0103] Step 44: Call the OpenCV function `calibrateCamera` to calculate the camera's intrinsic parameter matrix K and distortion coefficients dst based on the corner data (the correspondence between pixel coordinates and real-world coordinates) from multiple calibration images.

[0104]

[0105] dst = [k1, k2, p1, p2, k3]

[0106] Where f x ,f y The focal length is expressed in pixels along the width and height of the image; c x c y The principal point (the intersection of the optical axis and the image plane) is located in the image coordinate system; a fixed value of 1 is used for transformations in the homogeneous coordinate system, representing the normalization of the linear projection matrix; k1, k2, k3 are radial distortion coefficients; p1, p1 are tangential distortion coefficients.

[0107] Step 45: Optimize the original intrinsic parameter matrix K based on the distortion coefficient dst, and use OpenCV to calculate the new intrinsic parameter matrix K′.

[0108]

[0109] Step 46: Combine the vehicle tracking ID, vehicle pixel coordinates, vehicle depth, and camera-corrected intrinsic parameters to obtain the vehicle's spatial coordinates in the camera coordinate system.

[0110] Step 47: Combine the vehicle's spatial coordinates in the camera coordinate system, the camera's actual geographical location, and the camera's orientation angle to obtain the vehicle's true geographical coordinates.

[0111] Step 5: Vehicle twinning and scene update. Based on the vehicle ID, vehicle type, vehicle color, and real-world geographic coordinates obtained from detection and calculation, the corresponding vehicle is twinned in real time, and the road twinning scene is updated.

[0112] Step 5 involves the following specific steps:

[0113] Step 51: Based on the preset vehicle types in the target detection dataset, construct twin models representing different types of vehicles and put them into the model pool for unified management.

[0114] Step 52: Use WEBGL to build a road twin scene, which includes a road model and a camera model.

[0115] Step 53: Use vehicle category information to retrieve the corresponding vehicle model from the model pool.

[0116] Step 54: Use the vehicle color information to adjust the texture of the vehicle model so that its color matches the color in the real scene.

[0117] Step 55: Calculate the vehicle's orientation based on its historical location and apply it to the vehicle model.

[0118] Step 56: Using the vehicle's real geographic coordinates, add the vehicle to the appropriate location in the twin scene. If a vehicle with the same tracking ID already exists in the twin scene, use interpolation animation to smoothly move the vehicle from the old location to the new location.

[0119] Example 2

[0120] The purpose of this embodiment is to provide a real-time road vehicle twin system based on a monocular camera, including:

[0121] The target detection module is configured to: use a target detection network to detect vehicles in the video stream and obtain the rectangular bounding boxes and outlines of the detected vehicles; wherein the video stream is captured by a monocular camera;

[0122] The target tracking module is configured to: identify the color and category of vehicles in the video stream based on a visual question-answering big model, update the target tracker based on the rectangular bounding box of the detected vehicle obtained from target detection, and use the updated tracking ID as the detected vehicle ID;

[0123] The depth estimation module is configured to: calculate the depth map of the vehicle in the video stream using a depth estimation model, and combine it with the rectangular bounding box of the detected vehicle to obtain the depth information of the detected vehicle;

[0124] The coordinate transformation module is configured to obtain the true geographical coordinates of the detected vehicle based on the depth information of the detected vehicle and the geographical coordinates of the monocular camera.

[0125] The real-time twin module is configured to update the road twin scene with the real-time acquired vehicle ID, color, category, and real geographic coordinates of the detected vehicle.

[0126] In further embodiments, the following is also provided:

[0127] An electronic device includes a memory and a processor, as well as computer instructions stored in the memory and running on the processor. When executed by the processor, the computer instructions perform the method described in Embodiment 1. For brevity, further details are omitted here.

[0128] It should be understood that in this embodiment, the processor can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0129] Memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of memory may also include non-volatile random access memory. For example, memory may also store information about the device type.

[0130] A computer-readable storage medium for storing computer instructions, which, when executed by a processor, perform the method described in Embodiment 1.

[0131] The method in Embodiment 1 can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor. The software modules can reside in readily available storage media in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory; the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method. To avoid repetition, a detailed description is not provided here.

[0132] A computer program product includes a computer program that, when executed by a processor, implements the method described in Embodiment 1.

[0133] The present invention also provides at least one computer program product tangibly stored on a non-transitory computer-readable storage medium. The computer program product includes computer-executable instructions, such as instructions included in program modules, which execute in a device on a target real or virtual processor to perform the processes / methods described above. Typically, program modules include routines, programs, libraries, objects, classes, components, data structures, etc., that perform specific tasks or implement specific abstract data types. In various embodiments, the functionality of program modules can be combined or divided among program modules as needed. The machine-executable instructions for the program modules can execute within a local or distributed device. In a distributed device, the program modules can reside in both local and remote storage media.

[0134] The computer program code used to implement the methods of the present invention may be written in one or more programming languages. This computer program code may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the computer or other programmable data processing device, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a computer, partially on a computer, as a stand-alone software package, partially on a computer and partially on a remote computer, or entirely on a remote computer or server.

[0135] In the context of this invention, computer program code or related data may be carried by any suitable carrier to enable a device, apparatus, or processor to perform the various processes and operations described above. Examples of carriers include signals, computer-readable media, and the like. Examples of signals may include electrical, optical, radio, sound, or other forms of propagation signals, such as carrier waves, infrared signals, etc.

[0136] Those skilled in the art will recognize that the units and algorithm steps described in conjunction with the embodiments herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0137] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. A method for real-time twinning of road vehicles based on a monocular camera, characterized in that, include: An improved YOLOv11 network is used as the object detection network to detect vehicles in the video stream and obtain rectangular bounding boxes for the detected vehicles. The video stream is captured by a monocular camera. The improvement of the YOLOv11 network is as follows: the C3k2 module in the YOLOv11 network is replaced by the FC3k2 module. The FC3k2 module adopts the FasterNet Block module with partial convolution to perform convolution calculation on some channels of the input feature map. The Head module in the YOLOv11 network is replaced by the OAA Head module. The OAA Head module introduces depthwise separable convolution and fully connected transformation to extract features of occluded areas from multiple dimensions. The method for identifying vehicle color and category in a video stream using a visual question-answering model involves: designing a question template and a candidate feature set. The question template includes placeholders for describing vehicle color or category features. The candidate feature set consists of predefined candidate appearance features for vehicle color or category. Each feature from the candidate feature set is used to fill the placeholders in the designed question template, forming a candidate question set. Each candidate question and an image of a vehicle in the video stream are input into the visual question-answering model. The confidence score of the target vehicle for each question is calculated. The feature with the highest confidence score from the candidate features is selected as the final recognition result. The target tracker is updated based on the bounding box of the detected vehicle obtained from target detection, and the updated tracking ID is used as the detected vehicle ID. The depth map of the vehicle in the video stream is calculated using a depth estimation model, and the depth information of the detected vehicle is obtained by combining it with the rectangular bounding box of the detected vehicle. Based on the depth information of the detected vehicle and the geographic coordinates of the monocular camera, the true geographic coordinates of the detected vehicle are obtained. Specifically, multiple chessboard images are captured using the monocular camera, with the chessboard posture and camera angle being different for each capture. The chessboard plane is set as the world coordinate system, and the corner points of the chessboard are obtained using the camera calibration function. Based on the corner point data of multiple calibration images, the camera's intrinsic parameter matrix and distortion coefficients are solved, and the camera's intrinsic parameter matrix is ​​optimized based on the distortion coefficients. Based on the optimized camera's intrinsic parameter matrix, combined with the vehicle tracking ID, vehicle pixel coordinates, and vehicle depth, the true geographic coordinates of the vehicle are obtained. The real-time acquired vehicle ID, color, category, and actual geographic coordinates of the detected vehicle are updated in the road twin scene.

2. The method for real-time twinning of road vehicles based on a monocular camera as described in claim 1, characterized in that, The depth map of the vehicle in the video stream is calculated using a depth estimation model, and combined with the bounding box of the detected vehicle, the depth information of the detected vehicle is obtained, specifically as follows: Depth maps of vehicles in a video stream are calculated using a depth estimation model; Calculate the pixel coordinates of the center point of the bottom edge of the rectangular bounding box of the detected vehicle, and use them as the pixel coordinates of the detected vehicle; The depth information of the detected vehicle is obtained by combining the pixel coordinates of the detected vehicle with the depth map of the detected vehicle.

3. The method for real-time twinning of road vehicles based on a monocular camera as described in claim 1, characterized in that, In the training of the object detection network, the DAS function is used as the loss function. The DAS loss function is as follows: ; in, It is a hyperparameter that controls the behavior of non-monotonic attention mechanisms; This represents the intersection of the predicted bounding box and the ground truth bounding box; This represents the union of the predicted bounding box and the ground truth bounding box. This represents the horizontal distance between the left edge of the predicted bounding box and the left edge of the target bounding box; This represents the horizontal distance between the right edge of the predicted bounding box and the right edge of the target bounding box; This represents the vertical distance between the top edge of the predicted bounding box and the top edge of the target bounding box; This indicates the vertical distance between the bottom edge of the anchor frame and the bottom edge of the target frame; and These represent the width and height of the target bounding box, respectively. and These represent the width and height of the anchor frame, respectively.

4. A real-time road vehicle twin system based on a monocular camera, characterized in that, include: The object detection module is configured to: use an improved YOLOv11 network as the object detection network to detect vehicles in the video stream and obtain the bounding boxes and contours of the detected vehicles; wherein the video stream is captured by a monocular camera; the improvement of the YOLOv11 network is specifically as follows: the C3k2 module in the YOLOv11 network is replaced by the FC3k2 module, wherein the FC3k2 module adopts the FasterNet Block module with partial convolution to perform convolution calculation on some channels of the input feature map; and the Head module in the YOLOv11 network is replaced by the OAA Head module, wherein the OAA Head module introduces depthwise separable convolution and fully connected transformation to extract features of the occluded region from multiple dimensions; The target tracking module is configured to: identify the color and category of vehicles in the video stream based on a visual question-answering model. Specifically, it designs a question template and a candidate feature set. The question template includes placeholders for vehicle color or vehicle category feature descriptions. The candidate feature set consists of predefined candidate options for vehicle color or vehicle category appearance features. Each feature from the candidate feature set is used to fill the placeholders in the designed question template to form a candidate question set. Each candidate question and an image of a vehicle in the video stream are input into the visual question-answering model. The confidence score of the target vehicle for each question is calculated. The feature with the highest confidence score from the candidate features is selected as the final recognition result. The target tracker is updated based on the bounding box of the detected vehicle obtained from target detection, and the updated tracking ID is used as the detected vehicle ID. The depth estimation module is configured to: calculate the depth map of the vehicle in the video stream using a depth estimation model, and combine it with the rectangular bounding box of the detected vehicle to obtain the depth information of the detected vehicle; The coordinate transformation module is configured to obtain the true geographic coordinates of the detected vehicle based on the depth information of the detected vehicle and the geographic coordinates of the monocular camera. Specifically, it uses the monocular camera to capture multiple chessboard images, where the chessboard posture and camera angle are different for each capture. The chessboard plane is set as the world coordinate system, and the corner points of the chessboard are obtained using the camera calibration function. Based on the corner point data of multiple calibration images, the camera's intrinsic parameter matrix and distortion coefficients are solved, and the camera's intrinsic parameter matrix is ​​optimized based on the distortion coefficients. Based on the optimized camera's intrinsic parameter matrix, combined with the vehicle tracking ID, vehicle pixel coordinates, and vehicle depth, the true geographic coordinates of the vehicle are obtained. The real-time twin module is configured to update the road twin scene with the real-time acquired vehicle ID, color, category, and real geographic coordinates of the detected vehicle.

5. An electronic device, characterized in that, It includes a memory and a processor, as well as computer instructions stored in the memory and running on the processor, which, when executed by the processor, perform the method according to any one of claims 1-3.

6. A computer-readable storage medium, characterized in that, Used to store computer instructions, which, when executed by a processor, perform the method described in any one of claims 1-3.

7. A computer program product, characterized in that, Includes a computer program, which, when executed by a processor, implements the method described in any one of claims 1-3.