Skeleton similarity-based cross-lens multi-target tracking method, device and medium
By combining a skeleton similarity-based multi-target tracking method with head localization and skeleton pose extraction, the problems of occlusion and appearance changes in cross-camera tracking are solved, improving the accuracy and robustness of tracking, adapting to different camera perspectives, and achieving real-time tracking.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2023-06-08
- Publication Date
- 2026-06-23
Smart Images

Figure CN116703985B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of cross-camera multi-target tracking, and in particular to a cross-camera multi-target tracking method, apparatus, and medium based on skeleton similarity. Background Technology
[0002] Multi-object tracking (MOT) across cameras refers to tracking multiple objects across consecutive frames from different camera views. It is a crucial task in computer vision, encompassing several sub-tasks such as object detection, object re-identification, trajectory association, and trajectory prediction. The goal of this technology is to link the trajectories of each object across different views, identifying and tracking multiple objects as they move from one camera to another, and connecting these objects to maintain their identity and trajectory. It has wide applications in video surveillance, traffic management, intelligent security, and autonomous driving.
[0003] Cross-camera target tracking methods are mainly divided into non-overlapping region tracking and overlapping region tracking problems. Non-overlapping region tracking refers to the situation where the shooting areas of adjacent cameras do not overlap. The main technical challenge lies in the camera association model. For example, the SIFT features of targets appearing in adjacent cameras are paired with the original target. If a match is successful, the target ID from the original camera is assigned to the target in the new camera. However, this method requires prior knowledge of the real-world spatial topology of multiple cameras. Overlapping region tracking occurs when the same target appears simultaneously in multiple cameras. In this case, image fusion is the primary consideration. By calculating the field-of-view boundary of each camera in the overlapping region, and based on the homography between the fields of view of different cameras, the consistency of the target between adjacent cameras is established.
[0004] Currently popular cross-camera MOT methods mainly include: deep learning-based algorithms, spatiotemporal relationship-based algorithms, and trajectory fusion-based algorithms. The main shortcomings of existing cross-camera MOT technologies include:
[0005] 1. Occlusion: Occlusion occurs when a target is partially or completely hidden by another target or an object in the scene. Because different cameras capture different parts of the scene, cross-camera tracking algorithms struggle to handle occlusion problems.
[0006] 2. Appearance Changes: Due to changes in light, direction, and distance, the target's appearance will change. Because of these appearance changes, cross-camera tracking algorithms may fail to identify the target in new camera views.
[0007] 3. Camera calibration: Inaccurate or inconsistent camera calibration can lead to errors in estimating the position and orientation of the camera, which can affect the accuracy of cross-camera tracking.
[0008] 4. Different viewpoints: Different camera viewpoints can capture targets from different angles, making matching challenging. A target may look completely different in one camera view than in another.
[0009] 5. Large camera networks: As the number of cameras in a network increases, the complexity of cross-camera tracking also increases, leading to increased computational and memory requirements.
[0010] 6. Real-time tracking: Real-time tracking requires tracking the target in near real-time across multiple camera views, which can be a challenge due to the high computational and memory requirements of cross-camera tracking algorithms.
[0011] Although significant progress has been made in cross-camera MOT in recent years, it still has shortcomings in solving the real-time tracking problem in large-scale surveillance systems and is difficult to handle complex scenarios. Summary of the Invention
[0012] The purpose of this invention is to provide a cross-camera multi-target tracking method, device, and medium based on skeleton similarity, which addresses the problems of changes in appearance and perspective caused by human posture or other factors under different camera viewpoints, as well as the problems of different resolutions, angles, and lighting conditions of different cameras, thereby improving the accuracy and robustness of tracking.
[0013] The objective of this invention can be achieved through the following technical solutions:
[0014] A cross-camera multi-target tracking method based on skeleton similarity includes the following steps:
[0015] S1. Read in multiple camera data streams, store two of the video streams into the storage queue at the same time, and extract the storage queue frame by frame to obtain the first and second images captured by the two cameras at the same time.
[0016] S2. Matching the ID of the target in the same area: Head localization is performed on the first and second images respectively to obtain appearance features and center coordinates, and skeleton extraction is performed to obtain the pose of the joint skeleton. The appearance features and center coordinates are used as input to the target tracking model to perform matching of the ID of the target in the same area.
[0017] S3. Cross-regional target ID matching: When the first screen enters the overlapping area of the second screen, the skeleton pose similarity is calculated based on the key point skeleton pose of the two screens, and the Euclidean distance between the center coordinates is calculated based on the center coordinates of the two screens. The cross-regional target ID matching is achieved by using a linear weighted combination of the skeleton pose similarity and the Euclidean distance between the center coordinates during area switching.
[0018] Step S2 includes the following steps:
[0019] S21. Pre-train the head localization network Yolov5 using a public dataset. Feed the first and second images into the head localization network Yolov5 and the skeleton pose extraction network Alphapose, respectively, to obtain the head localization bounding box and the skeleton pose coordinate information of 18 key points.
[0020] S22. Based on the head localization bounding box, use the pre-trained ResNet50 network to extract appearance features, and use the center position of the bounding box as prior knowledge to input the target tracking model DeepSort to perform Kalman filtering to predict the trajectory.
[0021] S23. Calculate the distance between the appearance features of the preceding and following frames using the cosine distance function, calculate the distance between the predicted and actual positions using Mahalanobis distance, and assign the same ID to the same target person using the threshold method.
[0022] The determination of the head positioning bounding box includes the following steps:
[0023] The acquired raw images are cropped and pixel values are normalized to generate an image input tensor of a preset size;
[0024] The input image is convolved using a YOLOv5 network model with pre-trained head target detection parameters. Regression is performed to obtain candidate target regions and confidence factors. High-confidence head target regions are selected by setting a threshold.
[0025] The extraction of the joint skeleton attitude coordinate information includes the following steps:
[0026] The skeleton extraction framework based on the Alphapose skeleton pose extraction network uses a top-down strategy and a convolutional neural network algorithm to detect each person in the image.
[0027] Regression prediction of joint position information was performed on the detected region of each person to extract the skeleton pose information of 18 joints for each person.
[0028] Step S3 includes the following steps:
[0029] S31. Using the skeleton pose coordinates of all relevant nodes as input to the skeleton similarity algorithm, calculate the skeleton pose similarity of the target characters in the two images pairwise.
[0030] S32. Use Euclidean distance to perform pairwise matching calculations on the center coordinates of the target figures in the two images to obtain the positional similarity.
[0031] S33. Linearly weight the skeleton similarity and positional proximity to achieve cross-regional target ID matching.
[0032] The skeleton similarity algorithm includes the following steps:
[0033] S311, Skeleton Deformation: Calculate the scaling factor for the length of skeleton line segments corresponding to the template skeleton kps_t and the skeleton to be detected kps_a. Based on the connection path r between joints i, the joint coordinates of the skeleton to be detected kps_a are transformed sequentially to obtain the scale-normalized skeleton to be detected kps_a′.
[0034] S312, Skeleton Translation: Select a central reference point k b Calculate the coordinate offset T of the center reference points of the two skeletons, and perform a translational motion on the skeleton to be detected kps_a′ to obtain the skeleton to be detected kps_a″ with its center reference coordinates aligned:
[0035]
[0036] kps_a″=kps_a′+T
[0037] S313. Skeleton Similarity Calculation: Calculate the sum of the skeleton segment distances Δ between the template skeleton kps_t and the skeleton to be detected kps_a″. B Δ B It is a number greater than or equal to 0, and the larger the value, the smaller the similarity of the corresponding actions; the smaller the value, the higher the similarity of the corresponding actions; normalize the output results to the [0,1] interval, and for different skeleton segments R after normalization i Assign different weights W i The weighted skeleton pose similarity S is obtained:
[0038]
[0039] Where ε is the preset error factor.
[0040] The method for linearly weighting skeleton similarity and positional proximity is as follows:
[0041] A=1-S)+γ×L
[0042] Where A is the weighted similarity, S is the skeleton pose similarity, L is the positional similarity, and γ is the weight parameter.
[0043] The cross-regional target ID matching includes the following steps:
[0044] The DeepSort target tracking model is used for initial tracking and matching. The rectangular region corresponding to the head target region is used as the input tensor, and convolutional feature extraction is performed to obtain a feature vector of a preset size.
[0045] The Kalman filter algorithm is used to predict the position of the center coordinate of the target area in the next time step, and the Hungarian algorithm is used to match the feature vectors of the previous time step and the current time step to obtain the corresponding target identity ID.
[0046] Calculate the Euclidean distance between the center position of the target human's clavicle and the center position of the target human's face, and perform secondary target matching between the previous frame and the current frame to prevent the target identity ID from being lost.
[0047] A cross-camera multi-target tracking device based on skeleton similarity includes a memory, a processor, and a program stored in the memory, wherein the processor executes the program to implement the method described above.
[0048] A storage medium having a program stored thereon, which, when executed, implements the method described above.
[0049] Compared with the prior art, the present invention has the following beneficial effects:
[0050] 1. This invention proposes a cross-camera multi-target tracking method based on skeleton similarity, which realizes automated pedestrian target tracking in a computer-assisted manner.
[0051] 2. This invention combines the target's external space with the angular relationships formed by the joints of each skeleton and coordinate information to calculate the skeleton similarity, thereby obtaining more accurate human skeleton similarity parameter values. This not only considers skeleton deformation but also the impact of limb angle changes on target tracking.
[0052] 3. Utilizing target skeleton information for tracking, compared to traditional appearance-based tracking algorithms, is better able to handle issues such as changes in target appearance and occlusion, improving tracking accuracy and robustness. This invention determines whether the target is the same object by calculating the similarity of the target skeleton under different shots, achieving cross-shot tracking and better adapting to the problem of multi-target tracking across shots.
[0053] 4. By using a deep learning model to automatically extract target skeleton information, the workload of manually labeling data is reduced, and tracking efficiency is improved. This invention can also perform tasks such as target pose estimation and action recognition, increasing the diversity and application scope of tracking. Attached Figure Description
[0054] Figure 1 This is a schematic diagram of the cross-lens multi-target tracking framework of the present invention;
[0055] Figure 2 This is a flowchart of the method of the present invention;
[0056] Figure 3 A flowchart of a cross-camera multi-target tracking method in a preferred embodiment;
[0057] Figure 4 This is a flowchart of the same-region target ID matching and tracking process in one embodiment;
[0058] Figure 5 This is a flowchart of cross-regional target ID matching and tracking in one embodiment;
[0059] Figure 6 This is a schematic diagram of the hardware structure in one embodiment. Detailed Implementation
[0060] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0061] This embodiment proposes a cross-camera multi-target tracking method based on skeleton similarity. This method addresses the problem of matching multiple targets in two cameras. Since the monitoring area cannot be fully captured by a single camera, it involves the fusion of camera surveillance footage from two areas. The problems mainly include two aspects: 1) the target matching problem of tracking the same person in the same camera's footage; 2) the problem of matching the person target from one camera to the next in cross-camera area footage.
[0062] To address the aforementioned problems, this invention employs a multi-region image combination method that combines appearance feature extraction, head position localization and detection, Hungarian algorithm matching, skeleton similarity calculation, Kalman filter prediction technology, and pose similarity estimation to achieve target tracking across multiple shots and views. The framework is as follows: Figure 1 As shown.
[0063] like Figure 2 As shown, frame0 and frame1 are respectively fed into the same-region target ID matching module. Through head localization and skeleton extraction, the appearance features and center coordinates are fed into the target tracking model DeepSort for same-region target ID matching. Finally, when frame0 is detected to have entered the overlapping area with frame1, it enters the cross-region target ID matching module. The cross-region target ID is obtained by linearly weighting the skeleton pose similarity and the Euclidean distance of the center position coordinates.
[0064] Specifically, such as Figure 3 As shown, it includes the following steps:
[0065] S1. Read in multiple camera data streams (which capture multiple target people), simultaneously store two of the video streams into the storage queue, and extract the storage queue frame by frame to obtain the frame0 captured by camera 1 and the frame1 captured by camera 2 at the same time.
[0066] S2. Frame0 and frame1 are respectively fed into the same-region target ID matching module for same-region target ID matching. Specifically, head localization is performed on frame0 and frame1 to obtain appearance features and center coordinates, and skeleton extraction is performed to obtain joint skeleton pose. The appearance features and center coordinates are used as input to the target tracking model DeepSort for same-region target ID matching.
[0067] In one embodiment, step S2 specifically includes the following steps:
[0068] S21. Pre-train the head localization network Yolov5 using the Wider Face public dataset. Feed frame0 and frame1 into the head localization network Yolov5 and the skeleton pose extraction network Alphapose, respectively, to obtain the head localization bounding box HeadBox and the skeleton pose coordinate information Bone of 18 joints.
[0069] In one embodiment, determining the head positioning bounding box includes the following steps:
[0070] The acquired original image is cropped and its pixel values are normalized to generate a 640*640 image input tensor.
[0071] The input image is convolved using a Yolov5 network model with pre-trained head target detection parameters. Regression is performed to obtain candidate target regions and confidence factors. High-confidence head target regions are selected by setting a threshold.
[0072] In one embodiment, the extraction of joint skeleton pose coordinate information includes the following steps:
[0073] The skeleton extraction framework based on the Alphapose skeleton pose extraction network uses a top-down strategy and a convolutional neural network algorithm to detect each person in the image.
[0074] Regression prediction of joint position information was performed on the detected region of each person to extract the skeleton pose information of 18 joints for each person.
[0075] S22. Based on the head localization bounding box, use the pre-trained ResNet50 network to extract appearance features, and use the center position of the bounding box as prior knowledge to input the target tracking model DeepSort to perform Kalman filtering to predict the trajectory.
[0076] S23. Calculate the distance between the appearance features of the preceding and following frames using the cosine distance function, calculate the distance between the predicted and actual positions using Mahalanobis distance, and assign the same ID to the same target person using the threshold method.
[0077] In a preferred embodiment, such as Figure 4 As shown, S2 specifically includes the following steps:
[0078] In step 201, the acquired original image is cropped and the pixel values are normalized to generate a 640*640 image input tensor. The input image is then convolved using a YOLOv5 network model with pre-trained head target detection parameters to regress and obtain the candidate target head localization bounding box and confidence factor.
[0079] In step 202, the Alphapose model reads the video frame image storage sequence, extracts the skeleton pose, obtains the skeleton pose coordinate information of the target person's 18 joints (Bone), and determines the human skeleton matrix composed of each limb vector.
[0080] In step 203, a threshold C is set, and it is determined whether the confidence factor of the candidate head target region is greater than C. If it is greater than or equal to C, the target region is retained; if it is less than C, the target region is discarded. After determining all candidate target regions, target regions that meet the threshold are retained, and the process proceeds to step 203.
[0081] In step 204, given that the shooting area is a rectangle, the coordinates of the four corner points of the rectangle are pre-marked. The coordinates (x, y) of the center point of each target area are calculated to determine if it lies within the closed region of the matrix. If it does, proceed to step 204. If not, discard the target area, until all target areas have been evaluated.
[0082] In step 205, the head position of the target is located based on the head localization bounding box obtained from the pre-trained target detection network model YOLOv5.
[0083] In step 206, based on the head localization bounding box, the appearance features of the target region are extracted using the pre-trained network model ResNet50. The rectangular region corresponding to the head target region is used as the input tensor, and the convolutional feature extraction yields a 751*1 feature vector.
[0084] In step 207, the extracted feature vectors are used to predict the position of the center coordinates of the target area at the next time step using the Deepsort algorithm and the Kalman filter algorithm.
[0085] In step 208, the Hungarian algorithm is used to perform correlation matching on the feature vectors of the previous moment and the current moment. If the identity does not exist, a new ID is generated; if the identity exists, the corresponding target identity ID is returned.
[0086] In step 209, it is determined whether the identity ID of the current target area exists in the previous frame. If it exists, proceed to step 210; otherwise, proceed to step 209 to calculate the distance.
[0087] In step 210, the cosine distance function calculates the distance between the appearance features of the preceding and following frames, and the Mahalanobis distance is used to calculate the distance between the predicted position and the actual position.
[0088] In step 211, the same target person is assigned the same ID by a threshold method, the ID sequence is output, and the tracking of targets in the same area ends.
[0089] S3. Cross-region target ID matching: When frame0 is detected to have entered the overlapping area with frame1, the cross-region target ID matching module is called. The skeleton pose similarity is calculated based on the key point skeleton pose of the two frames, and the Euclidean distance between the center coordinates is calculated based on the center coordinates of the two frames. The cross-region target ID matching is achieved by using a linear weighted combination of skeleton pose similarity and center coordinate Euclidean distance during area switching.
[0090] In one embodiment, such as Figure 5 As shown, step S3 specifically includes the following steps:
[0091] S31. Using the skeleton pose coordinates of all relevant nodes as input to the skeleton similarity algorithm, calculate the skeleton pose similarity S for each pair of target characters in the two images.
[0092] S311, Skeleton Deformation: Calculate the scaling factor for the length of skeleton line segments corresponding to the template skeleton kps_t and the skeleton to be detected kps_a. Based on the connection path r between joints i, the joint coordinates of the skeleton to be detected kps_a are transformed sequentially to obtain the scale-normalized skeleton to be detected kps_a′.
[0093] S312, Skeleton Translation: Select a central reference point k b Calculate the coordinate offset T of the center reference points of the two skeletons, and perform a translational motion on the skeleton to be detected kps_a′ to obtain the skeleton to be detected kps_a″ with its center reference coordinates aligned:
[0094]
[0095] kps_a″=kps_a′+T(2)
[0096] S313. Skeleton Similarity Calculation: Calculate the sum of the skeleton segment distances Δ between the template skeleton kps_t and the skeleton to be detected kps_a″. B Δ B It is a number greater than or equal to 0, and the larger the value, the smaller the similarity of the corresponding actions; the smaller the value, the higher the similarity of the corresponding actions; normalize the output results to the [0,1] interval, and for different skeleton segments R after normalization i Assign different weights W i The weighted skeleton pose similarity S is obtained:
[0097]
[0098] Where ε is the preset error factor.
[0099] S32. Use Euclidean distance to perform pairwise matching calculations on the center coordinates of the faces of the target characters in the two images to obtain the positional similarity L.
[0100] S33. Using equation (4), the skeleton similarity and positional similarity are linearly weighted to achieve cross-regional target ID matching.
[0101] A=1-S)+γ×L(4)
[0102] Where A is the weighted similarity, S is the skeleton pose similarity, L is the positional similarity, and γ is the weight parameter.
[0103] The matching pair with the highest weighted similarity is used as the target matching pair in the preceding and following regions. The identity ID of the preceding region is assigned to the matching target human body in the new region to achieve cross-region ID matching.
[0104] In another embodiment, the matching pair with the highest skeleton similarity can be assigned as the target matching pair in the front and back regions first, and the identity ID of the front region can be assigned to the matching target human body in the new region. Then, the Euclidean distance between the center position of the clavicle of the target human body and the center position of the target face is calculated for secondary matching to obtain the matching relationship of multiple target human bodies in the target region at the current time, and the corresponding identity ID is assigned to the target.
[0105] Specifically, it includes the following steps:
[0106] The DeepSort target tracking model is used for initial tracking and matching. The rectangular region corresponding to the head target region is used as the input tensor, and convolutional feature extraction is performed to obtain a 751*1 feature vector.
[0107] The Kalman filter algorithm is used to predict the position of the center coordinate of the target area in the next time step, and the Hungarian algorithm is used to match the feature vectors of the previous time step and the current time step to obtain the corresponding target identity ID.
[0108] Calculate the Euclidean distance between the center position of the target human's clavicle and the center position of the target human's face, and perform secondary target matching between the previous frame and the current frame to prevent the target identity ID from being lost.
[0109] In one embodiment, such as Figure 6 As shown, cameras were set up in two areas of the student military training ground to capture real-time images from dual video streams. The multi-camera area information was combined, and the Hungarian algorithm was used for cross-camera matching. Then, skeleton pose similarity and center point cooperative distance were calculated, and the nearest value was taken through linear weighting for target ID matching. The cross-camera multi-target tracking device used in this embodiment includes:
[0110] a. Camera 1, used to capture training images of the first region.
[0111] b. Camera 2, used to capture training images of the second region.
[0112] c. The cross-camera multi-target tracking module includes a same-area target tracking module and a cross-area target matching and tracking module. When a student exceeds the detection area corresponding to camera one, the system needs to switch to the detection area corresponding to camera two. During the area switching process, it is necessary to match the identity of the same student in both images.
[0113] c1. The same-region target tracking module includes a head localization and skeleton extraction module, a feature extraction and target tracking module, and a matching module, among which,
[0114] The head localization and skeleton extraction module first uses an object detection network model to locate the bounding box of the human head, and then uses a pose estimation network model to extract the location information of key points of the human skeleton.
[0115] The feature extraction and target tracking module uses ResNet50 to extract appearance features, then uses the DeepSort model to perform appearance feature matching, the Kalman filter algorithm to predict the position of the next frame, and the Hungarian algorithm to associate and match feature vectors.
[0116] The matching module calculates the distance between corresponding key points based on the last frame of the previous region and the current frame of the next region, and calculates the skeleton similarity S. It then assigns the matching pair with the highest similarity as the target matching pair in the previous and next regions, thereby achieving cross-border head target tracking.
[0117] c2. The cross-regional target matching and tracking module includes a skeleton similarity calculation module, a positional similarity calculation module, and a cross-regional matching module, among which,
[0118] The skeleton similarity calculation module uses the skeleton pose coordinate information of all relevant nodes as input to the skeleton similarity algorithm, and calculates the skeleton pose similarity of the target characters in the two images pairwise.
[0119] The positional similarity calculation module uses Euclidean distance to perform pairwise matching calculations on the center coordinates of the faces of target people in two images to obtain the positional similarity.
[0120] The cross-region matching module achieves cross-region ID matching based on skeleton pose similarity and position proximity.
[0121] When applying the above method to mobile devices, considering issues such as communication transmission speed, latency, and bandwidth, this embodiment employs 5G communication technology. 5G networks offer faster speeds, reaching up to 10Gbps; lower latency, below 1 millisecond; more connections, supporting up to 1 million devices per square kilometer; and higher bandwidth, exceeding 20Gbps. In project implementation, not only network speed but also network protocol security needs to be considered. This embodiment uses the IPv6 Internet protocol, which supports IPSec and provides better security. IPv6 has a larger address space, with an address length of 128 bits, representing a 2 to 96-fold increase in address space, providing more network addresses for current and future Internet applications; smaller routing tables, improving router forwarding efficiency; and better QoS support, as IPv6 supports flow labels, offering enhanced QoS capabilities.
[0122] This embodiment primarily focuses on tracking multiple targets monitored by different outdoor cameras. It utilizes IPv6 and 5G communication technologies. Compared to 4G, this embodiment employs a 5G wireless communication network on the mobile terminal in practical applications, meeting the rapid proliferation of smart terminals and users' high demands for mobile internet speeds. Furthermore, this embodiment uses the more secure IPv6 protocol to address the issue of IP resource scarcity.
[0123] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0124] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.
Claims
1. A cross-camera multi-target tracking method based on skeleton similarity, characterized in that, Includes the following steps: S1. Read in multiple camera data streams, store two of the video streams into the storage queue at the same time, and extract the storage queue frame by frame to obtain the first and second images captured by the two cameras at the same time. S2. Matching the ID of the target in the same region: Head localization is performed on the first and second images respectively to obtain the appearance features and the center coordinates of the head localization bounding box. Skeleton extraction is performed to obtain the pose of the joint skeleton. The appearance features and the center coordinates of the head localization bounding box are used as the input of the target tracking model to perform matching of the ID of the target in the same region. S3. Cross-regional target ID matching: When the first screen enters the overlapping area of the second screen, the skeleton pose similarity is calculated based on the key point skeleton pose of the two screens, and the Euclidean distance of the center coordinates is calculated based on the center coordinates of the head positioning bounding box in the two screens. The cross-regional target ID matching is achieved by using a linear weighted combination of skeleton pose similarity and center coordinate Euclidean distance when switching regions.
2. The cross-camera multi-target tracking method based on skeleton similarity according to claim 1, characterized in that, Step S2 includes the following steps: S21. Pre-train the head localization network Yolov5 using a public dataset. Feed the first and second images into the head localization network Yolov5 and the skeleton pose extraction network Alphapose, respectively, to obtain the head localization bounding box and the skeleton pose coordinate information of 18 key points. S22. Based on the head localization bounding box, use the pre-trained ResNet50 network to extract appearance features, and use the center coordinates of the head localization bounding box as prior knowledge to input the target tracking model DeepSort to perform Kalman filtering to predict the trajectory. S23. Calculate the distance between the appearance features of the preceding and following frames using the cosine distance function, calculate the distance between the predicted and actual positions using Mahalanobis distance, and assign the same ID to the same target person using the threshold method.
3. The cross-camera multi-target tracking method based on skeleton similarity according to claim 2, characterized in that, The determination of the head positioning bounding box includes the following steps: The acquired raw images are cropped and pixel values are normalized to generate an image input tensor of a preset size; The input image is convolved using a YOLOv5 network model with pre-trained head target detection parameters. Regression is performed to obtain candidate target regions and confidence factors. High-confidence head target regions are selected by setting a threshold.
4. The cross-camera multi-target tracking method based on skeleton similarity according to claim 2, characterized in that, The extraction of the joint skeleton attitude coordinate information includes the following steps: The skeleton extraction framework based on the Alphapose skeleton pose extraction network uses a top-down strategy and a convolutional neural network algorithm to detect each person in the image. Regression prediction of joint position information was performed on the detected region of each person to extract the skeleton pose information of 18 joints for each person.
5. The cross-camera multi-target tracking method based on skeleton similarity according to claim 1, characterized in that, Step S3 includes the following steps: S31. Using the skeleton pose coordinates of all relevant nodes as input to the skeleton similarity algorithm, calculate the skeleton pose similarity of the target characters in the two images pairwise. S32. Use Euclidean distance to perform pairwise matching calculations on the center coordinates of the head positioning bounding boxes of the two images to obtain the position similarity. S33. Linearly weight the skeleton similarity and positional proximity to achieve cross-regional target ID matching.
6. The cross-camera multi-target tracking method based on skeleton similarity according to claim 5, characterized in that, The skeleton similarity algorithm includes the following steps: S311, Skeleton Deformation: Calculate the template skeleton kps_t and the skeleton to be tested kps_a The corresponding scaling factor for skeleton segment length According to the key points i Connection paths between r The skeleton to be detected is changed sequentially. kps_a The joint coordinates are used to obtain the scale-normalized skeleton to be detected. kps_a′ ; S312, Skeleton Translation: Select a central reference point. Calculate the coordinate offset of the center reference points of the two skeletons. T To test the skeleton kps_a′ A translational motion is performed to obtain the skeleton to be detected aligned with the central reference coordinates. kps_a″ : S313, Skeleton Similarity Calculation: Calculate the template skeleton kps_t and the skeleton to be tested kps_a″ The sum of the distances of the skeleton line segments , It is a number greater than or equal to 0, and the larger the value, the smaller the similarity of the corresponding actions; the smaller the value, the higher the similarity of the corresponding actions; normalize the output results to the [0,1] interval, and then perform different normalized skeleton line segments. Assign different weights The weighted skeleton pose similarity S is obtained: in, It is a preset error factor.
7. A cross-camera multi-target tracking method based on skeleton similarity according to claim 5, characterized in that, The method for linearly weighting skeleton similarity and positional proximity is as follows: Where A is the weighted similarity. For skeleton pose similarity. For proximity, These are the weight parameters.
8. A cross-camera multi-target tracking method based on skeleton similarity according to claim 1, characterized in that, The cross-regional target ID matching includes the following steps: The DeepSort target tracking model is used for initial tracking and matching. The rectangular region corresponding to the head target region is used as the input tensor, and convolutional feature extraction is performed to obtain a feature vector of a preset size. The Kalman filter algorithm is used to predict the position of the center coordinate of the target area in the next time step, and the Hungarian algorithm is used to match the feature vectors of the previous time step and the current time step to obtain the corresponding target identity ID. Calculate the Euclidean distance between the center position of the target human's clavicle and the center position of the target human's face, and perform secondary target matching between the previous frame and the current frame to prevent the target identity ID from being lost.
9. A cross-camera multi-target tracking device based on skeleton similarity, comprising a memory, a processor, and a program stored in the memory, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-8.
10. A storage medium having a program stored thereon, characterized in that, When the program is executed, it implements the method as described in any one of claims 1-8.