Cross-camera target person tracking method and device, storage medium and computer device
By employing a cross-camera tracking method based on multi-dimensional feature collaboration and spatiotemporal constraints, the problem of insufficient continuity and robustness in cross-camera tracking is solved, and stable human tracking is achieved in complex scenes.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DONGGUAN ZKTECO ELECTRONICS TECH
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244765A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of people tracking technology, and in particular to a method, apparatus, storage medium and computer device for cross-camera target people tracking. Background Technology
[0002] In the field of intelligent video surveillance, person tracking is a key technology for achieving continuous target localization and cross-scene association. Single-camera tracking, combined with target detection and Kalman filtering, can maintain a stable trajectory within a single field of view, but the target is lost once it leaves the field of view. Face recognition tracking is highly accurate when acquiring clear frontal images, but it heavily relies on facial visibility and is ineffective when encountering occlusions, side profiles, or heads tilted down. Pedestrian re-identification technology performs cross-camera matching based on appearance features such as clothing, but in scenes with significant differences in lighting and viewing angles, the appearance of the same target changes significantly, leading to insufficient matching stability.
[0003] Therefore, the core problem facing existing technologies lies in the difficulty of simultaneously achieving continuity and robustness in tracking. A single feature dimension has limited anti-interference capabilities in complex real-world scenarios, while blind spots between cameras and environmental differences further exacerbate the difficulty of resuming identity tracking after target interruption, making it highly susceptible to losing identity association when the target disappears for extended periods or experiences drastic changes in perspective. How to enhance identity association capabilities across scenarios to achieve stable person tracking remains a pressing technical challenge that needs to be addressed. Summary of the Invention
[0004] The purpose of this application is to at least address one of the aforementioned technical deficiencies, particularly the technical deficiency in existing technologies regarding how to enhance identity association capabilities across different scenarios to achieve stable person tracking.
[0005] Firstly, this application provides a method for tracking a target person across cameras, the method comprising:
[0006] When the first camera detects the appearance of a target person, it extracts the target person's identity features, which include appearance features, gait features, body shape features, and movement pattern features.
[0007] When the first camera detects that the target person has disappeared, the second camera is predicted in each of the downstream cameras of the first camera, and the spatiotemporal constraint matching probability is determined.
[0008] Among the initial people detected by the second camera, candidate people are selected, and based on identity features, the similarity between each candidate person and the target person in terms of appearance, gait, body shape and movement is calculated.
[0009] For each candidate, the costume change result is identified based on the appearance similarity of the candidate. Based on the costume change result, the weights of the corresponding appearance similarity, gait similarity, body shape similarity, and motion similarity are adjusted. By weighting and using the spatiotemporal constraint matching probability as a penalty coefficient, the matching score is calculated to confirm the tracking result of the target person based on each matching score.
[0010] In one embodiment, the process of extracting motion pattern features includes:
[0011] Based on physical characteristics, gait characteristics, and body shape characteristics, historical records are used to determine whether the target person is appearing for the first time.
[0012] If so, extract the instantaneous movement speed of the target person as a feature of the movement pattern;
[0013] Otherwise, based on the matching results, the target person's turning preferences and dwelling patterns are obtained from the historical records, and the instantaneous movement speed, turning preferences, and dwelling patterns are used as movement pattern features.
[0014] In one embodiment, the step of predicting the second camera among the downstream cameras of the first camera and determining the spatiotemporal constraint matching probability when the first camera detects the disappearance of the target person includes:
[0015] When the first camera detects that the target person has disappeared, it records the time of the disappearance and determines the direction of the target person's movement based on their identity characteristics.
[0016] Obtain the topology graph of the first camera, which includes the reachable paths, shortest transit time, and longest transit time between the first camera and each downstream camera.
[0017] Based on each reachable path and direction of movement, candidate cameras are selected from each downstream camera;
[0018] For each candidate camera, if the time interval between the appearance time and disappearance time of the initial person detected by the candidate camera is not less than the shortest elapsed time, then the candidate camera is used as the second camera, and the spatiotemporal constraint matching probability of the second camera is calculated based on the corresponding time interval, the shortest elapsed time and the longest elapsed time.
[0019] In one embodiment, the step of screening candidates from the initial people detected by the second camera includes:
[0020] Obtain the target height from the target person's physical characteristics;
[0021] If the height difference between the initial person detected by the second camera and the target height is less than a preset height difference threshold, then the initial person is considered a candidate.
[0022] In one embodiment, appearance similarity includes global similarity and multiple local similarities. The step of identifying the costume change result based on the corresponding appearance similarity includes:
[0023] If the global similarity of the candidate character is less than the global similarity threshold, at least two corresponding local similarities are greater than the first local similarity threshold, and at least one corresponding local similarity is less than the second local similarity threshold, then the candidate character is determined to have changed clothes, and the first local similarity threshold is greater than the second local similarity threshold.
[0024] In one embodiment, the step of adjusting the weights for the corresponding appearance similarity, gait similarity, body shape similarity, and movement similarity based on the costume change result includes:
[0025] If it is determined that the candidate has changed clothes, the initial weight of appearance similarity is reduced, while the initial weights of gait similarity, body shape similarity, and movement similarity are increased. The initial weight of appearance similarity is obtained by adjusting the basic weights of appearance features based on the lighting quality, and the initial weight of gait similarity is obtained by adjusting the basic weights of gait features based on the distance between the target person and the first camera.
[0026] In one embodiment, the step of confirming the tracking results of the target person based on each matching score includes:
[0027] If there is a matching score that exceeds the preset matching threshold, the matching score range is determined. The upper limit of the matching score range is the maximum matching score, and the lower limit is the difference between the maximum matching score and the preset difference.
[0028] The candidate whose matching score falls within the matching score range is taken as the person to be verified. If there are more than two people to be verified, the matching score of the person to be verified is continuously calculated. When there is only one person to be verified, it is confirmed that the target person has been tracked.
[0029] Secondly, this application provides a cross-camera target person tracking device, the device comprising:
[0030] The identity feature extraction module is used to extract the identity features of the target person when the first camera detects the appearance of the target person. The identity features include appearance features, gait features, body shape features and movement pattern features.
[0031] The spatiotemporal constraint matching probability determination module is used to predict the second camera among the downstream cameras of the first camera when the first camera detects the disappearance of the target person, and to determine the spatiotemporal constraint matching probability.
[0032] The similarity calculation module is used to filter candidates from the initial people detected by the second camera, and calculate the appearance similarity, gait similarity, body shape similarity and movement similarity between each candidate and the target person based on identity features.
[0033] The tracking result confirmation module is used to identify the costume change result for each candidate based on the appearance similarity of the candidate, and adjust the weights of the corresponding appearance similarity, gait similarity, body shape similarity and movement similarity based on the costume change result. By weighting and using the spatiotemporal constraint matching probability as a penalty coefficient, the matching score is calculated, and the tracking result of the target person is confirmed based on each matching score.
[0034] Thirdly, this application provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of any of the cross-camera target person tracking methods described in the above embodiments.
[0035] Fourthly, this application provides a computer device, including: one or more processors, and a memory;
[0036] The memory stores computer-readable instructions that, when executed by one or more processors, perform the steps of any of the cross-camera target person tracking methods described in the above embodiments.
[0037] As can be seen from the above technical solutions, the embodiments of this application have the following advantages:
[0038] The cross-camera target person tracking method, apparatus, storage medium, and computer equipment provided in this application first predict downstream cameras and matching probabilities through spatiotemporal constraints after the target disappears, narrowing the matching range and reducing the risk of global mismatches. During the candidate matching process, for possible changes in the target's appearance, features such as gait and body shape, which are not easily affected by the environment, are used in conjunction with appearance features to determine whether the target has changed clothes. The weight of each feature in the matching score is dynamically adjusted according to the recognition results, so that stable features can still be relied upon first for identity continuation when the target's appearance changes. At the same time, the spatiotemporal constraint probability is incorporated as a penalty coefficient into the comprehensive score to further filter unreasonable spatiotemporal associations. Thus, this application achieves synergistic complementarity and adaptive fusion of multi-dimensional features, significantly improving the accuracy of identity continuation after target interruption in cross-camera scenarios, and ensuring the continuity and stability of person tracking in complex scenes. Attached Figure Description
[0039] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0040] Figure 1 A flowchart illustrating the cross-camera target person tracking method provided in this application embodiment;
[0041] Figure 2 This is a schematic diagram of the cross-camera target person tracking device provided in an embodiment of this application;
[0042] Figure 3 This is a schematic diagram of the internal structure of a computer device provided in an embodiment of this application. Detailed Implementation
[0043] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0044] This application provides a method for tracking a target person across multiple cameras. The following embodiments illustrate this method using a computer device as an example. It is understood that the computer device can be any device with data processing capabilities, including but not limited to a single server, server cluster, personal laptop, desktop computer, etc. Figure 1 As shown, the method may include the following steps:
[0045] S101: When the first camera detects the appearance of a target person, it extracts the target person's identity features, including appearance features, gait features, body shape features, and movement pattern features.
[0046] The first camera is a front-end device deployed in the monitoring scene to capture video footage, and it has an independent field of view. The target person is a specific individual to be tracked, identified and located in the video footage as a bounding box. Identification features are a set of biometric and behavioral characteristics used to uniquely or highly distinguish the target person, including appearance features, gait features, body shape features, and movement pattern features. Appearance features refer to visual information such as the color, texture, style of clothing, and items carried by the target person. Gait features refer to the dynamic behavioral patterns of the target person, such as posture, arm swing amplitude, stride frequency, and stride length. Body shape features refer to static physical structure information such as height, shoulder width, and body proportions. Movement pattern features refer to the spatiotemporal behavioral patterns of the target person when moving within the monitoring scene, such as changes in speed, acceleration, directional preferences, and path selection.
[0047] In a real-world surveillance network, multiple cameras cover different areas. When a camera (the first camera) detects a target person within its field of view, it locates the target person in the video feed. From the initial moment of detection, the system begins extracting multi-dimensional identity features from that person.
[0048] For appearance feature extraction, a ResNet50-based backbone network was used to extract global features. The 2048-dimensional features were reduced to 512 dimensions through principal component analysis to capture salient information such as overall contour, clothing color, and backpack. Simultaneously, human semantics were divided into six parts: head, upper body, lower body, shoes, backpack or handbag, and held items. 128-dimensional local features were extracted independently for each part, forming a 768-dimensional local feature vector. Furthermore, 256-bin histograms of the RGB color space and HSV color space histograms were extracted, and the dominant color tone information of the top and bottom was extracted.
[0049] During gait feature extraction, 17 key points of the human body are detected using the OpenPose or AlphaPose algorithm. Gait cycle segmentation is achieved by detecting the complete cycle from left foot landing to right foot landing and back to left foot landing in consecutive video frames. Stride length, stride frequency, and swing amplitude are calculated based on the key point coordinates, and the patterns of center of gravity changes and joint angle changes are analyzed. The human body contours of all frames within a gait cycle are superimposed to generate a 128-pixel by 128-pixel gait energy map, and a 256-dimensional gait feature vector is extracted using a convolutional neural network.
[0050] When extracting body features, the actual height is estimated based on parameters such as the camera's installation height and tilt angle, combined with the pixel height of the human body in the image and a scaling factor. Shoulder width is calculated by the distance between key points on the left and right shoulders, and waist circumference is estimated by the distance between key points on the left and right hips. Then, the ratio of shoulder width to height, waist circumference to height, and head size to height are calculated, and the body type is classified into categories such as thin, standard, fat, and stocky.
[0051] During motion pattern feature extraction, the target's movement speed is recorded in real time, and the average walking speed, speed standard deviation, and acceleration are calculated. Motion trajectory data of the target is collected in different time periods and scenarios. Multiple time series are segmented using a sliding time window to calculate parameters such as turning radius, turning angle, and direction change frequency. Turning behavior is classified using K-means or DBSCAN clustering algorithms. The turning sequence is modeled based on a Markov model or Hidden Markov Model, transforming it into a feature vector containing turning radius distribution, angle preference weights, and hesitation frequency. The target's movement speed is monitored; when the speed is below 0.3 meters per second and lasts for more than 3 seconds, it is considered a dwelling event. The coordinates and duration of the dwelling location are recorded, and the average dwelling duration and dwelling frequency are calculated. Cluster analysis is performed on the dwelling locations to determine location preferences.
[0052] By simultaneously extracting four dimensions of identity features—appearance, gait, body shape, and movement patterns—from the initial stage of target detection by the first camera, a multi-dimensional identity representation foundation is constructed for the target. Compared to relying solely on single appearance features, this method collects stable features such as gait and body shape that are not easily changed by the environment or behavior as soon as the target appears. These features are independent of the target's clothing and can still serve as reliable matching criteria even if the target's appearance changes subsequently. Simultaneously, the accumulation of movement pattern features provides information on the target's unique behavioral patterns, enabling identity recognition to expand from static appearance comparison to dynamic behavior verification. The joint extraction and storage of multi-dimensional features provides a highly redundant and complementary feature pool for subsequent identity continuation in cross-camera tracking, fundamentally enhancing the reliability and robustness of identity association in complex scenarios.
[0053] S102: When the first camera detects that the target person has disappeared, predict the second camera among the downstream cameras of the first camera and determine the spatiotemporal constraint matching probability.
[0054] Downstream cameras refer to those spatially connected to the first camera and likely to be entered by the target person after leaving the first camera's field of view. These cameras are typically pre-determined based on the topology of the surveillance network and the travel paths within the scene. The second camera is the next camera predicted from the downstream cameras as the most likely next location for the target person to appear. The spatiotemporal constraint matching probability is a quantifiable value used to measure the likelihood that a target person will appear at a specific downstream camera within a specific time window after disappearing from the first camera. This probability comprehensively considers factors such as the spatial distance between cameras, travel time, historical trajectory statistics, and path constraints within the scene.
[0055] In the actual deployment of a surveillance network, the coverage areas of each camera are not isolated; there are clear connectivity and path reachability between the cameras. When the first camera detects the disappearance of a target person, it first retrieves a pre-constructed camera topology map. This topology map records the spatial locations of all cameras in the surveillance network and the physical connectivity paths between them. Based on this topology, all downstream cameras that have a direct connection to the first camera are identified, and these cameras are used as candidate prediction targets.
[0056] For each downstream camera, the spatiotemporal constraint matching probability is calculated. The actual physical distance between the first and downstream cameras is obtained, and combined with the pedestrian speed range in the scene, the time interval required for the target person to reach the downstream camera after disappearing from the first camera is estimated. A historical trajectory database is retrieved, which records the statistical distribution of past pedestrian travel times from the first camera to each downstream camera, including the fastest travel time, slowest travel time, and the most common travel time interval. Based on the movement speed of the target person's last trajectory segment in the first camera, the travel time interval is individually adjusted: if the target person was running before disappearing, the travel time interval is shortened; if they were walking slowly, it is appropriately extended. Path constraints in the scene are also considered, such as whether there are necessary corridors, stairs, or elevators, and whether these paths incur additional time consumption. The above physical distance, historical statistics, current movement state, and path constraints are fused together to calculate a spatiotemporal constraint matching probability between 0 and 1, reflecting the likelihood that the target person will appear at the downstream camera within a reasonable timeframe.
[0057] After calculating the probability of all downstream cameras, one or more downstream cameras with the highest probability value or exceeding a preset threshold are identified as the second camera and used as the key targets for subsequent matching and tracking. For downstream cameras with extremely low spatiotemporal constraint matching probability, their weight will be significantly reduced or they will be directly excluded in the subsequent matching process, thereby reducing unnecessary computational overhead.
[0058] By predicting the second camera from downstream cameras and determining the spatiotemporal constraint matching probability in advance when the target person disappears, the scope of cross-camera identity association is narrowed from all downstream cameras to a finite set of candidates most likely to appear in time and space, effectively reducing the computational cost of global matching. Simultaneously, introducing spatiotemporal constraint matching probability allows for a quantitative evaluation of the spatiotemporal rationality of the match based on factors such as physical distance, historical traffic statistics, and current motion state. In subsequent association processes, priority is given to cameras with the highest spatiotemporal probability, avoiding invalid matches with spatiotemporally unreasonable cameras, thereby improving the accuracy and efficiency of cross-camera identity association.
[0059] S103: Among the initial people detected by the second camera, candidate people are selected, and based on identity features, the similarity of appearance, gait, body shape and movement between each candidate person and the target person is calculated respectively.
[0060] The initial subjects refer to all pedestrians detected by the second camera within its field of view. Candidate subjects are selected from the initial subjects who have a high degree of spatiotemporal matching with the target subject, and are used as the objects for subsequent identity similarity calculations. Appearance similarity measures the degree of matching between the candidate and target subjects in visual appearance information such as clothing color, texture, style, and carried items. Gait similarity measures the consistency of dynamic behavioral patterns such as walking posture, stride length, stride frequency, and body sway amplitude. Body shape similarity measures the closeness of static physical structure such as height, shoulder width ratio, and waist circumference ratio. Motion similarity measures the degree of conformity of spatiotemporal behavioral patterns such as movement speed, acceleration, turning habits, and dwelling patterns.
[0061] After identifying the second camera, the video stream currently captured by the second camera is acquired. A target detection algorithm is used to identify all pedestrians within the camera's field of view, and these pedestrians are selected as initial subjects. For each initial subject, a preliminary screening is performed. Specifically, the time window for the target subject's appearance at the second camera is calculated based on the time the target subject disappears from the first camera and the predicted travel time interval from the first camera to the second camera. Individuals from the initial subjects whose detection time falls within this time window are retained, while those whose detection time significantly deviates from the time window are removed, thus forming a candidate subject set.
[0062] In calculating appearance similarity, a 512-dimensional global appearance feature vector of the candidate is extracted and its cosine similarity is calculated with the pre-stored global feature vector of the target person to obtain the global similarity. Simultaneously, 128-dimensional feature vectors of six local regions—head, upper body, lower body, shoes, backpack or handbag, and held items—are extracted from the candidate and their corresponding local feature vectors. Cosine similarity is then calculated with the corresponding local feature vectors of the target person to obtain six local similarities. Both global and local similarities are used as part of the overall appearance similarity score.
[0063] When calculating gait similarity, the gait energy map of the candidate walking continuously within the field of view of the second camera is obtained. A 256-dimensional gait feature vector is extracted by a convolutional neural network. The Euclidean distance is calculated between the candidate and the target person's 256-dimensional gait feature vector. The distance is divided by the maximum distance of the gait features of the same person in the dataset, and then 1 is taken and the value is subtracted to obtain the gait similarity.
[0064] When calculating body shape similarity, for both the candidate and the target, the following calculations are performed: height similarity = 1 minus height difference divided by 30 cm; shoulder width similarity = 1 minus shoulder width difference divided by 10 cm; waist circumference similarity = 1 minus waist circumference difference divided by 15 cm; and head similarity = 1 minus head size difference divided by 5 cm. Then, the similarity is weighted and summed according to the following weights: height similarity weight 0.4, shoulder width similarity weight 0.3, waist circumference similarity weight 0.2, and head similarity weight 0.1, to obtain the body shape similarity score.
[0065] In calculating motion similarity, the candidate's movement speed time series, turning radius sequence, turning angle preference distribution, hesitation feature parameters, and dwelling event data within the second camera's field of view are recorded. Correlation analysis is performed between the candidate's speed time series and the target's speed time series to obtain speed pattern similarity. Turning radius similarity, turning angle preference similarity, and hesitation feature similarity are calculated and weighted at 0.4, 0.4, and 0.2 respectively to obtain turning habit similarity. Dwelling duration similarity, dwelling frequency similarity, and position preference similarity are calculated and weighted at 0.4, 0.3, and 0.3 respectively to obtain dwelling pattern similarity. Finally, a weighted sum is calculated using a weight of 0.5 for speed pattern similarity, 0.3 for turning habit similarity, and 0.2 for dwelling pattern similarity to obtain motion similarity.
[0066] By filtering candidate individuals from the initial population detected by the second camera based on spatiotemporal constraints, the scope of identity matching was effectively narrowed, reducing the computational load of subsequent similarity calculations. Simultaneously, the similarity between candidate individuals and the target individual was calculated across four dimensions: appearance, gait, body shape, and movement patterns. This established a multi-dimensional foundation for identity comparison of the target individual, enabling subsequent identity matching to move beyond relying solely on single appearance features. Instead, it incorporated gait and body shape features, which are less affected by clothing and occlusion, as well as movement patterns reflecting individual behavioral habits. This provided rich and complementary similarity data for comprehensively judging identity consistency, enhancing the accuracy and robustness of cross-camera identity association.
[0067] S104: For each candidate, identify the costume change result based on the appearance similarity of the candidate, and adjust the weights of the corresponding appearance similarity, gait similarity, body shape similarity and motion similarity based on the costume change result. Calculate the matching score by weighting and using the spatiotemporal constraint matching probability as a penalty coefficient, and confirm the tracking result of the target person based on each matching score.
[0068] The clothing change result is a judgment based on whether the target person has changed their clothing, identified by appearance similarity. This includes situations such as no change, changing the top, or changing the pants. The penalty coefficient is a multiplicative factor that directly applies the spatiotemporal constraint matching probability to the matching score, used to reduce the weight of unreasonable spatiotemporal matches. The matching score is a comprehensive score obtained by multiplying the weighted sum of multidimensional similarities by the spatiotemporal constraint matching probability, used to measure the overall matching degree between the candidate and the target person. The tracking result is the final identity association result confirmed from all candidates based on the matching score.
[0069] In one implementation, a probabilistic model-based method for identifying costume changes and dynamically adjusting weights is used. A costume change probability prediction model can be pre-trained, using appearance similarity as input and outputting a costume change probability value between 0 and 1. This model can be trained using logistic regression or a lightweight neural network, and the training data should contain a large number of positive and negative sample pairs with and without costume changes. During the identification phase, the weights are continuously adjusted rather than discretely switched based on the costume change probability output by the model. A higher costume change probability results in a lower weight coefficient for appearance similarity, while the weight coefficients for gait similarity, body shape similarity, and movement similarity are higher. This continuous weight adjustment method avoids abrupt changes caused by threshold switching, allowing the matching score to smoothly reflect the impact of the degree of costume change.
[0070] Another implementation is a hierarchical weighting strategy based on multi-level clothing change discrimination. A more granular clothing change type discrimination is designed, employing differentiated weight allocations for different clothing change scenarios. When only the upper garment is changed, the similarity of the lower body, shoes, and backpack remains highly valuable, while the appearance similarity is only moderately reduced. When the entire outfit is changed, the weight of appearance similarity is significantly reduced, while the weights of gait and body shape features are greatly increased. When no clothing change is detected but partial occlusion exists, the original weights are maintained, but the weight of the unoccluded parts of the local features is appropriately increased. By refining the clothing change type and matching the corresponding weighting strategy, the weight adjustments more accurately reflect changes in feature effectiveness in the actual scene, further improving matching accuracy.
[0071] The weighted score is multiplied by the spatiotemporal constraint matching probability to obtain the final matching score of the candidate. The spatiotemporal constraint matching probability serves as a penalty coefficient. When the time when the candidate appears in the second camera deviates significantly from the prediction time window, this probability value is low. Even if the multidimensional similarity weighted score is high, the final matching score will be significantly lowered, thus eliminating mismatches that are spatiotemporally unreasonable but visually similar.
[0072] After calculating the matching scores for all candidate individuals, the candidate with the highest score is associated with the target person to confirm that the candidate is the target person's tracking result. If the matching scores of all candidate individuals are lower than a preset threshold, it is determined that no matching target person has been found in the current second camera, and the system will wait for the next round of prediction or expand the search range.
[0073] Furthermore, when a target person has been missing for more than 5 minutes, the search area is expanded to increase the probability of retrieval. The search radius is expanded from one-hop cameras to three-hop cameras, where one hop refers to a camera directly adjacent to the current camera, and three hops refer to a camera that requires traversing three paths to reach; the time window is expanded from 5 minutes to 15 minutes; and the matching threshold is reduced from 0.8 to 0.6. If no match is found after expanding the search, the target person is marked as missing and awaits manual intervention.
[0074] By recognizing clothing changes based on appearance similarity and dynamically adjusting weights, the system automatically shifts the matching focus from volatile appearance features to stable features unaffected by clothing, such as gait and body shape, when a target changes clothes. This avoids incorrectly relying on outdated features due to sudden appearance changes, effectively reducing the risk of mismatches in clothing-changing scenarios. Simultaneously, by incorporating spatiotemporal constraint matching probabilities as penalty coefficients into the matching score, spatiotemporally unreasonable matches are significantly downweighted even when appearance similarity is high, further filtering out false associations from a spatiotemporal perspective. Through joint evaluation of weighted scores and spatiotemporal penalties, adaptive fusion of multi-dimensional features and spatiotemporal rationality verification are achieved, significantly improving the accuracy of identity continuation after target interruption in cross-camera scenarios and ensuring the continuity and stability of character tracking in complex real-world scenarios such as clothing changes and occlusion.
[0075] In the above embodiments, after the target disappears, the downstream camera and matching probability are predicted through spatiotemporal constraints to narrow the matching range and reduce the risk of global mismatches. During the candidate matching process, for possible changes in the target's appearance, features such as gait and body shape, which are not easily affected by the environment, are used in conjunction with appearance features to determine the clothing change situation. The weight of each feature in the matching score is dynamically adjusted according to the recognition results. Thus, even when the target's appearance changes, stable features can still be relied upon first for identity continuation. At the same time, the spatiotemporal constraint probability is incorporated as a penalty coefficient into the comprehensive score to further filter unreasonable spatiotemporal associations. Therefore, this application achieves the synergistic complementarity and adaptive fusion of multi-dimensional features, significantly improving the accuracy of identity continuation after target interruption across cameras, and ensuring the continuity and stability of person tracking in complex scenarios.
[0076] In one embodiment, the process of extracting motion pattern features includes:
[0077] Based on physical characteristics, gait characteristics, and body shape characteristics, historical records are used to determine whether the target person is appearing for the first time.
[0078] If so, extract the instantaneous movement speed of the target person as a feature of the movement pattern;
[0079] Otherwise, based on the matching results, the target person's turning preferences and dwelling patterns are obtained from the historical records, and the instantaneous movement speed, turning preferences, and dwelling patterns are used as movement pattern features.
[0080] The historical data section is a database storing the identity characteristics and movement behavior data of past targets. Each target corresponds to one record, including appearance features, gait features, body shape features, and accumulated movement patterns. "First appearance" means that the currently detected target has no matching record in the historical data, i.e., the system has never tracked this individual before. "Instantaneous movement speed" refers to the instantaneous speed sequence of the target within the current camera's field of view from entry to disappearance or from the moment of detection, including average speed, speed variation range, and acceleration characteristics. "Turning preference" is the turning habits exhibited by the target during multiple movements, extracted from the historical data, including turning radius distribution, turning angle preference, and hesitation characteristics. "Dwelling pattern" is the dwelling behavior pattern exhibited by the target in past monitoring scenarios, extracted from the historical data, including average dwelling duration, dwelling frequency, and dwelling location preference.
[0081] When the first camera detects a target person, it extracts the target person's appearance, gait, and body shape features. The extracted 3D features are matched against all target person features stored in the historical record. The overall similarity of the feature vectors is calculated to determine if the target is appearing for the first time. If the overall similarity is below a preset threshold, the target person is determined to be appearing for the first time. In this case, only the target person's instantaneous movement speed within the current camera's field of view is extracted as its motion pattern feature, including the average speed after entering the field of view, the speed change sequence, and acceleration features. This instantaneous movement speed, along with the appearance, gait, and body shape features, is stored in the historical record, creating a new profile for the target person.
[0082] If the overall similarity exceeds a preset threshold, it is determined that the target person already has a file in the historical records. The historical records are then used to retrieve the target person's accumulated turning preferences and dwelling patterns. Turning preferences include the target's statistical distribution of turning radii, preferred turning angle ranges, and frequency of hesitation during turns in the historical trajectory. Dwelling patterns include the target's average dwell time per dwell, number of dwell times per unit time, and preferred dwelling positions relative to the scene center or boundary, statistically analyzed in the historical records. Simultaneously, the current instantaneous movement speed is extracted and combined with the turning preferences and dwelling patterns retrieved from the historical records to form the target person's movement pattern characteristics for this tracking. These movement pattern characteristics, along with appearance, gait, and body shape characteristics, are used for subsequent cross-camera identity matching.
[0083] By determining whether a target is appearing for the first time based on historical records and employing differentiated motion pattern feature extraction strategies, only instantaneous motion speed is extracted when the target appears for the first time. This avoids ineffective complex feature extraction due to a lack of historical data and improves the efficiency of feature initialization. When the target is not appearing for the first time, stable behavioral features accumulated over a long period, such as turning preferences and dwell patterns, are obtained from historical records. This means that motion pattern features are no longer limited to the instantaneous speed of a single appearance, but rather incorporate the unique behavioral habits formed by an individual over a long period of time. This enhances the stability and distinguishability of motion pattern features and provides a more reliable behavioral dimension basis for subsequent cross-camera identity association.
[0084] In one embodiment, the step of predicting the second camera among the downstream cameras of the first camera and determining the spatiotemporal constraint matching probability when the first camera detects the disappearance of the target person includes:
[0085] When the first camera detects that the target person has disappeared, it records the time of the disappearance and determines the direction of the target person's movement based on their identity characteristics.
[0086] Obtain the topology graph of the first camera, which includes the reachable paths, shortest transit time, and longest transit time between the first camera and each downstream camera.
[0087] Based on each reachable path and direction of movement, candidate cameras are selected from each downstream camera;
[0088] For each candidate camera, if the time interval between the appearance time and disappearance time of the initial person detected by the candidate camera is not less than the shortest elapsed time, then the candidate camera is used as the second camera, and the spatiotemporal constraint matching probability of the second camera is calculated based on the corresponding time interval, the shortest elapsed time and the longest elapsed time.
[0089] The disappearance time refers to the moment the target person was detected in the field of view of the first camera from the last frame, used to mark the time node where tracking was interrupted. The direction of movement refers to the direction of movement determined based on the trajectory of the target person before disappearing, used to predict its possible exit location. The topology diagram is a pre-constructed description of the monitoring network structure, recording the spatial connectivity, physical distance, and travel time information between cameras. The reachable path refers to the actual route from the first camera to the downstream camera. The shortest travel time and the longest travel time refer to the minimum and maximum time required for a pedestrian to travel along the reachable path from the first camera to the downstream camera, respectively, determined based on historical statistical data and path length. Candidate cameras are initially selected from the downstream cameras based on the direction of movement and reachable paths, indicating cameras that may have captured the target person. The appearance time refers to the moment the initial person was first detected in the field of view of the candidate cameras. The time interval is the difference between the appearance time of the initial person and the disappearance time of the target person.
[0090] When the first camera detects that a target person has disappeared, it records the last frame of the target person in its field of view as the disappearance time. At the same time, it determines the direction of movement based on the trajectory data of the target person before disappearing. Specifically, by analyzing the characteristics of the movement pattern, it fits a vector of the direction of movement to determine which exit area the target left the field of view of the first camera from.
[0091] Obtain a pre-constructed camera topology graph, which records the reachable paths between the first camera and each downstream camera, as well as the shortest and longest travel times for each reachable path. The shortest travel time is calculated based on the physical length of the path and the lower limit of a pedestrian's normal walking speed. The longest travel time is determined based on the path length and the upper limit of possible pedestrian behaviors such as stopping and detouring, combined with historical statistics, or it can be determined based on the shortest travel time and a preset buffer time.
[0092] The downstream cameras are initially screened based on the target person's movement direction. When the target person's movement direction points towards an exit area, the downstream camera with the strongest connectivity to that exit area is selected as a candidate camera, while cameras with opposite or unrelated movement directions are excluded. Based on the reachable paths of each candidate camera, a set of cameras whose movement directions match are then selected.
[0093] For each candidate camera, calculate the spatiotemporal constraint matching probability. Obtain the disappearance time of the target person at the first camera and the appearance time of the initial person in the candidate cameras, and calculate the time interval between the two. Obtain the shortest elapsed time corresponding to the candidate camera from the topology graph, and set the longest elapsed time to the shortest elapsed time plus a 5-minute buffer period to accommodate possible brief stops or slow movements of the target person in the path. Based on the time interval, determine the following: if the time interval is less than the shortest elapsed time, the initial person is deemed physically unreachable, and the matching probability is set to 0; if the time interval is greater than the longest elapsed time, the target person may be delayed, and the matching probability is multiplied by 0.3 to reduce the confidence level; if the time interval is between the shortest and longest elapsed times, the matching probability is calculated according to the linear decay formula, i.e., 1 minus the difference between the time interval and the shortest time divided by the difference between the longest and shortest times. The closer the time interval is to the shortest elapsed time, the closer the probability is to 1; the closer it is to the longest elapsed time, the closer the probability is to 0.
[0094] By recording disappearance times and determining movement directions based on identity characteristics, and combining reachable paths and traversed time ranges in the topological relationship graph for filtering, the scope of subsequent cross-camera matching is narrowed from all downstream cameras to candidate cameras reachable by the movement direction. Furthermore, a shortest traversed time constraint is introduced in the time dimension, retaining only initial individuals whose time intervals meet the probability of passage, effectively eliminating spatiotemporally unreasonable mismatch candidates. The spatiotemporal constraint matching probability is calculated based on the time interval and traversed time range, providing a quantified spatiotemporal rationality weight for subsequent comprehensive matching scores. This allows priority to be given to cameras and individuals most likely to appear spatiotemporally, thereby improving the accuracy and computational efficiency of cross-camera identity association.
[0095] In one embodiment, the step of screening candidates from the initial people detected by the second camera includes:
[0096] Obtain the target height from the target person's physical characteristics;
[0097] If the height difference between the initial person detected by the second camera and the target height is less than a preset height difference threshold, then the initial person is considered a candidate.
[0098] The target height is the actual physical height value extracted and estimated from the body shape characteristics of the target person. The preset height difference threshold is a pre-set value used to determine whether the heights of two pedestrians are close enough to be considered the same person; it is usually determined based on the error range of height estimation.
[0099] The target height is obtained from the physical characteristics of the target person. Specifically, when the target person appears in the first camera, based on the camera's installation height, pitch angle, and other internal and external parameters, combined with the target person's pixel height in the image, the actual physical height is calculated through perspective transformation, and this height value is stored as the target height.
[0100] Once the second camera is identified, all initial figures detected within its field of view are acquired. For each initial figure, the same height estimation method is used to calculate its actual height based on the intrinsic and extrinsic parameters of the second camera and the figure's pixel height in the image. The height of each initial figure is compared to the target height, and the absolute difference between the two is calculated. If the difference is less than a preset height difference threshold, the initial figure is considered a candidate and retained for subsequent multidimensional similarity calculations; if the difference is greater than or equal to the preset height difference threshold, the initial figure is directly excluded, and subsequent similarity calculations for appearance, gait, etc., are not performed.
[0101] By using height as a stable feature that is not easily affected by clothing for initial screening, a large number of initial individuals with significant height differences can be quickly eliminated before entering the complex multi-dimensional similarity calculation, effectively reducing the size of the candidate set. Height, as a hard constraint, has low computational cost and high execution efficiency in its screening process, significantly reducing the computational overhead of subsequent multi-dimensional feature similarity calculations such as appearance, gait, and body shape. Furthermore, due to the short-term stability of height, this screening method can improve the overall efficiency of cross-camera identity association without sacrificing accuracy.
[0102] In one embodiment, appearance similarity includes global similarity and multiple local similarities. The step of identifying the costume change result based on the corresponding appearance similarity includes:
[0103] If the global similarity of the candidate character is less than the global similarity threshold, at least two corresponding local similarities are greater than the first local similarity threshold, and at least one corresponding local similarity is less than the second local similarity threshold, then the candidate character is determined to have changed clothes, and the first local similarity threshold is greater than the second local similarity threshold.
[0104] Global similarity is the degree of matching between the candidate and the target in overall appearance features, calculated using cosine similarity based on the 512-dimensional global feature vector extracted by ResNet50. Local similarity is the degree of matching between the candidate and the target in semantic body parts, including six local regions: head, upper body, lower body, shoes, backpack or handbag, and held items. Cosine similarity is calculated for each region based on a 128-dimensional local feature vector. The global similarity threshold is a preset value used to determine whether the overall appearance has changed significantly; a threshold below this threshold indicates a significant difference between the target's overall appearance and historical records. The first local similarity threshold is a higher preset value used to determine whether local features remain highly stable; a threshold above this threshold indicates no significant change in the local area. The second local similarity threshold is a lower preset value used to determine whether local features have changed significantly; a threshold below this threshold indicates that the local area may have been replaced or occluded. The first local similarity threshold is greater than the second local similarity threshold.
[0105] For each candidate, a pre-calculated global similarity and six local similarities (head, upper body, lower body, shoes, backpack, and held item) are obtained. First, the global similarity is compared with a preset global similarity threshold, typically set to 0.6, to determine if the overall appearance of the target has changed significantly.
[0106] Assuming the global similarity is below a threshold, the distribution of local similarities is further analyzed. Six local similarities are iterated, and the number of similarities above the first local similarity threshold and below the second local similarity threshold is counted. The first local similarity threshold is set to 0.85, representing a local area that is highly consistent with historical records and is a stable feature that is not easily affected by clothing changes. The second local similarity threshold is set to 0.5, representing a local area that differs significantly from historical records and is prone to replacement or occlusion.
[0107] A candidate is deemed to have changed their attire when all three of the following conditions are met: a global similarity score below 0.6 indicates a significant change in overall appearance; at least two local similarities above 0.85 indicate multiple stable local features, such as shoes, backpacks, or lower body, remain unchanged; and at least one local similarity score below 0.5 indicates at least one variable local feature, such as the upper body or trousers, has undergone a significant change. If none of the above conditions are met simultaneously, it is determined that no attire change has occurred or the type of attire change is unclear.
[0108] It is understandable that when a target changes their clothes, their overall appearance will change significantly, while parts that are not easily changed, such as shoes and backpacks, will remain highly similar. The parts that are changed will show significant differences. Based on this characteristic, by combining and verifying multiple local features, we can accurately distinguish the appearance changes caused by changing clothes from completely different passersby.
[0109] By combining global similarity with multiple local similarities for joint judgment, it is possible to accurately identify whether a target has changed their clothing. When the global similarity is low but multiple stable local features still maintain high similarity, it can effectively distinguish between an overall appearance change caused by clothing change and a completely different passerby, avoiding the incorrect exclusion of the target due to overall appearance change. At the same time, by setting at least one local feature below a low threshold, it ensures that the judgment is based on real changes in body parts rather than accidental local differences. This clothing change recognition mechanism provides a reliable basis for subsequent dynamic adjustment of the weights of each feature dimension, thereby automatically shifting the matching focus to stable features such as gait and body shape that are not affected by clothing in clothing change scenarios, thus effectively reducing the risk of mismatches caused by clothing change.
[0110] In one embodiment, the step of adjusting the weights for the corresponding appearance similarity, gait similarity, body shape similarity, and movement similarity based on the costume change result includes:
[0111] If it is determined that the candidate has changed clothes, the initial weight of appearance similarity is reduced, while the initial weights of gait similarity, body shape similarity, and movement similarity are increased. The initial weight of appearance similarity is obtained by adjusting the basic weights of appearance features based on the lighting quality, and the initial weight of gait similarity is obtained by adjusting the basic weights of gait features based on the distance between the target person and the first camera.
[0112] Lighting quality is a parameter used to quantify the lighting conditions of the current monitoring scene, reflecting the influence of factors such as light brightness, uniformity, and contrast on appearance feature extraction. Higher lighting quality results in higher reliability of appearance features. The base weight of appearance features is a preset baseline weight value for appearance similarity without considering environmental factors. The initial weight of appearance similarity is the actual weight obtained after adjusting the base weight based on lighting quality. Higher lighting quality results in an initial weight closer to the base weight, while lower lighting quality results in a correspondingly lower initial weight. Distance refers to the physical straight-line distance or longitudinal distance along the field of view between the target person and the first camera. The closer the distance, the higher the resolution of the target in the image, and the clearer the gait details. The base weight of gait features is a preset baseline weight value for gait similarity without considering distance factors. The initial weight of gait similarity is the actual weight obtained after adjusting the base weight based on distance. Closer distances result in an initial weight closer to the base weight, while farther distances result in a correspondingly lower initial weight.
[0113] Before starting the matching calculation, the initial weights for each similarity metric are determined based on the current environmental conditions and the target's state. For appearance similarity, the lighting quality of the first camera when the target appears is evaluated. The lighting quality is categorized into three levels: excellent, good, and poor, by analyzing the average image brightness, contrast, and uniformity of illumination. When the lighting quality is excellent, the base weight of the appearance features remains unchanged as the initial weight; when the lighting quality is good, the base weight of the appearance features is multiplied by 0.8 as the initial weight; and when the lighting quality is poor, the base weight of the appearance features is multiplied by 0.6 as the initial weight. For gait similarity, the distance between the target and the camera in the first camera view is obtained. When the distance is less than 5 meters, the base weight of the gait features remains unchanged as the initial weight; when the distance is between 5 and 10 meters, the base weight of the gait features is multiplied by 0.9 as the initial weight; and when the distance is greater than 10 meters, the base weight of the gait features is multiplied by 0.7 as the initial weight. The initial weights for body shape similarity and motion similarity remain unchanged and are unaffected by lighting and distance.
[0114] Once it's determined that a candidate has changed clothes, the initial weights are further adjusted. For appearance similarity, its initial weight is reduced by multiplying it by 0.4 before normalization with other similarity weights, significantly decreasing the contribution of appearance similarity in the clothing-changing scenario. For gait similarity, its initial weight is increased by multiplying it by 1.3 before normalization. For body shape similarity and movement similarity, their initial weights are increased by multiplying them by 1.2 and 1.1 respectively before normalization. All adjusted weights are normalized to ensure a sum of 1, which is used as the final comprehensive matching weight.
[0115] By dynamically adjusting the initial weight of appearance similarity based on lighting quality, the system proactively reduces reliance on appearance features in poor lighting conditions, preventing feature distortion caused by lighting from affecting matching accuracy. Similarly, by dynamically adjusting the initial weight of gait similarity based on distance, the weight of gait features is appropriately reduced when the target is far away and gait details are blurred, while maintaining a high weight when the target is close and gait features are clear, ensuring that the use of gait features matches their actual reliability. Furthermore, when a change of clothing is detected, the weight of appearance similarity is further reduced while the weights of gait, body shape, and motion similarity are increased. This achieves a two-layer dynamic weight adjustment mechanism combining environmental and scene adaptation, making the weight allocation of each feature dimension more closely aligned with feature reliability under actual working conditions. This significantly improves the robustness and accuracy of cross-camera identity association in complex real-world scenarios.
[0116] In one embodiment, the step of confirming the tracking results of the target person based on each matching score includes:
[0117] If there is a matching score that exceeds the preset matching threshold, the matching score range is determined. The upper limit of the matching score range is the maximum matching score, and the lower limit is the difference between the maximum matching score and the preset difference.
[0118] The candidate whose matching score falls within the matching score range is taken as the person to be verified. If there are more than two people to be verified, the matching score of the person to be verified is continuously calculated. When there is only one person to be verified, it is confirmed that the target person has been tracked.
[0119] The preset matching threshold is a pre-defined value used to determine whether the matching score between a candidate and the target reaches a credible level. Only when the matching score exceeds this threshold is the candidate considered qualified to become a tracking target. The matching score range is an interval centered on the highest matching score. Its upper limit is the maximum value among all matching scores, and its lower limit is the maximum value minus a preset difference, used to filter out candidates with scores close to the highest score. Candidates to be verified refer to those whose matching scores fall within the matching score range. These candidates are very closely matched with the target and require further differentiation. Continuous calculation refers to iteratively updating the matching score of candidates to be verified, recalculating the multidimensional similarity using image information from more subsequent frames, and gradually widening the score gap as new data accumulates.
[0120] After calculating the matching scores for all candidates, candidates whose matching scores exceed a preset matching threshold are first selected. If at least one candidate exceeds this threshold, a matching score range is determined. The highest matching score is used as the upper limit of the range, and the result obtained by subtracting a preset difference from this maximum value is used as the lower limit of the range. The preset difference is usually set to 0.1 to control the leniency of the candidate selection.
[0121] All candidates whose matching scores fall within this range are identified as individuals to be verified. If there is only one individual to be verified, that candidate is directly confirmed as the target individual. If there are two or more individuals to be verified, it indicates that there are multiple highly similar candidates that are difficult to distinguish, and a continuous calculation mechanism is initiated. Image information of these individuals to be verified is continuously acquired from subsequent video frames of the second camera. Gait features, motion pattern features, and other features that become more stable over time are re-extracted, and the matching scores of each individual to be verified are dynamically updated. As the number of frames increases and the trajectory lengthens, the differences between different candidates in gait details, movement habits, etc., gradually become apparent, and the gap between matching scores gradually widens. When the continuous calculation reaches a certain point and the number of individuals to be verified decreases to one, this unique candidate is confirmed as the target individual, and the identity association is completed.
[0122] By setting a matching score range and continuously verifying candidates with scores close to the highest score, incorrect matching decisions are avoided when multiple highly similar candidates are directly matched. When multiple candidates have similar scores, instead of hastily determining the result, the matching score is dynamically updated using continuously collected data. This allows dimensions such as gait features and motion pattern features, which require time to accumulate and fully manifest, to gradually take effect, thereby differentiating candidates. This progressive confirmation mechanism effectively reduces the risk of erroneous associations due to insufficient information in a single match, significantly improving the accuracy and reliability of cross-camera identity association.
[0123] The cross-camera target person tracking device provided in the embodiments of this application is described below. The cross-camera target person tracking device described below can be referred to in correspondence with the cross-camera target person tracking method described above. Figure 2 As shown, this application provides a cross-camera target person tracking device, the device comprising:
[0124] The identity feature extraction module 201 is used to extract the identity features of the target person when the first camera detects the appearance of the target person. The identity features include appearance features, gait features, body shape features and movement pattern features.
[0125] The spatiotemporal constraint matching probability determination module 202 is used to predict the second camera among the downstream cameras of the first camera when the first camera detects the disappearance of the target person, and determine the spatiotemporal constraint matching probability.
[0126] The similarity calculation module 203 is used to filter candidates from the initial people detected by the second camera, and calculate the appearance similarity, gait similarity, body shape similarity and motion similarity between each candidate and the target person based on identity features.
[0127] The tracking result confirmation module 204 is used to identify the change result for each candidate based on the appearance similarity of the candidate, and adjust the weights of the corresponding appearance similarity, gait similarity, body shape similarity and movement similarity based on the change result. By weighting and using the spatiotemporal constraint matching probability as a penalty coefficient, the matching score is calculated to confirm the tracking result of the target person based on each matching score.
[0128] In one embodiment, the identity feature extraction module 201 includes:
[0129] The historical record matching unit is used to determine whether the target person is appearing for the first time by matching historical records based on appearance features, gait features, and body shape features.
[0130] The first motion law feature extraction unit is used to extract the instantaneous motion speed of the target person as a motion law feature if the condition is met.
[0131] The second motion pattern feature extraction unit is used to otherwise obtain the target person's turning preferences and dwelling patterns from the historical records based on the matching results, and to use the instantaneous motion speed, turning preferences, and dwelling patterns as motion pattern features.
[0132] In one embodiment, the spatiotemporal constraint matching probability determination module 202 includes:
[0133] The motion direction determination unit is used to record the disappearance time of the target person when the first camera detects that the target person has disappeared, and to determine the motion direction of the target person based on the identity characteristics.
[0134] The topology graph acquisition unit is used to acquire the topology graph of the first camera. The topology graph includes the reachable paths, shortest traversal time, and longest traversal time between the first camera and each downstream camera.
[0135] The candidate camera selection unit is used to select candidate cameras from among the downstream cameras based on each reachable path and direction of motion;
[0136] The spatiotemporal constraint matching probability calculation unit is used to, for each candidate camera, if the time interval between the appearance time and disappearance time of the initial person detected by the candidate camera is not less than the shortest elapsed time, then the candidate camera is regarded as the second camera, and the spatiotemporal constraint matching probability of the second camera is calculated according to the corresponding time interval, the shortest elapsed time and the longest elapsed time.
[0137] In one embodiment, the similarity calculation module 203 includes:
[0138] The target height acquisition unit is used to obtain the target height from the body characteristics of the target person.
[0139] The candidate determination unit is used to identify the initial person as a candidate if the height difference between the initial person's height detected by the second camera and the target height is less than a preset height difference threshold.
[0140] In one embodiment, the tracking result confirmation module 204 includes:
[0141] The costume change result determination unit is used to determine that the candidate character has changed costumes when the global similarity corresponding to the candidate character is less than the global similarity threshold, at least two corresponding local similarities are greater than the first local similarity threshold, and at least one corresponding local similarity is less than the second local similarity threshold. The first local similarity threshold is greater than the second local similarity threshold.
[0142] In one embodiment, the tracking result confirmation module 204 includes:
[0143] The weight adjustment unit is used to reduce the initial weight of appearance similarity and increase the initial weights of gait similarity, body shape similarity and motion similarity if it is determined that the candidate has changed clothes. The initial weight of appearance similarity is obtained by adjusting the basic weight of appearance features according to the lighting quality, and the initial weight of gait similarity is obtained by adjusting the basic weight of gait features according to the distance between the target person and the first camera.
[0144] In one embodiment, the tracking result confirmation module 204 includes:
[0145] The matching score range determination unit is used to determine the matching score range if there is a matching score that exceeds the preset matching threshold. The upper limit of the matching score range is the maximum matching score, and the lower limit is the difference between the maximum matching score and the preset difference.
[0146] The tracking result confirmation unit is used to take the candidate person corresponding to the matching score within the matching score range as the person to be verified. If there are no less than two people to be verified, the matching score of the person to be verified is continuously calculated, and when there is only one person to be verified, it is confirmed that the target person has been tracked.
[0147] In one embodiment, this application also provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the cross-camera target person tracking method as described in any of the above embodiments.
[0148] In one embodiment, this application also provides a computer device storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the cross-camera target person tracking method as described in any of the above embodiments.
[0149] Indicatively, such as Figure 3 As shown, Figure 3 This is a schematic diagram of the internal structure of a computer device 300 provided in an embodiment of this application. The computer device 300 can be provided as a server. (Refer to...) Figure 3The computer device 300 includes a processing component 302, which further includes one or more processors, and memory resources represented by memory 301 for storing instructions, such as applications, that can be executed by the processing component 302. The applications stored in memory 301 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 302 is configured to execute instructions to perform the cross-camera target person tracking method of any of the above embodiments.
[0150] The computer device 300 may also include a power supply component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input / output (I / O) interface 305. The computer device 300 may operate on an operating system stored in memory 301, such as Windows Server™, Mac OS X™, Unix™, Linux™, Free BSD™, or similar.
[0151] Those skilled in the art will understand that Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0152] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element. In this document, "a," "an," "the," "the," and "its" may also include plural forms unless the context clearly indicates otherwise. "Multiple" refers to at least two, such as 2, 3, 5, or 8, etc. "And / or" includes any and all combinations of the related listed items.
[0153] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.
[0154] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method for tracking a target person across multiple cameras, characterized in that, The method includes: When the first camera detects the appearance of a target person, it extracts the identity features of the target person, which include appearance features, gait features, body shape features, and movement pattern features. When the first camera detects that the target person has disappeared, the second camera is predicted among the downstream cameras of the first camera, and the spatiotemporal constraint matching probability is determined. Among the initial individuals detected by the second camera, candidate individuals are selected, and based on the identity features, the similarity in appearance, gait, body shape, and movement between each candidate individual and the target individual is calculated. For each candidate, the costume change result is identified based on the appearance similarity of the candidate, and the weights of the corresponding appearance similarity, gait similarity, body shape similarity and motion similarity are adjusted based on the costume change result. The matching score is calculated by weighting and using the spatiotemporal constraint matching probability as a penalty coefficient, so as to confirm the tracking result of the target person based on each matching score.
2. The cross-camera target person tracking method according to claim 1, characterized in that, The process of extracting the motion pattern features includes: Based on the appearance features, gait features, and body shape features, historical record matching is used to determine whether the target person is appearing for the first time. If so, the instantaneous movement speed of the target person is extracted as the movement pattern feature; Otherwise, based on the matching results, the target person's turning preferences and dwelling patterns are obtained from the historical records, and the instantaneous movement speed, turning preferences, and dwelling patterns are used as the movement pattern features.
3. The cross-camera target person tracking method according to claim 1, characterized in that, The step of predicting the second camera among the downstream cameras of the first camera and determining the spatiotemporal constraint matching probability when the first camera detects the disappearance of the target person includes: When the first camera detects that the target person has disappeared, it records the time of the target person's disappearance and determines the target person's direction of movement based on the identity characteristics; Obtain the topology graph of the first camera, the topology graph including the reachable path, shortest transit time and longest transit time between the first camera and each of the downstream cameras; Based on each reachable path and the direction of movement, a candidate camera is selected from each of the downstream cameras; For each candidate camera, if the time interval between the appearance time of the initial person detected by the candidate camera and the disappearance time is not less than the shortest elapsed time, then the candidate camera is used as the second camera, and the spatiotemporal constraint matching probability of the second camera is calculated based on the corresponding time interval, the shortest elapsed time and the longest elapsed time.
4. The cross-camera target person tracking method according to claim 1, characterized in that, The step of screening candidates from the initial people detected by the second camera includes: Obtain the target height from the physical characteristics of the target person; If the height difference between the initial person detected by the second camera and the target height is less than a preset height difference threshold, then the initial person is considered as the candidate.
5. The cross-camera target person tracking method according to claim 1, characterized in that, The appearance similarity includes global similarity and multiple local similarities. The step of identifying the costume change result based on the corresponding appearance similarity includes: If the global similarity of the candidate character is less than the global similarity threshold, at least two corresponding local similarities are greater than the first local similarity threshold, and at least one corresponding local similarity is less than the second local similarity threshold, then the candidate character is determined to have changed clothes, and the first local similarity threshold is greater than the second local similarity threshold.
6. The cross-camera target person tracking method according to claim 1, characterized in that, The step of adjusting the weights for appearance similarity, gait similarity, body shape similarity, and movement similarity based on the change of clothes includes: If it is determined that the candidate has changed clothes, the initial weight of the appearance similarity is reduced, and the initial weights of the gait similarity, body shape similarity, and movement similarity are increased. The initial weight of the appearance similarity is obtained by adjusting the basic weight of the appearance features based on the lighting quality, and the initial weight of the gait similarity is obtained by adjusting the basic weight of the gait features based on the distance between the target person and the first camera.
7. The cross-camera target person tracking method according to claim 1, characterized in that, The step of confirming the tracking result of the target person based on each of the matching scores includes: If there is a matching score that exceeds the preset matching threshold, then the matching score range is determined. The upper limit of the matching score range is the maximum matching score, and the lower limit is the difference between the maximum matching score and the preset difference. The candidate whose matching score falls within the specified matching score range is taken as the person to be verified. If there are at least two people to be verified, the matching score of the person to be verified is continuously calculated. When there is only one person to be verified, it is confirmed that the target person has been tracked.
8. A cross-camera target person tracking device, characterized in that, The device includes: The identity feature extraction module is used to extract the identity features of the target person when the first camera detects the appearance of the target person. The identity features include appearance features, gait features, body shape features and movement pattern features. The spatiotemporal constraint matching probability determination module is used to predict the second camera among each downstream camera of the first camera and determine the spatiotemporal constraint matching probability when the first camera detects the disappearance of the target person. The similarity calculation module is used to filter candidates from the initial people detected by the second camera, and calculate the appearance similarity, gait similarity, body shape similarity and movement similarity between each candidate and the target person based on the identity features. The tracking result confirmation module is used to identify the costume change result for each candidate based on the appearance similarity of the candidate, and adjust the weights of the corresponding appearance similarity, gait similarity, body shape similarity and movement similarity based on the costume change result. By weighting and using the spatiotemporal constraint matching probability as a penalty coefficient, a matching score is calculated to confirm the tracking result of the target person based on each matching score.
9. A storage medium, characterized in that: The storage medium stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the cross-camera target person tracking method as described in any one of claims 1 to 7.
10. A computer device, characterized in that, include: One or more processors, and memory; The memory stores computer-readable instructions that, when executed by the one or more processors, perform the steps of the cross-camera target person tracking method as described in any one of claims 1 to 7.