Deep learning based autonomous vehicle pedestrian intent recognition control system
The deep learning-based pedestrian intent recognition control system for autonomous vehicles solves the problems of identity jumps and unstable intent prediction caused by multi-camera perspective switching and occlusion. It achieves cross-camera identity consistency mapping and trajectory stability, thereby improving the control stability and safety of the autonomous driving system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANJIN UNIV OF TECH & EDUCATION (TEACHER DEV CENT OF CHINA VOCATIONAL TRAINING & GUIDANCE)
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-26
AI Technical Summary
In autonomous driving systems, the switching and occlusion of multiple camera perspectives can lead to changes in pedestrian identity and unstable intention prediction, affecting the continuity and accuracy of control strategies.
A deep learning-based pedestrian intent recognition control system for autonomous vehicles is adopted. The system acquires image sequences through a multi-camera perception unit, extracts appearance feature vectors through a pedestrian detection feature extraction unit, performs cross-camera identity consistency mapping through a collaborative embedding learning unit, associates pedestrian tracking and identity management with trajectory, performs temporal modeling through a pedestrian intent prediction unit, and generates vehicle control commands in conjunction with a control decision unit.
It improves the continuity of cross-camera identity association and the stability of trajectory maintenance, enhances the ability to express the temporal behavior patterns of pedestrians' intentions to cross the street, reduces mismatch and drift in complex scenarios, and improves the stability and security of control output.
Smart Images

Figure CN122275868A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent connected vehicles and autonomous driving technology, and in particular to a pedestrian intention recognition control system for autonomous vehicles based on deep learning. Background Technology
[0002] As autonomous driving systems evolve to higher levels, the need for vehicle environment understanding has expanded from static target detection to the prediction of the behavior and future intentions of dynamic traffic participants. Pedestrian crossing intention prediction can be summarized as: after detecting a pedestrian, determining whether they will enter the lane area within the next few seconds, thereby providing a basis for vehicle planning and control to reduce collision risk and unnecessary sudden deceleration.
[0003] In urban road scenarios (such as unsignalized intersections, unprotected left / right turns, etc.), a single vehicle camera is prone to blind spots or short-term occlusion due to its installation location and field of view. To expand the perception range, engineering practices and publicly available datasets generally adopt multiple cameras to cover the vehicle's circumferential field of view. Some research has also further explored the introduction and fusion of multi-camera inputs in intent prediction tasks to enhance context awareness. However, when pedestrians move between different camera fields of view, especially when passing through areas with little overlap between cameras or when they are obscured by vehicles or other road users, the system needs to complete target identity association and continuous tracking across cameras. Existing cross-camera association usually combines appearance feature similarity with geometric / spatiotemporal constraints for matching, but under conditions such as similar pedestrian appearance, significant differences in viewing angle, sudden changes in illumination, reduced imaging resolution, or severe occlusion, identity switching or trajectory breakage may still occur.
[0004] In recent years, methods have been developed to enhance the robustness of cross-camera matching through deep learning re-identification / joint detection and re-identification. However, in many systems, the learning of related features is still relatively independent of the downstream intention prediction and control strategies. This results in the learned representations being more biased towards appearance differentiation and insufficient utilization of temporal behavioral cues that are highly related to the intention to cross the street (such as starting, acceleration, orientation change, and relative position change with the curb / lane). Consequently, in complex interaction scenarios, there is still a problem of prediction fluctuation caused by unstable association. Summary of the Invention
[0005] In view of the aforementioned existing problems, the present invention is proposed.
[0006] This invention provides a deep learning-based pedestrian intent recognition control system for autonomous vehicles to solve the problem of pedestrian identity changes due to multiple camera switching occlusions and unstable intent prediction affecting control.
[0007] To solve the above-mentioned technical problems, the present invention provides the following technical solution: This invention provides a deep learning-based pedestrian intent recognition control system for autonomous vehicles, comprising: A multi-camera sensing unit is used to simultaneously acquire image sequences of the vehicle's surrounding environment; A pedestrian detection feature extraction unit is used to detect pedestrian targets in the image sequence and output the corresponding appearance feature vectors; The collaborative embedding learning unit is a deep neural network used to perform shared space mapping on appearance feature vectors from different cameras and output a sequence of pedestrian embedding vectors with cross-camera identity consistency. The pedestrian tracking and identity management unit is used to perform cross-camera identity association and maintain trajectory based on the pedestrian embedded vector sequence; The pedestrian intent prediction unit is used to perform time-series modeling on the pedestrian embedding vector sequence corresponding to the trajectory and output the street crossing intent result; And a control decision unit, used to generate control commands for vehicle deceleration, braking or maintaining driving based on the pedestrian crossing intention result and output them to the vehicle controller.
[0008] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the multi-camera perception unit includes at least forward, backward, left and right cameras, and writes a timestamp for each frame of image based on a unified time reference to achieve synchronization alignment.
[0009] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the pedestrian detection feature extraction unit uses a shared backbone network to extract features from images from each camera, and outputs a bounding box, confidence score, and appearance feature vector for each pedestrian target.
[0010] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the collaborative embedding learning unit is trained end-to-end with a composite training objective during the training phase. The composite training objective includes at least: an identity consistency training term for enhancing cross-camera identity consistency, an intention prediction training term for enhancing pedestrian crossing intention discrimination, and a temporal smoothing training term for constraining embedding changes at continuous times. The collaborative embedding learning unit weights and fuses identity consistency training terms, intent prediction training terms, and temporal smoothing training terms during the training phase to form a composite training objective, and superimposes the gradients of the three types of training terms at the shared network parameters and feeds them back to the shared parameters of the collaborative embedding learning unit; the weighting coefficients are set as fixed coefficients during the system configuration phase, or are obtained as learnable weights and normalized during the training phase.
[0011] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the identity consistency training term is based on metric learning of sample pairs of the same pedestrian and different pedestrians, so that the embedding distance of the same pedestrian tends to be smaller and the embedding distance of different pedestrians tends to be larger.
[0012] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the collaborative embedding learning unit includes a feature projection subnetwork, a feature fusion subnetwork, and a consistency coding subnetwork. The feature fusion subnetwork performs weighted fusion of features from different camera sources based on attention weights, and then the consistency coding subnetwork outputs the pedestrian embedding vector.
[0013] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the pedestrian tracking and identity management unit determines the association between the current pedestrian embedding vector and the embedding vector in the trajectory library when the pedestrian switches between cameras or reappears due to occlusion, and performs gating screening by combining time window constraints and spatial consistency constraints under camera calibration parameters. The trajectory database is maintained by the pedestrian tracking and identity management unit. The pedestrian tracking and identity management unit forms an appearance cost based on the appearance similarity between the pedestrian embedding vectors stored in the trajectory database and the currently detected pedestrian embedding vectors. During the association determination, the unit combines the time gating of trajectory disconnection duration and the spatial consistency gating of vehicle coordinate system based on camera calibration to screen candidate trajectories. After the gating is passed, the appearance cost, spatial cost and time cost are fused to construct the association cost matrix and solve for the optimal match between the trajectory and the detection. After a successful match, the pedestrian embedding vector corresponding to the trajectory in the trajectory database is updated using a sliding update method, and the update magnitude is adjusted according to the appearance similarity and disconnection duration.
[0014] As a preferred embodiment of the deep learning-based pedestrian intention recognition control system for autonomous vehicles described in this invention, the pedestrian intention prediction unit includes a temporal encoder and a classifier. The temporal encoder performs temporal modeling on the pedestrian embedding vectors of the same trajectory at multiple consecutive times, and the classifier outputs the pedestrian intention category or its probability.
[0015] As a preferred embodiment of the deep learning-based autonomous vehicle pedestrian intention recognition control system of the present invention, it further includes a spatiotemporal context perception unit, which is used to acquire the vehicle's own motion state information and road structure semantic information, and then fuse the information with the pedestrian embedding vector sequence and input it into the pedestrian intention prediction unit.
[0016] As a preferred embodiment of the deep learning-based autonomous vehicle pedestrian intention recognition control system of the present invention, the control decision unit determines the risk level of entering the lane based on the pedestrian crossing intention result and the relative position relationship of the pedestrian, and selects to output a deceleration command, a braking command or a driving-keeping command accordingly.
[0017] Through the above technical solution, the present invention can achieve at least the following beneficial effects: To address the issues of cross-camera identity association errors, pedestrian identification jumps, and trajectory breaks caused by multi-camera perspective switching and occlusion reproduction, this invention maps appearance features from different cameras to a shared embedding space and introduces identity consistency constraints and temporal stability constraints during the training phase. This enables the same pedestrian to maintain a more consistent embedding representation under cross-viewpoint and cross-time conditions, thereby improving the continuity of cross-camera association and the stability of trajectory maintenance.
[0018] To address the problem that existing appearance re-identification features are decoupled from intent prediction tasks, resulting in the learned representation being insensitive to dynamic behavioral cues related to crossing intent, this invention introduces intent discrimination training signals simultaneously during the collaborative embedding learning stage. This enables the embedded representation to maintain its identity discrimination capability while enhancing its ability to express temporal behavioral patterns required for crossing intent discrimination, thereby providing a more stable and discriminative input sequence for subsequent temporal coding and intent classification.
[0019] To address the problem of mutual matching and misassociation propagation that easily occur when relying solely on single appearance similarity in multi-target crowded scenarios, this invention introduces time gating and vehicle coordinate system spatial consistency gating based on camera calibration in the identity association stage. Candidate associations are eliminated a priori, and appearance, space and time factors are integrated into a unified association cost. A more consistent association result is obtained through global matching, thereby reducing mismatch and drift propagation in complex interaction scenarios.
[0020] To address the issue that intent prediction is prone to short-term jitter when there is occlusion, sudden change in viewpoint, or lack of information, which can lead to frequent switching of control strategies or excessive conservatism, this invention provides a probability / confidence representation at the intent output end. In the control decision-making process, the intent result is integrated with the relative position relationship, relative speed, and vehicle motion state to form a risk level, which is then mapped to control actions such as deceleration, braking, or maintaining driving. This allows the control output and intent judgment to form a stable closed loop, improving the safety and smoothness of the human-vehicle interaction process. Attached Figure Description
[0021] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation on the scope of this application.
[0022] Figure 1 This is a framework diagram of the pedestrian intent recognition control system for autonomous vehicles in the embodiment. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0024] All terms used in this application (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein should be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.
[0025] Example 1: like Figure 1 As shown, this embodiment proposes a deep learning-based pedestrian intent recognition control system for autonomous vehicles, including: A multi-camera sensing unit is used to simultaneously acquire image sequences of the vehicle's surrounding environment; The multi-camera perception unit includes camera components covering the front, rear, left, and right sides of the vehicle, respectively. These camera components are connected to the onboard computing platform via a high-speed data link. The multi-camera perception unit records the camera identifier and frame number for each captured image frame and writes the corresponding timestamp into the image metadata, forming a multi-camera image sequence arranged chronologically. The multi-camera perception unit packages multi-camera image frames within the same time-aligned group into a synchronized frame group and outputs it to the pedestrian detection feature extraction unit.
[0026] In this embodiment, a synchronization frame group refers to a set of frames within the same time alignment group, consisting of at least one frame each from the forward, backward, left, and right cameras, along with corresponding timestamps, camera identifiers, and synchronization status markers. Specifically, the synchronization frame group is formed on the vehicle computing platform side through a buffer queue. The buffer queue length, as an implementation parameter, is 30 frames by default, with an adjustable range of 10 to 120 frames. When any camera frame arrives, it is written to the corresponding queue according to its timestamp and attempts to pair it with the frame whose timestamp is closest to that of other camera queues. If the pairing is successful, a synchronization frame group is formed and output. Furthermore, the output order of the synchronization frame group is based on the increasing timestamp. During output, the frame number and camera identifier of each camera are carried together to ensure that subsequent pedestrian detection results can be traced back to the corresponding camera view and time position.
[0027] The pedestrian detection feature extraction unit is used to detect pedestrian targets in the image sequence and output the corresponding appearance feature vector; The collaborative embedding learning unit is a deep neural network used to perform shared space mapping on appearance feature vectors from different cameras and output a sequence of pedestrian embedding vectors with cross-camera identity consistency. The pedestrian tracking and identity management unit is used to perform cross-camera identity association and maintain trajectory based on pedestrian embedding vector sequences; The pedestrian tracking and identity management unit maintains a trajectory database. Each trajectory in the database includes a trajectory identifier, a recent update timestamp, a recent bounding box, an embedding vector cache, and a trajectory status flag. The trajectory initialization rule is as follows: when a currently detected pedestrian target cannot match any historical trajectory under a preset association gating condition, a new trajectory is created and a new trajectory identifier is assigned. The trajectory update rule is as follows: when the current pedestrian target satisfies the association determination with a historical trajectory, the current bounding box and the current embedding vector are written to the trajectory, and the trajectory's recent update timestamp is refreshed. The trajectory termination rule is as follows: when a trajectory fails to receive a matching update for multiple consecutive time steps and exceeds a preset loss retention threshold, the trajectory status is marked as terminated and removed from the candidate set participating in matching.
[0028] In this embodiment, the preset loss retention threshold is set to 1.0 second by default, with an adjustable range of 0.5 to 5.0 seconds. This threshold is based on the statistical percentile of the occlusion duration and the camera's field-of-view switching time. The threshold should not be too short to avoid frequent trajectory termination due to short-term occlusion and the introduction of identity jumps, nor should it be too long to avoid invalid trajectories occupying the candidate set, leading to an increased probability of incorrect associations. Furthermore, the trajectory status label includes at least three states: active, occlusion-retained, and terminated. The active state participates in the association solution, the occlusion-retained state only participates in the association solution for time-gated connections, and the terminated state does not participate in the candidate set.
[0029] The pedestrian intent prediction unit is used to perform temporal modeling on the pedestrian embedding vector sequence corresponding to the trajectory and output the pedestrian crossing intent result. And a control decision unit, used to generate control commands for vehicle deceleration, braking or maintaining motion based on the pedestrian crossing intention and output them to the vehicle controller; The control decision unit receives information on pedestrian crossing intentions, relative pedestrian positions, and vehicle motion status. It calculates the risk level of entering the lane and generates corresponding control actions. The risk level is determined by the probability of pedestrian crossing intention, the relative distance between the pedestrian and the lane boundary, the relative speed between the pedestrian and the vehicle, and the vehicle's current braking capability. This risk level is mapped to multiple risk levels using a preset risk grading rule. The control action selection rule is as follows: a braking command is output when the risk level reaches a preset braking level; a deceleration command is output when the risk level reaches a preset deceleration level but not the braking level; and a maintain driving command is output when the risk level is lower than the preset deceleration level. The control command output includes a target longitudinal acceleration or target speed request and a control mode identifier, and is output to the vehicle controller at a fixed refresh cycle.
[0030] Furthermore, the fixed refresh cycle, as an implementation parameter, defaults to 50 milliseconds, with an adjustable range of 10–100 milliseconds. The refresh cycle setting is based on the matching relationship between the vehicle controller's execution cycle and the onboard computing platform's scheduling cycle. The refresh cycle should not exceed 100 milliseconds to avoid control response lag, nor should it be less than 10 milliseconds to avoid excessive bus load. The control mode identifier is used to distinguish between three modes: deceleration control, braking control, and maintaining driving control. It is output to the vehicle controller along with the control command, and the vehicle controller selects the corresponding longitudinal control strategy accordingly and executes the target longitudinal acceleration or target speed request. In the lane entry risk level classification rules, the threshold for the probability of crossing the street to participate in the classification is set as an implementation parameter with a default value of 0.6 for deceleration level and 0.8 for braking level, with adjustable ranges of 0.5–0.8 and 0.7–0.95, respectively. The threshold setting is based on the trade-off between safety boundary and comfort on the validation set, and the braking level threshold is always not lower than the deceleration level threshold to maintain the monotonicity of the classification.
[0031] In this embodiment, the multi-camera sensing unit includes at least front, rear, left and right cameras, and writes a timestamp for each frame of image based on a unified time reference to achieve synchronization alignment; A unified time base is provided by the vehicle-mounted computing platform, and this time base is synchronized with the acquisition threads of each camera. The synchronization alignment process pairs multi-camera image frames according to their timestamps. If any camera is missing from a synchronization frame group, it is filled in using adjacent timestamp image frames, and the completion status is marked in the metadata. Image frames exceeding a preset time deviation threshold are determined to be out of sync and removed from the current synchronization frame group. The output of the synchronization frame group includes the image frame, timestamp, camera identifier, and synchronization status marker for each camera.
[0032] Specifically, the timestamp is generated using a unified time base on the vehicle computing platform and recorded with millisecond-level precision. The timestamp is written to both the image metadata and the synchronization frame group metadata. Further, a preset time deviation threshold is set as an implementation parameter, defaulting to 30 milliseconds, with an adjustable range of 10–80 milliseconds. This threshold is set based on the frame rate output by the camera and the frame arrival jitter statistical percentile under vehicle bus load conditions. This threshold is not set to 0 to avoid minor jitter preventing the synchronization frame group from forming, nor is it set too high to avoid mispairing across time slices, which could introduce motion blur and incorrect associations. Optionally, when a camera is missing in the current synchronization frame group and triggers completion, the completed frame and the synchronized position of the completed frame are set to the completed state in the synchronization status flag. In subsequent training and evaluation, this flag allows for the controlled ignoring or weight reduction of the completed frame. Weight reduction is achieved through the weighting coefficient system already mentioned in the original text.
[0033] In this embodiment, the pedestrian detection feature extraction unit uses a shared backbone network to extract features from the images of each camera and outputs bounding boxes, confidence scores, and appearance feature vectors for each pedestrian target. The pedestrian detection feature extraction unit performs pedestrian target detection on images from each camera in the synchronized frame group, obtaining the bounding boxes and confidence scores of the pedestrian targets, and extracting the region features corresponding to the bounding boxes during the forward pass of the same network. The appearance feature vector is obtained by feature aggregation and projection from the region features, and the appearance feature vector is represented with fixed dimensions and normalized. In the output results, the pedestrian detection feature extraction unit adds camera identifiers, timestamps, and bounding box coordinates to each pedestrian target for subsequent cross-camera association and temporal modeling.
[0034] For example, fixed dimension means that the appearance feature vector and the pedestrian embedding vector maintain the same dimension under the same system configuration. The dimension size is set to 256 by default as an implementation parameter, and the adjustable range is 128 to 512. Too small a dimension will reduce appearance distinguishability, while too large a dimension will increase computational and storage overhead and amplify noise sensitivity. Furthermore, normalization refers to normalizing the vector magnitude to stabilize the similarity measure. The normalization method is consistent with the subsequent similarity calculation. When cosine similarity is used to form the appearance cost, the normalized vector can be directly used for similarity calculation and reduce the scale shift caused by exposure differences between different cameras.
[0035] Furthermore, the bounding box coordinates are recorded using the image pixel coordinate system, with the origin at the top left corner of the image and the coordinate axes along the horizontal and vertical directions of the image, respectively. The bounding box is represented by the coordinates of the top left and bottom right corners, along with its width and height for subsequent projection and gating calculations. Optionally, when the pedestrian detection confidence is lower than a preset detection confidence threshold, the detection result is marked as low confidence and does not enter the trajectory initialization process. The detection confidence threshold, as an implementation parameter, defaults to 0.4 and is adjustable within a range of 0.2 to 0.7. Its setting is based on the trade-off between the false positive rate and the false negative rate on the validation set.
[0036] In this embodiment, the collaborative embedding learning unit is trained end-to-end with a composite training objective during the training phase. The composite training objective includes at least: an identity consistency training term for enhancing cross-camera identity consistency, an intent prediction training term for enhancing street crossing intent discrimination, and a temporal smoothing training term for constraining embedding changes at continuous time intervals. The training data consists of image sequences simultaneously acquired by multiple cameras, including pedestrian identity labeling and pedestrian crossing intent labeling. Identity labeling establishes an identity label based on the sample attribution of the same pedestrian target at different times and from different camera views; pedestrian crossing intent labeling establishes an intent label based on whether the pedestrian target enters the lane area within the prediction time range. The identity consistency training term constructs same-identity sample pairs and different-identity sample pairs based on the identity labels, driving the same-identity embedding vectors output by the collaborative embedding learning unit to be closer together in the shared space and different-identity embedding vectors to be further apart. The intent prediction training term calculates the difference between the intent prediction unit output and the true label based on the intent label and backpropagates it to the collaborative embedding learning unit, enabling the pedestrian embedding vector sequence to effectively support intent discrimination. The temporal smoothing training term constrains the variation amplitude of pedestrian embedding vectors at adjacent times within the same trajectory; when the variation amplitude exceeds a preset smoothing constraint threshold, a penalty is applied to suppress embedding jitter. The training terms of the composite training objective are combined according to preset weights and trained end-to-end.
[0037] Specifically, the magnitude of change refers to the magnitude measure of the difference between pedestrian embedding vectors at adjacent time points. This magnitude measure is consistent with the distance measure in the original text and is calculated on a normalized vector. The preset smoothing constraint threshold, as an implementation parameter, defaults to 0.2, with an adjustable range of 0.05 to 0.5. The threshold is not set to 0 to avoid continuous penalties caused by normal viewpoint changes, nor is it set to be close to 1.0 to avoid the smoothing constraint losing its drift suppression effect. Furthermore, when there are missing frames or occlusions in the trajectory, the smoothing penalty coefficient at the corresponding time point can be reduced to 0.2 to 0.8 times according to the synchronization state label to reduce the interference of non-realistic jumps caused by padding on training. The coefficient is an implementation parameter and is implemented through the weighted coefficient system already mentioned in the original text.
[0038] Specifically, the prediction time range, as an implementation parameter, defaults to 2.0 seconds, with an adjustable range of 1.0 to 5.0 seconds. This setting is based on the lead time requirements for vehicle braking and deceleration control in low- and medium-speed urban road scenarios. This time range is neither too short to avoid only capturing already occurred crossing behavior and losing the significance of early prediction, nor too long to avoid introducing long-term uncertainties that could increase label noise. Furthermore, the lane area, lane boundary, and curb are all determined by the road structure semantic information already present in the original text. The lane boundary is defined by the lane line or road boundary semantic output, the curb by the curb semantic output, and the lane area is the area enclosed by the lane boundary and located within the passable range of the vehicle's direction of travel. Entry into the lane area is determined by the pedestrian target's spatial position crossing the curb or lane boundary over time. The spatial position is obtained by mapping the center pixel of the bounding box's bottom edge to the ground point in the vehicle coordinate system using camera calibration parameters. If mapping fails, the sample is marked as unusable for intent supervision.
[0039] Furthermore, the preset weights correspond to the weighting coefficient system already mentioned in the original text. The default configuration for the implementation parameters is 0.4 for identity consistency training, 0.4 for intent prediction training, and 0.2 for temporal smoothing training, with adjustable ranges from 0.1 to 0.8, and the sum of the three after normalization is 1.0. This default configuration ensures that the strength of the supervision signal for identity and intent judgment is similar, and preserves the basic constraint strength for temporal stability. Optionally, when learnable weights are used and normalized, the initial value of the learnable weights is set to be equal for all three terms by default to avoid bias in the early stages of training, and the normalization result is limited to 0.05–0.90 to prevent any training term weight from collapsing to 0, causing the corresponding constraint to fail. Similarly, the margin threshold in the identity consistency training term is set to 0.3 by default, with an adjustable range of 0.1 to 0.8. The margin threshold should not be set to 0 to avoid difficulty in separating positive and negative samples, nor should it be set too high to avoid unstable training gradients. The focus index in the intention prediction training term is set to 2.0 by default, with an adjustable range of 1.0 to 5.0. The focus index should not be set below 1.0 to avoid insufficient emphasis on difficult samples, nor should it be set too high to avoid over-focusing and slowing down convergence.
[0040] During the training phase, the collaborative embedding learning unit weights and fuses identity consistency training terms, intent prediction training terms, and temporal smoothing training terms to form a composite training objective. The gradients of the three types of training terms are superimposed at the shared network parameters and fed back to the shared parameters of the collaborative embedding learning unit. The weighting coefficients are set as fixed coefficients during the system configuration phase, or are obtained as learnable weights and normalized during the training phase, so as to adjust the proportion of the influence of the three types of training terms on the end-to-end training process.
[0041] In one implementation, during the training phase, the collaborative embedding learning unit generates a pedestrian embedding using a forward link of appearance feature detection → shared spatial mapping → cross-camera fusion → consistency encoding. Then, three types of training signals—identity consistency, intent prediction, and temporal smoothing—are simultaneously applied to this embedding to form a composite training objective. The specific training method is as follows: When from the One camera, time pedestrian targets When a pedestrian is detected, the pedestrian detection feature extraction unit outputs an appearance feature vector. The pedestrian embedding vector is obtained by the collaborative embedding learning unit according to the following formula. : (1) In equation (1), Indicates camera At any moment For the target The appearance feature vector; Represents the feature projection subnetwork, Its network parameters are shared across different cameras; This represents the characteristics of the shared space after projection; Indicates the number of cameras; The attention weights represent the output of the feature fusion subnetwork. Indicates fusion characteristics; Represents a consistent coding subnetwork. Its network parameters; This represents a pedestrian embedding vector with consistent identity across cameras.
[0042] In the same batch of data, the composite training objective is constructed using a weighted summation as follows: (2) In equation (2), Indicates the total loss; This represents the training term for identity consistency; Indicates the intention to predict training items; Indicates a time-series smoothing training term; These represent the weighting coefficients of the three losses.
[0043] S1, Construct identity consistency training terms: The identity consistency training terms are constructed around the metric learning constraint that embeds the same pedestrian closer and different pedestrians further apart, and are applied in the form of triples. Apply interval constraints: (3) In equation (3), Indicates the number of triplet samples; Indicates the first The time and target index of each anchor sample; This indicates a positive sample belonging to the same pedestrian as the anchor sample (which may come from different cameras or different times after occlusion). This represents negative samples belonging to different pedestrians; Indicates the embedding distance metric; This represents the interval threshold, used to limit the distance difference between negative samples and positive samples.
[0044] In one type of implementation, the distance metric used is Euclidean distance: (4) In equation (4), Represents any two embedding vectors; This represents the L2 norm.
[0045] S2, Constructing Intent Prediction Training Terms: The intent prediction training term uses the pedestrian intent prediction unit's output cross-street intent probability as the supervision signal. The temporal encoder and classifier model the embedded sequences within the same trajectory window and output the class probability. (5) In equation (5), Indicate target The set of time-series window indices; Indicates a timing encoder. Its network parameters; Represents a classifier. Its network parameters; Indicate target Predicted probability vectors for each intent category; This represents the normalization function.
[0046] When the training data is class imbalanced, the intended prediction of training terms uses focus loss to suppress the gradient proportion of easily classified samples: (6) In equation (6), This represents the set of targets for monitoring the intention to participate in the monitoring; Represents a set The target quantity; Number of categories representing intent; Indicate target In category The indicated quantity is marked on the label; express Corresponding category The predicted probability; This indicates a focus index.
[0047] S3, Constructing a time-smoothing training term The temporal smoothing training term is constrained by limiting the embedding changes between adjacent time steps to suppress embedding drift caused by occlusion and sudden changes in viewpoint during tracking. One implementation uses the L2 norm penalty of adjacent differences. (7) In equation (7), Represents the set of targets participating in the temporal smoothing constraint; Represents a set The target quantity; Indicate target Window length; Indicates the earliest time within the window; This represents the square of the L2 norm, used to characterize the magnitude of the difference between adjacent embeddings.
[0048] In another type of implementation, Replaced with a second-order difference penalty to limit the variation in embedding velocity and correspond to a more stable trajectory identity representation: (8) In equation (8), Denotes the set of objectives participating in the second-order constraints; Represents a set The target quantity is specified in the table; the meanings of the remaining symbols are consistent with the aforementioned timing window definition.
[0049] S4 defines the joint optimization of the three losses and the gradient backpropagation path. Joint optimization is obtained simultaneously in a single forward computation. and Then by The parameters of the projection, fusion, encoding, and temporal encoders and classifiers are updated synchronously and in reverse. For any set of parameters... In other words, the gradient of the total loss is obtained by linearly superimposing the three gradients: (9) In equation (9), This represents the set of network parameters that need to be updated. Any one of them; Indicates loss For parameters The gradient.
[0050] when or hour, and pass Direct return; via classifier → Timing Encoder →Embedded sequence Return to And further transmitted back to Therefore, the same set of shared parameters is simultaneously subject to the joint constraints of cross-camera identity consistency, intent discrimination, and temporal stability.
[0051] Regarding weight settings, one implementation method takes... To fix the hyperparameters; another approach is to make the weights learnable and use normalization to avoid scale drift: (10) In equation (10), This represents the learnable scalar corresponding to the three loss weights, used to adaptively adjust the proportion of the three constraints during training; This indicates exponentiation.
[0052] At each moment, based on the multi-camera detection set and the active trajectory set in the trajectory library, the trajectory library maintains a representative embedding, the most recent successful association time, and vehicle coordinate system position prediction and position uncertainty information supported by camera calibration for each trajectory; for each detection, the detection embedding is output, and the center pixel of the bottom edge of the detection box is mapped to the ground point in the vehicle coordinate system through calibration. The appearance consistency part uses cosine similarity to measure the closeness between the trajectory embedding and the detection embedding, as shown in Equation (11), and converts the similarity into appearance cost for fusion with other costs, as shown in Equation (12). The gating and screening part restricts the candidate matching to a reasonable time span and spatial range: the time gating calculates based on the trajectory disconnection time and compares it with the threshold to form a release decision, as shown in Equations (13) and (14); the spatial gating calculates the Mahalanobis distance between the trajectory position prediction and the detected ground point in the vehicle coordinate system and compares it with the threshold to form a release decision, as shown in Equations (15) and (16). For candidate pairs that pass the gating, a comprehensive association cost is generated by fusing appearance cost, spatial cost, and temporal cost, as shown in Equation (17). For candidate pairs that fail the gating, a penalty cost for prohibiting matching is set, as shown in Equation (18). The association solution transforms the cost matrix into a binary matching minimization problem and obtains the optimal matching mapping, as shown in Equation (19). After a successful match, the trajectory representation embedding is updated by fusing with the current detection embedding in a sliding manner to suppress the drift caused by occlusion recurrence, as shown in Equation (20). The update step size is adaptively calculated in the association stage based on appearance similarity and disconnection duration, as shown in Equation (21). The upper and lower limits of the step size and the similarity reference threshold are preset by the technicians in the system configuration stage.
[0053] Specifically, this composite training objective uses pedestrian embedding as a common target, unifying cross-camera identity consistency, pedestrian crossing intent discrimination, and embedding temporal stability into a single end-to-end training iteration. The identity consistency training term uses metric learning to shorten the embedding distance of the same pedestrian and widen the embedding distance of different pedestrians, making identity association more stable during cross-camera switching and occlusion re-enactment. The intent prediction training term models continuous embeddings using a temporal encoder and outputs intent class probabilities through a classifier. During training, a loss form more sensitive to class imbalance can be used to prevent the learning signals of minority class samples from being overwhelmed by the majority class. The temporal smoothing training term constrains the differences between adjacent or second-order time steps, suppressing embedding drift caused by viewpoint changes and short-term occlusion, thereby reducing the risk of misjudgment due to trajectory breaks. During joint optimization, the gradients of the three training terms are superimposed and backpropagated at shared network parameters, allowing the projection, fusion, and consistency encoding sub-networks to simultaneously obtain constraints on identity, intent, and temporal continuity in the same optimization process, improving the stability of the intent output upon which the overall closed-loop control depends.
[0054] In this embodiment, the identity consistency training term is based on the sample pairs of the same pedestrian and different pedestrians for metric learning, so that the embedding distance of the same pedestrian tends to be smaller and the embedding distance of different pedestrians tends to be larger. Same-identity sample pairs consist of pedestrian samples under the same identity label but from different cameras or different timestamps; different-identity sample pairs consist of pedestrian samples under different identity labels. During identity consistency training, a penalty is applied when the embedding distance of a same-identity sample pair is greater than a preset same-identity distance threshold; a penalty is also applied when the embedding distance of a different-identity sample pair is less than a preset different-identity distance threshold. The gradients generated by the identity consistency training terms and the gradients generated by the intent prediction training terms are backpropagated together to the projection layer, fusion layer, and encoding layer parameters of the collaborative embedding learning unit.
[0055] In this embodiment, the collaborative embedding learning unit includes a feature projection subnetwork, a feature fusion subnetwork, and a consistency coding subnetwork. The feature fusion subnetwork performs weighted fusion of features from different camera sources based on attention weights, and then the consistency coding subnetwork outputs the pedestrian embedding vector. The feature projection subnetwork receives appearance feature vectors from different pedestrian targets and different cameras, and maps them to a shared feature space through a projection layer with shared parameters. The feature fusion subnetwork uses the shared features of the same pedestrian target under different cameras as fusion input, generates importance coefficients of features from each viewpoint based on attention weights, and performs weighted aggregation. The attention weights are adaptively calculated by the fusion subnetwork according to the input features. The consistency coding subnetwork encodes the fused features, outputs pedestrian embedding vectors, and retains the correspondence between camera identifiers and timestamps in the output to form a sequence of pedestrian embedding vectors.
[0056] In this embodiment, the shared feature space refers to the unified representation space output by the feature projection sub-network. Appearance feature vectors from different cameras fall into this space after being mapped by shared parameters, enabling the subsequent feature fusion sub-network to calculate attention weights and complete cross-view aggregation on the same scale. Furthermore, the pedestrian embedding vector refers to the vector representation obtained by the consistency coding sub-network for encoding the fused features. Its dimension is consistent with the dimension of the appearance feature vector during the system configuration phase, and it also undergoes normalization processing so that the identity consistency training term, the temporal smoothing training term, and the appearance similarity measure share the same distance metric.
[0057] In this embodiment, when a pedestrian switches between cameras or reappears due to occlusion, the pedestrian tracking and identity management unit makes an association determination based on the similarity between the current pedestrian embedding vector and the embedding vector in the trajectory library, and performs gating screening by combining time window constraints and spatial consistency constraints under camera calibration parameters. Similarity calculation uses the normalized embedding vector as a metric to obtain the similarity score between the current pedestrian target and each candidate trajectory. The time window constraint rule is as follows: only trajectories whose difference between the most recently updated timestamp and the current timestamp is less than a preset time window threshold are included in the candidate trajectory set; trajectories exceeding the time window threshold are removed from the current matching. The spatial consistency constraint rule is as follows: based on camera calibration parameters, the spatial position corresponding to the current bounding box is mapped to the vehicle coordinate system, and the spatial difference between the current bounding box and the nearest spatial position of the candidate trajectory is calculated; when the spatial difference exceeds a preset spatial gating threshold, the candidate trajectory is removed from the current matching. The association determination rule is as follows: when the maximum similarity score is less than a preset similarity threshold, it is determined to be unrelated and a new trajectory is created; when multiple candidate trajectories meet the similarity threshold, the candidate trajectory with the highest similarity score is selected as the matching result, and the remaining candidate trajectories are excluded.
[0058] Specifically, the spatial location mapping approximates the pedestrian's foot point using the center pixel of the bounding box's bottom edge, and obtains the ground point position in the vehicle coordinate system under the ground plane assumption combined with camera calibration parameters, with the position unit being meters. Optionally, when road structure semantic information indicates that the pixel falls in a non-ground area or the mapped position exceeds the perceptible range around the vehicle, the spatial difference of the candidate pair is directly set as a result without gating. The spatial gating threshold, as an implementation parameter, is 3.0 meters by default when using Euclidean distance in the vehicle coordinate system, with an adjustable range of 1.0 to 8.0 meters; and 9.0 meters by default when using Mahalanobis distance, with an adjustable range of 4.0 to 16.0 meters. The threshold setting is based on trajectory position uncertainty statistics and vehicle accessibility constraints, and is not set too small to avoid falsely rejecting cross-frame displacements at normal walking speeds.
[0059] Camera calibration parameters include the intrinsic and extrinsic parameters of each camera, as well as their installation pose information relative to the vehicle coordinate system. These calibration parameters are generated during the vehicle assembly calibration process and stored on the onboard computing platform. During operation, they are read by the pedestrian tracking and identity management unit and used for spatial consistency gating calculations. The spatial mapping process transforms the detection results of each camera based on its extrinsic parameters to generate a positional representation in a unified vehicle coordinate system, which is maintained in the trajectory database along with the trajectory status.
[0060] The trajectory database is maintained by the pedestrian tracking and identity management unit. The pedestrian tracking and identity management unit forms an appearance cost based on the appearance similarity between the pedestrian embedding vectors stored in the trajectory database and the currently detected pedestrian embedding vectors. During the association determination, the unit combines the time gating of trajectory disconnection duration and the spatial consistency gating of vehicle coordinate system based on camera calibration to screen candidate trajectories. After the gating is passed, the appearance cost, spatial cost and time cost are fused to construct the association cost matrix and solve for the optimal match between the trajectory and the detection. After a successful match, the pedestrian embedding vector corresponding to the trajectory in the trajectory database is updated using a sliding update method, and the update magnitude is adjusted according to the appearance similarity and disconnection duration to reduce the impact of trajectory embedding drift on identity association when occlusion reappears.
[0061] Specifically, the sliding update method refers to a weighted fusion update using pedestrian embedding vectors stored in historical trajectories and currently detected pedestrian embedding vectors. The update step size increases with appearance similarity and decreases with the duration of disconnection. The minimum update step size is set to 0.05 by default, with an adjustable range of 0.01 to 0.20; the maximum update step size is set to 0.50 by default, with an adjustable range of 0.30 to 0.80. The maximum value is not set close to 1.0 to avoid trajectory embedding drift caused by single-frame noise, and the minimum value is not set to 0 to prevent trajectory embedding from failing to adapt to appearance changes. Furthermore, when the appearance similarity is below a preset similarity threshold, the update step size automatically converges to the minimum value to suppress the contamination of the trajectory database by erroneous associations.
[0062] Furthermore, the association cost matrix refers to a two-dimensional cost array constructed using the set of activity trajectories and the current detection set as rows and columns, with the number of rows representing the number of activity trajectories and the number of current detections. The matrix elements represent the fused comprehensive association cost. The appearance cost is obtained by converting the normalized pedestrian embedding vector similarity, the spatial cost is composed of the spatial consistency distance or its normalized value, and the temporal cost is composed of the ratio of the disconnection duration to the maximum allowable disconnection duration threshold. The fusion weights, as implementation parameters, are configured by default as appearance 0.5, spatial 0.3, and temporal 0.2, with adjustable ranges of 0.2–0.7, 0.1–0.6, and 0.05–0.4, respectively. The sum of the three after normalization is 1.0. The default configuration prioritizes appearance consistency and uses spatial and temporal gating to suppress mismatches. For candidate pairs that fail the gating, the penalty cost for prohibiting matching is set by default to 5.0 times the upper quantile of the comprehensive association cost of the gating candidate pairs, with an adjustable range of 2.0 to 10.0 times, to avoid the optimization solution being forced to select out-of-gating candidates when there are no feasible matches.
[0063] In one implementation, in an association determination scenario, at time... The multi-camera detection set is denoted as The set of activity trajectories in the trajectory library is denoted as For any trajectory The trajectory library maintains its representative embedding. The most recent successful association And vehicle coordinate system position prediction based on camera calibration With location covariance For any detection Collaborative embedding learning unit outputs detection embedding The calibration parameters are used to map the center pixel of the bottom edge of the detection box to a ground point in the vehicle coordinate system. .
[0064] When the trajectory With detection When calculating appearance similarity, cosine similarity can be used: (11) In equation (11), Indicates time trajectory With detection Appearance similarity; Indicates transpose; Represents the L2 norm; Representing the trajectory At any moment The representation of embedding; Indicates detection At any moment Embedded.
[0065] To facilitate the superposition of other costs, the appearance cost is defined as: (12) In equation (12), This indicates the cost of appearance.
[0066] The gating selection process consists of both temporal and spatial gating. Temporal gating limits the candidate set based on the duration of trajectory loss of contact, defined as: (13) In equation (13), Indicates time trajectory The time interval since the most recent update; Representing the trajectory The most recent successful connection.
[0067] Based on this interval, a time-gated indication is constructed: (14) In equation (14), Indicates whether time-based gating is allowed; Indicates an indicator function; This indicates the maximum allowed duration of disconnection.
[0068] Spatial gating utilizes camera calibration parameters to project the detected data onto the vehicle coordinate system for distance filtering. This is then applied to the trajectory. With detection Calculate Mahalanobis distance: (15) In equation (15), Indicates spatial consistency distance; Indicates detection Two-dimensional or three-dimensional position vector projected onto the ground in the vehicle coordinate system; Representing the trajectory Position prediction in the vehicle coordinate system; Representing the trajectory Location uncertainty covariance matrix; It represents its inverse matrix.
[0069] Based on this distance, a spatial gating indicator is constructed: (16) In equation (16), Indicates whether the space gate allows passage; This represents the spatial gating threshold.
[0070] After the gating is completed, the associated cost integrates appearance, space, and time factors into a unified scalar, which can be written as: (17) In equation (17), Indicates the overall associated cost; Indicates the appearance cost weight; Indicates the space cost weight; This represents the time cost weight.
[0071] When incorporating the gating results into the cost matrix, candidates that fail the gating can be treated as penalty constants: (18) In equation (18), Indicates the cost of gating; This indicates a logical AND operation. This represents a penalty constant that prohibits matching.
[0072] The correlation problem can be transformed into a binary matching problem that minimizes the total cost. In the form corresponding to the Hungarian algorithm, this involves finding the mapping from the trajectory to the detection. Minimize the objective function: (19) In equation (19), Indicates time The optimal matching mapping; Representing the trajectory The assigned detection index; This represents the independent variable that corresponds to the minimum value.
[0073] For trajectories that have not been assigned to valid detection, the trajectory database can simply update their loss count and retain the previous time step. and This provides continuity for occlusion reproduction.
[0074] When the trajectory At any moment Associated with detection To reduce embedding drift caused by occlusion recurrence, the representation embedding uses a sliding update with an adaptive step size: (20) In equation (20), Indicates time trajectory Embedded update step size; This indicates that the previous time step represents an embedding; This indicates the embedding of the current matching detection.
[0075] The step size can be adaptively adjusted based on appearance similarity and duration of disconnection. ,(twenty one) In equation (21), Indicates the minimum update step size; Indicates the maximum update step size; Represents the Sigmoid function; This indicates the reference threshold for approval based on appearance similarity; This represents the similarity scaling coefficient; Indicates exponentiation; This represents the constant of the decay time when the connection is lost.
[0076] This update tends to make recurring matches with low similarity or long periods of disconnection more likely to be updated in small steps, thereby reducing the magnitude by which the representative embedding is skewed by short-term noise.
[0077] In each training batch, based on the pedestrian detection results and corresponding time information in the multi-camera images, the pedestrian detection feature extraction unit outputs appearance features, and the collaborative embedding learning unit sequentially completes shared space mapping, cross-camera fusion and consistency encoding to obtain pedestrian embeddings for cross-camera identity alignment, as shown in Equation (1). Subsequently, in the same batch, the three types of training signals, identity consistency, intent prediction and temporal smoothing, are unified into a composite training objective, and the three losses are aggregated into a total loss by weighted summation, as shown in Equation (2). Among them, the identity consistency training term forms an interval loss according to the triplet metric constraint and adopts Euclidean distance metric, as shown in Equations (3) and (4); the intent prediction training term outputs the street crossing intent probability by the temporal encoder and classifier and introduces focus loss when there is class imbalance, as shown in Equations (5) and (6); the temporal smoothing training term penalizes the adjacent difference or second-order difference to limit the non-stationary drift of the embedding over time, as shown in Equation (7) or (8). During joint optimization, the gradient of the total loss is backpropagated to the parameters of the projection, fusion, consistency coding, and temporal encoder and classifier in the form of a weighted superposition of three gradients, as shown in Equation (9), so that the shared parameters are simultaneously constrained by identity, intent and temporal continuity. The weight coefficients of the three losses are preset by the technicians during the system configuration phase, or are adaptively obtained as learnable scalars in the normalized form during the training phase, as shown in Equation (10).
[0078] Specifically, this association mechanism places appearance consistency and spatiotemporal consistency in the same solution chain. The appearance part uses the similarity relationship of embedded vectors to form a measurable association basis, distinguishing the same pedestrian from different pedestrians. The gating process restricts candidate pairs to a reasonable time span and spatial range. The time constraint suppresses false associations caused by long periods of disconnection, while the spatial constraint uses the vehicle coordinate relationship after camera calibration to exclude candidates that cross lanes or are impossible to reach. Cost fusion unifies appearance, spatial, and temporal factors into a single cost, facilitating global matching for solution and reducing mutual contention and mismatches caused by local greed in multi-target scenarios. After association is completed, the trajectory representation embedding is updated using a sliding method, and the update amplitude is adjusted according to similarity and the degree of disconnection, ensuring that occlusion reappearance is not significantly skewed by a single observation, which is beneficial for maintaining the continuity of subsequent cross-camera switching and re-identification.
[0079] In this embodiment, the pedestrian intention prediction unit includes a temporal encoder and a classifier. The temporal encoder performs temporal modeling on the pedestrian embedding vectors of the same trajectory at multiple consecutive times, and the classifier outputs the pedestrian intention category or its probability. The pedestrian crossing intent categories include at least two types: entering the lane and not entering the lane. Entering the lane refers to a pedestrian crossing the curb or lane boundary to enter the vehicle driving lane area within the prediction time range. The pedestrian intent prediction unit uses the embedding vectors of the same trajectory at multiple consecutive time points as the input sequence, and the sequence length is determined by a preset time window. When there are occlusions or missing parts in the input sequence, the most recent valid embedding vector from the trajectory library is used to fill in the missing parts, and the missing positions are marked in the sequence. The classifier output includes the probability distribution of each intent category and outputs a confidence label. When the probability of the largest category is lower than the preset confidence threshold, the output is marked as uncertain and the stable output state of the previous time point is maintained.
[0080] Furthermore, the preset temporal window, as an implementation parameter, defaults to 1.0 second, with an adjustable range of 0.5 to 3.0 seconds. The sequence sampling corresponding to the temporal window forms a sequence of pedestrian embedding vectors at multiple consecutive time points, using the synchronization frame group as the time base. Optionally, when changes in the synchronization frame rate cause fluctuations in the number of samples within the window, the time span of the temporal window remains unchanged, and the sequence is resampled at equal intervals to keep the number of samples input to the temporal encoder stable. Similarly, when occlusion gap filling is performed using the most recent valid embedding vector from the trajectory library, a missing position marker is simultaneously written so that the temporal encoder can reduce the attention weight for the missing position during training. This reduction is achieved using the attention weight system already mentioned in the original text, without introducing any new modules.
[0081] Specifically, the default reliability threshold is set to 0.6, with an adjustable range of 0.5 to 0.8. This threshold is set based on a trade-off between precision and recall on the validation set. It is not set below 0.5 to avoid a large number of low-reliability predictions being treated as definitive results in the control decision, nor is it set above 0.9 to avoid insufficient definitive outputs leading to a prolonged state of uncertainty. Furthermore, maintaining the stable output state from the previous time step in an uncertain state means that when all consecutive time steps are judged as uncertain, the intent category output from the previous time step remains unchanged, while the intent probability continues to be transmitted to the control decision unit as auxiliary information to participate in risk level calculation.
[0082] In this embodiment, the control decision unit determines the risk level of entering the lane based on the pedestrian crossing intention and the relative position of the pedestrian, and selects to output a deceleration command, a braking command or a driving-keeping command accordingly. When the control decision unit outputs a deceleration command, the vehicle controller performs longitudinal deceleration control while maintaining the vehicle's lateral control strategy. When it outputs a braking command, the vehicle controller enters emergency braking mode and limits the target speed request to a preset safe lower speed limit. When it outputs a hold-and-go command, the vehicle controller maintains the current cruise or following control state. If the pedestrian crossing intention is in an uncertain state for a duration exceeding a preset uncertainty threshold, the control decision unit raises the risk level to a preset conservative level and outputs a deceleration command until the pedestrian crossing intention returns to a definite state.
[0083] Specifically, the preset uncertainty holding threshold is set to 0.5 seconds by default, with an adjustable range of 0.2 to 2.0 seconds. This threshold is based on the statistical percentile of the duration of intention probability fluctuations caused by short-term pedestrian occlusion and sudden changes in viewpoint. This threshold is neither too short to avoid frequent triggering of the conservative level due to short-term fluctuations, nor too long to avoid increased risk control during prolonged uncertainty. The duration of the uncertain state is accumulated over consecutive synchronized frame groups. If the intention output recovers to a definite state at any point in the interval, the accumulated duration is reset to zero, and the risk level mapping is resumed based on the definite output.
[0084] Example 2: Based on Embodiment 1, this embodiment also includes a spatiotemporal context perception unit, which is used to acquire vehicle motion state information and road structure semantic information, and then fuse the information with the pedestrian embedded vector sequence and input it into the pedestrian intention prediction unit. Vehicle motion status information includes vehicle speed, longitudinal acceleration, lateral acceleration, yaw rate, steering angle, and braking status indicators. This information is output by the vehicle chassis controller via the onboard bus and includes a timestamp. Road structure semantic information includes lane boundaries, pedestrian crossings, stop lines, curbs, and intersection boundaries. This information is obtained by the onboard perception module from camera images or from the onboard map interface and is aligned with the vehicle coordinate system. The fusion process aligns the vehicle status and road structure semantic information with the pedestrian embedding vector sequence according to the timestamp in the temporal dimension, and then concatenates them in the feature dimension, using them as the joint input to the temporal encoder.
[0085] In this embodiment, the timestamps of the vehicle's own motion state information and the timestamps of the multiple cameras use the same unified time base. If the arrival frequency of the vehicle state is higher than the output frequency of the synchronization frame group, the vehicle state sample with the closest timestamp for each synchronization frame group is selected as the alignment result. If the arrival frequency of the vehicle state is lower than the output frequency of the synchronization frame group, the vehicle state of the previous moment is retained and marked as retained for missing moments. After the road structure semantic information is aligned with the vehicle coordinate system, it is updated with a fixed refresh cycle. The refresh cycle is a default implementation parameter of 200 milliseconds, and the adjustable range is 50 to 500 milliseconds. When there is no update within the refresh interval, the road structure semantic information of the previous cycle is retained and marked as retained to ensure that the spatiotemporal context input to the intent prediction unit is continuous in time.
[0086] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
[0087] Furthermore, those skilled in the art will understand that although some embodiments herein include certain features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of this application and form different embodiments. For example, all the embodiments above can be used in any combination. The information disclosed in this background section is intended only to enhance the understanding of the general background of this application and should not be construed as an admission or in any way implying that such information constitutes prior art known to those skilled in the art.
Claims
1. A deep learning-based pedestrian intent recognition control system for autonomous vehicles, characterized in that, include: A multi-camera sensing unit is used to simultaneously acquire image sequences of the vehicle's surrounding environment; A pedestrian detection feature extraction unit is used to detect pedestrian targets in the image sequence and output the corresponding appearance feature vectors; The collaborative embedding learning unit is a deep neural network used to perform shared space mapping on appearance feature vectors from different cameras and output a sequence of pedestrian embedding vectors with cross-camera identity consistency. The pedestrian tracking and identity management unit is used to perform cross-camera identity association and maintain trajectory based on the pedestrian embedded vector sequence; The pedestrian intent prediction unit is used to perform time-series modeling on the pedestrian embedding vector sequence corresponding to the trajectory and output the street crossing intent result; And a control decision unit, used to generate control commands for vehicle deceleration, braking or maintaining driving based on the pedestrian crossing intention result and output them to the vehicle controller.
2. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, The multi-camera sensing unit includes at least front, rear, left and right cameras, and writes a timestamp for each frame of image based on a unified time reference to achieve synchronization alignment.
3. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, The pedestrian detection feature extraction unit uses a shared backbone network to extract features from images from each camera and outputs bounding boxes, confidence scores, and appearance feature vectors for each pedestrian target.
4. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, The collaborative embedding learning unit is trained end-to-end with a composite training objective during the training phase. The composite training objective includes at least: an identity consistency training term to enhance cross-camera identity consistency, an intent prediction training term to enhance street crossing intent discrimination, and a temporal smoothing training term to constrain embedding changes at continuous time intervals. During the training phase, the collaborative embedding learning unit weights and fuses identity consistency training terms, intent prediction training terms, and temporal smoothing training terms to form a composite training objective, and superimposes the gradients of the three types of training terms at the shared network parameters and feeds them back to the shared parameters of the collaborative embedding learning unit; the weighting coefficients are set to fixed coefficients during the system configuration phase.
5. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 4, characterized in that, The identity consistency training term is based on metric learning of sample pairs of the same pedestrian and different pedestrians, so that the embedding distance of the same pedestrian tends to be smaller and the embedding distance of different pedestrians tends to be larger.
6. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, The collaborative embedding learning unit includes a feature projection subnetwork, a feature fusion subnetwork, and a consistency coding subnetwork. The feature fusion subnetwork performs weighted fusion of features from different camera sources based on attention weights, and then the consistency coding subnetwork outputs the pedestrian embedding vector.
7. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, When a pedestrian switches between cameras or reappears due to occlusion, the pedestrian tracking and identity management unit makes an association determination based on the similarity between the current pedestrian embedding vector and the embedding vector in the trajectory library, and performs gating screening in combination with time window constraints and spatial consistency constraints under camera calibration parameters. The trajectory database is maintained by the pedestrian tracking and identity management unit. The pedestrian tracking and identity management unit forms an appearance cost based on the appearance similarity between the pedestrian embedding vectors stored in the trajectory database and the currently detected pedestrian embedding vectors. During the association determination, the unit combines the time gating of trajectory disconnection duration and the spatial consistency gating of vehicle coordinate system based on camera calibration to screen candidate trajectories. After the gating is passed, the appearance cost, spatial cost and time cost are fused to construct the association cost matrix and solve for the optimal match between the trajectory and the detection. After a successful match, the pedestrian embedding vector corresponding to the trajectory in the trajectory database is updated using a sliding update method, and the update magnitude is adjusted according to the appearance similarity and disconnection duration.
8. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, The pedestrian intention prediction unit includes a temporal encoder and a classifier. The temporal encoder performs temporal modeling on the pedestrian embedding vectors of the same trajectory at multiple consecutive times, and the classifier outputs the pedestrian intention category or its probability.
9. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, It also includes a spatiotemporal context awareness unit, which is used to acquire vehicle motion state information and road structure semantic information, and then fuse the information with the pedestrian embedding vector sequence and input it into the pedestrian intention prediction unit.
10. The deep learning-based pedestrian intent recognition control system for autonomous vehicles according to claim 1, characterized in that, The control decision unit determines the risk level of entering the lane based on the pedestrian crossing intention result and the relative position of the pedestrian, and selects to output a deceleration command, a braking command, or a driving-keeping command accordingly.