A pedestrian detection and trajectory prediction method based on a space-time joint attention mechanism

By introducing mechanical state data into the quay crane scenario to construct an occlusion mask and risk field, and correcting the spatiotemporal attention mechanism, the problem of pedestrian detection and trajectory prediction errors caused by metal occlusion is solved, and stable and continuous trajectory prediction is achieved in complex environments.

CN122244803APending Publication Date: 2026-06-19JIANGSU XIONGFENG INTELLIGENT BRAIN ROBOT TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU XIONGFENG INTELLIGENT BRAIN ROBOT TECHNOLOGY CO LTD
Filing Date
2026-04-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing spatiotemporal attention mechanisms in container terminal quay crane scenarios suffer from errors in pedestrian detection and trajectory prediction due to interference from metal obstructions, resulting in serious false alarms or missed alarms. They cannot effectively distinguish between pedestrians and quay crane structures, and the optical flow estimation network fails.

Method used

By introducing the mechanical state data of the quay crane to construct a dynamic occlusion mask, generating an attention decay weight field, correcting the feature similarity calculation of the spatiotemporal attention mechanism, and combining it with the mechanical motion risk field for trajectory prediction, the network is optimized using a multi-task loss function.

Benefits of technology

Under conditions of occlusion by the hoist and changes in lighting, the stability of pedestrian detection and the continuity of trajectory prediction are improved, the background noise response is reduced, and the model's adaptability in complex scenes is enhanced.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244803A_ABST
    Figure CN122244803A_ABST
Patent Text Reader

Abstract

This invention relates to the field of pedestrian trajectory prediction technology and discloses a pedestrian detection and trajectory prediction method based on a spatiotemporal joint attention mechanism. The method includes acquiring video images of the quay crane operation area and synchronous mechanical status data; projecting the mechanical status data onto an image plane to generate a dynamic occlusion mask; constructing an attention decay weight field based on the dynamic occlusion mask; using the attention decay weight field to correct the query and key feature similarity calculation of the spatiotemporal attention mechanism; and performing pedestrian bounding box detection and future trajectory sequence prediction based on the corrected spatiotemporal attention features. This invention, by introducing the mechanical motion parameters of the quay crane PLC to construct an occlusion thermal mask and guide the spatiotemporal attention calculation, can force the model to focus on the effective passable area under quay crane occlusion conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of pedestrian trajectory prediction technology, specifically to a pedestrian detection and trajectory prediction method based on a spatiotemporal joint attention mechanism. Background Technology

[0002] With the continued growth of global trade, automated container terminals are facing increasingly stringent requirements for operational efficiency and safety management. In the core operational unit of a container terminal—the quay crane area—the mixed operation of numerous trucks, personnel, and machinery in the landside interaction zone makes unauthorized intrusion or abnormal loitering by pedestrians highly susceptible to serious mechanical injury accidents. To achieve 24 / 7 safety monitoring of the quay crane operation area, pedestrian detection and future trajectory prediction technologies based on computer vision have become a focus of attention in both industry and academia. Existing mainstream technical solutions typically employ a deep learning-based spatiotemporal joint attention mechanism model architecture. The spatiotemporal attention mechanism refers to the ability of neural networks to automatically learn and allocate different weights to computational resources when processing video sequences. This means giving higher attention to important features both spatially (different pixel regions within the same frame) and temporally (the same target across different frames), thereby improving the model's accuracy in perceiving moving targets.

[0003] However, the aforementioned general spatiotemporal attention mechanisms face performance challenges due to the unique environmental characteristics when applied to the specific industrial scenario of container terminal quay cranes. Specifically, the core source of interference in this scenario stems from the massive metal mechanical structure of the quay crane itself. The quay crane mainly consists of a large-span metal truss beam, a trolley moving at high speed along the beam's track, and spreaders connected to the trolley for grabbing containers. During loading and unloading operations, especially when the spreaders are vertically lowering containers, the metal projection, the spreader body, and the wire ropes create large-area, irregularly shaped, and rapidly moving physical obstructions in the monitoring camera's field of view. This type of occlusion differs fundamentally from vehicle occlusion in ordinary urban road scenes: First, the occluder is a metal component with a highly reflective surface, which can cause local overexposure or specular reflection artifacts in the image; Second, the edges of the occluded area have strong optical shadow gradient changes, which are easily misjudged as moving edges by the feature extraction layer of the deep learning model; Third, the descent motion of the quay crane has a clear physical acceleration law, but its deformation scale is large in the two-dimensional image plane projection, causing the data-driven optical flow estimation network to fail.

[0004] Under such strong occlusion interference, existing spatiotemporal attention mechanisms suffer from severe erroneous associations and feature degradation. On one hand, at the spatial attention level, because occluded regions occupy significant high-frequency edge energy in the image, the query and key vectors in general self-attention mechanisms, when calculating inner product similarity, incorrectly establish high-weight associations between background pixels belonging to the quay crane's metal structure and foreground pixels belonging to pedestrians. This causes the model to expend considerable computational effort on invalid regions occluded by the sling, neglecting unoccluded, passable ground areas. On the other hand, at the temporal attention level, when a pedestrian reappears after being briefly and completely occluded by the sling, existing models, lacking an understanding of the physical causes of occlusion, often fail to establish long-distance, cross-occlusion temporally consistent associations, leading to changes in the target's identity or sudden divergence in trajectory prediction. The trajectory prediction mentioned in this background specifically refers to inferring a pedestrian's possible movement path in the next few seconds based on their historical movement coordinate sequence over the past few seconds using a recurrent neural network or Transformer decoder. If the input coordinates during the occlusion frame generate noise due to detection drift, the predicted trajectory will deviate from the actual safe passage, directly leading to serious false alarms or missed alarms in the safety warning system. Therefore, the following solutions are proposed to address the above problems. Summary of the Invention

[0005] To address the aforementioned technical problems, this invention provides a pedestrian detection and trajectory prediction method based on a spatiotemporal joint attention mechanism, comprising the following steps: Collect video images and synchronous mechanical status data of the quay crane operation area; The mechanical state data is projected onto the image plane to generate a dynamic occlusion mask; Construct an attention decay weight field based on dynamic occlusion masking; The similarity calculation of query and key features is corrected by using an attention decay weight field to modify the spatiotemporal attention mechanism; Pedestrian bounding box detection and future trajectory sequence prediction based on modified spatiotemporal attention features.

[0006] Preferably, the mechanical status data includes the position of the quay crane trolley, the height of the spreader off the ground, and the pitch angle of the main beam; the method also includes a step of aligning the video images with the mechanical status data using time-stamp linear interpolation.

[0007] Preferably, projecting mechanical state data onto an image plane to generate a dynamic occlusion mask specifically includes: Based on the pre-calibrated 3D geometric model of the quay crane and the camera intrinsic parameter matrix, calculate the 2D projection bounding box region of the lifting equipment and wire rope on the image plane; The two-dimensional projected bounding box region is mapped to the size of the visual feature map to generate a binarized dynamic occlusion mask.

[0008] Preferably, the method of constructing an attention decay weight field based on a dynamic occlusion mask further includes: Calculate the current speed of the spreader. When it is determined to be in a descent state, perform a morphological expansion operation on the edge of the dynamic masking membrane. The size of the expansion core is positively correlated with the absolute value of the spreader speed.

[0009] Preferably, the query and key feature similarity calculation using the attention decay weight field to correct the spatiotemporal attention mechanism specifically includes: The query vector and key vector are multiplied element-wise with the spatial position weights corresponding to the attention decay weight field, and then the inner product operation is performed to make the similarity contribution of feature vectors located in the occluded projection area zero.

[0010] Preferably, the modified query-key feature similarity calculation follows the formula below: In the formula, For position The query vector, For position The key vector, and This represents the value of the attention decay weight field at the corresponding position. This indicates element-wise multiplication.

[0011] Preferably, predicting future trajectory sequences based on the corrected spatiotemporal attention features further includes: Obtain the historical world coordinate system position sequence of pedestrians; Construct a risk field penalty matrix from a bird's-eye view using mechanical condition data; The mean of the Gaussian distribution of the future trajectory points output by the decoder is positionally modulated along the negative gradient direction of the risk field penalty matrix to generate the final predicted trajectory sequence.

[0012] Preferably, the position modulation of the Gaussian distribution mean during trajectory prediction follows the following formula: In the formula, The original prediction mean is the output of the decoder. For risk field penalty matrix, For gradient operators, As an adjustment factor that is positively correlated with the descent speed of the spreader, This is the mean of the final predicted position after correction.

[0013] Preferably, the network training process of the method is optimized using a multi-task loss function, which includes a regularization loss term for constraining the feature responses of occluded regions; the regularization loss term penalizes high activation responses of visual feature maps within occluded regions based on a dynamic occlusion mask.

[0014] The present invention has the following beneficial effects: This invention constructs an occlusion prior mask by introducing quay crane mechanical state data and applies physical constraints to correct the spatiotemporal attention calculation process. This allows the network to focus on effective surface areas even under sling occlusion or strong shadow interference, reducing background noise response during feature extraction. The method associates trajectory prediction with the mechanical motion risk field, ensuring the predicted path avoids dangerous areas where occlusion is imminent, improving the continuity and scene adaptability of trajectory output. By utilizing occlusion region regularization loss, it promotes low activation of visual features in occluded areas, improving the stability of detection results. The overall solution does not rely on pure data-driven approaches and exhibits good adaptability to structural occlusion and lighting changes in quay crane operation scenarios. Attached Figure Description

[0015] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a flowchart illustrating a pedestrian detection and trajectory prediction method based on a spatiotemporal joint attention mechanism according to the present invention. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] Please see Figure 1 As shown, this invention is a pedestrian detection and trajectory prediction method based on a spatiotemporal joint attention mechanism. The prediction method includes the following steps: Collect video images and synchronous mechanical status data of the quay crane operation area; The mechanical state data is projected onto the image plane to generate a dynamic occlusion mask; Construct an attention decay weight field based on dynamic occlusion masking; The similarity calculation of query and key features is corrected by using an attention decay weight field to modify the spatiotemporal attention mechanism; Pedestrian bounding box detection and future trajectory sequence prediction based on modified spatiotemporal attention features.

[0019] The mechanical status data includes the position of the quay crane trolley, the height of the spreader off the ground, and the pitch angle of the main beam; the method also includes a step of aligning the video images with the mechanical status data by time-stamp linear interpolation.

[0020] The mechanical state data is projected onto the image plane to generate a dynamic occlusion mask, specifically including: Based on the pre-calibrated 3D geometric model of the quay crane and the camera intrinsic parameter matrix, calculate the 2D projection bounding box region of the lifting equipment and wire rope on the image plane; The two-dimensional projected bounding box region is mapped to the size of the visual feature map to generate a binarized dynamic occlusion mask.

[0021] The attention decay weight field constructed based on dynamic occlusion masks also includes: Calculate the current speed of the spreader. When it is determined to be in a descent state, perform a morphological expansion operation on the edge of the dynamic masking membrane. The size of the expansion core is positively correlated with the absolute value of the spreader speed.

[0022] The similarity calculation of query and key features is corrected by using an attention decay weight field to modify the spatiotemporal attention mechanism, specifically including: The query vector and key vector are multiplied element-wise with the spatial position weights corresponding to the attention decay weight field, and then the inner product operation is performed to make the similarity contribution of feature vectors located in the occluded projection area zero.

[0023] The revised query-key feature similarity calculation follows the formula below: In the formula, For position The query vector, For position The key vector, and This represents the value of the attention decay weight field at the corresponding position. This indicates element-wise multiplication.

[0024] Predicting future trajectory sequences based on the corrected spatiotemporal attention features further includes: Obtain the historical world coordinate system position sequence of pedestrians; Construct a risk field penalty matrix from a bird's-eye view using mechanical condition data; The mean of the Gaussian distribution of the future trajectory points output by the decoder is positionally modulated along the negative gradient direction of the risk field penalty matrix to generate the final predicted trajectory sequence.

[0025] The position modulation of the Gaussian distribution mean during trajectory prediction follows the formula below: In the formula, The original prediction mean is the output of the decoder. For risk field penalty matrix, For gradient operators, As an adjustment factor that is positively correlated with the descent speed of the spreader, This is the mean of the final predicted position after correction.

[0026] The network training process of the method is optimized using a multi-task loss function, which includes a regularization loss term for constraining the feature responses of occluded regions. The regularization loss term penalizes high activation responses of visual feature maps within occluded regions based on a dynamic occlusion mask.

[0027] One specific application of this embodiment is: Step S1: Synchronous acquisition and spatiotemporal alignment preprocessing of multi-source heterogeneous data Step S11, Data Acquisition Source Definition: This solution is deployed on the landside gantry of the container terminal quay crane or the top of the electrical room to collect the following two types of synchronous data streams: Visual data stream: RGB video sequences are acquired via two wide-angle surveillance cameras facing the landside interaction area. Images acquired at any time are defined as ,in, For height, The width is 3, and 3 indicates the RGB three-channel.

[0028] Mechanical status data stream: Mechanical motion parameters are read in real time from the main control PLC of the quay crane via the OPCUA protocol. The parameters collected at all times include: the position of the trolley on the bridge. Height of lifting mechanism hoist from the ground , beam pitch angle .

[0029] Step S12: Spatiotemporal alignment and projection transformation preprocessing Because the sampling frequencies of visual data and PLC data are different (visual data approximately 25Hz, PLC approximately 50Hz), timestamp linear interpolation alignment is required to ensure that each video frame... Each has a one-to-one corresponding mechanical state vector .

[0030] Using a pre-calibrated 3D geometric model of the quay crane and the camera's intrinsic and extrinsic parameter matrices The machine coordinates are transformed to the image plane coordinate system. Specifically, for dynamic obstructions such as lifting equipment and wire ropes, their two-dimensional projected bounding box regions on the image plane are calculated. This area will be used in step S2 to generate a priori mask.

[0031] Step S13, Scene Ground Plane Prior Calibration: Acquire open field images and calibrate the equation of the effective passable ground plane under the quay bridge: In the formula, For one The unit normal vector represents the orientation of the calibrated passable ground plane in the three-dimensional world coordinate system; This is the transpose of the vector; For one The coordinate vector represents the coordinates of any point in the three-dimensional world coordinate system. ; It is a scalar constant representing the intercept of the ground plane in three-dimensional space, that is, the signed distance from the plane to the origin of the world coordinate system; This plane is used to constrain the search range of pedestrian landing points in three-dimensional space. In subsequent steps, this constraint will be used to filter out false alarms caused by birds in the air or metallic reflections.

[0032] Step S2: Construct a spatiotemporally decoupled attention network for occlusion perception guided by mechanical motion constraints. This scheme constructs a dual-stream encoder-decoder architecture, the core of which lies in the guided occlusion mask generation module introduced in step S22, and the spatiotemporal decoupling attention correction mechanism in step S24.

[0033] Step S21, Visual Feature Pyramid Extraction: For the input single-frame image Multi-scale visual feature maps are extracted using a feature pyramid network based on the ResNet-50 backbone network. ,in The number of feature channels, This represents the spatial dimensions after downsampling.

[0034] Step S22, Generation of a 2D Occlusion Thermal Mask Based on Mechanokinetics: This step differs from existing technologies that rely on neural networks to learn occlusion attributes. Instead, it directly derives the occlusion using a known physical kinematics model. Specifically: Projection remapping: Based on the dynamic occlusion area obtained in step S12 Generate a binary occlusion mask In this study, pixels belonging to the projection areas of the lifting equipment, wire rope, and quay crane structure are marked as 1 (occluded), while the remaining areas are marked as 0 (visible).

[0035] Risk attenuation field construction: Introducing constraints on the motion trend of the quay crane machinery. Calculating the spreader speed at the current moment. .like It is a negative value (descent process) and its absolute value is greater than the threshold. If the condition is deemed unsuitable, it is classified as a high-risk operation. In this state, Gaussian diffusion is applied to the edges of the occluded area to generate an attention-decrease weight field. : In the formula, In order to be in Time, dimension The attention decay weight matrix; This refers to the horizontal coordinate index in the image feature map space coordinate system; This is the index of the ordinate in the image feature map space coordinate system; It is a specific two-dimensional coordinate position in the image feature map space coordinate system; This is a function for image morphological dilation. In order to be in Time, dimension The binary occlusion mask matrix; This weight field explicitly instructs the network that spatial locations directly below the descent path of the rigging, even if not occluded in the current frame, should not be considered high-probability points for predicting the future trajectory of pedestrians due to their extremely high risk.

[0036] Step S23: Temporal motion feature encoding Continuous Visual feature map of a frame The data is concatenated along the channel dimension and input into a 3D convolutional encoder to extract a spatiotemporal tensor containing local motion information. .

[0037] Step S24: Spatiotemporal decoupling attention calculation and correction for occlusion perception: Decoupling attention initialization: Constructing a multi-head self-attention mechanism and calculating the spatial attention matrix separately. and time attention matrix .in, It reflects the correlation strength between pixels within the current frame.

[0038] Physical prior injection: This scheme does not simply scale the attention weights, but instead utilizes the attention decay weight field generated in step S22. Spatial selective masking is performed on the feature vectors of the query and the key.

[0039] Let the similarity calculation function for spatial attention be: The corrected similarity is: In the formula, The output value of the similarity function is the result of physical prior injection correction using this method. This is the fundamental function used in multi-head self-attention mechanisms to calculate the similarity between the query vector and the key vector. In multi-head self-attention mechanisms, the spatial location of the image is the primary factor. The query vector; In multi-head self-attention mechanisms, the focus is on the spatial location of the image. The key vector; To reduce the weight field of attention decay Extracted from, corresponding to the position scalar weighting factors; To reduce the weight field of attention decay Extracted from, corresponding to the position scalar weighting factors; The symbol for element-wise multiplication; This operation forces feature vectors located in occluded projection areas or areas suppressed by mechanical movement trends to contribute zero when calculating the inner product, thereby severing the erroneous association between background noise and foreground pedestrians.

[0040] Temporal feature recalibration: For the trajectory prediction branch, utilizing... Capture historical trajectory points. At this point, introduce a physical space constraint regularization term: calculate for each predicted trajectory point. The reprojection error to the calibrated ground plane projection point, if the error exceeds the threshold. If so, the weight of that node in the time attention propagation is reduced.

[0041] Step S3: Forward inference of the joint detection head and trajectory prediction head Step S31, Pedestrian Bounding Box Detection Branch: The visual feature map corrected in step S24 is fed into the detection head. The detection head consists of a classification branch and a regression branch, outputting the set of pedestrian bounding boxes for the current frame. Since the physical prior has suppressed spurious responses in the spreader area, the detection results show a significant reduction in jitter at the occlusion boundary.

[0042] Step S32, Trajectory prediction branch based on physical constraint enhancement: Historical trajectory encoding: For each tracked pedestrian target Extract its past Frame world coordinate system position sequence .

[0043] Predictive Decoding: Using an LSTM decoder to predict the future The distribution of trajectory points in the frame. At each decoding step, the mechanical motion risk field generated in step S22 is projected onto the bird's-eye view (BEV) perspective as a penalty matrix. The mean of the Gaussian distribution of the decoder output Perform cross-modulation: In the formula, After correction by the physical risk field, at the prediction time The coordinate vector of the Gaussian distribution of the trajectory points; For the uncorrected raw output of the trajectory prediction decoder at the prediction time The coordinate vector of the Gaussian distribution of the trajectory points; It is a non-negative scalar adjustment factor; This is the gradient operator symbol; This is a function defining the mechanical motion risk field in the bird's-eye view coordinate system; Risk field function At a point in space The gradient vector at that point; This formula means: if the original mean of the decoder output is... If the predicted point falls into a high-risk area that the spreader is about to pass through, the negative gradient direction of the risk field will be used to push the predicted point to a nearby safe area, such as the edge of the lane. This is an adjustment factor that is positively correlated with the descent speed of the spreader.

[0044] The final output is a smooth future trajectory sequence. .

[0045] Step S4: Offline training and convergence based on multi-task loss function This invention employs an end-to-end supervised learning approach to train the aforementioned network, with a loss function... It consists of three parts to ensure that all data collected and generated in steps S1 to S3 are effectively utilized: Step S41, Detect loss : Using the actual bounding boxes marked in step S1 Compared with the prediction in step S31 Calculate the losses, including FocalLoss for classification and CIoULoss for bounding box regression.

[0046] Step S42, Trajectory Prediction Regression Loss : Using the sequence of real future pedestrian trajectories recorded in step S1 Compared with the prediction in step S32 Calculate the mean squared error loss: In the formula, The regression loss function value for the trajectory prediction task; The total number of future time frames that need to be predicted; Index variable for future time steps; For the model prediction, in The two-dimensional position coordinate vector of a pedestrian in the world coordinate system or the bird's-eye view coordinate system at any given time ; To obtain real data from data collection and annotation The two-dimensional position coordinate vector of the pedestrian at any given time; Step S43: Regularization loss based on occlusion region suppression To explicitly constrain the network to learn the property of ignoring mechanical occlusion, the occlusion mask generated in step S22 is used. Construct a regularized loss. This loss penalizes the high activation responses of feature maps falling within the occluded region. In the formula, This represents the value of the regularization loss function; Normalization factor; In order to be in Multi-scale visual feature map tensors extracted at any time from the visual feature pyramid network; For the feature map extracted, located in spatial coordinates Feature channel vector at the location; This loss forces the feature extractor to output a response close to zero at pixel locations occluded by the sling, which, combined with the physical masking of the attention mechanism, double guarantees the robustness of the method.

[0047] Step S44, Joint Optimization: Use the Adam optimizer to optimize the total loss. Perform backpropagation to update the network parameters described in step S2 until the model converges.

[0048] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A pedestrian detection and trajectory prediction method based on a spatio-temporal joint attention mechanism, characterized in that, The prediction method includes the following steps: Collect video images and synchronous mechanical status data of the quay crane operation area; The mechanical state data is projected onto the image plane to generate a dynamic occlusion mask; An attention decay weight field is constructed based on the dynamic occlusion mask; The query and key feature similarity calculation is corrected using the attention decay weight field to modify the spatiotemporal attention mechanism; Pedestrian bounding box detection and future trajectory sequence prediction based on modified spatiotemporal attention features.

2. The method of claim 1, wherein the method is based on a spatio-temporal joint attention mechanism. The mechanical status data includes the position of the quay crane trolley, the height of the spreader off the ground, and the pitch angle of the main beam; the method also includes the step of performing time-stamp linear interpolation to align the video images with the mechanical status data.

3. The method of claim 1, wherein the method comprises: The step of projecting mechanical state data onto an image plane to generate a dynamic occlusion mask specifically includes: Based on the pre-calibrated 3D geometric model of the quay crane and the camera intrinsic parameter matrix, calculate the 2D projection bounding box region of the lifting equipment and wire rope on the image plane; The two-dimensional projected bounding box region is mapped to the size of the visual feature map to generate the binarized dynamic occlusion mask.

4. The method of claim 1, wherein the method is based on a spatio-temporal joint attention mechanism. The construction of the attention attenuation weight field based on dynamic occlusion mask also includes: Calculate the current speed of the spreader. When it is determined to be in a descent state, perform a morphological expansion operation on the edge of the dynamic shielding mask, wherein the size of the expansion core is positively correlated with the absolute value of the spreader speed.

5. The method of claim 1, wherein the method is based on a spatio-temporal joint attention mechanism. The query and key feature similarity calculation using the attention decay weight field to correct the spatiotemporal attention mechanism specifically includes: The query vector and key vector are multiplied element-wise with the spatial position weights corresponding to the attention decay weight field, and then the inner product operation is performed to make the similarity contribution of the feature vectors located in the occluded projection area zero.

6. The method of claim 5, wherein the method comprises: The revised query-key feature similarity calculation follows the formula below: wherein is a position of a query vector, is a position of a key vector, and is a value of the field of attention decay weights at the corresponding position, denotes an element-wise multiplication.

7. The method of claim 1, wherein the method is based on a spatio-temporal joint attention mechanism. The prediction of future trajectory sequences based on the modified spatiotemporal attention features further includes: Obtain the historical world coordinate system position sequence of pedestrians; Construct a risk field penalty matrix from a bird's-eye view using the aforementioned mechanical state data; The mean of the Gaussian distribution of the future trajectory points output by the decoder is positionally modulated along the negative gradient direction of the risk field penalty matrix to generate the final predicted trajectory sequence.

8. The method of claim 7, wherein the method is based on a spatio-temporal joint attention mechanism. The position modulation of the Gaussian distribution mean during the trajectory prediction process follows the following formula: wherein is the original prediction mean value output by the decoder, is the risk field penalty matrix, is the gradient operator, is the adjustment factor positively correlated with the lowering speed of the spreader, is the final prediction position mean value after correction.

9. The pedestrian detection and trajectory prediction method based on spatiotemporal joint attention mechanism according to claim 1, characterized in that: The network training process of the method is optimized using a multi-task loss function, which includes a regularization loss term for constraining the feature responses of occluded regions. The regularization loss term penalizes high activation responses of visual feature maps within occluded regions based on the dynamic occlusion mask.