A method and device for high-altitude non-cooperative target intention recognition and behavior prediction

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By preprocessing and multimodal residual analysis of high-altitude observation data, combined with extended Kalman filter and encoder-decoder models, the real-time and accuracy problems in high-altitude non-cooperative target monitoring and prediction are solved, enabling in-depth interpretation and efficient prediction of target behavior.

CN122200537APending Publication Date: 2026-06-12深圳市斯贝达电子有限公司

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 深圳市斯贝达电子有限公司
Filing Date: 2026-03-07
Publication Date: 2026-06-12

Application Information

Patent Timeline

07 Mar 2026

Application

12 Jun 2026

Publication

CN122200537A

IPC: G06V20/52; G06V10/764; G06V10/77; G06V10/72; G06V10/80; G06V10/82; G06N3/0464; G06N3/045

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies suffer from high data transmission latency and poor real-time performance in monitoring and predicting non-cooperative targets at high altitudes. Furthermore, traditional prediction methods exhibit decreased accuracy during target maneuvers and struggle to integrate multimodal information to improve identification accuracy.

⚗Method used

By preprocessing high-altitude observation data, a standardized input tensor and synchronous metadata are constructed. The image and metadata processing is accelerated by using a neural network processing unit and a central processing unit. Combined with an extended Kalman filter and multimodal residual analysis, a real-time state vector is constructed to determine the probability distribution of intent. A temporal prediction model is then constructed using an encoder-decoder system to predict behavior.

🎯Benefits of technology

It enables a deeper understanding of the behavior of non-cooperative targets at high altitudes, improves the real-time performance and accuracy of predictions, and can promptly detect abnormal maneuvering behavior, generating more predictive results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122200537A_ABST

Patent Text Reader

Abstract

The application relates to the technical field of image processing, in particular to a high-altitude non-cooperative target intention recognition and behavior prediction method and device, the method comprising the following steps: acquiring high-altitude observation data at a current moment, preprocessing the high-altitude observation data to obtain a standardized input tensor and synchronous metadata; calling a preset target detection model according to the standardized input tensor to obtain a preliminary detection result, constructing a real-time state vector of a target based on the preliminary detection result and the synchronous metadata; calculating a multi-modal residual error vector based on the real-time state vector, judging whether a change trend of the multi-modal residual error vector meets a preset trigger condition of a maneuvering event; if the change trend meets the trigger condition, determining an intention probability distribution vector of the target based on the real-time state vector; and determining a behavior prediction result of the target based on the intention probability distribution vector. The application has the effect of improving the accuracy of target intention recognition and behavior prediction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to a method and device for recognizing the intent and predicting the behavior of non-cooperative targets at high altitudes. Background Technology

[0002] Currently, with the increasing number of high-altitude targets such as drones and near-space vehicles, effective monitoring, intent identification, and behavior prediction of high-altitude non-cooperative targets have become key technological aspects in fields such as airspace security, national defense early warning, and air traffic management. High-altitude non-cooperative targets typically exhibit high maneuverability and behavioral uncertainty; therefore, rapid and accurate prediction of their behavior is a prerequisite for making effective response decisions.

[0003] Existing monitoring and prediction technologies for high-altitude targets typically rely on large ground-based computing centers to perform offline or near-offline analysis and processing of sensor data. For example, large amounts of observational data acquired by front-end sensors such as radar and electro-optical pods are transmitted back to a back-end server, where complex models process the data. These existing solutions suffer from the following drawbacks: high latency due to data transmission makes them unsuitable for real-time performance in highly dynamic adversarial scenarios. Furthermore, traditional prediction methods often rely on single dynamic models such as Kalman filtering, which leads to a sharp drop in prediction accuracy when targets engage in tactical maneuvers such as evasion or deception. They also struggle to effectively integrate multimodal information, such as the target's visual features, to improve recognition accuracy. Therefore, there is an urgent need for methods for identifying and predicting the intentions and behaviors of non-cooperative high-altitude targets. Summary of the Invention

[0004] To improve the accuracy of target intent recognition and behavior prediction, this application provides a method and device for high-altitude non-cooperative target intent recognition and behavior prediction.

[0005] The above-mentioned objective of this application is achieved through the following technical solution: A method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes, the method comprising: Acquire the upper-air observation data at the current moment, preprocess the upper-air observation data to obtain a standardized input tensor and synchronization metadata; Based on the standardized input tensor, a preset target detection model is invoked to obtain preliminary detection results. Based on the preliminary detection results and the synchronous metadata, a real-time state vector of the target is constructed. Based on the real-time state vector, a multimodal residual vector is calculated, and it is determined whether the changing trend of the multimodal residual vector meets the triggering conditions of a preset maneuver event. If the trend of change satisfies the triggering condition, then the intention probability distribution vector of the target is determined based on the real-time state vector; Based on the intent probability distribution vector, the behavior prediction result of the target is determined.

[0006] By employing the aforementioned technical solution, standardized input tensors and synchronous metadata are obtained through preprocessing of high-altitude observation data. A real-time state vector of the target is constructed based on this standardized input tensor and synchronous metadata, providing a comprehensive quantitative description for subsequent precise analysis. By calculating the multimodal residual vector and determining whether its changing trend meets the triggering conditions, the difference between observation and prediction can be quantified from both dynamic and visual dimensions, thereby enabling timely detection of abnormal maneuvering behavior of the target. After maneuvering is triggered, the target's intent probability distribution vector is determined based on the real-time state vector, mapping the low-level state vector to a high-level intent space, achieving a deep understanding of the purpose behind the target's behavior. By determining the final behavior prediction result based on this intent probability vector, and injecting the intent as high-level prior knowledge into the prediction model, more predictive results can be generated, thus improving the entire system's ability to understand and predict the behavior of non-cooperative high-altitude targets.

[0007] In a preferred embodiment, this application can be further configured such that: the preprocessing of the upper-air observation data to obtain a standardized input tensor and synchronization metadata specifically includes: Image data and metadata are obtained from the high-altitude observation data; By calling the neural network processing unit, the image data is subjected to standardized image operations to obtain the standardized input tensor, wherein the standardized image operations include at least decoding, scaling and normalization processing; The synchronization metadata is obtained by calling the central processing unit to perform structured processing on the metadata.

[0008] By adopting the above technical solution, the acquisition of high-altitude observation data is refined into the acquisition of image data and metadata, which are processed by the neural network processing unit and the central processing unit respectively. The neural network processing unit can accelerate the image processing task with high parallelism, and the central processing unit can handle the complex metadata parsing. Thus, while ensuring data quality, the real-time performance and efficiency of data preprocessing are greatly improved.

[0009] In a preferred example, this application can be further configured as follows: Based on the standardized input tensor, a preset target detection model is invoked to obtain preliminary detection results; based on the preliminary detection results and the synchronized metadata, a real-time state vector of the target is constructed, specifically including: Obtain the target detection model constructed based on the attention mechanism, input the standardized input tensor into the target detection model, and perform target localization reasoning through the target detection model; The preliminary detection results are obtained from the target detection model, wherein the preliminary detection results include at least the target bounding box coordinates, multimodal visual feature data, recognition confidence, and signal-to-noise ratio of the target; The sensor attitude data and corresponding timestamp information are obtained from the synchronization metadata, and the target bounding box in the preliminary detection result is transformed from the two-dimensional image coordinate system to the three-dimensional world coordinate system to obtain the dynamic information of the target. The real-time state vector is constructed based on the dynamic information and the multimodal visual feature data.

[0010] By adopting the above technical solution and using a target detection model based on an attention mechanism, it is possible to intelligently focus on key areas in the image, thereby effectively capturing weak targets even in low signal-to-noise ratio environments. By obtaining preliminary detection results containing multimodal visual feature data, it is possible to extract visual fingerprints describing the deep essential attributes such as shape, texture, and infrared radiation from the target's pixel set. By transforming the target bounding box to a three-dimensional world coordinate system to obtain dynamic information, it is possible to achieve accurate back projection of the target state from the two-dimensional image plane to the three-dimensional physical space. Finally, by fusing dynamic information and multimodal visual feature data to construct a real-time state vector, it provides information density and richness for subsequent analysis.

[0011] In a preferred embodiment, this application can be further configured such that: the calculation of the multimodal residual vector based on the real-time state vector specifically includes: Obtain the historical state vector of the previous moment, wherein the historical state vector includes historical multimodal visual feature data; By using an extended Kalman filter to predict the state of the historical state vector, a dynamic prediction value is obtained. The predicted dynamic values are compared with the dynamic information in the real-time state vector to obtain the dynamic residuals, and the distance between the multimodal visual feature data in the real-time state vector and the historical multimodal visual feature data is calculated to obtain the visual residuals. The dynamic residual and the visual residual are weighted and fused to obtain the multimodal residual vector.

[0012] By adopting the above technical solution and using the extended Kalman filter for state prediction, it is possible to make theoretical predictions of the inertial behavior of the target based on the laws of physical kinematics, providing a benchmark for anomaly detection. By calculating the dynamic residual and visual residual separately and performing weighted fusion, it is possible to quantitatively evaluate the abnormal deviation of the target from two independent dimensions of motion and form, making the detection more comprehensive.

[0013] In a preferred embodiment, this application can be further configured such that: determining whether the changing trend of the multimodal residual vector satisfies the triggering condition of a preset maneuvering event specifically includes: Based on the multimodal residual vector and the preset drift parameters, the cumulative statistic at the current moment is calculated using a cumulative summation and test algorithm. Based on the signal-to-noise ratio of the target, determine the threshold for the preset maneuvering event, and determine whether the cumulative statistics exceed the threshold. If the cumulative statistics exceed the determination threshold, then the trend of the modal residual vector is determined to meet the triggering condition.

[0014] By adopting the above technical solution and using the cumulative summation and verification algorithm, the continuous growth trend of the residual can be effectively amplified and confirmed, thereby clearly distinguishing real maneuvering behavior from isolated random noise. By dynamically determining the judgment threshold based on the signal-to-noise ratio, the detection system can automatically adjust its sensitivity under different observation conditions, thereby effectively reducing the false alarm rate while maintaining a high detection rate.

[0015] In a preferred embodiment, this application can be further configured such that: determining the intent probability distribution vector of the target based on the real-time state vector specifically includes: The dynamic information and multimodal visual feature data in the real-time state vector are compared with the preset multimodal intent template library to calculate the similarity, and the initial likelihood vector of the real-time state vector under each preset intent is calculated. The initial likelihood vector is used as the current observation evidence, and the intention probability distribution vector of the previous time step is used as the prior probability to calculate the posterior intention probability distribution vector of the current time step. The posterior intention probability distribution vector is then used as the intention probability distribution vector of the target.

[0016] By adopting the above technical solution, similarity calculation between the real-time state vector and the multimodal intent template library can be performed, and the degree of conformity between the current target state and the known typical intent pattern can be quickly evaluated based on template matching. This provides preliminary quantitative evidence for intent recognition. By using the intent probability distribution of the previous time step as the prior probability and combining it with the current observation evidence for Bayesian update, the intent judgment result can be effectively smoothed and corrected, avoiding drastic jumps in intent judgment caused by single-frame observation noise. This makes the final intent recognition result more stable and reliable.

[0017] In a preferred embodiment, this application can be further configured such that: determining the behavior prediction result of the target based on the intent probability distribution vector specifically includes: The final intent label is determined based on the intent probability distribution vector. Construct a prediction sequence by appending the final intent label to the end of the prediction input sequence in coded form. The prediction input sequence includes multiple historical state vectors within a preset length. The predicted input sequence is input into a preset temporal prediction model based on an encoder and decoder, and the behavior prediction result of the target is obtained from the temporal prediction model. The behavior prediction result includes at least the prediction data of future high-resolution predicted trajectory and future maneuver behavior warning.

[0018] By adopting the above technical solution, the final intent label is determined based on the intent probability distribution vector, avoiding prediction uncertainty caused by fluctuations in intent probability. By encoding the intent label and attaching it to the historical state vector sequence to construct the prediction input sequence, heterogeneous fusion of high-level semantic information and low-level temporal data is achieved, enabling the prediction model to understand the target's motivation. By using a temporal prediction model built based on encoder and decoder, the context vector that compresses historical information and intent can be gradually unfolded into a serialized prediction of future states, thereby generating more forward-looking prediction trajectories and maneuver warnings, significantly improving the accuracy and practical value of behavior prediction.

[0019] In a preferred embodiment, this application can be further configured such that: determining the final intent label based on the intent probability distribution vector specifically includes: Continuously monitor the intent probability distribution vector; If the probability value of a single intent in the intent probability distribution vector exceeds a preset threshold more than a preset number of times within a preset time period, then the single intent is determined as the final intent label.

[0020] By adopting the above technical solution, and by continuously monitoring the intent probability distribution vector, and setting a time consistency judgment mechanism that determines a single intent as the final intent label only when the probability value of a single intent continuously and stably exceeds the confidence threshold within a preset time period, the instantaneous jump in intent probability caused by single-frame observation noise or when the target is in a brief ambiguous state can be effectively filtered out. This avoids frequent jitter in intent decision-making and significantly improves the decision confidence and time stability of the final intent label.

[0021] The second objective of this invention is achieved through the following technical solution: An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the above-described method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes.

[0022] In summary, this application includes at least one of the following beneficial technical effects: 1. By preprocessing high-altitude observation data to obtain standardized input tensors and synchronization metadata, a real-time state vector of the target is constructed based on the standardized input tensor and synchronization metadata, providing a comprehensive quantitative description for subsequent accurate analysis. By calculating the multimodal residual vector and judging whether its changing trend meets the triggering conditions, the difference between observation and prediction can be quantified from both dynamic and visual dimensions, thereby timely detecting abnormal maneuvering behavior of the target. After the maneuver is triggered, the target's intention probability distribution vector is determined based on the real-time state vector, mapping the low-level state vector to the high-level intention space, realizing a deep interpretation of the purpose behind the target's behavior. By determining the final behavior prediction result based on this intention probability vector, the intention is injected as high-level prior knowledge into the prediction model, which can generate more predictive prediction results, thereby improving the entire system's ability to understand and predict the behavior of non-cooperative high-altitude targets. Attached Figure Description

[0023] Figure 1 This is a flowchart illustrating the implementation of a method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes in one embodiment of this application. Figure 2 This is a flowchart illustrating the implementation of step S10 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application. Figure 3 This is a flowchart illustrating the implementation of step S20 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application. Figure 4 This is a flowchart illustrating the implementation of step S30 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application. Figure 5 This is another implementation flowchart of step S30 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application; Figure 6 This is a flowchart illustrating the implementation of step S40 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application. Figure 7 This is a flowchart illustrating the implementation of step S50 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application. Figure 8 This is a flowchart illustrating the implementation of step S51 in the high-altitude non-cooperative target intent recognition and behavior prediction method in one embodiment of this application. Figure 9 This is a schematic diagram of the internal structure of an electronic device according to an embodiment of this application. Detailed Implementation

[0024] The following embodiments will help those skilled in the art to further understand the function of this application, but do not limit this application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of this application. These all fall within the protection scope of this application.

[0025] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0026] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0027] The present application will be further described in detail below with reference to the accompanying drawings.

[0028] In one embodiment, such as Figure 1 As shown, this application discloses a method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes, which specifically includes the following steps: S10: Obtain the upper-air observation data at the current moment, preprocess the upper-air observation data, and obtain the standardized input tensor and synchronization metadata.

[0029] Specifically, such as Figure 2 As shown, step S10 specifically includes: S11: Obtain image data and metadata from upper-air observation data.

[0030] Specifically, high-altitude observation data typically comes from raw and unstructured data from sensors such as electro-optical pods. Directly inputting this data into subsequent intelligent models leads to low processing efficiency and accuracy. Therefore, preprocessing is necessary to transform this data into a unified format that subsequent algorithm modules can process efficiently. This provides a clean, well-organized, and synchronized data foundation for subsequent target detection and state estimation. For example, an integrated electro-optical pod sends 30 data packets per second through a network port. Each data packet contains both H.265-encoded infrared image frames and metadata recording the pod's azimuth, elevation, GPS coordinates, and a precise atomic clock timestamp.

[0031] S12: By calling the neural network processing unit, image data is standardized to obtain a standardized input tensor. The standardized image operations include at least decoding, scaling and normalization.

[0032] Specifically, image processing is accelerated by utilizing the Neural Processing Unit (NPU) in the edge chip, which is specifically designed for massively parallel floating-point operations. For example, the H.265 bitstream is first decoded in real time into a 1920x1080 original RGB image. Then, to match the fixed input size of the subsequent detection model, a bicubic interpolation algorithm is used to scale the image to a high-quality size of 640x640. Finally, the RGB value of each pixel is linearly transformed from the integer range of [0,255] to the floating-point range of [-1,1]. This normalization process can prevent gradient vanishing or exploding during neural network training and inference, thereby significantly improving the stability and convergence speed of the model.

[0033] S13: By calling the central processing unit, the metadata is processed in a structured manner to obtain synchronized metadata.

[0034] Specifically, unlike the NPU, the Central Processing Unit (CPU) is better suited to perform complex logical judgments and serial parsing tasks. Therefore, the CPU parses the metadata structure. For example, the CPU reads the binary metadata stream and parses it byte by byte according to a preset communication protocol such as the MAVLink protocol, transforming the unstructured byte stream into a structure containing explicit fields, such as {timestamp:1668576000.123456,azimuth:135.7°,elevation:45.2°}. This structured synchronous metadata ensures that it can be accurately associated with other data at any time through timestamps.

[0035] S20: Based on the standardized input tensor, call the preset target detection model to obtain preliminary detection results. Based on the preliminary detection results and synchronous metadata, construct the real-time state vector of the target.

[0036] Specifically, such as Figure 3 As shown, step S20 specifically includes: S21: Obtain the target detection model built based on the attention mechanism, input the standardized input tensor into the target detection model, and perform target localization reasoning through the target detection model.

[0037] Specifically, this object detection model based on the attention mechanism shows advantages over traditional convolutional models when dealing with high-altitude scenes. For example, when there are clouds, ground noise and a very dark target in the observed image, the self-attention module inside the model will calculate the correlation between each pixel region in the image and all other regions. This allows the model to learn that the pixels in the "target" region usually have similar motion patterns or texture features, while having a low correlation with the background region. Therefore, the model will allocate more weights and computational resources to these highly correlated target regions, thus achieving effective focusing and accurate localization of the target even in environments with extremely low signal-to-noise ratios.

[0038] More specifically, the object detection model based on the attention mechanism is constructed using the following steps: First, a convolutional neural network (CNN) is selected as the basic backbone network, such as ResNet-50 or EfficientNet, to extract hierarchical basic visual features from the input normalized image tensor. Second, attention modules are embedded in parallel or in series after multiple key feature layers of this backbone network. For example, channel attention modules can learn the importance of different feature channels and perform weight recalibration, and spatial attention modules can identify which regions on the feature map are more critical for the object detection task. Third, the multi-scale feature maps enhanced by the attention modules are fed into a feature pyramid network for fusion to generate fused features that are robust to targets of different sizes. Finally, two parallel prediction heads are connected to the upper layer of the fused feature maps. One head is responsible for regressing the bounding box of each potential target, and the other head is responsible for classifying the target and outputting its recognition confidence. The entire model is obtained through end-to-end deep learning training on a labeled dataset containing a large number of high-altitude target samples such as drones. During training, the loss function simultaneously optimizes the localization accuracy and classification accuracy, enabling the model to accurately locate and identify non-cooperative targets in complex high-altitude backgrounds.

[0039] S22: Obtain preliminary detection results from the target detection model, wherein the preliminary detection results include at least the target bounding box coordinates, multimodal visual feature data, recognition confidence, and the target signal-to-noise ratio.

[0040] Specifically, the preliminary detection results are a structured data report that details the model's instantaneous snapshot analysis of the target. For example, the target bounding box coordinates are [301.5, 452.0, 315.8, 460.3], providing a localization box accurate to the sub-pixel level. The multimodal visual feature data is a 1024-dimensional feature vector output by the penultimate layer of the model. This high-dimensional vector is the product of deep abstraction of the original pixel information. The values of its different segments may encode the target's shape and contour, the infrared radiation intensity of the exhaust plume, and even the relative angle between the wing and the fuselage, among other modal visual fingerprint information. The recognition confidence score is 0.92, representing the probability that the model considers the detection box to be a real target. The signal-to-noise ratio is 13.7dB, which is a quantitative assessment of the target signal strength exceeding the background clutter level.

[0041] S23: Obtain sensor attitude data and corresponding timestamp information from the synchronization metadata, transform the target bounding box in the preliminary detection results from the two-dimensional image coordinate system to the three-dimensional world coordinate system, and obtain the target's dynamic information.

[0042] Specifically, firstly, based on the timestamp information, the sensor attitude data that completely corresponds to the current image frame is found from the synchronization metadata, such as azimuth angle 120.34 degrees and pitch angle -30.56 degrees. Then, combined with pre-calibrated camera intrinsic parameters such as focal length and principal point position, a transformation matrix that accurately describes the camera's current position and orientation in three-dimensional space can be constructed. Through this matrix, the two-dimensional center pixel coordinates [308.65, 456.15] of the target bounding box can be converted into a three-dimensional ray vector pointing from the camera's optical center to the target. Finally, by fusing the instantaneous distance value returned by the laser rangefinder or by performing triangulation through continuous multi-frame observations, the precise position of the target on this ray can be determined, thereby resolving its three-dimensional position, velocity, acceleration, and other dynamic information in world coordinate systems such as WGS-84.

[0043] S24: Construct a real-time state vector based on dynamic information and multimodal visual feature data.

[0044] Specifically, the target's three-dimensional position (x, y, z) and three-dimensional velocity (v) are... x ,v y ,v z ) and three-dimensional acceleration (a x ,a y ,a zThese nine independent floating-point values are extracted to form a dynamic sub-vector. Simultaneously, the 1024-dimensional multimodal visual feature data vector generated by the target detection model is subjected to L2 norm normalization to eliminate the influence of feature amplitude fluctuations under different observation conditions, resulting in a standardized visual sub-vector. Finally, this nine-dimensional dynamic sub-vector and the 1024-dimensional standardized visual sub-vector are concatenated in a predefined order to form a composite vector with a total dimension of 1033. This composite vector is the real-time state vector, which not only unifies the scale of data from different sources numerically, but also logically couples the target's external motion state, i.e., dynamic information, with its internal inherent attributes, i.e., visual features. This provides unprecedented information density and richness for high-level intent recognition and behavior prediction.

[0045] S30: Based on the real-time state vector, calculate the multimodal residual vector and determine whether the changing trend of the multimodal residual vector meets the triggering conditions of the preset maneuver event.

[0046] Specifically, such as Figure 4 As shown, step S30 specifically includes: S31: Obtain the historical state vector of the previous moment. The historical state vector includes historical multimodal visual feature data.

[0047] Specifically, at the start of the current k-th frame processing flow, the historical state vector that was finally determined and stored at the end of the (k-1)-th frame processing flow is read from a circular buffer or state register. The format of this vector is completely consistent with the real-time state vector constructed in the aforementioned steps. It is a 1033-dimensional composite vector that encapsulates all the information about the target dynamics and visual characteristics at the previous observation time, serving as the benchmark and starting point for all subsequent prediction and comparison calculations.

[0048] S32: By using an extended Kalman filter, the historical state vector is used to predict the state and obtain the dynamic prediction value.

[0049] Specifically, this step utilizes the physical motion model built into the Extended Kalman Filter (EKF) to estimate the theoretical position of the target without external interference. Specifically, firstly, the 9-dimensional dynamic information from the previous time step k-1 historical state vector obtained from the above steps is defined as the posterior state estimation vector x at time k-1. k-1 , that is, x k-1 =[p x ,p y ,p z ,v x ,v y ,v z ,a x ,a y,a z Let p, v, and a represent the position, velocity, and acceleration components, respectively. Next, define a state transition matrix F, constructed based on a near-constant acceleration motion model, to describe how the system state evolves over time. Its specific form is F = [[I,Δt×I,0.5×Δt²×I],[0,I,Δt×I],[0,0,I]], where I is a 3x3 identity matrix, and Δt is the time interval from time k-1 to the current time k. Finally, execute the prediction step of the extended Kalman filter to calculate the prior state estimate at the current time k, i.e., the dynamic prediction value x. k|k-1 =F×x k-1 This x k|k-1 The nine components of the vector are purely theoretical predictions of the target's dynamic state at the current time k, such as position, velocity, and acceleration. It is based entirely on the target's state at time k-1 and the laws of physical kinematics, and does not contain any actual observation information at the current time.

[0050] S33: Compare the predicted dynamic values with the dynamic information in the real-time state vector to obtain the dynamic residuals, and calculate the distance between the multimodal visual feature data in the real-time state vector and the historical multimodal visual feature data to obtain the visual residuals.

[0051] Specifically, for the dynamic residual, the 9-dimensional dynamic information in the real-time state vector at time k is first defined as the observation vector z. k Then, by observing the vector z k The dynamic prediction value x calculated in the above steps k|k-1 Performing vector subtraction yields a 9-dimensional dynamic residual vector, calculated using the formula y. kdyn =z k -x k|k-1 Each component of this vector represents the specific difference between the target's actual velocity in the corresponding motion dimension, such as the x-axis, and its inertial prediction. For the visual residual, a 1024-dimensional visual feature vector, denoted as v, is first extracted from the current real-time state vector. k It also extracts the corresponding visual feature vector from the historical state vector of the previous time step, denoted as v. k-1 Then, the semantic difference between these two high-dimensional vectors is quantified by calculating the cosine distance between them, with the formula: Visual residual = 1 - (v k ·v k-1 ) / (||v k ||×||v k-1The formula is: ||), where · represents the vector dot product, and |||| represents the L2 norm of the vector, i.e., the Euclidean length. The result is a scalar between 0 and 2. The closer the value is to 0, the more stable the visual feature is. The closer the value is to 2, the more drastic the visual change has occurred.

[0052] S34: Weighted fusion of dynamic residuals and visual residuals to obtain multimodal residual vectors.

[0053] Specifically, in order to transform the 9-dimensional dynamic residual vector y k|dyn Transform it into a scalar that can represent the overall maneuver amplitude, and calculate its L2 norm, i.e., ||y||. k|dyn The formula ||2 = sqrt(y1² + y2² + ... + y9²) is used to obtain a non-negative dynamic residual amplitude. Then, this dynamic residual amplitude is combined with the visual residual scalar calculated in the previous steps, using a preset weighting coefficient w. d and w v A linear combination is performed to form the final multimodal residual value R, which is calculated using the formula R = w d ×||y k|dyn ||2+w v × Visual residual; This fused scalar R, as a comprehensive anomaly score, is proportional to the degree to which the target's current behavior deviates from its stable flight pattern, providing a direct and quantitative basis for subsequent maneuver event judgment.

[0054] More specifically, such as Figure 5 As shown, step S30 further includes: S35: Based on the multimodal residual vector and preset drift parameters, the cumulative statistic at the current time is calculated through cumulative summation and testing algorithms.

[0055] Specifically, a statistical process control method highly sensitive to time series changes, namely the cumulative summation algorithm, is used to amplify and confirm the persistent growth trend of the residuals, thereby effectively distinguishing real maneuvering behavior from isolated random noise. First, the calculated multimodal residual value R is defined as the input R of the cumulative summation algorithm at the current time k. k Secondly, a parameter μ, known as the drift parameter or reference value, is set, representing the expected mean of the multimodal residuals of the system in a steady state or the upper limit of tolerable normal fluctuations. Then, the cumulative statistic S is updated using the cumulative summation recursive formula. k This formula is specifically used to detect upward shifts in the mean, and its mathematical expression is S. k =max(0,S k-1 +(R k -μ)), where S k-1It is the cumulative statistic from the previous time step, with the initial value S set to 0 (index 0). The core of this formula is that it only applies when the current residual R... k When the magnitude exceeds the normal fluctuation μ, the cumulative statistic S k Only if the growth is negative will it increase; otherwise, the growth will be negative, and the max(0,...) function will reset it to 0. This mechanism ensures that S... k It accumulates only those deviations that are persistent and significantly exceed expectations, thus becoming a robust indicator for measuring the cumulative effect of abnormal target behavior.

[0056] S36: Based on the target's signal-to-noise ratio, determine the preset threshold for judging maneuvering events, and determine whether the cumulative statistics exceed the threshold.

[0057] Specifically, to ensure the detection system maintains optimal sensitivity and reliability under different observation conditions, an adaptive threshold strategy is adopted. This strategy dynamically correlates the decision threshold with the signal quality of the target. First, the signal-to-noise ratio (SNR) of the target in the current frame image is obtained from the preliminary detection results. k Secondly, the decision threshold H at the current moment is calculated using a preset functional relationship. k This function is designed as a decreasing function of the signal-to-noise ratio (SNR), meaning that a higher SNR indicates more reliable data quality, a lower decision threshold, and allows the system to be more sensitive to minor maneuvers, and vice versa. A typical function implementation could be H... k =H base ×(1+β / SNR k ), where H base This is a baseline threshold under ideal observation conditions, and β is an adjustment factor used to control the drastic change of the threshold with the signal-to-noise ratio; finally, the calculated cumulative statistic S at the current time is... k With the dynamically calculated judgment threshold H k Comparison, i.e., performing judgment S k Does it exceed H? k .

[0058] S37: If the cumulative statistic exceeds the judgment threshold, the trend of the modal residual vector is determined to meet the triggering condition.

[0059] Specifically, when judging S k >H k When the conditions are met, the system declares that a valid maneuver event has been detected; the reliability of this conclusion stems from the rigor of the entire judgment chain: it is not based solely on an instantaneous residual pulse, but on the accumulation of residuals that have continuously and significantly exceeded normal fluctuation levels over a period of time, as determined by S. k The statistical evidence demonstrates this, and the strength of this evidence exceeds the reasonable threshold set dynamically based on the current signal quality by H. kThe confidence threshold is reflected; therefore, determining that the triggering condition is met means that the system believes with extremely high confidence that the continuous change of the observed multimodal residual vector is not caused by random noise, but by the target's inherent, nonlinear state change, thus generating a highly reliable triggering signal for subsequent advanced processing steps such as maneuver type identification.

[0060] S40: If the trend of change meets the triggering condition, then the target's intention probability distribution vector is determined based on the real-time state vector.

[0061] Specifically, such as Figure 6 As shown, step S40 specifically includes: S41: Calculate the similarity between the dynamic information and multimodal visual feature data in the real-time state vector and the preset multimodal intent template library to obtain the initial likelihood vector of the real-time state vector under each preset intent.

[0062] Specifically, a pre-built multimodal intent template library is constructed, storing N typical intents such as "cruise," "reconnaissance," "attack," and "evasion." Each intent is represented by a template vector, the structure of which is consistent with the real-time state vector, containing typical dynamic and visual features representing that intent. During calculation, on the one hand, the Mahalanobis distance is calculated between the 9-dimensional dynamic sub-vector in the current real-time state vector and the dynamic part of each intent template in the library, resulting in an N-dimensional dynamic similarity score vector. On the other hand, the cosine similarity is calculated between the 1024-dimensional visual sub-vector in the real-time state vector and the visual part of each intent template in the library, resulting in an N-dimensional visual similarity score vector. Finally, these two N-dimensional score vectors are weighted and fused, for example, the initial likelihood vector = α × dynamic similarity score vector + (1-α) × visual similarity score vector, and then normalized using the Softmax function to obtain an N-dimensional initial likelihood vector P. The i-th element of this vector represents the probability that the target intent is the i-th type under the current observed state.

[0063] S42: Using the initial likelihood vector as the current observation evidence and the intention probability distribution vector of the previous time step as the prior probability, calculate the posterior intention probability distribution vector of the current time step, and use the posterior intention probability distribution vector as the intention probability distribution vector of the target.

[0064] Specifically, by introducing the idea of Bayesian filtering, historical information is fused to smooth and correct intent judgment. First, the N-dimensional initial likelihood vector P calculated in the above steps is used as the likelihood term in the Bayesian formula, representing evidence from the latest observations. Second, the N-dimensional intent probability distribution vector P, which was finally determined at the previous time k-1, is obtained from the state storage.k-1 This is used as the prior probability at the current moment, representing the intention prediction based on all historical information. Then, the discrete form of the Bayesian update rule is applied to calculate the posterior probability at the current moment element by element, with the formula P. k|i ∝P i ×P k-1|i Where ∝ represents proportionality, after calculating for all i, a global normalization is performed again so that the sum of all elements of the vector is 1; thus obtaining the final N-dimensional vector P. k , which is the posterior intent probability distribution vector at the current moment, reflects both the latest observational evidence and inherits the continuity of historical judgments, thus making the intent recognition results more stable and reliable.

[0065] S50: Determine the target's behavior prediction result based on the intent probability distribution vector.

[0066] Specifically, such as Figure 7 As shown, step S50 specifically includes: S51: Determine the final intent label based on the intent probability distribution vector.

[0067] Specifically, such as Figure 8 As shown, step S51 specifically includes: S511: Continuous monitoring intent probability distribution vector.

[0068] Specifically, at each time k, the system receives the latest intent probability distribution vector P generated by the aforementioned steps. k It stores this information in a fixed-length sliding window buffer of length M; this buffer operates on a first-in, first-out (FIFO) basis, always storing the sequence of intent probability distribution vectors {P} from the most recent M time moments. k-M+1 ,P k-M+2, ...,P k At each new moment, the system performs real-time analysis on all vectors within this window. Specifically, it examines each of the N intents one by one, calculates the probability value of that intent in each frame within the window, and counts whether the probability value exceeds a preset threshold θ. conf For example, a frame rate of 0.8 can be used to create a statistical profile that reflects the sustained intensity of each intention over a short period of time.

[0069] S512: If the probability value of a single intent in the intent probability distribution vector exceeds the preset threshold more than the preset number of times within a preset time period, then the single intent is determined as the final intent label.

[0070] Specifically, this judgment criterion ensures that the system will only lock an intent if it exhibits a sufficiently strong and sustained dominance. The specific judgment logic is as follows: iterate through N intents, and for the i-th intent, obtain the probability value exceeding the confidence threshold θ from the sliding window statistics in the previous step. conf Frame count i Then, count i With a preset number of thresholds T count For example, if M=10, T count =7 means that the intent must be dominant in at least 7 out of 10 frames. If there exists an intent i that satisfies Count i >T count If the probability value of this intent is still the maximum value among all intents at the current time k, then the system determines the category number i of this intent as the final intent label L. final For example, 0 indicates cruising, and 1 indicates an attack. This time-consistency-based judgment mechanism effectively avoids misjudgments caused by single-frame anomalies, ensuring the high reliability and time stability of the final intent label.

[0071] S52: Construct a prediction sequence by appending the final intent label to the end of the prediction input sequence in encoded form. The prediction input sequence includes multiple historical state vectors within a preset length.

[0072] Specifically, the real-time state vectors of the most recent L moments (e.g., L=20) are extracted from the state storage module in reverse chronological order to form a three-dimensional tensor with dimensions [L, 1033], representing the complete motion and visual evolution history of the target over a past period. Secondly, the final intent label L determined in the above steps is... final It is converted into an N-dimensional sparse vector through one-hot encoding, for example, if N=5 and L final =1, then the encoding result is [0,1,0,0,0]. Finally, this N-dimensional intent encoding vector is used as a global context feature and expanded into a two-dimensional tensor of [L,N] through a broadcast mechanism. It is then concatenated with the historical state tensor of [L,1033] in the feature dimension to generate a prediction input sequence tensor of [L,1033+N]. This design enables the prediction model to know the target's intent label when decoding the state at each future moment, thereby guiding the prediction trajectory to evolve in a direction consistent with the intent.

[0073] S53: Input the predicted input sequence into a preset temporal prediction model based on encoder and decoder, and obtain the target's behavior prediction results from the temporal prediction model. The behavior prediction results include at least the prediction data of future high-resolution predicted trajectory and future maneuver behavior warning.

[0074] Specifically, this time-series prediction model adopts the classic Encoder-Decoder architecture, which is particularly suitable for sequence-to-sequence mapping tasks, that is, mapping a historical state sequence to a future state sequence. The entire prediction process can be divided into three core stages: the encoding stage, the context propagation stage, and the decoding stage. In the encoding stage, the encoder receives the prediction input sequence with dimensions [L, 1033+N] constructed in the previous step, where L represents the historical time step, such as L=20, and 1033+N represents the state feature dimension at each time step. The encoder's bottom layer is usually composed of multiple layers, such as three layers of bidirectional long short-term memory network (Bi-LSTM) units stacked together. Each Bi-LSTM layer contains a forward LSTM and a backward LSTM. The forward LSTM scans from time 1 to time L, capturing the positive time dependencies, while the backward LSTM scans from time L to time 1, capturing the negative time dependencies. At each time step t, the forward and backward LSTMs each output a hidden state vector, both with dimension d. h For example, in a 256-dimensional vector, these two vectors can be merged into a single vector with a dimension of 2×d through a concatenation operation. h bidirectional hidden state h t After traversing all L time points, the encoder extracts the bidirectional hidden state h of the last time point L. L and the corresponding cell state c L These are treated as a condensed representation of the entire historical sequence, i.e., the context vector C = (h L ,c L This context vector can be understood as an information bottleneck, which forces the model to compress all historical information into a fixed-dimensional vector, thereby learning to extract the most critical temporal patterns and semantic features.

[0075] Furthermore, during the context propagation phase, the encoder generates a context vector C=(h L ,c L The initial hidden state and initial cell state are passed to the decoder. This passing action mathematically establishes an information bridge between the encoder and decoder, allowing the decoder to know the target's complete past behavioral history and its identified intent information when it begins generating future sequences. Furthermore, to enhance the decoder's ability to focus on key historical information, many advanced implementations introduce attention mechanisms. That is, when generating predictions for each future time step, the decoder not only relies on the initial context vector C, but also dynamically adjusts the bidirectional hidden states {h1, h2, ..., h...} generated by the encoder for all L time steps. LThe hidden states of the current decoder are weighted and summed, with the weights determined by the similarity between the hidden states of the current decoder and the hidden states of each encoder. Through this dynamic attention, the decoder can selectively look back at different parts of the historical sequence when predicting different future moments.

[0076] Furthermore, in the decoding stage, the decoder starts with the initial state C received from the encoder and gradually generates a sequence of predicted states for the next T time points, such as T=30, through autoregression. Autoregression means that when generating the predicted state for the t-th future time point, the predicted state generated at the previous time point t-1 is used as the input for the current time point. The decoder is also composed of multiple layers of unidirectional LSTM units. In the first decoding step, it generates the prediction for the future time point 1. The input to the decoder is a special initial marker vector, which is typically a zero vector of dimension 10^33+N or a learnable vector. The LSTM unit of the decoder, based on the embedded vector and the initial state C, outputs a hidden state vector s1. Then, s1 is fed into two parallel fully connected neural network layers: the first fully connected layer has an output dimension of 1033, and after a linear transformation, directly outputs the predicted state vector pred1 for the future time 1. The first 9 dimensions of this vector represent the predicted dynamic information, namely the three-dimensional position, velocity, and acceleration, while the last 1024 dimensions represent the predicted visual feature evolution. The second fully connected layer has an output dimension of 1, and after passing through a sigmoid activation function, outputs a maneuver warning probability p between 0 and 1. maneuver1 When p maneuver1 When the value is greater than 0.5, the system determines that target 1 may maneuver in the future. In the second decoding step, the decoder takes the predicted state pred1 generated in the previous step as input, combines it with the hidden state s1 and the cell state from the previous step, and continues to recursively generate s2, pred2, and p. maneuver2 This process continues until a complete sequence of predicted future times T is generated.

[0077] More specifically, after all T decoding steps are completed, the system performs post-processing and encapsulation on the output; first, from the T predicted state vectors {pred1,pred2,...,pred...} t Extract the first 9 dimensions of dynamic information from each vector, especially the three-dimensional position components (x, y, z). Connect these T three-dimensional coordinate points in chronological order to form a continuous curve in three-dimensional space. This curve is the future high-resolution predicted trajectory. Secondly, the T maneuver warning probabilities {p} are... maneuver1 ,p maneuver2 ,...,p maneuvertThe system organizes the data into a one-dimensional time series, where each element corresponds to a maneuver risk score at a future time. When the probability value at a given time first exceeds 0.5, the system generates a structured warning message, such as {"Warning Time": 15th frame in the future,"Warning Probability": 0.73,"Warning Type": "Possible evasive maneuver"}. The system encapsulates the predicted trajectory curve (e.g., a T×3 coordinate matrix), the maneuver warning time series (e.g., a probability array of length T), and any additional information such as the prediction confidence interval into a structured and serializable data structure, defined as the behavior prediction result, and returns it to the upper-level decision-making and planning module through a standard interface.

[0078] In one embodiment, an electronic device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 9 As shown, the electronic device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system, computer programs, and database. The internal memory provides the environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database stores data such as high-altitude observation data, real-time state vectors, and intent probability distribution vectors. The network interface communicates with external terminals via a network. When executed by the processor, the computer program implements a method for recognizing and predicting the intent and behavior of non-cooperative targets at high altitudes.

[0079] In one embodiment, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps: Acquire the upper-air observation data at the current moment, preprocess the upper-air observation data, and obtain the standardized input tensor and synchronization metadata; Based on the standardized input tensor, a preset target detection model is invoked to obtain preliminary detection results. Based on the preliminary detection results and synchronous metadata, a real-time state vector of the target is constructed. Based on the real-time state vector, the multimodal residual vector is calculated, and it is determined whether the changing trend of the multimodal residual vector meets the triggering conditions of the preset maneuver event. If the trend of change meets the triggering conditions, the target's intent probability distribution vector is determined based on the real-time state vector. Based on the intent probability distribution vector, the target's behavior prediction result is determined.

[0080] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0081] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0082] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes, characterized in that, The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes includes: Acquire the upper-air observation data at the current moment, preprocess the upper-air observation data to obtain a standardized input tensor and synchronization metadata; Based on the standardized input tensor, a preset target detection model is invoked to obtain preliminary detection results. Based on the preliminary detection results and the synchronous metadata, a real-time state vector of the target is constructed. Based on the real-time state vector, a multimodal residual vector is calculated, and it is determined whether the changing trend of the multimodal residual vector meets the triggering conditions of a preset maneuver event. If the trend of change satisfies the triggering condition, then the intention probability distribution vector of the target is determined based on the real-time state vector; Based on the intent probability distribution vector, the behavior prediction result of the target is determined.

2. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to claim 1, characterized in that, The preprocessing of the upper-air observation data to obtain standardized input tensors and synchronization metadata specifically includes: Image data and metadata are obtained from the high-altitude observation data; By calling the neural network processing unit, the image data is subjected to standardized image operations to obtain the standardized input tensor, wherein the standardized image operations include at least decoding, scaling and normalization processing; The synchronization metadata is obtained by calling the central processing unit to perform structured processing on the metadata.

3. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to claim 2, characterized in that, The step of calling a preset target detection model based on the standardized input tensor to obtain preliminary detection results, and constructing a real-time state vector of the target based on the preliminary detection results and the synchronized metadata, specifically includes: Obtain the target detection model constructed based on the attention mechanism, input the standardized input tensor into the target detection model, and perform target localization reasoning through the target detection model; The preliminary detection results are obtained from the target detection model, wherein the preliminary detection results include at least the target bounding box coordinates, multimodal visual feature data, recognition confidence, and signal-to-noise ratio of the target; The sensor attitude data and corresponding timestamp information are obtained from the synchronization metadata, and the target bounding box in the preliminary detection result is transformed from the two-dimensional image coordinate system to the three-dimensional world coordinate system to obtain the dynamic information of the target. The real-time state vector is constructed based on the dynamic information and the multimodal visual feature data.

4. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to claim 3, characterized in that, The calculation of the multimodal residual vector based on the real-time state vector specifically includes: Obtain the historical state vector of the previous moment, wherein the historical state vector includes historical multimodal visual feature data; By using an extended Kalman filter to predict the state of the historical state vector, a dynamic prediction value is obtained. The predicted dynamic values are compared with the dynamic information in the real-time state vector to obtain the dynamic residuals, and the distance between the multimodal visual feature data in the real-time state vector and the historical multimodal visual feature data is calculated to obtain the visual residuals. The dynamic residual and the visual residual are weighted and fused to obtain the multimodal residual vector.

5. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to claim 3, characterized in that, The step of determining whether the changing trend of the multimodal residual vector meets the triggering conditions of a preset maneuvering event specifically includes: Based on the multimodal residual vector and the preset drift parameters, the cumulative statistic at the current moment is calculated using a cumulative summation and test algorithm. Based on the signal-to-noise ratio of the target, determine the threshold for the preset maneuvering event, and determine whether the cumulative statistics exceed the threshold. If the cumulative statistics exceed the determination threshold, then the trend of the modal residual vector is determined to meet the triggering condition.

6. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to claim 1, characterized in that, Determining the target's intent probability distribution vector based on the real-time state vector specifically includes: The dynamic information and multimodal visual feature data in the real-time state vector are compared with the preset multimodal intent template library to calculate the similarity, and the initial likelihood vector of the real-time state vector under each preset intent is calculated. The initial likelihood vector is used as the current observation evidence, and the intention probability distribution vector of the previous time step is used as the prior probability to calculate the posterior intention probability distribution vector of the current time step. The posterior intention probability distribution vector is then used as the intention probability distribution vector of the target.

7. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to any one of claims 1-6, characterized in that, The step of determining the target's behavior prediction result based on the intent probability distribution vector specifically includes: The final intent label is determined based on the intent probability distribution vector. Construct a prediction sequence by appending the final intent label to the end of the prediction input sequence in coded form. The prediction input sequence includes multiple historical state vectors within a preset length. The predicted input sequence is input into a preset temporal prediction model based on an encoder and decoder, and the behavior prediction result of the target is obtained from the temporal prediction model. The behavior prediction result includes at least the prediction data of future high-resolution predicted trajectory and future maneuver behavior warning.

8. The method for identifying the intent and predicting the behavior of non-cooperative targets at high altitudes according to claim 7, characterized in that, Determining the final intent label based on the intent probability distribution vector specifically includes: Continuously monitor the intent probability distribution vector; If the probability value of a single intent in the intent probability distribution vector exceeds a preset threshold more than a preset number of times within a preset time period, then the single intent is determined as the final intent label.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the high-altitude non-cooperative target intent recognition and behavior prediction method as described in any one of claims 1 to 8.