A method for joint control of UAV trajectory and data acquisition optimized for data freshness

By combining gridded synchronous observation with actionable masking, the problem of insufficient information freshness in UAV swarm intelligence perception systems is solved, and efficient, stable and executable trajectory and data acquisition joint control is achieved in dynamic mission environments.

CN122308404APending Publication Date: 2026-06-30JIANGSU VOCATIONAL COLLEGE OF BUSINESS +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU VOCATIONAL COLLEGE OF BUSINESS
Filing Date
2026-04-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing UAV swarm intelligence perception systems struggle to achieve real-time optimization of UAV trajectories and data acquisition strategies in dynamic mission environments, resulting in insufficient information freshness. Furthermore, existing methods suffer from problems in practical deployment, such as overly idealistic information assumptions, ineffective utilization of spatial structures, and passive handling of hard constraints leading to uncontrollable action space.

Method used

By introducing gridded synchronous observations to generate spatial reward value prediction priors, constructing a spatial reward value prediction module and actionable mask, and combining it with a policy network for decision-making, we can ensure that the UAV focuses on high-value areas and meets hard constraints, thereby achieving executable action decisions.

Benefits of technology

It improves the long-term efficiency of information freshness reduction, enhances the feasibility and stability of action decisions, reduces energy consumption, and achieves efficient optimization under multiple constraints.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308404A_ABST
    Figure CN122308404A_ABST
Patent Text Reader

Abstract

This invention discloses a joint control method for UAV trajectory and data acquisition optimized for data freshness, comprising the following steps: acquiring gridded observation information of the monitoring area and generating a spatial reward value prediction prior representing the potential contribution of reducing the long-term weighted information age; constructing a binary action mask to form a set of state-related actionable actions based on system hard constraints such as UAV state, speed reachability, and energy budget; fusing the spatial prior into the policy network, sampling and outputting movement and acquisition decisions only from the set of actionable actions, and controlling the UAV to perform parallel uplink acquisition; determining the effective state of nodes based on transmission feasibility: refreshing the AoI only when the upload condition is met, otherwise incrementing it, and synchronously updating the cache, return flag, and energy; calculating the immediate reward including the weighted average AoI and energy consumption penalty to optimize the policy network. This invention can achieve faster convergence, lower long-term average AoI, and more stable online control effects in real-world scenarios with multiple constraints and strong coupling, and has significant advantages in engineering applications.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method for joint control of UAV trajectory and data acquisition, and more particularly to a method for joint control of UAV trajectory and data acquisition based on data freshness. Background Technology

[0002] With the rapid development of the "low-altitude economy," drones, as low-altitude intelligent carriers, have been widely used in urban governance, emergency inspections, environmental monitoring, and public safety. In these application scenarios, business needs exhibit characteristics of "on-demand initiation, rapid response, and continuous updates," meaning that data requesters need to obtain the latest status of a specific area or target at any given time. To address this, drone swarm intelligence sensing systems have emerged. Their core objective is no longer merely to complete a single data collection, but to ensure that the data acquired by the platform has sufficient "freshness" to reflect the real-time status of the monitored objects.

[0003] To quantify data freshness, academia and industry have introduced the Information Age (AoI) metric. AoI is defined as the length of time data has existed since its generation; a lower AoI indicates fresher information. In a typical UAV swarm intelligence sensing system architecture, the platform layer receives sensing requests and generates acquisition tasks. Ground base stations act as communication intermediaries, forwarding task instructions and receiving data feedback. UAVs move within the monitoring area, performing periodic acquisition and data feedback at multiple task points, forming a closed loop of "request-dispatch-execution-feedback-redispatch." However, constrained by engineering limitations such as UAV flight speed, endurance, communication coverage, cache capacity, and parallel acquisition capabilities, how to jointly optimize the UAV's movement trajectory and acquisition strategy in a dynamically changing task environment to minimize the overall system AoI has become a hot topic and a challenge in current technical research.

[0004] Existing technologies for AoI optimization in UAV data acquisition mainly include the following types of solutions: 1. Iterative optimization schemes for minimizing AoI (Aspect-Oriented Intelligence) These approaches typically model drone data collection as a path planning or vehicle routing problem with constraints such as energy levels. Approximate optimal solutions are obtained by alternately optimizing variables such as hovering position, access order, and task allocation. Representative techniques include joint optimization methods that minimize the age of data collection information under multi-drone assistance, and methods that incorporate AoI thresholds and energy consumption into energy-efficient data collection to output the optimal path and speed configuration. While these approaches take AoI as the core objective and comprehensively consider engineering constraints such as energy levels, they often deviate from linear or semi-offline solution paradigms. When faced with dynamic task arrival, time-varying priorities, and continuous platform updates, frequent recalculations are often required, limiting real-time performance and scalability. Furthermore, these methods lack "pre-guarantees" for state-related actions (such as resource criticality, backhaul triggers, and reachability changes), easily leading to conflicts between policy executability and system rules.

[0005] 2. Online decision-making solutions based on deep reinforcement learning with novelty as the core reward. This type of approach models the UAV control process as a Markov Decision Process (MDP), setting states, actions, and reward functions. It uses AoI (Aspect-Oriented Intelligence) or freshness metrics as the primary optimization driver and prioritizes important tasks through weights. Representative techniques include path planning methods that use deep reinforcement learning to weight rewards to balance information freshness and importance; methods that use clustering to determine hovering points and combine them with Dual Deep Q-Networks (DDQN) for dynamic adjustment; and methods that construct MDPs based on node dynamic priorities to solve the joint optimization problem of task completion time and energy consumption. While this approach emphasizes online strategies and dynamic adaptation, it still has significant shortcomings: Firstly, some solutions rely on structured simplifications such as clustering or hovering points, resulting in lower exploration efficiency when encountering more complex constraint couplings and large action spaces. Secondly, the handling of hard constraints often manifests as reward penalties or posterior selection, easily generating a large number of invalid samples during training. Furthermore, the executability and stability of the strategy are insufficient when rules such as backpropagation triggers and cache limitations exist. In addition, insufficient emphasis on the consistency constraint of "update success determination - AoI evolution" can easily lead to a deviation between the optimization objective and the truly effective update.

[0006] 3. Constraint-enhanced or structured reinforcement learning solutions These approaches further introduce constraint handling or structured strategies into reinforcement learning frameworks to improve stability under complex constraints. For example, based on the hierarchical proximal policy optimization (PPO) architecture, they combine constrained MDPs with multi-objective optimization and embed AoI metrics into rewards or constraints to support dynamic collaboration. These approaches focus on stable optimization under multiple constraints, attempting to reduce decision-making difficulty through a structured framework. However, their emphasis often leans towards group coverage or specific business multi-objectives, rather than directly addressing the continuous freshness maintenance of "platform distribution-closed-loop collection and feedback." Furthermore, their constraint handling in many implementations still relies primarily on penalty terms or relaxation methods, making it difficult to achieve strict, state-dependent actionable guarantees at the action sampling level, and explicit guidance on spatial value structures is often insufficient.

[0007] In summary, existing technologies typically exhibit the following common drawbacks when deploying systems in real-world scenarios: Information assumptions are too idealistic: Many methods implicitly assume that drones can grasp the fine state of each node, link quality, or task changes in real time (i.e., global, instantaneous, and accurate information), which is often difficult to meet in actual crowdsensing and has huge communication overhead.

[0008] Spatial structure is not effectively utilized: When AoI is superimposed with task weights, it will naturally form "high demand areas / hot spot clusters" with uneven spatial distribution. However, existing methods mostly treat the grid action space as an approximately uniform exploration object, which makes it easy for UAVs to detour, repeatedly explore, and focus slowly on high-value areas in the early training and online operation, significantly reducing sample efficiency and real-time decision-making effect.

[0009] Hard constraints result in passive processing and an uncontrollable action space: When considering hard constraints such as maximum speed, communication coverage, MU-MIMO parallelism, cache backhaul rules, and battery energy budget, the action space is essentially a "state-dependent set of feasible options." If naive reinforcement learning or simple greedy strategies are still used, unreachable, out-of-bounds, forced backhaul, or parallel access / rate infeasibility actions will frequently occur during training and execution, leading to wasted exploration, training oscillations, and even the generation of unexecutable strategies. Summary of the Invention

[0010] To address the aforementioned technical deficiencies, the purpose of this invention is to provide a new technical solution that enables stable decision-making through gridded synchronous observation without relying on real-time global information at the node level; it can construct spatial reward value prediction to guide UAVs to focus on high-value areas more quickly; and it can "front-load" hard constraints such as mobility, communication, cache backhaul, and energy into the construction and sampling process of action sets, ensuring that decisions in each time slot are executable and exploration is more efficient, ultimately achieving the comprehensive goals of continuous reduction of long-term weighted AoI and controllable energy consumption.

[0011] The specific solution is as follows: a drone trajectory and data acquisition joint control method optimized for data freshness, including the following steps: Step S1: Obtain gridded observation information of the monitoring area. The gridded observation information includes at least the weighted AoI distribution, node density distribution, and UAV's own state information within each grid. The UAV's own state information includes its current location, remaining energy, cache usage, and backhaul flag. Step S2: Based on the gridded observation information, generate a spatial reward value prediction prior. The spatial reward value prediction prior is a grid-level value vector or heatmap characterizing the potential contribution of each candidate grid to reducing long-term weighted AoI. Step S3: Based on the UAV's own state information and system hard constraints, construct a corresponding binary action mask to form a set of state-related actionable actions for the current time slot. The system hard constraints include at least two of the following: speed reachability, communication coverage, parallel access limit, cache capacity and backhaul rules, uplink rate feasibility, and energy budget constraints. The backhaul rule constraint includes that when the cache usage exceeds a preset threshold, the binary action mask only retains actions pointing within the communication coverage area of ​​the ground base station or meeting specific backhaul path requirements, forcing the UAV to perform a data backhaul task.

[0012] Step S4: Use the spatial reward value prediction prior as an auxiliary input feature or fusion representation of the policy network, and combine it with the set of state-related actionable actions to output the movement and acquisition decision actions of the current time slot through the policy network, wherein the policy network selects only from the set of state-related actionable actions when sampling or selecting actions.

[0013] Step S5: Control the UAV to execute the movement and data acquisition decision actions. After reaching the target location, perform multi-user, multi-input, multi-output parallel uplink data acquisition scheduling within the communication coverage area. Step S6: Based on the transmission feasibility judgment conditions, determine the effective update status of each acquisition node. If a node meets the condition of completing effective data upload within a time slot, refresh the node's information age; otherwise, increment the node's information age by time slot. Update cache usage and check if information needs to be transmitted back, and update the energy budget. Step S7: Update the UAV status information and the system global status, and calculate the instant reward based on the updated status to optimize the policy network. The instant reward includes at least a weighted average AoI penalty term and an energy consumption penalty term.

[0014] In the above scheme, the spatial reward value prediction prior generated in step S2 specifically includes at least two of the following: elements reflecting the intensity of freshness demand within the grid or its coverage neighborhood, elements reflecting the grid's potential parallel acquisition capability, and spatial aggregation structure elements reflecting the distribution of hotspot clusters at the coverage scale. The spatial reward value prediction prior adopts a normalized fusion form, and its specific expression is: in, For the required strength of the grid, For the parallel potential of the grid, Let be the clustering strength of the grid, α, β, γ be the non-negative fusion weights, g represent the grid index, and t represent the time slot. , , in To cover the set of candidate nodes in the neighborhood, For node weights, Let AoI be the node and M be the upper limit for parallel access.

[0015] In the above scheme, the mask processing expression in step S3 is: in The output of the policy network for action a, For binary action masks, This represents the current system state.

[0016] In the above scheme, in step S5, under the premise of satisfying the antenna parallel limit and buffer capacity, a feasible subset is selected from the candidate nodes, and the combination expected to bring the maximum weighted AoI reduction is given priority. If the buffer full forced backhaul rule is triggered during the acquisition process, the backhaul action is executed first until the buffer is released.

[0017] In the above scheme, the transmission feasibility determination condition in step S6 is: , in For the node's effective uplink rate, D represents the communication duration, and D represents the size of the status packet.

[0018] In the above scheme, the expression for calculating the immediate return in step S7 is: , in For time slot weighted average AoI, Balancing freshness and energy consumption For flight energy consumption, This refers to the energy consumption for communication.

[0019] In the above scheme, the gridded observation information obtained in step S1 is generated by the control center periodically aggregating task and node information. The control center synchronizes the weighted AoI heatmap, node density map and MU-MIMO opportunity map to the UAV in the form of a fixed-size multi-channel grid map. The UAV does not need to know the real-time global fine state of each node.

[0020] This invention has outstanding comprehensive advantages in terms of convergence efficiency, long-term freshness, operability, and consistency of engineering closed-loop. These advantages are mainly reflected in the following aspects.

[0021] (1) Timeliness and convergence efficiency Introducing a spatial reward value prediction heatmap during the policy learning phase explicitly represents the potential returns in the grid space. This allows the policy to focus on high AoI demand and high-weight regions earlier, reducing redundant detours and low-return visits during the exploration phase. Consequently, it exhibits a faster convergence trend during training and achieves a lower average AoI level in the stable phase. This mechanism is particularly suitable for scenarios with large node scale, uneven spatial distribution, or obvious hotspots, and can achieve better long-term freshness performance with the same training budget.

[0022] (2) Feasibility and stability In each time slot, a set / mask of state-related actionable actions is constructed based on hard constraints such as speed reachability, coverage relationship, parallel acquisition limit, cache usage / backhaul rules and energy budget. In the policy output stage, inactive actions are eliminated and the actionable actions are renormalized, thereby ensuring that the output decision naturally meets the engineering constraints and significantly reducing the probability of execution failure caused by out-of-bounds, unreachability, and resource overrun.

[0023] (3) The authenticity of revenue and the consistency of the closed loop The determination of effective updates is strictly linked to AoI evolution: the corresponding node's AoI is only refreshed when the threshold condition of completing effective data upload within a time slot is met; otherwise, it is incremented according to rules, and the result is fed back into the next time slot's observation and decision-making. This ensures that the reward signal of strategy optimization is consistent with the actual communication update benefits, improving the reliability and interpretability of long-term freshness optimization.

[0024] (4) Resource efficiency and project feasibility By incorporating cache usage, backhaul triggering, and energy budget into the construction of action constraints and the closed loop of state updates, the behavior of UAVs remains controllable during execution. Under the premise of meeting resource constraints, the probability of effective updates per unit time slot and regional service efficiency are improved, thereby enhancing overall energy efficiency and the stability of engineering deployment.

[0025] In summary, the "spatial reward value prediction module" improves the focusing efficiency of high-value areas, the "action constraint embedding" ensures the executableness of actions and reduces ineffective exploration, and the closed-loop consistency of "effective update judgment - AoI evolution - state feedback" ensures the authenticity and reliability of optimization benefits. It can achieve faster convergence, lower long-term average AoI, and more stable online control effects in real-world scenarios with multiple constraints and strong coupling, demonstrating significant advantages in engineering applications. Attached Figure Description

[0026] Figure 1 This is a schematic diagram of the architecture of an unmanned aerial vehicle (UAV) swarm intelligence perception system.

[0027] Figure 2 A diagram of the joint control technology architecture for drone trajectory and data acquisition optimized for data freshness.

[0028] Figure 3 A flowchart of the decision-making process for data collection by drones. Detailed Implementation

[0029] The system architecture of this solution can be composed of "platform / control center - base station / ground network - UAV - data acquisition task point or ground sensor node". The platform is responsible for receiving service requests, generating tasks, and dispatching them; the base station acts as an information intermediary, forwarding task information and transmitting collected data back; the UAV acts as the executor, moving within the monitoring area and collecting and transmitting data. The method operates on a discrete time slot decision cycle, completing observation construction, decision output, movement execution, arrival data acquisition, status update, and entering the next time slot within each time slot, thus forming a continuously iterative closed-loop control process. For consistent expression, the terminology used in this paper is explained as follows: Data freshness / information freshness describes whether the information obtained by the platform reflects the "current state." A commonly used quantitative indicator is Information Age (AoI), which represents the time span from the last effective update of a target point to the current moment. Weighted AoI is a metric used to reflect differences in task urgency / importance, where the AoI of each target point is weighted according to weights determined by business priority, risk level, or task urgency. An effective update refers to an update event that completes the data upload within a certain time slot, meets the transmission threshold, and is recognized as a "refresh" by the platform. The AoI of the corresponding target point is refreshed only when the effective update criteria are met; otherwise, it increments according to the time slot progression rule. Grid observation refers to the state input used for decision-making, which includes at least AoI and its weight distribution, target point / node spatial density or task distribution, and the UAV's own state, such as position, remaining energy, buffer usage, and transmission flags.

[0030] Spatial Reward Value Prediction (SRVP) is a grid-level value heatmap / prior map generated based on grid observations. It is used to characterize the potential contribution of each grid to reducing long-term weighted AoI in the current time slot. The prior can be used as policy input features or fusion representations to guide decision-making.

[0031] The action set / action mask is a set or binary mask obtained by filtering the action space based on hard constraints. It is used to eliminate unexecutable actions and renormalize the actionable actions during the decision-making stage. Hard constraints include at least two or more of the following constraints: mobility reachability (speed / time slot displacement limit), coverage relationship (coverage radius / serviceability relationship), parallel acquisition limit, cache capacity and occupancy, backhaul trigger and backhaul area rules, and energy budget.

[0032] Multi-user Multiple-input Multiple-output (MU-MIMO): Uplink parallel access technology that enables UAVs to receive data from multiple nodes in parallel within the same time slot.

[0033] Gridded time slot control: The region is discretized into a G×G grid, and the target grid is output with time slots as the decision period, and movement and acquisition are performed.

[0034] This solution can be applied to business products or platforms that require reducing the age of information, such as emergency rescue monitoring, urban environmental monitoring, traffic incident inspection, and industrial park inspection / security. Examples include drone inspection platforms, emergency command systems, urban sensing platforms, and industrial IoT monitoring systems.

[0035] The working principle of this solution is described below.

[0036] This invention abstracts it as Figure 2This paper identifies two common problems and proposes corresponding collaborative technical solutions. A UAV trajectory-acquisition joint control scheme for long-term optimization of data freshness is presented. The core of this scheme lies in embedding the Spatial Reward Value Prediction (SRVP) module and action constraints into a Mask for collaborative integration. This enables the system to simultaneously achieve "rapid focusing on high-value areas" and "action execution under multiple constraints" within a large monitoring area. Firstly, corresponding technical elements are proposed for two key scientific problems: For redundant detours and ineffective explorations caused by the vast grid action space and lack of spatial value structure expression, the SRVP module is introduced to construct a spatial reward value prediction heatmap based on grid observations, providing structured guidance for the strategy on "where is more worthwhile to serve." For unexecutable actions and optimization instability caused by the strong coupling of constraints such as speed accessibility, coverage, parallel acquisition, cache backhaul, and energy budget, the Mask module is introduced to construct a state-related set of actionable actions / masks. Combined with effective update judgment, unexecutable actions are eliminated in advance during the decision-making stage, ensuring that the output actions meet hard constraints and the update benefits are real. Spatial priors, as prior inputs to the strategy, influence the distribution of action preferences. Action masks, as constraint embeddings, perform feasibility screening of the action space. Both are computed in parallel within the same time slot and work together in the strategy output stage, ultimately forming a movement and acquisition decision that is both "high-value oriented" and "executable".

[0037] The following section explains the parallel generation of the spatial reward value prediction module and the action constraint mask module in the same time slot, their combined effect in the unified strategy decision-making stage, and the principle process of forming a closed-loop optimization from the perspective of working mechanism. This clarifies the necessity and internal logic of the two modules working together to achieve long-term data freshness optimization.

[0038] First, a spatial reward value prediction is generated based on gridded observation information synchronously constructed by the control center or on the UAV. This prior outputs the relative benefit indication of each candidate grid at the grid granularity, which is used to guide the strategy to focus on high-value areas more quickly. The construction of the spatial reward value prediction integrates at least two of the following elements: First, elements reflecting the "freshness demand intensity" within the grid or its coverage neighborhood, such as the aggregation result of node weights and AoI; second, elements reflecting the grid's potential "parallel acquisition capability", such as the parallel access potential determined by the number of candidate nodes in the coverage neighborhood and the upper limit of UAV parallel access; and third, elements reflecting the "spatial clustering structure" of hotspot cluster distribution at the coverage scale, such as the clustering strength obtained by aggregating demand intensity / density in the neighborhood corresponding to the coverage radius.

[0039] For ease of formal description, the demand intensity and parallel potential of the grid g can be written as: And the spatial reward value prediction is written in a normalized fusion form: in To cover the set of candidate nodes in the neighborhood, M is the upper limit for parallel access. Cluster strength, These are non-negative fusion weights. The spatial reward value prediction, after normalization, serves as an auxiliary input or fusion feature of the policy network, enabling the policy to prioritize regions with higher expected returns among equally feasible candidate moves, thereby reducing blind exploration and improving convergence speed and online decision stability.

[0040] Furthermore, to ensure the engineering feasibility of decisions and align the policy learning process with real-world executable behavior, hard constraints such as speed accessibility, communication coverage, parallel access limits, cache capacity and backhaul rules, and energy budget are embedded into the action space processing in a pre-processing manner. This constructs a set of state-dependent feasible actions and applies mask constraints to the policy output. This constraint embedding and spatial reward value prediction belong to another key branch of parallel generation within the same time slot: the former addresses "which actions are allowed to be executed in the current state," while the latter addresses "which of the allowed actions are more conducive to long-term freshness." Specifically, in each time slot, based on state information such as the current UAV position, remaining energy, cache usage, backhaul flags, and coverage relationships, a feasibility assessment of all grid actions is performed to obtain a binary mask. For actions that are impossible, their probability is set to zero, and the remaining actions are renormalized before sampling or selection. This masking process can be expressed as follows: in This is the output of the policy network for action a. The masking method can solidify the backhaul rules: when the cache usage reaches a threshold or a forced backhaul condition is triggered, the mask only retains actions within the backhaul area or those that meet the backhaul path constraints, so that the policy can only output movement decisions that meet the backhaul requirements in this state, thereby improving deployment security and stability.

[0041] After the UAV reaches the target grid or target location, deterministic MU-MIMO scheduling is used to select a set of nodes for parallel access within the coverage area for uplink data acquisition. The success of the update is linked to the feasibility of transmission to ensure that the AoI evolution is consistent with the actual communication effect. Specifically, for any selected node, it is considered to have completed a valid update and refreshed its AoI only if it can complete the upload of a fixed data packet within the time slot under the current parallel access and link conditions; otherwise, the update is considered a failure, and the AoI increments according to the rules. The success of this update can be determined using the following threshold conditions: in For the node's effective uplink rate, Let D be the communication duration and D be the state packet size. By coordinating the spatial reward value prediction module, action masking processing, and update success determination mechanism, a closed-loop control chain of "observation synchronization - prior guidance - constraint embedding - decision execution - feasible update - state iteration" is formed. This ensures that the policy output can effectively focus on high-value areas while guaranteeing that each time slot action is executable and the update reward is real. In the long run, this can stably reduce the weighted average AoI while taking into account system resource constraints such as cache and energy.

[0042] As explained in the preceding principle section, the spatial reward value prediction module and the actionable constraint embedding module are generated in parallel within the same time slot and converge at the unified strategy decision point, thereby achieving a collaborative closed loop between value guidance and actionable constraints. To further illustrate how this collaborative mechanism is implemented in a practical system, the following section uses a single time slot as the basic decision cycle, combined with... Figure 2 The information flow and decision-making closed loop shown illustrate the technical route execution process of the UAV data acquisition and decision-making process, which includes "observation construction - prior generation - mask construction - unified strategy decision-making - execution acquisition - effective update judgment - status and reward feedback". Step 1, Information Acquisition and Observation Construction: The control center aggregates task and node information into a fixed-size multi-channel mesh diagram, and periodically... Synchronization with the drone terminal typically includes: a weighted AoI heatmap (aggregated per cell). The system includes a node density map and a "MU-MIMO opportunity map" reflecting the potential for parallel access. Simultaneously, the UAV locally incorporates its own low-dimensional state, such as current location / grid index, remaining energy, and cache usage, to form observations. This mechanism avoids reliance on real-time global information per node, making engineering overhead more controllable.

[0043] Step 2, Spatial Reward Value Prediction: Based on the above synchronized grid diagram, the spatial reward value prediction module outputs a value of length [missing information]. Grid value vector SRVP, meaning "relative benefit / utility prior to moving to each grid," is used to indicate high-value areas but does not directly replace the final action. SRVP constructs the prior value of each grid as a fusion of three factors: AoI demand: represented by the aggregated value of that grid in the weighted AoI heatmap (high AoI / high weight areas are prioritized); parallel access potential: using... Roughly describes the relationship between the number of parallel service nodes and the upper limit of antenna count (MM) within the coverage area; clustering term at the coverage scale: within the coverage radius The clustering structure of demand / density is extracted at the corresponding neighborhood scale to help identify "hotspot clusters" rather than single points. The three factors are then fused and normalized to obtain... This forms a spatial prior field and serves as an auxiliary input to the policy network, enabling the policy to focus on high-value areas more quickly in the large action space.

[0044] Step 3, Set of Actions: For the set of mesh actions Construct a binary mask based on the system's hard constraints. If action a violates any hard constraint in the current state, then set... The constraints are eliminated during sampling; they include: single-slot reachability due to maximum speed, joint limitations of communication coverage and parallel access / caching, uplink rate feasibility, forced backhaul rules triggered by full cache, and remaining battery constraints. The key basis for uplink feasibility and parallelism constraints is that "the reachability rate is jointly determined by the number of parallel nodes KK and link fading / distance," and the validity of the update is determined by the condition of "completing a fixed packet length upload within a single time slot." Energy constraints are set at the limit of "flight energy consumption + communication energy consumption" not exceeding the battery budget, used to further narrow the actionable space and prevent the strategy from becoming "unexecutable due to energy depletion" in the later stages.

[0045] Step 4, Policy Decision-Making by Integrating Prior Knowledge and Masking: The policy network integrates original observations and... The algorithm then outputs the preference values ​​for each grid action and uses "Softmax" to remove ineffective actions before sampling, concentrating policy distribution in "executable and high-value" regions. This mechanism directly embeds hard constraints into the action sampling process, reducing invalid exploration and improving convergence stability.

[0046] Step 5, Deterministic Scheduling and Update Judgment After Arrival: After the UAV moves to the center of the target grid, it performs MU-MIMO uplink acquisition within the coverage area. The scheduler, while meeting the antenna parallelism limit and buffer capacity, selects a feasible subset from the candidate nodes, prioritizing combinations expected to bring the maximum weighted AoI decrease. This scheduling itself is not a learning action, but it determines which nodes can meet the rate feasibility requirement and achieve a "successful update," thus affecting state transitions and rewards.

[0047] Step 6, AoI, Cache, and Energy Status Update: For each node, if the upload is successful in the current time slot, the AoI is refreshed; otherwise, it increments with each time slot. This update rule forms the basis for the system's freshness evolution and, together with the rate feasibility condition, determines whether the update is effective. Simultaneously, cache occupancy is updated, such as by increasing data acquisition and clearing data transmission, and the "cache full → forced transmission" rule constraint is checked. Energy is accumulated based on flight propulsion energy consumption and communication energy consumption, ensuring it does not exceed the battery budget and continues to be used for constructing feasible actions in the next time slot.

[0048] Step 7, Rewards and Strategy Learning / Optimization: To align with the long-term weighted AoI minimization objective, an immediate reward design of "freshness penalty + energy consumption penalty" is adopted, for example... in For time slot weighted average AoI, The strategy balances freshness and energy consumption; this reward and the masking action mechanism work together to drive the strategy to steadily optimize long-term freshness performance while meeting hard constraints.

[0049] This solution can also employ various alternative implementation methods to adapt to different deployment conditions and broaden the scope of protection.

[0050] In terms of spatial discretization and motion control, the gridding of the monitoring area can be divided using different resolutions or unequal spacing: finer grids are used in hotspot areas, and coarser grids are used in low-demand areas to reduce computational overhead while maintaining guidance accuracy. Motion actions can also be changed from "selecting the target grid center" to "selecting any feasible point or heading angle and step size within the target grid," or a hierarchical decision-making process can be adopted, such as first selecting a cluster of regions and then selecting specific locations within those regions, to improve scalability in large-scale scenarios. For continuous spatial scenarios, this invention can also replace grid actions with continuous trajectory control, such as outputting a two-dimensional displacement vector or heading and velocity, while maintaining the "prior guidance-constraint embedding" mechanism unchanged.

[0051] In the generation of spatial reward value prediction, the construction of priors is not limited to a single network or a single statistical method. It can be replaced by implementations based on convolution, attention mechanisms, or graph structure aggregation, or by explicit interpretable statistical mappings. In addition to weighted AoI demand intensity, parallel acquisition potential, and hotspot clustering structure, prior fusion elements can also incorporate geographical obstacles / no-fly zones, link occlusion risks, historical update success rates, and task arrival frequencies, thereby enhancing the guidance effect in different scenarios. The use of priors can also be alternative: as input features to the policy network, as action bias terms, or as a basis for pre-screening candidate actions, such as a prior-driven Top-K candidate set, to flexibly balance computation and performance.

[0052] Regarding the embedding of actionable constraints, masking can take two forms: "hard masking" or "soft masking." Hard masking directly eliminates and normalizes inactive actions; soft masking applies attenuation weights or penalties to critical actions, enabling the strategy to maintain feasibility while possessing a certain degree of robustness. The set of constraints upon which masking is based can be selected to enable two or more of them according to system conditions, including speed reachability, coverage radius, parallel access limit, buffer capacity, backhaul triggering and backhaul area, energy budget, etc. For the backhaul mechanism, in addition to "forced backhaul when buffer is full," it can also be replaced by "threshold backhaul," "periodic backhaul," or "predictive backhaul" (backhauling in advance based on future demand and energy capacity), but the strategy output still meets the backhaul rules through the actionable set constraints.

Claims

1. A method for joint control of UAV trajectory and data acquisition optimized for data freshness, characterized in that, Includes the following steps: Step S1: Obtain gridded observation information of the monitoring area. The gridded observation information includes at least the weighted AoI distribution, node density distribution, and UAV's own state information within each grid. The UAV's own state information includes its current position, remaining energy, cache usage, and backhaul flag. S2: Based on the gridded observation information, generate a spatial reward value prediction prior. The spatial reward value prediction prior is a grid-level value vector or heatmap characterizing the potential contribution of each candidate grid to reducing long-term weighted AoI. S3: Based on the UAV's own state information and system hard constraints, construct and generate a corresponding binary action mask to form a set of state-related actionable actions for the current time slot. The system hard constraints include speed reachability, communication coverage, parallel access limit, cache capacity and backhaul rules, uplink rate feasibility, and energy budget constraints. At least two of the following: Step S4: Use the spatial reward value prediction prior as an auxiliary input feature or fusion representation of the policy network, and combine it with the set of state-related actionable actions to output the movement and acquisition decision actions for the current time slot through the policy network, wherein the policy network selects only from the set of state-related actionable actions when sampling or selecting actions; Step S5: Control the UAV to execute the movement and acquisition decision actions, and after reaching the target location, execute multi-user multi-input multi-output parallel uplink acquisition scheduling within the communication coverage area; Step S6: Based on the transmission feasibility judgment condition, determine the effective update status of each acquisition node. If the node meets the condition of completing the effective data upload within the time slot, refresh the information age of the node; otherwise, increment the information age of the node according to the time slot; update the cache occupancy and check whether the information needs to be transmitted back, and update the energy budget; Step S7: Update the UAV status information and the system global status, and calculate the instant reward based on the updated status to optimize the policy network, wherein the instant reward includes at least a weighted average AoI penalty term and an energy consumption penalty term.

2. The data freshness oriented optimized UAV trajectory and acquisition joint control method according to claim 1, characterized in that, The prior for generating spatial reward value prediction in step S2 specifically includes at least two of the following: elements reflecting the intensity of freshness demand within the grid or its covered neighborhood, elements reflecting the grid's potential parallel acquisition capability, and spatial aggregation structure elements reflecting the distribution of hotspot clusters at the coverage scale.

3. The data freshness oriented optimized UAV trajectory and acquisition joint control method according to claim 2, characterized in that, The spatial reward value prediction prior adopts a normalized fusion form, and the specific expression is as follows: wherein, is the demand intensity of the grid, is the parallel potential of the grid, is the clustering intensity of the grid, a, b, g are non-negative fusion weights, g denotes the grid index, t denotes the time slot.

4. The UAV trajectory and data acquisition joint control method optimized for data freshness according to claim 3, characterized in that: , , wherein is a set of candidate nodes within a coverage neighborhood, is a node weight, is a node AoI, and M is a parallel access limit.

5. The UAV trajectory and data acquisition joint control method optimized for data freshness according to claim 1, characterized in that, The masking expression in step S3 is: in The output of the policy network for action a, For binary action masks, This represents the current system state.

6. The UAV trajectory and data acquisition joint control method optimized for data freshness according to claim 1, characterized in that, In step S5, under the premise of satisfying the antenna parallel limit and buffer capacity, a feasible subset is selected from the candidate nodes, and the combination that is expected to bring the maximum weighted AoI reduction is given priority.

7. The UAV trajectory and data acquisition joint control method optimized for data freshness according to claim 1, characterized in that, The transmission feasibility determination condition in step S6 is as follows: ,in For the node's effective uplink rate, D represents the communication duration, and D represents the size of the status packet.

8. The UAV trajectory and data acquisition joint control method optimized for data freshness according to claim 1, characterized in that, The expression for calculating the immediate return in step S7 is: , Among them, The time slot weighted average AoI, Balancing freshness and energy consumption For flight energy consumption, This refers to the energy consumption for communication.

9. The UAV trajectory and data acquisition joint control method optimized for data freshness according to claim 1, characterized in that, In step S1, the gridded observation information is obtained by the control center periodically aggregating task and node information. The control center synchronizes the weighted AoI heatmap, node density map and MU-MIMO opportunity map to the UAV in the form of a fixed-size multi-channel grid map.

10. The UAV trajectory-acquisition joint control method optimized for data freshness according to claim 1, characterized in that, Priors for spatial reward value prediction can be implemented using convolutional, attention mechanisms, or graph structure aggregation.