A space debris laser ranging method based on markov decision
By constructing a Markov decision model and combining it with reinforcement learning algorithms, the planning challenges of multi-station collaboration and dynamic environments in distributed space debris laser ranging were solved, achieving efficient and intelligent laser ranging task planning and improving the success rate and data quality of laser ranging.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YUNNAN OBSERVATORY CHINESE ACADEMY OF SCIENCES
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
AI Technical Summary
In distributed space debris laser ranging scenarios, the distribution of stations in different locations introduces more variables and constraints, resulting in poor performance of existing laser ranging technologies, especially in multi-station collaboration and dynamic environments where planning challenges are encountered.
A Markov decision-making model is constructed and combined with a reinforcement learning algorithm. By maximizing the expected value of the accumulated future reward, the state-action value function is iteratively updated to generate the optimal ranging task planning strategy, which guides the observation tasks of various surface stations in the distributed space debris laser ranging network.
It improved the success rate and data quality of laser ranging, optimized resource utilization, enhanced the adaptability and reliability of the system, reduced observation failures, and achieved efficient collaborative observation and intelligent resource allocation.
Smart Images

Figure CN121978698B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of laser ranging technology, and in particular to a laser ranging method for space debris based on Markov decision-making. Background Technology
[0002] With the increasing number of space debris, Debris Laser Ranging (DLR) technology, as an extension of Satellite Laser Ranging (SLR) technology, is of great significance for ensuring the safety of the space environment. An important direction in DLR research is to coordinate the use of observation resources from multiple ground stations to construct a distributed network. Specifically, a laser is emitted from one station, while multiple other stations receive the echo signals from the space debris. Figure 1 As shown, this innovative method, known as Distributed Space Debris Laser Ranging (Distributed-DLR), has two main advantages: first, it utilizes multiple telescopes to increase the effective receiving area of the echo signal; second, it can acquire multi-dimensional ranging data from different directions, thereby improving the accuracy of determining the orbits of non-cooperative targets.
[0003] In Distributed-DLR observation missions, the main challenges lie in real-time target selection and multi-station collaboration. For traditional single-station observations, a station simply needs to generate and execute its own observation task plan. However, in multi-station scenarios, due to the different geographical locations of the stations, the transit conditions of the same target vary, and some targets may appear simultaneously, thus requiring the determination of which is the "best" target. Furthermore, local weather conditions can affect station operating conditions, introducing more uncertainty. In existing Distributed-DLR scenarios, the stations are distributed across different locations, introducing even more variables and constraints, making the problem extremely complex, and traditional laser ranging methods are ineffective. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a space debris laser ranging method based on Markov decision-making. This method solves the technical problem that in existing Distributed-DLR scenarios, the distribution of stations in different locations introduces more variables and constraints, leading to poor performance of existing laser ranging technologies.
[0005] To address the aforementioned technical problems, this invention provides the following technical solution: a space debris laser ranging method based on Markov decision-making, applied to a distributed space debris laser ranging network, wherein the distributed space debris laser ranging network consists of several ground stations performing space debris laser ranging, and the method includes the following steps:
[0006] The time for observing space debris is divided into several discrete time slots;
[0007] A Markov decision model for laser ranging of space debris is constructed, wherein the Markov decision model includes the state space, action space, state transition probability and reward function for each time slot;
[0008] The Markov decision model is trained using a reinforcement learning algorithm. The state-action value function is iteratively updated until convergence by maximizing the expected value of the accumulated future reward, so as to obtain the optimal ranging task planning strategy.
[0009] Based on the obtained optimal task planning strategy, the observation task sequence of each surface station in the distributed space debris laser ranging network is generated.
[0010] Preferably, the state space includes the observation state of each station in several time slots. The observation state includes the ranging result of the target debris and environmental factors that are relatively independent of the ranging result. The ranging result includes ranging success and ranging failure, which are represented by two different parameters.
[0011] Preferably, the environmental factors include a set of meteorological parameters, a set of station parameters, and a set of target parameters expressed in numerical form.
[0012] Preferably, the target parameter set includes observational values that are positively correlated with the target debris priority, size, and orbital altitude, as well as the observation completion rate and remaining visibility time of the target debris.
[0013] Preferably, the action space is a set of actions that each ground station can select for each time slot, and the action set includes continuing to observe the current target debris and switching to observe a new target debris that is currently visible.
[0014] Preferably, the state transition probability is the ranging success probability at the next moment calculated based on the environmental factors at the previous moment.
[0015] Preferably, the cumulative future reward is obtained by introducing a discount factor to adjust the instantaneous reward of the distributed space debris laser ranging network for each time slot from the current moment to the end of the observation, and by calculating the sum of the adjusted instantaneous rewards for each time slot.
[0016] Preferably, the instantaneous reward for each time slot of the distributed space debris laser ranging network is calculated based on the rewards of each station. The reward for each station is calculated based on whether the ranging is successful, as follows:
[0017] If the distance measurement is successful, the station's reward is positively correlated with the observation value in the target parameter set and the meteorological parameter set corresponding to that station;
[0018] If the distance measurement fails, the station's reward will be the preset failure reward, which is less than the minimum reward for a successful distance measurement.
[0019] Preferably, the reward calculation for the station incorporates a penalty factor that is positively correlated with it, the penalty factor being proportional to the effective observation duration of the station from the current moment to the end of the observation.
[0020] Preferably, the reward calculation for the site incorporates a reward factor that is positively correlated with it, which takes effect when the available observation time of the target fragment reaches an available time threshold.
[0021] By employing the above technical solution, the present invention provides a space debris laser ranging method based on Markov decision-making, which has at least the following beneficial effects:
[0022] 1. This invention divides the observation time into discrete time slots and constructs a Markov decision model, which is then trained using reinforcement learning algorithms to obtain the optimal task planning strategy. This enables efficient collaborative observation of a distributed space debris laser ranging network, significantly improving the overall efficiency and resource utilization of the observation task. It overcomes the planning challenges faced by traditional methods in multi-station collaboration and dynamic environments. At the same time, by maximizing the expected value of accumulated future rewards, the global optimality of the strategy is ensured, enabling the network to adaptively respond to weather changes and target priority fluctuations, thereby improving the success rate and data quality of laser ranging.
[0023] 2. This invention, through state-space design, includes ranging results and independent environmental factors, such as meteorological parameters, station parameters, and target parameters, which comprehensively reflect the dynamic state of the observation system. This makes the Markov decision model closer to actual conditions, improves the accuracy and robustness of task planning, and enables scientific decision-making even in complex environments. It reduces observation failures caused by environmental uncertainties, optimizes resource allocation, and improves overall observation efficiency.
[0024] 3. This invention allows each ground station to flexibly choose to continue observation or switch targets through action space design. It combines state transition probability with the calculation of ranging success probability based on environmental factors, enabling the system to respond in real time to environmental changes and target visibility, reducing observation interruption time, improving the continuity and integrity of data acquisition, and ensuring efficient operation of multi-station collaboration by optimizing action selection through reinforcement learning.
[0025] 4. This invention designs a reward function that positively correlates the reward for successful ranging with the observation value and meteorological parameters. It also introduces a failure penalty, a penalty factor, and a reward factor to guide the reinforcement learning agent to prioritize high-value targets and emergency situations. This achieves intelligent allocation of observation resources, maximizes cumulative rewards, improves the observation completion rate of key targets, avoids resource waste, and enhances the overall network performance.
[0026] 5. This invention effectively addresses the multivariable and complex constraints in distributed networks by combining Markov decision processes with reinforcement learning, solves the collaborative challenges brought about by site distribution, improves the adaptability and reliability of laser ranging, provides an automated planning scheme for space debris monitoring, reduces the need for manual intervention, and promotes the intelligent and efficient management of space environment safety. Attached Figure Description
[0027] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0028] Figure 1 A schematic diagram illustrating a laser emitted from one site, while several other sites receive echo signals from space debris.
[0029] Figure 2 This is a schematic diagram of the residual plot, with the vertical axis representing the difference between the observed and calculated values;
[0030] Figure 3 This invention compares the simulated observation process of a target on a certain day with the actual observation process.
[0031] Figure 4 This is a schematic diagram illustrating the increase of the reward curve for q-learning with the number of rounds in this invention;
[0032] Figure 5 This is a schematic diagram of the weather changes at various monitoring stations simulated by this invention;
[0033] Figure 6 This is a schematic diagram illustrating the change of the reward curve over time in the execution plan of this invention;
[0034] Figure 7 This is a global demonstration of the simulation plan of this invention. Detailed Implementation
[0035] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. This will allow for a full understanding of how the present application uses technical means to solve technical problems and achieve technical effects, and to facilitate its implementation.
[0036] With the increasing number of space debris targets and the development of Distributed Debris Laser Ranging (DLR) technology, existing Distributed-DLR scenarios introduce more variables and constraints due to the distribution of stations in different locations. Traditional single-station scheduling methods struggle to cope with the complexities of multi-station collaboration, dynamic weather conditions, and target priority changes in future distributed networks, resulting in poor performance of existing laser ranging technologies. This invention proposes a space debris laser ranging method based on Markov decision-making, integrating the framework of Markov Decision Process (MDP) and Reinforcement Learning (RL) algorithms. The observation process is modeled as an MDP model, and the network's observation of space debris is achieved by maximizing the expected value of the accumulated future reward of the entire distributed space debris laser ranging network through reinforcement learning. This method is applied to a distributed space debris laser ranging network, which consists of several ground stations performing space debris laser ranging, such as... Figure 1 As shown, a laser is emitted from one station, while several other stations receive echo signals from space debris. The method includes the following steps:
[0037] S1. First, in an observation experiment, the time for observing space debris needs to be divided into several discrete time slots, so that the state of the next time slot can be predicted by the Markov decision model based on the state of the distributed space debris laser ranging network at the current moment.
[0038] The typical DLR observation process is as follows: prepare target trajectory predictions; track and measure distances, collect echo points; and perform data processing. Ideally, all planned targets should be observed. Their echo points are the product of the observations. Figure 2 An example is given.
[0039] For subsequent data processing, as described in the International Laser Ranging Service (ILRS) workflow documents, a "standard point" is obtained by statistically processing echo points collected over a period of time. The accuracy of the standard point is a key indicator for evaluating the quality of laser ranging observation data. Therefore, for each target, generally, the more echo points there are, the higher the quality of the standard point. Since DLR originates from SLR, it is necessary to review the SLR equations. The SLR equations describe the energy transmission and attenuation relationship throughout the entire process of the laser signal being emitted from the ground station, reflected by the target, and returning to the receiving system. Its core lies in estimating the average number of echo photons or electrons detected at the receiver. The equation is as follows:
[0040]
[0041] In the above formula, The equation represents the average number of photoelectrons in the laser echo signal received by the system. Planck's constant h and the speed of light c are both constants. The remaining parameters in the equation can be divided into the following three categories:
[0042] 1. Site parameters: Laser wavelength Laser single pulse energy laser beam divergence angle Detector efficiency The system's effective receiving area Efficiency of the transmitting optical system Efficiency of the receiving optical system 2. Target parameters: Satellite reflector area reflector divergence angle The instantaneous distance from the target to the ranging station 3. Meteorological parameters: Atmospheric transmittance Atmospheric attenuation factor It is clearly affected by the weather.
[0043] Therefore, ranging outputs are affected not only by observation duration but also by weather, station performance, and target characteristics. To ensure observation quality, important targets are typically given more observation time, and all planned targets should be observed. In DLR, observations may fail, as even the best stations may not receive echoes. The success rate of a ranging measurement within a given time period is related to weather, station performance, and target parameters. For example, a sudden cloud obscuring the view or changes in atmospheric light paths causing real-time distance alterations can affect the echo detection probability. Furthermore, it should be noted that for experienced observers or when using intelligent search algorithms, if a target echo has already been detected, the probability of detecting another echo within a short period is high. This relationship can be described using a Markov chain, i.e., a Markov Decision Process (MDP), a stochastic mathematical model used to describe decision-making scenarios with state transitions and reward mechanisms.
[0044] S2. Therefore, a Markov decision model for laser ranging of space debris is constructed. By introducing a Markov chain, the probabilistic characteristics of the system's transitions between different states can be described, thus making the future observation success rate "predictable." The Markov decision model includes the state space, action space, state transition probabilities, and reward function for each time slot. The state space includes the observation state of each station in several time slots, thereby predicting the observation state of the next time slot based on the observation state of each time slot. For an observation experiment containing N time slots, the state sequence S of the observation experiment can be represented as:
[0045]
[0046] in, This indicates the observation status in the nth time slot of the observation experiment, which includes the ranging results of the target debris. and environmental factors that are relatively independent of the ranging result environmental factors Unaffected by ranging results The influence of the observed state It can be represented as:
[0047]
[0048] Among them, the ranging results This includes successful and failed ranging tests, each represented by two different parameters, such as 1 for successful ranging and 0 for failed ranging. The ranging result in the current state affects the probability of successful ranging of the same target in the next state. Environmental factors are also considered. Includes a set of meteorological parameters expressed numerically. Site parameter set and target parameter set Environmental factors The expression can be represented as: For a distributed space debris laser ranging network containing j ground stations, the meteorological parameter set It is the set of meteorological parameters from various ground stations at time n, which can be represented as: , This represents the meteorological parameters of the j-th ground station, such as sunny, cloudy, and rainy, with values ranging from [0, 1]. It is the set of station parameters. That is, the set of station parameters for each ground station at time n, which can be represented as: The target parameter set includes observational value that is positively correlated with the target debris priority, size, and orbital altitude, as well as the observation completion rate and remaining visibility time of the target debris. It may also include dynamic information such as start / end time to provide more comprehensive information for subsequent decision-making.
[0049] In each time slot, the ground station needs to decide whether to continue observing the current target or switch to a new target. All possible actions constitute the action space. Therefore, the action space is the set of actions that each ground station can choose in each time slot, that is, the decisions that the system can make. The action set includes continuing to observe the current target fragment and switching to observe the new target fragment that is currently visible. Taking a certain ground station as an example, if... This indicates that the current target will continue to be observed. If we switch the observation target k, then the action space of its nth time slot is... It can be represented as: .
[0050] The state transition probability is the probability of successful ranging at the next moment calculated based on the environmental factors at the previous moment. For example, if no echo is detected at the current moment (state), the agent decides to switch targets (action), and the new target is successfully detected (new state). The state transition probability is the probability of successful ranging, which is mainly determined by environmental factors and can be estimated based on the meteorological parameter set, target parameter set, and station parameter set.
[0051] S3. The Markov decision model is trained using a reinforcement learning algorithm. The state-action value function is iteratively updated until convergence by maximizing the expected value of accumulated future rewards to obtain the optimal ranging task planning policy. Reinforcement learning (RL) is a machine learning method where an agent learns optimal decisions through continuous interaction with the environment. It is based on a trial-and-error mechanism: the agent performs actions, resulting in state changes and receiving rewards to guide the learning process. In particular, RL uses the MDP framework to model scenarios where future states depend only on the current state and actions. Algorithms such as Q-learning help the agent estimate action values through iterative updates, while policies such as ε-greedy strike a balance between exploring new actions and utilizing known rewards. Q-learning is a model-free reinforcement learning algorithm designed to find the optimal action selection policy for a given environment. It estimates the expected future reward for each state-action pair by learning a Q-value function. Through continuous exploration and utilization, taking the Q-learning algorithm as an example, the Q-value is updated based on the observed rewards and the maximum future Q-value (i.e., the expected value of accumulated future rewards), eventually converging to the optimal policy.
[0052] The cumulative future reward is obtained by introducing a discount factor to adjust the instantaneous reward of the distributed space debris laser ranging network for each time slot from the current moment to the end of the observation, and then calculating the sum of the adjusted instantaneous rewards for each time slot. Its expression can be represented as:
[0053]
[0054] in, This represents the cumulative future reward calculated at time n. The discount factor is a parameter used to discount future rewards, with a value range of [0,1]. It is used to adjust the value of rewards, making the agent focus more on immediate or future rewards. If the value is close to 0, the agent focuses more on immediate rewards; if... A value close to 1 indicates that the agent prioritizes future rewards. In DLR observations, once sufficient echo data has been collected, the reward for continued observation should decrease. This formula prevents the agent from focusing solely on a single high-value objective, thus encouraging it to switch to other objectives and optimizing the global observation strategy. The instantaneous reward for the nth time slot is defined as the reward for a specific action. In the distributed space debris laser ranging network, the instantaneous reward for each time slot is calculated based on the rewards of each station. For example, the average reward of each station in each time slot can be used as the instantaneous reward for that time slot, aiming to ensure all stations participate in observation. A station's reward can be calculated based on the success of the ranging operation. If the ranging is successful, the station's reward is positively correlated with the observation value in the target parameter set and the meteorological parameter set corresponding to that station. If the ranging fails, the station's reward is a preset failure reward, which is less than the minimum reward for successful ranging, for example, set to 0. Furthermore, the reward can be adjusted based on the completion status of the current observation state through reward or penalty factors. For example, switching to a new target requires time, which can be represented as a penalty factor as a percentage of the effective observation time. That is, the station's reward calculation can also incorporate a penalty factor that is positively correlated with the effective observation time from the current moment to the end of the observation period. Furthermore, the reward calculation for each station introduces a positively correlated reward factor. This reward factor takes effect when the available observation time for the target fragment reaches a certain threshold. For example, when the available observation time reaches a certain threshold and the target is marked as urgent (e.g., only 10 minutes remain), the reward factor takes effect. At this time, the reward can be multiplied by this factor, thus making the agent more inclined to choose urgent targets. Specifically, based on the above, for the immediate reward of the j-th station... It can be fully represented as:
[0055]
[0056]
[0057]
[0058] In the above formula, This represents the reward for the j-th station. and These represent the penalty factor and the reward factor, respectively. This indicates the time required for the site to switch to the new target. This indicates the time from the current moment until the end of the observation of the new target after switching. This represents the percentage of effective observation time. This represents the meteorological parameters of the j-th station. The observation value of target i is positively correlated with three parameters: target debris priority, size, and orbital altitude. The influence of each factor can be adjusted by introducing weight parameters. For example, in modeling, the target debris size can be divided into small and large categories, and different weights can be set with a threshold of 10 m². Or the orbit can be divided into near, medium, and far categories, with corresponding altitude ranges of less than 1000 km, [1000, 2000] km, and greater than 2000 km, respectively, and different weights can be assigned to calculate the observation value. High-value targets usually correspond to longer observation times. For example, Kunming Station successfully measured a debris located at an altitude of 2000 km with a size of only 0.045 m² in 2017, which demonstrated the station's performance. Generally speaking, the higher the value of the target, the lower the probability of successful ranging, and therefore a longer observation time is required.
[0059] The goal of the strategy is to find an action in the action space that maximizes the Q-value for a given state. For the distributed network observation in this paper, it is necessary to calculate the expected observation reward based on the ranging success probability and concentrate all state transition probabilities on the action with the highest expected cumulative future reward. This is a deterministic strategy, which iteratively updates the state-action value function until convergence by maximizing the expected value of the cumulative future reward in order to obtain the optimal ranging task planning strategy.
[0060] S4. Based on the obtained optimal task planning strategy, generate the observation task sequence for each surface station in the distributed space debris laser ranging network. Then, each surface station in the distributed space debris laser ranging network can observe the corresponding target debris according to its observation task sequence.
[0061] Those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0062] We conducted simulation verification using the methods described above, as detailed below:
[0063] Since a fully established Distributed-DLR network has not yet been established, the simulation is based on previous DLR data from independent sites and conventional SLR observation data. The focus is on: 1) the effectiveness of the target OV evaluation rules; 2) the comparison between task scheduling simulation and actual observations in a single-site conventional SLR scenario; and 3) virtual task generation and simulation for multi-site, multi-target scenarios. The completion rate of the observation plan and the time allocation for high-value targets are the main evaluation indicators.
[0064] In the verification process, target information from some typical satellites and debris was introduced, and several simulated targets were generated. Their corresponding Optical Values (OVs) were calculated, as shown in Table 1 below, which is a table of the observational value of some targets during a certain simulated observation:
[0065]
[0066] By adjusting the weight parameters, the value of ranging targets can be easily classified, and the results show that debris targets generally have higher value. The agent then uses the Offset-Value (OV) mechanism in the policy function for decision-making. Table 2 below shows the simulation parameter settings during the simulation process:
[0067]
[0068] Since there is currently no regularly operating DLR network, and the frequency of DLR experiments is much lower than that of SLR experiments, the mission plan of the SLR station is used to verify the scheduling method, focusing on LEO targets at altitudes similar to debris orbits. Figure 3 The figure presents a comparison between the actual observation progress and the simulation task plan during the period from 12:30 to 15:30 UTC on June 6, 2025. The figure only shows 3 hours of the 9-hour observation shift. It can be seen from the figure that the method of the present invention basically covers all targets.
[0069] The weather was clear that day, and the simulation plan fully covered the predicted LEO targets, which enjoyed higher priority during the process; while GEO or MEO targets were not focused on and were only used to fill gaps between the main targets. Because the target switching configuration was 40 seconds and the time slot length was 180 seconds, the observation time distribution in the simulation plan was more even and concentrated on LEO targets (such as Lares-2), thus helping observers make better decisions. In the initial 3 hours of observation, all predicted targets were included in the simulation plan, with 87.7% of the time allocated to LEO targets; however, the actual observations missed 3 of the 18 cooperative targets.
[0070] However, judging solely by total benefits is insufficient for the following reasons: 1) In conventional SLR missions, the success rate is typically 100%, except during extreme weather conditions, making it difficult to demonstrate the advantages of the MDP method; 2) These targets all have corner reflectors, with nearly identical OV values, resulting in minimal impact from size weights; 3) Observers also need to focus on GEO targets and ground calibration targets, which are set as low priority in simulation programs. Overall, the feasibility of this method is demonstrated through comparison with real-world scenarios, but conventional single-station SLR observations are not its ideal application scenario.
[0071] The core of this study is Distributed-DLR observation, which features multiple stations, multiple targets, and dynamic environments, and can fully explore the potential of the MDP method. By introducing a random factor into the success probability, the observation process becomes closer to reality, thus demonstrating the advantages of MDP in dealing with random events.
[0072] like Figure 5 As shown, during a 4-hour observation period (one episode in Q-learning), four stations were located at different sites. Weather data was generated based on local climate characteristics or historical records and changed every 30 minutes. Continuous time was discretized into 180-second time slots; switching targets resulted in no gain for the following 40 seconds because telescope pointing and signal identification required time. 40–60 debris targets were generated based on the orbital characteristics of real debris, ranging in size from 1–20 m², with orbital altitudes from 500–2500 km. Each target arc lasted approximately 5–15 minutes; the start and end times of the same arc for different stations were generated by random increments and decrements. "Target clusters" were introduced to simulate the simultaneous appearance of multiple targets, with each cluster randomly containing 35 targets. In addition, GEO targets and a small number of cooperative SLR satellites were generated to fill gaps, with their OV (ocean velocity) being much lower than that of the debris targets.
[0073] The Q-learning hyperparameter configurations are shown in Table 3 below. Training was performed on a laptop equipped with an Nvidia RTX 4060, taking several minutes depending on the hyperparameter settings. Results are as follows... Figure 4 As shown, the total reward gradually increases and converges as the episode increases; training ends when ε drops to 0.05, and the output Q-table can be used to generate task plans.
[0074]
[0075] In a 4-hour observation simulation, weather conditions such as Figure 5 As shown. The recommended observation plan is partially shown in Table 4 below, and the reward estimate is shown in [reference needed]. Figure 6 See the overall arrangement. Figure 7When the weather is sunny or cloudy at all stations, the total reward increases rapidly; when rainfall occurs at some locations, the immediate reward decreases. At 21:30, the laser emission station experienced rain, and as recommended, the entire network ceased observations, while the total reward remained unchanged. Overall, during the 3.5-hour effective observation period, 82% of the time was allocated to the DLR target, with the remainder allocated to GEO or SAT targets, the specific proportion varying depending on the predicted target. The generated plan is reasonable and efficient, and can guide the operation of the distributed network. Table 4 below shows an example of the output of the simulated observation plan:
[0076]
[0077] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. Since the above embodiments are substantially similar to the method embodiments, their descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0078] The above embodiments provide a detailed description of the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A Markov decision-based space debris laser ranging method applied to a distributed space debris laser ranging network composed of a plurality of ground stations for space debris laser ranging, characterized in that, The method includes the following steps: The time for observing space debris is divided into several discrete time slots; A Markov decision model for laser ranging of space debris is constructed, wherein the Markov decision model includes the state space, action space, state transition probability and reward function for each time slot; The Markov decision model is trained using a reinforcement learning algorithm. The state-action value function is iteratively updated until convergence by maximizing the expected value of the accumulated future reward to obtain the optimal ranging task planning strategy. The accumulated future reward is obtained by introducing a discount factor to adjust the instantaneous reward of the distributed space debris laser ranging network for each time slot from the current moment to the end of the observation, and the sum of the adjusted instantaneous rewards for each time slot is calculated. The instantaneous reward for each time slot of the distributed space debris laser ranging network is calculated based on the rewards of each station. The station's reward is calculated based on whether the ranging is successful, specifically as follows: if the ranging is successful, the station's reward is positively correlated with the observation value in the target parameter set and the meteorological parameter set corresponding to that station; if the ranging fails, the station's reward is a preset failure reward, the value of which is less than the minimum reward for successful ranging. The station's reward calculation introduces a penalty factor that is positively correlated with it, and the penalty factor is proportional to the effective observation time of the station from the current moment to the end of the observation. Based on the obtained optimal task planning strategy, the observation task sequence of each surface station in the distributed space debris laser ranging network is generated.
2. The space debris laser ranging method according to claim 1, characterized in that, The state space includes the observation state of each station in several time slots. The observation state includes the ranging result of the target debris and environmental factors that are relatively independent of the ranging result. The ranging result includes ranging success and ranging failure, which are represented by two different parameters.
3. The space debris laser ranging method according to claim 2, characterized in that, The environmental factors include a set of meteorological parameters, a set of station parameters, and a set of target parameters, all expressed numerically.
4. The space debris laser ranging method according to claim 3, characterized in that, The target parameter set includes observational value that is positively correlated with the target debris priority, size, and orbital altitude, as well as the observation completion rate and remaining visibility time of the target debris.
5. The space debris laser ranging method according to claim 1, characterized in that, The action space is a set of actions that each ground station can choose for each time slot, and the action set includes continuing to observe the current target debris and switching to observe a new target debris that is currently visible.
6. The space debris laser ranging method according to claim 2, characterized in that, The state transition probability is the probability of successful ranging at the next moment, calculated based on the environmental factors at the previous moment.
7. The space debris laser ranging method according to claim 1, characterized in that, The reward calculation for the site incorporates a positively correlated reward factor, which takes effect when the available observation time for the target fragment reaches an available time threshold.