A voronoi boundary-based dqn parameter optimization system and method
By optimizing the DQN parameter system based on Voronoi boundaries and combining it with a self-feedback mechanism, the handover process in ultra-dense networks is optimized, solving the problem of service quality degradation caused by frequent handovers and achieving more efficient handover management and data rate improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INNER MONGOLIA UNIVERSITY
- Filing Date
- 2025-05-16
- Publication Date
- 2026-06-19
AI Technical Summary
In ultra-dense network environments, signal strength-based handover mechanisms lead to frequent handovers, reducing the quality of service for users. Furthermore, traditional methods require complex signaling interactions, increasing handover costs.
A Voronoi boundary-based deep reinforcement learning (DQN) parameter optimization system is adopted. Through path loss module, user movement module, user handover module and reinforcement learning module, combined with self-feedback mechanism, the handover process is optimized, the user handover rate is reduced and the data rate is improved.
Reduce unnecessary information interaction, simplify the switching process, lower the user switching rate, improve user data rate, and accelerate network convergence in complex environments.
Smart Images

Figure CN120498574B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of wireless communication handover application technology based on reinforcement learning, and in particular to a DQN parameter optimization system and method based on Voronoi boundaries. Background Technology
[0002] With the widespread application of millimeter-wave networks, ultra-dense networks are experiencing rapid development. In ultra-dense network environments, due to the high density of base station deployments, mobile users need to frequently switch between base stations while moving, making handover management a critical challenge that needs to be addressed. For the currently widely used signal strength-based handover mechanism, practical systems typically use the A3 handover trigger event defined in the 3GPP standard to execute the handover process. In ultra-dense network environments, this method leads to frequent handovers, degrading user service quality, such as reduced user data rates.
[0003] To address this, numerous studies have utilized machine learning techniques to optimize handover parameters based on this handover mechanism, aiming to reduce handover frequency and improve network performance. However, traditional signal strength-based handover methods require frequent and complex signaling interactions between the base station and the user, involving the reporting of critical information such as network channel status, the user's received power to the serving cell and neighboring cells, and the level of interference caused by surrounding equipment. The complex signaling interactions involved in traditional signal strength-based handover processes not only reduce system efficiency but also increase handover costs. While the introduction of neural networks in recent years has reduced the handover rate to some extent, existing research has not yet improved traditional handover schemes.
[0004] In view of this, the present invention proposes a DQN parameter optimization system and method based on Voronoi boundaries. Summary of the Invention
[0005] The purpose of this invention is to provide a DQN parameter optimization system and method based on Voronoi boundaries to solve the problems mentioned in the background art.
[0006] To achieve the above-mentioned objectives, the present invention provides the following technical solution:
[0007] A DQN parameter optimization system based on Voronoi boundaries includes the following modules:
[0008] Path loss module: used to complete channel modeling and describe the power attenuation law of the signal during propagation;
[0009] User movement module: A random walk model is used, assuming that the user's current and future positions are random, and the user's speed and direction of movement are also random and independent of historical and future moments;
[0010] User handover module: Based on the Voronoi model, the problem of solving the optimal handover boundary is transformed into solving the handover region. The high-frequency handover areas experienced by users during the handover process are defined as the handover regions, and the handover process is divided into the handover preparation stage, the candidate base station selection stage, and the handover execution stage.
[0011] Reinforcement Learning Module: Combines reinforcement learning algorithms with Voronoi models to reduce user switching rate and increase user data rate; introduces a self-feedback mechanism to correct the action output of the DQN network and accelerate network convergence.
[0012] A DQN parameter optimization method based on Voronoi boundaries includes the following steps:
[0013] S1. Construct a path loss model, complete channel modeling, and describe the power attenuation law of the signal during propagation;
[0014] S2. The random walk model is selected as the user movement model. It is assumed that the user's current position and future position are random, and the user's speed and direction of movement are also random and independent of the past and future moments.
[0015] S3. Construct a user handover model based on Voronoi boundaries, transform the problem of finding the optimal handover boundary into solving the handover region, define the high-frequency handover areas experienced by users during the handover process as the handover region, and divide the handover process into the handover preparation stage, the candidate base station selection stage, and the handover execution stage.
[0016] S4. Combine reinforcement learning algorithms with Voronoi models to construct a reinforcement learning network DQN switching framework to reduce user switching rate and improve user data rate; and introduce a self-feedback mechanism to correct the action output of the DQN network and accelerate network convergence.
[0017] Preferably, the handover process division in S3 specifically includes the following:
[0018] S3.1 Switching Preparation Phase:
[0019] During a random walk, the system continuously acquires the user's location information. When the user approaches the handover boundary, the system enters the handover preparation phase to prepare for the subsequent handover.
[0020] The optimal switching boundary is represented by a path loss model that decays exponentially, and the formula is as follows:
[0021]
[0022] Among them, P T The base station transmit power is represented by u; the user's initial location is represented by x. sIndicates the location of the serving base station; x i η represents the location of other base stations; η represents the path loss factor; Φ represents a uniform random point process.
[0023] Simplifying the above formula, the optimal switching boundary is:
[0024] Γ(x s ,x i )={(x,y)|(xx s ) 2 +(yy s ) 2 =(xx) i ) 2 +(yy i ) 2}
[0025] When the user moves to a position relative to Γ(x) s ,x i When the distance d of the trajectory is less than or equal to a certain critical value D, the handover preparation stage is entered, where D is the handover distance threshold.
[0026] S3.2 Candidate Base Station Selection Stage:
[0027] The reliability of the handover process is enhanced by employing the time-limited time-to-time (TTT) condition in the A3 handover event. Based on the A3 handover trigger event, the concept of a handover region is proposed. The area enclosed by the handover threshold D when the user enters handover preparation and the handover boundary is defined as the handover region, expressed by the formula:
[0028]
[0029] Wherein, (a) is transformed into a Cartesian coordinate system representation; (b) is transformed into a polar coordinate system representation; r represents the point with the user as the pole and the boundary Γ(x) s ,x i The distance is 0; the polar axis is the parallel line to the switching boundary; θ represents the angle between the polar axis and the target's direction of motion, ranging from [0,π]; r s and θ s This indicates the distance and angle between the user and the serving base station; r i and θ i This indicates the distance and angle between the user and the target base station;
[0030] During user movement, the system calculates the handover area between the user and multiple base stations, filters potential target base stations based on TTT conditions, calculates the vector distance between the user and each candidate base station within a specified TTT time using a distance vector sum algorithm, and sums these distances cumulatively. The formula is as follows:
[0031]
[0032] Where, d i (t) represents the distance d between the user and base station i at time t, where i represents the base station number; sign(α)v' represents the vertical component of the user's motion toward the target base station, v' = v·sinθ. L represents the sum of the vector distances between the user and a base station within the time window; if the sum of the vector distances between the user and a base station is always positive, the base station is considered a candidate base station for handover; if at a certain moment, the sum of the vector distances between the user and a base station is negative, the base station is disqualified as a candidate base station.
[0033] S3.3 Switching Execution Phase:
[0034] The system selects the base station with the maximum sum of vector distances from the candidate base stations as the optimal target base station, as expressed by the formula:
[0035]
[0036] Where I represents the optimal target base station.
[0037] Preferably, the reinforcement learning network DQN switching framework includes:
[0038] Communication environment module: responsible for recording and processing key information of the user during the movement process. The key information includes: the user's instantaneous movement speed, instantaneous movement direction, currently connected serving base station, distance between the user and the serving base station and the distance to other base stations, transmission power of each base station, channel propagation loss and real-time data rate received by the user.
[0039] The switching module is responsible for receiving user and base station information sent by the communication environment module, processing the information, and selecting the best target base station; it is also responsible for calculating the state information learned by the agent module.
[0040] The intelligent agent module is responsible for receiving status information and immediate rewards, feeding them back to the decision-making module for decision-making; it is also responsible for receiving the latest action output returned by the decision-making module and outputting the action to the communication environment module.
[0041] Decision module: Responsible for integrating and calculating the state information and switching decision information from the communication environment module, switching module and intelligent agent module, storing the current state information and its corresponding action value, and making decisions based on the instant reward function.
[0042] Preferably, the decision-making module incorporates a self-feedback mechanism to correct action D and time constraint TTT, specifically including the following:
[0043] A simplified A3 handover event mechanism is proposed. A hysteresis delay factor (HOM) is introduced into the standardized A3 handover triggering event mechanism. The system considers the handover successful when the target base station's reference signal received power (RSRP) is greater than or equal to the serving base station's RSRP, and a certain HOM margin is satisfied. The formula is as follows:
[0044] RSRP T >RSRP S +HOM
[0045] The simplified A3 handover event is calculated and the lowest HOM parameter value that meets the conditions is output. If the output value is positive, it indicates that the current handover meets the conditions, and no adjustment of the output action is required. If the output value is negative, it indicates that the RSRP of the target base station is lower than the RSRP of the serving base station, indicating that the handover is not ideal. The system immediately adjusts the D and TTT values based on the correction function and feeds back the corresponding action to the agent module to optimize the handover decision and improve the handover success rate. The formula of the correction function is expressed as:
[0046]
[0047] Where α represents the correction factor, which is a constant used to control the degree of weighting; W represents the weight factor output by the correction function; the self-feedback mechanism corrects actions D and TTT based on W, i.e. mode, where W∈[0,1].
[0048] The present invention further protects a computer device, the computer device including a processor and a memory, the memory storing at least one instruction, at least one program, code set or instruction set, the instruction, program, code set or instruction set being loaded and executed by the processor to implement the above-described Voronoi boundary-based DQN parameter optimization method.
[0049] The present invention further protects a computer-readable storage medium, characterized in that the computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, wherein the instruction, program, code set, or instruction set is loaded and executed by a processor to implement the above-described Voronoi boundary-based DQN parameter optimization method.
[0050] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0051] This invention continuously acquires user location information to determine whether to initiate a handover, thereby reducing unnecessary information interaction and simplifying the handover process. Furthermore, to further alleviate the problem of frequent user handovers, this invention combines the Voronoi model with deep reinforcement learning (DQN) to construct a novel reinforcement learning-driven handover framework. This framework can adaptively optimize the Voronoi handover boundary to better adapt to complex communication environments, effectively reducing the user handover rate and increasing user data rate. In addition, this invention introduces a self-feedback mechanism into the reinforcement learning-driven handover framework, which can better correct the action output of the DQN network and accelerate network convergence. Attached Figure Description
[0052] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings involved in the embodiments are now briefly described. Obviously, the drawings in the following description are merely illustrative of some embodiments of the present invention. For those skilled in the art, other forms of drawings can be constructed based on these drawings without creative effort.
[0053] Figure 1 This is a schematic diagram of user movement mentioned in Embodiment 1 of the present invention;
[0054] Figure 2 This is a schematic diagram of the switching process mentioned in Embodiment 1 of the present invention;
[0055] Figure 3 This is a schematic diagram of the DQN framework structure mentioned in Embodiment 1 of the present invention;
[0056] Figure 4-6 This is a graph showing the network performance characterization results mentioned in Embodiment 2 of the present invention. Detailed Implementation
[0057] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.
[0058] Example 1:
[0059] This invention proposes a DQN parameter optimization system and method based on Voronoi boundaries, specifically including:
[0060] 1. Path loss model
[0061] This example sets up a single-layer ultra-dense network, where the distribution of base stations follows a uniform random point process Φ with strength u. The received power P of the mobile device... R As shown below:
[0062] PR (d)=P T ·|d| -η ·l(|d|)
[0063] Wherein, symbol P T Let represent the base station transmit power, and d represent the distance between the transmitter and receiver. The symbol η represents the path loss factor, set to η≥4. l represents the shadow fading model, where lnl(|d|) follows a mean of 1 / 2. The variance is σ 2 The shadow of decline.
[0064] 2. User movement model
[0065] This example sets the user movement model as a random walk model, which assumes that the user's current and future positions are random, that is, the user's speed and direction of movement are random and independent of historical and future moments.
[0066] Assuming the user's initial position is u(x,y), at each time t, the user's position is updated by a combination of velocity and direction:
[0067] u(t+1)=u(t)+v·Δt
[0068] Where u(t) represents the current user position, u(t+1) represents the user position at the next moment, v represents the user's velocity, and Δt is the time step.
[0069] 3. User handover model based on Voronoi boundaries
[0070] The Voronoi model is widely used in spatial partitioning analysis. It divides space into regions, where points within each region are closer to their corresponding base point than to other base points. The region generated by each base point is called its Voronoi cell, and two adjacent Voronoi cells are connected by a shared boundary. Irregular Voronoi cells simulate the signal coverage area of a base station, and the boundary between the signal coverage areas of adjacent base stations is considered the handover boundary for users. The base station in the user's area is called the serving base station B. S The base station that the user switches to is called target base station B. T .
[0071] Based on the Voronoi model, the handover boundary is represented by the point where the power is equal; that is, at this boundary, the signal power received by the user from the two base stations is equal.
[0072]
[0073] The location of the serving base station is denoted as x. s (xs ,y s The locations of other base stations are denoted as x. i (x i ,y i ), where x i ∈Φ / x s .
[0074] The serving base station initially connected to by the user satisfies the formula:
[0075]
[0076] Because the actual handover boundary fluctuates continuously over time, accurately determining the optimal handover boundary for a user becomes extremely complex. To address this challenge, this invention proposes transforming the problem into solving for the handover region. Specifically, the handover region is defined as the area where a user may experience high-frequency handovers during the handover process, i.e., an area with significant uncertainty in user signal coverage. This invention divides the handover process into a handover preparation phase, a candidate base station selection phase, and a handover execution phase, specifically including the following:
[0077] (1) Switching preparation phase
[0078] During a random walk, the system continuously acquires the user's location information. When the user approaches the handover boundary, the system enters the handover preparation phase to prepare for the subsequent handover.
[0079] First, the optimal switching boundary is represented by a path loss model that decays exponentially, as follows:
[0080]
[0081] Simplifying the above formula, the optimal switching boundary is:
[0082] Γ(x s ,x i )={(x,y)|(xx s ) 2 +(yy s ) 2 =(xx) i ) 2 +(yy i ) 2}
[0083] When the user moves to a position relative to Γ(x) s ,x i When the distance d of the trajectory is less than or equal to a certain critical value D, the handover preparation stage is entered, where D is called the handover distance threshold.
[0084] (2) Candidate base station selection stage
[0085] Because users' random walk speed and direction of movement are highly random, this uncertainty makes relying solely on the handover preparation phase insufficient to ensure a smooth handover. Specifically, during movement, users may be unable to accurately predict the handover timing in advance due to factors such as signal fluctuations and interference, leading to delays or failures during the handover process. Therefore, simply relying on users approaching the handover boundary and entering the handover preparation state is insufficient to guarantee the stability and success rate of the handover.
[0086] To improve the stability and reliability of the handover process, this invention employs the Time-To-Time (TTT) condition in the A3 handover event. The A3 handover event refers to the setting of a time threshold condition during handover between base stations to limit changes in user signal quality within a certain period, preventing frequent handovers caused by short-term signal fluctuations. Specifically, the TTT condition stipulates that when the user's received signal quality continuously exceeds a set threshold M... S A handover can only be performed when the signal strength (+Hys) exceeds the preset TTT value. This condition helps avoid invalid handovers caused by momentary signal fading or interference, thereby enhancing handover reliability.
[0087] Based on the A3 handover triggering event, this invention defines a candidate base station selection phase to further optimize the handover process. To better explain the candidate base station selection phase, the concept of a handover area is first proposed. The area enclosed by the handover threshold D and the handover boundary when a user enters handover preparation is called the handover area, i.e.:
[0088]
[0089] Here, (a) it is transformed into a Cartesian coordinate system representation; (b) it is transformed into a polar coordinate system representation; r represents the point with the user as the pole and the boundary Γ(x) s ,x i The distance is 0; the polar axis is the parallel line to the switching boundary; θ represents the angle between the polar axis and the target's direction of motion, ranging from [0,π]; r s and θ s This indicates the distance and angle between the user and the serving base station; r i and θ i This represents the distance and angle between the user and the target base station.
[0090] Please see Figure 1 , Figure 1This diagram illustrates user movement. The blue curve represents the handover boundary, the black triangle represents the base station, the red dot represents the pole, the gray curve represents the polar axis, the black curve represents the x-axis, and the red arrow represents the user's direction of movement. The solid line indicates the system calculates the current candidate base station as number 10, and the dashed line indicates the system calculates the current candidate base station as number 3. Taking the solid line as an example, the user's current polar axis is parallel to the handover boundary, the x-axis is perpendicular to the handover boundary, the angle between the polar axis and the direction of movement is θ, the angle between the x-axis and the direction of movement is α, and r represents the distance from the user to the handover boundary.
[0091] During this stage, the system calculates the handover area between the user and multiple base stations and filters potential target base stations based on TTT conditions. Specifically, this invention uses a distance vector sum algorithm to determine candidate base stations. In this process, within a specified TTT time, the system calculates the vector distance between the user and each candidate base station and sums these distances. The formula is as follows:
[0092]
[0093] Where, d i (t) represents the distance d between the user and base station i at time t, where i represents the base station number. sign(α)v' represents the vertical component of the user's motion towards the target base station, where v' = v·sinθ. L represents the sum of the vector distances between the user and a particular base station within the time window. If the sum of the vector distances between the user and a base station is always positive, then that base station is considered a candidate base station for handover.
[0094] Specifically, during the TTT (Time-to-Time) period, due to the random movement of users, the candidate base stations are continuously updated over time. This means that the number and number of candidate base stations may change at any given moment. If, at any given moment, the cumulative vector distance L between a user and a base station is negative, that base station will be disqualified as a candidate. Conversely, if the cumulative vector distance L between a user and a base station is consistently positive, that base station will continue to be a candidate. Through this dynamic update mechanism, the system can flexibly select the most suitable target base station based on the user's real-time movement and signal conditions, ensuring the stability and accuracy of the handover process.
[0095] The purpose of this phase is to continuously monitor and evaluate the signal quality of multiple base stations to ensure that the selected target base station has sufficient stability and can provide reliable coverage during actual handover, thereby reducing the probability of handover failure and improving system performance. By introducing TTT conditions and a candidate base station selection phase, the accuracy and stability of the handover process are effectively improved, unnecessary handover operations are reduced, and the overall system performance is optimized.
[0096] (3) Switching execution phase
[0097] The system is responsible for selecting a unique and optimal base station from the candidate base stations as the target base station to ensure that the user's data rate reaches the optimal level during the handover process. That is, selecting the candidate base station with the maximum vector distance as the optimal target base station is expressed as:
[0098]
[0099] In this way, the system can select the most suitable target base station based on the vector distance information between the user and each candidate base station, thereby ensuring a smooth handover process, optimizing the user experience, and maximizing the stability and reliability of the data rate.
[0100] Please see Figure 2 The diagram details the handover process. In this diagram, yellow dots represent the user's starting position, green dots represent the user's ending position, and the red curve represents the user's movement trajectory. Solid blue lines represent handover boundaries, black dashed lines indicate the user entering the handover preparation phase, red dashed lines mark the start of the candidate base station selection phase, and green dashed lines represent the start of the handover execution phase. The symbol D represents the handover distance threshold, defined as the distance between the black dashed line and the solid blue line, and TTT represents the handover time threshold, defined as the time period between the red dashed line and the green dashed line. S (D,TTT) Represents switching regions, d 10 (0) represents the distance between the user and the switching boundary at the initial moment, d 10 (TTT) represents the distance between the user and the handover boundary at the TTT time.
[0101] Starting from the initial position, the user moves along the red trajectory to the black dot, entering the handover preparation phase. This area is represented by yellow in the diagram, and the yellow area changes with the user's time increment. After the system receives the signal indicating that the user has entered the handover preparation phase, a timer starts and continuously monitors the time-to-time (TTT) period, recording the vector distance d between the user and the handover boundary. i (t), meaning that if the angle θ is positive, the distance is positive; if the angle θ is negative, the distance is negative. This process is illustrated in the diagram starting from the initial time d. i (0) to the termination time d i The red and green areas (TTT) represent the timer's duration. When a user reaches the green dot, it signifies the end of the TTT and the system enters the handover execution phase. During this phase, the system selects the base station with the maximum distance vector sum between the user and each candidate base station as the target base station and completes the handover with the user via signaling messages. At this point, the user disconnects from the serving base station and establishes a new connection with the target base station, completing the entire handover process.
[0102] In summary, this invention addresses two key handover parameters in A3 events and reconstructs the handover process based on the Voronoi model. Building upon this, a novel user location-based handover model is proposed. Unlike traditional A3 events that rely on signal strength, this model determines the handover entry conditions through the handover control parameter D and delineates the handover area by incorporating the TTT (Time To Time) delay. Furthermore, this scheme further refines the candidate base station selection mechanism and optimizes the optimal base station selection strategy, thereby improving the accuracy of handover decisions and network performance.
[0103] 4. Reinforcement Learning-Based Switching Framework Model
[0104] Reinforcement learning (RL), a machine learning method based on the interaction between an agent and its environment, optimizes its decision-making strategy by providing feedback (rewards or penalties) during trial and error, thereby maximizing long-term rewards. RL is particularly suitable for complex optimization problems, especially the Voronoi diagram-based switching parameter optimization problem proposed in this invention. In UDN environments, users frequently switch to maintain stable data rates, and the switching process is affected by multiple factors, including user speed, direction of movement, receiving power, and data rate. These factors increase the complexity of switching decisions to some extent. Therefore, how to optimize switching strategies in dynamic environments has become a pressing challenge.
[0105] To effectively address this challenge, this invention proposes combining Regression-Enhanced Learning (RL) with the Voronoi model. RL's adaptive capabilities are leveraged to optimize key control parameters during the handover process, such as the handover time threshold (TTT) and the handover distance threshold (D). The Voronoi model simplifies the decision space and optimizes handover decision execution efficiency by dividing the environment into multiple non-overlapping regions. The combination of RL and the Voronoi model improves the optimization effect of the handover process and effectively reduces system complexity, providing an effective strategy for handling dynamic changes in UDN environments. By optimizing handover parameters in real time, the model proposed in this invention significantly improves the accuracy of handover decisions and the overall system performance, as detailed below.
[0106] Reinforcement Learning Network DQN Switching Framework
[0107] This invention proposes a Deep Q-Network (DQN) handover framework that combines Reinforcement Learning (RL) with handover. This framework continuously explores and learns the optimal strategy through the adaptive capabilities of RL to address complex handover optimization problems.
[0108] The DQN switching framework is divided into four main parts according to function, such as Figure 3As shown. Through this structure, the agent can make accurate decisions based on environmental state information during the learning process, thereby achieving real-time optimization of the switching process, specifically including the following:
[0109] (1) Communication Environment Module
[0110] In a communication network environment, the communication environment module plays a crucial role. It's a platform integrating a large amount of complex network information, responsible for recording and processing key information about the user during movement. Its main functions include real-time tracking of the user's instantaneous speed, instantaneous direction of movement, currently connected serving base station, the distance between the user and the serving base station, and the distances to other base stations. Simultaneously, the communication environment module is also responsible for monitoring and recording the transmit power of each base station, channel propagation loss, and the real-time data rate received by the user. And network performance metrics such as SINR. Specifically, the communication environment module has the following four core functions.
[0111] 1) Provide the handover module with user and base station information, namely user speed, user direction, user location, and the locations of all base stations, i.e., [v, Dir, (x, y)]. user , (x i ,y i ) bs , i∈Φ]
[0112] 2) The communication environment module is responsible for receiving the target base station information sent to it by the handover module, and sending the key information of the target base station to the decision module for handover decision-making. The handover decision information includes the target base station RSRP and the serving base station RSRP, i.e., [RSRP]. S RSRP T ].
[0113] 3) Receive actions sent by the intelligent agent module, update the status information according to the latest actions, and transmit the new user and base station information to the handover module.
[0114] 4) The immediate reward generated by the latest action is fed back to the agent module. The definition of the reward function R is given below. The reward function uses a cumulative reward mechanism, with a switching rate λ. HO and data rate For performance indicators:
[0115]
[0116] Where, λ HO Represents the average user switching rate, in times per second, λ HO =HO count / time, HO countThis represents the number of times the user switched devices, in seconds, while "time" represents the time the user experienced, in seconds.
[0117] Handover control parameters directly affect the user's handover rate and average data rate. Higher values for D and TTT lead to frequent handovers, impacting the user experience; lower values for D and TTT result in handover failures or even disconnections, also affecting the handover success rate. The reward function sets the user's average data rate and handover rate. This is primarily to ensure that the user's average data rate is increased while simultaneously reducing the average handover rate. Furthermore, it's to balance the user's average data rate and handover rate, preventing network non-convergence and handover rates approaching zero.
[0118] (2) Switching Module
[0119] Based on the handover modeling described above, the handover module selects the optimal target base station for handover. The handover module has the following functions:
[0120] 1) The handover module is responsible for receiving user and base station information sent by the communication environment, processing the information, and selecting the best handover base station.
[0121] 2) The switching module is responsible for calculating the state information learned by the agent. The state information selected in this paper is given below. Where v represents the user's speed, reflecting the instantaneous speed of the user's movement within the network, and is an indispensable factor in handover decisions. Dir represents the user's direction of movement, influencing the user's future trajectory and thus affecting handover decisions and the selection of target base stations. L represents the number of candidate base stations the user may access during handover, providing relative location information of these candidate base stations for handover decisions. I and BS T This represents the vector distance to the target base station and the target base station number, used to indicate the target base station to which the user is handing over. BS S This indicates the currently connected service base station, representing the user's current service status. The handover module sends this status information F to the agent.
[0122] (3) Intelligent Agent Module
[0123] The agent module is responsible for receiving state information and immediate rewards, and outputting the latest actions to the environment to achieve a continuous exploration and learning process. The agent module has the following functions:
[0124] 1) Receive the status information sent by the switching module and send the current status to the decision module for decision-making.
[0125] 2) Receive the latest action output returned by the decision module and apply the action output to the communication environment.
[0126] 3) Receive real-time rewards generated by the communication environment and send the real-time rewards back to the decision-making module.
[0127] (4) Decision Module
[0128] The decision-making module plays a crucial role in the entire system. It integrates and calculates state and switching decision information from the switching module, the agent, and the environmental information module. It is responsible for storing the current state information and its corresponding action values, and making decisions based on the immediate reward function. The core functions of this module include the following:
[0129] 1) To further improve the accuracy and robustness of handover decisions, this paper enhances the decision-making module by proposing a self-feedback function. This function corrects actions D and TTT. Specifically, a simplified A3 event mechanism from the standardization process is introduced, i.e., a hysteresis delay factor (HOM) is added. The decision-making module considers the impact of user received power on our handover. When the reference signal received power (RSRP) of the target base station is greater than or equal to the RSRP of the serving base station, and a certain HOM margin is satisfied, the system considers the handover successful, that is:
[0130] RSRP T >RSRP S +HOM
[0131] Its main purpose is to ensure the success rate of user handover by providing double protection from the perspective of received power. The system first calculates the simplified A3 handover event and outputs the lowest HOM parameter value that meets the conditions. If the output value is positive, it means that the current handover meets the conditions and no output action needs to be adjusted; if the output value is negative, it means that the RSRP of the target base station is lower than the RSRP of the serving base station, which means that this handover is not ideal. At this time, the system will immediately adjust the D and TTT values and feed the corresponding action back to the agent to optimize the handover decision and improve the handover success rate. The correction function is given below:
[0132]
[0133] Here, α is a correction factor used to control the weighting degree; it is a constant. The self-feedback mechanism corrects the actions D and TTT based on the weighting factor W output by the correction function. The method is defined as follows, where W∈[0,1]. By modifying the D and TTT actions, the handover decision-making process can be adjusted, making the system more stable in complex network environments. Secondly, the application of the correction mechanism effectively avoids handover failures caused by network instability or signal fluctuations, and accelerates the convergence of the network loss function.
[0134] 2) The decision-making module outputs actions based on the received state information using a greedy strategy, namely the ε-greedy strategy: randomly selecting an action (exploration) with probability ε, and selecting the current optimal action with probability 1-ε. The output action is a combination of the switching distance threshold D and the switching time threshold TTT (D, TTT).
[0135] In the DQN handover framework, the agent continuously learns state information F, constantly exploring and adjusting its decision-making strategy to obtain the optimal handover distance threshold D and handover time threshold TTT that best suit the current communication environment. Furthermore, this framework introduces a self-feedback mechanism, effectively ensuring the success rate of user handovers. This mechanism not only plays a crucial role in the handover decision-making process but also improves the stability and accuracy of the handover process by dynamically adjusting the strategy. These optimization processes reduce handover failures caused by network fluctuations or signal strength changes to a certain extent, improve the overall system performance and user experience, accelerate the network convergence process, and optimize the reward function, thereby further enhancing system performance.
[0136] Example 2:
[0137] Based on Example 1, but with a difference: the content mentioned in Example 1 is simulated using the Python platform. The specific simulation pseudocode is as follows:
[0138]
[0139]
[0140]
[0141] A comparative experiment was designed to compare the DQN-based handover scheme proposed in this invention with the A3 event-based handover scheme and the fixed handover control parameter scheme. The performance characteristics are as follows: Figure 4-6 As shown in the figure, the characterization data demonstrates that this invention utilizes a DQN network to reduce the switching rate and increase the user data rate. Furthermore, the self-feedback mechanism of the DQN network improves the network loss convergence speed and optimizes the agent's action output.
[0142] It should be noted that in this invention patent, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations.
[0143] The above are merely preferred embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A method for DQN parameter optimization based on Voronoi boundary, characterized in that, The method is implemented based on a DQN parameter optimization system, which includes the following modules: Path loss module: used to complete channel modeling and describe the power attenuation law of the signal during propagation; User movement module: A random walk model is used, assuming that the user's current and future positions are random, and the user's speed and direction of movement are also random and independent of historical and future moments; User handover module: Based on the Voronoi model, the problem of solving the optimal handover boundary is transformed into solving the handover region. The high-frequency handover areas experienced by users during the handover process are defined as the handover regions, and the handover process is divided into the handover preparation stage, the candidate base station selection stage, and the handover execution stage. Reinforcement Learning Module: Combines reinforcement learning algorithms with Voronoi models to reduce user switching rate and increase user data rate; introduces a self-feedback mechanism to correct the action output of the DQN network and accelerate network convergence; The method includes the following steps: S1. Construct a path loss model, complete channel modeling, and describe the power attenuation law of the signal during propagation; S2. The random walk model is selected as the user movement model. It is assumed that the user's current position and future position are random, and the user's speed and direction of movement are also random and independent of the past and future moments. S3. Construct a user handover model based on Voronoi boundaries, transforming the problem of finding the optimal handover boundary into solving the handover region. Define the high-frequency handover areas experienced by users during the handover process as the handover region, and divide the handover process into a handover preparation phase, a candidate base station selection phase, and a handover execution phase. The specific division of the handover process includes the following: S3.1 Switching Preparation Phase: During a random walk, the system continuously acquires the user's location information. When the user approaches the handover boundary, the system enters the handover preparation phase to prepare for the subsequent handover. The optimal switching boundary is represented by a path loss model that decays exponentially, and the formula is as follows: in, P T The base station transmit power is represented by u; the user's initial location is represented by x. s Indicates the location of the serving base station; x i Indicates the locations of other base stations; η Indicates the path loss factor; Represents a uniformly random point process; Simplifying the above formula, the optimal switching boundary is: When the user moves location and Distance of the trajectory d Less than or equal to a certain critical value D When that happens, the handover preparation phase begins, in which... D To switch the distance threshold; S3.2 Candidate Base Station Selection Stage: Using time constraints in A3 switching events TTT To enhance the reliability of the handover process, based on the A3 handover trigger event, the concept of a handover zone is proposed, which sets a handover threshold when a user enters handover preparation. D The area enclosed by the switching boundary is called the switching region, expressed by the formula: Among them, (a) is transformed into a Cartesian coordinate system representation; (b) is transformed into a polar coordinate system representation. r This represents the user as the extreme point, and the boundary. The distance; the polar axis is a parallel line to the switching boundary; The angle between the polar axis and the direction of target motion, ranging from... ; r s and θ s This indicates the distance and angle between the user and the serving base station; r i and θ i This indicates the distance and angle between the user and the target base station; During user movement, the system calculates the handover area between the user and multiple base stations, and based on... TTT Conditions are used to filter potential target base stations, and a distance vector sum algorithm is used to calculate the distance vector sum at the specified location. TTT The vector distance between the user and each candidate base station within a given time period is calculated, and these distances are summed. The formula is as follows: in, d i ( t ) indicates the user at time t With base station i The distance represented d , i Indicates the base station number; This represents the vertical component of the user's movement toward the target base station. , ; This represents the sum of the vector distances of the base station within the time window. α This represents the correction factor, which is a constant used to control the weighting degree; if the vector distance between a user and a certain base station is accumulated... L If the sum of the vector distances between the user and a base station is always positive, then the base station is considered a candidate base station for handover; if at a certain moment, the sum of the vector distances between the user and a base station is positive, then the base station is considered a candidate base station for handover. L If the value is negative, the base station will be disqualified as a candidate base station. S3.3 Switching Execution Phase: The system selects the base station with the maximum sum of vector distances from the candidate base stations as the optimal target base station, as expressed by the formula: Where I represents the optimal target base station; S4. Combine reinforcement learning algorithms with Voronoi models to construct a reinforcement learning network DQN switching framework to reduce user switching rate and improve user data rate; and introduce a self-feedback mechanism to correct the action output of the DQN network and accelerate network convergence.
2. The method of claim 1, wherein, The DQN switching framework for the reinforcement learning network includes: Communication environment module: responsible for recording and processing key information of the user during the movement process. The key information includes: the user's instantaneous movement speed, instantaneous movement direction, currently connected serving base station, distance between the user and the serving base station and the distance to other base stations, transmission power of each base station, channel propagation loss and real-time data rate received by the user. The switching module is responsible for receiving user and base station information sent by the communication environment module, processing the information, and selecting the best target base station; it is also responsible for calculating the state information learned by the agent module. The intelligent agent module is responsible for receiving status information and immediate rewards, feeding them back to the decision-making module for decision-making; it is also responsible for receiving the latest action output returned by the decision-making module and outputting the action to the communication environment module. Decision module: Responsible for integrating and calculating the state information and switching decision information from the communication environment module, switching module and intelligent agent module, storing the current state information and its corresponding action value, and making decisions based on the instant reward function.
3. The method of claim 2, wherein, The decision module has a self-feedback mechanism introduced therein for correcting the action D and time limit conditions TTT , and specifically includes the following contents: A simplified A3 handover event mechanism is proposed, introducing a hysteresis delay factor into the standardized A3 handover triggering event mechanism. HOM When the target base station's reference signal received power RSRP Greater than or equal to the serving base station RSRP and meet certain requirements HOM When there is a margin of safety, the system considers the switchover successful, as expressed by the formula: RSRP T > RSRP S + HOM Calculate the simplified A3 switching event and output the minimum value that meets the conditions. HOM If the output value is positive, it means that the current switching condition is met and no adjustment of the output action is required. If the output value is negative, it indicates that the target base station... RSRP Lower than the serving base station RSRP This indicates that the switching is not ideal, and the system immediately adjusts based on the correction function. D The TTT (Time To Handling Time) value is used, and the corresponding actions are fed back to the agent module to optimize the handover decision and improve the handover success rate; the formula of the correction function is expressed as: in, α This represents the correction factor, which is a constant used to control the degree of weighting; W This represents the weighting factor output by the correction function; Self-feedback mechanism based on W Action D and TTT Make corrections, that is Methods, among which .
4. A computer device, comprising: The computer device includes a processor and a memory, wherein the memory stores at least one instruction, at least one program, code set, or instruction set, and the instruction, program, code set, or instruction set is loaded and executed by the processor to implement the Voronoi boundary-based DQN parameter optimization method as described in any one of claims 1-3.
5. A computer readable storage medium, characterized in that, The computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, which is loaded and executed by a processor to implement the Voronoi boundary-based DQN parameter optimization method as described in any one of claims 1-3.