Method for optimizing inventory space site selection scheme of elderly care facilities based on deep reinforcement learning
By constructing a multi-dimensional site selection feature space and reward function through deep reinforcement learning, the problems of high computational complexity and model instability in the site selection of elderly care facilities are solved. This enables efficient and stable decision-making for the layout of elderly care facilities, outputs a Pareto optimal solution set, and supports scientific facility configuration.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGSU INST OF URBAN PLANNING & DESIGN
- Filing Date
- 2026-02-14
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for selecting sites for elderly care facilities suffer from high computational complexity, unstable models, and an inability to fully quantify influencing factors, making it difficult to output Pareto optimal solutions under multiple real-world constraints.
A deep reinforcement learning-based approach is adopted to construct a site selection feature space that includes multiple dimensions such as transportation, land use, supporting facilities, and economy. By defining the state space, action mask, and reward function, the PPO algorithm is used for training to achieve stable convergence and efficient optimization. The reward function and state transition mechanism are designed, and an intelligent agent interaction environment is constructed for model training.
It achieves stable convergence and efficient optimization in high-dimensional states, outputs Pareto optimal solution sets, and provides scientific, data-driven decision support for the layout of elderly care facilities, taking into account both social and economic benefits.
Smart Images

Figure CN122242834A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent site selection and resource allocation technology for public service facilities, and in particular to a method for optimizing the site selection scheme of existing space for elderly care facilities based on deep reinforcement learning. Background Technology
[0002] In urban and rural planning and design, policy formulation, and other fields, the site selection of elderly care facilities is one of the core components. Currently, the era of stock renewal places more refined and economical demands on the site selection of elderly care facilities. The location, land area, expected service population, and estimated investment budget of elderly care facilities in existing spaces determine whether the allocation of elderly care facilities is economical, efficient, and equitable.
[0003] Site selection for elderly care facilities involves complex and dynamically changing factors such as cost, demand, and supply. Traditional methods for site selection for elderly care facilities include network analysis, Voronoi diagrams, genetic algorithms, and ant colony optimization.
[0004] Network analysis algorithms can reflect real traffic conditions, but cannot reflect other effects besides traffic.
[0005] Voronoi diagrams: cannot show the impact of roads, land use conditions, etc. on the layout;
[0006] Genetic algorithms and ant colony algorithms suffer from the "curse of dimensionality" when faced with large-scale site selection problems (such as choosing from hundreds of candidate sites), resulting in exponential growth in computational complexity and difficulty in finding optimal solutions, thus limiting their practicality.
[0007] In the existing technology, another deep reinforcement learning algorithm is used. Although this algorithm has the ability to perform multi-objective collaborative optimization and high-dimensional state processing, it has high requirements for the design of state space, reward function, action space and state transition function. If it is not designed properly, it will lead to model instability and training non-convergence. Summary of the Invention
[0008] In view of the shortcomings of the prior art, the purpose of this invention is to provide a method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning, so as to solve one or more problems in the prior art.
[0009] To achieve the above objectives, the technical solution of the present invention is as follows:
[0010] An optimization method for the site selection of existing elderly care facilities based on deep reinforcement learning, the optimization method comprising the following steps:
[0011] S1. Collect multi-dimensional data, perform data processing and population forecasting, and form a multi-dimensional database at the plot scale that includes plot-scale elderly population forecast data at the end of the planning period.
[0012] S2. Based on the predicted elderly population data at the plot scale at the end of the planning period and the current data of elderly care facilities, analyze the gap in elderly care services and generate a map of the gap in elderly care facilities.
[0013] S3. Construct a target indicator system for the site selection of elderly care facilities. The indicator system includes the current facilities, population distribution, transportation conditions, land use conditions, supporting facilities, and economic cost criteria.
[0014] S4. Construct a demand state space based on the map of the shortage of elderly care facilities, and construct a site selection feature space based on the existing updatable spatial data in the map and the site selection target index system; combine deep reinforcement learning algorithms to construct an intelligent agent interaction environment that includes state space, action space, action mask, reward function and state transition function, and train the model of the intelligent agent for site selection of elderly care facilities.
[0015] S5. Load the trained model and simulate the site selection decision based on the final state space of the target area during the planning period. Output the spatial site selection scheme for elderly care facilities under preset constraints.
[0016] Furthermore, step S2, analyzing the gap in elderly care services, includes the following steps:
[0017] Based on location data of elderly care facilities and residential communities, the service coverage of elderly care facilities is defined as the walking distance.
[0018] To determine the gap in elderly care services, if the residential community is located outside the service coverage of elderly care facilities, the number of elderly people in each residential community outside the reach of elderly care facilities at the end of the planning period is calculated as the service gap; if the residential community is located within the service coverage of elderly care facilities, the number of elderly people in each residential community within the reach of elderly care facilities at the end of the planning period is aggregated for further calculation.
[0019] The number of beds per elderly care facility is calculated based on the number of beds in elderly care facilities within the reach of the facilities and the number of elderly people at the end of the planning period. If the number of beds per capita is greater than or equal to the national standard, it is considered that there is no shortage of elderly care services. If the number of beds per capita is less than the national standard, then the shortage of elderly care services within the reach of the facilities is calculated.
[0020] By summarizing the gaps in elderly care services both within and outside the reachable area, a distribution map of the elderly care service gaps is obtained.
[0021] Furthermore, the formula for calculating the gap in elderly care services within the reach of the aforementioned elderly care facilities is as follows:
[0022]
[0023] In the formula: For the first The shortage of elderly care services in residential communities; It is the first The number of elderly people in each residential community; The number of beds per capita in elderly care facilities as determined by national standards. For the first The number of beds per person in each elderly care facility.
[0024] Furthermore, the indicators at each criterion level of the target indicator system for the site selection of elderly care facilities constructed in step S3 are as follows:
[0025] The current facility criteria layer includes indicators such as the number of current elderly care facilities.
[0026] The population distribution criteria layer includes indicators such as the pension demand gap at the end of the planning period;
[0027] The traffic condition criteria layer includes indicators for road traffic conditions and traffic facility conditions.
[0028] The supporting facilities standard layer includes indicators such as the number of medical facilities and the number of emergency medical facilities;
[0029] The land use condition criteria layer includes indicators such as topographic conditions and the number of NIMBY (Not In My Backyard) facilities.
[0030] The economic cost criterion layer includes indicators such as land prices and consumption levels.
[0031] Furthermore, in step S4, the action space is defined as a one-dimensional discrete vector, the dimension of which corresponds to the number of facility candidate points, and each action corresponds to selecting or not selecting the candidate plot; the action mask generates a Boolean matrix based on the Euclidean distance threshold between the demand space and the facility candidate points, which is used to dynamically constrain invalid site selection actions.
[0032] Furthermore, the reward function described in step S4 is based on a multi-objective optimization mechanism, and its calculation formula is shown below:
[0033]
[0034] In the formula: In the first At each time step, the state Next, execute the action. Rewards, actions Expressed as demand space Matching to facility alternatives ; Indicates action Matching demand space To facility alternative point The Euclidean distance; Indicates the state Below, demand space The gap in elderly care services; Representing demand space and facility alternatives The motion constraint value determined by the Euclidean distance; Indicates the alternative facility location The economic cost of site selection; Indicates the alternative facility location Incentives for supporting facilities at selected sites; Indicate facility alternatives Incentives based on the suitability of the selected site for construction; Indicates in NIMBY (Not In My Backyard) penalties for candidate site selection;
[0035] The formula for calculating long-term cumulative rewards is as follows:
[0036]
[0037] In the formula: This is a long-term, cumulative reward. For decision-making steps; Discount factor; For the first Rewards for actions taken at each decision-making step.
[0038] Furthermore, in step S4, the model adopts the PPO deep reinforcement learning model structure to construct the policy network output action probability distribution and construct the value network to evaluate state value; the model training includes an interactive sequence collection stage and a network parameter update stage.
[0039] Furthermore, the policy network formula is shown below:
[0040]
[0041] In the formula: Is the policy network in state Select action The probability, These are network weight parameters; For policy network mapping functions composed of multi-layer neural networks;
[0042] The training objective of the policy network is to minimize the policy network loss function, which is achieved based on the pruning advantage function in the PPO algorithm. The calculation formula is shown below:
[0043]
[0044] In the formula: Importance sampling ratio; The dominant function; This is the clipping function.
[0045] Furthermore, the objective of the value network is to minimize the value loss function, calculated as follows:
[0046]
[0047] In the formula: For value network parameters; This refers to the actual long-term cumulative rewards; This represents the expected value of long-term cumulative rewards.
[0048] Furthermore, in step S5, the preset constraints include facility size constraints and economic feasibility constraints; the site selection decision simulation process is based on Markov decision process and environmental interaction.
[0049] Compared with the prior art, the beneficial technical effects of the present invention are as follows:
[0050] (i) This invention constructs a site selection feature space that includes multiple dimensions such as transportation, land use, supporting facilities, and economy, and comprehensively quantifies the actual influencing factors such as road conditions, land use attributes, and NIMBY facilities, thus avoiding the one-sidedness of single-dimensional analysis.
[0051] (ii) This invention transforms static addressing into a sequential decision-making process, defines a state space, action mask and reward function, and uses the PPO algorithm for training, thereby achieving stable convergence and efficient optimization in high-dimensional states and breaking through the limitation of computational complexity.
[0052] (III) This invention ensures the training stability and convergence of the deep reinforcement learning model through the designed reward function and state transition mechanism, solves the model oscillation problem caused by unreasonable design, and can output Pareto optimal solution set under multiple real constraints, providing data-driven scientific decision support for the layout of elderly care facilities that takes into account both social and economic benefits. Attached Figure Description
[0053] Figure 1 The diagram illustrates a flowchart of an optimization method for the site selection of existing elderly care facilities based on deep reinforcement learning, according to an embodiment of the present invention.
[0054] Figure 2 This illustration shows a schematic diagram of the elderly care service gap identification method based on the deep reinforcement learning-based optimization method for the site selection scheme of existing elderly care facilities according to an embodiment of the present invention.
[0055] Figure 3 This paper illustrates a schematic diagram of the target index system for the site selection of elderly care facilities, based on the optimization method for the site selection scheme of existing elderly care facilities using deep reinforcement learning, according to an embodiment of the present invention.
[0056] Figure 4This diagram illustrates the state space construction method of the optimization method for the site selection scheme of existing elderly care facilities based on deep reinforcement learning, according to an embodiment of the present invention.
[0057] Figure 5 This paper illustrates a schematic diagram of the model training process for an optimization method of existing space location scheme for elderly care facilities based on deep reinforcement learning, according to an embodiment of the present invention.
[0058] Figure 6 The diagram shows the simulation results of the site selection decision of the optimization method for the site selection scheme of the existing space of elderly care facilities based on deep reinforcement learning according to an embodiment of the present invention (left is the market entity criterion, right is the government entity criterion). Detailed Implementation
[0059] To make the objectives, technical solutions, and advantages of this invention clearer, the following detailed description of the proposed method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning, in conjunction with the accompanying drawings and specific embodiments, will further illustrate these points. The advantages and features of this invention will become clearer from the following description. It should be noted that the accompanying drawings are in a very simplified form and use non-precise proportions, used only to facilitate and clearly illustrate the purpose of the embodiments of this invention. Please refer to the accompanying drawings for a clearer understanding of the objectives, features, and advantages of this invention. It should be understood that the structures, proportions, sizes, etc., depicted in the accompanying drawings are only for illustrative purposes and to enable those skilled in the art to understand and read them. They are not intended to limit the implementation conditions of this invention and therefore have no substantial technical significance. Any modifications to the structure, changes in proportions, or adjustments to the size, without affecting the effects and objectives achieved by this invention, should still fall within the scope of the technical content disclosed in this invention.
[0060] Please refer to the following: Figure 1 An optimization method for the site selection of existing elderly care facilities based on deep reinforcement learning, the optimization method comprising the following steps:
[0061] S1. Collect multivariate data, perform data processing and population forecasting, and form a multivariate database at the plot scale that includes predicted data on the elderly population at the plot scale at the end of the planning period, as detailed below:
[0062] S11. Data Collection: Collect basic data such as population census data, land use status data, and inefficient land use patch data. Combine this with the latest internet map data to collect big data on population profiles, including statistical characteristics of the population within each residential unit, such as population distribution, age characteristics, economic capacity, and travel habits. Simultaneously, collect information on existing elderly care facilities, including attributes such as the location, scale, and service capacity of various facilities, including elderly care institutions. Through data collection, establish a raw dataset covering multiple dimensions such as population, land, and facilities to provide basic data support for subsequent analysis.
[0063] S12. Conduct verification and correction of the collected population data to ensure its accuracy and reliability. Using official census and statistical data as a reference standard, process the population profile big data to accurately reflect the actual population situation and ensure consistency with official statistical data in terms of trends and total numbers.
[0064] The processing steps mainly cover data cleaning, data association, and data correction. Data cleaning aims to remove duplicate and erroneous data from the population profile big data and repair incomplete data. Data association involves matching the population profile big data with corresponding geospatial information and geographical units in official statistical data to ensure data comparison and analysis are conducted at the same spatial scale. Data correction involves constructing regression models between official statistical data and population profile big data, and using the model's predicted values to correct the population profile big data, thereby making the population profile big data better match the trends presented by the statistical data.
[0065] The processed population profile big data is compared with statistical data in multiple dimensions, such as comparing the data distribution in different regions and time periods, in order to ensure the consistency and rationality of the data.
[0066] S13. Based on the land parcel-level population forecasting method, predict the size of the elderly population at the end of the planning period. Use the known cohort element forecasting method to predict the number of elderly people at the land parcel level at the end of the planning period. Iterative calculations are performed by inputting base year land parcel-level population data, age structure, fertility rate, mortality rate, and migration rate. After model calculation, the overall population size and the number of people by age group are output, including the number of elderly people.
[0067] S14. The processed data is stored in the database to form a basic geographic, population distribution, public facilities, and socio-economic database at the plot scale, so as to be efficiently accessed and analyzed in the subsequent model training and site selection decision-making process.
[0068] S2. Based on the projected elderly population data at the plot scale at the end of the planning period and the current data on elderly care facilities, analyze the gap in elderly care services and generate a map of the gap in elderly care facilities.
[0069] Please refer to the following: Figure 2 Analyzing the gap in elderly care services includes the following steps:
[0070] Based on the location data of elderly care facilities and residential communities, and defining the service coverage area of elderly care facilities as the walking distance, in this embodiment, the service coverage area of elderly care facilities is defined as the area of residential communities that can be reached within 15 minutes on foot from the service location of the elderly care facility.
[0071] To determine the gap in elderly care services, if a residential community is located outside the service coverage of elderly care facilities, the number of elderly people in each residential community outside the reach of elderly care facilities at the end of the planning period is calculated as the service gap; if a residential community is located within the service coverage of elderly care facilities, the number of elderly people in each residential community within the reach of elderly care facilities at the end of the planning period is aggregated for further calculation.
[0072] The number of beds per capita in each elderly care facility is calculated based on the number of beds in the facilities within the reach of the facilities and the number of elderly people at the end of the planning period. If the number of beds per capita is greater than or equal to the national standard, it is considered that there is no shortage of elderly care services. If the number of beds per capita is less than the national standard, then the shortage of elderly care services within the reach of the facilities is calculated.
[0073] By summarizing the gaps in elderly care services both within and outside the reachable area, a distribution map of these gaps is obtained, providing a data foundation for the subsequent demand status space.
[0074] Furthermore, the formula for calculating the gap in elderly care services within the reach of the aforementioned elderly care facilities is shown in Formula 1 below:
[0075] (1)
[0076] In the formula: For the first There is a gap in elderly care services in residential communities. It is the first The number of elderly people in each residential community. The number of beds per capita in elderly care facilities as determined by national standards. For the first The number of beds per capita in each elderly care facility. Among them, the number of beds per capita in elderly care facilities in residential communities is the ratio of the number of elderly people in the residential community to the number of beds in elderly care facilities within the service coverage area.
[0077] By calculating the gaps in elderly care services in each residential community, a distribution map of these gaps is obtained, providing a data foundation for the demand state space of the subsequent intelligent agent environment.
[0078] S3. Construct a target indicator system for the site selection of elderly care facilities. The indicator system includes the current facilities, population distribution, transportation conditions, land use conditions, supporting facilities, and economic cost criteria.
[0079] Please refer to the following: Figure 3 By reviewing relevant policy documents, technical specifications, and research findings concerning the site selection of elderly care facilities, rigid constraint indicators such as spatial configuration standards, safety protection requirements, and service function configurations are extracted. Marginal indicators with a frequency below a critical threshold are eliminated based on the principles of salience and universality. Semantic similarity algorithms are used to eliminate redundant synonymous indicators, resulting in six layers of elderly care site selection criteria in this embodiment. The indicators for each criterion layer are as follows:
[0080] The current facility criteria layer includes indicators such as the number of current elderly care facilities.
[0081] The population distribution criteria layer includes indicators such as the number of elderly care demand gaps at the end of the planning period.
[0082] The traffic condition criteria layer includes indicators for road traffic conditions and traffic facility conditions.
[0083] The supporting facilities criteria layer includes indicators such as the number of medical facilities and the number of emergency medical facilities.
[0084] The land use criteria layer includes indicators such as topographical conditions and the number of NIMBY (Not In My Backyard) facilities.
[0085] The economic cost criterion layer includes indicators such as land prices and consumption levels.
[0086] By establishing various indicators for the above six criteria layers, a complete target indicator system for the site selection of elderly care facilities is constructed, providing a theoretically complete measurement benchmark for the subsequent definition of the state space of the intelligent agent environment.
[0087] S4. Construct a demand state space based on the gap map of elderly care facilities, and construct a site selection feature space based on the inefficient map patch data and the site selection target index system. Combine with deep reinforcement learning algorithm, construct an intelligent agent interaction environment including state space, action space, action mask, reward function and state transition function, and train the model of the intelligent agent for elderly care facility site selection.
[0088] Specifically, this embodiment constructs an intelligent site selection decision model for elderly care facilities based on a reinforcement learning framework. By modeling the site selection decision process as a Markov decision process, the traditional static site selection problem is transformed into a sequential decision problem. This allows the reinforcement learning agent to learn the optimal facility layout strategy through multiple interactions, achieving multi-objective collaborative optimization such as maximizing demand coverage and minimizing costs. In this embodiment, for a planning scenario with n residential communities with elderly care needs and m alternative facility locations within a region, the following is defined:
[0089] Decision Sequence: Each decision cycle contains n decision steps, and each decision step performs a location matching action once.
[0090] Action Space: Each action matches an uncovered demand point with a facility alternative point, indicating that the elderly care services for that demand point are covered by the selected facility.
[0091] State transition and termination: After each matching, the remaining elderly care demand gap for each demand point is automatically updated, and the system enters the next state. When all demand points are matched or a preset number of steps are reached, the current site selection decision cycle ends, forming a complete facility-demand matching scheme.
[0092] To implement the model, the agent interaction environment, model structure, and training environment parameter configuration were designed and constructed, as follows:
[0093] For the design of intelligent agent interaction environments:
[0094] Please refer to the following: Figure 4 Environment construction includes defining the state space, action space and action mask, and reward function. In this embodiment, a custom simulation environment is defined, with key components including the state space, action space, and reward function. Regarding the state space, it is structurally decomposed into a location feature space and a demand space.
[0095] Furthermore, constructing the site selection feature space involves two steps: identifying potential facility sites and constructing the features of these sites. First, based on inefficient land use map data from the inefficient land use special plan, plots smaller than the minimum land size for nursing homes and currently designated as roads, railways, or municipal utilities are eliminated, leaving the remaining inefficient plots as potential nursing home sites. Second, based on the nursing home site selection target index system constructed in step S3, features of the potential sites are constructed. These features cover the transportation conditions, supporting facilities, land use conditions, economic costs, and spatial coordinates of each potential site. Transportation conditions include two features: the number of roads and transportation facilities. The road quantity feature counts the number of adjacent roads where motor vehicle entrances / exits can be set up at each potential site. The transportation facility feature calculates the number of subway stations and bus stops within walking distance of each potential site. Supporting facilities include two features: medical facilities and emergency facilities. The medical facility feature calculates the number of general hospitals and community health service centers within motor vehicle reach of each potential site. Emergency medical facilities are calculated based on the number of available emergency medical facilities within the reach of motor vehicles from each available facility site. Land use conditions include two features: NIMBY (Not In My Backyard) facilities and terrain conditions. NIMBY facilities are defined using the available facility sites as buffer zones, and the number of NIMBY facilities such as garbage and sanitation facilities, power supply facilities, sewage facilities, rainwater facilities, gas facilities, railway land, industrial land, warehousing land, and funeral facilities within the buffer zone is counted. Terrain conditions are calculated using a DEM (Digital Image Model) raster to determine the average slope of the land at the available sites. Economic costs include two features: land price and consumption level. Land prices are obtained from the transfer prices of various residential communities through internet channels, and the nominal land prices from different years are standardized using a real estate price index. Consumption level is based on the consumption level field of population profile big data, with the average consumption level within the buffer zone of the available facility sites used as the evaluation indicator.
[0096] Furthermore, based on the distribution map of the elderly care gap at the end of the planning period obtained in step S2, a demand state space is constructed for the elderly care demand gap of each residential community. The site selection feature space and the demand space are used to calculate the service coverage relationship through the known Euclidean distance to determine the service accessibility, that is, the Boolean value of the accessibility from the facility candidate point to the demand point, so as to construct a Boolean matrix. This Boolean matrix is an important part of the state space and helps the agent determine which facility candidate points can be used to match specific demand points in the current state.
[0097] In terms of the state space, the action space is defined as a one-dimensional discrete vector whose dimension corresponds to the number of facility candidate points. Each action corresponds to selecting or not selecting the candidate plot. In this embodiment, 0 means not selecting the plot and 1 means selecting the plot.
[0098] Regarding the action mask, the action mask generates a Boolean matrix based on the Euclidean distance threshold between the demand space and facility candidate points, used to dynamically constrain invalid site selection actions. During the agent's site selection decision-making process, the demand space must select facility candidate points within the Euclidean distance range of the demand space; those outside this range are invalid selections. A causal chain of "agent decision-making - environmental feedback" is established by combining the state transition function. After each site selection decision is completed, the elderly care demand value of the demand space is updated based on the elderly care demand covered within the facility's coverage area. The environment is updated iteratively through the facility layout state, providing a real-time interactive environment for subsequent decisions.
[0099] Regarding the reward function, it is based on a multi-objective optimization mechanism, including the fairness criterion from the government's perspective and the efficiency criterion of market entities. The calculation formula is shown in Equation 2 below:
[0100] (2)
[0101] In the formula: In the first At each time step, the state Next, execute the action. Rewards, actions Expressed as demand space Matching to facility alternatives . Indicates action Matching demand space To facility alternative point The Euclidean distance is calculated using the coordinate information of the demand space and the coordinate information of the facility candidate points in the site selection feature space. Indicates the state Below, demand space The gap in elderly care services is identified in step S2. Representing demand space and facility alternatives The action constraint value determined by the Euclidean distance between the two is 0 when the Euclidean distance exceeds the threshold specified in the elderly care facility regulations, and 1 when it does not exceed the threshold. This value is used to constrain invalid site selection actions. Indicates the alternative facility location The economic cost of site selection is a function of land prices and consumption levels. The higher the surrounding land prices and consumption levels, the greater the economic cost and the smaller the reward. Indicates the alternative facility location The site selection facility support incentive is a function of the distance to nearby medical and emergency facilities; the closer the location is to nearby medical and emergency facilities, the greater the incentive. Indicate facility alternatives The suitability bonus for site selection is a function of land use conditions. The more complex the terrain, the smaller the bonus. Indicates in The NIMBY penalty for candidate site selection is a function of the distance to NIMBY facilities; the smaller the distance to NIMBY facilities, the smaller the reward.
[0102] Furthermore, the long-term cumulative reward calculation formula is shown in Equation 3 below:
[0103] (3)
[0104] In the formula: Long-term cumulative rewards are the sum of rewards accumulated from multiple decision-making steps. This is the decision step, and its value ranges from 1 to n. This is a discount factor, with a value range of 0 to 1. A value of 1 indicates that the reward does not decay over time, as shown in this embodiment. The parameter is set to 1; For the first Rewards for actions taken at each decision-making step.
[0105] Regarding model structure and training environment parameter configuration:
[0106] The model network structure adopts the PPO deep reinforcement learning model structure to construct the policy network output action probability distribution, construct the value network to evaluate state value, and use the Adam optimizer for policy gradient update. The policy network formula is shown in Equation 4 below:
[0107] (4)
[0108] In the formula: Is the policy network in state Select action The probability, These are the network weight parameters. Since the location selection action is a discrete action, Softmax is used to transform the neural network output into the probability of each action. This is the mapping function for a policy network composed of a multi-layered neural network.
[0109] The training objective of the policy network is to minimize the policy network loss function, which is achieved based on the pruning advantage function in the PPO algorithm. The calculation formula is shown in Equation 5 below:
[0110] (5)
[0111] In the formula: The importance sampling ratio is determined by comparing the current policy network with the old policy network before the update in terms of state. Downward movement The probability is given, and the ratio is given. Let be the dominance function, representing the state . Take action below How much better than average. For the clipping function, restrictions The range of variation limits the step size of policy updates, preventing policy updates from being too aggressive.
[0112] Furthermore, the value network input state Output the expected value of long-term cumulative reward. This value is used to calculate the advantage function in the policy network. The advantage function provides direction for policy network updates. The formula is shown in Equation 6 below:
[0113] (6)
[0114] In the formula: The actual long-term cumulative reward is calculated using reward function formula 2. This represents the expected value of long-term cumulative rewards.
[0115] The objective of the value network is to minimize the value loss function, as shown in Equation 7 below:
[0116] (7)
[0117] In the formula: These are the parameters of the value network. The actual long-term cumulative reward is calculated using a reward function. This represents the expected value of long-term cumulative rewards.
[0118] Based on the above-mentioned intelligent agent interaction environment design, policy network and value network algorithms, the intelligent agent interaction environment, network structure and training parameters are customized through the deep reinforcement learning library. Reasonable learning rate, training rounds, number of rounds and time step size thresholds are set, and the core parameters of the algorithm network such as optimizer momentum coefficient and policy gradient pruning range are adjusted in sync to ensure the efficiency and stability of subsequent model training.
[0119] Furthermore, model training includes two phases:
[0120] The first stage, the interactive collection stage, freezes network parameters: the initial policy network samples action sequences, calculates environmental rewards in real time and records state-action-reward sequence tuples, and triggers a stage transition when the accumulated time step reaches the time step threshold.
[0121] The second stage, the network parameter update stage, is based on the collected trajectory data. First, the state value function is estimated using the value network, and then the action advantage is quantified using generalized advantage estimation. Then, the objective function, which includes policy loss and value function loss, is calculated, and finally the policy network parameters are updated through gradient backpropagation and gradient pruning.
[0122] The process is executed cyclically until the preset number of rounds is reached or the convergence condition is met. During this process, importance sampling and trust region optimization are used to ensure training stability. After training is complete, the model parameters are saved.
[0123] For details on the model training process, please refer to [link / reference]. Figure 5 :
[0124] a. After initializing the state space, action space, and reward function, the environment enters the interaction collection sequence phase.
[0125] b. Input the current state into the policy network, and output the sampled action to the reward function and the state transition function through the policy network, and output the reward and the next state respectively.
[0126] c. Record the current state, action, reward, and next state to form a trajectory data.
[0127] d. Determine whether to end sequence collection. If not, update the state to the policy network and repeat the process until the decision sequence ends. If yes, proceed to the network parameter update stage.
[0128] e. Using the acquired trajectory data, input the state into the value judgment network and output the state value through the value judgment network, while inputting the reward.
[0129] f. Calculate the value loss function and the policy loss function by combining the output state value and the input reward.
[0130] g. While updating the value judgment network parameters using the value loss function and the policy loss function, determine whether to end the network update. If not, re-enter the state and reward based on the trajectory data and loop until the preset round limit is reached or the convergence condition is met. If so, training ends, and the final policy network and value network parameters are saved. S5. Load the trained model, simulate site selection decisions based on the final state space of the target area's planning period, and output the spatial site selection scheme for elderly care facilities under preset constraints.
[0131] Based on regional data for the site selection decision-making process, a state space is constructed at the end of the planning period, including a demand space and a site selection feature space. Then, the model network and training model parameters are loaded, and the state space is input into the site selection agent. The site selection decision is simulated based on action weights. A dual reality constraint mechanism is incorporated into the simulation process:
[0132] Facility size constraints limit decision-making boundaries by pre-setting physical carrying capacity thresholds for individual institutions (including maximum land area and maximum service population).
[0133] Economic feasibility constraints are imposed, and a total investment budget is set to simulate resource scarcity conditions.
[0134] The agent interacts with the environment through a multi-round Markov decision process within a constrained framework. After each decision step selects a new facility location, the model calculates in real time the change in demand coverage, the increase in cost consumption, and the degree of constraint violation, and updates the state space and reward feedback accordingly.
[0135] Finally, through iterative optimization, the optimal solution set under multiple objectives is output, such as... Figure 6 The simulation results of the site selection decision shown demonstrate that, under the premise of satisfying all rigid constraints, the proposed scheme achieves synergistic optimization of the social benefit goals led by the government and the operational efficiency goals driven by market entities, providing data-driven decision support for the layout planning of elderly care facilities.
[0136] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0137] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this invention patent should be determined by the appended claims.
Claims
1. A method for optimizing the site selection of existing elderly care facilities based on deep reinforcement learning, characterized by: The optimization method includes the following steps: S1. Collect multi-dimensional data, perform data processing and population forecasting, and form a multi-dimensional database at the plot scale that includes plot-scale elderly population forecast data at the end of the planning period. S2. Based on the predicted elderly population data at the plot scale at the end of the planning period and the current data of elderly care facilities, analyze the gap in elderly care services and generate a map of the gap in elderly care facilities. S3. Construct a target indicator system for the site selection of elderly care facilities. The indicator system includes the current facilities, population distribution, transportation conditions, land use conditions, supporting facilities, and economic cost criteria. S4. Construct a demand state space based on the map of the shortage of elderly care facilities, and construct a site selection feature space based on the existing updatable spatial data in the map and the site selection target index system; combine deep reinforcement learning algorithms to construct an intelligent agent interaction environment that includes state space, action space, action mask, reward function and state transition function, and train the model of the intelligent agent for site selection of elderly care facilities. S5. Load the trained model and simulate the site selection decision based on the final state space of the target area during the planning period. Output the spatial site selection scheme for elderly care facilities under preset constraints.
2. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 1, characterized in that, Step S2, analyzing the gap in elderly care services, includes the following steps: Based on location data of elderly care facilities and residential communities, the service coverage of elderly care facilities is defined as the walking distance. To determine the gap in elderly care services, if the residential community is located outside the service coverage of elderly care facilities, the number of elderly people in each residential community outside the reach of elderly care facilities at the end of the planning period is calculated as the service gap; if the residential community is located within the service coverage of elderly care facilities, the number of elderly people in each residential community within the reach of elderly care facilities at the end of the planning period is aggregated for further calculation. The number of beds per elderly care facility is calculated based on the number of beds in elderly care facilities within the reach of the facilities and the number of elderly people at the end of the planning period. If the number of beds per capita is greater than or equal to the national standard, it is considered that there is no shortage of elderly care services. If the number of beds per capita is less than the national standard, then the shortage of elderly care services within the reach of the facilities is calculated. By summarizing the gaps in elderly care services both within and outside the reachable area, a distribution map of the elderly care service gaps is obtained.
3. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 2, characterized in that: The formula for calculating the gap in elderly care services within the reach of the aforementioned elderly care facilities is as follows: In the formula: For the first The shortage of elderly care services in residential communities; It is the first The number of elderly people in each residential community; The number of beds per capita in elderly care facilities as determined by national standards. For the first The number of beds per person in each elderly care facility.
4. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 1, characterized in that: The indicators at each criterion level of the target indicator system for the site selection of elderly care facilities constructed in step S3 are as follows: The current facility criteria layer includes indicators such as the number of current elderly care facilities. The population distribution criteria layer includes indicators such as the pension demand gap at the end of the planning period; The traffic condition criteria layer includes indicators for road traffic conditions and traffic facility conditions. The supporting facilities standard layer includes indicators such as the number of medical facilities and the number of emergency medical facilities; The land use condition criteria layer includes indicators such as topographic conditions and the number of NIMBY (Not In My Backyard) facilities. The economic cost criterion layer includes indicators such as land prices and consumption levels.
5. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 1, characterized in that: In step S4, the action space is defined as a one-dimensional discrete vector, the dimension of which corresponds to the number of facility candidate points. Each action corresponds to selecting or not selecting the candidate plot. The action mask generates a Boolean matrix based on the Euclidean distance threshold between the demand space and the facility candidate points, which is used to dynamically constrain invalid site selection actions.
6. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 1, characterized in that, The reward function described in step S4 is based on a multi-objective optimization mechanism, and its calculation formula is shown below: In the formula: In the first At each time step, the state Next, execute the action. Rewards, actions Expressed as demand space Matching to facility alternatives ; Indicates action Matching demand space To facility alternative point The Euclidean distance; Indicates the state Below, demand space The gap in elderly care services; Representing demand space and facility alternatives The motion constraint value determined by the Euclidean distance; Indicates the alternative facility location The economic cost of site selection; Indicates the alternative facility location Incentives for supporting facilities at selected sites; Indicate facility alternatives Incentives based on the suitability of the selected site for construction; Indicates in NIMBY (Not In My Backyard) penalties for candidate site selection; The formula for calculating long-term cumulative rewards is as follows: In the formula: This is a long-term, cumulative reward. For decision-making steps; Discount factor; For the first Rewards for actions taken at each decision-making step.
7. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 1, characterized in that, In step S4, the model adopts the PPO deep reinforcement learning model structure to construct the policy network output action probability distribution and construct the value network to evaluate state value; the model training includes an interactive sequence collection stage and a network parameter update stage.
8. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 7, characterized in that, The policy network formula is shown below: In the formula: Is the policy network in state Select action The probability, These are network weight parameters; For policy network mapping functions composed of multi-layer neural networks; The training objective of the policy network is to minimize the policy network loss function, which is achieved based on the pruning advantage function in the PPO algorithm. The calculation formula is shown below: In the formula: Importance sampling ratio; The dominant function; This is the clipping function.
9. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 8, characterized in that, The objective of the value network is to minimize the value loss function, as shown in the following formula: In the formula: For value network parameters; This refers to the actual long-term cumulative rewards; This represents the expected value of long-term cumulative rewards.
10. The method for optimizing the site selection scheme of existing elderly care facilities based on deep reinforcement learning as described in claim 1, characterized in that: In step S5, the preset constraints include facility size constraints and economic feasibility constraints; the site selection decision simulation process is based on Markov decision process and environmental interaction.