A dynamic scheduling method for collection and distribution vehicles in a shared mode
By adopting a shared-mode dynamic scheduling method for collection and distribution vehicles, and utilizing iterative learning of Q-value tables and the Interp-Q algorithm to optimize vehicle task allocation, the congestion and carbon emission problems in port collection and distribution operations have been solved, improving efficiency and adaptability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- DALIAN UNIV OF TECH
- Filing Date
- 2023-03-13
- Publication Date
- 2026-06-30
AI Technical Summary
In port collection and distribution operations, container transport vehicles are prone to congestion and uneven resource allocation when they arrive at the terminal during peak periods, resulting in low efficiency and increased carbon emissions. Existing scheduling methods are not adaptable enough to dynamic environments.
A shared-mode dynamic scheduling method for collection and distribution vehicles is proposed. It utilizes the Interp-Q algorithm, which is updated and improved through iterative learning of the Q-value table, combined with the vehicle state influence coefficient and neighborhood interpolation method, to dynamically allocate vehicle tasks and optimize vehicle assignment and task allocation.
It improves the sharing and utilization rate of vehicles and the efficiency of box delivery and retrieval, reduces fleet operating costs and carbon emissions, and enhances the dynamic adaptability and scheduling accuracy of vehicle dispatching.
Smart Images

Figure CN116562534B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of port collection and distribution operation technology, specifically to a dynamic scheduling method for collection and distribution vehicle operations in a shared mode. Background Technology
[0002] In port cargo handling operations, the arrival of a large number of container trucks at the terminal during peak periods can easily lead to congestion at the terminal gates and on cargo roads, as well as uneven distribution of resources within the terminal. This reduces cargo handling efficiency, increases carbon emissions, and raises the operating costs of cargo truck fleets. Therefore, how to dynamically allocate transport tasks for cargo trucks, improve operational efficiency, and reduce costs while increasing efficiency for cargo truck fleets has become an urgent problem to be solved.
[0003] Vehicle dispatching refers to the scheduling of container transport vehicles between freight stations and container terminals, where these vehicles act as horizontal transport equipment for containers, performing port entry and exit tasks. The goal is to minimize the total operating cost of the fleet. Currently, there are two main methods:
[0004] (1) Rule-based manual dispatching method. This method considers the transportation fleet and its tasks are mainly divided into two categories: one is the delivery of export containers from the port yard to the terminal's export container area, and the other is the pickup of import containers from the terminal's import container area to the port yard. After completing a transportation task, the vehicle requests a new task instruction from the transportation fleet dispatch center. The dispatcher issues a new task instruction according to specific task assignment rules, and the vehicle executes it until all tasks are completed. The disadvantage is that it is difficult for the dispatcher to balance the weight of information such as the operation status of each terminal gate, the road conditions between each freight station and each terminal, and the operation status of the equipment in the terminal yard when assigning tasks. The task assigned according to a single rule is not the optimal solution under all working conditions, and has limitations. In addition, the frequent issuance of dispatch instructions by the dispatcher can easily lead to misoperation or incorrect instructions.
[0005] (2) Scheduling methods based on heuristic algorithms. This type of method establishes a vehicle delivery and retrieval container scheduling model and uses heuristic algorithms to solve the model. For example, the literature "Fan Houming, Guo Zhenfeng, Li Yang. Multi-terminal scheduling problem of container delivery trucks considering carbon emissions and reservation mechanism [J]. Journal of Tongji University (Natural Science Edition), 2018, 46(9):1242-1246" constructs a single-yard multi-terminal container delivery vehicle scheduling model with the goal of minimizing the number of container delivery vehicles called by a single external yard and minimizing the total carbon emissions of the multi-container terminal scheduling scheme. An improved ant colony algorithm is designed to solve the model and obtain the multi-terminal container delivery vehicle scheme of the external yard divided by time period. The paper "Zhang He, Xing Jianghao, Yan Jianxin, et al. Dual-objective optimization model for container delivery and pickup by off-port trucks based on reservation strategy [J]. Journal of Central China Normal University (Natural Science Edition), 2020, 54(3):486-489" aims to minimize the dual-objective cost value corresponding to the total operation time of all collection and distribution vehicles and the number of vehicles called. It establishes a reservation optimization model for container delivery and pickup and solves the model using a designed and improved genetic algorithm. The scheduling method based on heuristic algorithm is mostly an overall optimization result, that is, the operation sequence of static vehicle scheduling is optimized. However, the vehicle scheduling operation process is highly dynamic. When dynamic factors such as port entry and exit operation delays and traffic congestion change, the previous optimized operation sequence is no longer applicable, the actual scheduling accuracy is reduced, and even rescheduling is required. Summary of the Invention
[0006] The purpose of this invention is to provide a dynamic scheduling method for collection and distribution vehicles in a shared mode, which improves the sharing utilization rate of vehicles and the efficiency of box delivery and retrieval, reduces fleet operation costs and carbon emissions, realizes dynamic scheduling of vehicles for box delivery and retrieval in a dynamic operation environment, and improves the adaptability of vehicle scheduling.
[0007] To achieve the above objectives, the technical solution of this application is: a dynamic scheduling method for collection and distribution vehicle operations in a shared mode, comprising: Step 1, iteratively learning and updating a Q-value table using collection and distribution tasks of different scales to obtain a learned Q-value table; Step 2, using the learned Q-value table to simulate vehicle assignment and thus determine the actual vehicle assignment scheme (i.e., the number of owned and hired vehicles to be dispatched), and then dynamically allocating the collection and distribution tasks to be performed by the vehicles.
[0008] Furthermore, the specific implementation steps of step 1 are as follows: Let Q(s) t ,a t ) represents the status-action pair for container truck dispatching operations outside the port (s) t ,a t The cumulative reward value of ), the Q-value table is for all possible boundary state-action pairs (s t ,a tA table consisting of the vehicle's actions and corresponding Q-values; during the learning phase, the Q-value table obtains the learning status and immediate feedback information through feedback from the vehicle's actions, and then continuously learns and updates, gradually approaching the stable optimal value. The process is as follows: Figure 4 As shown, a multi-round incremental learning approach can be used:
[0009] Step 1: Initialize the Q-value table to 0; initialize Interp-Q algorithm parameters: α0, γ, ε0, etc.
[0010] Step 2: Initialize environmental parameters: information such as vehicles, task sequences, docks, and offshore storage yards;
[0011] Step 3: Idle vehicles select an action based on their current status to perform the corresponding container transport task;
[0012] Step 4: Obtain the updated Q value based on the immediate reward and the indirect Q value update method;
[0013] Step 5: Determine if all tasks in the task sequence have been completed. If yes, proceed to Step 6; otherwise, proceed to Step 3.
[0014] Step 6: Determine whether the termination criterion is met, i.e., whether the maximum number of learning attempts has been reached. If yes, proceed to Step 7; otherwise, proceed to Step 2.
[0015] Step 7: The learning phase ends, and the Q-value table is output.
[0016] Furthermore, the specific implementation steps of step 2 are as follows:
[0017] Step 1: Simulate vehicle assignment and determine the actual vehicle assignment plan based on the simulation results. The process is as follows: Figure 5 As shown;
[0018] Step 1-1: Load the Q-value table output during the learning phase;
[0019] Step 1-2: Initialize environment parameters and select the vehicle assignment scheme to be tried, i.e. the number of owned and hired vehicles to be dispatched.
[0020] Steps 1-3: Assign the best task to the vehicle based on the learned Q-value table;
[0021] Steps 1-4: The vehicle completes the task and the vehicle operation cost is recorded; determine whether all tasks have been completed. If yes, proceed to Step 1-5; otherwise, proceed to Step 1-3.
[0022] Step 1-5: Output the total cost of the vehicle assignment scheme; determine whether all vehicle assignment schemes to be tried have been explored. If so, select the vehicle assignment scheme with the lowest total cost and end; otherwise, go to Step 1-2.
[0023] Step 2: Dynamic task allocation, flowchart as follows Figure 6 As shown;
[0024] Step 2-1: The scheduling center loads the Q-value table output during the learning phase;
[0025] Step 2-2: Initialize environmental parameters and start port collection and distribution operations using the vehicles selected in Step 1;
[0026] Steps 2-3: The vehicle sends a task request to the dispatch center; the dispatch center assigns the best task to the vehicle based on its status.
[0027] Steps 2-4: The vehicle completes the task and sends detailed work information (work time, fuel consumption, etc.) to the dispatch center; the dispatch center evaluates the task and receives immediate feedback, and adaptively updates the Q value using the indirect Q value update method;
[0028] Step 2-5: Determine if all tasks have been completed. If yes, output the total cost and end the scheduling; otherwise, go to Step 2-3.
[0029] Furthermore, each state s in the vehicle scheduling state set S is described by six dimensions based on factors such as task volume, time, and location, expressed as:
[0030] s=(p1,p2,p3,p4,p5,p6) (1)
[0031] In the formula: p1 is the difference between the remaining port arrival and port departure tasks in the next N hours; p2 is the sum of the remaining port arrival and port departure tasks in the next N hours; N can be determined according to the port arrival and departure task cycle, and it is recommended to take 0.05 to 0.1 times the port arrival and departure task cycle time. p3 represents the remaining task quantity at the vehicle's current location and time (the task quantity within the scheduled time period, starting from the vehicle's current location). For example, if the vehicle's current location is Terminal 1, and Terminal 1 has 6 remaining port clearance tasks with task quantities of 2, 10, 18, 6, 15, and 17 respectively, and the current time is within the scheduled time period of the first 4 tasks, then the value of p3 is 2 + 10 + 18 + 6. p4 represents the proportion of the maximum remaining task quantity among the remaining tasks at the current time to the total remaining tasks. For example, if there are 5 unfinished tasks at the current time with task quantities of 10, 15, 12, 16, and 20 respectively, then the value of p4 is 20 / (10 + 15 + 12 + 16 + 20). p5 represents the number of other operating vehicles with the current location as their destination at the current time. For example, if the vehicle is currently at Terminal 1, and 5 other operating vehicles are heading towards Terminal 1, then the value of p5 is 5. p6 represents the task quantity that has timed out at the current time. These 6 dimensions are discrete values and independent of each other.
[0032] Furthermore, the vehicle task selection strategies include: 1) Selecting tasks with the most urgent reservation times, using the distance between the reservation start time and the current time to measure the urgency; the closer the distance, the higher the urgency. 2) Selecting tasks whose starting point is closest to the vehicle's location. 3) Selecting tasks with the shortest travel time from the starting point to the destination. 4) Selecting tasks with the most remaining cargo space. Combining these four task selection strategies and removing unreasonable combinations, the task selection rules (actions) for vehicle dispatching are designed as follows:
[0033] 1) Select the task with the most urgent appointment time that is closest to the vehicle's location at the starting point;
[0034] 2) Select the tasks with the most urgent appointment times and the shortest travel time from the starting point to the destination;
[0035] 3) Select the task with the most urgent appointment time and the largest remaining box quantity;
[0036] 4) Select the task with the shortest time commitment and the closest starting point to the vehicle's location;
[0037] 5) Select the task with the shortest travel time from the starting point to the destination, based on the task group where the starting point is closest to the vehicle's location;
[0038] 6) Select the task with the most remaining boxes in the task set that is closest to the starting point and the location of the vehicle.
[0039] Furthermore, the impact of a vehicle completing a transportation task is mainly reflected in: 1) altering the overall characteristics of the remaining tasks; and 2) the success or failure of this task affecting the task decisions of subsequent vehicles. Therefore, immediate feedback needs to evaluate the impact of this task on the entire task sequence and the completion status of this task. Immediate feedback consists of two parts: task feedback and time feedback, and its formula is as follows:
[0040] r = r d +r t (2)
[0041] r d =λ1r1+λ2r2 (3)
[0042] r t =-(ω1T) e +ω2T u +ω3T c (4)
[0043] Where r represents immediate return, r d For the task reward, r t For time-based feedback; r1 is the task urgency feedback value, which provides positive feedback if the task reduces urgency, and negative feedback otherwise. Urgency is measured by the time remaining until the task's scheduled start time at the port; r2 is the task balance feedback value, which provides positive feedback if the task reduces the absolute value of the difference between the volume of tasks arriving at and departing from the port, and negative feedback otherwise; λ1 and λ2 are both task reward component coefficients; T e T u T c These represent the vehicle's idle time, idling time, and deviation from the scheduled time period during the mission, all in minutes. ω1, ω2, and ω3 are time-reward component coefficients, all in minutes. -1 The emphasis of each item's return can be adjusted by adjusting the values of the coefficients for each item.
[0044] Furthermore, traditional Q-learning action exploration strategies generally use an ε-greedy exploration strategy. During the learning process, the vehicle randomly selects actions with probability ε (ε∈[0,1]) and chooses the optimal action (the action that maximizes the Q-value) with probability (1-ε). However, as the learning process progresses, the higher probability of random action selection in the later stages is detrimental to convergence. To address this issue, this invention employs a method where ε and α decrease continuously with the number of learning iterations to ensure convergence. In the early stages of learning, the vehicle tends to randomly select actions to fully explore the state space, and the acceptance rate of new attempts is relatively high, allowing for rapid iteration and updating of the Q-value. As the number of learning iterations increases, the probability of the vehicle selecting the optimal action continuously increases, and the proportion of fully learned Q-values retained becomes larger, ensuring both the convergence effect and convergence speed of the algorithm. The improved ε-greedy exploration strategy is as follows:
[0045]
[0046]
[0047]
[0048] Where τ is the number of learning iterations; ε0 and α0 are the initial values of the probability ε of randomly selecting an action and the learning rate α, respectively; ζ1 and ζ2 are decay coefficients, which are determined based on the total number of learning iterations.
[0049] Furthermore, based on the continuity and state independence characteristics of the problem, the following two assumptions are made: I) The Q-values of adjacent states are more correlated, and the influence coefficient of a state on the Q-value of its neighborhood is characterized by a function Γ; the corresponding curve is called the state influence coefficient curve. II) Different state dimensions are independent of each other, that is, the multidimensional state influence coefficient can be represented by the product of the fractal state influence coefficients. Based on the above assumptions, a neighborhood interpolation formula is obtained; using... and Q(s) t ,a t ) represent the N-dimensional state and the Q-value corresponding to the state-action, respectively.
[0050] 1) When N = 1, that is Time: such as Figure 2 As shown, and They are Two neighborhood rank values characterize s t Two boundary states and Based on hypothesis I, we obtain the function Γ. and right Influence coefficient; considering and right Its influence and its distance Distance is negatively correlated, and a bell curve is used as the state influence coefficient curve to obtain it. and right Influence coefficient:
[0051]
[0052]
[0053] The two influence coefficients mentioned above were normalized:
[0054]
[0055]
[0056] Then Q(s) t ,a t The formula for obtaining ) is:
[0057]
[0058] in, and The corresponding value in the Q-value table;
[0059] 2) When N = 2, that is Time: such as Figure 3 As shown, and They are Two neighborhood rank values, and They are If s has two neighborhood levels, then s t The boundary state is:
[0060] Using the function Γ to obtain and right The influence coefficients are respectively and and right The influence coefficients are respectively and According to hypothesis II, For s t The influence coefficients are respectively Normalizing the above four influence coefficients yields standard influence coefficients δ1, δ2, δ3, and δ4. Then, Q(s) t ,a t The formula for obtaining ) is:
[0061]
[0062] 3) When N > 2: The interpolation methods from one-dimensional and two-dimensional to multi-dimensional can be extended according to... neighborhood hierarchy The influence coefficients for each dimension are obtained by the function Γ. s t There are a total of 2 boundary states N s (where N is the number of dimensions) t The set of boundary states is The set of influence coefficients for each boundary state is as follows After normalizing the influence coefficient using equation (14), Q(s) is obtained. t ,a t The N-dimensional interpolation of ) is shown in equation (15):
[0063]
[0064]
[0065] The Q-value update formula for traditional Q-learning is:
[0066] Q(s t ,a t )←Q(s t ,a t )+α[r(s t ,a t )+γmax Q(s t+1 ,a)-Q(s t ,a t (16)
[0067] The Q-value table for the Interp-Q algorithm contains the corresponding Q(s) t ,a t When an item is specified, it can be updated directly; otherwise, it can be updated via s. t The boundary (neighborhood) state Q-values are learned. Let ρ t =α[r(s) t ,a t )+γmax Q(s t+1 ,a)-Q(s t ,a t If the boundary (neighborhood) state is defined, then the boundary (neighborhood) state is defined. The indirect update formula for the Q value is:
[0068]
[0069] ρ t max Q(s) t+1 ,a) and Q(s)t ,a t The value of r(s) can be obtained through an interpolation strategy; t ,a t ) represents the vehicle in state s t After the action a is completed t The immediate reward is calculated according to formula (2); α is the learning factor, α∈(0,1], the larger α is, the smaller the proportion of old Q values retained and the larger the proportion of new attempts accepted; γ is the discount factor, γ∈(0,1], the closer γ is to 0, the more the vehicle will be inclined to immediate reward, and the closer γ is to 1, the more the vehicle will be inclined to consider long-term reward.
[0070] This invention, by employing the above technical solutions, achieves the following technical effects: It solves the problem of coordinated scheduling of container terminal transport vehicles for delivery and retrieval, improving vehicle sharing and utilization rates and delivery / retrieval efficiency, while reducing fleet operating costs and carbon emissions. It enables dynamic scheduling of vehicle delivery and retrieval in dynamic operating environments such as terminal or yard delays and traffic congestion, improving the dynamic adaptability of vehicle scheduling. By using state influence coefficients and neighborhood interpolation, the size of the Q-value table is effectively compressed without reducing the state space size, ensuring the integrity of the state space. Left and right boundaries are set for the level division of each state dimension, ensuring that the level division covers the value range of the state dimension. Sufficient learning of the level division guarantees sufficient learning of the state space, and the multi-round incremental learning method reduces learning time and improves the generalization of the Q-value table. Attached Figure Description
[0071] Figure 1 A schematic diagram illustrating the process of booking port collection and distribution operations;
[0072] Figure 2 This is a schematic diagram of one-dimensional interpolation;
[0073] Figure 3 This is a schematic diagram of two-dimensional interpolation;
[0074] Figure 4 A flowchart for the learning phase;
[0075] Figure 5 A flowchart simulating vehicle assignment;
[0076] Figure 6 Flowchart for dynamic task assignment. Detailed Implementation
[0077] The embodiments of the present invention are implemented under the premise of the technical solution of the present invention, and detailed implementation methods and specific operation processes are given. However, the protection scope of the present invention is not limited to the following embodiments.
[0078] Sharing resources and tasks among transport fleets helps to integrate transport capacity and coordinate operations, thereby improving the overall efficiency of transport. The main sharing includes: (1) vehicle sharing among multiple terminals and off-port yards. Vehicles can undertake transport tasks at any terminal or off-port yard within the scope of port operations; (2) vehicle sharing between port entry and exit tasks. There are no restrictions on the type of transport tasks performed by the vehicles; (3) sharing port entry and exit tasks between outsourced vehicles and owned vehicles. When the fleet's own vehicle capacity is insufficient, it can outsource vehicles to jointly perform port entry and exit tasks.
[0079] The basic process for booking port cargo handling vehicles is as follows: The terminal divides the arrival and departure times into several booking slots and sets a maximum acceptable booking quota for each slot; the cargo handling fleet selects the booking slot and container handling volume based on the booking information published by the terminal and its own capacity; the terminal confirms the cargo handling booking. Vehicles must arrive at the port according to the booking. The basic process for booking port cargo handling operations is as follows: Figure 1 As shown.
[0080] The operation of collection and distribution vehicles is affected by a variety of dynamic factors, such as real-time road conditions affecting vehicle travel time, dock and off-port storage yard operations affecting vehicle loading and unloading time, and dock gate operations affecting vehicle gate opening time. As vehicle operations change dynamically, subsequent task assignments need to be dynamically optimized. The actual scheduling of collection and distribution vehicles is a real-time task assignment process.
[0081] Considering the dynamic factors of port entry and exit reservations and operations, this invention decomposes the port entry and exit vehicle scheduling problem in a shared mode into two sub-problems: vehicle assignment and dynamic task allocation. Its main features are: (1) Port entry and exit vehicles are shared among multiple terminals, multiple yards, and multiple tasks. Vehicles are not limited to fixed routes and can continuously undertake port entry and exit tasks; (2) Port entry and exit vehicles are shared, and the vehicles that can be assigned include vehicles owned by the fleet and vehicles hired from outside. When the fleet's own vehicles cannot meet the task requirements, vehicles from other fleets can be hired to assist in the execution of the task; (3) Dynamic scheduling (allocation) of vehicle tasks is adopted. After a vehicle completes a task, a new task is assigned in real time based on the dynamic information of the vehicle and the task.
[0082] This invention designs an improved Interp-Q algorithm based on the Q-learning algorithm, proposing a dynamic scheduling method for shared-mode collection and distribution vehicles. It mainly includes: a vehicle scheduling state set, an action set, immediate reporting, an action exploration strategy, and an Interp-Q value update strategy. Solving the shared-mode collection and distribution vehicle scheduling problem using the Interp-Q algorithm is divided into a learning phase and an application phase: the learning phase iteratively updates the Q-value table; the application phase uses the learned Q-value table to first select a vehicle assignment scheme (the number of owned and hired vehicles to be dispatched), and then dynamically allocates the transportation tasks of the vehicles.
[0083] The large number of state variables and the resulting large state space make traditional Q-learning algorithms difficult to implement. Therefore, this invention proposes a state influence coefficient and neighborhood interpolation method, where Q-values are obtained through multi-state dimension interpolation. Each state dimension is assigned several levels with a certain degree of discrimination, and intermediate values are learned and obtained through interpolation. The Q-value table uses level values for each state dimension, effectively compressing the Q-value table size without reducing the state space size, thus ensuring the integrity of the state space. Furthermore, traditional Q-learning cannot guarantee that all states are fully learned; insufficiently learned and explored states can easily lead to significant decision biases. This invention ensures that the level division covers the range of state dimension values, and sufficient learning of the level division guarantees sufficient learning of the state space. To address the issue of reducing learning time and improving the generalization of the Q-value table, this invention employs a multi-round incremental learning method. This involves training the Q-value table using multiple examples of different sizes, and using the Q-value table obtained from previous examples as the initial Q-value table for subsequent examples.
[0084] Example 1
[0085] Based on the scheduling data between a container terminal and a freight station in a certain port, the technical solution of this invention is implemented, and its beneficial effects are analyzed.
[0086] A certain cargo handling fleet owns 30 vehicles. Its six port handling tasks are generated as follows: the task cycle is 24 hours (0-24 hours); the starting and ending points of the port handling tasks are randomly selected from the off-port storage yard; the task reservation time slots are randomly selected from the reservation cycle, ranging from 6 to 12 hours; the container volume within the reservation time slot is allocated proportionally to the time slot length; and the total task volume of the port handling sequence is randomly generated from the interval [E-5, E+5]. For the six examples, E takes values of 750, 800, 850, 900, 950, and 1000, respectively. The fleet prioritizes using its own vehicles, only hiring external vehicles when its own vehicles cannot meet the task requirements.
[0087] The vehicle's fuel consumption under heavy load is 0.36 L / km, under no-load fuel consumption is 0.24 L / km, and idling fuel consumption is 0.05 L / min. The fuel price is 7.6 yuan / L, the carbon emission coefficient is 1.6 kg / L, the unit cost of carbon emission control is 0.25 yuan / kg, the hourly wage for owned vehicle drivers is 15 yuan / h, and the hourly wage for outsourced vehicle drivers is 20 yuan / h. The average service time at the terminal gate is 2 minutes, the average driving time of vehicles within the terminal is 3 minutes, and the average service time of the terminal cranes is 3 minutes. The average driving time of vehicles in the off-port storage yard is 2.5 minutes, and the average service time of the off-port storage yard cranes is 3 minutes. In the Interp-Q algorithm, α0, γ, and ε0 are 0.35, 0.85, and 0.4, respectively.
[0088] The Interp-Q algorithm proposed in this invention was compared with a rule-based scheduling method (i.e., assigning transportation tasks to vehicles using specific rules during the collection and distribution process, with the six scheduling rules being the aforementioned task selection rules) and a traditional Q-learning algorithm. The results are as follows:
[0089] Table 1. Two-stage scheduling results
[0090]
[0091] Table 2. Rule-based scheduling results (carbon emissions / kg, total cost / yuan)
[0092]
[0093] Table 3 compares the optimization level of the Interp-Q algorithm with that of rule-based scheduling methods.
[0094]
[0095] Table 4 shows the optimization level of the Interp-Q algorithm compared to the traditional Q-learning algorithm.
[0096]
[0097] The comparative results show that the Interp-Q algorithm outperforms rule-based scheduling algorithms and traditional Q-learning algorithms across different scales of computational examples. The two methods reduce variable carbon emissions by 27.80% and 36.55%, and variable costs by 25.55% and 13.11%, respectively.
[0098] The embodiments of the present invention are preferred for implementation but are not intended to limit the invention in any way. The technical features or combinations of technical features described in the embodiments of the present invention should not be considered isolated; they can be combined with each other to achieve better technical effects. The scope of the preferred embodiments of the present invention may also include other implementations, and this should be understood by those skilled in the art to which the embodiments of the invention pertain.
Claims
1. A dynamic scheduling method for collection and distribution vehicle operations in a shared mode, characterized in that, include: Step 1: Iteratively learn and update the Q-value table using port collection and distribution tasks of different scales to obtain the learned Q-value table; Step 2: Use the learned Q-value table to simulate vehicle assignment and determine the actual vehicle assignment scheme, and then dynamically allocate the port collection and distribution tasks to be performed by the vehicles. In step 1, set Status and action pairs for outbound container truck dispatching operations The cumulative reward value, Q-value table for all possible boundary state-action pairs A table consisting of the corresponding Q values; during the learning phase, the Q value table obtains the learning status and immediate feedback information of the Q values through the feedback of the vehicle's actions, and then updates it using a multi-round incremental learning method; The specific implementation steps for step 2 are as follows: Step 1: Simulate vehicle assignment and determine the actual vehicle assignment plan based on the simulation results; Step 1-1: Load the Q-value table output during the learning phase; Step 1-2: Initialize environment parameters and select the vehicle assignment scheme to be tried, i.e. the number of owned and hired vehicles to be dispatched. Steps 1-3: Assign the best task to the vehicle based on the learned Q-value table; Steps 1-4: The vehicle completes the task and the vehicle operation cost is recorded; determine whether all tasks have been completed. If yes, proceed to Step 1-5; otherwise, proceed to Step 1-3. Step 1-5: Output the total cost of the vehicle assignment scheme; determine whether all vehicle assignment schemes to be tried have been explored. If so, select the vehicle assignment scheme with the lowest total cost and end; otherwise, go to Step 1-2. Step 2: Dynamic task allocation; Step 2-1: The scheduling center loads the Q-value table output during the learning phase; Step 2-2: Initialize environmental parameters and start port collection and distribution operations using the vehicles selected in Step 1; Steps 2-3: The vehicle sends a task request to the dispatch center; the dispatch center assigns the best task to the vehicle based on its status. Steps 2-4: The vehicle completes the task and sends the detailed operation information to the dispatch center; the dispatch center evaluates the task and receives an immediate report, and adaptively updates the Q value using the indirect Q value update method; Step 2-5: Determine whether all tasks have been completed. If yes, output the total cost and end the scheduling; otherwise, go to Step 2-3. Based on the characteristics of problem continuity and state independence, we first make the following two assumptions: I) The Q-values of adjacent states are more correlated, using a function... The coefficient characterizing the influence of a state on the Q-value of its neighborhood is called the state influence coefficient curve; II) Different state dimensions are independent of each other, that is, the multidimensional state influence coefficient is represented by the product of the fractal state influence coefficients; based on the above assumptions, the neighborhood interpolation formula is obtained; using and They represent Dimensional states and their corresponding Q-values for state-action pairs: 1) When ,Right now hour: and They are Two neighborhood level values characterize Two boundary states and Based on hypothesis I, a bell curve is used as the state influence coefficient curve to obtain... and right Influence coefficient: (8) (9) The two influence coefficients mentioned above were normalized: (10) (11) but The formula for obtaining it is: (12) in, and The corresponding value in the Q-value table; 2) When ,Right now hour: and They are Two neighborhood rank values, and They are The two neighborhood levels, then The boundary state is: , , , ; Using functions get and right The influence coefficients are respectively and , and right The influence coefficients are respectively and According to hypothesis II, , , , right The influence coefficients are respectively , , , The standard influence coefficients are obtained by normalizing the above four influence coefficients. , , , ,but The formula for obtaining it is: (13) 3) When Time: According to neighborhood hierarchy and functions The influence coefficients for each dimension are obtained as follows: ; There are a total of boundary states indivual, The set of boundary states is The set of influence coefficients for each boundary state is as follows After normalizing the influence coefficient using equation (14), we obtain of The interpolation is shown in equation (15): (14) (15) The Q-value update formula for traditional Q-learning is: (16) The Q-value table for the Interp-Q algorithm has the corresponding value. If the item is not present, update directly; otherwise, update via update. Learn the Q-values of boundary states; make Then the boundary state The indirect update formula for the Q value is: (17) middle and The value is obtained through an interpolation strategy; For the vehicle in status After the action is completed Immediate returns; For learning rate, , A larger value indicates that a smaller proportion of the old Q value is retained, and a larger proportion of the results of new attempts are accepted. As a discount factor, , The closer a value is to 0, the more the vehicle will prioritize immediate rewards; the closer a value is to 1, the more the vehicle will prioritize long-term rewards.
2. The method for dynamic scheduling of collection and distribution vehicles in a shared mode according to claim 1, characterized in that, The specific implementation steps of step 1 are as follows: Step 1: Initialize the Q-value table to 0; Step 2: Initialize environment parameters: Step 3: Idle vehicles select an action based on their current status to perform the corresponding container transport task; Step 4: Obtain the updated Q value based on the immediate reward and the indirect Q value update method; Step 5: Determine if all tasks in the task sequence have been completed. If yes, proceed to Step 6; otherwise, proceed to Step 3. Step 6: Determine whether the termination criterion is met, i.e., whether the maximum number of learning attempts has been reached. If yes, proceed to Step 7; otherwise, proceed to Step 2. Step 7: The learning phase ends, and the Q-value table is output.
3. A method for dynamic scheduling of collection and distribution vehicles in a shared mode according to claim 1 or 2, characterized in that, vehicle dispatch state set Each state in The expression is: (1) In the formula: For the future N The difference between the remaining port arrival and port departure tasks within the hour; For the future N The sum of remaining port entry and exit tasks within the hour; N Based on the port collection and distribution task cycle, take 0.05 to 0.1 times the port collection and distribution task cycle time; The remaining workload at the vehicle's current location and time; This represents the proportion of the maximum number of remaining tasks to the total number of remaining tasks at the current time. This represents the number of other vehicles operating at the current time that terminate at the current location. The current timeout amount of tasks; the above 6 dimensions take discrete values and are independent of each other.
4. A method for dynamic scheduling of collection and distribution vehicles in a shared mode according to claim 1 or 2, characterized in that, The task selection rules for vehicle dispatching are as follows: 1) Select the task with the most urgent appointment time that is closest to the vehicle's location at the starting point; 2) Select the tasks with the most urgent appointment times and the shortest travel time from the starting point to the destination; 3) Select the tasks with the most urgent appointment times and the largest remaining box quantity; 4) Select the task with the shortest time commitment and the closest starting point to the vehicle's location; 5) Select the task with the shortest travel time from the starting point to the destination, based on the task group where the starting point is closest to the vehicle's location; 6) Select the task with the most remaining boxes in the task set that is closest to the starting point and the location of the vehicle.
5. A method for dynamic scheduling of collection and distribution vehicles in a shared mode according to claim 1 or 2, characterized in that, The immediate reward consists of two parts: task reward and time reward, and its formula is as follows: (2) (3) (4) in, For immediate return, In return for the task, In return for time; This is a task urgency feedback value. If the urgency of the task is reduced in this task, positive feedback will be given; otherwise, negative feedback will be given. Urgency is measured by the length of time remaining until the scheduled start time of the task in Hong Kong. This is the task balance feedback value. If the absolute value of the difference between the port arrival and port departure task volume decreases in this task, positive feedback is given; otherwise, negative feedback is given. , All are task reward component coefficients; , , These represent the vehicle's idle time, idling time, and the time of deviation from the scheduled time during the mission, all in minutes. , , These are time-based return factor values, all in minutes. -1 The emphasis of each item's return can be adjusted by adjusting the values of the coefficients for each item.
6. The method for dynamic scheduling of collection and distribution vehicles in a shared mode according to claim 2, characterized in that, Action selection uses improved The exploration strategy is as follows: (5) (6) (7) in, Number of times for learning; and The probabilities of randomly selecting actions are respectively and learning rate The initial value; and This is the attenuation coefficient, and its value is determined based on the total number of learning iterations.