A service function chaining orchestration and scheduling method, device and medium
By optimizing the service function chain orchestration and scheduling through secure reinforcement learning, and utilizing local network state and reward/cost function constraints, the problem of real-time global state acquisition in large-scale networks is solved, achieving deterministic end-to-end latency and jitter performance guarantees, and improving network resource utilization and execution efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG LAB
- Filing Date
- 2023-08-16
- Publication Date
- 2026-06-23
AI Technical Summary
Existing service function chain orchestration and scheduling schemes are difficult to effectively acquire real-time global network feature state information when the network state changes frequently or on a large scale. Furthermore, deep reinforcement learning methods suffer from a single reward mechanism and insufficient constraint guidance, resulting in latency constraints being satisfied only during the execution phase of the neural network and not during the training process, which affects execution timeliness and resource utilization.
By employing a secure reinforcement learning approach, service function chain orchestration and scheduling are optimized through local network feature states. Reward and cost functions are set as constraints, agent upper and lower bound functions are constructed, and policy parameters are updated to generate the optimal orchestration and scheduling strategy, thereby achieving deterministic guarantees for end-to-end latency and jitter performance.
It improves network resource utilization, meets the end-to-end latency and jitter performance requirements of business determinism, and enhances the execution timeliness and resource utilization of service function chains.
Smart Images

Figure CN117439892B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer network technology, and in particular relates to a service function chain orchestration and scheduling method, device and medium. Background Technology
[0002] Emerging industrial scenarios such as advanced manufacturing and multi-modal transportation require network service capabilities to rapidly upgrade from the traditional "best-effort" consumer-grade to deterministic, low-latency, and even low-jitter industrial-grade capabilities to meet the complex and diverse real-time communication and control needs of various latency-sensitive services in these scenarios. Service function chains (SFCs) are a flexible and efficient next-generation network service technology and a research hotspot and challenge in fields such as 5G, software-defined networks, and network function virtualization. By generating, deploying, connecting, and scheduling virtual network functions in a streamlined process at the network edge or within the network, SFCs can provide flexible, on-demand network services for diverse services, effectively reducing capital and operating expenditures while enabling function decentralization and on-network processing. To improve the latency performance and resource efficiency of SFCs, recent research has proposed intelligent solutions based on machine learning methods to address the orchestration and scheduling problems of SFCs.
[0003] However, existing mainstream solutions still have the following three key shortcomings: 1) Existing solutions usually require real-time perception of global network feature state information to make decisions. When the network scale is large or the magnitude / frequency of network state changes is high, it is difficult or too costly to obtain real-time information on the entire network state; 2) The deep reinforcement learning methods used in existing solutions usually have problems such as a single reward mechanism and insufficient constraint-oriented optimization, making it difficult to achieve deterministic end-to-end latency and jitter performance guarantees while optimizing resource efficiency according to the specific service quality requirements of the service function chain business; 3) Existing solutions usually can only meet latency constraints during the execution phase of the neural network, but cannot achieve strong guarantees of latency constraints during the training process, resulting in the neural network being deployed only after sufficient training, and poor execution timeliness. Summary of the Invention
[0004] The purpose of this invention is to address the shortcomings of existing technologies by providing a service function chain orchestration and scheduling method, apparatus, and medium to realize a constraint-oriented, intelligent, and controllable new service function chain optimization mechanism, thereby improving network bandwidth resource utilization while meeting the end-to-end latency and jitter performance requirements of service determinism.
[0005] The objective of this invention is achieved through the following technical solution: a service function chain orchestration and scheduling method, comprising the following steps:
[0006] (1) Initialize the parameters of the service function chain orchestration and scheduling strategy, wherein the strategy is represented as a function. in For local network state sets, For the service function chain orchestration and scheduling action set, k is the round, θ k ∈Θ represents a parameter, where Θ is the parameter set of the service function chain orchestration and scheduling strategy;
[0007] (2) Set the maximum number of training rounds for the security reinforcement learning agent;
[0008] (3) In each round, the agent first sets the number of trajectories to be sampled in that round according to the current service function chain orchestration and scheduling strategy, and generates a specified number of service function chain orchestration and scheduling trajectories.
[0009] (4) The agent obtains the advantage function based on the generated service function chain orchestration and scheduling trajectory;
[0010] (5) Construct proxy upper and lower bound functions based on the aforementioned advantage function;
[0011] (6) Update the parameters of the current service function chain orchestration and scheduling strategy based on the agent upper and lower bound functions, and generate the strategy to be followed in the next round;
[0012] (7) Repeat steps (3) to (6) until the maximum number of rounds set for training is reached, and the optimal service function chain orchestration and scheduling strategy is generated.
[0013] Furthermore, the initialization strategy is represented as That is, round k=0; the local network characteristic state includes the service function position of the data packet on its service function chain, its node position in the real network, and the remaining available resource status of the real network node and real network link at the current moment; the service function chain orchestration and scheduling action includes the service function deployment position and processing bandwidth set for the data packet, as well as the transmission path and transmission bandwidth.
[0014] Furthermore, in step (3), generating a specified number of service function chain orchestration and scheduling trajectories includes the following sub-steps:
[0015] (3.1) The security reinforcement learning agent acquires the local network feature state at the current moment;
[0016] (3.2) Based on the local network characteristic state, make decisions and execute service function chain orchestration and scheduling actions according to the currently followed strategy;
[0017] (3.3) After completing the execution of the service function chain orchestration and scheduling actions, obtain the reward value, cost value and the local network feature state observed in the next step at the current moment;
[0018] (3.4) Based on the observed local network characteristics, the next service function chain orchestration and scheduling action is obtained. Steps (3.2) to (3.3) are repeated until the data packet reaches the destination node of its service function chain. At this point, the training round ends, and the entire trajectory is generated and recorded. The data packet is the object of service function chain orchestration and scheduling, and also the object of security reinforcement learning agent control. When the data packet arrives at its destination node after being sent from the source node, it is a complete service function chain orchestration and scheduling trajectory completed by the security reinforcement learning agent.
[0019] The reward value is used to characterize the real network node resources and real network link resources consumed by the service function chain orchestration and scheduling actions. After each complete trajectory is generated, the sum of the reward values is used to characterize the end-to-end resource consumption of the service function chain. The cost value is used to characterize the processing delay, transmission delay, and propagation delay spent by the service function chain orchestration and scheduling actions. After each complete trajectory is generated, the sum of the cost values is used to characterize the end-to-end delay of the service function chain.
[0020] Furthermore, the local network characteristic state is represented as follows:
[0021]
[0022]
[0023] in, and These represent the current position of the data packet managed by the security reinforcement learning agent on its service function chain and its position in the actual network, respectively. and These represent the actual network node v of the security reinforcement learning agent at the current time. i and physical network link e i,j The remaining available resource capacity, f i′ M represents the number of service functions in the service function chain, and N represents the number of physical network nodes.
[0024] The service function chain orchestration and scheduling actions are represented as follows:
[0025] a=(x(f i′ ,v),y(l i′-1,i′ ,e i,j ),λ(f i′ ),λ(l i′-1,i′ ))
[0026] in,
[0027] x(f i′ ,v)=[x(fi′ ,v i )] 1≤i≤N
[0028] y(l i′-1,i′ ,e)=[y(l i′-1,i′ ,e i,j )] 1≤i,j≤N
[0029] x(f i′ ,v) and y(l i′-1,i′ ,e i,j ) respectively represent the security reinforcement learning agent targeting service function f i′ and virtual link l i′-1,i′ Deployment decisions, (f i′ ) and λ(l i′-1,i′ ) respectively represent the data packets in service function f i′ The processing bandwidth allocated on and in the virtual link l i′-1,i′ The allocated transmission bandwidth;
[0030] The reward value follows a function.
[0031]
[0032] in, Let be the set of all the latent states, and s be the unknown latent state corresponding to the local network feature state o at the current time. Let μ be the set of entity nodes in the entity network, ε be the set of entity links in the entity network, and μ(λ(f)) be the set of entity links in the entity network. i′ The data packet is in service function f i′ The allocated processing bandwidth λ(f) i′ The real network node resources required when (l) are consumed, μ(λ(l) i′-1,i′ For data packets in virtual link l i′-1,i′ The allocated transmission bandwidth λ(l) i′-1,i′ The physical network link resources required at that time;
[0033] The cost follows a function
[0034]
[0035] Where d(f) i′ )=b / λ(f i′ ) for the data packet in service function f i′ The processing latency spent above, b represents the size of the data packet; d(l i′-1,i′ )=b / λ(l i′-1,i′ The data packet is located on the virtual link l i′-1,i′Each corresponding physical network link e i,j The transmission delay spent, d(e) i,j ) refers to the data packet on the real network link e i,j The propagation delay incurred.
[0036] Furthermore, the resources consumed by the physical network nodes cannot exceed the remaining available resource capacity of the physical network nodes at the current moment corresponding to the deployment location of the service function;
[0037] The consumed physical network link resources cannot exceed the remaining available resource capacity of the physical network link corresponding to the transmission path at the current time.
[0038] Furthermore, step (4) includes the following sub-steps:
[0039] (4.1) The security reinforcement learning agent obtains state value based on the generated trajectory;
[0040] The state value is calculated in the following way:
[0041]
[0042] Where γ is the discount factor, γ i′-1 The discount factor r represents the discount factor that decreases progressively with the progress of service function chain orchestration and scheduling. i′ =R(s) i′ ,a i′ ) indicates that the agent is in the latent state s i′ Follow action a i′ The reward value obtained at that time;
[0043] (4.2) Construct an action value function based on the state value;
[0044] The action value function is constructed in the following form:
[0045]
[0046] Where τ represents the trajectory, This indicates an average value operation;
[0047] (4.3) Construct an advantage function based on the state value and action value functions;
[0048]
[0049] Furthermore, the surrogate upper bound function and lower bound function can be constructed in the following way:
[0050] Where ∈ represents following the arbitrary generated service function chain orchestration and scheduling strategy π θThe highest cumulative value of the average dominance function that can be generated across all trajectories when action a is performed is specifically calculated as follows:
[0051]
[0052] π represents the orchestration and scheduling strategy for arbitrarily generated service function chains. θ and the current strategy The KL divergence between them.
[0053] Furthermore, the service function chain orchestration and scheduling strategy followed in the next round This is generated by solving the following constrained optimization problem:
[0054]
[0055]
[0056]
[0057] in, This indicates that the security reinforcement learning agent follows the currently adhered policy. The reward that can be generated when performing service function chain orchestration and scheduling corresponds to the amount of real network resources consumed by the agent. This indicates that the current strategy is being followed. The cost reward generated during service function chain orchestration and scheduling corresponds to the agent's accumulated end-to-end latency, where ξ i′ =C(s) i′ ,a i′ ) indicates that the agent is in the latent state s i′ Follow action a i′ The value generated by the times; The lower bound of end-to-end latency is set for the service corresponding to the service function chain, wherein The upper bound of end-to-end jitter as required by business requirements. This is the upper bound of the end-to-end latency required by the business.
[0058] A service function chain orchestration and scheduling apparatus includes one or more processors for implementing the aforementioned service function chain orchestration and scheduling method.
[0059] A computer-readable storage medium having a program stored thereon, which, when executed by a processor, is used to implement the above-described service function chain orchestration and scheduling method.
[0060] The beneficial effects of this invention are: 1) The service function chain orchestration and scheduling method designed in this invention can optimize service function chain orchestration and scheduling actions based on locally observed network feature state information, thereby avoiding the overhead caused by real-time global network feature state information perception. 2) While defining a reward function based on network resource consumption, this invention also adds a cost function oriented towards end-to-end latency and jitter performance indicators as a constraint condition, which can realize a constraint-oriented service function chain optimization mechanism, provide differentiated and deterministic end-to-end service quality assurance, and improve network resource utilization. 3) This invention can achieve strong latency constraint guarantees during the training process of neural networks, allowing service function chain orchestration and scheduling strategies to be deployed during training, increasing the timeliness of strategy execution. Attached Figure Description
[0061] Figure 1 This is a flowchart illustrating the overall process of a service function chain orchestration and scheduling method according to an embodiment of the present invention.
[0062] Figure 2 This is a flowchart illustrating the service function chain orchestration and scheduling trajectory generation in an embodiment of the present invention;
[0063] Figure 3 This is a flowchart illustrating the process of obtaining the advantage function according to an embodiment of the present invention;
[0064] Figure 4 This is a block diagram of a service function chain orchestration and scheduling device according to an embodiment of the present invention. Detailed Implementation
[0065] The present invention will now be described in detail with reference to the accompanying drawings. Unless otherwise specified, the features of the following embodiments and implementations can be combined with each other.
[0066] The present invention provides a service function chain orchestration and scheduling method, such as... Figure 1 The diagram shown is an overall flowchart of a service function chain orchestration and scheduling method according to an embodiment of the present invention. This method is used to train a security reinforcement learning (data packet) agent to optimize service function chain orchestration and scheduling strategies, and includes the following steps:
[0067] Step 1: Initialize the parameters of the service function chain orchestration and scheduling strategy;
[0068] Step 2: Set the maximum number of training rounds for the security reinforcement learning agent;
[0069] Step 3: In each round, the security reinforcement learning agent first sets the number of trajectories to be sampled according to the currently followed service function chain orchestration and scheduling strategy in that round, and generates a specified number of service function chain orchestration and scheduling trajectories according to the set number of trajectories to be sampled.
[0070] Step 4: The security reinforcement learning agent obtains the advantage function based on the generated service function chain orchestration and scheduling trajectory;
[0071] Step 5: Construct upper and lower bound functions for the proxy based on the aforementioned advantage function;
[0072] Step 6: Update the parameters of the current service function chain orchestration and scheduling strategy based on the agent upper and lower bound functions, and generate the strategy to be followed in the next round;
[0073] Step 7: Repeat steps 3 through 6 until the maximum number of training iterations is reached.
[0074] In the specific implementation of step 1, the parameters of the service function chain orchestration and scheduling strategy are initialized;
[0075] Specifically, the service function chain orchestration and scheduling strategy is represented as a function. in For the set of all observable local network states, This function orchestrates and schedules action sets for all generable service function chains. It is θ k A random memoryless strategy with parameters ∈Θ can be implemented based on the observed local network state. Output service function chain orchestration and scheduling actions The initial policy can be represented as That is, the round number k = 0, and its parameters can be randomly determined; Θ is the parameter set of the service function chain orchestration and scheduling strategy.
[0076] In the specific implementation of step 2, the maximum number of training rounds for the security reinforcement learning agent is set:
[0077] Specifically, the maximum number of training rounds is set to K. During implementation, it can be adjusted according to characteristics such as network size, network load, or degree of change in state. Alternatively, it can be simply set to K=∞ to support continuous training and learning of the agent. The latter may cause unnecessary training resource overhead when the network size is small or the network tends to be static.
[0078] In the specific implementation of step 3, in each round, the security reinforcement learning agent first sets the number of trajectories to be sampled according to the currently followed service function chain orchestration and scheduling strategy within that round, and generates a specified number of trajectories according to the set number of trajectories to be sampled:
[0079] Specifically, the number of trajectories to be sampled is set to L. In practice, L≥1 is set, meaning that at least one trajectory is sampled according to the current service function chain orchestration and scheduling strategy, in order to improve the trajectory sampling accuracy and the accuracy of the advantage function described in step 4.
[0080] Specifically, such as Figure 2 As shown, the generation of each trajectory in step 3 may include the following sub-steps:
[0081] Step 3.1: The security reinforcement learning agent acquires the observed local network feature state at the current moment;
[0082] Specifically, the observed local network feature state can be represented as:
[0083]
[0084]
[0085] in, The position of the data packet controlled by the security reinforcement learning agent on its service function chain is used to indicate the current position of the data packet on its service function chain. i′ (Assuming the service function chain contains M service functions, including source and destination nodes), And at this point, for all other f j′ (j′≠i′,1≤j′≤M), must satisfy The location of the data packet controlled by the security reinforcement learning agent in the real network is used to indicate the position of the data packet in the real network, that is, when the data packet is located at a certain real network node v in the real network. i At time (assuming the actual network contains N actual network nodes), And at this time, for all other v j (j≠i, 1≤j≤N), must satisfy also, and These represent the actual network node v of the security reinforcement learning agent at the current time. i and physical network link e i,j The remaining available resource capacity; of which, and These respectively represent the actual network node v i and physical network link e i,j The upper limit of available resources. Therefore, the observed local network characteristic state includes the service function position of the data packet on its service function chain, its node position in the real network, and the remaining available resource status of the real network node and real network link at the current moment. Initially, in the observed local network characteristic state
[0086] Step 3.2: Based on the local network
[0087] Specifically, the service function chain orchestration and scheduling actions can be represented as characteristic states, and the service function chain orchestration and scheduling actions are decided and executed according to the currently followed strategy;
[0088] a=(x(f i′ ,v),y(l i′-1,i′ ,e i,j ),λ(f i′ ),λ(l i′-i,i′ )),
[0089] in,
[0090] x(f i′ ,v)=[x(f i′ ,v i )] 1≤i≤N ,
[0091] y(l i′-1,i′ ,e)=[y(l i′-1,i′ ,e i,j )] 1≤i,j≤N .
[0092] Here, x(f) i′ ,v)(x(f i′ ,v i ()∈{0,1}) is used to represent the security reinforcement learning agent targeting service function f i′ The deployment decision, that is, when the security reinforcement learning agent determines the service function f of the data packet. i′ By real network node v i When used to carry, x(f) i′ ,v i ) = 1, and at this time for all other v j (j≠i,1≤j≤N), must satisfy x(f i′ ,v j ) = 0; y(l i′-1,i′ ,e i,j )(y(l i′-1,i′ ,e i,j )∈{0,1}) is used to represent the security reinforcement learning agent targeting virtual link l i′-1,i′ The deployment decision, that is, when the security reinforcement learning agent decides on the virtual link l for the data packet. i′-1,i′ (This link is used to connect service function f) i′-1 and f i′ Includes physical network links e i,j At that time, y(l i′-1,i′ ,e i,j ) = 1, and at this time for all other e m,n (m≠i or n≠j, 1≤m,n≤N), must satisfy y(li′-1,i′ ,e m,n ) = 0. λ(f i′ ) and λ(l i′-1,i′ ) respectively represent the data packets in service function f i′ The processing bandwidth allocated on and in the virtual link l i′-1,i′ The allocated transmission bandwidth. Therefore, the service function chain orchestration and scheduling actions include the service function deployment location and processing bandwidth set for the data packet, as well as the transmission path and transmission bandwidth.
[0093] It should be noted that, for simplicity, the processing and transmission bandwidth allocated to any data packet is "short-term fixed," meaning the allocated bandwidth amount only changes at the earliest when entering the next service function or virtual link. Furthermore, the security reinforcement learning agent will only make decisions and begin executing the service function chain orchestration and scheduling actions after completing the processing of the data packet in the previous service function.
[0094] Furthermore, the data packet in service function f i′ The allocated processing bandwidth λ(f) i′ The real network node resources required for this process are μ(λ(f)). i′ And satisfy That is, the consumed real network node resources cannot exceed the remaining available resource capacity of the real network node at the current time corresponding to the set service function deployment location; the data packet in the virtual link l i′-1,i′ The allocated transmission bandwidth λ(l) i′-1,i′ The physical network link resources required for this process are μ(λ(l)). i′-1,i′ And satisfy That is, the consumed real network link resources cannot exceed the remaining available resource capacity of the real network link at the current time corresponding to the set transmission path.
[0095] Step 3.3: After completing the execution of the service function chain orchestration and scheduling actions, obtain the current reward value, cost value, and the local network feature state observed in the next step;
[0096] The reward value represents the real-world network resources consumed by the service function chain orchestration and scheduling actions. The sum of the reward values obtained after each complete trajectory is generated represents the end-to-end resource consumption of the service function chain, enabling optimization of end-to-end resources. The cost value represents the processing latency, transmission latency, and propagation latency incurred by the service function chain orchestration and scheduling actions. The sum of the cost values obtained after each complete trajectory is generated represents the end-to-end latency of the service function chain, enabling constraints on end-to-end latency (and end-to-end jitter) performance.
[0097] Specifically, the reward value follows a function.
[0098]
[0099] Where s is the unknown latent state corresponding to the local network feature state o observed at the current moment. The set of all said latent states, Let μ be the set of entity nodes in the entity network, ε be the set of entity links in the entity network, and μ(λ(f)) be the set of entity links in the entity network. i′ The data packet is in service function f i′ The allocated processing bandwidth λ(f) i′ The real network node resources required when (l) are consumed, μ(λ(l) i′-1,i′ For data packets in virtual link l i′-1,i′ The allocated transmission bandwidth λ(l) i′-1,i′ The real network link resources required for execution are denoted as 'a', where 'a' represents the service function chain orchestration and scheduling action. The reward value is the immediate reward feedback obtained after executing the service function chain orchestration and scheduling action, defined as the sum (negative) of the real network node and link resource consumption after each step of the security reinforcement learning agent's action. This definition is consistent with the optimization objective of this method, namely, minimizing the network resources consumed by the service function chain corresponding to the service in all trajectories.
[0100] Specifically, the value follows a function.
[0101]
[0102] Where d(f) i′ )=b / λ(f i′ ) for the data packet in service function f i′ The processing latency spent above, b represents the size of the data packet; d(l i′-1,i′ 0=b / λ(l i′-1,i′ The data packet is located on the virtual link l i′-1,i′ Each corresponding physical network link e i,j The transmission delay spent, d(e) i,j ) refers to the data packet on the real network link e i,j The cost is the propagation delay incurred. The cost value is the immediate cost feedback obtained after executing the service function chain orchestration and scheduling action, defined as the sum of the processing, transmission and propagation delays actually experienced by the security reinforcement learning agent after each step of the action. This definition is consistent with the constraint objective of the method, that is, to meet the end-to-end latency and jitter constraints of the corresponding service of the service function chain.
[0103] Specifically, the local network feature state observed in the next step has the same definition and representation as the local network feature state observed in the previous step, so it will not be described in detail further.
[0104] Step 3.4: Based on the observed local network characteristics, obtain the next service function chain orchestration and scheduling actions. Repeat steps (3.2) to (3.3) until the data packet reaches the destination node of its service function chain. At this point, the training round ends, and the entire trajectory is generated and recorded.
[0105] The next step of service function chain orchestration and scheduling, based on the observed local network feature state, can be implemented using a deep neural network model. This model comprises an input layer, hidden layers, and an output layer; wherein the input layer is based on the observed local network state. As input, the output layer outputs the service function chain orchestration and scheduling actions corresponding to the local network state o. The hidden layer serves as a transition layer connecting the input layer and the output layer.
[0106] In practical implementation, the deep neural network model can be constructed using a fully connected neural network model. Furthermore, to increase flexibility and customizability, the parameterization strategy used to generate service function chain orchestration and scheduling actions can also be constructed using other methods, such as linear methods, memory-based or kernel-based functions, etc. However, the input and output of different methods must satisfy the following condition: the input should be the observed local network state. The output should be the service function chain orchestration and scheduling actions corresponding to the local network state o.
[0107] Specifically, steps 3.1 to 3.3 constitute the training process of the security reinforcement learning agent at each step, and the event information recorded at each step is in the form of (o i′ ,a i′ ,r i′ ,ξ i′ ), where r i′ =R(s) i′ ,a i′ ) indicates that the agent is in the latent state s i′ Follow action a i′ The reward value obtained at that time, ξ i′ =C(s) i′ ,a i′ ) indicates that the agent is in the latent state s i′ Follow action a i′The value generated over time; when the data packet traverses all service functions in its service function chain, including the destination node, it completes one round of training, and records the trajectory containing the entire event sequence, in the form τ=((o1,a1,r1,ξ1),…,(o i′ ,a i′ ,r i′ ,ξ i′ ),…(o M ,a M ,r M ,ξ M (1≤i′≤M). The trajectory is a finite-round random trajectory generated according to the currently followed strategy, corresponding to a complete end-to-end orchestration and scheduling process of the data packet within its respective service function chain. All trajectories generated by the security reinforcement learning agent are uniformly recorded in the experience pool.
[0108] In this context, data packets are the objects of service function chain orchestration and scheduling, and also the objects of management and control by the security reinforcement learning agent. A trajectory refers to the complete service function chain orchestration and scheduling trajectory completed by the security reinforcement learning agent from the source node to its destination node.
[0109] In the specific implementation of step 4, the security reinforcement learning agent obtains an advantage function based on the generated trajectory:
[0110] Specifically, such as Figure 3 As shown, the advantage function is obtained in step 4 through the following sub-steps:
[0111] Step 4.1: The security reinforcement learning agent obtains state value based on the generated trajectory;
[0112] Specifically, the state value can be calculated in the following way:
[0113]
[0114] Where, r i′ =R(s) i′ ,a i′ ) indicates that the agent is in the latent state s i′ Follow action a i′ The reward value obtained at that time, γ∈[0,1] is a discount factor used to weigh short-term and long-term rewards or costs, γ i′-1 The discount factor represents the decreasing rate of service function chain orchestration and scheduling: for any i′≤j′, γ satisfies i′-1 ≥γ j′-1 Initially, i′=1, γ i′-1 =γ 0 =1, and at the same time, oi′-1 =o0 and a i′-1 =a0 can be randomly initialized to any value. In this invention, the discount factor is fixed at γ=1. This setting meets the reward and constraint modeling requirements of the service function chain orchestration and scheduling method, that is, the state value represents a given triple (o i′-1 ,a i′-1 ,o i′ When the security reinforcement learning agent accumulates negative real-world network resource consumption or relative latency, the following applies: For the same triples (o) in the same or multiple trajectories... i′-1 ,a i′-1 ,o i′ (), which can take the average value of multiple states. This represents the averaging operation, which corresponds to the agent following a policy. The random trajectory τ generated at that time.
[0115] Step 4.2: Construct an action value function based on the state value;
[0116] Specifically, the action value function can be constructed in the following form:
[0117]
[0118] in, This represents the averaging operation, which assumes that the agent follows an arbitrarily generated service function chain orchestration and scheduling strategy. Perform action a, and then continue to follow the currently applied strategy. To make subsequent decisions. Therefore, the action value function represents: for a given triple (o i′ ,a,o i′+1) Follows any other arbitrarily generated service function chain orchestration and scheduling strategy The average value generated is the result of performing action a and then making subsequent decisions according to the currently followed strategy.
[0119] Step 4.3: Construct the advantage function based on the state value and action value functions.
[0120] Specifically, the advantage function can be constructed in the following form:
[0121]
[0122] It is worth noting that the above is based on the quintuple (o i′-1 ,a i′-1 ,o i′ ,a,o i′+1The constructed advantage function can reduce the amount of network feature state information that needs to be acquired, avoiding the overhead caused by real-time global network feature state information perception.
[0123] In the specific implementation of step 5, proxy upper and lower bound functions are constructed based on the advantage function:
[0124] Specifically, the surrogate upper bound function and lower bound function can be constructed in the following way:
[0125] Upper bound function:
[0126]
[0127] Lower bound function:
[0128]
[0129] Where ∈ represents following the arbitrary generated service function chain orchestration and scheduling strategy π. θ The highest cumulative value of the average dominance function that can be generated across all trajectories when action a is performed is specifically calculated as follows:
[0130]
[0131] at the same time, This represents the orchestration and scheduling strategy π of an arbitrarily generated service function chain. θ and the current strategy The Kullback-Leibler (KL) divergence between the two. KL divergence is generally used to quantify the difference between different probability density distributions, and here it is used to measure the difference between two service function chain orchestration and scheduling strategies.
[0132] In the specific implementation of step 6, the parameters of the current service function chain orchestration and scheduling strategy are updated based on the agent upper and lower bound functions to generate the strategy to be followed in the next round:
[0133] Specifically, the service function chain orchestration and scheduling strategy followed in the next round This can be generated by solving the following constraint optimization problem:
[0134]
[0135]
[0136]
[0137] in, This indicates that the security reinforcement learning agent follows the currently adhered policy. The reward that can be generated when performing service function chain orchestration and scheduling corresponds to the agent's accumulated consumption of real network resources (including real network node resources and real network link resources) (negative); This indicates that the current strategy is being followed. The cost-reward generated during service function chain orchestration and scheduling corresponds to the agent's accumulated end-to-end latency, ξ. i′ =C(s) i′ ,a i′ ) indicates that the agent is in the latent state s i′ Follow action a i′ The value generated over time. The reward and cost rewards can be calculated using the sampled trajectory according to the expression.
[0138] The constrained optimization problem The lower bound of end-to-end latency is set for the service corresponding to the service function chain, wherein The upper bound of end-to-end jitter as required by business requirements. This is the upper bound of the end-to-end latency required by the business; here and Constraint window constituting relative delay cost-reward It should be noted that, in addition to the end-to-end jitter upper bound, a latency lower bound is also set here and configured as follows: The main considerations are as follows: the optimization objective of the method designed in this invention is to minimize the network resources consumed by the corresponding service function chain, while ensuring that the specific end-to-end latency and jitter performance of the service are met, and to ensure that the data packets are within the constraint window as much as possible. An optimization mechanism for reaching the destination node within the network is a sufficient condition for achieving the above optimization objectives.
[0139] In practice, the constrained optimization problem can be solved using existing constrained optimization problem solvers, or for simplicity, it can be solved quickly using methods such as linear approximation based on the confidence region.
[0140] In the specific implementation of step 7, steps 3 to 6 are repeated until the maximum number of training rounds is reached:
[0141] Specifically, steps 3 to 6 constitute one round of training. Through repeated iterative search of the strategy, the security reinforcement learning agent can be fully trained and generate the optimal service function chain orchestration and scheduling strategy.
[0142] This invention employs a secure reinforcement learning framework, assigning a series of secure reinforcement learning agents to each service function chain, both guided by reward and constraint functions. Each agent can dynamically plan and control the service function deployment location, transmission path, processing bandwidth, and transmission bandwidth of each data packet in real time, based on the specific service quality requirements of the corresponding service in the service function chain and the observed local network state at the current moment. This constraint-oriented optimization mechanism improves network bandwidth resource utilization while meeting deterministic end-to-end latency and jitter performance requirements, providing deterministic and efficient network services for latency-sensitive services in industrial scenarios.
[0143] Corresponding to the aforementioned embodiment of a service function chain orchestration and scheduling method, the present invention also provides an embodiment of a service function chain orchestration and scheduling apparatus.
[0144] See Figure 4 The present invention provides a service function chain orchestration and scheduling apparatus, including one or more processors, for implementing a service function chain orchestration and scheduling method in the above embodiments.
[0145] An embodiment of the service function chain orchestration and scheduling device of the present invention can be applied to any device with data processing capabilities, such as a computer or other similar device. The device embodiment can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 4 The diagram shown is a hardware structure diagram of any data processing-capable device where a service function chain orchestration and scheduling device of the present invention is located. Except for... Figure 4 In addition to the processor, memory, network interface, and non-volatile memory shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.
[0146] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0147] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0148] This invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements a service function chain orchestration and scheduling method as described in the above embodiments.
[0149] The computer-readable storage medium can be an internal storage unit of any data processing device described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units of any data processing device and external storage devices. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.
[0150] The above embodiments are only used to illustrate the design concept and features of the present invention, and their purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made based on the principles and design ideas disclosed in the present invention are within the protection scope of the present invention.
Claims
1. A service function chain orchestration and scheduling method, characterized in that, Includes the following steps: (1) Initialize the parameters of the service function chain orchestration and scheduling strategy, wherein the strategy is represented as a function. ;in For the local network feature state set, For the service function chain orchestration and scheduling action set, k is the round. Represented as parameters, The parameter set for service function chain orchestration and scheduling strategies; (2) Set the maximum number of training rounds for the security reinforcement learning agent; (3) In each round, the agent first sets the number of trajectories to be sampled in that round according to the current service function chain orchestration and scheduling strategy, and generates a specified number of service function chain orchestration and scheduling trajectories; (4) The agent obtains the advantage function based on the generated service function chain orchestration and scheduling trajectory; (5) Construct upper and lower bound functions for the proxy based on the aforementioned advantage function; (6) Update the parameters of the current service function chain orchestration and scheduling strategy based on the agent upper and lower bound functions, and generate the strategy to be followed in the next round; (7) Repeat steps (3) to (6) until the maximum number of training rounds is reached to generate the optimal service function chain orchestration and scheduling strategy; In step (3), generating a specified number of service function chain orchestration and scheduling trajectories includes the following sub-steps: (3.1) The security reinforcement learning agent acquires the local network feature state at the current moment; (3.2) Based on the local network characteristic state, make decisions and execute service function chain orchestration and scheduling actions according to the currently followed strategy; (3.3) After completing the execution of the service function chain orchestration and scheduling actions, obtain the reward value, cost value and the local network feature state observed in the next step at the current moment; (3.4) Based on the observed local network characteristics, the next service function chain orchestration and scheduling action is obtained. Steps (3.2) to (3.3) are repeated until the data packet reaches the destination node of its service function chain. At this time, the training round ends and the entire trajectory is generated and recorded. The reward value is used to characterize the real network node resources and real network link resources consumed by the service function chain orchestration and scheduling actions. After each full trajectory is generated, the sum of the reward values obtained is used to characterize the end-to-end resource consumption of the service function chain. The cost value is used to characterize the processing latency, transmission latency, and propagation latency incurred by the service function chain orchestration and scheduling actions. After each complete trajectory is generated, the sum of the cost values obtained is used to characterize the end-to-end latency of the service function chain.
2. The service function chain orchestration and scheduling method according to claim 1, characterized in that, The initialization strategy is represented as round The local network characteristic state includes the service function position of the data packet on its service function chain, its node position in the real network, and the remaining available resource status of the real network node and real network link at the current moment; the service function chain orchestration and scheduling action includes the service function deployment position and processing bandwidth set for the data packet, as well as the transmission path and transmission bandwidth.
3. The service function chain orchestration and scheduling method according to claim 1, characterized in that, The local network feature state is represented as follows: : in, and These represent the current position of the data packet managed by the security reinforcement learning agent on its service function chain and its position in the actual network, respectively. and These respectively represent the real-world nodes of the security reinforcement learning agent at the current time. and physical network links The remaining available resource capacity, This represents the service functions in the service function chain. Indicates the number of service functions. Indicates the number of physical network nodes; The service function chain orchestration and scheduling actions are represented as follows: in, and These respectively represent the security reinforcement learning agent targeting the service function. and virtual links Deployment decisions, and These respectively represent the data packets in the service function. The processing bandwidth allocated on the virtual link and in the virtual link The allocated transmission bandwidth; The reward value follows a function. : in, Let be the set of all latent states. The local network characteristic state at the current moment The corresponding unknown latent state, For the set of entity nodes in the physical network, For the entity link set of the physical network, For the data packet in the service function The allocated processing bandwidth The amount of real network node resources required at that time. For data packets in virtual links The allocated transmission bandwidth The amount of real network link resources required at that time; The cost follows a function : in, For the data packet in the service function The processing latency incurred Indicates the size of the data packet; For the data packet in the virtual link Each corresponding physical network link The transmission latency spent on it For the data packet on the real network link The propagation delay incurred.
4. The service function chain orchestration and scheduling method according to claim 1, characterized in that, The consumed real-network node resources cannot exceed the remaining available resource capacity of the real-network node at the current moment corresponding to the service function deployment location; The consumed physical network link resources cannot exceed the remaining available resource capacity of the physical network link corresponding to the transmission path at the current time.
5. The service function chain orchestration and scheduling method according to claim 3, characterized in that, Step (4) includes the following sub-steps: (4.1) The security reinforcement learning agent obtains state value based on the generated trajectory; The state value is calculated in the following manner: in, As a discount factor, This represents a discount factor that decreases progressively as the service function chain orchestration and scheduling progresses. This indicates that the agent is in a latent state. Follow the action The reward value obtained at that time; (4.2) Construct an action value function based on the state value; The action value function is constructed in the following form: in, Represents the trajectory. This indicates an average value operation; (4.3) Construct an advantage function based on the state value and action value functions; 。 6. The service function chain orchestration and scheduling method according to claim 5, characterized in that, The surrogate upper and lower bound functions can be constructed in the following ways: in, This represents the adherence to any generated service function chain orchestration and scheduling strategy. Execute action At that time, the highest cumulative value of the average dominance function that can be generated among all trajectories is specifically calculated as follows: This represents the orchestration and scheduling strategy for arbitrarily generated service function chains. and the current strategy The KL divergence between them.
7. The service function chain orchestration and scheduling method according to claim 6, characterized in that, The service function chain orchestration and scheduling strategy to be followed in the next round This is generated by solving the following constrained optimization problem: in, This indicates that the security reinforcement learning agent follows the currently adhered policy. The reward that can be generated when performing service function chain orchestration and scheduling corresponds to the amount of real network resources consumed by the agent. This indicates that the current strategy is being followed. The cost reward generated during service function chain orchestration and scheduling corresponds to the accumulated end-to-end latency of the agent, where... This indicates that the agent is in a latent state. Follow the action The value generated by the times; The lower bound of end-to-end latency is set for the service corresponding to the service function chain, wherein The upper bound of end-to-end jitter as required by business requirements. This is the upper bound of the end-to-end latency required by the business.
8. A service function chain orchestration and scheduling device, characterized in that, It includes one or more processors for implementing a service function chain orchestration and scheduling method according to any one of claims 1-7.
9. A computer-readable storage medium having a program stored thereon, characterized in that, When executed by the processor, the program is used to implement a service function chain orchestration and scheduling method according to any one of claims 1-7.