Method, system, device and storage medium for rate control based on reinforcement learning

By constructing an agent that interacts with the encoder in video encoding, employing a gating mechanism to calculate reward values, and utilizing rate-distortion curves and bit-rate-distortion gain to optimize the policy network, the problems of low sample efficiency and high learning difficulty in reinforcement learning are solved, achieving efficient bit rate control and improved video encoding quality.

CN122205083APending Publication Date: 2026-06-12UNIV OF SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UNIV OF SCI & TECH OF CHINA
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing reinforcement learning methods for bitrate control in video coding suffer from low sample efficiency and high learning difficulty for agents. The self-competition mechanism cannot accurately quantify RD performance, resulting in a complex mapping relationship between coding state and QP.

Method used

By constructing an agent that interacts with the encoder, a gating mechanism is used to calculate the reward value. The rate-distortion curve is calculated using the actual bit rate and video quality, and the bit rate-distortion gain is used as the final reward value. This optimizes the policy network and the value network, improving sample efficiency and performance.

🎯Benefits of technology

It significantly improves the sample efficiency and final performance of reinforcement learning-optimized bitrate control strategies, and enhances the bitrate accuracy of the encoder and video quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122205083A_ABST
    Figure CN122205083A_ABST
Patent Text Reader

Abstract

The application discloses a code rate control method, system, device and storage medium based on reinforcement learning, which are corresponding solutions, in the solutions: a reward function which can accurately quantify the code rate control strategy performance of the current agent is designed, that is, the rate distortion curve of the agent is calculated through the actual code rate and the video quality, and the bit rate distortion gain reflecting the code rate control strategy of the agent is calculated based on the rate distortion curve of the encoder, so that the agent can quantify the R-D performance of the code rate control strategy in the reinforcement learning process, the sample efficiency and the final performance of the code rate control strategy problem using the reinforcement learning algorithm are significantly improved, and after the training is completed, the encoder can be better guided to perform the encoding work, so that the code rate accuracy is improved while the video quality is ensured.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of video coding technology, and in particular to a bitrate control method, system, device and storage medium based on reinforcement learning. Background Technology

[0002] In video encoding, users typically set a target bitrate for each video segment and require the compressed video sequence to occupy a bitrate at that target bitrate. To achieve this, traditional encoders set a quantization parameter (QP) for each frame to control the bitrate consumed in each frame. Generally, the smaller the QP of a frame, the higher the bitrate consumed, and the less distortion produced after encoding. Bitrate control algorithms need to achieve the optimal balance between bitrate and distortion (RD) while meeting user-specified bitrate limits, that is, to maximize RD performance. Formally, the bitrate control problem can be expressed as follows: ; Where D represents distortion, R represents the actual bit rate during encoding, and st indicates that the constraint is met.

[0003] This is a constrained optimization problem, where the parameter to be optimized is the QP of each frame. This is addressed by introducing Lagrange multipliers. The above problem can be transformed into an unconstrained optimization problem: ; The above equation is also called the rate-distortion function, where J is the Lagrange cost.

[0004] In video coding, the RD performance of a coding method can be measured by its RD curve. For example... Figure 1As shown, in actual operation, each point on the curve comes from a different set of coding parameters. By comparing the RD performance of the new method and the baseline method under different coding parameter configurations, the coding performance of the two methods can be directly compared. When comparing the objective quality of the two coding methods, distortion is usually represented by Peak Signal Noise Ratio (PSNR). Meanwhile, Reference 1 (Li B, Li H, Li L, Zhang J. Lambda domain rate control algorithm for high efficiency video coding[J]. IEEE Transactions on Image Processing,2014, 23(9): 3841-3854.) points out that the RD curve of the traditional encoder is in In planar coordinates, a better fit can usually be achieved using a straight line, where r is the bit rate.

[0005] In video coding, the difference in RD performance between two methods is usually measured by BD-rate (bit rate distortion gain). When BD-rate is negative, it indicates that the new method has improved RD performance compared to the baseline method; conversely, when BD-rate is positive, it indicates that the coding quality of the new method is inferior to the baseline method. Reference 2 (Herglotz C, Och H, Meyer A, Ramasubbu G, Eichermüller L, Kränzler M, Brand F, Fischer K, NguyenD T, Regensky A, et al. The Bjøntegaard bible: Why your way of comparing video codecs may be wrong[J]. IEEE Transactions on Image Processing, 2024,33: 987-1001.) proposes a method for calculating BD-rate, which, after simplification, is expressed as the following formula: ; In this formula, A and B represent the new method and the baseline method, respectively. and These represent the minimum and maximum values ​​of the vertical axis in the PSNR intersection region between the new method and the baseline method, respectively. Logarithmic transformation values ​​of the bitrates of the two methods , , , Here are the bitrates for the two methods.

[0006] When performing rate control, the optimal QP (Queries Per Second) for the current frame is influenced by the QPs of previously encoded frames. This means that the QPs of different frames influence each other; therefore, the rate control problem is formally a sequential decision problem. In control physics, sequential decision problems can generally be modeled as a Markov chain (MC). A Markov chain typically consists of five elements: state, action, reward function, transition probabilities, and reward discount factor. Specifically, in the rate control problem, the state defines the encoding state of the current frame during rate control, such as the remaining target bits at the current frame and the texture features of the current frame; the action is the frame-level QP; the reward function measures the quality of the encoding; the transition probabilities express the transition relationship between the encoding states of the current and next frames; and the reward discount factor expresses the degree to which future rewards affect the current encoding performance. Reinforcement learning, as a common method for solving Markov chain problems, can naturally be applied to rate control.

[0007] like Figure 2 As shown, reinforcement learning schemes typically consist of two parts: an agent and an environment. When using reinforcement learning to solve the bitrate control problem, the environment mainly consists of an encoder, while the agent is the control unit that determines the QP (Queries Per Count) according to a predefined algorithm. During the encoding process, the agent decides the QP for each frame based on the encoding state of the current frame. After receiving the QP for the current frame, the encoder encodes the current frame. After encoding is complete, the environment rewards the agent, which measures the agent's R&D performance in encoding the current frame.

[0008] The goal of reinforcement learning is to maximize the agent's cumulative reward by optimizing the agent. During optimization, the agent's decisions are made to maximize the reward function. Since the reward function is the agent's only optimization signal, its quality determines the effectiveness of the rate control strategy ultimately learned by the agent.

[0009] In the above formula for BD-rate It cannot be directly calculated, and there is no existing work using this formula or its derivation as a reward function. Furthermore, the Lagrange multipliers in the aforementioned rate-distortion function... The rate-distortion function (RD) is related to video content and target bitrate, and is difficult to determine before using reinforcement learning to optimize the agent's representation. Therefore, the RD function cannot be directly used as the reward function for reinforcement learning. To address this issue, reference 3 (Mandhane A, Zhernov A, Rauh M, Gu C, Wang M, Xue F, Shang W, Pang D, Claus R, Chiang CH, et al, MuZero with self-competition for rate control in VP9 video compression [EB / OL]. arXiv:2202.06626, 2022.02.14) proposes a new reward function—the self-competition mechanism. The self-competition mechanism guides the agent to improve its bitrate control strategy by comparing the RD performance of the current version of the agent with that of historical versions. Specifically, if the agent improves its performance in satisfying the bitrate constraint during the current encoding, it receives a +1 reward; otherwise, it receives a -1 reward, thus encouraging the agent to evolve towards satisfying the bitrate constraint and improving the objective quality of video encoding. However, the reward design method based on the self-competition mechanism has the following drawbacks: (1) Under the condition of satisfying the bitrate constraint, no matter how much the agent's performance in terms of objective quality indicators is higher than the historical average, the reward feedback provided to the agent by the self-competition mechanism is only +1 and -1. This makes it impossible for the agent to know the magnitude of the improvement in video coding quality, which makes it difficult for the agent to learn the relative merits of different bitrate control strategies from the feedback signal. If the number of training steps required to achieve a certain RD performance is used to measure the sample efficiency of training the agent, the problem of low sample efficiency often occurs in the process of training the agent using the self-competition mechanism.

[0010] (2) The agent needs to learn how to map the encoded state and the QP of the current frame to the reward function. Since the self-competition mechanism only contains two reward values, +1 and -1, the same reward function value may correspond to multiple different encoded states and QPs with large differences. This leads to a complex relationship between the encoded state, QP and the self-competition mechanism, making it difficult for the agent to learn this mapping relationship, which in turn makes it difficult to train the agent using reinforcement learning algorithms.

[0011] In view of this, the present invention is hereby proposed. Summary of the Invention

[0012] The purpose of this invention is to provide a bitrate control method, system, device, and storage medium based on reinforcement learning, which can significantly improve the sample efficiency and final performance of problems using reinforcement learning to optimize bitrate control strategies.

[0013] The objective of this invention is achieved through the following technical solution: A reinforcement learning-based rate control method includes: The process involves constructing an agent and interacting with an encoder to complete agent training. This includes: using the agent to determine quantization parameters based on encoding features to guide the encoder's encoding process; determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate and the target bitrate; and using a gating mechanism to calculate the final reward value. If the bitrate constraint requirements are met, the agent's rate-distortion curve is calculated using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve. Otherwise, the final reward value is calculated using one or more preset hyperparameters. The encoding features, quantization parameters, and reward value are used as interaction trajectories, where the reward value of the last frame is the final reward value, and the reward value of the remaining frames is 0. When the number of interaction trajectories reaches a set requirement, a batch of interaction trajectories is sampled, the loss function of the reinforcement learning algorithm is calculated, and the agent's parameters are updated. The trained agent determines the quantization parameters for the next frame based on the input encoded features and feeds them back to the encoder.

[0014] A bitrate control system based on reinforcement learning, comprising: The agent construction and training unit is used to construct an agent and interact with the encoder to complete agent training. This includes: determining quantization parameters based on the agent's encoding features to guide the encoder's encoding process; determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate and the target bitrate; and calculating the final reward value using a gating mechanism. If the bitrate constraint requirements are met, the agent's rate-distortion curve is calculated using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve. Otherwise, the final reward value is calculated using one or more preset hyperparameters. The encoding features, quantization parameters, and reward value are used as interaction trajectories, where the reward value of the last frame is the final reward value, and the reward values ​​of the remaining frames are 0. When the number of interaction trajectories reaches a set requirement, a batch of interaction trajectories is sampled, the loss function of the reinforcement learning algorithm is calculated, and the agent's parameters are updated. The bitrate control unit is used to determine the quantization parameters of the next frame based on the input coding features of the trained agent and feed them back to the encoder.

[0015] A processing device includes: one or more processors; and a memory for storing one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the aforementioned method.

[0016] A readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned method.

[0017] As can be seen from the technical solution provided by the present invention, a reward function that can accurately quantify the performance of the current agent's bitrate control strategy is designed. That is, the rate-distortion curve of the agent is calculated by using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated based on the encoder's rate-distortion curve. This allows the agent to quantify the RD performance of its bitrate control strategy during reinforcement learning, which significantly improves the sample efficiency and final performance of using reinforcement learning algorithms to optimize bitrate control strategy problems. After training, it can better guide the encoder to perform encoding work, thereby improving bitrate accuracy while ensuring video quality. Attached Figure Description

[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a schematic diagram of the RD curve of video encoding provided for the background technology of this invention.

[0020] Figure 2 The overall block diagram for solving the bitrate control problem using reinforcement learning is provided as the background technology of this invention.

[0021] Figure 3 This is a schematic diagram of a reinforcement learning-based rate control method provided in an embodiment of the present invention.

[0022] Figure 4 This is a schematic diagram illustrating the derivation of the reward function provided in an embodiment of the present invention.

[0023] Figure 5 This is a schematic diagram of the network structure of Example 1 provided in an embodiment of the present invention.

[0024] Figure 6 This is a schematic diagram of the network structure of Example 2 provided in the embodiments of the present invention.

[0025] Figure 7 This is a schematic diagram of a reinforcement learning-based bitrate control system provided in an embodiment of the present invention.

[0026] Figure 8 This is a schematic diagram of a processing device provided in an embodiment of the present invention. Detailed Implementation

[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.

[0028] First, the following explanations are provided for the terms that may be used in this article: The terms "comprising," "including," "containing," "having," or other similar semantic descriptions should be interpreted as non-exclusive inclusion. For example, including a technical feature element (such as raw material, component, ingredient, carrier, dosage form, material, size, part, component, mechanism, device, step, process, method, reaction conditions, processing conditions, parameter, algorithm, signal, data, product or article of manufacture, etc.) should be interpreted as including not only the expressly listed technical feature element, but also other technical feature elements that are not expressly listed and are well-known in the art.

[0029] The term "composed of" excludes any technical features not expressly listed. When used in a claim, it closes the claim to exclude all technical features other than those expressly listed, except for associated conventional impurities. If the term appears only in a clause of a claim, it limits the claim to the elements expressly listed in that clause; elements recited in other clauses are not excluded from the overall claim.

[0030] The following provides a detailed description of a reinforcement learning-based rate control method, system, device, and storage medium provided by this invention. Contents not described in detail in the embodiments of this invention are prior art known to those skilled in the art. Unless otherwise specified in the embodiments of this invention, conditions are performed according to conventional conditions or manufacturer recommendations. Reagents or instruments used in the embodiments of this invention, unless otherwise specified, are all commercially available products.

[0031] Example 1 like Figure 3 The diagram illustrates the overall process of a reinforcement learning-based rate control method provided in this embodiment of the invention, which mainly includes the following steps: Step 1: Build an agent and interact with the encoder to complete agent training.

[0032] In this embodiment of the invention, the agent mainly includes a policy network and a value network. The training process is as follows: the agent makes decisions on quantization parameters based on encoding features to guide the encoder's encoding process. When the actual bitrate after encoding meets the bitrate constraint requirements, the rate-distortion curve of the agent is calculated based on the actual bitrate after encoding and the video quality. The bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve. The encoding features, quantization parameters, and reward value are used as interaction trajectories, where the reward value of the last frame is the final reward value, and the reward value of the remaining frames is 0. When the number of interaction trajectories reaches the set requirement, a batch of interaction trajectories is sampled, the loss function of the reinforcement learning algorithm is calculated, and the parameters of the agent are updated. After training is completed, the policy network is retained.

[0033] (1.1) Randomly sample video sequences and their corresponding target bitrates from the pre-prepared training dataset.

[0034] (1.2) The agent makes decisions based on the coding features of each frame and outputs the corresponding quantization parameters. The encoder then uses the quantization parameters to encode the corresponding video frames, repeating this process until the encoder completes the encoding of the last frame and determines the actual bitrate and video quality of the encoded video sequence. Based on the actual bitrate and target bitrate, it is determined whether the agent can meet the bitrate constraint requirements. A gating mechanism is used to calculate the final reward value: if the bitrate constraint requirements are met, the rate-distortion curve of the agent is calculated using the actual bitrate and video quality, and the bit rate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve; otherwise, the final reward value is calculated using one or more preset hyperparameters.

[0035] Furthermore, the encoded features of each frame, the quantization parameters output by the agent, and the reward value are stored as an interaction trajectory in the experience buffer. Specifically, for the first frame, the corresponding quantization parameters can be initialized through the bitrate control algorithm built into the encoder to complete the encoding of the first frame. The agent uses the encoded features of the first frame to predict the quantization parameters of the second frame, and then hands them over to the encoder to encode the second frame. This process is repeated until the encoder completes the encoding of the last frame. That is, in this process, the agent is responsible for predicting the quantization parameters of all video frames except the first frame. The reward value of the second frame to the penultimate frame is 0, and the reward value of the last frame is the final reward value.

[0036] (1.3) When the number of interaction trajectories in the experience buffer reaches the set requirement, a batch of interaction trajectories are sampled from them, the loss function of the reinforcement learning algorithm is calculated, and the parameters of the network are updated using the gradient descent method.

[0037] For example, the reinforcement learning algorithm can be the PPO algorithm (proximal policy optimization algorithm). Of course, in practical applications, users can adjust it to other reinforcement learning algorithms according to the situation or experience. This invention does not impose specific limitations.

[0038] In this embodiment of the invention, the agent's ability to meet the bitrate constraint is determined based on the actual bitrate and the target bitrate. A gating mechanism is used to calculate the final reward value, which is expressed as follows: ; Where Reward is the final reward value, and h is... These are two preset hyperparameters; The historical exponential moving average excess of the agent observed in the current round. The bit rate excess observed during the current round of encoding The maximum value in, This indicates that the encoded video sequence S meets the bitrate constraint requirement. This is a historical representation of bitrate constraints. `max` indicates taking the maximum value within the parentheses; `k` represents the current round, i.e., the round at the target bitrate. The video sequence is then encoded for the kth time. Let be the slope of the encoder's rate-distortion curve, and c be the slope of b and c. The coefficient of the direct proportional relationship, b, is the slope of the rate-distortion curve of the agent; the rate-distortion curve is... The curve on a plane coordinate system, with the vertical axis representing the peak signal-to-noise ratio and the horizontal axis representing the peak signal-to-noise ratio. That is, the logarithm of the bit rate. , For video quality, i.e. at the target bitrate The peak signal-to-noise ratio generated when encoding the video sequence S. Let represent the minimum and maximum values ​​of the vertical axis in the peak signal-to-noise ratio (PSNR) crossover region of the rate-distortion curves of the agent and encoder, respectively. ,Right now for and The difference, , These represent the minimum and maximum values ​​of the encoder's rate-distortion curve on the horizontal axis at the intersection of the peak signal-to-noise ratio in the rate-distortion curves of the agent and the encoder, respectively.

[0039] In this embodiment of the invention, It can be represented as: ; , This represents the amount of bitrate excess observed during the current round of encoding. This represents the actual bitrate for the current round. , and These represent the historical exponential moving average excess of the agent observed in the current round and the previous round, respectively. For the weight magnitude, when k-1=0, .

[0040] During the training process described above, the parameters of the policy network and the value network are updated. The specific methods can be referred to conventional techniques, and will not be elaborated in this invention. After training is completed, the policy network is retained.

[0041] Step 2: The trained agent determines the quantization parameters for the next frame based on the input encoding features and feeds them back to the encoder.

[0042] As mentioned earlier, after training is complete, only the policy network in the agent is retained. The policy network makes decisions on quantization parameters based on the encoding features, which in turn guides the encoder to encode the corresponding video frames. This process is repeated until the encoder completes the encoding of all video frames.

[0043] The above-mentioned solution provided by the embodiments of the present invention designs a reward function that can accurately quantify the performance of the current agent's bitrate control strategy. That is, the rate-distortion curve of the agent is calculated by using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated based on the rate-distortion curve of the encoder. This allows the agent to quantify the RD performance of its bitrate control strategy during reinforcement learning, which significantly improves the sample efficiency and final performance of using reinforcement learning algorithms to optimize bitrate control strategy problems. After training, it can better guide the encoder to perform encoding work, thereby improving bitrate accuracy while ensuring video quality.

[0044] To more clearly demonstrate the technical solution and its effects provided by the present invention, the method provided by the embodiments of the present invention will be described in detail below with reference to specific examples.

[0045] I. Detailed introduction of the plan.

[0046] 1. Derivation of the principle of the reward function that directly reflects the BD-rate.

[0047] As described in the background art, in In planar coordinates, the encoder's RD curve can be well fitted using a straight line. For example... Figure 4 As shown, in this embodiment of the invention, assuming the agent is the new method A (referred to as agent A) and the encoder method is the baseline method B, then the two... The equations of the RD curves in planar coordinates are as follows: ; ; in, For agent A in The intercept and slope of the RD curve in planar coordinates. For baseline method B in The intercept and slope of the RD curve in planar coordinates. Similar to the background technique, the logarithmic transformation values ​​of the two method bitrates are defined. , , , Here are the bitrates for the two methods.

[0048] In the interval The BD-rate is calculated. To calculate the magnitude of the BD-rate based on the formula provided in Reference 2, the two linear equations above are inversely transformed: ; ; therefore: ,variable .

[0049] set up and midpoint intermediate parameters We can obtain: ; in, , , and for and The horizontal axis value on the encoder's rate-distortion curve is formally represented as: ; .

[0050] Combining the formula from reference 2 above, we can obtain: .

[0051] Set intermediate parameters , It can be deduced and Relationship: ;

[0052] in, , These represent the minimum and maximum values ​​of the encoder's rate-distortion curve on the horizontal axis at the intersection of the peak signal-to-noise ratio in the rate-distortion curves of the agent and the encoder, respectively.

[0053] Therefore, the BD-rate relationship can be obtained: .

[0054] Assuming slope and The relationship satisfies a direct proportional function, that is: ; in, for and The coefficient of the direct proportional relationship. Therefore, the above BD-rate relationship can be further transformed into: .

[0055] In the above BD-rate transformation formula, the slope Before using reinforcement learning algorithms to optimize the agent's bitrate control strategy, the following formula can be used to calculate: .

[0056] Suppose agent A is at the target bit rate The peak signal-to-noise ratio generated when encoding the video sequence S is: , If the bitrate R generated by agent A after encoding video sequence S is equal to the target bitrate Approximate, that is: , here is A threshold, as an example, can be set in practice. kb / s means kilobits per second, so It can be calculated using the following formula: ;

[0057] The above BD-rate transformation establishes the BD-rate and the representation of the agent A encoded video sequence S (using... The relationship between the above BD-rate and the BD-rate is represented by the above BD-rate transformation formula, which has a monotonically decreasing relationship with the BD-rate. Therefore, the above BD-rate transformation formula can be used as a reward function for training the agent. It can also quantify the BD-rate of the agent when encoding the video sequence S during the training process, thereby solving the various shortcomings of the self-competition mechanism in Reference 3 that cannot quantify the RD performance of the agent.

[0058] 2. Calculation of reward function based on gating mechanism.

[0059] In the process of optimizing the agent's bitrate control strategy, the reward function represented by the above BD-rate transformation formula... The conditions that can be calculated are the bitrate R and the target bitrate generated by agent A after encoding the video sequence S. Approximate, but if the target bitrate is specified by the user when encoding the video sequence S. Smaller, bitrate R is usually related to The significant difference makes it impossible to calculate the reward function represented by the BD-rate transformation formula, and therefore it cannot be applied to reinforcement learning optimization algorithms for solving code rate control.

[0060] To address this issue, this invention introduces a gating mechanism. Specifically, if the agent can meet the bit rate constraint requirement (i.e., ... If the agent cannot meet the bitrate constraint when encoding the video sequence S in the current round (i.e., ...), then the reward is defined according to the above BD-rate transformation formula; conversely, if the agent cannot meet the bitrate constraint requirement when encoding the video sequence S in the current round (i.e. ...), then the reward is defined according to the above BD-rate transformation formula. Then, there are two cases: if the agent improves its performance in satisfying the code rate constraint during the current round of encoding, then the assigned value is... The reward is given if the reward is not given, otherwise a reward of value h is assigned, where h and h are the same. All of these are hyperparameters. The above BD-rate transformation, combined with the gating mechanism, forms the final reward function, which can be formalized as: .

[0061] For example, it can be set , .

[0062] 3. Agent training and bitrate control.

[0063] As previously introduced, the agent consists of a policy network and a value network. The reward value is calculated based on the aforementioned reward function, thereby constructing the interaction trajectory. Then, the policy network and value network are optimized by calculating the reinforcement learning loss function. After training, the policy network is retained for subsequent bitrate control. The specific training process can be found in the previous introduction and will not be repeated here.

[0064] II. Example Introduction.

[0065] 1. Example 1.

[0066] (1) The configuration in Example 1.

[0067] Example 1 primarily addresses the bitrate control problem in secondary encoding of x265 (an open-source, high-efficiency video encoder), using the PPO algorithm to optimize the bitrate control algorithm. Features of x265 include quantization parameters, number of encoded bits, buffer state information, and frame type information.

[0068] The network structure diagram of the feature extraction module in Example 1 is shown below. Figure 5 As shown. Figure 5 The Broadcast mechanism refers to increasing the dimensionality of the input by copying it to a specified dimension to ensure the concatenation operation can proceed correctly. The Transformer Block uses the TrXL-I architecture from reference 4 (Parisotto E, Song F, Rae J, Pascanu R, Gulcehre C, Jayakumar S, Jaderberg M, Kaufman RL, Clark A, Noury ​​S, et al. Stabilizing transformers for reinforcement learning[C]. International Conference on Machine Learning, 2020: 7487-7498.), which will not be elaborated here. The MLP (Multilayer Perceptron) structure refers to a fully connected neural network. and These refer to the feature sequences encoded in the first and second encodings, respectively. T is the number of frames in the encoded video sequence S, t is the index of the current frame, and the symbol ⊕ indicates splicing.

[0069] The output encoded features of the feature extraction module are obtained. The output is then fed into a value network consisting of two consecutive fully connected layers, each containing 64 neurons. and policy network The predicted value and quantization parameter are given; in Example 1, the range of QP is [0, 51]. The optimization process uses the AdamW optimizer (an adaptive moment estimator with weight decay).

[0070] (2) The process in Example 1.

[0071] Step 1: Initialize the policy network parameters Value network parameters In addition to the experience buffer, a training dataset containing video sequences and their target bitrates is prepared. In Example 1, a low-resolution sequence of BVI-DVC is used, and sequences downloaded from the Vimeo official website are added. At the same time, an encoder that supports double encoding mode is enabled.

[0072] Step 2: While the training process has not yet converged, randomly select a video sequence from the training dataset. and its corresponding target bitrate .

[0073] Step 3: Use the encoder to process the selected video sequence. At target bitrate Under the given conditions, the first encoding is performed to obtain the first encoded feature sequence used for statistical information in the two encodings.

[0074] Step four: After completing the first encoding, the interactive encoding process is invoked to encode the video sequence a second time. During this encoding process, the reinforcement learning policy network makes decisions on the quantization parameters of each frame, thereby obtaining an interactive trajectory. .

[0075] Step 5: Obtain the interaction trajectory Store it in the experience buffer.

[0076] Step 6: After a predetermined number of video encoded trajectories are collected in the experience buffer, the PPO is used to update the parameters of the policy network and the value network, including the policy clipping objective function, the value function loss, and the entropy regularization term.

[0077] Step 7: After completing the parameter update, clear the experience buffer and continue executing steps 2 through 6 until the policy network training converges, obtaining the final trained policy network. .

[0078] The second encoding process involved in step four above is as follows:

[0079] Step (1): During interactive encoding, the quantization parameters of the first frame are first initialized using the encoder's built-in two-times rate control algorithm. .

[0080] Step (2): Starting from the first frame, traverse each frame in the video sequence according to the encoding order, and apply the current quantization parameters to the t-th frame. Encode the data to obtain the encoding status information of the t-th frame during the second encoding process, where t=2,…,T.

[0081] Step (3): The statistical information obtained from the first encoding and the current encoded state information from the second encoding are fused using the feature extraction module to obtain the encoding features corresponding to frame t. .

[0082] Step (4), Input Policy Network The policy network outputs the quantization parameters for the next frame. .

[0083] Step (5): During the encoding process from frame t to frame T-1, the instant reward value is set to 0, and the encoded features are... Actions output by the policy network and reward value Store in trajectory cache.

[0084] Step (6): After the last frame (the Tth frame) of the video sequence is encoded, the actual coding bitrate R and the corresponding PSNR of the entire video sequence are obtained.

[0085] Step (7), based on the target bitrate of the video sequence The actual bitrate R and the reconstructed video quality PSNR are used to calculate the final reward value using the final reward function provided earlier, denoted here as . .

[0086] Step (8) encodes the features of the last frame. and final reward value Store the trajectory in the trajectory cache and form a complete interaction trajectory.

[0087] 2. Example 2.

[0088] Unlike Example 1, which uses a Transformer-based feature extraction module, Example 2 uses a feature extractor composed solely of a linear fully connected neural network. The network structure of this feature extractor is as follows: Figure 6 As shown.

[0089] The other encoding configurations, training methods, and processes in Example 2 are exactly the same as in Example 1, so they will not be repeated here.

[0090] It is worth noting that the two examples here demonstrate the optimization of the bitrate control problem for x265 secondary encoding using the PPO algorithm, but this is only true as long as the reward function is based on the assumptions in the previous derivation (i.e., the slope). and (Satisfying a direct proportional function relationship), the reward function proposed in this invention can be applied to different encoders, network structures, feature settings, optimizer settings, hyperparameter settings, encoding configurations, datasets, and reinforcement learning algorithms. It is not limited to the bitrate control problem of x265 secondary encoding, specific network structures, hyperparameter settings, and PPO algorithms, which are settings unrelated to the reward function.

[0091] III. Effects Description

[0092] The above Example 1 was tested using the standard test sequences for classes B, C, and D provided by the Common Test Condition (CTC). Since the training set used low-resolution videos with a length of 65 frames, classes B, C, and D were first downsampled to 416×240, and the first 65 frames of each sequence were tested.

[0093] 1. Bitrate accuracy.

[0094] As shown in Table 1, compared with x265's default secondary encoding rate control algorithm, the rate control strategy obtained by training the agent using Example 1 improves the rate accuracy, achieving a rate accuracy of 1.10%.

[0095] Table 1: Comparison Results of Bitrate and Accuracy

[0096] 2. Objective quality.

[0097] The test results for BD-rate are shown in Table 2. Ultimately, compared to x265's default secondary rate control algorithm, the rate control strategy obtained by training the agent using Example 1 saved approximately 2.49% of the bitrate on the YUV components of class B, class C, and class D in the standard CTC test sequences.

[0098] Table 2: BD-rate test results

[0099] Through the above description of the embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or by using software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solutions of the above embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, mobile hard drive, etc.), including several instructions to cause a computer device (such as a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0100] Example 2 This invention also provides a bitrate control system based on reinforcement learning, which is mainly used to implement the methods provided in the foregoing embodiments, such as... Figure 7 As shown, the system mainly includes: The agent construction and training unit is used to construct an agent and interact with the encoder to complete agent training. This includes: determining quantization parameters based on the agent's encoding features to guide the encoder's encoding process; determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate and the target bitrate; and calculating the final reward value using a gating mechanism. If the bitrate constraint requirements are met, the agent's rate-distortion curve is calculated using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve. Otherwise, the final reward value is calculated using one or more preset hyperparameters. The encoding features, quantization parameters, and reward value are used as interaction trajectories, where the reward value of the last frame is the final reward value, and the reward values ​​of the remaining frames are 0. When the number of interaction trajectories reaches a set requirement, a batch of interaction trajectories is sampled, the loss function of the reinforcement learning algorithm is calculated, and the agent's parameters are updated. The bitrate control unit is used to determine the quantization parameters of the next frame based on the input coding features of the trained agent and feed them back to the encoder.

[0101] In this embodiment of the invention, the step of using an intelligent agent to determine quantization parameters based on encoded features to guide the encoder's encoding process includes: Randomly sample video sequences and their corresponding target bitrates from a pre-prepared training dataset; The encoder encodes the first frame of the video sequence and outputs the encoded features to the agent. The agent then determines the quantization parameters of the second frame, and the encoder uses the quantization parameters of the second frame to encode the second frame. This process is repeated until the encoder completes the encoding of the last frame.

[0102] In this embodiment of the invention, the agent includes: a policy network and a value network: during training, the parameters of the policy network and the value network are updated, and after training is completed, the policy network is retained.

[0103] In this embodiment of the invention, the step of determining whether the agent can meet the code rate constraint requirement based on the actual encoded code rate and the target code rate, and using a gating mechanism to calculate the final reward value, is expressed as follows: ; Where Reward is the final reward value, and h is... These are two preset hyperparameters; The historical exponential moving average excess of the agent The bit rate excess observed during the current round of encoding The maximum value in, This indicates that the encoded video sequence S meets the bitrate constraint requirement. This is a historical representation of bitrate constraints. `max` indicates taking the maximum value within the parentheses; `k` represents the current round, i.e., the round at the target bitrate. The video sequence is then encoded for the kth time. Let be the slope of the encoder's rate-distortion curve, and c be the slope of b and c. The coefficient of the direct proportional relationship, b, is the slope of the rate-distortion curve of the agent; the rate-distortion curve is... The curve on a plane coordinate system, with the vertical axis representing the peak signal-to-noise ratio and the horizontal axis representing the peak signal-to-noise ratio. That is, the logarithm of the bit rate. , For video quality, i.e. at the target bitrate The peak signal-to-noise ratio generated when encoding the video sequence S. Let represent the minimum and maximum values ​​of the vertical axis in the peak signal-to-noise ratio (PSNR) crossover region of the rate-distortion curves of the agent and encoder, respectively. ,Right now for and The difference, , These represent the minimum and maximum values ​​of the encoder's rate-distortion curve on the horizontal axis, respectively, at the intersection of the peak signal-to-noise ratio (SNR) in the rate-distortion curves of the agent and the encoder. Considering that the main technical details involved in the above system have been described in detail in previous embodiments, they will not be repeated here.

[0104] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional modules is used as an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the system can be divided into different functional modules to complete all or part of the functions described above.

[0105] Example 3 The present invention also provides a processing device, such as Figure 8 As shown, it mainly includes: one or more processors; a memory for storing one or more programs; wherein, when the one or more programs are executed by the one or more processors, the one or more processors implement the method provided in the foregoing embodiments.

[0106] Furthermore, the processing device also includes at least one input device and at least one output device; in the processing device, the processor, memory, input device, and output device are connected via a bus.

[0107] In this embodiment of the invention, the specific types of the memory, input device, and output device are not limited; for example: Input devices can be touchscreens, image acquisition devices, physical buttons, or mice, etc. The output device can be a display terminal; The memory can be random access memory (RAM) or non-volatile memory, such as disk storage.

[0108] Example 4 The present invention also provides a readable storage medium storing a computer program that, when executed by a processor, implements the method provided in the foregoing embodiments.

[0109] In this embodiment of the invention, the readable storage medium is a computer-readable storage medium and can be disposed in the aforementioned processing device, for example, as a memory in the processing device. Furthermore, the readable storage medium can also be any medium capable of storing program code, such as a USB flash drive, portable hard drive, read-only memory (ROM), magnetic disk, or optical disk.

[0110] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims. The information disclosed in the background section is intended only to enhance the understanding of the overall background technology of the present invention and should not be construed as an admission or implication in any way that such information constitutes prior art known to those skilled in the art.

Claims

1. A rate control method based on reinforcement learning, characterized in that, include: The process involves constructing an agent and interacting with an encoder to complete agent training. This includes: using the agent to determine quantization parameters based on encoding features to guide the encoder's encoding process; determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate and the target bitrate; and using a gating mechanism to calculate the final reward value. If the bitrate constraint requirements are met, the agent's rate-distortion curve is calculated using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve. Otherwise, the final reward value is calculated using one or more preset hyperparameters. The encoding features, quantization parameters, and reward value are used as interaction trajectories, where the reward value of the last frame is the final reward value, and the reward value of the remaining frames is 0. When the number of interaction trajectories reaches a set requirement, a batch of interaction trajectories is sampled, the loss function of the reinforcement learning algorithm is calculated, and the agent's parameters are updated. The trained agent determines the quantization parameters for the next frame based on the input encoded features and feeds them back to the encoder.

2. The bitrate control method based on reinforcement learning according to claim 1, characterized in that, The step of using an intelligent agent to determine quantization parameters based on encoded features to guide the encoder's encoding process includes: Randomly sample video sequences and their corresponding target bitrates from a pre-prepared training dataset; The encoder encodes the first frame of the video sequence and outputs the encoded features to the agent. The agent then determines the quantization parameters of the second frame, and the encoder uses the quantization parameters of the second frame to encode the second frame. This process is repeated until the encoder completes the encoding of the last frame.

3. The bitrate control method based on reinforcement learning according to claim 1, characterized in that, The agent includes: a policy network and a value network: during training, the parameters of the policy network and the value network are updated, and after training is completed, the policy network is retained.

4. A rate control method based on reinforcement learning according to any one of claims 1 to 3, characterized in that, The process of determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate after encoding and the target bitrate, and using a gating mechanism to calculate the final reward value, is expressed as follows: ; Where Reward is the final reward value, and h is... These are two preset hyperparameters; The historical exponential moving average excess of the agent The bit rate excess observed during the current round of encoding The maximum value in, This indicates that the encoded video sequence S meets the bitrate constraint requirement. This is a historical representation of bitrate constraints. `max` indicates taking the maximum value within the parentheses; `k` represents the current round, i.e., the round at the target bitrate. The video sequence is then encoded for the kth time. Let be the slope of the encoder's rate-distortion curve, and c be the slope of b and c. The coefficient of the direct proportional relationship, b, is the slope of the rate-distortion curve of the agent; the rate-distortion curve is... The curve on a plane coordinate system, with the vertical axis representing the peak signal-to-noise ratio and the horizontal axis representing the peak signal-to-noise ratio. That is, the logarithm of the bit rate. , For video quality, i.e. at the target bitrate The peak signal-to-noise ratio generated when encoding the video sequence S. Let represent the minimum and maximum values ​​of the vertical axis in the peak signal-to-noise ratio (PSNR) crossover region of the rate-distortion curves of the agent and encoder, respectively. ,Right now for and The difference, , These represent the minimum and maximum values ​​of the encoder's rate-distortion curve on the horizontal axis at the intersection of the peak signal-to-noise ratio in the rate-distortion curves of the agent and the encoder, respectively.

5. A bitrate control system based on reinforcement learning, characterized in that, include: The agent construction and training unit is used to construct an agent and interact with the encoder to complete agent training. This includes: determining quantization parameters based on the agent's encoding features to guide the encoder's encoding process; determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate and the target bitrate; and calculating the final reward value using a gating mechanism. If the bitrate constraint requirements are met, the agent's rate-distortion curve is calculated using the actual bitrate and video quality, and the bitrate-distortion gain reflecting the agent's bitrate control strategy is calculated as the final reward value based on the encoder's rate-distortion curve. Otherwise, the final reward value is calculated using one or more preset hyperparameters. The encoding features, quantization parameters, and reward value are used as interaction trajectories, where the reward value of the last frame is the final reward value, and the reward values ​​of the remaining frames are 0. When the number of interaction trajectories reaches a set requirement, a batch of interaction trajectories is sampled, the loss function of the reinforcement learning algorithm is calculated, and the agent's parameters are updated. The bitrate control unit is used to determine the quantization parameters of the next frame based on the input coding features of the trained agent and feed them back to the encoder.

6. A bitrate control system based on reinforcement learning according to claim 5, characterized in that, The step of using an intelligent agent to determine quantization parameters based on encoded features to guide the encoder's encoding process includes: Randomly sample video sequences and their corresponding target bitrates from a pre-prepared training dataset; The encoder encodes the first frame of the video sequence and outputs the encoded features to the agent. The agent then determines the quantization parameters of the second frame, and the encoder uses the quantization parameters of the second frame to encode the second frame. This process is repeated until the encoder completes the encoding of the last frame.

7. A bitrate control system based on reinforcement learning according to claim 5, characterized in that, The agent includes: a policy network and a value network: during training, the parameters of the policy network and the value network are updated, and after training is completed, the policy network is retained.

8. A bit rate control system based on reinforcement learning according to any one of claims 5 to 7, characterized in that, The process of determining whether the agent can meet the bitrate constraint requirements based on the actual bitrate after encoding and the target bitrate, and using a gating mechanism to calculate the final reward value, is expressed as follows: ; Where Reward is the final reward value, and h is... These are two preset hyperparameters; The historical exponential moving average excess of the agent The bit rate excess observed during the current round of encoding The maximum value in, This indicates that the encoded video sequence S meets the bitrate constraint requirement. This is a historical representation of bitrate constraints. `max` indicates taking the maximum value within the parentheses; `k` represents the current round, i.e., the round at the target bitrate. The video sequence is then encoded for the kth time. Let be the slope of the encoder's rate-distortion curve, and c be the slope of b and c. The coefficient of the direct proportional relationship, b, is the slope of the rate-distortion curve of the agent; the rate-distortion curve is... The curve on a plane coordinate system, with the vertical axis representing the peak signal-to-noise ratio and the horizontal axis representing the peak signal-to-noise ratio. That is, the logarithm of the bit rate. , For video quality, i.e. at the target bitrate The peak signal-to-noise ratio generated when encoding the video sequence S. Let represent the minimum and maximum values ​​of the vertical axis in the peak signal-to-noise ratio (PSNR) crossover region of the rate-distortion curves of the agent and encoder, respectively. ,Right now for and The difference, , These represent the minimum and maximum values ​​of the encoder's rate-distortion curve on the horizontal axis at the intersection of the peak signal-to-noise ratio in the rate-distortion curves of the agent and the encoder, respectively.

9. A processing device, characterized in that, include: One or more processors; Memory, used to store one or more programs; Wherein, when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the method as described in any one of claims 1 to 4.

10. A readable storage medium storing a computer program, characterized in that, When a computer program is executed by a processor, it implements the method as described in any one of claims 1 to 4.