Off-line and on-line hybrid learning method and bandwidth estimation method for real-time communication
By combining offline and online learning methods to train a bandwidth estimator and fusing network layer and application layer data, the problem of inaccurate bandwidth estimation in real-time communication systems under complex network environments is solved, achieving more efficient bandwidth utilization and stable audio and video transmission.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-26
AI Technical Summary
Existing real-time communication systems struggle to accurately perceive changes in network status under complex and dynamic network environments, leading to inaccurate bandwidth estimation and impacting the quality of real-time audio and video transmission. Furthermore, existing methods struggle to balance network environment adaptability with user experience.
We employ an offline policy learning method based on behavior cloning and an online policy optimization method based on reinforcement learning. By combining network layer quality data and application layer user experience data, we train a bandwidth estimator through a hybrid learning framework to achieve cross-layer fusion and dynamic adjustment.
It improves the accuracy and stability of bandwidth estimation, enabling a better balance between throughput and latency in complex network environments, enhancing user experience, and reducing video stuttering and latency.
Smart Images

Figure CN122293531A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of network transmission control and machine learning technology, and in particular to an offline-online hybrid learning method and bandwidth estimation method for a real-time communication bandwidth estimator. Background Technology
[0002] With the continuous development of internet and multimedia communication technologies, real-time communication (RTC) technology has gradually become a crucial foundational technology supporting real-time interactive internet services, and is widely used in applications such as online video conferencing, cloud gaming, augmented reality (AR), remote collaboration, and telemedicine. In these applications, systems typically need to achieve low latency, high stability, and high-quality real-time audio and video transmission in complex network environments, thus placing high demands on network transmission performance and real-time interactive capabilities.
[0003] WebRTC (Web Real-Time Communication), as the current mainstream real-time communication technology standard, integrates audio and video codec modules and real-time transmission protocol stacks into browsers and terminal devices, enabling real-time audio and video communication between terminals without additional plugins, thus promoting the development of cross-platform real-time communication applications. However, in actual deployment environments, the transmission performance of real-time media streams is still limited by the underlying network environment of the Internet, such as network bandwidth fluctuations, latency variations, and packet loss.
[0004] In the WebRTC system architecture, the Bandwidth Estimator (BWE) is a crucial module for achieving adaptive media transmission. The BWE monitors and analyzes network state information to estimate the available network bandwidth and dynamically adjusts the video encoding rate and data packet transmission rate accordingly to improve link resource utilization efficiency and ensure real-time communication quality. However, the internet typically employs a "best-effort" transmission mode, and end-user network access environments are highly heterogeneous, including cellular mobile networks (4G / 5G), wireless local area networks (Wi-Fi), satellite links, and wired broadband networks. In these complex network environments, network state parameters such as available bandwidth, round-trip time (RTT), and packet loss rate often exhibit significant dynamic changes.
[0005] During real-time media transmission, if the data transmission rate at the sending end exceeds the actual carrying capacity of the network link, it may cause the buffer queues in intermediate network nodes (such as routers or switches) to continuously accumulate, resulting in a "buffer bloat" phenomenon. When this phenomenon occurs, the queuing time of data packets in the network device queue will increase significantly, leading to a continuous increase in end-to-end transmission latency, and may cause problems such as video stuttering and audio-video desynchronization, thereby affecting the user's real-time interactive experience.
[0006] Therefore, in complex and dynamically changing network environments, designing a bandwidth estimation method that can accurately sense changes in network status and achieve an effective balance between throughput utilization and transmission latency has become one of the key technical issues in real-time communication systems.
[0007] With the development of real-time communication technology, real-time audio and video services place higher demands on network congestion control and bandwidth estimation. Early internet congestion control primarily relied on TCP-based algorithms based on packet loss signals, such as TCP Reno and TCP Vegas. These methods adjust the transmission rate by detecting network packet loss or latency changes. To achieve stable audio and video transmission in complex network environments, researchers have proposed various congestion control and bandwidth estimation algorithms. These methods typically determine network status by monitoring congestion signals such as network latency, throughput changes, and packet loss rate, and dynamically adjust the media stream transmission rate at the sending end to achieve adaptive bandwidth control. However, these methods suffer from limited adaptability to complex network environments and insufficient model generalization ability, thus affecting the accuracy of bandwidth prediction. Summary of the Invention
[0008] In view of this, embodiments of the present invention provide an offline-to-online hybrid learning method and a bandwidth estimation method for real-time communication bandwidth estimators, in order to eliminate or improve one or more defects existing in the prior art.
[0009] One aspect of the present invention provides an offline-to-online hybrid learning method for a bandwidth estimator for real-time communication, the method comprising the following steps: An offline policy learning mechanism based on behavior cloning is adopted. A preset neural network model is trained offline based on an expert decision dataset to train the neural network model into an initial bandwidth estimator that can output offline bandwidth prediction results based on the input network layer quality data. The expert decision dataset includes multiple network layer quality data and corresponding bandwidths collected in real-time communication based on a heuristic congestion control algorithm. An online policy optimization mechanism based on reinforcement learning is adopted to optimize the initial bandwidth estimator online according to real-time network layer quality data, so as to train the initial bandwidth estimator into a target bandwidth estimator that can output the target bandwidth prediction result based on real-time network layer quality data.
[0010] In some embodiments of the present invention, the method further includes: An online policy optimization mechanism based on reinforcement learning is adopted to optimize the initial bandwidth estimator online based on real-time network layer quality data and application layer user experience data, so as to train the initial bandwidth estimator into a target bandwidth estimator that can output the target bandwidth prediction result based on real-time network layer quality data and application layer user experience data.
[0011] In some embodiments of the present invention, an online policy optimization mechanism based on reinforcement learning is employed to optimize the initial bandwidth estimator online based on real-time network layer quality data and application layer user experience data, so as to train the initial bandwidth estimator into a target bandwidth estimator capable of outputting target bandwidth prediction results based on real-time network layer quality data and application layer user experience data, including: The initial bandwidth estimator is used as a policy network to obtain the state for each period, which includes network layer quality data and application layer user experience data. Using the aforementioned policy network, the bandwidth prediction results are output online based on the state corresponding to each cycle; Using a value network, a corresponding multi-objective reward is determined based on the online bandwidth prediction results. The multi-objective reward is determined by considering throughput gain, queuing delay penalty, packet loss penalty, and rate fluctuation penalty at the network layer, as well as user experience penalty at the application layer. The user experience penalty at the application layer is determined by considering video playback stuttering penalty and end-to-end playback latency penalty. The online bandwidth prediction result when the multi-objective reward is maximized is output as the target bandwidth prediction result, and the target bandwidth estimator is obtained.
[0012] In some embodiments of the present invention, the calculation formula for the multi-objective reward is as follows: In the formula, Indicates the first Multi-objective rewards over a period of time Indicates throughput. Indicates queuing delay. Indicates packet loss. Indicates rate fluctuation, This represents the application layer user experience. , , , and Indicates the weighting coefficient. This indicates that the video playback is choppy. Indicates end-to-end playback delay. and This represents the weighting coefficient.
[0013] In some embodiments of the present invention, an online policy optimization mechanism based on reinforcement learning is employed to optimize the initial bandwidth estimator online according to real-time network layer quality data, so as to train the initial bandwidth estimator into a target bandwidth estimator capable of outputting target bandwidth prediction results based on real-time network layer quality data, including: The initial bandwidth estimator is used as a policy network to obtain the state for each cycle, and the state includes network layer quality data. Using the policy network, the corresponding online bandwidth prediction results are output based on the state of each cycle; Using a value network, a corresponding multi-objective reward is determined based on the online bandwidth prediction results. The multi-objective reward is determined by a throughput gain item, a queuing delay penalty item, a packet loss penalty item, and a rate fluctuation penalty item based on the network layer. The online bandwidth prediction result when the multi-objective reward is maximized is output as the target bandwidth prediction result, and the target bandwidth estimator is obtained.
[0014] In some embodiments of the present invention, an improved near-end policy optimization algorithm is used to perform online policy optimization on the policy network to limit the deviation of the online policy from the baseline policy. The optimization objective function is: In the formula, This indicates the update target of the policy network. This represents the loss function of the value network. This represents the policy entropy term used to encourage exploration. and Indicates the weighting coefficient. This represents the weighting coefficient of the KL divergence constraint term. Denotes KL divergence, This refers to the policy network or initial bandwidth estimator that serves as the baseline policy. This represents the target bandwidth estimator for the online strategy. To represent the directionality in KL divergence, i.e. As a reference distribution, Compare and measure.
[0015] In some embodiments of the present invention, during the optimization process, the weight coefficient of the KL divergence constraint term is dynamically adjusted based on network stability. When the network stability is high, the weight coefficient of the KL divergence constraint term is decreased; when the network stability is low or a decrease in application layer user experience data is detected, the weight coefficient of the KL divergence constraint term is increased. The network stability is determined based on the round-trip delay change rate, bandwidth change rate, packet loss change rate, and user experience data change rate.
[0016] In some embodiments of the present invention, the application layer user experience data includes the jitter buffer occupancy at the receiving end and the client video rendering frame rate.
[0017] Another aspect of the present invention provides a bandwidth estimation method for real-time communication, the method comprising the following steps: Obtain network layer quality data for real-time communication; The network layer quality data is input into the target bandwidth estimator so that the target bandwidth estimator outputs a target bandwidth prediction result. The target bandwidth estimator is trained in advance using the aforementioned offline-online hybrid learning method for the bandwidth estimator of real-time communication.
[0018] In some embodiments of the present invention, the method further includes: Obtain application-layer user experience data for real-time communication; The network layer quality data and the application layer user experience data are input into the target bandwidth estimator so that the target bandwidth estimator outputs the target bandwidth prediction result accordingly.
[0019] Another aspect of the present invention provides an electronic device comprising: a computer device including a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions stored in the memory, wherein when the computer instructions are executed by the processor, the device implements the steps of the aforementioned offline-to-online hybrid learning method for a bandwidth estimator for real-time communication, or implements the steps of the aforementioned bandwidth estimation method for real-time communication.
[0020] Another aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the aforementioned offline-to-offline hybrid learning method for a bandwidth estimator for real-time communication, or the steps of the aforementioned bandwidth estimation method for real-time communication.
[0021] Another aspect of the present invention provides a computer program product, including computer instructions that, when executed by a processor, implement the steps of the aforementioned offline-to-online hybrid learning method for a bandwidth estimator for real-time communication, or the steps of the aforementioned bandwidth estimation method for real-time communication.
[0022] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the description, or may be learned by practice of the invention. The objects and other advantages of the invention can be realized and obtained by means of the structures specifically pointed out in the description and drawings.
[0023] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0024] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, are not intended to limit the scope of the invention. The components in the drawings are not drawn to scale but are merely illustrative of the principles of the invention. For ease of illustration and description of certain parts of the invention, corresponding portions in the drawings may be enlarged, i.e., may appear larger relative to other components in an exemplary device actually manufactured according to the invention. In the drawings: Figure 1 This is a flowchart illustrating the offline-online hybrid learning method for a real-time communication bandwidth estimator in one embodiment of the present invention. Figure 2 This is a flowchart illustrating the offline-online hybrid learning method for a real-time communication bandwidth estimator in another embodiment of the present invention. Figure 3 This is a schematic diagram illustrating the specific process of the offline-online hybrid learning method for a real-time communication bandwidth estimator in another embodiment of the present invention; Figure 4 This is a flowchart illustrating a real-time communication bandwidth estimation method in one embodiment of the present invention. Figure 5 This is a flowchart illustrating a bandwidth estimation method for real-time communication in another embodiment of the present invention. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0026] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0027] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0028] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.
[0029] In the following description, embodiments of the invention will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0030] The more typical congestion control and bandwidth estimation schemes in the existing technology mainly include congestion control algorithms based on heuristic rules, congestion control algorithms based on online reinforcement learning, and data-driven methods based on offline reinforcement learning.
[0031] In the field of heuristic congestion control, Google's GCC (Google Congestion Control) algorithm combines delay-based detection with packet loss-based control. Its core mechanism involves the receiver using a Trendline filter to perform linear regression analysis on the arrival time offset between data packet groups, calculating the delay trend, and comparing it with a dynamically adaptive threshold to determine whether the link is overloaded, normal, or underloaded. Based on this feedback, the transmitter performs additive detection under normal conditions using an AIMD-like rate adjustment mechanism, and performs multiplicative deceleration during overload, while simultaneously adjusting the available bandwidth in a closed-loop manner using a packet loss rate threshold.
[0032] In reinforcement learning algorithms, OnRL employs online reinforcement learning based on an Actor-Critic architecture to achieve dynamic control. This method models the network environment as a continuous state space, collecting metrics such as transmission rate, instantaneous throughput, RTT, and their statistical characteristics in real time. The agent updates the Actor network through policy gradients, outputting continuous rate adjustment actions, and updates the policy based on a reward function composed of throughput and latency, thereby learning rate adjustment strategies under different network conditions.
[0033] The Aurora algorithm utilizes deep reinforcement learning (DRL) to address nonlinear congestion problems in complex network environments. This method uses statistical features such as transmission rate variations, latency information, and packet loss rate as state inputs, and employs the Proximal Policy Optimization (PPO) algorithm for training in a simulation environment. By maximizing the long-term reward function, it achieves a tradeoff between throughput and latency, thereby forming an end-to-end rate control strategy.
[0034] In offline reinforcement learning, Schaferct utilizes historical network trajectory data for model training. Researchers extract state-action-performance triples from large-scale communication sessions and employ the Implicit Q-Learning (IQL) offline reinforcement learning algorithm to learn bandwidth prediction strategies without interacting with the real environment. The model takes multidimensional network observation features as input and directly outputs predicted bandwidth, thereby estimating available bandwidth.
[0035] The FARC algorithm is also based on an offline reinforcement learning framework and uses an Actor-Critic structure for bandwidth prediction. The Actor network outputs bandwidth predictions based on network observations, while the Critic network predicts the audio / video quality (QoE) for the corresponding state and action, serving as an evaluation signal for policy optimization. To avoid overly aggressive predictions, FARC introduces a conservative factor to constrain the policy during training, thereby reducing overshoot in bandwidth estimation and improving stability.
[0036] Furthermore, HRCC employs a hybrid architecture combining rule-guided and data-driven approaches. This method embeds a deep learning bandwidth estimation module into the traditional WebRTC congestion control framework. The receiver predicts the available bandwidth of the link by extracting features such as latency variations, reception rate, and packet loss rate, and feeds this information back to the sender for rate adjustment. This mechanism utilizes rule-based logic to ensure stability in the initial stages of the connection, while simultaneously improving bandwidth awareness accuracy during stable transmission phases through a data model.
[0037] Although the congestion control and bandwidth estimation algorithms mentioned above have been extensively studied in real-time communication systems, they still have some shortcomings in complex dynamic network environments, mainly in the following aspects: 1) Heuristic rule-based congestion control algorithms have limited adaptability to complex network environments. These algorithms typically rely on manually designed threshold rules and fixed rate adjustment logic, such as judging network congestion status based on latency trends or packet loss rates and adjusting the transmission rate accordingly. When significant changes occur in the network environment, such as random packet loss in wireless links, rapid bandwidth fluctuations, or deep buffer queues in links, fixed rules struggle to accurately reflect the actual network state. This can easily lead to conservative bandwidth estimations or untimely responses to changes in network congestion, thus affecting link resource utilization efficiency.
[0038] 2) Reinforcement learning-based congestion control algorithms also have certain limitations in practical applications. Online reinforcement learning methods require continuous interaction with the network environment during operation, constantly updating model parameters through policy exploration. However, in real-time audio and video communication scenarios, policy exploration can lead to significant fluctuations in the transmission rate, causing video stuttering or increased latency, negatively impacting system stability and user experience. Offline reinforcement learning methods typically rely on pre-collected large-scale network trajectory data for model training. While this type of method avoids the risks associated with online exploration, the limited distribution of training data means that when the actual network environment differs from the training data, the model may exhibit insufficient generalization ability, thus affecting the accuracy of bandwidth prediction.
[0039] 3) Most existing bandwidth estimation algorithms primarily rely on network layer Quality of Service (QoS) metrics, such as throughput, latency, and packet loss rate, while paying insufficient attention to application layer Quality of Experience (QoE) metrics. In real-time audio and video communication systems, user experience is not only related to network layer performance but is also affected by factors such as playback stuttering, end-to-end latency, and frame rate stability. If bandwidth control is based solely on QoS metrics, it may be difficult to accurately reflect the actual user experience, thus making it challenging to achieve a good balance between bandwidth utilization and user experience.
[0040] Therefore, addressing the shortcomings of heuristic congestion control algorithms in adapting to complex network environments, and the potential instability of reinforcement learning methods during online exploration, as well as the insufficient generalization ability of offline-trained models in real or new network environments, this invention proposes a hybrid online and offline learning method for real-time communication bandwidth estimators. This method enhances the perception and adaptability to complex dynamic network states while ensuring system stability, thereby improving the bandwidth estimation performance of the model. Furthermore, this invention introduces application-layer user experience metrics, enabling bandwidth prediction or estimation to consider not only network-layer QoS metrics but also user QoE, achieving a more reasonable balance between throughput, latency, and playback stability. This invention enables more stable and efficient bandwidth estimation and rate control in complex heterogeneous network environments, improving the accuracy of bandwidth prediction and the overall service quality of real-time communication systems.
[0041] Figure 1 This is a flowchart of an offline-to-online hybrid learning method for a real-time communication bandwidth estimator in one embodiment of the present invention, as shown below. Figure 1 As shown, the method includes the following steps: Step S110: An offline policy learning mechanism based on behavior cloning is adopted to train a preset neural network model offline according to an expert decision dataset, so as to train the neural network model into an initial bandwidth estimator that can output offline bandwidth prediction results based on the input network layer quality data. The expert decision dataset includes multiple network layer quality data and corresponding bandwidths collected in real-time communication based on a heuristic congestion control algorithm.
[0042] The bandwidth estimation method of this invention is implemented based on a two-stage offline and online hybrid learning framework, which includes an offline behavioral cloning strategy construction stage and an online strategy optimization stage. In the first stage, to construct a stable security prior strategy, this invention adopts an offline strategy learning mechanism based on behavioral cloning (BC). First, the classic heuristic congestion control (GCC) algorithm is used as the expert strategy. This algorithm is run in a real-time communication network simulation environment, and a large amount of network operation trajectory data is collected (multiple network layer quality data that change in real time with network operation, and multiple bandwidths obtained by the heuristic congestion control algorithm based on multiple network layer quality data, where each network layer quality data and its corresponding bandwidth form a training sample), forming an expert decision dataset. The model input in this stage only contains network layer quality data or network layer quality index information, used to describe the real-time transmission status of the network links. Specifically, the collected QoS information (network layer quality data) includes basic network characteristics and statistical characteristics within a certain time window. The basic network characteristics include one-way latency, packet loss rate, receive rate, historical bandwidth estimates, and link utilization, while the statistical characteristics within the certain time window include dynamic characteristics such as latency gradient, latency trend, and bandwidth change rate, thus forming a network layer quality state vector. .
[0043] Secondly, regarding the model structure, a recurrent neural network (RNN) is pre-defined as the underlying neural network. The RNN models the quality state of the temporal network layers, and learns the bandwidth control behavior of expert policies under different network quality states through neural network learning. During offline training, the model outputs the predicted bandwidth. and output bandwidth using expert strategies. As a supervisory signal, to improve the prediction stability of the model in different bandwidth ranges, a combined loss function L is used for offline optimization, which has the following form: In the formula, Indicates model parameters, This represents the number of training samples in the expert decision-making dataset. Indicates based on network quality status (the first... The weight coefficients are dynamically adjusted based on the network layer quality data from each training sample. Indicates the output bandwidth of the expert strategy (the first...) (bandwidth in each training sample) This indicates the offline bandwidth prediction results. and This represents the weighting parameter.
[0044] Through the aforementioned offline training process, the neural network can offline mimic and learn the congestion detection logic and rate adjustment mechanism in the GCC algorithm, internalizing them as the initial weights of the deep neural network. This forms a stable control prior strategy in the model parameters of the initial bandwidth estimator, fundamentally eliminating the risks of cold start and blind trial and error in the early stages of online deployment, and providing a safe initialization strategy for the subsequent online optimization stage.
[0045] Step 120: An online policy optimization mechanism based on reinforcement learning is adopted to optimize the initial bandwidth estimator online according to real-time network layer quality data, so as to train the initial bandwidth estimator into a target bandwidth estimator that can output the target bandwidth prediction result based on real-time network layer quality data.
[0046] In the second stage, an online reinforcement learning optimization mechanism is introduced based on the strategy obtained in the offline training stage, enabling the model to adaptively adjust according to the real-time network environment.
[0047] In some embodiments, an online policy optimization mechanism based on reinforcement learning is employed to optimize the initial bandwidth estimator online according to real-time network layer quality data, so as to train the initial bandwidth estimator into a target bandwidth estimator capable of outputting target bandwidth prediction results based on real-time network layer quality data, including: The initial bandwidth estimator is used as a policy network to obtain the state for each cycle, and the state includes network layer quality data. Using the policy network, the corresponding online bandwidth prediction results are output based on the state of each cycle; Using a value network, a corresponding multi-objective reward is determined based on the online bandwidth prediction results. The multi-objective reward is determined by a throughput gain item, a queuing delay penalty item, a packet loss penalty item, and a rate fluctuation penalty item based on the network layer. The online bandwidth prediction result when the multi-objective reward is maximized is output as the target bandwidth prediction result, and the target bandwidth estimator is obtained.
[0048] In the online policy optimization process, the bandwidth control problem is modeled as a continuous Markov Decision Process (MDP). During each control cycle, the system collects the current network layer state. The bandwidth estimator obtained from the previous stage and used as the initial bandwidth estimator for the policy network in this stage's reinforcement learning model is based on the current network layer state. The online bandwidth prediction result is used as the action. The bandwidth control strategy corresponds to the data packet transmission rate at the sending end. Subsequently, the value network in the reinforcement learning model calculates the reward value based on the impact of this action on network performance. The policy is updated using reinforcement learning algorithms. In this embodiment, the multi-objective reward is optimized solely based on network layer QoS information, and its calculation formula is as follows: In the formula, Indicates the first Multi-objective rewards over a period of time Indicates throughput. Indicates queuing delay. Indicates packet loss. Indicates rate fluctuation, , , and This represents the weighting coefficient.
[0049] Figure 2 This is a flowchart of an offline-to-online hybrid learning method for a real-time communication bandwidth estimator in another embodiment of the present invention, as shown below. Figure 2 As shown, the method includes the following steps: Step S210: An offline policy learning mechanism based on behavior cloning is adopted to train a preset neural network model offline according to an expert decision dataset, so as to train the neural network model into an initial bandwidth estimator that can output offline bandwidth prediction results based on the input network layer quality data. The expert decision dataset includes multiple network layer quality data and corresponding bandwidths collected in real-time communication based on a heuristic congestion control algorithm. Step S220: An online policy optimization mechanism based on reinforcement learning is adopted to optimize the initial bandwidth estimator online according to real-time network layer quality data and application layer user experience data, so as to train the initial bandwidth estimator into a target bandwidth estimator that can output the target bandwidth prediction result based on real-time network layer quality data and application layer user experience data.
[0050] In the second stage of this embodiment, to achieve cross-layer experience perception, the present invention introduces application-layer Quality of User Experience (QoE) metric information on top of the original network-layer QoS state information to expand the original network-layer state space. Specifically, application-layer user experience data is added to the original network-layer QoS state vector. This application-layer user QoE data includes features such as receiver jitter buffer occupancy and client video rendering frame rate. By jointly modeling the above-mentioned QoE metrics with the original QoS metrics to form an extended state vector, cross-layer fusion of network-layer and application-layer information is achieved.
[0051] In traditional standard WebRTC system architectures, bandwidth estimators typically only obtain network layer transmission information, such as QoS information like latency, packet loss rate, and receive rate, but cannot directly obtain QoE information such as application layer playback status. To achieve QoE-aware bandwidth control, this invention extends the upper-layer media processing module of the bandwidth estimator in the standard WebRTC architecture, specifically as follows: Figure 3 As shown, based on the original architecture, application layer user experience information such as video rendering frame rate and receiver jitter buffer usage is extracted on the client player side and fed back to the bandwidth estimator through the control interface, thereby realizing cross-layer fusion of application layer QoE information and network layer QoS information. Figure 3 The initial model, basic model, and fine-tuning model in the model are the preset neural network model, the initial bandwidth estimator, and the target bandwidth estimator, respectively.
[0052] In some embodiments, an online policy optimization mechanism based on reinforcement learning is employed to optimize the initial bandwidth estimator online based on real-time network layer quality data and application layer user experience data, so as to train the initial bandwidth estimator into a target bandwidth estimator capable of outputting target bandwidth prediction results based on real-time network layer quality data and application layer user experience data, including: The initial bandwidth estimator is used as a policy network to obtain the state for each period, which includes network layer quality data and application layer user experience data. Using the aforementioned policy network, the bandwidth prediction results are output online based on the state corresponding to each cycle; Using a value network, a corresponding multi-objective reward is determined based on the online bandwidth prediction results. The multi-objective reward is determined by considering throughput gain, queuing delay penalty, packet loss penalty, and rate fluctuation penalty at the network layer, as well as user experience penalty at the application layer. The user experience penalty at the application layer is determined by considering video playback stuttering penalty and end-to-end playback latency penalty. The online bandwidth prediction result when the multi-objective reward is maximized is output as the target bandwidth prediction result, and the target bandwidth estimator is obtained.
[0053] In this embodiment, the bandwidth control problem is also modeled as a continuous Markov decision process during online policy optimization. In each cycle, the system collects the current expansion state and, based on the previous stage, uses the initial bandwidth estimator of the policy network in the reinforcement learning model of this stage to output the online bandwidth prediction result, which serves as the action. The bandwidth control strategy corresponds to the data packet transmission rate at the sending end. Subsequently, the value network in the reinforcement learning model calculates the reward value based on the impact of this action on network performance. And use reinforcement learning algorithms to update the policy.
[0054] To enable bandwidth control strategies or decisions to simultaneously consider network performance and application-layer user experience performance, this invention introduces a joint optimization mechanism of QoS and QoE in the reward function design. Specifically, in each decision cycle, a multi-objective reward function is constructed based on throughput, queuing latency, packet loss rate, and video playback experience metrics. The calculation formula for the multi-objective reward is as follows: In the formula, Indicates the first Multi-objective rewards over a period of time Indicates throughput. Indicates queuing delay. Indicates packet loss. Indicates rate fluctuation, This represents the application layer user experience. , , , and Indicates the weighting coefficient. This indicates that the video playback is choppy. Indicates end-to-end playback delay. and This represents the weighting coefficient. The receiver's jitter buffer occupancy reflects its buffer state, and changes in this buffer typically affect end-to-end playback latency. Therefore, receiver jitter buffer occupancy is correlated with end-to-end playback latency penalty. Client-side video rendering frame rate reflects the smoothness of video playback. When network conditions deteriorate leading to insufficient bandwidth, a drop in rendering frame rate is often accompanied by video stuttering. Therefore, client-side video rendering frame rate is correlated with video playback stuttering penalty. Through the above reward function design, the bandwidth control strategy can simultaneously consider network layer quality of service (QoS) performance and application layer user experience quality of service (QoE) performance.
[0055] In designing the aforementioned multi-objective reward function, this invention also incorporates a long-term penalty mechanism. Specifically, since the feedback cycle of application-layer QoE information is typically long—for example, QoE metrics such as video rendering frame rate and jitter buffer occupancy are usually fed back to the bandwidth estimator at a cycle of approximately 1 second—while network-layer QoS information (such as latency, packet loss rate, and reception rate) is typically updated and acquired at a cycle of approximately 200 ms, if an instantaneous penalty is applied to a QoE drop event only in a single cycle, the bandwidth control strategy may again generate unreasonable rate adjustment behavior within a short period, leading to repeated video stuttering or playback quality fluctuations. To address this issue, this invention introduces a long-term penalty mechanism into the multi-objective reward function. When a QoE drop is detected (e.g., video stuttering or increased end-to-end playback latency), the corresponding penalty (i.e., the application-layer user experience penalty) is applied continuously for multiple subsequent cycles. This ensures that the bandwidth control strategy remains sensitive to this adverse state for a period of time, thereby preventing the strategy from recovering to an excessively high transmission rate too quickly and improving the playback stability and user experience of the real-time communication system in dynamic network environments.
[0056] In some embodiments, an improved proximal policy optimization algorithm is used to perform online policy optimization on the policy network to limit the deviation of the online policy from the baseline policy. The optimization objective function is: In the formula, This indicates the update target of the policy network. This represents the loss function of the value network. This represents the policy entropy term used to encourage exploration. and Indicates the weighting coefficient. This represents the weighting coefficient of the KL divergence constraint term. Denotes KL divergence, This refers to the policy network or initial bandwidth estimator that serves as the baseline policy. This represents the target bandwidth estimator for the online strategy. To represent the directionality in KL divergence, i.e. As a reference distribution, Compare and measure. Indicates the benchmark strategy As a reference distribution, it is used to measure online strategies. The degree of deviation relative to the benchmark strategy.
[0057] In traditional PPO algorithms, KL divergence is typically used to constrain the difference between the current policy and the policy of the previous round. However, in the improved PPO algorithm of this invention, the KL divergence constraint term is used to constrain the difference between the current online policy and the baseline policy obtained during offline training. By introducing this KL divergence constraint term into the optimization objective, the deviation of the online policy from the offline expert policy can be limited, thereby avoiding system instability caused by drastic policy changes during reinforcement learning.
[0058] In some embodiments, during the optimization process, the weight coefficient of the KL divergence constraint term is dynamically adjusted based on network stability. When the network stability is high, the weight coefficient of the KL divergence constraint term is decreased; when the network stability is low or a decrease in application layer user experience data is detected, the weight coefficient of the KL divergence constraint term is increased. The network stability is determined based on the round-trip delay change rate, bandwidth change rate, packet loss change rate, and user experience data change rate.
[0059] In the improved near-end policy optimization algorithm, this invention also introduces a dynamic KL divergence constraint mechanism, that is, during the online policy update process, the KL divergence constraint weights are dynamically adjusted according to the degree of change in the current network state. A network stability assessment mechanism is adopted to comprehensively evaluate the degree of change in network state, that is, a network stability index is constructed to reflect the degree of fluctuation in the current network link state. The network stability index is calculated based on multi-dimensional characteristics such as the rate of change of network round-trip delay, bandwidth fluctuation rate, packet loss rate change, and application layer QoE change, and its calculation formula is: In the formula, Indicates network stability. Indicates the rate of change of round-trip time delay. Indicates the rate of change of bandwidth. Indicates the rate of change in packet loss. Indicates the rate of change in user experience data. , , and This represents the weighting coefficient.
[0060] When network state changes are small and links are stable (i.e., network stability is high), the KL divergence constraint weights are appropriately reduced to allow the online strategy to explore more fully within a certain range, thereby improving its adaptability to complex network environments. Conversely, when network state fluctuations are large or a decrease in the QoE index is detected (i.e., network stability is low or the QoE index is low), the KL divergence constraint weights are increased to limit the deviation of the online strategy from the baseline strategy obtained during offline training, thus preventing the reinforcement learning strategy from exhibiting overly aggressive bandwidth adjustment behavior. Through this dynamic KL divergence constraint mechanism, the bandwidth control strategy can improve its adaptability to complex network environments while maintaining stability, thereby reducing bandwidth estimation overshoot and improving the overall transmission stability of the real-time communication system in dynamic network environments.
[0061] Phase two of this invention involves an online near-end policy optimization and fine-tuning process based on the fusion of cross-layer network quality and application layer user experience, along with controlled exploration. The system performs dynamic real-time closed-loop interaction on the initial bandwidth estimator, fine-tuning its own parameters by receiving QoS and QoE information from the network and returning the estimated bandwidth to the network, ultimately obtaining the target bandwidth estimator. Application layer QoE and network layer QoS metrics are deeply integrated into the extended state space. A PPO algorithm with KL divergence constraints is used to limit policy drift, and a multi-objective reward function containing a long-term penalty mechanism is combined to drive the model to complete a paradigm shift from rule imitation to experience optimization.
[0062] By combining the two stages of offline policy learning based on behavior cloning and online policy optimization based on reinforcement learning, this invention can improve the adaptability of bandwidth estimation to complex dynamic network environments while enhancing the stability of system bandwidth estimation, and further realize bandwidth control policy optimization oriented towards user experience.
[0063] To address the shortcomings of existing real-time communication bandwidth estimation methods, this invention proposes a bandwidth estimation method for real-time communication in complex dynamic network environments, thereby improving the accuracy of bandwidth prediction and bandwidth utilization efficiency during real-time audio and video transmission.
[0064] Figure 4 This is a flowchart of a real-time communication bandwidth estimation method in one embodiment of the present invention, as shown below. Figure 4 As shown, the method includes the following steps: Step S410: Obtain network layer quality data for real-time communication; Step S420: Input the network layer quality data into the target bandwidth estimator so that the target bandwidth estimator outputs the target bandwidth prediction result. The target bandwidth estimator is trained in advance using the offline-online hybrid learning method of the bandwidth estimator for real-time communication as described above.
[0065] To further improve the user experience in real-time communication systems, Figure 5 This is a flowchart of a real-time communication bandwidth estimation method according to another embodiment of the present invention, as shown below. Figure 5 As shown, the method includes the following steps: Step S510: Obtain network layer quality data and application layer user experience data for real-time communication; Step S520: Input the network layer quality data and the application layer user experience data into the target bandwidth estimator so that the target bandwidth estimator outputs the target bandwidth prediction result. The target bandwidth estimator is trained in advance using the offline-online hybrid learning method for real-time communication bandwidth estimators as described above.
[0066] In summary, the offline-to-online hybrid learning method and bandwidth estimation method for real-time communication bandwidth estimator of this invention can significantly improve the overall transmission performance of real-time communication systems in complex and dynamic network environments. This invention can more accurately perceive changes in network bandwidth, thereby improving link resource utilization while ensuring network stability, enabling the system to make fuller use of available bandwidth. Simultaneously, in situations of network congestion or significant bandwidth fluctuations, this invention can adjust the transmission rate in a timely manner, effectively reducing network queuing delay and its fluctuations, minimizing the accumulation of end-to-end delays, and improving the smoothness of interaction during real-time communication.
[0067] Furthermore, this invention can maintain good playback stability while increasing throughput, thereby reducing video playback stuttering and image quality fluctuations. In complex heterogeneous network environments, such as scenarios with frequent bandwidth changes in mobile or wireless networks, this invention can still maintain stable rate control behavior, achieving a more reasonable balance between bandwidth utilization efficiency, transmission latency, and user experience, thereby improving the overall service quality and user experience of real-time audio and video communication systems.
[0068] Corresponding to the above method, the present invention also provides an electronic device, which includes a computer device, the computer device including a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the device implements the steps of the aforementioned method.
[0069] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned method. The computer-readable storage medium may be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.
[0070] This invention also provides a computer program product, including computer instructions that, when executed by a processor, implement the steps of the aforementioned method.
[0071] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0072] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0073] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0074] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. An offline-to-online hybrid learning method for a bandwidth estimator in real-time communication, characterized in that, The method includes: An offline policy learning mechanism based on behavior cloning is adopted. A preset neural network model is trained offline based on an expert decision dataset to train the neural network model into an initial bandwidth estimator that can output offline bandwidth prediction results based on the input network layer quality data. The expert decision dataset includes multiple network layer quality data and corresponding bandwidths collected in real-time communication based on a heuristic congestion control algorithm. An online policy optimization mechanism based on reinforcement learning is adopted to optimize the initial bandwidth estimator online according to real-time network layer quality data, so as to train the initial bandwidth estimator into a target bandwidth estimator that can output the target bandwidth prediction result based on real-time network layer quality data.
2. The method according to claim 1, characterized in that, The method further includes: An online policy optimization mechanism based on reinforcement learning is adopted to optimize the initial bandwidth estimator online based on real-time network layer quality data and application layer user experience data, so as to train the initial bandwidth estimator into a target bandwidth estimator that can output the target bandwidth prediction result based on real-time network layer quality data and application layer user experience data.
3. The method according to claim 2, characterized in that, An online policy optimization mechanism based on reinforcement learning is employed to optimize the initial bandwidth estimator online based on real-time network layer quality data and application layer user experience data. This trains the initial bandwidth estimator into a target bandwidth estimator capable of outputting target bandwidth prediction results based on real-time network layer quality data and application layer user experience data. The mechanism includes: The initial bandwidth estimator is used as a policy network to obtain the state for each period, which includes network layer quality data and application layer user experience data. Using the aforementioned policy network, the bandwidth prediction results are output online based on the state corresponding to each cycle; Using a value network, a corresponding multi-objective reward is determined based on the online bandwidth prediction results. The multi-objective reward is determined by considering throughput gain, queuing delay penalty, packet loss penalty, and rate fluctuation penalty at the network layer, as well as user experience penalty at the application layer. The user experience penalty at the application layer is determined by considering video playback stuttering penalty and end-to-end playback latency penalty. The online bandwidth prediction result when the multi-objective reward is maximized is output as the target bandwidth prediction result, and the target bandwidth estimator is obtained.
4. The method according to claim 3, characterized in that, The formula for calculating the multi-objective reward is as follows: In the formula, Indicates the first Multi-objective rewards over a period of time Indicates throughput. Indicates queuing delay. Indicates packet loss. Indicates rate fluctuation, This represents the application layer user experience. , , , and Indicates the weighting coefficient. This indicates that the video playback is choppy. Indicates end-to-end playback delay. and This represents the weighting coefficient.
5. The method according to claim 1, characterized in that, An online policy optimization mechanism based on reinforcement learning is employed to optimize the initial bandwidth estimator online based on real-time network layer quality data, thereby training the initial bandwidth estimator into a target bandwidth estimator capable of outputting target bandwidth prediction results based on real-time network layer quality data. This includes: The initial bandwidth estimator is used as a policy network to obtain the state for each cycle, and the state includes network layer quality data. Using the policy network, the corresponding online bandwidth prediction results are output based on the state of each cycle; Using a value network, a corresponding multi-objective reward is determined based on the online bandwidth prediction results. The multi-objective reward is determined by a throughput gain item, a queuing delay penalty item, a packet loss penalty item, and a rate fluctuation penalty item based on the network layer. The online bandwidth prediction result when the multi-objective reward is maximized is output as the target bandwidth prediction result, and the target bandwidth estimator is obtained.
6. The method according to claim 3 or 5, characterized in that, An improved proximal policy optimization algorithm is used to perform online policy optimization on the policy network to limit the deviation of the online policy from the baseline policy. The optimization objective function is: In the formula, This indicates the update target of the policy network. This represents the loss function of the value network. This represents the policy entropy term used to encourage exploration. and Indicates the weighting coefficient. This represents the weighting coefficient of the KL divergence constraint term. Denotes KL divergence, This refers to the policy network or initial bandwidth estimator that serves as the baseline policy. This represents the target bandwidth estimator for the online strategy. To represent the directionality in KL divergence, i.e. As a reference distribution, Compare and measure.
7. The method according to claim 6, characterized in that, During the optimization process, the weight coefficient of the KL divergence constraint term is dynamically adjusted based on network stability. When the network stability is high, the weight coefficient of the KL divergence constraint term is decreased; when the network stability is low or a decrease in application layer user experience data is detected, the weight coefficient of the KL divergence constraint term is increased. The network stability is determined based on the round-trip delay change rate, bandwidth change rate, packet loss change rate, and user experience data change rate.
8. The method according to claim 2, characterized in that, The application layer user experience data includes the jitter buffer usage at the receiving end and the video rendering frame rate on the client side.
9. A bandwidth estimation method for real-time communication, characterized in that, The method includes: Obtain network layer quality data for real-time communication; The network layer quality data is input into the target bandwidth estimator so that the target bandwidth estimator outputs a target bandwidth prediction result, wherein the target bandwidth estimator is trained in advance by the offline-online hybrid learning method of the bandwidth estimator for real-time communication as described in any one of claims 1 to 8.
10. The method according to claim 9, characterized in that, The method further includes: Obtain application-layer user experience data for real-time communication; The network layer quality data and the application layer user experience data are input into the target bandwidth estimator so that the target bandwidth estimator outputs the target bandwidth prediction result accordingly.