A video group call jitter buffer learning optimization method and system

By collecting and analyzing global features, network features, and playback feedback features in video group call scenarios, the buffering time of video group calls is optimized, solving the problems of insufficient multi-stream collaboration and role adaptation in existing technologies, and achieving a dynamic balance between smoothness and latency.

CN122247975APending Publication Date: 2026-06-19FUJIAN BEIFENG COMM TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
FUJIAN BEIFENG COMM TECH CO LTD
Filing Date
2026-05-19
Publication Date
2026-06-19

Smart Images

  • Figure CN122247975A_ABST
    Figure CN122247975A_ABST
Patent Text Reader

Abstract

This invention discloses a video group call jitter buffer learning optimization method and system, specifically relating to the field of real-time audio and video communication technology. This invention achieves refined adaptation of single-stream buffer duration, integrating multi-dimensional network timing characteristics such as one-way transmission delay variance and packet loss rate. It calculates the baseline buffer duration through dynamic basic additional coefficients and comprehensive correction terms, and also sets hard boundary constraints based on the video frame rate, ensuring that the single-stream buffer duration accurately matches the real-time network state. This guarantees basic jitter resistance while avoiding delay redundancy caused by excessively long buffer durations, effectively solving the problem of insufficient single-stream buffer adaptability under different network environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of real-time audio and video communication technology, and more specifically, to a video group call jitter buffer learning optimization method and system. Background Technology

[0002] With the rapid development of real-time audio and video communication technology, video group calling has been widely used in many core scenarios such as enterprise remote conferencing, online education classrooms, emergency command and dispatch, and remote medical consultations. It has become a core technology carrier for cross-regional collaborative interaction. In the video group calling scenario, the receiving end needs to simultaneously receive, decode, and render multiple concurrent video streams. However, the unreliability of IP networks can cause problems such as transmission jitter, out-of-order data packets, random packet loss, and bandwidth fluctuations, which can directly lead to stuttering, screen tearing, and audio-visual asynchrony in video playback, seriously affecting the user's interactive experience.

[0003] Jitter buffering technology is the core means to solve the above problems. The setting of buffer duration is the core of jitter buffering technology: if the buffer duration is too short, it cannot effectively offset network jitter and is prone to causing stuttering and screen tearing; if the buffer duration is too long, it will lead to a significant increase in end-to-end latency and destroy the real-time interactivity of video group calls.

[0004] Currently, existing jitter buffering optimization methods have significant technical shortcomings in multi-channel concurrent video group call scenarios, specifically in the following aspects: The single-stream independent optimization mode lacks the ability to coordinate multiple streams globally. Most existing solutions calculate and set the buffer duration for each video stream independently, without considering the hardware resource constraints of the receiving end and the resource competition relationship of multiple streams in the video group call scenario. As the number of group call terminals increases, the number of video streams that the receiving end needs to process simultaneously also increases. The buffer duration set independently for multiple streams can easily lead to the total buffer memory usage exceeding the resource limit of the receiving end, causing problems such as insufficient resources of the decoding and rendering module and increased system processing latency. This results in a decline in overall playback quality and cannot adapt to the concurrent scenarios of large-scale video group calls. The existing technology fails to implement differentiated strategies based on the role attributes of the video streams. Video group call scenarios have a clear role hierarchy, with significant differences in user attention, smoothness requirements, and latency sensitivity for the video streams corresponding to the speaker, participants, and observers. Existing technologies generally adopt uniform buffer duration calculation and adjustment rules, which cannot perform differentiated optimization for the core needs of different roles. Either the smoothness of the speaker stream leads to generally high buffer latency across the entire stream, resulting in decreased interactivity; or the overall latency control results in insufficient anti-jitter capability of the core speaker stream, leading to core experience issues such as stuttering and screen tearing. It is impossible to achieve a graded balance between smoothness and latency. The calculation of the baseline buffer duration has a single dimension and insufficient network adaptability. Existing technologies do not comprehensively consider multiple dimensions of network timing characteristics, such as one-way transmission latency variance, packet out-of-order rate, real-time packet loss rate, and bandwidth estimation. This results in the baseline buffer duration being unable to accurately match complex and ever-changing network environments: in weak network scenarios with strong jitter, high packet loss, and high out-of-order packets, the buffer duration is insufficient to offset network fluctuations, and playback smoothness cannot be guaranteed; in scenarios with good network conditions, there is significant latency redundancy, making it impossible to achieve fine-grained dynamic adjustment of the buffer duration. The lack of a closed-loop learning and optimization mechanism based on playback quality feedback means that existing technologies do not incorporate the actual quality feedback from the playback side of the receiving end into the optimization system to make real-time closed-loop corrections to the buffer duration. This results in buffer adjustments lagging behind changes in network and playback quality, making it impossible to continuously iterate and optimize the buffer strategy. Consequently, it is difficult to maintain the optimal balance between smoothness and latency in dynamic network and business scenarios over the long term.

[0005] To address this, a video group call jitter buffer learning optimization method and system are proposed. Summary of the Invention

[0006] To overcome the above-mentioned defects of the prior art, embodiments of the present invention provide a video group call jitter buffer learning optimization method and system.

[0007] To achieve the above objectives, the present invention provides the following technical solution: A video group call jitter buffer learning optimization method includes: Time-series data acquisition: Collect global features of video group call scenarios, network time-series features of each video stream within the group call, and feedback features from the playback end to construct a time-series feature dataset; Single-stream baseline buffer optimization: Extract the network timing features of each video stream as the core driver, and use the pre-edited buffer duration to obtain the baseline buffer duration of a single stream for logical calculation; Multi-stream collaborative prediction: Extracting global features of group call scenarios as the core driver, and after differentially correcting the baseline buffer duration of a single stream and allocating resources collaboratively, the buffer duration of a single stream after resource collaborative allocation is determined; Closed-loop learning optimization: Extract feedback features from the playback end as the core driver, and perform real-time closed-loop correction on the buffer duration of a single stream after resource collaborative allocation.

[0008] Specifically, the global features of a group call scenario include the number of group call terminals, the role type of each video stream, the encoding bitrate, the frame rate, and the resolution; Network timing characteristics include RTP packet arrival time interval, one-way transmission delay variance, RFC3550 standard jitter value, packet loss rate, out-of-order rate, and real-time bandwidth estimate. Feedback characteristics include the number of stutters, stutter duration, end-to-end average latency, and number of screen glitches.

[0009] Specifically, the logic for calculating the baseline buffer duration; Using formula Obtain the reference buffer duration for a single stream. ;in The jitter value is the RFC3550 standard value within the current sliding window; This is a basic additional coefficient that is dynamically adjusted based on the network state; This is a comprehensive correction term output based on the superposition of multi-dimensional network features.

[0010] Specifically, the logic for obtaining the basic additional coefficient and the comprehensive correction term; One-way transmission delay variance, out-of-order rate, and packet loss rate are extracted from the network temporal features of each video stream. After normalization, they are weighted and fused to output network state coefficients. The basic additional coefficients are then obtained by transforming the network state coefficients using a mapping rule constructed between them and the basic additional coefficients. ; Comprehensive correction item It is obtained by combining the out-of-order rate, packet loss rate, real-time bandwidth estimate, and RTP packet arrival time interval.

[0011] Specifically, the baseline buffer duration for a single stream is modified accordingly. Extract the role type of each video stream and preset the latency weight corresponding to different role types; the role types include speaker, participant, and observer. Based on the matched delay weights, the baseline buffer duration of a single stream is corrected to obtain the priority-corrected buffer duration; this is achieved using the formula... The corrected buffer duration was calculated. ;in This is the extracted time delay weight.

[0012] Specifically, determine the buffer duration for a single stream after resource collaborative allocation; The bit rate, frame rate, and resolution of a single stream are mapped to resource occupancy coefficients; the total buffer resource limit of the receiving end is determined based on the number of group call terminals; Calculate the maximum allowable buffer duration for a single stream by combining the resource occupancy coefficient of a single stream and the upper limit of the total buffer resources at the receiving end; ;in This represents the maximum total buffer resources available at the receiving end. Let i be the resource occupancy coefficient for a single-path flow. The sum of resource occupancy coefficients for all single-path flows; the buffer duration for a single-path flow after resource coordination allocation. .

[0013] Specifically, the buffer duration of a single stream after resource collaborative allocation is corrected in real time using a closed-loop mechanism. A set of reference coefficients is set for different role types. Based on the stuttering severity coefficient, latency redundancy coefficient and decoding anomaly coefficient in the sliding window, the quality constraint coefficient is output after weighted fusion processing in combination with the corresponding set of reference coefficients. The reference coefficient set includes the severe stuttering threshold coefficient, the latency redundancy threshold coefficient, and the decoding anomaly threshold coefficient; The quality constraint coefficient is compared with the preset reference coefficient range. Based on the comparison results, closed-loop correction is selectively triggered, and the buffer duration of the single-path flow after resource collaborative allocation is corrected in real time.

[0014] Specifically, the logic for obtaining the stuttering severity coefficient, latency redundancy coefficient, and decoding anomaly coefficient; After weighted fusion of the number of stutters and the cumulative duration of stutters within the sliding window, a stutter severity coefficient is obtained; The end-to-end average latency within the sliding window is extracted as the numerator, and the preset minimum latency of the scene target is used as the denominator to calculate the latency redundancy coefficient. The number of screen glitches within the sliding window is counted, and the number of screen glitches is converted into the decoding anomaly coefficient using a pre-built mapping rule of screen glitches and decoding anomaly coefficients.

[0015] Specifically, closed-loop correction is selectively triggered based on the comparison results; If the value is below the reference coefficient range but the role type is the main speaker, then the closed-loop correction will not be triggered. If the value is below the reference coefficient range but the role type is an observer or participant, the absolute difference between the quality constraint coefficient and the lowest value in the reference coefficient range is calculated as the adjustment difference. The mapping rule of the role type to which the current adjustment difference belongs is extracted from the pre-built database. The adjustment difference is converted into an adjustment buffer coefficient and multiplied by the single-path buffer duration after resource collaborative allocation, which is used as the buffer duration after closed-loop correction. If the value is above the reference coefficient range but the role type is an observer, then the closed-loop correction will not be triggered. If the value is higher than the reference coefficient range and the role type is either a presenter or a participant, then the difference between the quality constraint coefficient and the maximum value in the reference coefficient range is calculated as the upward adjustment difference. The mapping rule of the role type to which the current upward adjustment difference belongs is extracted from the pre-built database. The upward adjustment difference is converted into an upward adjustment buffer coefficient and multiplied by the single-path flow buffer duration after resource collaborative allocation, which is used as the buffer duration after closed-loop correction.

[0016] A video group call jitter buffer learning optimization system includes: The feature acquisition module is used to acquire global features of the video group call scenario, network timing features of each video stream within the group call, and feedback features from the playback end. The single-stream baseline calculation module is used to calculate the baseline buffer duration for each video stream based on network timing characteristics; The multi-stream collaborative optimization module is used to differentiate and collaboratively allocate resources based on the global characteristics of the group call scenario, and output the buffer duration of the single stream after resource collaborative allocation. The closed-loop feedback correction module is used to perform real-time closed-loop correction of the single-stream buffer duration after resource collaborative allocation based on the feedback characteristics of the playback end.

[0017] The technical effects and advantages of this invention are as follows: (1) This invention achieves fine-grained adaptation of single-stream buffer duration, integrates multi-dimensional network timing characteristics such as one-way transmission delay variance and packet loss rate, calculates the benchmark buffer duration through dynamic basic additional coefficients and comprehensive correction terms, and sets hard boundary constraints based on video frame rate, so that the single-stream buffer duration accurately matches the real-time network status, which not only ensures basic anti-jitter capability, but also avoids delay redundancy caused by excessively high buffer duration, effectively solving the problem of insufficient adaptability of single-stream buffer in different network environments; (2) This invention realizes the resource optimization allocation of multi-stream collaboration. Combining the role classification and resource constraint characteristics of video group calls, it sets differentiated delay weights and buffer upper and lower limits according to the roles of speaker, participant, and observer. At the same time, it maps bit rate, frame rate, etc. into resource occupancy coefficients, dynamically determines the upper limit of total buffer resources based on the number of group call terminals, allocates the maximum allowable buffer time of a single channel according to the coefficient, and sets a high-quality stream buffer bottom line. This satisfies the smoothness requirements of the core stream and avoids the overload of total buffer resources, adapting to large-scale multi-channel concurrent group call scenarios. (3) This invention constructs a real-time closed-loop optimization mechanism for playback feedback, extracts playback feedback features such as stuttering and screen tearing and converts them into quantitative coefficients, sets exclusive reference coefficient sets for different roles and calculates quality constraint coefficients, and selectively triggers buffer duration correction for different role streams based on the comparison results of coefficients and reference ranges. It also sets smooth adjustment rules to ensure the playback quality of the core stream and compress the latency of non-core streams in a timely manner, continuously balancing the smoothness and real-time interactivity of the entire scene. Attached Figure Description

[0018] Figure 1 This is a flowchart of a video group call jitter buffer learning optimization method according to the present invention; Figure 2 This is a schematic diagram of a video group call jitter buffer learning optimization system according to the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] Example 1

[0021] like Figure 1 As shown, a video group call jitter buffer learning optimization method is as follows: Time-series data acquisition: Collect global features of video group call scenarios, network time-series features of each video stream within the group call, and feedback features from the playback end to construct a time-series feature dataset; The global features of a group call scenario include the number of group call terminals, the role type of each video stream, the encoding bitrate, the frame rate, and the resolution. Network timing characteristics include the RTP packet arrival time interval collected by a fixed sliding window, one-way transmission delay variance, RFC3550 standard jitter value, packet loss rate, out-of-order rate, and real-time bandwidth estimate. Feedback features include the number of stutters within the sliding window, stutter duration, average end-to-end latency, and number of screen glitches. Single-stream baseline buffer optimization: Extract the network timing features of each video stream as the core driver, and use the pre-edited buffer duration to obtain the baseline buffer duration of a single stream for logical calculation; Specifically: That is, using formulas Obtain the reference buffer duration for a single stream. ;in The jitter value is the RFC3550 standard value within the current sliding window; This is a basic additional coefficient that is dynamically adjusted based on the network state; This is a comprehensive correction term output based on the superposition of multi-dimensional network features; The one-way transmission delay variance, out-of-order rate, and packet loss rate are extracted from the network temporal characteristics of each video stream. After normalization, they are weighted and fused to output the network state coefficients. The calculation process is as follows: The normalized one-way transmission delay variance, out-of-order rate, and packet loss rate are respectively labeled as follows: , as well as ; Combined with a pre-defined standard network state dataset, using the formula Calculations are performed to obtain the network state coefficients. ;in , as well as The standard variance of one-way transmission delay, out-of-order standard rate, and packet loss standard rate are included in the standard network state dataset. , as well as The weighting coefficients are set, and their sum is one.

[0022] Network state coefficients The lower the value, the more stable the network; conversely, the higher the value, the greater the network fluctuation.

[0023] The basic additional coefficients are obtained by transforming the network state coefficients using a mapping rule constructed from the basic additional coefficients. ; This means that the preset network state coefficients correspond to each set of coefficient intervals, and each set of coefficient intervals corresponds to a set of basic additional coefficients. The value range of the basic additional coefficients is limited to 1.5-3.5. The initial settings are made by technical personnel and can be dynamically adjusted later. The larger the network state coefficient, the higher the probability of matching 3.5. Due to the large network fluctuations, the multiplier needs to be increased to prioritize preventing lag.

[0024] Comprehensive correction item It is obtained by combining out-of-order rate, packet loss rate, real-time bandwidth estimate, and RTP packet arrival time interval; Calculation process; The preset correction calculation rules are used to obtain the correction terms corresponding to the out-of-order rate, packet loss rate, real-time bandwidth estimate, and RTP packet arrival time interval, respectively. Out-of-order correction term = ×100; Packet loss correction item = ×150; Bandwidth correction term = ; Arrival Interval Correction Term = ; Apply hard boundary constraints to each correction term; Out-of-order correction term ≤ 50ms (out-of-order rate > 50% is considered extreme out-of-order, triggering a fallback), i.e. ; Packet loss correction period ≤ 75ms (packet loss rate > 50% is considered extreme packet loss, triggering a fallback condition), i.e. ; The bandwidth correction term is ≤80ms (a bandwidth of 0 indicates an extremely weak network, triggering a fallback), and the upper limit of the original formula is maintained. Arrival Interval Correction: Based on frame interval duration constraints, such as 25fps (40ms frame interval), take 3 times the frame interval as the upper limit (exceeding this will result in completely out-of-order packet arrival, triggering a fallback). .

[0025] After normalizing the four sets of correction terms after the above constraints, substitute them into the formula. The weighted calculation yields the comprehensive correction term. ;in , , as well as These represent the four sets of correction terms after the constraints. , , as well as These are the weight coefficients corresponding to the four sets of correction terms, and their sum is one; Mapping the calculated synthesis correction term back to the millisecond level yields the final synthesis correction term. The upper limit of the inverse normalization baseline (i.e., the maximum effective value of the comprehensive correction term) is determined. Combined with the baseline range of video group call buffer time (single stream baseline buffer ≤ 500ms), 100ms is taken as the upper limit of the inverse normalization (determined after actual measurement by technical personnel, and can be dynamically adjusted in the future). Multiplying the calculated comprehensive correction term by the inverse normalization upper limit (set to 100) yields the final comprehensive correction term. .

[0026] Additional explanation: Boundary constraints on the duration of the single-stream reference buffer; To avoid ineffective adjustments to buffer duration, hard boundary constraints are set based on the video frame rate, with the following specific rules: Minimum buffer duration: playback duration of no less than 1 frame, 40ms at 25fps, to ensure basic anti-shaking capability; Maximum buffer duration: The initial limit is no more than 500ms to avoid excessive end-to-end latency.

[0027] Multi-stream collaborative prediction: Extracting global features of group call scenarios as the core driver, and after differentially correcting the baseline buffer duration of a single stream and allocating resources collaboratively, the buffer duration of a single stream after resource collaborative allocation is determined; Specifically: Extract the role type of each video stream and preset the latency weight corresponding to different role types; Role types include speaker, participant, and observer; When setting criteria such as the main speaker's role, latency is given a lower weight, and the optimization goal is to ensure smooth playback.

[0028] Example: Presenter's end: Latency weight = 0.2; For participants: latency weight = 0.32; Listening end: Delay weight = 0.41.

[0029] Based on the matching delay weights, the baseline buffer duration of a single stream is corrected to obtain the priority-corrected buffer duration. That is, through the formula The corrected buffer duration was calculated. ;in For the extracted time delay weight, To ensure smoothness, the smaller the latency weight β, the higher the smoothness guarantee coefficient, and the greater the increase in buffer duration, matching the smoothness requirements of high-priority streams.

[0030] Additional notes: The correction buffer duration after priority adjustment will be adjusted differently based on the preset upper and lower limits of the buffer corresponding to the role type. Example rules are as follows: Main speaker: The maximum buffer limit has been increased to 1000ms, and the minimum buffer limit has been increased to 80ms, enhancing the anti-jitter capability under extreme network conditions; Participating roles: The maximum buffer limit is maintained at 500ms, and the minimum buffer limit is maintained at 40ms, adhering to the standard balancing strategy; Listening role: The maximum buffer limit is compressed to 300ms, and the minimum buffer limit is 20ms, strictly controlling latency and resource consumption.

[0031] Map the bitrate, frame rate, and resolution of a single stream to a resource consumption coefficient; ; The coefficients are set according to the following rules: resolution scaling coefficients are set as follows: 1080P=1, 720P=0.5, 480P=0.25.

[0032] The upper limit of the total buffer resources at the receiving end is determined based on the number of group call terminals; The number of terminals is negatively correlated with the maximum buffer duration of a single stream. The more terminals there are, the more video streams the receiver needs to decode simultaneously, and the memory occupied by the decoding and rendering modules will increase linearly. Therefore, the proportion of memory reserved for jitter buffer must be dynamically reduced as the number of terminals increases, thus achieving the enhanced constraint that "the more terminals there are, the stricter the upper limit of total buffer resources".

[0033] Specifically: The upper limit of basic buffer resources is determined based on the real-time available memory at the receiving end; The upper limit of the basic buffer resources is obtained by multiplying the real-time available content at the receiving end by the preset basic allocation ratio; the basic allocation ratio is set to 20%, and 80% of the memory is reserved for core incompressible modules such as decoding and rendering to ensure system stability.

[0034] Obtain the current number of group call terminals and determine the corresponding scenario type, and set the buffer memory allocation ratio for different scenario types; This means that the number of terminals in each group of calls is preset to a range, and each range corresponds to a scenario type, such as small group calls (less than 5), medium group calls (6-15), etc. The more group call terminals there are, the lower the corresponding buffer memory allocation ratio will be; the buffer memory allocation ratio is limited to 3%-20%, and the specific setting and dynamic adjustment will be made by technical personnel. Multiply the basic buffer resource limit by the buffer memory allocation ratio to obtain the total buffer resource limit of the receiving end; To avoid excessive compression of total buffer resources when there are too many terminals, causing the buffer duration of high-priority main streams to fall below the minimum jitter resistance requirement, the minimum allowable value of the receiver's total buffer resource limit needs to be calculated in reverse from the minimum buffer limit of high-priority streams. The formula is as follows: ; The minimum anti-jitter buffer duration for high-priority main streams is 80ms by default. Resource consumption coefficient of high-priority main presentation stream; The sum of resource utilization coefficients for all streams; To ensure that even with a large number of terminals, the maximum allowable buffering time for high-priority streams will not be lower than the anti-jitter threshold of 80ms, thus avoiding sacrificing the playback experience of the core stream in order to control resources.

[0035] Calculate the maximum allowable buffer duration for a single stream by combining the resource occupancy coefficient of a single stream and the upper limit of the total buffer resources at the receiving end; ;in This represents the maximum total buffer resources available at the receiving end. Let i be the resource occupancy coefficient for a single-path flow. This is the sum of the resource utilization coefficients of all single-path flows; The higher the resource utilization coefficient of a single stream, the longer the maximum allowable buffer duration, enabling reasonable allocation of resources based on the coding load of the stream; the more total streams in a group call, the higher the resource utilization coefficient of a single stream. The lower the value, the less overall resource overload will be avoided; The buffer memory usage of a single stream is linearly positively correlated with the buffer duration. It is obtained by converting memory to time.

[0036] Single-stream buffer duration after resource collaborative allocation ; Ensure that the buffer duration of a single stream does not exceed the maximum allowable value allocated by the system resources.

[0037] Closed-loop learning optimization: Extract feedback features from the playback end as the core driver, and perform real-time closed-loop correction on the buffer duration of a single stream after resource collaborative allocation; Specifically: After weighted fusion of the number of stutters and the cumulative duration of stutters within the sliding window, a stutter severity coefficient is obtained; The calculation process is as follows: Divide the cumulative duration of lag by the total duration of the sliding window to obtain the percentage of lag duration. The formula is used to calculate the percentage of times and duration of buffering. The severity coefficient of stuttering is calculated. ;in and The weighting coefficients are set, and their sum is one; and These represent the number of stutters and the percentage of stutter duration within the sliding window, respectively.

[0038] A higher stuttering severity index indicates more severe stuttering, requiring an increase in buffer duration and replenishment of anti-shake level.

[0039] The end-to-end average latency within the sliding window is extracted as the numerator, and the latency redundancy coefficient is calculated by using the preset minimum latency of the scene target as the denominator; (the preset minimum latency of the scene target is 100ms, which is a fixed latency of acquisition + encoding + transmission + decoding + rendering). A ratio greater than 1 indicates that there is room to reduce the buffer if the latency is lowered; a ratio less than 1 indicates that the latency has exceeded the lower limit and the buffer needs to be increased.

[0040] The number of screen flickering events within the sliding window is counted, and the number of screen flickering events is converted into the decoding anomaly coefficient using a pre-constructed mapping rule between the number of screen flickering events and the decoding anomaly coefficient. This involves setting up intervals for each number of screen glitches, with each interval corresponding to a decoding anomaly coefficient. The decoding anomaly coefficient is limited to 0-1 and is not 0; the more screen glitches there are, the higher the probability of matching 1.

[0041] A higher decoding anomaly coefficient indicates more frequent screen flickering, requiring an increase in buffer duration and the addition of out-of-order waiting windows and retransmission redundancy.

[0042] A set of reference coefficients is set for different role types. Based on the stuttering severity coefficient, latency redundancy coefficient and decoding anomaly coefficient in the sliding window, the quality constraint coefficient is output after comprehensive processing in combination with the corresponding set of reference coefficients. The reference coefficient set includes the severe stuttering threshold coefficient, the latency redundancy threshold coefficient, and the decoding anomaly threshold coefficient; The reference coefficient set includes the criteria for its setting, for example: Presenter (High-Quality): The threshold coefficients for severe stuttering and latency redundancy are set too low (more sensitive); the threshold coefficients for decoding anomalies are set too high (not sensitive to latency).

[0043] For the participant (Zhongyou): the threshold coefficients for severe stuttering, latency redundancy, and decoding anomalies are set to a moderate level to balance smoothness and latency.

[0044] Listening end (low priority): The threshold coefficients for severe stuttering and latency redundancy are set too high (not sensitive); the threshold coefficients for decoding anomalies are set too low (sensitive to latency).

[0045] The calculation process is as follows: Using formula The quality constraint coefficient is calculated; where , as well as These represent the threshold coefficients for severe stuttering, latency redundancy, and decoding anomalies. , as well as The weighting coefficients are set, and their sum is one. and This represents the delay redundancy coefficient and the decoding anomaly coefficient.

[0046] The quality constraint coefficient is compared with the preset reference coefficient range. If it is lower than the reference coefficient range but the role type is the main speaker, the closed-loop correction will not be triggered. When the quality constraint coefficient is below the reference coefficient range (quality "excess"): Meaning: The current playback quality is too good (almost no stuttering, no screen tearing), but the cost may be a longer buffering time and higher end-to-end latency, which is a state of "quality redundancy". The main speaker role will not be adjusted: because the main speaker is the core, the priority is to ensure "absolute smoothness", and even if there is some redundancy, we will not risk downgrading it (to prevent immediate lag when the network fluctuates).

[0047] If the value is below the reference coefficient range but the role type is an observer or participant, the absolute difference between the quality constraint coefficient and the lowest value in the reference coefficient range is calculated as the adjustment difference. The mapping rule of the role type to which the current adjustment difference belongs is extracted from the pre-built database. The adjustment difference is converted into an adjustment buffer coefficient and multiplied by the single-path buffer duration after resource collaborative allocation, which is used as the buffer duration after closed-loop correction. Reduce buffering for participants / observers: These two roles are more sensitive to latency (or have a slightly higher tolerance for smoothness). Since there is excess quality, the buffering time can be compressed by "reducing the buffer coefficient" to reduce end-to-end latency and improve interactivity. Mapping rules: For example, if the current role type is an observer role, then extract the difference range of each group of downward adjustment difference corresponding to the observer role, and each difference range corresponds to a group of downward adjustment buffer coefficients; the downward adjustment buffer coefficient of the observer role can be limited to the range of 0.8-1.0, and the larger the downward adjustment difference, the higher the probability of matching 0.8.

[0048] If the value is above the reference coefficient range but the role type is an observer, then the closed-loop correction will not be triggered. When the quality constraint coefficient is higher than the reference coefficient range (quality is "insufficient"); Meaning: The current playback quality has deteriorated (more stuttering, frequent screen tearing, or low latency causing buffering to be unable to handle jitter), which is a "quality warning" state; The observer role is not modified: because the observer has the lowest priority, priority is given to ensuring "low latency" and "saving resources", and even if the quality is slightly worse, it will not occupy more buffer resources.

[0049] If the value is higher than the reference coefficient range and the role type is a speaker or a participant, the difference between the quality constraint coefficient and the maximum value in the reference coefficient range is calculated as the upward adjustment difference. The mapping rule of the role type to which the current upward adjustment difference belongs is extracted from the pre-built database. The upward adjustment difference is converted into an upward adjustment buffer coefficient and multiplied by the single-path buffer duration after resource collaborative allocation, which is used as the buffer duration after closed-loop correction. Increase buffering for presenters / participants: These two roles are crucial, especially the presenter role, where smoothness must be prioritized. Increase buffering time by "increasing the buffer coefficient" to improve anti-jitter and anti-packet loss capabilities. Even if it means sacrificing a little latency, the playback quality must be brought back. Mapping rules: For example, if the current role type is a main speaker, then extract the difference range of each group of upward adjustment corresponding to the main speaker role, and each difference range corresponds to a group of upward adjustment buffer coefficients; the upward adjustment buffer coefficient of the main speaker role is set to be relatively large, and the range can be limited to 1.1-1.5.

[0050] Additional notes: Used to adjust the receiver buffer level according to smoothing rules: The adjustment range for a single instance shall not exceed 20% of the current buffer duration; The maximum adjustment step size for high-priority flows is ≤30ms, and for medium- and low-priority flows it is ≤20ms.

[0051] Example 2

[0052] Please see Figure 2As shown, based on the video group call jitter buffer learning optimization method provided in Embodiment 1 of this application, Embodiment 2 of this application proposes a video group call jitter buffer learning optimization system. Embodiment 2 is merely a preferred embodiment of Embodiment 1, and the implementation of Embodiment 2 will not affect the individual implementation of Embodiment 1.

[0053] Specifically, Embodiment 2 of this application provides a video group call jitter buffer learning optimization system, comprising: The feature acquisition module is used to acquire global features of the video group call scenario, network timing features of each video stream within the group call, and feedback features from the playback end. The single-stream baseline calculation module is used to calculate the baseline buffer duration for each video stream based on network timing characteristics; The multi-stream collaborative optimization module is used to differentiate and collaboratively allocate resources based on the global characteristics of the group call scenario, and output the buffer duration of the single stream after resource collaborative allocation. The closed-loop feedback correction module is used to perform real-time closed-loop correction of the single-stream buffer duration after resource collaborative allocation based on the feedback characteristics of the playback end. The above formulas are all dimensionless calculations. Dimensionless calculations can be performed using various methods such as standardization, which will not be elaborated here. The formulas are derived from software simulations based on a large amount of collected data, and the preset parameters in the formulas can be set by those skilled in the art according to the actual situation.

[0054] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, ATA hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium can be a solid-state ATA hard disk.

[0055] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0056] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0057] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0058] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0059] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0060] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable ATA hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0061] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A video group call jitter buffer learning optimization method, characterized in that, Includes the following modules: Time-series data acquisition: Collect global features of video group call scenarios, network time-series features of each video stream within the group call, and feedback features from the playback end to construct a time-series feature dataset; Single-stream baseline buffer optimization: Extract the network timing features of each video stream as the core driver, and use the pre-edited buffer duration to obtain the baseline buffer duration of a single stream for logical calculation; Multi-stream collaborative prediction: Extracting global features of group call scenarios as the core driver, and after differentially correcting the baseline buffer duration of a single stream and allocating resources collaboratively, the buffer duration of a single stream after resource collaborative allocation is determined; Closed-loop learning optimization: Extract feedback features from the playback end as the core driver, and perform real-time closed-loop correction on the buffer duration of a single stream after resource collaborative allocation.

2. The video group call jitter buffer learning optimization method according to claim 1, characterized in that: The global features of a group call scenario include the number of group call terminals, the role type of each video stream, the encoding bitrate, the frame rate, and the resolution. Network timing characteristics include RTP packet arrival time interval, one-way transmission delay variance, RFC3550 standard jitter value, packet loss rate, out-of-order rate, and real-time bandwidth estimate. Feedback characteristics include the number of stutters, stutter duration, end-to-end average latency, and number of screen glitches.

3. The video group call jitter buffer learning optimization method according to claim 2, characterized in that: Baseline buffer duration calculation logic; The formula is as follows: The reference buffer duration of the single-path stream is obtained ; wherein RFC3550 standard jitter value in the current sliding window; a base additional coefficient that is dynamically adjusted based on network conditions; The output comprehensive correction term is based on the superposition of multi-dimensional network features.

4. The video group call jitter buffer learning optimization method according to claim 3, characterized in that: The logic for obtaining the basic additional coefficients and the comprehensive correction term; The one-way transmission delay variance, the out-of-order rate and the packet loss rate are extracted from the network timing characteristics of each video stream, are normalized, and are subjected to weighted fusion processing to output a network state coefficient; a mapping rule constructed by the network state coefficient and a basic additional coefficient is used for conversion to obtain the basic additional coefficient ​ Comprehensive correction item It is obtained by combining the out-of-order rate, packet loss rate, real-time bandwidth estimate, and RTP packet arrival time interval.

5. The video group call jitter buffer learning optimization method according to claim 2, characterized in that: Differentiated corrections are made to the baseline buffer duration for a single stream; Extract the role type of each video stream and preset the latency weight corresponding to different role types; Role types include speaker, participant, and observer; Based on the matching delay weights, the baseline buffer duration of a single stream is corrected to obtain the priority-corrected buffer duration. Through formula The corrected buffer duration was calculated. ;in This is the extracted time delay weight.

6. The video group call jitter buffer learning optimization method according to claim 5, characterized in that: Determine the buffer duration for a single stream after resource collaborative allocation; Map the bitrate, frame rate, and resolution of a single stream to a resource consumption coefficient; The upper limit of the total buffer resources at the receiving end is determined based on the number of terminals in the group call; Calculate the maximum allowable buffer duration for a single stream by combining the resource occupancy coefficient of a single stream and the upper limit of the total buffer resources at the receiving end; ;in This represents the maximum total buffer resources available at the receiving end. Let i be the resource occupancy coefficient for a single-path flow. This is the sum of the resource occupancy coefficients for all single-path flows; Single-stream buffer duration after resource collaborative allocation .

7. The video group call jitter buffer learning optimization method according to claim 6, characterized in that: Real-time closed-loop correction of the buffer duration of a single stream after resource collaborative allocation; A set of reference coefficients is set for different role types. Based on the stuttering severity coefficient, latency redundancy coefficient and decoding anomaly coefficient in the sliding window, the quality constraint coefficient is output after weighted fusion processing in combination with the corresponding set of reference coefficients. The reference coefficient set includes the severe stuttering threshold coefficient, the latency redundancy threshold coefficient, and the decoding anomaly threshold coefficient; The quality constraint coefficient is compared with the preset reference coefficient range. Based on the comparison results, closed-loop correction is selectively triggered, and the buffer duration of the single-path flow after resource collaborative allocation is corrected in real time.

8. The video group call jitter buffer learning optimization method according to claim 7, characterized in that: The logic for obtaining the stuttering severity coefficient, latency redundancy coefficient, and decoding anomaly coefficient; After weighted fusion of the number of stutters and the cumulative duration of stutters within the sliding window, a stutter severity coefficient is obtained; The end-to-end average latency within the sliding window is extracted as the numerator, and the preset minimum latency of the scene target is used as the denominator to calculate the latency redundancy coefficient. The number of screen glitches within the sliding window is counted, and the number of screen glitches is converted into the decoding anomaly coefficient using a pre-built mapping rule of screen glitches and decoding anomaly coefficients.

9. The video group call jitter buffer learning optimization method according to claim 7, characterized in that: Selective closed-loop correction is triggered based on the comparison results; If the value is below the reference coefficient range but the role type is the main speaker, then the closed-loop correction will not be triggered. If the value is below the reference coefficient range but the role type is an observer or participant, the absolute difference between the quality constraint coefficient and the lowest value in the reference coefficient range is calculated as the adjustment difference. The mapping rule of the role type to which the current adjustment difference belongs is extracted from the pre-built database. The adjustment difference is converted into an adjustment buffer coefficient and multiplied by the single-path buffer duration after resource collaborative allocation, which is used as the buffer duration after closed-loop correction. If the value is above the reference coefficient range but the role type is an observer, then the closed-loop correction will not be triggered. If the value is higher than the reference coefficient range and the role type is either a presenter or a participant, then the difference between the quality constraint coefficient and the maximum value in the reference coefficient range is calculated as the upward adjustment difference. The mapping rule of the role type to which the current upward adjustment difference belongs is extracted from the pre-built database. The upward adjustment difference is converted into an upward adjustment buffer coefficient and multiplied by the single-path flow buffer duration after resource collaborative allocation, which is used as the buffer duration after closed-loop correction.

10. A video group call jitter buffer learning optimization system, applied to the video group call jitter buffer learning optimization method according to any one of claims 1-9, characterized in that, include: The feature acquisition module is used to acquire global features of the video group call scenario, network timing features of each video stream within the group call, and feedback features from the playback end. The single-stream baseline calculation module is used to calculate the baseline buffer duration for each video stream based on network timing characteristics; The multi-stream collaborative optimization module is used to differentiate and collaboratively allocate resources based on the global characteristics of the group call scenario, and output the buffer duration of the single stream after resource collaborative allocation. The closed-loop feedback correction module is used to perform real-time closed-loop correction of the single-stream buffer duration after resource collaborative allocation based on the feedback characteristics of the playback end.