Multimedia resource playing method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By building a virtual dual-stream merging mechanism in the player kernel and utilizing fragment indexing and data packet integration strategies, the audio-visual synchronization problem of traditional players during bitrate switching is solved, achieving seamless switching and smooth playback.

CN122269072APending Publication Date: 2026-06-23SHUXING TECH (BEIJING) CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHUXING TECH (BEIJING) CO LTD
Filing Date: 2026-03-26
Publication Date: 2026-06-23

Application Information

Patent Timeline

26 Mar 2026

Application

23 Jun 2026

Publication

CN122269072A

IPC: H04N21/439; H04N21/44; H04N21/4402; H04N21/845

AI Tagging

Application Domain

Selective content distribution

Technology Topics

Data pack Audio frequency

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Information processing system
JP2026101003ADigital data information retrieval Special data processing applications Data ingestion Information processing
Information interaction method, robot and storage medium
CN122240219ABiological models Other databases indexingRoboticsData pack
A method and system for ocean data processing based on communication limited conditions
CN122268955ATransmission Data pack Engineering
A blockchain-based technology achievement transaction storage and traceability system
CN122243492AQuickly understand the meaning of datahigh transparencyDigital data protection Payment protocols Data packRegulatory authority
Data processing method and apparatus, and related device
WO2026124279A1Complex mathematical operations Data pack Feature vector

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional media player architectures struggle to ensure audio-visual synchronization during bitrate switching, leading to issues like screen tearing, rewinding, or stuttering.

Method used

By constructing a virtual dual-stream merging mechanism in the player kernel, the target audio and video segments are determined using the segment index, and then merged into a data packet sequence according to the data packet integration strategy, thereby merging the audio and video streams and ensuring synchronous playback.

Benefits of technology

It achieves synchronized audio and video playback and smooth playback, avoiding stuttering caused by triggered events and improving the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122269072A_ABST

Patent Text Reader

Abstract

The embodiment of the present specification provides a multimedia resource playing method, wherein the multimedia resource playing method is applied to a player kernel, and the method comprises the following steps: in response to a trigger event for a multimedia resource, determining a target audio segment in an audio file contained in the multimedia resource according to a segment index of the multimedia resource, and determining a target video segment in a video file contained in the multimedia resource; integrating at least one audio data packet in the target audio segment and at least one video data packet in the target video segment into a data packet sequence according to a data packet integration strategy; and performing a playing task of the multimedia resource according to the data packet sequence.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments in this specification relate to the field of multimedia resource processing technology, and in particular to multimedia resource playback methods. Background Technology

[0002] In complex mobile network environments and with diverse services, media players need to frequently switch between different bitrates or audio tracks. Traditional player architectures struggle to guarantee audio-visual synchronization during bitrate switching in video playback. Ensuring precise alignment of the old and new streams on the timeline and handling hardware decoder context resets during stream switching are key challenges in player kernel implementation. Existing players often experience screen tearing, rewinding, or stuttering during transitions due to improper handling. Therefore, a more effective multimedia resource playback method is urgently needed to address these issues. Summary of the Invention

[0003] In view of the above, embodiments of this specification provide a method for playing multimedia resources. One or more embodiments of this specification also relate to a multimedia resource playback device, a computing device, a computer-readable storage medium, and a computer program product, to address the technical deficiencies existing in the prior art.

[0004] According to a first aspect of the embodiments of this specification, a multimedia resource playback method is provided, applied to a player kernel, including: In response to a triggering event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource based on the segment index of the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource. According to the data packet integration strategy, at least one audio data packet in the target audio segment and at least one video data packet in the target video segment are integrated into a data packet sequence; The multimedia resource playback task is executed according to the data packet sequence.

[0005] Optionally, determining the target audio segment in the audio file contained in the multimedia resource and the target video segment in the video file contained in the multimedia resource based on the segment index of the multimedia resource includes: The audio fragment index and video fragment index are determined based on the fragment index of the multimedia resource; The target audio segment is determined in the audio file contained in the multimedia resource according to the audio segment index, and the target video segment is determined in the video file contained in the multimedia resource according to the video segment index.

[0006] Optionally, the construction of the audio segment index and the video segment index includes: Receive the audio resource address and video resource address of the multimedia resource, and initialize the audio stream manager and video stream manager; The audio stream manager is used to determine the audio file based on the audio resource address, and the audio file index information is determined based on the audio file. The audio segment index is then constructed based on the audio file index information. The video stream manager is used to determine the video file based on the video resource address, and the video file index information is determined based on the video file. The video segment index is then constructed based on the video file index information.

[0007] Optionally, the triggering event of the multimedia resource includes: The system receives a trigger operation submitted by the user for the multimedia resource and uses the trigger operation as the trigger event; or, it uses the stream switching event associated with the multimedia resource as the trigger event.

[0008] Optionally, the step of integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy includes: The audio data packet to be integrated is determined from at least one audio data packet in the target audio segment, and the video data packet to be integrated is determined from at least one video data packet in the target video segment; By comparing the timestamps of the audio data packets to be integrated and the video data packets to be integrated according to the data packet integration strategy, the target data packet and the data packet to be compared are determined in the audio data packets to be integrated and the video data packets to be integrated. In the case where the data packet to be compared corresponds to an audio type, a first video data packet is determined from the at least one video data packet, and the first video data packet is used as the video data packet to be integrated. The timestamps of the first video data packet to be integrated are compared with the audio data packet to be integrated until the data packet sequence is obtained.

[0009] Optionally, after determining the target data packet and the comparison data packet from the audio data packets to be integrated and the video data packets to be integrated, the method further includes: The target data packet is stored in the data packet sequence, and the flow index of the target data packet is determined; A logical index for the target data packet is generated based on the flow index.

[0010] Optionally, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the method further includes: Receive a handover instruction for the data packet sequence, parse the handover instruction, and obtain handover information; Based on the switching information and the segment index, the target audio file and the target video file of the multimedia resource are determined, and the target audio file is used as the audio file, and the target video file is used as the video file; The steps of determining the target audio segment in the audio file contained in the multimedia resource based on the segment index of the multimedia resource, and determining the target video segment in the video file contained in the multimedia resource are performed.

[0011] Optionally, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the method further includes: A target video frame is determined in the target video segment, and the decoding parameters of the target video segment are encapsulated into the video frame data structure corresponding to the target video frame.

[0012] Optionally, the step of performing the playback task of the multimedia resource according to the data packet sequence includes: A rendering event is generated based on the data packet sequence, and the rendering event is added to the event queue; When the rendering event is executed through the rendering thread, the playback task associated with the multimedia resource corresponding to the rendering event is executed, and the video icon on the playback page is updated.

[0013] According to a second aspect of the embodiments of this specification, a multimedia resource playback device is provided, applied to a player kernel, comprising: The determination module is configured to, in response to a triggering event for a multimedia resource, determine a target audio segment in an audio file contained in the multimedia resource based on a segment index of the multimedia resource, and determine a target video segment in a video file contained in the multimedia resource. The integration module is configured to integrate at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to a data packet integration strategy. The execution module is configured to perform the playback task of the multimedia resource according to the data packet sequence.

[0014] According to a third aspect of the embodiments of this specification, a computing device is provided, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the multimedia resource playback method described above.

[0015] According to a fourth aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores a computer program / instructions, which, when executed by a processor, implement the steps of the multimedia resource playback method described above.

[0016] According to a fifth aspect of the embodiments of this specification, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the multimedia resource playback method described above.

[0017] According to a fifth aspect of the embodiments of this specification, a method for storing a bitstream is provided, comprising storing the bitstream in a storage medium, wherein the bitstream is generated by the multimedia resource playback method described above.

[0018] According to a fifth aspect of the embodiments of this specification, a method for transmitting a bit stream is provided, comprising transmitting the bit stream, the bit stream being generated by the multimedia resource playback method described above.

[0019] According to a fifth aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores a bitstream thereon, the bitstream being generated by the multimedia resource playback method described above.

[0020] This specification provides a multimedia resource playback method according to one embodiment, applied to a player kernel. In response to a trigger event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource, based on the segment index of the multimedia resource. Following a data packet integration strategy, at least one audio data packet from the target audio segment and at least one video data packet from the target video segment are integrated into a data packet sequence, thereby merging the audio stream corresponding to the target audio segment and the video stream corresponding to the target video segment into a single stream. This achieves the merging of the audio and video streams into a single data stream. The multimedia resource playback task is executed according to the data packet sequence. The integration of the target audio and video segments allows downstream decoding and rendering modules to achieve smooth playback of the separate audio and video streams without modification, while avoiding audio-visual asynchrony and playback stuttering issues caused by trigger events. Attached Figure Description

[0021] Figure 1This is a flowchart illustrating a multimedia resource playback method provided in one embodiment of this specification; Figure 2 This is a file structure analysis diagram of a multimedia resource playback method provided in one embodiment of this specification; Figure 3 This is a schematic diagram of data packet reading for a multimedia resource playback method provided in one embodiment of this specification; Figure 4 This is a schematic diagram of the decision callback process of a multimedia resource playback method provided in one embodiment of this specification; Figure 5 This is a schematic diagram of the decision module switching data packets of a multimedia resource playback method provided in one embodiment of this specification; Figure 6 This is a timing flowchart of a multimedia resource playback method provided in one embodiment of this specification; Figure 7 This is a flowchart illustrating the player kernel adaptation and seamless switching of a multimedia resource playback method provided in one embodiment of this specification; Figure 8 This is a schematic diagram of the structure of a multimedia resource playback device provided in one embodiment of this specification; Figure 9 This is a structural block diagram of a computing device provided in one embodiment of this specification. Detailed Implementation

[0022] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.

[0023] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.

[0024] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."

[0025] Furthermore, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this specification are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0026] In one or more embodiments of this specification, a large model refers to a deep learning model with a large number of model parameters, typically containing hundreds of millions, tens of billions, hundreds of billions, trillions, or even tens of trillions of model parameters. A large model can also be called a foundation model. It is pre-trained using large-scale unlabeled corpora to produce a pre-trained model with hundreds of millions of parameters. Such models can adapt to a wide range of downstream tasks and have good generalization ability. Examples include Large Language Models (LLMs) and multi-modal pre-training models.

[0027] In practical applications, large models only require a small number of samples to fine-tune the pre-trained model before they can be applied to different tasks. Large models can be widely used in fields such as Natural Language Processing (NLP) and Computer Vision. Specifically, they can be applied to computer vision tasks such as Visual Question Answering (VQA), Image Captioning (IC), and Image Generation, as well as NLP tasks such as text-based sentiment classification, text summarization, and machine translation. The main application scenarios for large models include digital assistants, intelligent robots, search, online education, office software, e-commerce, and intelligent design.

[0028] First, the terms and concepts used in one or more embodiments of this specification will be explained.

[0029] The player core (or playback engine) is the core software module of a multimedia player. It is responsible for acquiring data from the media source, parsing the encapsulation format, decoding audio and video streams, synchronizing audio and video, rendering output, and managing playback states (such as pause, fast forward, and switching). It is the foundational engine for the entire player's functionality and typically provides abstract interfaces to upper-layer applications (such as the UI and control logic).

[0030] FMP4Extractor (Core Decapsulator): The core control class responsible for coordinating the flow of Video and Audio. It exposes a unified readPacket interface externally, maintains the dual-stream state internally, and holds an FMP4ABRCallback callback object for triggering and executing switching logic.

[0031] FMP4Stream (Single Stream Manager): Independently manages the lifecycle of a single track. Each FMP4Stream internally maintains the track's URL, a list of Fragment indices, and the current read progress (CurrentFragment Index).

[0032] IOAdapter (Data Adaptation Layer): Responsible for underlying IO interactions. Based on AVIOContext's custom callbacks, it converts the logical offset calculated by the upper layer into an actual HTTP Range request, achieving accurate data download.

[0033] FMP4ABRCallback (Decision Callback Interface): Defines a standard protocol for the lower layer to "query" the upper layer. This interface reverses the switching decision-making power to the upper-layer business module.

[0034] AVPacket: An audio / video data packet, a basic data structure in FFmpeg used to encapsulate a compressed audio or video frame. It contains the actual encoded data, as well as metadata such as timestamps, stream indexes, and flags.

[0035] FFmpeg: An open-source audio and video processing framework, a powerful cross-platform open-source multimedia framework that can be used to record, convert, and stream audio and video. It includes several core libraries and command-line tools.

[0036] Side Data: Side data / auxiliary metadata refers to additional control information or auxiliary metadata attached to the AVPacket. It is not part of the main compression data but is crucial for the correct processing of the frame. It is stored in the AVPacket.side_data array. Each Side Data item contains: type and data content, used to pass frame-level dynamic configuration information, such as new codec parameters, display matrices, encryption information, etc.

[0037] An IDR frame (Instantaneous Decoding Refresh frame) is a special type of I-frame (Intra-coded frame) in video coding standards such as H.264 / H.265. Its characteristics include: decoding does not depend on any previous frames, and it forcibly clears the reference frame list for subsequent frames, ensuring independent decoding from this frame onwards.

[0038] FormatDesc (format descriptor): describes the encoding format information of the media stream, such as encoding type, resolution, color space, key parameter set, and pixel format.

[0039] VTB Session (VideoToolbox Decoding Session): A hardware-accelerated video decoder instance responsible for decoding compressed video frames into raw pixels.

[0040] AVFrame: In FFmpeg, AVFrame is the standard structure representing raw audio and video data. It is not compressed data, but rather "raw data" that can be directly rendered or played. For video, AVFrame contains: pixel data, image width and height, timestamp, and metadata such as pixel format, color space, and frame type.

[0041] SPS (Sequence Parameter Set): Contains video sequence-level parameters such as profile, level, width and height, frame rate, color format, etc.

[0042] PPS (Picture Parameter Set): Contains image-level parameters, such as slice group and entropy coding mode.

[0043] PTS (Presentation Time Stamp): Represents the absolute time point at which a video frame should be displayed on the playback timeline, measured in microseconds or as an integer count based on a time base. PTS is the core basis for audio-visual synchronization.

[0044] NewIndex: An integer index value used to uniquely identify the target resolution level or media stream version. For example, NewIndex=0 corresponds to 480p, NewIndex=1 corresponds to 720p, and NewIndex=2 corresponds to 1080p; the specific mapping relationship is predefined by the player configuration table.

[0045] SwitchEventQueue: A thread-safe first-in-first-out (FIFO) event queue used to store resolution switching events to be processed. Each event contains the fields {PTS, NewIndex}.

[0046] This specification provides a method for playing multimedia resources. It also relates to a multimedia resource playback device, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.

[0047] Existing mainstream open-source media player kernels (such as FFmpeg-based playback frameworks, early versions of ExoPlayer, and the libvlc core of VLC) generally adopt a highly integrated and functionally coupled architecture. This type of architecture typically tightly binds core functional modules such as media source acquisition, container parsing, decoding scheduling, rendering output, and playback state control, lacking clear interface abstraction and module boundary isolation. Specifically, existing media player kernel architectures usually contain a single data processing pipeline that reads media data from the network or local storage, and then sequentially performs steps such as protocol parsing, encapsulation format demultiplexing, audio and video decoding, and rendering output. The entire process is driven by a centralized state machine, with each stage directly calling other stages through internal hard-coded logic, making it difficult to dynamically replace or extend specific components at runtime.

[0048] Crucially, this architecture was not designed with a pluggable access mechanism for adaptive bitrate strategies in mind, nor was it specifically optimized for seamless multi-stream switching scenarios. When switching between different bitrate streams or different content sources, the kernel lacks a unified reference system for managing the timelines of the old and new streams, and it also lacks a secure mechanism for saving and restoring the hardware decoder context. Therefore, when a user triggers operations such as resolution switching or channel jumping, the existing player kernel usually needs to completely destroy the current playback session and rebuild a new decoding and rendering pipeline. This process leads to playback interruptions, screen tearing, audio interruptions, or playback position reversals, severely impacting the user experience. These defects essentially stem from its closed architectural design and strong coupling between modules, making it difficult to implement advanced playback functions (such as second-level seamless switching and intelligent ABR control) without modifying the kernel source code.

[0049] This specification provides a multimedia resource playback method according to one embodiment, applied to a player kernel. In response to a trigger event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource, based on the segment index of the multimedia resource. Following a data packet integration strategy, at least one audio data packet from the target audio segment and at least one video data packet from the target video segment are integrated into a data packet sequence, thereby merging the audio stream corresponding to the target audio segment and the video stream corresponding to the target video segment into a single stream. This achieves the merging of the audio and video streams into a single data stream. The multimedia resource playback task is executed according to the data packet sequence. The integration of the target audio and video segments allows downstream decoding and rendering modules to achieve smooth playback of the separate audio and video streams without modification, while avoiding audio-visual asynchrony and playback stuttering issues caused by trigger events.

[0050] The player kernel provided in one or more embodiments of this specification mainly includes core components such as a core decapsulator, a single-stream manager, a data adaptation layer, and a decision callback interface. By constructing a "virtual dual-stream merging mechanism" within the core decapsulator, it directly parses binary indexes to achieve precise reading, and designs the kernel as a "passive executor," achieving complete decoupling from the ABR strategy through a callback mechanism. This solves the problem of poor player kernel scalability, achieves separation of business strategies and underlying execution, solves the problem of synchronization and seamless connection of separated streams on the client, and solves the hardware compatibility problem during stream switching through the Side Data mechanism.

[0051] See Figure 1 , Figure 1 A flowchart of a multimedia resource playback method according to an embodiment of this specification is shown. The multimedia resource playback method is applied to a player kernel and specifically includes the following steps.

[0052] Step 102: In response to a triggering event for a multimedia resource, determine a target audio segment in the audio file contained in the multimedia resource and a target video segment in the video file contained in the multimedia resource according to the segment index of the multimedia resource.

[0053] The multimedia resource playback method provided in this embodiment can be applied to any multimedia resource playback scenario, such as audio and video on-demand scenarios, live streaming scenarios, virtual reality and augmented reality video playback scenarios, in-vehicle video playback scenarios, etc. Audio and video playback can be achieved using the multimedia resource playback method provided in this embodiment, thereby achieving seamless switching when changing video resolution, avoiding audio and video desynchronization, and improving the smoothness of multimedia resource playback. This embodiment uses multimedia resource on-demand (video on demand) as an example to illustrate the multimedia resource playback method. Descriptions for other scenarios can refer to the same or corresponding descriptions in this embodiment, and this embodiment does not impose any limitations.

[0054] Specifically, the multimedia resource playback method is applied to the player kernel, which is the core software module of the multimedia player. It is responsible for acquiring data from the media source, parsing the encapsulation format, decoding audio and video streams, synchronizing audio and video, rendering output, and managing playback status. The player kernel provided in one or more embodiments of this specification mainly includes core components such as a core decapsulator, a single-stream manager, a data adaptation layer, and a decision callback interface. The core decapsulator is the core control class, responsible for coordinating the flow of Video and Audio. The single-stream manager is used to independently manage the lifecycle of a single track. The data adaptation layer is responsible for the underlying IO interaction. The decision callback interface defines the standard protocol for the underlying layer to "query" the upper layer. This interface reverses the switching decision-making power to the upper-layer business module. Triggering events can be events that affect the clarity of multimedia resource playback during the playback process; these events can correspond to video playback clarity switching operations. Triggering events can be actively triggered by the user playing the multimedia resource, or passively triggered by external factors such as network conditions during multimedia resource playback.

[0055] The segment index of a multimedia resource can be constructed based on the audio and video files corresponding to the multimedia resource when playback begins. The segment index includes, but is not limited to, the absolute file offset, data size, resolution of each resource segment, and start timestamp for each segment. The audio and video files included in the multimedia resource are downloaded from the server during playback. An audio file contains multiple audio segments; the target audio segment is determined by its segment index among these segments, and the resolution associated with the trigger event corresponds to that segment. Similarly, a video file contains multiple video segments; the target video segment is determined by its segment index among these segments, and the resolution associated with the trigger event corresponds to that video segment.

[0056] Based on this, in multimedia resource playback scenarios, trigger events for multimedia resources can be received at any time, and responses can be made to these events. Specifically, in response to a trigger event for a multimedia resource, the target audio segment in the audio file contained within the multimedia resource and the target video segment in the video file contained within the multimedia resource are determined based on the segment index of the multimedia resource. The trigger event can also be a playback resolution switching operation for the multimedia resource. Based on the trigger event, a switching decision structure can be determined, which contains the target resolution level to which the trigger event will be switched.

[0057] In practical applications, when a trigger event is received for a multimedia resource, it indicates that a resolution level switch has occurred during the playback of the multimedia resource. The target video segment corresponding to the upcoming resolution level can be determined from the video file contained in the multimedia resource based on the segment index. If the resolution level switch does not affect audio playback, the target audio segment determined from the audio file contained in the multimedia resource can be the audio segment whose audio stream was read before the trigger event was received. That is, the read video segment switches to the new video stream as the resolution level changes, while the audio stream remains unchanged.

[0058] Furthermore, considering that the audio and video files contained in the multimedia resources are determined by physically separated addresses, in order to improve the reading efficiency of the target video and audio segments, the segment index of the multimedia resource includes audio and video segment indices. Therefore, the target video segment can be determined in the video file based on the video segment index, and the target audio segment can be determined in the audio file based on the audio segment index. The specific implementation is as follows: The audio fragment index and video fragment index are determined according to the fragment index of the multimedia resource; the target audio fragment is determined according to the audio file contained in the multimedia resource according to the audio fragment index, and the target video fragment is determined according to the video fragment index in the video file contained in the multimedia resource.

[0059] Specifically, the multimedia resource fragment index includes audio fragment indexes and video fragment indexes. The audio fragment index contains metadata information for each audio fragment, used to record the absolute file offset, data size, and start timestamp of each audio fragment in the audio file. The video fragment index contains metadata information for the video fragment index, used to record the absolute file offset, data size, and start timestamp of each video fragment in the video file.

[0060] Based on this, the multimedia resource's segment index includes audio segment indexes and video segment indexes. After determining the audio and video segment indexes according to the multimedia resource's segment index, the target audio segment can be determined in the audio file contained in the multimedia resource based on the audio segment index, and the target video segment can be determined in the video file contained in the multimedia resource based on the video segment index. The video segment index may contain video resolution information for each video segment, and the target video segment corresponding to the target resolution can be determined based on the target resolution to be switched and the video segment index.

[0061] For example, the player kernel mainly includes core components such as a core decapsulator, a single-stream manager, a data adaptation layer, and a decision callback interface. In scenarios involving multimedia resources (videos with synchronized audio and video playback), the initial video and audio segments can be determined in the audio and video files respectively based on the initial resolution of the multimedia resource playback, and synchronized playback of video and audio is performed. When the user adjusts the video resolution, or when the playback device is affected by external environmental factors (network, etc.) causing a resolution switch, the target resolution to be switched to is determined. Based on the target resolution, the video segment index corresponding to the video file and the audio file index corresponding to the audio file are searched to determine the target video and audio segments. When the user switches from low resolution (e.g., 480p) to high resolution (e.g., 1080p), the current playback time point is determined (e.g., currently playing at 32.5 seconds). The video segment index of the target resolution is searched for the one with the closest PTS (Personal Time Segment) but less than or equal to the current playback time (i.e., "anchored" to the same content time point), and loading and playback begin from that video segment, ensuring a seamless / smooth switch.

[0062] In summary, the target audio segment is determined in the audio file contained in the multimedia resource based on the audio segment index, and the target video segment is determined in the video file contained in the multimedia resource based on the video segment index, thus achieving accurate extraction of the target video segment and the target audio segment.

[0063] Furthermore, considering that the audio and video files contained in multimedia resources are separate resource streams, it is necessary to build separate fragment indexes for the audio and video contained in the multimedia resources. The specific implementation is as follows: The system receives the audio resource address and video resource address of the multimedia resource, and initializes the audio stream manager and video stream manager; it uses the audio stream manager to determine the audio file based on the audio resource address, and determines the audio file index information based on the audio file, and constructs the audio segment index based on the audio file index information; it uses the video stream manager to determine the video file based on the video resource address, and determines the video file index information based on the video file, and constructs the video segment index based on the video file index information.

[0064] Specifically, the audio resource address is used to download audio files, and the video resource address is used to download video files. The open interface of the core decapsulator can be called, passing in the physically separated audio and video resource addresses. Both the audio stream manager and the video stream manager are single-stream managers, used for downloading audio and video files respectively, and for building audio and video segment indexes respectively. The audio file index information is determined by the segment index box (sidx box) in the file header of the audio file. This information includes, but is not limited to, the audio identifier corresponding to the audio file, the initialization segment (moov), the byte offset, the reference list (each entry describes a segment), and at least one audio segment. Similarly, the video file index information is determined by the segment index box in the file header of the video file. This information includes, but is not limited to, the video identifier corresponding to the video file, the initialization segment, the byte offset, the reference list (each entry describes a segment), and at least one video segment.

[0065] Based on this, upon receiving the audio and video resource addresses of multimedia resources, the player kernel can call the open interface of the core decapsulator, passing in the physically separated audio and video resource addresses. Internally, the player kernel initializes the audio stream manager and video stream manager. The audio stream manager uses the audio resource address to determine the audio file and its audio file header segment index box. By parsing the reference list in the audio file header segment index box, it determines the audio file index information. Then, based on the audio file index information, it calculates the absolute file offset, data size, and start timestamp of each audio segment, thus constructing an audio segment index. Similarly, the video stream manager uses the video resource address to determine the video file and its video file header segment index box. By parsing the reference list in the video file header segment index box, it determines the video file index information. Then, based on the video file index information, it calculates the absolute file offset, data size, and start timestamp of each video segment, thus constructing a video segment index. Both the video segment index and the audio segment index are used during multimedia resource playback when there is a change in video resolution, enabling the lookup of audio and video segments.

[0066] In practical applications, the player kernel calls the open interface of the core decapsulator, passing in the physically separated audio and video resource addresses. Internally, the player kernel initializes the audio stream manager and video stream manager, building video and audio segment indices respectively. The audio and video stream managers natively integrate binary file header index parsing capabilities, building segment indices locally rather than relying on a text list sent from the server. This achieves byte-level precise control, reduces network interaction overhead, and improves playback and transition speeds.

[0067] The player kernel initializes the audio and video stream managers through the core decapsulator. Initializing a single stream manager essentially establishes a complete "data pipeline" for a physically separate media stream (audio or video), from the network URL to the decodeable compressed frame, and prepares metadata, timing information, and buffering mechanisms. Once the stream managers for the video and audio files are initialized, the core decapsulator has two independent but synchronized data streams.

[0068] For the file structure parsing logic of video and audio files respectively, please refer to [link / reference]. Figure 2 ,like Figure 2 As shown (taking video file structure parsing as an example), the video file (video.m4s) contains an initialization segment (moov), a segment index, video segments, and video allocation 2, among other video allocations. The player kernel calls the open interface of the core decapsulator to download the video file's segment index via HTTP GRT, parses the segment index, reads the data packets from the target video segment, and downloads the video segment to obtain the target video segment. The segment index is then parsed to generate a segment index table, i.e., the video segment index. The video segment index contains information such as the start timestamp (PTS) and absolute file offset.

[0069] In summary, the absolute file offset, data size, and start timestamp of each audio segment are calculated based on the audio file index information, thereby constructing an audio segment index; similarly, the absolute file offset, data size, and start timestamp of each video segment are calculated based on the video file index information, thereby constructing a video segment index. This initialization of the segment indexes facilitates subsequent searches for video and audio segments based on the video and audio segment indexes, respectively.

[0070] Furthermore, to improve the user experience for multimedia resource viewers and adapt to changes in the external environment such as the network during multimedia resource playback, multiple selectable resolutions are provided. When switching resolutions for multimedia resources, a trigger event is generated for the multimedia resource, as implemented below: The system receives a trigger operation submitted by the user for the multimedia resource and uses the trigger operation as the trigger event; or, it uses the stream switching event associated with the multimedia resource as the trigger event.

[0071] Specifically, the user can be a video viewer on the client where the player's kernel resides. The user can play the video (including audio synchronized with the video) corresponding to the multimedia resource through their mobile terminal. That is, the mobile terminal has a video-on-demand application installed, which allows for on-demand playback of multimedia resources. During playback, the user can trigger actions on the multimedia resource to meet their viewing needs. These trigger actions can be clicks, drags, or selections made by the user on the multimedia resource's playback page, where the target control can be used to switch the video's resolution. The associated streaming switching event for the multimedia resource can be a video resolution switch caused by external environmental factors such as network conditions during playback. In other words, in a weak network environment, the resolution is switched to a lower level to ensure smooth playback of the multimedia resource.

[0072] Based on this, the system receives trigger operations submitted by users through the multimedia resource playback page, treating these operations as trigger events. These trigger operations can be implemented by touching a target control on the multimedia resource playback page. Alternatively, the system can use the associated multimedia resource stream switching event as a trigger event. This stream switching event could be a passive switching of the multimedia resource's resolution, i.e., automatically switching the playback resolution of the multimedia resource based on external environmental factors such as network conditions.

[0073] Continuing with the previous example, in a video-on-demand scenario, the multimedia resource could be a movie requested by the user. During movie playback, the user can switch the video playback resolution using a resolution switching control provided on the movie's playback page. Clicking the resolution switching control and selecting the desired resolution completes the triggering operation, thus confirming the trigger event. If network fluctuations occur during movie playback, i.e., if the network speed is slow, the resolution can be switched to adapt to the network environment to ensure smooth playback; the resulting stream switching event is the trigger event.

[0074] In summary, triggering events can be determined based on user actions during multimedia resource playback, or automatically based on multimedia resource stream switching events, thus meeting diverse business needs for multimedia resource playback.

[0075] Step 104: According to the data packet integration strategy, integrate at least one audio data packet in the target audio segment and at least one video data packet in the target video segment into a data packet sequence.

[0076] Specifically, in response to a triggering event for a multimedia resource, after determining the target audio segment in the audio file and the target video segment in the video file contained in the multimedia resource based on the segment index of the multimedia resource, at least one audio data packet from the target audio segment and at least one video data packet from the target video segment can be integrated into a data packet sequence according to a data packet integration strategy. The data packet integration strategy is used to integrate the target audio and video segments, that is, to integrate the two streams of the target audio and video segments into a single stream, which is the data packet sequence. The data packet sequence contains audio and video data packets arranged in the order of their timestamps; that is, the audio data packets in the target audio segment and the video data packets in the target video segment are arranged in the order of their timestamps.

[0077] Based on this, in response to the triggering event for the multimedia resource, after determining the target audio segment in the audio file contained in the multimedia resource and the target video segment in the video file contained in the multimedia resource according to the segment index of the multimedia resource, at least one audio data packet in the target audio segment and at least one video data packet in the target video segment are integrated into a data packet sequence according to the data packet integration strategy. That is, the audio data packets in the target audio segment and the video data packets in the target video segment are arranged in order according to the timestamp of the data packets to obtain the data packet sequence.

[0078] In practical applications, at least one audio data packet in the target audio segment is arranged in ascending order of its decoding timestamp. Similarly, at least one video data packet in the target video segment is also arranged in ascending order of its decoding timestamp. Integrating at least one audio data packet and at least one video data packet according to the data packet integration strategy means integrating the dual-stream target audio and target video segments into a single-stream data packet sequence. Therefore, the data packets in this sequence are also arranged according to their decoding timestamps. The data packet sequence contains interleaved audio and video data packets; for example, the data packet sequence might contain audio data packet 1, video data packet 1, audio data packet 2, audio data packet 3, and video data packet 2 in sequence.

[0079] Furthermore, considering that the video resource address of the video file and the audio file address of the audio file are physically separate, the video stream of the video file and the audio stream of the audio file are separate streams. To reduce the differences between the physical files and enable downstream decoding and rendering modules to support the playback of separate streams, at least one audio data packet in the target audio segment and at least one video data packet in the target video segment can be virtually merged at the data packet level to construct a data packet sequence. The specific implementation is as follows: An audio data packet to be integrated is determined from at least one audio data packet in the target audio segment, and a video data packet to be integrated is determined from at least one video data packet in the target video segment. The timestamps of the audio data packets to be integrated and the video data packets to be integrated are compared according to the data packet integration strategy to determine a target data packet and a data packet to be compared. If the data packet to be compared corresponds to an audio type, a first video data packet is determined from the at least one video data packet, and the first video data packet is used as the video data packet to be integrated. Its timestamp is compared with the audio data packet to be integrated until the data packet sequence is obtained.

[0080] Specifically, at least one audio data packet in the target audio segment can be arranged according to its decoding timestamp. Therefore, the audio data packet to be integrated can be the audio data packet with the smaller decoding timestamp among the at least one audio data packet; that is, the audio data packet to be integrated is the first audio data packet in the target audio segment. At least one video data packet in the target video segment can be arranged according to its decoding timestamp. Therefore, the video data packet to be integrated can be the video data packet with the smaller decoding timestamp among the at least one video data packet; that is, the video data packet to be integrated is the first video data packet in the target video segment. The data packet integration strategy can be a data packet comparison and arrangement strategy. Specifically, it can be a comparison of the decoding timestamps of the audio data packets and the video data packets, using the data packet with the smaller decoding timestamp as the target data packet and the data packet with the larger decoding timestamp as the data packet to be compared. The data packet to be compared is used to compare with the next data packet to be integrated that has the same data packet type as the target data packet, realizing competitive merging of the two streams (audio stream and video stream) to obtain a data packet sequence containing at least one audio data packet and at least one video data packet arranged in ascending order of decoding timestamp.

[0081] Based on this, at least one audio data packet in the target audio segment is arranged in ascending order of its decoding timestamp. The audio data packet to be integrated is determined from the at least one audio data packet in the target audio segment according to its decoding timestamp, or the first audio data packet in the sequence of at least one audio data packet arranged in ascending order of its decoding timestamp is selected as the audio data packet to be integrated. Similarly, at least one video data packet in the target video segment is arranged in ascending order of its decoding timestamp. The video data packet to be integrated is determined from the at least one video data packet in the target video segment according to its decoding timestamp, or the first video data packet in the sequence of at least one video data packet arranged in ascending order of its decoding timestamp is selected as the video data packet to be integrated.

[0082] By comparing the timestamps of the audio and video data packets to be integrated according to the data packet integration strategy, the target data packets that can be output to the data packet sequence and the data packets to be compared in the next round of data packet comparison are determined from the audio and video data packets to be integrated. If the data packets to be compared correspond to an audio type, a first video data packet is determined from at least one video data packet; that is, a video data packet following the data packet to be integrated is selected. The first video data packet is used as the video data packet to be integrated and its timestamp is compared with the audio data packets to be integrated until a data packet sequence is obtained. If the data packets to be compared correspond to a video type, a first audio data packet is determined from at least one audio data packet; that is, an audio data packet following the audio data packet to be integrated is selected. The first audio data packet is used as the audio data packet to be integrated and its timestamp is compared with the video data packets to be integrated until at least one audio data packet and at least one video data packet have been compared. This completes the competitive merging of the audio and video streams, and a data packet sequence is obtained.

[0083] In practical applications, the core decapsulator employs a double-buffering mechanism, pre-reading one data packet from each of the two single-stream managers (audio stream manager and video stream manager). Specifically, it reads one audio data packet (the audio data packet to be merged) and one video data packet (the video data packet to be merged). This allows for the sorting of decoding timestamps: comparing the decoding timestamps of the video and audio data packets to be merged, outputting the data packet with the smaller timestamp (e.g., the video data packet), and shifting the cursor of the corresponding stream (video stream) forward while keeping the cursor of the audio stream unchanged. This process continues until the audio data packets in the target audio segment and the video data packets in the target video segment are sorted according to their decoding timestamps, thus obtaining the data packet sequence. Internally, the core decapsulator instantiates two single-stream managers and uses the DTS contention algorithm to logically merge the physically separated IO channels into a monotonically increasing single data stream. This achieves architectural uniformity and low coupling. It shields the differences in physical files from external components, allowing downstream decoding and rendering modules to support separate stream playback without any modifications.

[0084] In specific implementation, the data packet reading logic corresponding to the target audio segment and the target video segment can be found in [reference]. Figure 3 ,like Figure 3 As shown, the data packet reading logic includes a parsing module, a segmentation and sorting module, and an output module. Data packets are read separately for the video stream (target video fragment) and the audio stream (target audio fragment), reading audio data packets (DTS: 1000) A and video data packets V (DTS: 1020). The DTS values of the audio and video data packets are compared, and the data packet with the smaller DTS (such as audio data packet A) is output, determining the logical index Index=0. Video data packet V is then compared with the next audio data packet read from the audio stream using DTS, continuing until all data packets in both the video and audio streams have been compared.

[0085] Continuing with the previous example, if the target audio segment contains audio data packet 1, audio data packet 2, and audio data packet 3, and the video data packet contains video data packet 1 and video data packet 2, the audio data packets in the target audio segment are arranged in ascending order of their decoding timestamps, and the video data packets in the target video segment are arranged in ascending order of their decoding timestamps. Audio data packet 1 is selected as the audio data packet to be integrated, and video data packet 1 is selected as the video data packet to be integrated. Audio data packet 1 and video data packet 1 are compared along their decoding timestamps. If the decoding timestamp of audio data packet 1 is less than that of video data packet 1, then audio data packet 1 is output as the target data packet to the data packet sequence. The video data packet is then selected as the data packet to be compared. Audio data packet 2 is selected in the target audio segment and compared with video data packet 1, and so on, until a data packet sequence is obtained. In the data packet sequence, audio data packet 1, video data packet 1, audio data packet 2, audio data packet 3, and video data packet 2 are arranged in the correct order.

[0086] In summary, by comparing the timestamps of the audio data packets to be integrated and the video data packets to be integrated according to the data packet integration strategy, until the competitive merging of at least one audio data packet and at least one video data packet is completed, a data packet sequence is obtained, realizing dual-stream virtual merging. This shields the differences between physical files from the outside world, allowing the downstream decoding and rendering modules to support split-stream playback without any modifications.

[0087] Furthermore, after identifying the target data packet and the data packet to be compared from the audio data packets and video data packets to be integrated, it indicates that the target data packet has completed data packet merging and can be stored in the data packet sequence. Considering that the target data packet has a stream index in the corresponding resource fragment, after storing the target data packet in the data packet sequence, the stream index needs to be mapped to a logical index. The specific implementation is as follows: The target data packet is stored in the data packet sequence, and the flow index of the target data packet is determined; a logical index of the target data packet is generated based on the flow index.

[0088] Based on this, the target data packet is stored in the data packet sequence, and its stream index is determined. The stream index of the data packet is the index that the target data packet has in the corresponding resource fragment. A logical index for the target data packet is generated based on the stream index, which facilitates the subsequent rendering and playback of multimedia resources based on the data packet sequence.

[0089] Continuing with the previous example, audio data packet 1 is selected as the audio data packet to be integrated, and video data packet 1 is selected as the video data packet to be integrated. Audio data packet 1 and video data packet 1 are compared along the decoding timestamp dimension. If the decoding timestamp of audio data packet 1 is determined to be less than that of video data packet 1, then audio data packet 1 is output as the target data packet to the data packet sequence, and video data packet 1 is used as the data packet to be compared. Before outputting the target data packet, the audio stream index of audio data packet 1 is mapped to a logical index that the core decapsulator can expose (e.g., Video=0, Audio=1), facilitating subsequent rendering and playback of multimedia resources based on the data packet sequence.

[0090] In summary, by determining the stream index of the target data packet and generating a logical index of the target data packet based on the stream index, the stream index is mapped to the logical index, which facilitates subsequent multimedia resource rendering and playback based on the data packet sequence.

[0091] Furthermore, after constructing the data packet sequences corresponding to the target audio and video segments, it indicates that the target audio and video segments have been read completely, and the next audio and video segments need to be read. Upon receiving a switching instruction for the data packet sequence, the target video and audio files to be read can be determined based on the switching instruction. The target data packet sequences are then constructed for the target video and audio files, as specifically implemented as follows: The system receives a switching instruction for the data packet sequence, parses the switching instruction to obtain switching information, determines the target audio file and target video file of the multimedia resource based on the switching information and the fragment index, and uses the target audio file as the audio file and the target video file as the video file; and executes the steps of determining the target audio fragment in the audio file contained in the multimedia resource and determining the target video fragment in the video file contained in the multimedia resource according to the fragment index of the multimedia resource.

[0092] Specifically, the switching command can be a resolution switching command submitted by the user (video viewer) through the playback page of the mobile terminal. Alternatively, it can be a resolution switching command automatically triggered by environmental factors (such as network issues affecting video playback smoothness). The switching information can be a switching decision structure, which includes, but is not limited to, the target resolution level, the level index, and the decision information for whether to switch. The target audio and video files of the multimedia resource are determined based on the level index in the switching information and the target resolution level. The target resolution level can be represented in the form of a target resource address.

[0093] Based on this, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence, playback of the multimedia resource composed of the target audio segment and the target video segment can be performed. Upon receiving a switching instruction for the data packet sequence, the switching instruction is parsed to obtain switching information including the target resolution and its index. Based on the switching information and the segment index, the target audio file and target video file of the multimedia resource corresponding to the switching instruction are determined, and the target audio file is treated as the audio file and the target video file as the video file. The steps of determining the target audio segment in the audio file contained in the multimedia resource and the target video segment in the video file contained in the multimedia resource based on the segment index are executed. This achieves the construction of the target data packet sequence corresponding to the switching instruction based on the target audio file and the target video file, enabling the multimedia resource corresponding to the target data packet sequence to be played at the target resolution.

[0094] In practical applications, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence, it indicates that the audio frames in the target audio segment and the video frames in the target video segment have been read completely, triggering a checkpoint. At this point, a passive, seamless switching phase for multimedia resource playback begins. The core desealer calls the decision callback interface to perform a decision callback. That is, the core desealer calls the decision callback interface, passing in parameters such as the current gear index and stream type, and queries the upper-layer business (which could be the ARB decision module) for the next audio and video segment reading operation. The upper-layer business returns a switching decision structure (FMP4SwitchDecision) based on the user's triggered operation (gear switching operation) or the user terminal's automatic gear adjustment operation. This structure includes `target_url`: which gear to switch to, `target_index`: the gear index, and `should_switch`: a parameter indicating whether to switch gears. If `should_switch` is true, the player kernel keeps the video stream unchanged, updates the video stream URL, and downloads the new stream's segment index. In the shard index of the new stream, find the first shard whose PTS is greater than or equal to the current ending PTS, and redirect the read cursor to that position to achieve seamless shard switching.

[0095] The player kernel has stripped away all bandwidth detection and decision-making logic, degenerating into a pure "executor." It only "queries" the upper layer for the next operation at the segment boundary through a decision callback interface. This completely decouples the underlying mechanism from the upper-layer strategy. Business users can freely customize complex switching logic (such as forced VIP switching, weak network protection, and manual switching) without intruding on or modifying the kernel code, thus improving the player's versatility.

[0096] The decision callback process can be found in [reference]. Figure 4 ,like Figure 4 As shown, the decision callback includes the ABR decision module, the core decapsulator, and the audio stream. The core decapsulator reads the target video segment and determines whether reading has finished. If not, it continues reading the current video segment; if reading has finished, it calls the ABR decision module, executes the ABR algorithm to trigger a callback, and returns the target index and URL. It then determines whether a bitrate switch is needed. If no switch is needed, it continues reading the current video segment; if a switch is needed, it executes a seamless switching process: downloading a new stream, performing timeline alignment, resetting the state, injecting PPS / SPS, and using the new stream to resume reading video data packets. The audio stream continues to be read during this process.

[0097] Following the previous example, after the data packet sequences corresponding to the target audio segment and the target video segment are generated, if the user switches from low definition (e.g., 480p) to high definition (e.g., 1080p), the current playback time point is determined (e.g., currently playing for 32.5 seconds). The video segment with the closest PTS but less than or equal to the current playback time is found in the video segment index of the target definition (i.e., "anchored" to the same content time point), and loading and playback start from that video segment to ensure seamless / smooth switching.

[0098] In summary, based on the switching instructions for the data packet sequence, the target audio file and target video file to be played are determined, the target data packet sequence is constructed, and seamless switching between the data packet sequence and the target data packet sequence is achieved, ensuring the smooth playback of multimedia resources.

[0099] Furthermore, considering the potential decoder incompatibility issue when switching to target audio and target video segments for multimedia resource playback—that is, the decoder context not matching the current stream parameters—it is necessary to actively notify and reset the decoder. The specific implementation is as follows: A target video frame is determined in the target video segment, and the decoding parameters of the target video segment are encapsulated into the video frame data structure corresponding to the target video frame.

[0100] Specifically, the target video frame is the IDR frame of the target video segment, i.e., the decoded refresh frame. The decoding parameters of the target video segment include, but are not limited to, SPS or PPS. SPS contains video sequence-level parameters such as profile, level, width and height, frame rate, and chroma format. PPS contains image-level parameters such as slice group and entropy coding mode. The video frame data structure corresponding to the target video frame can be a structure of compressed video or audio data units, AVPacket. Encapsulating the decoding parameters of the target video segment into the video frame data structure corresponding to the target video frame can be done by encapsulating decoding parameters such as SPS or PPS into the side data of AVPacket. This drives the smooth reset of the downstream hardware decoding module, enabling continuous playback of multimedia resources and seamless transitions during bitrate or resolution changes.

[0101] Based on this, the target video frame is determined within the target video segment, and the decoding parameters for the target video segment are also determined. These decoding parameters are encapsulated into the video frame data structure corresponding to the target video frame, enabling the decoder to recognize the switching between the target audio segment and the target video segment, and subsequently play multimedia resources based on the target video segment and the target audio segment.

[0102] In practical applications, the decision module's logic for switching data packet enqueues can be found in [reference needed]. Figure 5 ,like Figure 5 As shown, the decision module's logic for switching data packet queuing is predicated on tmp4 fragment alignment and PTS alignment. Upon reading the keyframe data packet of the new stream, i.e., the IDR frame of the target video fragment, the module obtains the current PTS and the maximum read PTS. If the current PTS is greater than the maximum read PTS (Case A, time advances normally), continuous reading of the video stream is performed, and the read video stream is added to the event queue, completing the video fragment switching. If the current PTS is less than or equal to the maximum read PTS, a rollback occurs. If a keyframe with the same PTS exists in the data packet queue (Case B, keyframe in the data packet queue), it indicates that the keyframe corresponding to the original resolution has not been consumed. The data packet is determined based on the PTS and added to the event queue, completing the video fragment switching. If no keyframe with the same PTS exists in the data packet queue (Case C, keyframe has been consumed, fragment discarded), it indicates that the keyframe corresponding to the original resolution has been consumed, the entire fragment is discarded, and the module waits for the next keyframe before re-executing the decision.

[0103] Continuing the example, during video playback or streaming, when a bitrate or resolution switch occurs, such as switching from 1080p to 720p, a new video stream, i.e., the target video segment, is loaded. However, in the MP4 container format, SPS and PPS are typically stored only in the extra data at the beginning of the file and are not repeatedly embedded in every IDR frame. An IDR frame is a special type of I-frame (keyframe) that can be decoded independently without referencing any previous frames. Therefore, when switching to a new stream, the first IDR frame may not contain SPS / PPS information. Downstream hardware decoders, such as Android's MediaCodec, iOS's VideoToolbox, and Intel's VAAPI, must know the SPS / PPS during initialization to correctly allocate internal buffers, parse syntax, and output the correct screen size. If the decoder still uses the parameters of the old stream (e.g., the 1080p context) to decode 720p frames, it will lead to problems such as decoding failure, screen tearing, green screen, and crashes. To address the aforementioned issues, an "in-band parameter update" mechanism is employed. "In-band" refers to embedding control information within the normal data stream (such as video frames) for transmission, rather than transmitting it through a separate channel. Therefore, in the first AVPacket of the new stream (target video segment) after switching, the SPS / PPS of the new stream are encapsulated into the Side Data field of that AVPacket. Even if the IDR frame itself does not contain SPS / PPS, the decoder can obtain complete parameter information from the Side Data.

[0104] In practice, when the downstream hardware decoding module receives this AVPacket with Side Data, it performs the following operations: It checks the Side Data type, specifically whether there exists Side Data of type AV_PKT_DATA_NEW_EXTRADATA (new extra data specifically used to notify the decoder that "encoding / decoding parameters have changed"). It triggers a decoder reset: destroying the currently used decoder context, as the old context was created based on the old resolution (e.g., 1080p). It recreates a hardware decoder instance matching the new stream parameters (e.g., 720p) using the new SPS / PPS carried in the Side Data. It decodes the current frame (target video segment) using the new decoder: using the newly created 720p decoder to decode this IDR frame, thus outputting a picture with the correct size and content. This process is called "decoder adaptation," which allows the decoder to dynamically adapt to the encoding parameters of the new stream. In addition to rebuilding the decoder, state synchronization is also required: timestamp alignment, ensuring that the new stream's PTS / DTS (display / decoding timestamps) are synchronized with the playback clock. The buffer is cleared, discarding any undecoded or unrendered frames remaining from the old stream to prevent image corruption. The rendering pipeline is reconfigured, instructing the graphics system (such as OpenGL) to adjust the output texture size to match 720p. These operations are typically coordinated by the player framework or media engine to ensure a smooth, stutter-free transition from "switching" to "normal playback of the new stream." Since only IDR frames are independently decodeable, if new parameters are passed to non-IDR frames (such as P-frames or B-frames), the decoder may fail to decode correctly due to the lack of a reference frame, even with new SPS / PPS. Therefore, new parameters must be safely injected and the decoder rebuilt only when the first IDR frame of the new stream arrives.

[0105] In summary, by encapsulating the decoding parameters of the target video segment into the video frame data structure corresponding to the target video frame, the decoder can be reset, thus avoiding potential decoder incompatibility issues when switching to the target audio segment and the target video segment for multimedia resource playback.

[0106] Step 106: Execute the playback task of the multimedia resource according to the data packet sequence.

[0107] Specifically, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the aforementioned data packet integration strategy, the multimedia resource playback task can be executed according to the data packet sequence. The multimedia resource playback task is the playback task of the audio and video segments corresponding to the data packet sequence. Executing the multimedia resource playback task involves sending the data packet sequence into the rendering thread, drawing it frame by frame, displaying the playback screen frame by frame on the playback page, and simultaneously playing the audio corresponding to the screen to achieve audio-visual synchronization.

[0108] Based on this, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the multimedia resource playback task is executed according to the data packet sequence to realize the playback of the audio and video content corresponding to the data packet sequence.

[0109] In practical applications, the data packet sequence is composed of audio and video data packets interleaved in time (which can be decoding timestamps). Through demultiplexing, synchronous decoding, and frame-by-frame precise rendering mechanisms, smooth and synchronized playback of audio and video content is achieved. Each data packet in the sequence carries the following metadata: media type identifier (audio / video); presentation time stamp (PTS); decoding time stamp (DTS); keyframe flag (e.g., AV_PKT_FLAG_KEY); and can also include side data to transmit codec parameter changes (e.g., SPS / PPS updates). All data packets in the sequence are arranged in ascending DTS order, forming a time-ordered data stream, which serves as the input source for the playback engine.

[0110] After determining the data packet sequence, the playback context is initialized, that is, playback context objects are created and initialized, including but not limited to: audio decoder instances and video decoder instances, audio renderer, video renderer, audio and video synchronization clock (based on the system monotonic clock), and decoding frame buffer queues (for audio frames AVFrame and video frames AVFrame respectively). Data packets in the data packet sequence are read sequentially, that is, the next data packet to be processed is retrieved from the data packet sequence one by one until the sequence ends or a stop command is received. Based on the media type (audio or video), the data packet is distributed to the corresponding decoding channel. For the current data packet, if the media type identifier corresponds to audio, it is sent to the audio decoding channel; if the media type identifier corresponds to video, it is sent to the video decoding channel.

[0111] Before decoding, the data packet is checked for side data. If it exists, the codec's private parameters (such as SPS / PPS for H.264 / H.265) are extracted, triggering a dynamic reconfiguration process for the corresponding decoder. This includes: refreshing the decoder's internal state; reallocating the frame buffer based on the new parameters; and updating output attributes such as color format and resolution. The data packet is then submitted to the decoder for decoding, outputting the corresponding audio or video frames and storing them in their respective frame buffer queues. Independent audio and video rendering threads are started, working collaboratively. For example, the audio rendering thread retrieves audio frames from the audio frame queue in DTS / PTS order and sends them to the audio hardware output, updating the master clock to the PTS of the current audio frame in real time. The video rendering thread retrieves the next video frame to be displayed from the video frame queue; calculates the difference Δt between the frame's PTS and the current master synchronization clock; if Δt > 0, it waits for Δt time before rendering; if Δt ≤ 0 (i.e., expired), it discards the frame (frame skipping) to avoid lag; and performs GPU rendering to draw the video frame corresponding to it. If the current video data packet is a keyframe and carries a resolution switching event, a UI state update (such as a resolution icon change) is triggered at the actual rendering time of that frame to ensure strict alignment between the user interface and the screen content.

[0112] Furthermore, after determining the data packet sequence, the playback task corresponding to the multimedia resource can be executed based on the data packet sequence to achieve audio and video playback. The specific implementation is as follows: A rendering event is generated based on the data packet sequence, and the rendering event is added to the event queue; when the rendering event is executed through the rendering thread, the playback task associated with the multimedia resource corresponding to the rendering event is executed, and the video icon on the playback page is updated.

[0113] Specifically, a rendering event refers to a structured event object containing PTS and NewIndex, used to represent a scheduling instruction that "the UI resolution should be updated at a specified display time." The event queue is a first-in, first-out (FIFO) queue used to temporarily store rendering events generated by the player kernel, for the rendering thread to consume in sequence. The playback page is the video playback page displayed on the user's terminal; the video icon can be an icon on the video playback page indicating the resolution of the currently playing content, such as 720P, 1080P, etc.

[0114] Based on this, rendering events are generated according to the data packet sequence and added to the event queue. When the rendering event is executed by the rendering thread, the playback task of the associated multimedia resource corresponding to the rendering event is executed. In its main rendering loop, the rendering thread renders each video frame to be processed in the data packet sequence. Once it is determined that the current video frame is the first frame or a subsequent frame of the target resolution stream, it indicates that the new resolution image content has officially entered the rendering process. At this time, the rendering thread executes the rendering event, completes the drawing operation of this video frame, and updates the video icon on the playback page.

[0115] In practical applications, when the player kernel receives a resolution switching request initiated by the user, it first switches to the corresponding target media stream (the target audio and target video corresponding to the multimedia resource) and parses the display timestamp carried by the first renderable video frame in the stream (usually an Instantaneous Decoding Refresh (IDR) frame), denoted as the first PTS. Simultaneously, it determines the corresponding NewIndex based on the target resolution level. The player kernel constructs a rendering event, which includes the first PTS and NewIndex, and adds it to the event queue. In its main rendering loop, the rendering thread performs the following operations on each video frame to be processed in the data packet sequence: obtains the display timestamp of the current video frame, denoted as the second PTS; checks if the event queue is not empty; if not empty, it reads the rendering event at the head of the queue and compares the second PTS with the first PTS recorded in that event. When the second PTS is greater than or equal to the first PTS, it determines that the current video frame is the first frame of the target resolution stream or its subsequent frame, indicating that the new resolution content has officially entered the rendering process. At this point, the rendering thread executes the rendering event, that is, completes the drawing operation of this video frame. As part of the playback task, the video frame is sent to the user interface thread as a state update instruction, which carries the NewIndex. The user interface thread queries the preset resolution label mapping table based on the NewIndex and updates the video resolution icon on the playback page (for example, changing the original "720P" to "1080P"). After completing the above operations, the rendering event is removed from the event queue to prevent duplicate processing. By binding the rendering event to the PTS of the video frame, a causal link of "image change → UI change" is established. The player kernel is responsible for event generation, the rendering thread is responsible for event triggering, and the UI thread is responsible for state presentation. The three are loosely coupled and communicate through the event queue. This not only ensures that the UI icon update and the actual image resolution change are precisely aligned in time, but also effectively avoids UI false alarms caused by network latency, decoding frame loss, or frame skipping.

[0116] Continuing with the previous example, in video playback scenarios that support multi-bitrate adaptive switching or manual resolution switching, when a user selects a new resolution level, the player kernel needs to load a new media stream with the corresponding bitrate, i.e., a data packet sequence. However, the first frame of the new stream (usually an IDR frame) has network transmission and decoding latency. If the UI updates the resolution icon immediately when the switching command is issued, it will result in an inconsistent experience where "the UI displays 1080P, but the picture is still 720P." Therefore, it is necessary to ensure that UI state changes and actual picture content changes are precisely aligned in time. By using an event queue mechanism, the resolution switching event is bound to the rendering time of a specific video frame, thereby achieving frame-level synchronization between UI icon changes and picture resolution changes. When a user is watching a video, they click the "Switch to 1080P" button at 2.8 seconds into the playback. The player kernel switches to the 1080P bitrate stream and detects that the PTS of the first IDR frame of this stream is 5,200,000 microseconds (i.e., 5.2 seconds). The kernel then generates a rendering event {PTS=5,200,000,NewIndex=2} and pushes it into the event queue. When the rendering thread processes a video frame with PTS of 5,200,000 in a subsequent loop, the condition is met, and the rendering of that frame is completed. The UI thread is then notified to update the resolution icon to "1080P". Thus, the user interface state and the actual screen content are synchronized at the frame level.

[0117] In summary, when rendering events are executed through the rendering thread, the playback task of the associated multimedia resource corresponding to the rendering event is executed, and the video icon on the playback page is updated to realize the resolution icon change, ensuring that the user interface and the screen content are strictly aligned.

[0118] This specification provides a multimedia resource playback method according to one embodiment, applied to a player kernel. In response to a trigger event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource, based on the segment index of the multimedia resource. Following a data packet integration strategy, at least one audio data packet from the target audio segment and at least one video data packet from the target video segment are integrated into a data packet sequence, thereby merging the audio stream corresponding to the target audio segment and the video stream corresponding to the target video segment into a single stream. This achieves the merging of the audio and video streams into a single data stream. The multimedia resource playback task is executed according to the data packet sequence. The integration of the target audio and video segments allows downstream decoding and rendering modules to achieve smooth playback of the separate audio and video streams without modification, while avoiding audio-visual asynchrony and playback stuttering issues caused by trigger events.

[0119] The following is in conjunction with the appendix Figure 6Taking the application of the multimedia resource playback method provided in this specification in a video-on-demand scenario as an example, the multimedia resource playback method will be further explained. Figure 6 The following is a timing flowchart illustrating a multimedia resource playback method according to an embodiment of this specification.

[0120] The multimedia resource playback method is applied to the player kernel, which belongs to the client corresponding to the content application platform. The client is the terminal device held by the user browsing the content application platform for video-on-demand. In this embodiment, the player kernel specifically refers to the player kernel architecture. The server specifically refers to the server that stores the video-on-demand content, i.e., the multimedia resource server. For example... Figure 6 As shown: The client, in response to a trigger event for multimedia resources, sends a request to the server to retrieve the multimedia resources.

[0121] On the server side, in response to a multimedia resource retrieval request, the multimedia resource is sent to the client.

[0122] The client determines the target audio segment in the audio file contained in the multimedia resource and the target video segment in the video file contained in the multimedia resource based on the segment index of the multimedia resource.

[0123] The client, according to the data packet integration strategy, integrates at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence.

[0124] The client executes multimedia resource playback tasks according to the data packet sequence.

[0125] In practical applications, the process of player kernel adaptation and seamless switching can be found in [reference needed]. Figure 7 ,like Figure 7 As shown, the resource thread reads the data packet and determines if it is the first keyframe after the switch. If so, it constructs a switch event (switch queue) containing the current timestamp and fragment index, and sends the data packet to the decoding queue. If not, it directly sends the data packet to the decoding queue. After sending the data packet to the decoding queue, it retrieves the data packet and determines if it contains side data. If so, it updates the format descriptor, resets the VideoToolbox decoding session, executes decoding to output the standard structure (AVFrame) of the raw audio and video data, and performs a rendering loop to obtain the standard structure (Frame). It checks the head of the queue to determine if the fragment PTS is greater than the PTS of the switch event. If not, it directly renders the video frame. If so, it pops the queue, executes a callback to notify the UI, and renders the video frame.

[0126] Furthermore, considering that the audio and video files contained in the multimedia resources are determined by physically separated addresses, in order to improve the reading efficiency of the target video and audio segments, the segment index of the multimedia resource includes audio and video segment indices. Therefore, the target video segment can be determined in the video file based on the video segment index, and the target audio segment can be determined in the audio file based on the audio segment index. The specific implementation is as follows: The audio fragment index and video fragment index are determined according to the fragment index of the multimedia resource; the target audio fragment is determined according to the audio file contained in the multimedia resource according to the audio fragment index, and the target video fragment is determined according to the video fragment index in the video file contained in the multimedia resource.

[0127] Furthermore, considering that the audio and video files contained in multimedia resources are separate resource streams, it is necessary to build separate fragment indexes for the audio and video contained in the multimedia resources. The specific implementation is as follows: The system receives the audio resource address and video resource address of the multimedia resource, and initializes the audio stream manager and video stream manager; it uses the audio stream manager to determine the audio file based on the audio resource address, and determines the audio file index information based on the audio file, and constructs the audio segment index based on the audio file index information; it uses the video stream manager to determine the video file based on the video resource address, and determines the video file index information based on the video file, and constructs the video segment index based on the video file index information.

[0128] Furthermore, to improve the user experience for multimedia resource viewers and adapt to changes in the external environment such as the network during multimedia resource playback, multiple selectable resolutions are provided. When switching resolutions for multimedia resources, a trigger event is generated for the multimedia resource, as implemented below: The system receives a trigger operation submitted by the user for the multimedia resource and uses the trigger operation as the trigger event; or, it uses the stream switching event associated with the multimedia resource as the trigger event.

[0129] Furthermore, considering that the video resource address of the video file and the audio file address of the audio file are physically separate, the video stream of the video file and the audio stream of the audio file are separate streams. To reduce the differences between the physical files and enable downstream decoding and rendering modules to support the playback of separate streams, at least one audio data packet in the target audio segment and at least one video data packet in the target video segment can be virtually merged at the data packet level to construct a data packet sequence. The specific implementation is as follows: An audio data packet to be integrated is determined from at least one audio data packet in the target audio segment, and a video data packet to be integrated is determined from at least one video data packet in the target video segment. The timestamps of the audio data packets to be integrated and the video data packets to be integrated are compared according to the data packet integration strategy to determine a target data packet and a data packet to be compared. If the data packet to be compared corresponds to an audio type, a first video data packet is determined from the at least one video data packet, and the first video data packet is used as the video data packet to be integrated. Its timestamp is compared with the audio data packet to be integrated until the data packet sequence is obtained.

[0130] Furthermore, after identifying the target data packet and the data packet to be compared from the audio data packets and video data packets to be integrated, it indicates that the target data packet has completed data packet merging and can be stored in the data packet sequence. Considering that the target data packet has a stream index in the corresponding resource fragment, after storing the target data packet in the data packet sequence, the stream index needs to be mapped to a logical index. The specific implementation is as follows: The target data packet is stored in the data packet sequence, and the flow index of the target data packet is determined; a logical index of the target data packet is generated based on the flow index.

[0131] Furthermore, after constructing the data packet sequences corresponding to the target audio and video segments, it indicates that the target audio and video segments have been read completely, and the next audio and video segments need to be read. Upon receiving a switching instruction for the data packet sequence, the target video and audio files to be read can be determined based on the switching instruction. The target data packet sequences are then constructed for the target video and audio files, as specifically implemented as follows: The system receives a switching instruction for the data packet sequence, parses the switching instruction to obtain switching information, determines the target audio file and target video file of the multimedia resource based on the switching information and the fragment index, and uses the target audio file as the audio file and the target video file as the video file; and executes the steps of determining the target audio fragment in the audio file contained in the multimedia resource and determining the target video fragment in the video file contained in the multimedia resource according to the fragment index of the multimedia resource.

[0132] Furthermore, considering the potential decoder incompatibility issue when switching to target audio and target video segments for multimedia resource playback—that is, the decoder context not matching the current stream parameters—it is necessary to actively notify and reset the decoder. The specific implementation is as follows: A target video frame is determined in the target video segment, and the decoding parameters of the target video segment are encapsulated into the video frame data structure corresponding to the target video frame.

[0133] Furthermore, after determining the data packet sequence, the playback task corresponding to the multimedia resource can be executed based on the data packet sequence to achieve audio and video playback. The specific implementation is as follows: A rendering event is generated based on the data packet sequence, and the rendering event is added to the event queue; when the rendering event is executed through the rendering thread, the playback task associated with the multimedia resource corresponding to the rendering event is executed, and the video icon on the playback page is updated.

[0134] This specification provides a multimedia resource playback method according to one embodiment, applied to a player kernel. In response to a trigger event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource, based on the segment index of the multimedia resource. Following a data packet integration strategy, at least one audio data packet from the target audio segment and at least one video data packet from the target video segment are integrated into a data packet sequence, thereby merging the audio stream corresponding to the target audio segment and the video stream corresponding to the target video segment into a single stream. This achieves the merging of the audio and video streams into a single data stream. The multimedia resource playback task is executed according to the data packet sequence. The integration of the target audio and video segments allows downstream decoding and rendering modules to achieve smooth playback of the separate audio and video streams without modification, while avoiding audio-visual asynchrony and playback stuttering issues caused by trigger events.

[0135] Corresponding to the above method embodiments, this specification also provides embodiments of a multimedia resource playback device. Figure 8A schematic diagram of a multimedia resource playback device according to one embodiment of this specification is shown. Figure 8 As shown, a multimedia resource playback device is applied to the player kernel, and the device includes: The determination module 802 is configured to, in response to a triggering event for a multimedia resource, determine a target audio segment in an audio file contained in the multimedia resource based on a segment index of the multimedia resource, and determine a target video segment in a video file contained in the multimedia resource. The integration module 804 is configured to integrate at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to a data packet integration strategy. Execution module 806 is configured to perform the playback task of the multimedia resource according to the data packet sequence.

[0136] In an optional embodiment, determining the target audio segment in the audio file contained in the multimedia resource and the target video segment in the video file contained in the multimedia resource based on the segment index of the multimedia resource includes: The audio fragment index and video fragment index are determined based on the fragment index of the multimedia resource; The target audio segment is determined in the audio file contained in the multimedia resource according to the audio segment index, and the target video segment is determined in the video file contained in the multimedia resource according to the video segment index.

[0137] In an optional embodiment, the construction of the audio segment index and the video segment index includes: Receive the audio resource address and video resource address of the multimedia resource, and initialize the audio stream manager and video stream manager; The audio stream manager is used to determine the audio file based on the audio resource address, and the audio file index information is determined based on the audio file. The audio segment index is then constructed based on the audio file index information. The video stream manager is used to determine the video file based on the video resource address, and the video file index information is determined based on the video file. The video segment index is then constructed based on the video file index information.

[0138] In an optional embodiment, the triggering event of the multimedia resource includes: The system receives a trigger operation submitted by the user for the multimedia resource and uses the trigger operation as the trigger event; or, it uses the stream switching event associated with the multimedia resource as the trigger event.

[0139] In an optional embodiment, the step of integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy includes: The audio data packet to be integrated is determined from at least one audio data packet in the target audio segment, and the video data packet to be integrated is determined from at least one video data packet in the target video segment; By comparing the timestamps of the audio data packets to be integrated and the video data packets to be integrated according to the data packet integration strategy, the target data packet and the data packet to be compared are determined in the audio data packets to be integrated and the video data packets to be integrated. In the case where the data packet to be compared corresponds to an audio type, a first video data packet is determined from the at least one video data packet, and the first video data packet is used as the video data packet to be integrated. The timestamps of the first video data packet to be integrated are compared with the audio data packet to be integrated until the data packet sequence is obtained.

[0140] In an optional embodiment, after determining the target data packet and the comparison data packet from the audio data packets to be integrated and the video data packets to be integrated, the method further includes: The target data packet is stored in the data packet sequence, and the flow index of the target data packet is determined; A logical index for the target data packet is generated based on the flow index.

[0141] In an optional embodiment, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the method further includes: Receive a handover instruction for the data packet sequence, parse the handover instruction, and obtain handover information; Based on the switching information and the segment index, the target audio file and the target video file of the multimedia resource are determined, and the target audio file is used as the audio file, and the target video file is used as the video file; The steps of determining the target audio segment in the audio file contained in the multimedia resource based on the segment index of the multimedia resource, and determining the target video segment in the video file contained in the multimedia resource are performed.

[0142] In an optional embodiment, after integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the method further includes: A target video frame is determined in the target video segment, and the decoding parameters of the target video segment are encapsulated into the video frame data structure corresponding to the target video frame.

[0143] In an optional embodiment, performing the playback task of the multimedia resource according to the data packet sequence includes: A rendering event is generated based on the data packet sequence, and the rendering event is added to the event queue; When the rendering event is executed through the rendering thread, the playback task associated with the multimedia resource corresponding to the rendering event is executed, and the video icon on the playback page is updated.

[0144] This specification provides a multimedia resource playback device according to one embodiment, applied to a player kernel. In response to a trigger event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource, based on the segment index of the multimedia resource. Following a data packet integration strategy, at least one audio data packet from the target audio segment and at least one video data packet from the target video segment are integrated into a data packet sequence, thereby merging the audio stream corresponding to the target audio segment and the video stream corresponding to the target video segment into a single stream. This achieves the merging of the audio and video streams into a single data stream. The multimedia resource playback task is executed according to the data packet sequence. The integration of the target audio and video segments allows downstream decoding and rendering modules to achieve smooth playback of the separate streams composed of audio and video streams without modification, while avoiding audio-visual asynchrony and playback stuttering problems caused by trigger events.

[0145] The above is an illustrative scheme of a multimedia resource playback device according to this embodiment. It should be noted that the technical solution of this multimedia resource playback device and the technical solution of the multimedia resource playback method described above belong to the same concept. For details not described in detail in the technical solution of the multimedia resource playback device, please refer to the description of the technical solution of the multimedia resource playback method described above.

[0146] Figure 9 A structural block diagram of a computing device 900 according to one embodiment of this specification is shown. The components of the computing device 900 include, but are not limited to, a memory 910 and a processor 920. The processor 920 is connected to the memory 910 via a bus 930, and a database 950 is used to store data.

[0147] The computing device 900 also includes an access device 940, which enables the computing device 900 to communicate via one or more networks 960. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 940 may include one or more of any type of wired or wireless network interface (e.g., a network interface card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, or a Near Field Communication (NFC) interface.

[0148] In one embodiment of this specification, the above-described components of the computing device 900 and Figure 9 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 9 The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.

[0149] The computing device 900 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 900 can also be a mobile or stationary server.

[0150] The processor 920 is used to execute the following computer program / instructions, which, when executed by the processor, implement the steps of the multimedia resource playback method described above.

[0151] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the computing device embodiments are basically similar to the multimedia resource playback method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the multimedia resource playback method embodiments.

[0152] An embodiment of this specification also provides a computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the multimedia resource playback method described above.

[0153] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the computer-readable storage medium embodiments are basically similar to the multimedia resource playback method embodiments, so the description is relatively simple; relevant parts can be referred to in the description of the multimedia resource playback method embodiments.

[0154] An embodiment of this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the multimedia resource playback method described above.

[0155] The above is an illustrative scheme of a computer program product according to this embodiment. It should be noted that the technical solution of this computer program product and the technical solution of the multimedia resource playback method described above belong to the same concept. For details not described in detail in the technical solution of the computer program product, please refer to the description of the technical solution of the multimedia resource playback method described above.

[0156] An embodiment of this specification also provides a method for storing a bitstream, comprising storing the bitstream in a storage medium, the bitstream being generated by the multimedia resource playback method described above.

[0157] The above is an illustrative scheme of a method for storing bitstreams according to this embodiment. It should be noted that the technical solution of this method belongs to the same concept as the aforementioned multimedia resource playback method. Details not described in detail in the technical solution of the bitstream storage method can be found in the description of the aforementioned multimedia resource playback method.

[0158] An embodiment of this specification also provides a method for transmitting a bit stream, including transmitting a bit stream generated by the multimedia resource playback method described above.

[0159] The above is an illustrative scheme of a method for transmitting bit streams according to this embodiment. It should be noted that the technical solution of this method belongs to the same concept as the aforementioned multimedia resource playback method. Details not described in detail in the technical solution of the bit stream transmission method can be found in the description of the aforementioned multimedia resource playback method.

[0160] An embodiment of this specification also provides a computer-readable storage medium storing a bitstream generated by the multimedia resource playback method described above.

[0161] The above is an illustrative embodiment of a computer-readable storage medium. It should be noted that the technical solution of this computer-readable storage medium and the technical solution of the aforementioned multimedia resource playback method belong to the same concept. Details not described in detail in the technical solution of the computer-readable storage medium can be found in the description of the technical solution of the aforementioned multimedia resource playback method.

[0162] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0163] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0164] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.

[0165] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0166] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims

1. A method for playing multimedia resources, characterized in that, Applied to the player kernel, including: In response to a triggering event for a multimedia resource, a target audio segment is determined in the audio file contained in the multimedia resource based on the segment index of the multimedia resource, and a target video segment is determined in the video file contained in the multimedia resource. According to the data packet integration strategy, at least one audio data packet in the target audio segment and at least one video data packet in the target video segment are integrated into a data packet sequence; The multimedia resource playback task is executed according to the data packet sequence.

2. The multimedia resource playback method according to claim 1, characterized in that, The step of determining the target audio segment in the audio file contained in the multimedia resource based on the segment index of the multimedia resource, and determining the target video segment in the video file contained in the multimedia resource, includes: The audio fragment index and video fragment index are determined based on the fragment index of the multimedia resource; The target audio segment is determined in the audio file contained in the multimedia resource according to the audio segment index, and the target video segment is determined in the video file contained in the multimedia resource according to the video segment index.

3. The multimedia resource playback method according to claim 2, characterized in that, The construction of the audio segment index and the video segment index includes: Receive the audio resource address and video resource address of the multimedia resource, and initialize the audio stream manager and video stream manager; The audio stream manager is used to determine the audio file based on the audio resource address, and the audio file index information is determined based on the audio file. The audio segment index is then constructed based on the audio file index information. The video stream manager is used to determine the video file based on the video resource address, and the video file index information is determined based on the video file. The video segment index is then constructed based on the video file index information.

4. The multimedia resource playback method according to claim 1, characterized in that, The triggering events for the multimedia resources include: The system receives a trigger operation submitted by the user for the multimedia resource and uses the trigger operation as the trigger event; or, it uses the stream switching event associated with the multimedia resource as the trigger event.

5. The multimedia resource playback method according to claim 1, characterized in that, The step of integrating at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy includes: The audio data packet to be integrated is determined from at least one audio data packet in the target audio segment, and the video data packet to be integrated is determined from at least one video data packet in the target video segment; By comparing the timestamps of the audio data packets to be integrated and the video data packets to be integrated according to the data packet integration strategy, the target data packet and the data packet to be compared are determined in the audio data packets to be integrated and the video data packets to be integrated. In the case where the data packet to be compared corresponds to an audio type, a first video data packet is determined from the at least one video data packet, and the first video data packet is used as the video data packet to be integrated. The timestamps of the first video data packet to be integrated are compared with the audio data packet to be integrated until the data packet sequence is obtained.

6. The multimedia resource playback method according to claim 5, characterized in that, After determining the target data packet and the comparison data packet in the audio data packet and the video data packet to be integrated, the process further includes: The target data packet is stored in the data packet sequence, and the flow index of the target data packet is determined; A logical index for the target data packet is generated based on the flow index.

7. The multimedia resource playback method according to claim 1, characterized in that, After combining at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the method further includes: Receive a handover instruction for the data packet sequence, parse the handover instruction, and obtain handover information; Based on the switching information and the segment index, the target audio file and the target video file of the multimedia resource are determined, and the target audio file is used as the audio file, and the target video file is used as the video file; The steps of determining the target audio segment in the audio file contained in the multimedia resource based on the segment index of the multimedia resource, and determining the target video segment in the video file contained in the multimedia resource are performed.

8. The multimedia resource playback method according to claim 1, characterized in that, After combining at least one audio data packet from the target audio segment and at least one video data packet from the target video segment into a data packet sequence according to the data packet integration strategy, the method further includes: A target video frame is determined in the target video segment, and the decoding parameters of the target video segment are encapsulated into the video frame data structure corresponding to the target video frame.

9. The multimedia resource playback method according to claim 1, characterized in that, The step of executing the multimedia resource playback task according to the data packet sequence includes: A rendering event is generated based on the data packet sequence, and the rendering event is added to the event queue; When the rendering event is executed through the rendering thread, the playback task associated with the multimedia resource corresponding to the rendering event is executed, and the video icon on the playback page is updated.

10. A computing device, characterized in that, include: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the multimedia resource playback method according to any one of claims 1 to 9.

11. A computer-readable storage medium storing a computer program / instructions, characterized in that, When the computer program / instruction is executed by the processor, it implements the steps of the multimedia resource playback method according to any one of claims 1 to 9.

12. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instruction is executed by the processor, it implements the steps of the multimedia resource playback method according to any one of claims 1 to 9.

13. A method for storing a bit stream, comprising storing the bit stream in a storage medium, characterized in that, The bitstream is generated by the multimedia resource playback method according to any one of claims 1 to 9.

14. A method for transmitting a bit stream, comprising transmitting the bit stream, characterized in that, The bitstream is generated by the multimedia resource playback method according to any one of claims 1 to 9.

15. A computer-readable storage medium storing a bit stream thereon, characterized in that, The bitstream is generated by the multimedia resource playback method according to any one of claims 1 to 9.