A dual-instance processing method for the same sound source
By employing a dual-instance processing method, the main playback instance quickly initializes the interface using placeholder peak values, while the auxiliary parsing instance performs audio decoding and peak extraction in the background. Combined with high-precision clock compensation and state snapshot technology, this solves the problems of first-screen blocking during audio file loading and strong coupling between playback and visualization in existing technologies, achieving an instant response and structured recording and browsing experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAMEN XINGZONG DIGITAL TECH CO LTD
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240053A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of audio processing and front-end interaction technology, specifically to a dual-instance processing method for the same audio source. Background Technology
[0002] In current web applications, such as call center agent workstations and customer service quality control systems, users frequently need to listen to long audio files containing multiple stages, such as: voice navigation in Interactive Voice Response (IVR) systems, agent A's call, transfer waiting, and agent B's call. Existing front-end audio waveform visualization solutions typically face the following problems: First-screen blocking and long waiting times: The entire audio file must be downloaded and decoded locally before the player can be rendered and the user can be allowed to click play. For large WAV files, this can result in a wait of several seconds or even tens of seconds. Summary of the Invention
[0003] The purpose of this application is to provide a dual-instance processing method for the same sound source, and the specific technical solution adopted is as follows: Firstly, a dual-instance processing method for the same sound source is provided, the method comprising: Based on the proportion of each segment duration in the target audio source to the total duration, placeholder peak data is generated, and the front-end waveform interface is initialized based on the placeholder peak data. The placeholder peak data is loaded using the main playback instance, and the user interface is rendered and displayed. On the user interface, a control that can perform at least one of the following functions is displayed: play, pause, drag. While the main playback instance renders the user interface, the auxiliary parsing instance loads the target network resource address of the target audio source, so as to perform audio decoding and target peak data extraction on the target audio source based on the target network resource address; The main playback instance is re-instantiated using the target peak data, and the decoded audio data is used as the target instance for playback.
[0004] Secondly, a dual-instance processing device for the same sound source is provided, the device comprising: An initialization module is used to generate placeholder peak data based on the proportion of the duration of each segment in the target audio source to the total duration, so as to initialize the front-end waveform interface based on the placeholder peak data. The interface rendering module is used to load the placeholder peak data using the main playback instance, render and display the user interface, and display a control on the user interface that can perform at least one of the following functions: play, pause, drag; The audio decoding module is used to load the target network resource address of the target audio source using an auxiliary parsing instance while the main playback instance renders the user interface, so as to perform audio decoding and target peak data extraction on the target audio source based on the target network resource address; The instance playback module is used to re-instantiate the main playback instance using the target peak data, and to play the decoded audio data as the target instance.
[0005] Thirdly, an electronic device is provided, comprising: a memory and at least one processor, wherein the memory stores instructions; the at least one processor invokes the instructions in the memory to cause the electronic device to execute the aforementioned dual-instance processing method for the same sound source.
[0006] Fourthly, a computer program product is provided, comprising: computer program code, which, when run on a computer, causes the computer to perform the methods described in the first aspect or any possible implementation thereof.
[0007] Fifthly, a computer-readable storage medium is provided that stores computer program code, which, when executed on a computer, causes the computer to perform the methods described in the first aspect or any possible implementation thereof.
[0008] This application has the following beneficial effects: The main playback instance focuses on user interface interaction, while the auxiliary decoding instance handles time-consuming decoding and peak extraction, avoiding main thread blocking and ensuring smooth page operation (e.g., smooth progress bar dragging). Placeholder waveforms are displayed quickly, seamlessly replaced with real data, and playback control responds in real time, meeting users' expectations for "instant feedback." The establishment of playability is completely decoupled from the generation of real waveform data, allowing users to immediately click play upon opening the page, significantly reducing waiting time. Hidden auxiliary parsing instances focus on background peak extraction, preventing complex audio decoding from significantly impacting front-end interface responsiveness. Through three strategies—thread isolation, data decoupling, and copy transmission—the performance bottleneck of loading large amounts of audio is systematically solved, making it suitable for high-concurrency scenarios requiring real-time interaction, such as online audio editors and speech recognition systems. Attached Figure Description
[0009] To more clearly illustrate the technical solutions and advantages in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0010] Figure 1 A schematic diagram illustrating the implementation flow of a dual-instance processing method for the same sound source provided in an embodiment of this application; Figure 2 A schematic diagram illustrating the implementation process of a dual-instance processing and asynchronous waveform reconstruction method based on the same recording source, provided in an embodiment of this application; Figure 3 A schematic diagram of the structure of a dual-instance processing device for the same sound source provided in an embodiment of this application; Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Detailed Implementation
[0011] To further illustrate the technical means and effects adopted by this application to achieve the intended inventive purpose, the following, in conjunction with the accompanying drawings and preferred embodiments, details the specific implementation, structure, features, and effects of a dual-instance processing method for the same sound source proposed in this application. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, specific features, structures, or characteristics in one or more embodiments can be combined from any suitable form.
[0012] In the description of the embodiments of this application, unless otherwise stated, " / " means "or". For example, A / B can mean A or B. The "and / or" in the text is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of this application, "multiple" means two or more.
[0013] Hereinafter, the terms "first" and "second" are used for descriptive purposes only and should not be construed as implying or suggesting relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
[0014] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.
[0015] The following is a standardized explanation of the terminology: Main playback instance: The audio processing instance responsible for real user interaction and front-end interface presentation.
[0016] Auxiliary parsing example: An audio processing example responsible for background loading, decoding, and peak extraction.
[0017] Placeholder peak: Simulated peak data used to initialize the waveform interface before the actual peak is generated.
[0018] Skeleton waveform: A vacant peak with a sense of time axis length and weak amplitude morphology, used to reduce visual jumps when backfilling the real waveform.
[0019] State snapshot: The current playback context recorded by the main playback instance before reconstruction, including time, playback status, speed, and volume.
[0020] Offset compensation time: The new instance target playback time is calculated by combining media playback time and a high-precision clock.
[0021] Segmented Area: A visually interactive area mapped onto the waveform timeline based on the recording process stage.
[0022] This application provides a dual-instance processing method for the same sound source, such as... Figure 1 As shown, this can be achieved through the following steps: Step S110: Generate placeholder peak data based on the proportion of each segment duration in the target audio source to the total duration, and initialize the front-end waveform interface based on the placeholder peak data; Here, users can trigger access to the recording playback details page (front-end interface). The front-end interface obtains the Uniform Resource Locator (URL), total duration, and segment information for the recording.
[0023] During implementation, the data initialization and placeholder rendering module can be used to obtain the recording file URL, total duration, and duration information of each segment. Based on the total duration and duration information of each segment, a placeholder peak array can be generated proportionally for quickly initializing the front-end waveform interface.
[0024] Step S120: Load the placeholder peak data using the main playback instance, render and display the user interface, and display a control on the user interface that can perform at least one of the following functions: play, pause, drag; Here, the system provides a dual-instance scheduling module: The main playback instance is responsible for the foreground user interface (UI) display and user interaction. It initially loads placeholder peak data, binds native media elements, and provides immediate play, pause, and drag-and-drop functionality. Users can use the main playback instance to click play, adjust playback speed, and drag the progress bar.
[0025] Auxiliary Instance: Hides in the background, loads the same recording source URL, silently performs audio decoding and true peak extraction, and does not participate in UI rendering.
[0026] Step S130: While the main playback instance is rendering the user interface, the auxiliary parsing instance loads the target network resource address of the target audio source, so as to perform audio decoding and target peak data extraction on the target audio source based on the target network resource address; While the main playback instance renders the user interface, the auxiliary parsing instance loads the URL of the target audio source, and decodes the target audio source and extracts peak data based on the URL.
[0027] Step S140: Re-instantiate the main playback instance using the target peak data, and use the decoded audio data as the target instance for playback.
[0028] During implementation, the waveform asynchronous reconstruction and backfilling module can be used to receive the real peak data (target peak data) exported by the auxiliary instance and destroy the old main playback instance.
[0029] The target instance is then played on the front end, where the target instance is the audio data obtained by decoding the target audio source.
[0030] In this embodiment, the main playback instance focuses on interface interaction, while the auxiliary decoding instance handles time-consuming decoding and peak extraction, avoiding main thread blocking and ensuring smooth page flow (e.g., smooth progress bar dragging). Placeholder waveforms are displayed quickly, and real data is seamlessly replaced, with playback control responding in real time, meeting users' expectations for "instant feedback." The establishment of playability is completely decoupled from the generation of real waveform data, allowing users to immediately click play upon opening the page, significantly reducing waiting time. The hidden auxiliary parsing instance focuses on background peak extraction, preventing complex audio decoding from significantly impacting the front-end interface response. Through three strategies—thread isolation, data decoupling, and copy transmission—the performance bottleneck of loading large amounts of audio is systematically solved, making it suitable for high-concurrency scenarios requiring real-time interaction, such as online audio editors and speech recognition systems.
[0031] This application also provides a method for determining the playback starting point of a target instance, which can be achieved through the following steps: Step A: In response to the callback trigger of the auxiliary parsing instance, record the current media playback timestamp and use the system's high-precision clock to record the first moment; During implementation, a high-precision clock compensation and state injection module can be used to introduce a monotonic high-precision clock to participate in playback offset compensation during the destruction and reconstruction of the main playback instance. The monotonic high-precision clock represents the audio / video playback progress, i.e., the current playback second, such as 15.032 seconds. The system records the current media playback timestamp t_media and the first moment recorded by the system's high-precision clock, T_start, at the instant the auxiliary instance callback is triggered.
[0032] Step B: When the target instance has finished loading and reached a playable state, record the second moment using the system's high-precision clock; When the new instance is mounted and reaches a playable state, the second moment, T_end, is recorded using the system's high-precision clock.
[0033] Step C: Determine the playback starting point of the target instance based on the current media playback timestamp and the time difference between the first and second moments.
[0034] During implementation, the target playback starting point after compensation can be calculated using the following formula (1): (1); Among them, t _media For media playback timestamps, T _start The auxiliary instance callback provided for the system's high-precision clock triggers the instantaneous start time (first moment), T _end The system time (second moment) recorded when the new instance completes mounting and reaches a playable state.
[0035] In this embodiment, a monotonic high-precision clock and system time difference are introduced into the instance reconstruction link, so that the new instance is automatically aligned to the theoretical continuous playback position after reconstruction, effectively avoiding repeated playback or stuttering.
[0036] Before the auxiliary parsing instance completes audio decoding and sends the decoded audio source data back to the main playback instance, this application embodiment provides a method for obtaining the state context, which can be implemented through the following process: The main playback instance captures and maintains its current state context in real time, wherein the current state context includes playback status, playback rate, and volume.
[0037] During implementation, the state snapshot and persistence module can be used to capture and save the current state context of the main playback instance in real time before the auxiliary instance completes parsing and prepares to send back data. This includes the current playback time (currentTime), playback status (isPlaying), playback speed (playbackRate), and volume (volume).
[0038] Correspondingly, step S140 above, "execute playback using the decoded audio data as the target instance," can be achieved through the following steps: Step 141: Inject the current state context and the playback start point of the target instance into the target instance so that the target instance is aligned to the continuous playback position; During implementation, a high-precision clock compensation and state injection module can be used to introduce a playback starting point determined by playing offset compensation based on a monotonic high-precision clock during the destruction and reconstruction of the main playback instance.
[0039] The system will t _target The isPlaying, playbackRate, and volume are injected into the new instance together, so that the new instance is automatically aligned to the theoretical continuous playback position after reconstruction, instead of simply writing back the old time point.
[0040] After the new instance is created, the system performs state injection in the following order: first set peak data, then set target time, then restore speed and volume, and finally decide whether to automatically continue playback based on isPlaying. This reduces the sense of audio-visual disconnect caused by instance reconstruction.
[0041] Step 142: Play the target instance of the completed parameter injection.
[0042] In this embodiment, a monotonic high-precision clock is introduced into the instance reconstruction link. Combined with the state snapshot (current state context), the target time after offset compensation is formed, realizing the continuous recovery of playback progress, playback status, speed, and volume. During the process of real waveform backfilling and main playback instance reconstruction, the user's playback time, playback status, volume, and speed are accurately maintained, eliminating visual and auditory interruptions.
[0043] In some embodiments, after "the auxiliary parsing instance performs audio decoding and target peak data extraction on the target audio source using the target network resource address" in step S130, this application embodiment also provides an off-screen parsing and memory reclamation method, which can be implemented through the following steps: Step A: Save the target peak data as a 32-bit single-precision floating-point number or an equivalent contiguous memory structure; During implementation, the generated target peak data will be stored as a 32-bit single-precision floating-point number (Float32Array) or an equivalent contiguous memory structure.
[0044] Step B: Send the target peak data back to the main thread through a transferable object or an equivalent ownership transfer mechanism; During implementation, the peak array is sent back to the main thread through transferable objects or an equivalent ownership transfer mechanism to avoid the extra memory usage caused by copying large arrays between threads.
[0045] Step C: After the target peak data is successfully exported, actively release the audio cache object, the context of decoding the target sound source, and the intermediate cache object in the auxiliary parsing instance.
[0046] During implementation, after the peak data is successfully exported, the audio buffer object, decoding context and intermediate buffer object held in the auxiliary parsing instance are actively released, so that the browser's garbage collection mechanism can reclaim memory in a timely manner and reduce the risk of crash in multi-window concurrent playback scenarios.
[0047] In this embodiment of the application, for large-capacity audio files, the auxiliary parsing instance can run in a Web Worker or an equivalent background execution context. After completing decoding and peak extraction, the peak array is sent back to the main thread through a message mechanism to further reduce the risk of main thread blocking and avoid continuous blocking of the main thread by decoding and peak extraction.
[0048] In some embodiments, the "initializing the front-end waveform interface based on the occupied peak data" in step S110 above can be achieved through the following steps: Step A: Based on the occupancy peak data and the canvas width of the waveform interface, estimate the number of pixels to be drawn in the waveform, and initialize the front-end waveform interface based on the number of pixels. During implementation, the sampling rate f can be determined based on the metadata of the recording file. _s The total duration D and the current waveform canvas width W are used to estimate the number of pixels required for waveform drawing, and the front-end waveform interface is initialized based on the number of pixels.
[0049] Step B: Generate a pseudo-random sequence that can form fluctuations based on a preset amplitude range to simulate the visual characteristics of silent background noise or low energy range. Step C: Use the pseudo-random sequence as the initial placeholder peak value of the main playback instance.
[0050] During implementation, this pseudo-random sequence is used as the initial placeholder peak of the main playback instance, so that users can still perceive the time axis length and the main outline of the waveform before the actual peak has been extracted.
[0051] In some embodiments, when the main playback instance is initialized, the placeholder peak can be a skeleton waveform with animation effects to alleviate the visual waiting feeling during the generation of the real waveform.
[0052] In this embodiment, a skeleton waveform generated based on recording metadata and a pseudo-random amplitude sequence is used as a waiting state, and a gradual transition is performed during the real peak backfilling to improve visual continuity. Instead of using straight lines, a skeleton waveform is generated, which effectively reduces the visual abruptness when switching between the placeholder waveform and the real waveform.
[0053] In some embodiments, the step S140 above, "re-instantiating the main playback instance using the target peak data," can be achieved through the following process: The main playback instance is re-instantiated by performing a transparency gradient switch by requesting an animation frame or an equivalent frame-by-frame rendering mechanism to switch the placeholder peak data to the target peak data.
[0054] Here, the requestAnimationFrame (rAF) is a native Application Programming Interface (API) provided by the browser for high-performance frame-by-frame animation rendering. It is a core mechanism of modern web animation, offering more precise frame synchronization, lower power consumption, and better performance optimization. The equivalent frame-by-frame rendering mechanism refers to simulating frame-by-frame animation effects when the browser's native requestAnimationFrame cannot be used. Its core goal is to complete state updates and redraws within each frame (approximately 16.67ms) to maintain animation smoothness.
[0055] In this embodiment of the application, when the real peak is backfilled, the transparency gradient switch is executed by requesting animation frames or an equivalent frame-by-frame rendering mechanism, so that the skeleton waveform fades out smoothly and the real waveform fades in smoothly, reducing visual jumps.
[0056] This application also provides a method for determining the effective duration of the last segment, which can be implemented through the following process: If the total duration of the target audio source is not equal to the sum of the durations of each segment, the duration of the last segment is corrected based on the difference between the total duration and the sum of the durations of the other segments except the last segment, and the corrected duration is determined as the effective duration of the last segment.
[0057] To address the potential slight error between the sum of the segmented durations asynchronously returned by the server and the total recording duration, this application further introduces a dynamic tail duration calibration mechanism.
[0058] If the total duration of the target audio source is not equal to the sum of the durations of each segment, then the total duration can be subtracted from the sum of the durations of the last segment to obtain the effective duration of the last segment. Let the duration of each segment be d. _1 d _2 , ..., d _n The total recording time is D. _total The effective duration of the last segment is then dynamically calculated using the following formula (2): (2); This calculation method allows the system to automatically eliminate accumulated errors, ensuring that the last area is aligned with the actual end of the recording.
[0059] In this embodiment, business logic segments are combined with the physical waveform timeline. When the recording segment duration is incomplete, the duration of the last segment is dynamically calculated by subtracting the cumulative duration of the preceding segment from the total recording duration, so as to ensure that the segmented area is precisely aligned with the waveform timeline.
[0060] This application also provides a method for displaying target segments, which can be implemented through the following process: The main playback instance establishes highlight detection logic based on playback time intervals. The highlight detection logic is used to trigger the injection of style variables or equivalent rendering update of the user interface when the current playback time of the main playback instance enters any target segment interval, so as to achieve the highlight display of the target segment.
[0061] In this embodiment, the front-end (main playback instance) establishes highlight detection logic based on playback time intervals. When the currentTime of the main playback instance enters any segment interval [T], _start T _end When [the UI layer style variable injection or equivalent rendering update is triggered], segmented area highlighting is achieved without frequently performing additional Document Object Model (DOM) reflow operations for each area.
[0062] This application also provides a method for identifying a target interactive area, which can be implemented through the following steps: Step A: Generate multiple interactive regions on the waveform diagram based on the duration of each segment in the target audio source; Step B: Listen for the playback progress event of the target instance; Step C: If the playback progress event determines that the target interaction area has been entered, identify the target interaction area.
[0063] During implementation, segmented linkage and highlighting modules can be utilized: Based on the duration of each recording segment, multiple interactive areas are superimposed on the waveform to generate multiple interactive areas.
[0064] Listen for playback progress events on the main playback instance. When the progress enters a certain area, the current stage is automatically highlighted.
[0065] In this embodiment, waveforms are bound to segmented business data to generate interactive segmented regions that support stage-by-stage highlighting, significantly improving quality inspection and playback efficiency.
[0066] This application also provides a method for jumping to a clicked location for playback, which can be achieved through the following steps: Step A: Listen for click events in the designated area; Step B: Based on the click location that triggers the click event in the area, obtain the cumulative duration of the preceding segment corresponding to the click location, and determine the absolute time point corresponding to the click location based on the cumulative duration; Step C: Based on the absolute time point, determine the target instance and jump to the corresponding click position for playback.
[0067] During implementation, the system listens for click events in the listening area, calculates the absolute time point based on the cumulative duration of the preceding segments, and drives the main playback instance to jump to the corresponding position.
[0068] Displays real waveforms and segmented highlighted interfaces to users.
[0069] In this embodiment, waveforms are bound to segmented business data to generate interactive segmented areas, supporting precise click-to-jump functionality and significantly improving quality inspection and playback efficiency.
[0070] In current web applications, such as call center agent workstations and customer service quality inspection systems, users frequently need to listen to long audio files containing multiple stages, such as: IVR voice navigation, Agent A's call, transfer waiting, and Agent B's call. Existing front-end audio waveform visualization solutions typically face the following problems: 1. First-screen blocking and long waiting time: The player must wait for the entire audio file to download and complete local decoding before extracting the complete waveform peaks and rendering the player to allow the user to click play. For large WAV files, this can result in a wait of several seconds or even tens of seconds.
[0071] 2. Playback and visualization are strongly coupled: if only native methods are used... <audio>If tags are played first, users will not be able to see the waveform and will find it difficult to perceive the progress and structure; if waveform rendering is forcibly inserted in the middle, it will often cause the current playback to be interrupted and the progress to be lost.
[0072] 3. Lack of structured segmented navigation: Existing waveform displays usually only reflect volume fluctuations and cannot intuitively express the boundaries of different business stages in a call recording. Users find it difficult to quickly locate specific stages, such as skipping the IVR and directly listening to the agent's call.
[0073] The embodiments of this application aim to overcome the defect of strong coupling between "playback preparation" and "waveform generation" in existing front-end recording and parsing schemes. Figure 2 This application provides a schematic diagram of the implementation process of a dual-instance processing and asynchronous waveform reconstruction method based on the same recording source. The following steps can be achieved through the core modules of the system (data initialization and placeholder rendering module, dual-instance scheduling module, state snapshot and persistence module, high-precision clock compensation and state injection module, waveform asynchronous reconstruction and backfilling module, segmented linkage and highlighting module): Step 1: Pass in the peak station location and URL for rapid initialization.
[0074] Here, users can trigger access to the recording playback details page (front-end interface), where the front-end interface obtains the recording URL, total duration, and segment information.
[0075] During implementation, the data initialization and placeholder rendering module can be used to obtain the recording file URL, total duration, and duration information of each segment. Based on the total duration and duration information of each segment, a placeholder peak array can be generated proportionally for quickly initializing the front-end waveform interface.
[0076] Step 2: Pass in the recording URL for background loading.
[0077] During implementation, the user's front-end interface inputs the recording URL, which is loaded into the back-end (auxiliary parsing example). The back-end performs local decoding and extraction of the actual peak value.
[0078] The system provides a dual-instance scheduling module: The main playback instance is responsible for the front-end UI display and user interaction. It initially loads placeholder peak data, binds native media elements, and provides immediate play, pause, and drag-and-drop functionality. Users can use the main playback instance to click play, adjust playback speed, and drag the progress bar.
[0079] Auxiliary Instance: Hides in the background, loads the same recording source URL, silently performs audio decoding and true peak extraction, and does not participate in UI rendering.
[0080] Step 3: After parsing is complete, the actual peak data is sent back.
[0081] During implementation, once the auxiliary parsing instance has completed parsing the audio data, the obtained actual peak data will be sent back to the front-end interface.
[0082] Step 4: Trigger a state snapshot.
[0083] During implementation, the state snapshot and persistence module can be used to capture and save the current state context of the main playback instance in real time before the auxiliary instance completes parsing and prepares to send back data. This includes the current playback time (currentTime), playback status (isPlaying), playback speed (playbackRate), and volume (volume).
[0084] Step 5: Destroy the old instance and rebuild with the real peak value.
[0085] During implementation, the waveform asynchronous reconstruction and backfill module can be used to receive the real peak data exported by the auxiliary instance, destroy the old main playback instance, re-instantiate the main playback instance using the real peak data, and immediately inject the previously saved state snapshot and offset compensation results to resume playback.
[0086] Step 6: Inject the state snapshot and resume playback.
[0087] During implementation, a high-precision clock compensation and state injection module can be used to introduce a monotonic high-precision clock to participate in playback offset compensation during the destruction and reconstruction of the main playback instance. The monotonic high-precision clock is the playback progress of the audio / video, i.e., the current playback second, such as 15.032 seconds. The system records the current media playback timestamp t_media and the system high-precision clock T_start at the moment the auxiliary instance callback is triggered. When the new instance is mounted and reaches the playable state, it records T_end. Based on this, the target playback starting point after compensation is calculated using the following formula (1): (1); Among them, t _media For media playback timestamps, T _start The auxiliary instance callback provided for the system's high-precision clock triggers the instantaneous start time, T. _end The system time recorded when the new instance is successfully mounted and reaches a playable state.
[0088] The system will t _target The `isPlaying`, `playbackRate`, and `volume` are injected into the new instance, so that the new instance automatically aligns to the theoretical continuous playback position after reconstruction, instead of simply writing back the old time point. The old time point refers to directly... t_media Write back to the current time of the new instance (currentTime), that is, without any compensation, simply assign the playback seconds recorded before the old instance was destroyed to the new instance. The problem with this approach is that the process of destroying the main instance → initializing the new instance → mounting → reaching a playable state takes a certain amount of time (tens to hundreds of milliseconds). If the old time point is written back directly, the playback progress will be reversed, and the user will perceive obvious repeated playback or stuttering.
[0089] Step 7: Draw the interactive area based on the segmentation information.
[0090] During implementation, segmented linkage and highlighting modules can be utilized: Based on the duration of each recording segment, multiple interactive areas are superimposed on the waveform to generate multiple interactive areas.
[0091] Listen for playback progress events on the main playback instance. When the progress enters a certain area, the current stage is automatically highlighted.
[0092] Listen for click events in the listening area, calculate the absolute time point based on the cumulative duration of the preceding segments, and drive the main playback instance to jump to the corresponding position.
[0093] Displays real waveforms and segmented highlighted interfaces to users.
[0094] The method provided in this application embodiment works in concert with two audio processing instances, a primary and a secondary one, to ensure instant playback for the user with zero waiting time. At the same time, it asynchronously generates a real waveform in the background and uses state snapshot technology to achieve seamless waveform backfilling and continuous restoration of the playback state after generation. In addition, it combines recording segment information to achieve intelligent linkage and highlighting of waveform areas.
[0095] The beneficial effects are as follows: 1. Lightning-fast first-screen playback response: Completely decouples the establishment of playability from the generation of real waveform data, allowing users to click to play immediately upon opening the page, significantly reducing waiting time.
[0096] 2. Seamless interactive experience: During the process of real waveform backfilling and main playback instance reconstruction, the user's playback time, playback status, volume and speed are accurately maintained, eliminating visual and auditory interruptions.
[0097] 3. Improve structured browsing efficiency: Bind waveforms to business segment data to generate interactive segmented areas, supporting stage-by-stage highlighting and precise click-to-jump, significantly improving quality inspection and playback efficiency.
[0098] 4. Reduce main thread blocking: The hidden auxiliary parsing instance focuses on background peak extraction, avoiding the significant impact of complex audio decoding on the front-end UI response.
[0099] This application provides a key sub-solution 1: state snapshot data structure and seamless injection.
[0100] To enable cross-instance switching, in a preferred implementation, state snapshots are stored as structured objects containing at least the following fields: { "mediaTime": 15.032, "monotonicStart": 24123.818, "isPlaying": true, "playbackRate": 1.5, "volume": 0.8, "currentRegionId": "queue" Among them, `mediaTime` is used to record the media playback time before the old instance was destroyed, `monotonicStart` is used to compensate for the time difference with the high-precision clock when the instance reconstruction is completed, and `currentRegionId` is used to restore the highlight state of the current segment region after waveform reconstruction. After the new instance is created, the system performs state injection in the following order: "first set peak data, then set target time, then restore speed and volume, and finally decide whether to automatically continue playback based on `isPlaying`", thereby reducing the sense of audio-visual disconnect caused by instance reconstruction.
[0101] This application provides a key sub-solution 2: off-screen parsing and memory reclamation.
[0102] For large audio files, in a preferred implementation, the auxiliary parsing instance runs in an independent execution context, such as a Web Worker or an equivalent background parsing thread, to avoid continuous blocking of the main thread by decoding and peak extraction. This implementation includes: 1. Perform audio decoding and peak extraction on the same recording source within the background context; 2. Save the generated peak array (data) as a Float32Array or an equivalent contiguous memory structure; 3. Use Transferable Objects or an equivalent ownership transfer mechanism to send the peak array back to the main thread to avoid the extra memory usage caused by copying large arrays between threads; 4. After the peak data is successfully exported, actively release the AudioBuffer, decoding context and intermediate cache objects held in the auxiliary parsing instance, so that the browser's garbage collection mechanism can reclaim memory in a timely manner and reduce the risk of crash in multi-window concurrent playback scenarios.
[0103] This application provides a key sub-solution 3: skeleton waveform prediction rendering.
[0104] To reduce the visual abruptness when switching between the placeholder waveform and the real waveform, in a preferred embodiment of the present invention, instead of using a straight line, a skeleton waveform is generated. Specifically, this includes: 1. Based on the sampling rate f in the audio file metadata. _s The total duration D and the current waveform canvas width W are used to estimate the number of pixels required to draw the waveform. 2. Generate a pseudo-random sequence with slight fluctuations according to a preset amplitude range to simulate the visual characteristics of silent background noise or low energy range; 3. Use this pseudo-random sequence as the initial placeholder peak of the main playback instance so that users can still perceive the time axis length and the main outline of the waveform before the actual peak has been extracted; 4. When the actual peak value is backfilled, the transparency is gradually switched through requestAnimationFrame or an equivalent frame-by-frame rendering mechanism to make the skeleton waveform fade out smoothly and the actual waveform fade in smoothly, reducing visual abrupt changes.
[0105] This application provides a key sub-solution 4: dynamic tail duration calibration and segmented linkage protocol.
[0106] To address the potential slight error between the sum of the segment durations asynchronously returned by the server and the total recording duration, this application further introduces a dynamic tail duration calibration mechanism. Let the duration of each segment be d. _1 d _2 , ..., d _n The total recording time is D. _total The effective duration of the last segment is then dynamically calculated using the following formula (2): (2); This calculation method automatically eliminates accumulated errors, ensuring that the last area aligns with the actual end of the recording. Simultaneously, the front-end establishes highlight detection logic based on playback time intervals; when the main playback instance's currentTime enters any segment interval [T...],... _start T _end When a UI layer style variable injection or equivalent rendering update is triggered, segmented area highlighting is achieved without having to perform additional DOM reflow operations on each area frequently.
[0107] This application provides an embodiment of a call center quality inspection recording playback scenario as follows: Scenario description: The quality inspector plays a 30-minute WAV recording that includes "IVR navigation (20 seconds) -> waiting in line (40 seconds) -> agent answering (29 minutes)".
[0108] The implementation steps are as follows: 1. Lightning-fast First Screen: The front-end generates a flat placeholder waveform of matching length based on the total 30-minute duration the instant the page loads. Once the main playback instance is mounted, quality inspectors can directly click the play button without waiting for file download and decoding, and the sound will immediately start playing from 0 seconds.
[0109] 2. Background parsing: Simultaneously, the auxiliary parsing instance silently downloads the WAV file in the background and performs decoding calculations.
[0110] 3. State Snapshot and Offset Compensation: Assume that the background waveform analysis is completed at the 15th second. The system immediately records the current state: currentTime=15s, isPlaying=true, playbackRate=1.5x, and simultaneously records the high-precision clock T. _start Record T after the new instance is mounted. _end The compensation playback time t is calculated using the above formula (1). _target = t _media +(T _end - T _start ).
[0111] 4. Seamless backfill: The system replaces the skeleton waveform or placeholder waveform with the real waveform and directly aligns the new instance to t. _target Set the playback speed to 1.5x and continue playing. The quality inspector did not experience any noticeable interruption audibly, and the waveform visually transitioned smoothly to a realistic volume fluctuation graph.
[0112] 5. Segmented Calibration and Linkage: After rendering the actual waveform, the system covers the IVR area from 0 to 20 seconds, the queuing area from 20 to 60 seconds, and the call area after 60 seconds, based on the segmented data. When there is an error between the sum of the segmented durations returned by the server and the total recording duration, the system automatically performs end calibration on the last segment according to the total duration. When the playback progress reaches 21 seconds, the queuing area is automatically highlighted; if the quality inspector directly clicks on the call area, the player automatically jumps to the 60-second mark to start playback.
[0113] This method, through the collaborative work of two audio processing instances, ensures instant playback with zero waiting time for the user while asynchronously generating a real waveform in the background. After generation, it uses state snapshot technology to achieve seamless waveform backfilling and continuous restoration of the playback state. At the same time, it combines recording segmentation information to achieve intelligent linkage and highlighting of waveform areas.
[0114] The embodiments of this application provide the following technical solutions and solve the technical problems of the prior art as follows: 1. Dual-Instance Decoupled Architecture: Existing solutions typically use only a single audio instance, resulting in a strong dependency between "waiting for waveform parsing" and "allowing users to play". This patent proposes a master-slave dual-instance architecture, where the master instance quickly establishes a playable interface using placeholder peaks, while the slave instance completes the actual waveform parsing in parallel, completely breaking this dependency.
[0115] 2. Seamless cross-instance state injection mechanism based on high-precision clock compensation: Existing technologies typically only restore the old time point when updating waveform data, which is insufficient to offset the initialization time consumed during instance destruction and reconstruction. This patent introduces a monotonic high-precision clock into the instance reconstruction link, combining it with state snapshots to form the target time after offset compensation, thereby achieving continuous recovery of playback progress, playback status, speed, and volume. This is one of the core protection points of this invention.
[0116] 3. Predictive smooth transition mechanism from skeleton waveform to real waveform: Existing placeholder schemes typically use only straight lines as the waiting state, which can easily cause obvious visual jumps during switching. This scheme uses a skeleton waveform generated based on recording metadata and pseudo-random amplitude sequences as the waiting state, and performs a gradual transition when filling in the real peak, improving visual continuity.
[0117] 4. Time-axis driven segmented area supplementation and linkage: This solution combines business logic segmentation with the physical waveform time axis, and dynamically supplements the duration of the last segment by subtracting the cumulative duration of the preceding segment from the total recording duration when the recording segment duration is incomplete, ensuring that the segmented area is accurately aligned with the waveform time axis.
[0118] This application also provides the following implementation methods and corresponding technical effects: 1. Off-screen parsing threaded implementation: The auxiliary parsing instance can run in a Web Worker or an equivalent background execution context. After completing decoding and peak extraction, the peak array is sent back to the main thread through a message mechanism to further reduce the risk of main thread blocking.
[0119] 2. Zero-copy memory reclamation implementation: When the peak array is sent back, an ownership transfer mechanism is used to avoid copying large arrays; after the peak is exported, the auxiliary parsing instance actively releases the decoding buffer and intermediate objects to reduce the memory peak in long recording, multi-tab and multi-window scenarios.
[0120] 3. Skeleton Waveform Animation Implementation: When the main playback instance is initialized, the placeholder peak can use a skeleton waveform with animation effects to alleviate the visual waiting feeling during the generation of the real waveform.
[0121] 4. Permission Downgrade Mode: When the system detects that the current user only has structure viewing permission and not recording playback permission, it only initializes the auxiliary parsing instance and renders the waveform and segmented areas, while disabling the audio output and interactive events of the main playback instance to achieve "visible waveform but not audible" security control.
[0122] This application provides a dual-instance processing device for the same sound source. Please refer to [link / reference]. Figure 3 The device 300 includes: The initialization module 310 is used to generate placeholder peak data based on the proportion of the duration of each segment in the target audio source to the total duration, so as to initialize the front-end waveform interface based on the placeholder peak data. The interface rendering module 320 is used to load the placeholder peak data using the main playback instance, render and display the user interface, and display a control on the user interface that can perform at least one of the following functions: play, pause, drag; The audio decoding module 330 is used to load the target network resource address of the target audio source using an auxiliary parsing instance while the main playback instance renders the user interface, so as to perform audio decoding and target peak data extraction on the target audio source based on the target network resource address; The instance playback module 340 is used to re-instantiate the main playback instance using the target peak data, and to play the decoded audio data as the target instance.
[0123] Figure 4 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. For example, as shown... Figure 4 As shown, the computer device 400 includes: a memory 401, a processor 402, and a computer program 403 stored in the memory 401 and running on the processor 402, wherein when the processor 402 executes the computer program 403, the computer device can execute any of the aforementioned dual-instance processing methods for the same sound source.
[0124] Furthermore, this application also protects a control device, which may include a memory and a processor. The memory stores executable program code, and the processor is used to call and execute the executable program code to perform a dual-instance processing method for the same sound source provided in this application. This application can divide the control device into functional modules based on the above method example. For example, each module can correspond to a specific function, or two or more functions can be integrated into a single processing module. The integrated module can be implemented in hardware. It should be noted that the module division in this application is illustrative and only represents a logical functional division; other division methods may exist in actual implementation. It should also be noted that all relevant content of each step involved in the above method embodiment can be referenced to the functional description of the corresponding functional module, and will not be repeated here. It should be understood that the control device provided in this application is used to execute the above-mentioned dual-instance processing method for the same sound source, and therefore can achieve the same effect as the above-described implementation method. When using integrated units, the control device may include a processing module and a storage module. When the control device is applied to a block device, the processing module can be used to control and manage the actions of the block device. The storage module can be used to support block devices in executing mutual program code, etc. The processing module can be a processor or controller, which can implement or execute various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this application. The processor can also be a combination of functions that implement computing capabilities, such as a combination of one or more microprocessors, a combination of digital signal processing (DSP) and microprocessors, etc., and the storage module can be a memory.
[0125] Furthermore, the control device provided in the embodiments of this application may specifically be a chip, component, or module. The chip may include a connected processor and a memory. The memory stores instructions, and when the processor calls and executes the instructions, the chip can execute the dual-instance processing method for the same sound source provided in the above embodiments. The embodiments of this application also provide a computer-readable storage medium storing computer program code. When the computer program code is run on a computer, it causes the computer to execute the aforementioned method steps to implement the dual-instance processing method for the same sound source provided in the above embodiments.
[0126] This application also provides a computer program product. When the computer program product is run on a computer, it causes the computer to execute the aforementioned related steps to achieve the dual-instance processing method for the same sound source provided in the above embodiments. The control device, computer-readable storage medium, computer program product, or chip provided in this application are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they achieve can be referred to the beneficial effects in the corresponding methods provided above, and will not be repeated here. Through the description of the above embodiments, those skilled in the art can understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the control device can be divided into different functional modules to complete all or part of the functions described above. In the embodiments provided in this application, it should be understood that the disclosed control device and method can be implemented in other ways. For example, the control device embodiments described above are merely illustrative. For example, the division of modules or units is merely a logical functional division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated into another control device, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interface, control device or unit, and can be electrical, mechanical or other forms.
[0127] It should be noted that the order of the embodiments described above is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. The processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multiple task processing and parallel processing are possible or may be advantageous. The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. The above content is only a specific implementation of this application, but the protection scope of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the protection scope of this application.< / audio>
Claims
1. A dual-instance processing method for the same sound source, characterized in that, The method includes: Based on the proportion of each segment duration in the target audio source to the total duration, placeholder peak data is generated, and the front-end waveform interface is initialized based on the placeholder peak data. The placeholder peak data is loaded using the main playback instance, and the user interface is rendered and displayed. On the user interface, a control that can perform at least one of the following functions is displayed: play, pause, drag. While the main playback instance renders the user interface, the auxiliary parsing instance loads the target network resource address of the target audio source, so as to perform audio decoding and target peak data extraction on the target audio source based on the target network resource address; The main playback instance is re-instantiated using the target peak data, and the decoded audio data is used as the target instance for playback.
2. The method as described in claim 1, characterized in that, The method further includes: In response to the callback trigger of the auxiliary parsing instance, the current media playback timestamp is recorded, and the first moment is recorded using the system's high-precision clock; When the target instance completes loading and reaches a playable state, the second moment is recorded using the system's high-precision clock. The playback starting point of the target instance is determined based on the current media playback timestamp and the time difference between the first and second moments.
3. The method as described in claim 2, characterized in that, Before the auxiliary parsing instance completes audio decoding and sends the decoded audio source data back to the main playback instance, the method further includes: The main playback instance captures and maintains its current state context in real time, wherein the current state context includes playback state, playback rate, and volume; Correspondingly, the execution involves playing the decoded audio data as the target instance, including: The current state context and the playback start point of the target instance are injected into the target instance so that the target instance is aligned to the continuous playback position; Playback completes on the target instance of the injected parameters.
4. The method as described in claim 1, characterized in that, After the auxiliary parsing instance performs audio decoding and target peak data extraction on the target audio source using the target network resource address, the method further includes: The target peak data is stored as a 32-bit single-precision floating-point number or an equivalent contiguous memory structure. The target peak data is transmitted back to the main thread via a transferable object or an equivalent ownership transfer mechanism. After the target peak data is successfully exported, the audio cache object, the context for decoding the target sound source, and the intermediate cache object in the auxiliary parsing instance are actively released.
5. The method as described in claim 1, characterized in that, The initialization of the front-end waveform interface based on the occupancy peak data includes: Based on the occupancy peak data and the canvas width of the waveform interface, the number of pixels to be drawn in the waveform is estimated, and the front-end waveform interface is initialized based on the number of pixels. Based on a preset amplitude range, a pseudo-random sequence capable of forming fluctuations is generated to simulate the visual characteristics of silent background noise or low energy range. The pseudo-random sequence is used as the initial placeholder peak value of the main playback instance.
6. The method as described in claim 1, characterized in that, The step of re-instantiating the main playback instance using the target peak data includes: The main playback instance is re-instantiated by requesting animation frames or an equivalent frame-by-frame rendering mechanism to perform a transparency gradient switch, thereby switching the placeholder peak data to the target peak data.
7. The method according to any one of claims 1 to 6, characterized in that, The method further includes: If the total duration of the target audio source is not equal to the sum of the durations of each segment, the duration of the last segment is corrected based on the difference between the total duration and the sum of the durations of the other segments except the last segment, and the corrected duration is determined as the effective duration of the last segment.
8. The method according to any one of claims 1 to 6, characterized in that, The method further includes: The main playback instance establishes highlight detection logic based on playback time intervals. The highlight detection logic is used to trigger the injection of style variables or equivalent rendering update of the user interface when the current playback time of the main playback instance enters any target segment interval, so as to achieve the highlight display of the target segment.
9. The method according to any one of claims 1 to 6, characterized in that, The method further includes: Based on the duration of each segment in the target audio source, multiple interactive regions are generated on the waveform diagram; Listen for playback progress events of the target instance; If the playback progress event determines that the target interaction area has been entered, the target interaction area is identified.
10. The method according to any one of claims 1 to 6, characterized in that, The method further includes: Listen for click events in the designated area; Based on the click location that triggers the click event in the area, the cumulative duration of the preceding segment corresponding to the click location is obtained, and the absolute time point corresponding to the click location is determined based on the cumulative duration. Based on the absolute time point, the target instance is redirected to the corresponding clicked position for playback.