An offline continuous speech transcription and dynamic replacement method and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing asynchronous dual-track concurrency and multi-level dictionary intervention, the contradiction between low latency and high accuracy in edge-side offline speech recognition technology is resolved, achieving a stable, privacy-preserving, and efficient voice interaction experience in a network-free environment.

CN122201305APending Publication Date: 2026-06-12王奕川

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 王奕川
Filing Date: 2026-03-04
Publication Date: 2026-06-12

Application Information

Patent Timeline

04 Mar 2026

Application

12 Jun 2026

Publication

CN122201305A

IPC: G10L15/26; G10L15/16; G10L21/0208; G10L15/22

AI Tagging

Application Domain

Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122201305A_ABST

Patent Text Reader

Abstract

The application discloses an offline continuous speech transcription and dynamic replacement method and electronic equipment, aiming at solving the inherent contradiction that the end-side speech recognition cannot consider low delay and high accuracy. The method runs in a pure local environment, dynamically slices the audio stream and distributes it to an asynchronous double track: the fast processing link generates an initial draft and outputs it to the display area first, eliminating visual delay; the deep inference link performs deep inference in parallel, and generates high-precision final text after combining the memory word library post-processing and adjacent slice boundary deduplication of the permission classification. Then, the system uses timestamps and screen coordinates to anchor, calls the underlying instructions to seamlessly replace the final draft in the display area, and assists with the power state-based interface (UI) dynamic linkage. The application breaks through the end-side power bottleneck, realizing the coexistence of extremely low visual delay and extremely high accuracy results in the interactive experience.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and human-computer interaction technology, and in particular to an offline continuous speech transcription and dynamic replacement method and electronic device in edge computing scenarios. Background Technology

[0002] With the development of artificial intelligence technology, voice input has become an important human-computer interaction method for smart terminals (such as smartphones, computers, and wearable devices). Currently, mainstream continuous speech-to-text technologies mainly fall into two categories: "cloud-based recognition solutions" and "device-based offline recognition solutions." However, both of these approaches face pressing technical bottlenecks in practical applications.

[0003] Limitations of cloud-based recognition solutions: They are highly dependent on stable network connections and cannot work in weak or no network environments; moreover, user voice data must be continuously uploaded to cloud servers, posing potential risks of data leakage and privacy security, making it difficult to meet the needs of highly private scenarios.

[0004] The inherent contradiction of edge-side offline recognition solutions: In a completely offline local computing environment, there is an irreconcilable contradiction between "low latency (real-time on-screen display)" and "high accuracy" in speech recognition. Traditional lightweight offline models have streaming output capabilities, but their recognition accuracy is limited; while edge-side deep neural network models have high accuracy, their inference time is long, resulting in long periods without text output or severe lag in visual feedback on the interface.

[0005] Long audio memory overflow and the defects of hard segmentation during splicing: When processing long audio, large edge models are prone to memory overflow (OOM) and system process blocking due to feature matrix expansion. To prevent crashes, the industry typically performs forced segmentation of the audio stream. However, traditional segmentation methods inevitably sever semantics, resulting in a large number of repeated "echo" characters or incorrect sentence breaks when splicing the transcription results of adjacent segments, severely disrupting the smoothness of continuous output.

[0006] In summary, how to overcome the bottleneck of edge computing power in a network-free edge environment, provide users with low-latency visual feedback, ensure high accuracy of the final text, and effectively eliminate redundant errors and interface stagnation caused by slicing and splicing, is a core technical problem that urgently needs to be solved in this field. Summary of the Invention

[0007] The purpose of this invention is to provide an offline continuous speech transcription and dynamic replacement method and electronic device, which aims to solve the contradiction between "low-latency real-time feedback" and "high-precision context correction" in existing edge-side offline speech recognition technology, thereby providing a smooth and high-precision voice interaction experience in a purely local computing environment.

[0008] To achieve the above objectives, this invention provides an offline continuous speech transcription and dynamic replacement method, applied to a terminal with local computing capabilities. The method operates independently in a local environment without requiring data communication with an external server. The method includes:

[0009] Audio Acquisition and Segmentation: In response to voice input commands, acquire audio stream data locally and segment it into multiple audio segments based on dynamic threshold conditions.

[0010] Asynchronous dual-track concurrency (fast track and slow track): The audio stream data is input to a locally deployed first speech processing module to generate a first transcribed text with a first response delay and quickly output to the terminal's display area; at the same time, the data is input to a locally deployed second speech processing module to generate a high-precision second transcribed text with a second response delay greater than the first response delay.

[0011] Multi-level dictionary post-processing intervention: After slow track output, the local preset basic dictionary and user-defined dictionary are called for comparison. If the comparison matches, the dictionary mapping and forced replacement are performed according to the permission weight.

[0012] Boundary deduplication stitching: Extract the first and last character sequences of the transcribed text corresponding to adjacent audio slices, calculate the overlap, and remove redundant overlapping characters from the final continuous transcribed text.

[0013] Low-level instruction-oriented replacement: Using anchoring mechanisms such as timestamps, based on the second transcribed text, the corresponding part of the display area occupied by the first transcribed text is replaced with a full low-level overwrite or differential erasure.

[0014] Interactive computing power status: Real-time monitoring of backlog in the background, dynamically outputting backlog prompts or progress information on the terminal interface; and outputting a status indicator indicating completion after all slices have been processed, completing a seamless visual loop.

[0015] Furthermore, the present invention also provides an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores computer instructions executable by the at least one processor, and when the computer instructions are executed, the electronic device enables the aforementioned offline continuous speech transcription and dynamic replacement method. This electronic device may be a general-purpose intelligent mobile terminal (reusing a local CPU / NPU), or an intelligent microphone peripheral with an independent processing chip and physical privacy isolation features.

[0016] Beneficial effects

[0017] Compared with the prior art, the method and electronic device provided in the embodiments of the present invention have the following significant advantages:

[0018] Breaking through the inherent contradiction between "computing power and user experience," this approach reconstructs the interaction paradigm: through an "asynchronous dual-track transcription and dynamic visual replacement" mechanism, a small model provides a low-latency "streaming draft" for on-screen display, while a large model performs "asynchronous precise post-processing" in the background and executes seamless replacement. This cleverly utilizes the time difference in human eye reading to achieve a smooth interactive experience with "extremely low visual latency and extremely high result accuracy."

[0019] Eliminate long-duration voice OOM crashes on the device side and achieve long-term battery life: Through a dynamic dual threshold slicing mechanism and extreme memory relay scheduling, the memory usage in the system background is basically decoupled from the total recording time, enabling the terminal device to stably support uninterrupted transcription for up to several hours in a completely offline state.

[0020] Extremely high data privacy and 24 / 7 availability: The entire process is completed in a closed loop within a local physical sandbox, effectively avoiding the risk of cloud leakage of sensitive voice data and ensuring that the device can still operate normally in environments without or with weak network coverage.

[0021] Effectively eliminates the "slicing echo" caused by slicing: Introduces FuzzyMatch adjacent slice boundary deduplication logic to automatically remove redundant and overlapping characters, effectively bridging the semantic gaps caused by VAD or forced segmentation.

[0022] The waterfall-style thesaurus intervention with hierarchical permissions: It innovatively constructs a hierarchical intervention mechanism of basic whitelist and VIP hot words, which effectively makes up for the identification bias of quantitative large model on proper nouns and prevents the large-scale misjudgment of high-frequency words by uncommon thesaurus.

[0023] Provides a dynamic interactive closed loop with state awareness: Through the computing power dashboard and progress prompts that are linked to hardware and software, the computing latency of the underlying black box is transformed into flexible and transparent UI visual feedback, which effectively avoids user misoperation and excessive waiting caused by unknown state. Attached Figure Description

[0024] The above aspects and advantages of the present invention will become apparent and readily understood from the description of specific embodiments taken in conjunction with the following drawings, in which:

[0025] Figure 1 This is a schematic diagram of the macroscopic architecture and core data flow of an offline continuous speech transcription and dynamic replacement method provided by an embodiment of the present invention. The diagram intuitively shows the entire execution logic from audio acquisition, central processing unit distribution, asynchronous dual-track parallel processing (resulting in processing latency differences), to finally triggering the underlying full replacement instruction, so that the first draft in the display area completes a visually smooth transition and outputs the final text. Detailed Implementation

[0026] To ensure that the objectives, technical solutions, beneficial effects, and advantages of this invention are more clearly and thoroughly understood and implemented, the specific embodiments of this invention will be described in detail below with reference to the accompanying drawings. It should be understood that the accompanying drawings are only for illustrating exemplary embodiments of this invention and are not intended to limit it in any way. In all the drawings, the same reference numerals generally denote the same or similar components for ease of reading and understanding. It should also be understood that the logical flow or module architecture shown in the drawings is only a preferred complete link, and each sub-step or sub-module can be independently extracted or recombined according to different application scenarios to constitute an independent technical solution.

[0027] Those skilled in the art should understand that the specific embodiments described herein are intended to provide a detailed explanation of the core architecture of the present invention. The fundamental protected object of the present invention is a macro-architecture of an "asynchronous dual-track processing and dynamic visual replacement" system based on processing latency differences and memory space scheduling. Therefore, without departing from the core architecture and spirit of the present invention, any modifications, combinations, partial substitutions, or equivalent substitutions made by those skilled in the art to any specific implementation methods mentioned herein (e.g., but not limited to replacing different segmentation algorithms, using different frameworks for the first and second speech processing modules (such as acoustic or language models), replacing the matching algorithm for extracting first and last characters for concatenation and deduplication, and using different underlying UIs to replace the scheduling API, etc.) should all be considered to fall within the protection scope of the macro-architecture of the present invention.

[0028] It is particularly important to state that, while this invention is inclusive of the aforementioned macro-architecture, in subsequent embodiments, the specific and preferred algorithmic logics listed in this architecture (e.g., specific overlap matching algorithms for adjacent slices, dynamic weight intervention logic for multi-level dictionaries, etc.), as key technical nodes supporting the smooth interactive experience of this architecture, also possess independent inventiveness and technical value. The applicant reserves the right to apply for these specific core algorithms and processing logics as independent technical solutions.

[0029] Furthermore, the methods and systems provided in this invention are widely applicable to terminal devices with local computing capabilities. Typical current implementations include, but are not limited to, smartphones, computer devices, and microphone audio peripherals with highly integrated independent computing chips. It should be understood that with the rapid evolution of edge computing and smart hardware, various forms of smart electronic devices will inevitably emerge in the future. Regardless of how the physical appearance of these devices changes, what underlying operating system they run, or what unknown interaction scenarios they are applied to, as long as the core logic of "asynchronous dual-track transcription and dynamic visual replacement" described in this invention is deployed and executed internally, they should fall within the protection scope of this invention. Simultaneously, independent memory isolation and chip scheduling mechanisms customized for specific hardware entities (such as purely hardware-closed-loop smart microphone peripherals) also constitute independently protectable physical device solutions.

[0030] Explanation of the universality of the system architecture:

[0031] It should be noted that the core innovation of the asynchronous dual-track architecture and dynamic replacement logic proposed in this invention is essentially a "mechanism for alternating screen display and smooth visual replacement based on differences in processing latency." Therefore, this core architecture itself possesses independent universality that completely transcends the physical distribution of underlying hardware and the location of computing power deployment.

[0032] It should be understood that the core idea of this invention lies in its task scheduling and UI presentation logic in the time dimension, rather than the computing power storage point in the spatial dimension. Regardless of the deployment form of the application or system implementing this architecture—whether it is "both fast and slow tracks are deployed on local terminals," "the fast track is deployed locally while the slow track relies on the cloud," or "both fast and slow tracks are fully deployed on cloud servers (only sending streaming drafts and delayed targeted replacement instructions to local terminals)"—as long as it fully executes the time-difference scheduling workflow of "high-speed initial transcription on screen → asynchronous high-precision transcription inference → differential comparison and dynamic UI replacement" in macro-interaction, this working method essentially uses the core interaction architecture pioneered by this invention. The applicant reserves the right to apply for this asynchronous alternating interaction architecture that transcends physical deployment location limitations as an independent technical solution.

[0033] Explanation of the boundaries of the preferred embodiments:

[0034] Although the aforementioned generalized interaction architecture is fully compatible with and covers cloud or edge-cloud collaborative computing power deployment, considering the stringent requirements for absolute data security in specific high-end business application scenarios (such as confidential meetings, extreme personal privacy protection, and trade secret negotiations), any form of network connection and audio transmission node constitutes a potential data leakage exposure and cannot meet the extremely high level of physical isolation requirements.

[0035] Therefore, in order to resolve the inherent contradiction that pure edge devices cannot run large models persistently under the bottleneck of limited memory (RAM) and computing power, in a preferred embodiment of the present invention (i.e. the core protection scenario of the current claims), the method and system are strictly limited to running independently in a "pure local environment without any data communication interaction with external servers".

[0036] Throughout the entire lifecycle of this preferred embodiment, the system is completely and physically (or software-wise) cut off from data interaction with the external wide area network in any operating state. This means that the target system of this invention inherently abandons all hybrid mechanisms that rely on cloud computing power. All audio stream acquisition, dual-track model concurrent inference, multi-level dictionary post-processing, and even the destruction of segments of tens or hundreds of minutes of continuous speech and seamless UI replacement are completely and forcibly closed-loop and physically isolated within the local sandbox of the terminal device. This preferred embodiment aims to overcome the rigid dependence of existing "cloud / offline hybrid dual-mode" or "pure cloud" technologies on network communication links, and strives to establish a secure and reliable interactive experience in a completely offline, purely local environment.

[0037] Based on the above generalized interaction architecture principles and the strict definition of the preferred pure local implementation, the following sections will describe in detail the specific implementation process of the system of the present invention in a purely offline isolated environment:

[0038] Section 1 Audio Acquisition

[0039] 1.1 Triggering of Audio Acquisition and Basic Digitalization Principles

[0040] In human-computer interaction, the primary prerequisite for converting natural speech into text that electronic devices can process is converting analog sound wave physical quantities into a continuous digital stream that computers can recognize. Specifically, when a terminal receives a user's voice input command (e.g., by clicking the microphone icon on the screen UI, triggering a specific physical hardware button, or activating it via an offline wake word at the device's underlying level), the system responds by calling the terminal's local microphone array to continuously capture external analog sound wave signals.

[0041] Subsequently, the analog-to-digital converter (ADC) at the terminal's underlying layer performs high-frequency sampling, quantization, and encoding of the analog signal. In this embodiment, to achieve a better balance between the computational overhead and acoustic feature retention of the offline speech processing module, the system preferably uses a sampling rate of 16000Hz (16kHz), a bit depth of 16-bit, and a mono format to convert the analog audio into a pulse code modulation (PCM) digital audio stream. The 16kHz sampling rate can fully cover the core frequency bands of human speech, and the data volume is moderate, making it the optimal input format for most edge-side deep neural networks to extract acoustic features.

[0042] 1.2 Audio Stream Noise Reduction and Purification Preprocessing Strategies

[0043] Considering that users often find themselves in complex environments with high noise or strong reverberation, such as subways, streets, and conference rooms, necessary noise reduction and echo cancellation (AEC) preprocessing must be performed on the acquired digital audio stream to ensure the recognition accuracy of the subsequent asynchronous dual-track transcription engine. This invention provides an adaptive dual-track noise reduction strategy, allowing the system to flexibly schedule the following processing mechanisms based on the target terminal's hardware computing power level and underlying access permissions:

[0044] Strategy 1: Software-level noise reduction based on the operating system (OS). When the terminal is a conventional computing device, this system directly calls the standard audio interface (API) encapsulated by the terminal's underlying operating system. For example, in a development architecture based on Android or other general open-source systems, the system loads native software acoustic preprocessing modules from the OS (such as the system-level NoiseSuppression (NS) noise reduction algorithm and Acoustic Echo Canceler (AEC) echo cancellation algorithm) to acquire clean audio stream data filtered in real time by software algorithms with extremely low latency.

[0045] Strategy Two: Hardware-level Noise Reduction Directly Accessing the Terminal's Bottom Layer (BypassOS Strategy). With the rapid iteration of edge smart hardware, some current high-performance smart terminals not only incorporate high-specification multi-microphone arrays (for sound source localization and beamforming) but also dedicated digital signal processors (DSPs) or audio-specific NPUs for processing acoustic signals. Therefore, in a preferred embodiment, if this invention is deployed on the aforementioned top-tier hardware terminal, the system's audio acquisition module will skip the conventional system application layer software noise reduction encapsulation and directly acquire the high-fidelity audio stream with noise reduction completed at the hardware physical layer by calling the underlying hardware audio interface (e.g., Hardware Abstraction Layer (HAL) or low-latency audio channel). This strategy can acquire clean speech data with a high signal-to-noise ratio, effectively freeing up the terminal's main CPU's computing resources and reserving computing power for subsequent offline inference of large deep neural network models.

[0046] Through the aforementioned acquisition and purification preprocessing mechanisms, the system successfully established a continuous PCM audio digital stream channel with an extremely high signal-to-noise ratio in the terminal's local memory. This channel directly and continuously pumps high-quality acoustic feature data into the memory buffer, providing a high-quality data foundation for the subsequent audio dynamic slicing and asynchronous dual-track transcription engine.

[0047] Section 2 Dynamic Slicing and Asynchronous Dual-Track Distribution Strategy for Continuous Audio

[0048] 2.1 Technical limitations of existing continuous speech transcription technology

[0049] Before delving into the core scheduling mechanism of this invention, it is necessary to objectively analyze the current state of the industry and its physical limitations in existing continuous speech transcription technologies. Currently, mainstream voice input systems in the industry all face significant technical bottlenecks when processing long, continuous speech passages:

[0050] On the one hand, in the "cloud-based big model" solution, the continuous uploading of audio streams is highly dependent on network stability. In order to ensure the availability of the cloud-based high-concurrency architecture, a time-limited circuit breaker mechanism is usually forcibly set at the interface level, which results in the upper limit of the content that users can output continuously at one time being forcibly locked within 2 to 5 minutes.

[0051] On the other hand, in the few attempts to localize the model in "device-side offline solutions," the duration of a single continuous recognition session is often conservatively limited to around 2 minutes. Once this threshold is exceeded, the system will forcibly cut off the recording or cause recognition to stall. This widespread time limitation makes it impossible for existing technologies to meet the needs of long, high-frequency output scenarios such as government and enterprise meeting minutes, in-depth interviews, and lengthy business presentations, which often last for tens of minutes or even hours.

[0052] 2.2 Analysis of Memory Overflow Issues in End-Side Long Audio Inference

[0053] The reason why the aforementioned offline model on the device cannot perform continuous voice output for a long time (such as more than one hour) is mainly due to the risk of underlying physical memory (RAM) overflow and the increased processing latency.

[0054] The computational complexity of modern high-precision speech recognition models (especially large deep neural networks with attention mechanisms) increases non-linearly (e.g., quadratically) with the increase of the dimension of the input audio. If an unprocessed, extremely long audio stream is continuously fed into such a model, the system needs to maintain an extremely large acoustic feature matrix and historical context state in memory.

[0055] As recording time progresses, the model's inference speed decreases significantly, leading to increasingly longer "screen-on delays" in the final text, severely impacting the user's interactive experience. Furthermore, when the amount of audio data exceeds the physical memory of edge devices such as smartphones... A time limit will inevitably lead to a serious memory overflow (OOM). This will cause the input method process to terminate abnormally, and may even cause the entire underlying operating system process to block (or become unresponsive).

[0056] 2.3 Optimal Implementation Mechanism for Dynamic Audio Slicing

[0057] To effectively address the aforementioned technical bottlenecks, this invention introduces a crucial dynamic audio segmentation preprocessing mechanism at the system front end. While acquiring the purified audio stream, the system injects it into a low-overhead ring buffer and employs a "dual threshold monitoring" strategy to trigger the segmentation operation.

[0058] Condition 1: Soft segmentation based on semantic continuity. The system invokes an ultra-low-power Acoustic Endpoint Detection (VAD) algorithm to perform real-time frame-level scanning of the audio stream. When the duration of a user's speech pause reaches a first preset time threshold (preferably 500 to 800 milliseconds), the system determines that the current sentence has temporarily ended and performs segmentation at this point of silence. This mechanism maximizes the integrity of natural sentences.

[0059] Condition 2: Forced hard segmentation based on physical memory protection. When a user is continuously outputting at an extremely high speaking speed, or when a noisy environment causes the VAD to malfunction, the system calculates the cumulative duration of the currently unsegmented audio in real time. Once the second preset time threshold is reached (e.g., 8 to 10 seconds based on NPU computing power), the system immediately performs "forced hard segmentation" regardless of whether the user is speaking. This is an effective physical safety net to prevent OOM (Out of Memory) risks.

[0060] It should be noted that although the above slicing logic effectively avoids the risk of memory overflow, forced segmentation inevitably cuts off a complete word at the physical level, causing redundant "echoes" with overlapping beginnings and ends when adjacent slices are recognized separately. To address this side effect, this invention has designed a rigorous boundary deduplication (FuzzyMatch) logic, which will be detailed in subsequent chapters.

[0061] 2.4 Asynchronous Dual-Track Concurrency and State-Driven Dynamic Memory Lifecycle Management

[0062] After generating a fixed-size audio data block (audio slice), the system immediately splits the slice data into two parts and distributes them concurrently to the subsequent first-level speech processing module (fast track) and second-level speech processing module (slow track).

[0063] To achieve uninterrupted operation for several hours, this invention innovatively designs a "simplified memory lifecycle management strategy based on UI state drive":

[0064] Delayed destruction of the first-level transcription buffer: After the fast track model completes streaming decoding and outputs a temporary draft, the draft text is immediately output to the display area. At this time, the audio data of this segment in the first-level buffer is not released immediately; instead, the system will trigger an instruction at the lowest level to completely release all residual buffer data of this segment in the first-level transcription workspace the moment the high-precision text of the second-level transcription model (slow track) is successfully output to the display area and the draft is smoothly replaced. This also preserves the data foundation for inference degradation and fault tolerance in extreme cases.

[0065] The relay release of the second-level transcription cache: For the cached data of the second-level transcription model (slow track), the system will not immediately release its memory after it completes high-precision inference and successfully performs replacement in the display area. This is because the system needs to retain its tail feature data for boundary deduplication comparison with the head of the next audio slice. Therefore, the system will only completely release all deep inference caches of the previous slice in the second-level transcription space after the second-level transcription content of the next adjacent audio segment has been successfully replaced in the display area.

[0066] Through the aforementioned sophisticated "relay-style" memory release strategy, the system only needs to maintain a very small amount of active audio slice data (e.g., no more than 3 to 5) in its background resident memory at any given time. This architectural design, which transforms the space complexity from linear superposition to constant-level overhead, effectively eliminates the risk of OutOfMemoryError (OOM) during long-term operation of large models on the edge. Furthermore, in actual commercial deployments, even when applications carrying this invention's architecture are downloaded and installed through standard application distribution platforms (such as general app stores) and run on real commercial smart devices constrained by the strict background memory management mechanisms of the terminal operating system (OS), they still effectively achieve high-precision continuous speech inference and content output with millisecond-level response for more than 2 hours, without any memory overflow or forced termination (kill) by the underlying system. This fully verifies the extremely high engineering robustness and industrial practical value of this slice and memory scheduling architecture.

[0067] It should be understood that the specific parameters regarding dynamic audio slicing (such as a mute threshold of 500 to 800 milliseconds and a forced slicing threshold of 8 to 10 seconds) and the specific cache relay clearing timing for first and second level transcription are merely optimal embodiments adapted to the computing power and memory specifications of current mainstream smartphones. Those skilled in the art can flexibly adjust the above-mentioned slicing time thresholds and lifecycle release nodes according to the specific hardware ecosystem conditions of the target device (such as different RAM capacities, different NPU / CPU computing power levels) and specific application scenarios. This underlying scheduling logic based on "threshold slicing to prevent overflow" and "alternating relay memory release," as an inseparable core component of the macro-asynchronous dual-track architecture of this invention, should effectively fall within the protection scope of this invention.

[0068] Section 3 Streaming Response and Hardware-Level Asynchronous Concurrency Scheduling of the Rapid Rails

[0069] After the system completes the dynamic slicing of continuous audio through the aforementioned mechanism, the sliced data will be synchronously distributed to the dual-track processing pipeline. To ensure the efficient operation of dual-track concurrency under limited computing power on the edge, this section will elaborate on the model selection, streaming response mechanism, and crucial cross-hardware collaborative scheduling strategy of the first speech processing module (fast track).

[0070] 3.1 Asynchronous Concurrency and Thread Scheduling Strategies Across Hardware

[0071] Running two voice processing modules in parallel on edge devices (especially smartphones) can easily lead to contention for underlying computing resources and severe thermal thottling. To address this engineering challenge, this system introduces a deep hardware thread scheduling mechanism when distributing audio slices. Implementers can flexibly adopt the following two concurrency strategies based on the hardware ecosystem of the target device:

[0072] Strategy 1: Global concurrency strategy based on hardware degradation and fallback (compatibility solution).

[0073] When issuing computing tasks, the system prioritizes hardware calls as follows: NPU (Neural Processing Unit) → GPU (Graphics Processing Unit) → CPU (Central Processing Unit). When a task is issued, the system prioritizes requesting the most efficient NPU or GPU computing power; if the device does not support it or the hardware is occupied, it automatically degrades to the CPU for execution. The advantage of this strategy is its good compatibility with conventional computing devices. However, in practice, implementers should note that on low-end models with extremely limited NPU / GPU computing power, forcing dual-track concurrent computation to be performed by a single-core or multi-core CPU may slow down system response.

[0074] Strategy 2: Hardware decoupling-based "CPU+NPU" dual-track dedicated line scheduling strategy (preferred embodiment).

[0075] For smart terminals that are commonly equipped with NPUs, this invention provides a physical isolation strategy for decoupling computing power. Specifically, the system forcibly allocates the streaming decoding thread of the first voice processing module (fast track) to the central processing unit (CPU) of the mobile phone for execution; at the same time, it forcibly binds the deep inference thread of the second voice processing module (slow track) to the neural network processing unit (NPU) for execution.

[0076] The beneficial effects of this strategy are as follows: the first speech processing module has a small number of model parameters, enabling low-latency response using CPU computation; while the neural network model of the second speech processing module is well-matched with the tensor operation characteristics of the NPU. This effectively avoids the fast and slow threads from suspending and preempting each other within the same computing unit, significantly reducing the overall peak power consumption and heat generation of the chip; more importantly, it fully schedules and releases the NPU computing power that is usually in a very idle state in smart terminals (this computing power is usually in a low-load state in normal text interaction scenarios), achieving perfect peak-shifting utilization of hardware resources.

[0077] It should be understood that the above hardware scheduling strategy is intended to illustrate the system-level optimization potential of the asynchronous dual-track architecture of this invention. In practical applications, implementers may use different hardware binding combinations based on the underlying APIs provided by different chip manufacturers' platforms (e.g., binding the slow track to a specific DSP, or using a heterogeneous computing framework for dynamic computing power allocation). These are all equivalent means to achieve the "dual-track concurrency" goal in the macro-architecture of this invention, and should all fall within the protection scope of this invention.

[0078] 3.2 Model Selection and Ultra-Fast Response Mission of the First Speech Processing Module

[0079] Having clarified the underlying hardware scheduling, the primary function of the first speech processing module (fast track) is to achieve "extremely low visual latency feedback." In terms of model selection, this system preferentially adopts an acoustic architecture based on probabilistic statistical models or lightweight neural network models. Specifically, this includes, but is not limited to, traditional Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and lightweight, pruned pure self-attention streaming decoding network models (such as miniature Transformer or Conformer architectures).

[0080] For example, implementers can use lightweight streaming speech recognition frameworks commonly found in the open-source community (such as a lightweight branch of VOSK). These models undergo deep parameter compression (their physical file size is typically only around 40MB), resulting in extremely low computational complexity. When sliced data enters this module (such as the CPU-dedicated thread mentioned above), it can utilize streaming decoding algorithms to achieve incremental character output with extremely low latency (i.e., streaming on-screen display). In actual testing, its on-screen latency can be controlled to the tens to hundreds of milliseconds level. This near-zero-latency "initial transcribed text (draft)" is quickly output to the terminal display area, effectively avoiding unresponsive interfaces during long audio input.

[0081] It should be understood that the "40MB file size" or "open-source lightweight framework" mentioned in this article are merely examples to illustrate the advantages of the high-speed track. In the future, as the algorithm evolves, if implementers adopt any other acoustic model (or cross-language micro-model) with extremely small parameter magnitudes and capable of generating initial transcribed text with extremely low latency as a pioneering array to fill visual gaps, all such implementations should fall within the protection scope of the "one fast, one slow" dual-track interactive architecture of this invention.

[0082] 3.3 Architectural fault tolerance for key defects and targeted intervention with high-frequency word library

[0083] Although the first speech processing module has a significant speed advantage, its lack of deep contextual understanding results in limited recognition accuracy when faced with homophones or complex sentences. To ensure a good interactive experience despite this limitation, this invention adopts a dual strategy:

[0084] Strategy 1: Architecture-level Exemption (positioned as a visual draft). In the macro-architecture of this invention, the text output by this module is explicitly marked as a "temporary visual draft" at the system's underlying level. The system does not require it to have the accuracy of a final usable level, because after providing initial visual feedback, this text will eventually be smoothly replaced at the UI level by high-precision text from the subsequent second speech processing module.

[0085] Strategy Two: Targeted Efficiency Improvement with a Miniature Dedicated Dictionary. To prevent a poor interactive experience due to an excessively high error rate in the initial draft, this system incorporates a miniature local dedicated dictionary into the first speech processing module. In a preferred embodiment, this dictionary is strictly limited to a very small size (e.g., less than 2000 words) to ensure it does not slow down the streaming decoding speed. These words are "core high-frequency words" extracted in advance using a large-scale language model (LLM) on an external high-performance server based on the distillation of high-frequency human dialogue corpus. During transcription in the fast track, this dictionary is hot-loaded as a weight enhancement node in the decoding graph. It can effectively correct common homophone or high-frequency similar-looking word errors that are prone to occur in lightweight acoustic models without increasing the computational burden, thus raising its visual accuracy to an acceptable draft level.

[0086] It should be understood that the vocabulary limit (e.g., 2000 words), extraction source (e.g., LLM distillation), or specific weighting algorithm of the aforementioned external micro-lexicon are merely preferred implementation methods for further optimizing the visual experience of the express track. Implementers who change the size of the lexicon, adopt proprietary draft dictionaries for different industry sectors, or directly eliminate the micro-lexicon and rely solely on architectural-level replacements to mask the defects of the small model, do not deviate from the protection scope of the core architecture of this invention.

[0087] Section 4 Slow-Trajectory Deep Inference and Adaptive Scheduling Strategies for Hardware Resources

[0088] After the first audio processing module (fast track) completes the streaming of the temporary draft with extremely low latency, the second audio processing module (slow track) in the system background is simultaneously performing high-intensity deep semantic analysis on the same audio slice. This section will elaborate on the selection logic of the slow track module, the physical source of latency, and the adaptive scheduling and handover mechanism of computing power for different hardware platforms.

[0089] 4.1 Selection of the Second Speech Processing Module and Deep Semantic Analysis

[0090] The main function of the second speech processing module is to provide a high-level transcription accuracy that is significantly better than that of the first speech processing module, so as to effectively correct homophones, contextual errors and proper noun recognition anomalies in the fast track draft.

[0091] Currently, edge deep neural network frameworks that meet this condition include, but are not limited to: large open-source models based on the Transformer architecture (such as Paraformer series non-autoregressive models, Whisper series edge models, etc.).

[0092] As a preferred embodiment of this application, in order to extract audio features more accurately, the system preferably employs a deep neural network model that incorporates a complex multi-head attention mechanism. This mechanism can accurately capture the contextual dependencies between words in a long audio segment and is specifically designed for global deep semantic modeling, thereby achieving extremely high-precision deep semantic decoding.

[0093] 4.2 The physical cause and objective phenomenon of "second response delay"

[0094] An objectively existing system feature in the macro-architecture of this invention is that the "second response delay" of the final text output by the second speech processing module is greater than the "first response delay" of the first module.

[0095] This time lag typically has underlying physical and mathematical basis. To achieve high-precision semantic understanding, the number of model parameters and physical size of the second speech processing module are often significantly larger than those of the first acoustic model. The large parameter matrix implies a non-linearly increasing amount of floating-point operations (FLOPs). Therefore, even if the system utilizes the terminal's high-performance computing units, the time required for the second speech processing module to complete a full slice of deep inference is greater than the streaming output time of the first speech processing module.

[0096] 4.3 Adaptive Adaptation and Quantization Strategy for Terminal Hardware Computing Power

[0097] Since this invention is widely used in various terminal devices, the loading strategy of the second voice processing module is not static, but adaptively adapted according to the physical memory and computing power limitations of the target device:

[0098] For computing devices with ample power supply and large memory capacity (such as computers equipped with high-performance graphics cards or processors), handling relatively low-complexity speech-to-text tasks is a breeze. In this case, the system can directly load unquantized, full-precision (such as FP32 or FP16) deep neural network models to obtain high transcription accuracy without worrying about thread blocking.

[0099] However, when this method is applied to mobile terminals such as smartphones, forcibly loading a full-precision large model will quickly fill the processing threads of the mobile CPU / NPU with its massive parameter matrix, leading to the exhaustion of underlying computing resources and causing a severe thermal throttling effect. Therefore, in smartphone applications, the system preferably introduces a large model version with deep quantization (such as INT8 quantization), and it is recommended to strictly control its physical file size to within 150MB. This adaptive quantization strategy can significantly reduce memory consumption and thread load while retaining the attention mechanism features of the large model that are fully sufficient for speech recognition. For edge devices such as smart microphones, further deep pruning and adaptation are performed based on their built-in DSP computing power and extremely limited RAM space.

[0100] 4.4 Heterogeneous Computing Power Transfer: A Preferred Implementation from Tensor Derivation to Logic Control

[0101] In a preferred embodiment of the present invention, when performing intensive inference of the second speech processing module, a cross-chip heterogeneous computing power handover mechanism is designed.

[0102] In actual engineering implementation, dedicated acceleration chips such as NPUs or GPUs in smart terminals are mainly responsible for performing extremely intensive matrix multiplication and addition operations (MACs) in large models. The main processing stage of the NPU is completed the moment the slow-track model completes the decoding of acoustic features into characters with the help of NPU computing power and outputs "high-precision original second transcribed text".

[0103] Since the NPU is not adept at performing complex string comparisons, hash searches, and memory logic jumps, the underlying system architecture triggers a "hardware computing power handover": the system quickly sends the raw text string output by the NPU back and writes it into the active memory space of the terminal's main control CPU. Subsequently, all text intervention actions (including dictionary error correction, VIP hot word replacement), and boundary deduplication and splicing logic for adjacent audio slices (FuzzyMatch) are handled by the CPU, which excels at logic control. This heterogeneous collaborative mechanism, where the NPU handles high-density tensor operations and the CPU handles complex logic jumps and string processing, effectively optimizes the chip's execution efficiency.

[0104] It should be understood that the physical causes of the "second response delay" described in this section, the quantization adaptation rules for different terminals (such as limiting it to within 150MB on mobile devices), and the heterogeneous computing power scheduling mechanism of "handing over to CPU processing after NPU inference" listed in the preferred embodiment are merely exemplary descriptions to illustrate how the present invention achieves high energy efficiency under specific hardware constraints.

[0105] In practical applications, implementers can employ other equivalent mechanisms. For example, two models with similar parameter counts but different algorithmic complexities can be used to generate differentiated processing latency; or in a highly integrated single-chip architecture (or a future new processor architecture), inference and logic post-processing can be integrated into the same computing unit. The core idea of this invention lies in the macroscopic interactive closed loop of "dual-track alternation and dynamic interface replacement based on processing latency differences." Therefore, regardless of the quantization compression standard used by the terminal device, the specific model file format loaded, or the specific task allocation and computing power scheduling of its underlying concurrent threads, memory, and arithmetic units (such as CPU, GPU, and NPU), as long as it serves this macroscopic dual-track interactive framework, it effectively falls within the protection scope of this invention.

[0106] Section 5 Memory-Level Waterfall-Style Text Post-Processing and Slice Boundary Smoothing Stitching Strategy

[0107] After the second speech processing module (slow track) completes deep inference with the help of the NPU and hands it over to the main control CPU, the CPU memory retrieves a high-precision original transcribed text (defined as: Level 1 draft). However, due to the domain-specific accuracy decay caused by the deep quantization of the edge model and the physical breaks caused by forced audio slicing, this Level 1 draft may still have local flaws. This section will take "all-English speech input and English text output" as the preferred implementation scenario and elaborate on how the CPU can perform multi-level dynamic intervention and seamless stitching through a "waterfall-style draft evolution" pipeline in a very short time.

[0108] 5.1 The Physical Inevitability of Quantization Loss and the Injection of Two-Dimensional Prompts

[0109] In edge computing scenarios, deep quantization (e.g., INT8) of large, full-precision models of hundreds of megabytes effectively improves inference speed, but may lead to a decrease in recognition rate for certain long-tail words. To mitigate this issue at the source of inference, the system first employs a prompt injection strategy:

[0110] Static scene prompts: When a user initiates full English transcription, the system first extracts the most frequent core English words (e.g., 100 to 200 words) based on the pre-defined vertical application scenario (such as Technology, Medical, Legal, etc.) and injects them as context prompts into the slow-track decoder. This effectively guides the large model towards the semantic features of specific domains without increasing the NPU's computational burden.

[0111] Fast Track Draft Feedback (Dynamic Prompts): Given the system's asynchronous concurrent architecture, when the slow track begins inference, the temporary text generated by the first speech processing module (fast track) remains in the cache. The system synchronously injects (or inputs) this fast track draft as dynamic prompts into the slow track's large model. Although the fast track draft has limited precision, it retains the macro-context and pronunciation outline of the current segment, which can further enhance the decoding direction of the large model, achieving deep coupling and mutual enhancement of the two track data.

[0112] 5.2 Basic screening of whitelist based on hash matching and OOV anomaly labeling (generating secondary draft)

[0113] After the NPU outputs the "Level 1 Draft", the main control CPU can perform multiple rounds of pipelined post-processing on it in a very short time because the memory usage of the plain text number stream is extremely small. The system first introduces a "local basic dictionary whitelist mechanism".

[0114] The whitelist is a large-scale local static English dictionary, covering tens of thousands of common high-frequency English words and basic grammatical words (such as we, are, beautiful, etc.). The system uses English space features for fast word segmentation at the underlying level and employs a hash search algorithm to quickly compare the words in the first-level draft with the whitelist.

[0115] The primary function of this step is not error correction or replacement, but rather anomaly screening and labeling: when a transcribed word is detected that does not match the whitelist (e.g., suspected misidentification or the obscure proper noun "aegix"), the system implicitly assigns an OOV (Out-of-Vocabulary) low-level label to that word in memory. Regular standard words that match the whitelist remain in their normal state. After this efficient hash screening, the system outputs a "secondary draft" with an OOV classification label in CPU memory, providing a clear intervention indicator for subsequent algorithms.

[0116] 5.3 User hot keyword intervention with first priority (generating three-level drafts)

[0117] After obtaining the second-level draft, the system enters a stage with higher intervention weight: VIP user-defined English hot words intervention. The original intention of this level is to give user-specific words (such as specific company names or internal codes) a higher priority in matching and replacement, alleviating the technical limitation of large models easily correcting them to standard English.

[0118] Intervention Scope and Logic (Full Scan and Focused Attention): In this stage, the VIP hot word intervention mechanism ignores whether words in the secondary draft have OOV tags and performs a comprehensive scan of the entire text. Although the system will prioritize comparing abnormal words with OOV tags, its main technical advantage lies in its priority to replace standard English words.

[0119] The system implements a dynamic adjustment strategy for fault tolerance based on a preset weight allocation logic (e.g., using the length of English characters in the user's hot keywords as a core weight factor).

[0120] Short word filtering mechanism (anti-misintervention mechanism) with 3 or fewer characters: If the length of a hot word added by the user is less than or equal to 3 English letters (such as and, go, ORG), the system restricts the intervention logic of that hot word from taking effect. Very short words have a high probability of collision during fault-tolerant comparison (for example, it is very easy to mistakenly replace the correct "orc" with the hot word "ORG"). Restricting such very short hot words is the core safety anchor to ensure the robustness of the system engineering.

[0121] Strict Matching Mode (Hot Words of 4 to 5 Characters): When the hot word is 4 to 5 characters long, the system uses "strict mapping logic". Case normalization or format replacement is only performed when the text fragment in the secondary draft completely matches the target hot word.

[0122] Low tolerance and strong coverage mode (hot words of 6 to 10 characters): When the length of a hot word is greater than 5 and less than or equal to 10 characters, the system calculates the Levenshtein edit distance between the corresponding word in the secondary draft and the target hot word in real time. When this edit distance is less than the set tolerance threshold, the replacement mechanism is triggered.

[0123] Example Demonstration (Prioritizing Standard Word Replacement): If a user adds the trending term "Wonderfud" (a specific brand name), the large model might misidentify it as the standard English word "Wonderful" in the first-level draft. In Section 5.2, "Wonderful" is on the whitelist and not tagged with OOV. However, when the VIP mechanism intervenes, the system finds that the edit distance between "Wonderful" in the second-level draft and the trending term "Wonderfud" is within a very small threshold. At this point, the priority mechanism takes effect, and the system disregards the conventional attribute of "Wonderful" as standard English and directly replaces it with "Wonderfud".

[0124] Highly fault-tolerant mode (long hot words / short phrases with more than 10 characters): As long as the similarity (or subsequence overlap rate) between the character sequence of the second-level draft and the target long hot word reaches more than 45%, the system determines that the model may have made a recognition error and directly replaces the segment with the target long hot word.

[0125] After a global traversal and strong intervention of VIP hot keywords, the system generates a highly personalized "level three draft".

[0126] 5.4 Targeted fallback of vertical domain-specific thesaurus (generating level 4 drafts)

[0127] After obtaining the third-level draft, the system triggers a fallback of the third-level vertical domain thesaurus (such as large-scale customized dictionaries for medical, legal, etc.).

[0128] To effectively conserve the computing power of mobile CPUs and avoid widespread misjudgments of frequently used English terms by specialized and obscure word libraries, the intervention logic of vertical word libraries adopts a "targeted sweeping" mechanism (computing power pruning):

[0129] Instead of a full scan of the vertical thesaurus, a targeted scan is performed only on isolated words in the Level 3 draft that still have OOV (Out of Context) abnormal tags and have not been processed by VIP hot words. The system calls a fuzzy matching algorithm to perform in-depth industry-level inference and repair only on these remaining untouched isolated words. After the error correction is completed, the system officially generates a more accurate and logically coherent "Level 4 draft" in memory.

[0130] 5.5 Eliminating the side effects of physical hard segmentation: FuzzyMatch boundary deduplication and stitching of adjacent slices

[0131] After completing the aforementioned "waterfall-style tagging and directional post-processing," the Level 4 draft faces its final architectural challenge—"slice echo." Because the front-end performs dynamic audio forced hard segmentation, a complete English word or phrase may be split between two adjacent audio slices. This causes the dual-track model to generate repeated English characters at the slice seams when recognizing them independently (for example, the previous segment ends with "I will go to New," and the next segment begins with "NewYork today").

[0132] Before performing the screen replacement, this system forces the intervention of boundary deduplication logic for the level 4 draft: the system extracts the "tail character sequence" of the final text corresponding to the adjacent previous audio slice from memory, and the "head character sequence" of the current slice (level 4 draft); then calculates the character overlap between the two.

[0133] Implementer's Note: In specific project implementations, different string matching algorithms can be flexibly used to calculate the overlap, such as, but not limited to: Longest Common Subsequence (LCS) algorithm, Levenshtein edit distance algorithm, or N-gram-based fuzzy matching algorithm. When the overlap meets the preset conditions (such as finding "New" as an overlap feature), the system will automatically remove redundant and repeated characters from the current fourth-level draft header character sequence, complete the smooth splicing of the slices (effectively spliced to "I will go to New York today"), and thus output the final draft text that can be used for screen replacement.

[0134] 5.6 Unified Description of Core Intervention Architecture and Algorithm Parameters

[0135] It should be understood that the waterfall-style draft evolution pipeline described in detail in this section, namely "whitelist anomaly labeling → VIP hot word indiscriminate strong coverage → vertical thesaurus based on OOV tag targeted fallback", as well as the specific character length safety bolt, edit distance calculation rules and LCS splicing algorithm, is merely a preferred embodiment provided by the applicant in a specific environment (such as English context) in order to achieve efficient interaction and computing power balance.

[0136] Given the diverse implementation paths of text post-processing algorithms and memory scheduling techniques, implementers can make adaptive adjustments in practical applications based on different target language architectures (such as Chinese, Japanese, and other continuous character languages that do not require space separation) and different underlying operating system ecosystems. For example, the underlying word segmentation rules (using Jieba segmentation instead of space segmentation), word length calculation standards (replacing the number of English letters with the number of Chinese characters), and the specific data structure of out-of-vocabulary (OOV) tags can be replaced with equivalent ones.

[0137] The core technology of this invention lies in "addressing the flaws caused by asynchronous dual-track concurrency by performing waterfall-style post-processing in CPU memory using a permission-based dictionary mechanism and stitching the boundaries of the slice breakpoints." Therefore, any equivalent substitution, parameter fine-tuning, or language switching of the above-mentioned tagging logic and algorithm details, as long as it macroscopically serves the asynchronous dual-track dynamic replacement workflow of this invention, should effectively fall within the protection scope of the core architecture of this invention.

[0138] Section 6 Timestamp-based Coordinate Anchoring and Smooth UI Sub-layer Replacement Mechanism

[0139] After completing the aforementioned multi-level post-processing and boundary deduplication stitching in the main control CPU memory, the system obtains a coherent "final draft text". This section will explain in detail how the Input Method Service bridges the time difference between the asynchronous dual tracks, accurately locks and effectively erases the historical draft generated by the first speech processing module (fast track) in the terminal display area, and finally completes a smooth replacement at the visual level.

[0140] 6.1 Asynchronous Time Difference Challenges and the "Timestamp-Index" Two-Way Anchoring Mechanism

[0141] Because of the objective physical time difference between the fast and slow track execution (i.e., the second response delay is significantly greater than the first response delay), the system faces the technical challenge that when the slow track finalizes the text and prepares to display it on the screen, the draft text generated by the fast track has already settled at a certain position on the screen. In order to accurately locate and erase the old draft in the complex text editing flow without affecting other text of the user, the system must introduce a global tracking coordinate system.

[0142] This embodiment innovatively adopts a "two-way anchoring mechanism between global timestamp and screen character index":

[0143] Identifier Distribution: As early as when dynamic audio slicing is executed at the system front end, the underlying architecture generates a globally unique initial timestamp (e.g., Timestamp_10:00:00.000) or a unique hash ID for that specific audio slice. The data pointer of the slice, along with this timestamp, is synchronously and concurrently distributed to the fast track and slow track modules.

[0144] Quick Track Coordinate Registration: When the first module (Quick Track) completes inference and outputs its generated "Level 1 Draft Text" to the terminal screen display area, the input method's underlying service will precisely listen to and record the exact character position range (i.e., the start and end character indices, defined as the range [Index_start, Index_end]) occupied by this draft text within the current operating system's text box. The system's underlying layer then strongly binds this coordinate range to Timestamp_10:00:00.000 and stores it in a dynamic mapping table in the background memory.

[0145] 6.2 Targeted wake-up and precise coordinate retrieval

[0146] A few seconds later (e.g., 10:00:05), the second module (slow track) generates the "final draft text" for that slice after NPU inference and multi-level CPU post-processing. At this point, the replacement scheduling module triggers the UI replacement action with its initially bound Timestamp_10:00:00.000. The system queries the memory mapping table, using this timestamp primary key, to retrieve the precise coordinate range [Index_start, Index_end] currently occupied by the old draft corresponding to that slice on the screen UI, providing the target coordinates for the underlying replacement instruction.

[0147] 6.3 Performing Text Replacement: Scheduling of Underlying UI Commands and Smooth Visual Transition

[0148] After locking onto the precise coordinates, the system sends a replacement command to the underlying operating system. To adapt to terminal devices with varying performance and improve visual smoothness, this invention preferably provides two underlying API scheduling execution methods:

[0149] Method 1: Full region coverage command based on underlying API (high compatibility mode).

[0150] The system calls underlying text editing APIs, such as `InputConnection.setComposingRegion` on Android to define the region, followed by calling the `commitText` interface; or calling the `deleteSurroundingText` interface. The system sends a silent deletion command to the specific [Index_start, Index_end] coordinate range, deleting the draft characters within the range, and then writes the "final draft text" from the second speech processing module to that location.

[0151] Method 2: Based on memory Diff comparison difference erasure instructions (anti-flicker mode).

[0152] To avoid the slight flickering or repainting overhead that might occur when deleting long text entirely, a preferred embodiment involves the system performing a difference comparison algorithm in the background memory before sending UI commands. This algorithm compares the "old draft text" and the "final draft text." The system only extracts the differences where character changes actually occur, and then sends targeted backspace deletion and new character insertion commands to the underlying operating system for the specific erroneous characters.

[0153] Through the aforementioned timestamp-based coordinate guidance and underlying API scheduling, the system completes text replacement. Visually, this process is characterized by a smooth transition from the initial draft to high-precision text, effectively achieving low visual latency display updates and forming a closed loop of the asynchronous dual-track dynamic replacement workflow of this invention.

[0154] 6.4 Dynamic Dashboard and UI Status Linkage Interaction Mechanism Based on Asynchronous Computing Load

[0155] Because this invention employs an asynchronous dual-track architecture, the deep inference time of the second speech processing module (slow track) is objectively longer than the user's pronunciation time. When the user outputs long passages at extremely high speeds, the "audio slices to be processed" that have not yet completed high-precision post-processing in the system's backend will inevitably experience queue backlog. To improve the interactive experience and ensure the integrity of the final output text, this invention introduces a "dynamic computing power status dashboard" and a "dynamic progress feedback and status indicator mechanism" at the input method UI layer.

[0156] Threshold-triggered dynamic computing power dashboard:

[0157] During continuous recording by the user, the system backend monitors the "number of slices to be processed (i.e., the backlog)" in the slow-track queue in real time. The system sets a computing power safety threshold (e.g., 5 slices).

[0158] When the number of slices to be processed is less than 5, it means that the current terminal has sufficient computing power and the slow track replacement speed can keep up with the fast track output. At this time, the UI interface does not display any additional computing power prompts.

[0159] If a user speaks too quickly, causing the number of unprocessed slices to exceed a certain threshold (e.g., 5 or more), the system determines that the computing power is under high load and will immediately trigger a "computing power dashboard" in a specific area of the UI (e.g., the top center of the screen), displaying the number of currently unprocessed slices. This is intended to provide visual feedback to alert the user that the current computing power is under high load.

[0160] Dynamic progress feedback and "DONE" status indication mechanism after microphone interruption:

[0161] When the user clicks to end microphone recording, although the initial draft of the fast track has been fully displayed, there are usually still a few slices (e.g., 2) in the slow track memory that are undergoing in-depth post-processing and FuzzyMatch stitching.

[0162] If the system provides no notification, the user may proceed with further interaction (such as clicking send, save, or line break in the external host software) before the background post-processing is complete, thus retaining or outputting a flawed draft. Therefore, this invention dynamically outputs processing progress information (e.g., a countdown or display of "Number of slices remaining for finishing: 2... 1...") in a prominent location on the terminal interface (such as the upper right corner) the moment the user ends audio recording, prompting the user to pause their operation.

[0163] The dashboard will only flash a status indicator (such as "DONE" or a specific highlighted icon) to indicate that the highest precision text is ready when the background engine reports that all slices have completed the final UI replacement.

[0164] It should be understood that the display position of the aforementioned computing power dashboard (e.g., centered or upper right corner), the specific number for threshold triggering (e.g., 5 slices), and the "DONE" text and countdown UI animation at the end of finishing processing are merely preferred interactive embodiments provided in conjunction with the asynchronous dual-track architecture. Implementers can design variations of UI elements according to the specific visual style of the app. The underlying interactive logic of this invention is: "By monitoring the backlog of slice processing queues in the background asynchronous slow track in real time, dynamic visual warnings are issued on the UI side, and after the recording ends, dynamic progress prompts guide the user to pause interaction until the slow track queue is cleared before outputting the final status indicator." Any interactive method that drives the front-end UI progress prompts and status linkage based on the status of the background processing queue should fall within the protection scope of this invention.

[0165] It should be understood that the "timestamp-index mapping table" recording mechanism, coordinate range retrieval logic, and the listed Android platform InputConnection related API interfaces or Diff difference backspace mechanism described in detail in this section are merely a specific engineering embodiment provided to clearly illustrate how the present invention achieves underlying UI replacement.

[0166] Implementers should understand that different smart terminal operating systems have drastically different underlying text interaction protocols (e.g., the UITextDocumentProxy interface system in iOS / iPadOS platforms, the TSF (TextServices Framework) text service framework in Windows platforms, etc.). The core idea of this invention is: "By distributing tasks with unique identifiers (such as timestamps) at the front end, recording the dynamic position characteristics of the fast track on the screen, and using the identifier to address and execute text replacement at the corresponding position after the slow track post-processing is completed," which is a macroscopic UI state synchronization and addressing replacement logic. Therefore, regardless of the operating system platform or the specific name of the underlying framework API used by the implementer in the specific engineering development to perform the action of "deleting old words and writing new words," as long as it follows the core framework of "tracking and replacement under asynchronous time difference" described in this invention, it should fall within the protection scope of this invention.

[0167] Section 7 Macro-level Expansion Architecture of Offline Workstations and Offline Brains

[0168] The "asynchronous dual-track concurrency and memory-level post-processing replacement" mechanism detailed in Sections 1 to 6 above primarily focuses on interactive scenarios involving real-time microphone pickup (here collectively referred to as the "1.0 real-time transcription architecture"). When a user invokes this 1.0 architecture in a third-party typesetting or note-taking software, as transcription and dynamic replacement continue, the user will eventually generate a standard data object containing long text on their local device.

[0169] To comprehensively improve the complete ecosystem of offline voice processing, this invention further discloses, based on the 1.0 architecture, a macro-level extension scheme of "offline workstation (2.0 architecture)" for pre-recorded multimedia files and "offline brain (3.0 architecture)" for global post-processing of long texts.

[0170] 7.1 Offline Workstation: Asynchronous Dual-Track Batch Processing Mechanism for Multimedia Files (2.0 Architecture)

[0171] In real-world business or productivity scenarios, users often need to process pre-recorded long audio or video files. This invention provides a purely local offline workstation mode:

[0172] File import and digital parsing: The system acquires pre-recorded local multimedia files (covering various audio and video container formats), strips the video track at the underlying level, and extracts the pure digital audio stream.

[0173] Asynchronous batch processing and "global prior cue word" injection: For ultra-large audio streams, the system initiates dual-track concurrent batch processing to alleviate the computational bottleneck of a single model. First, the first speech processing module, with extremely low computational overhead, performs a rapid global scan of the file slices to generate a "global initial rough draft". Subsequently, the system uses this global draft as a high-confidence contextual cue word and injects it in batches into the context encoder or decoder of the second speech processing module.

[0174] Waterfall-style post-processing and document output: After receiving prompts, the second speech processing module performs deep reasoning and uses the aforementioned waterfall-style post-processing and boundary deduplication logic. Finally, the system automatically formats and exports the processed text into a standard text format for users to view offline.

[0175] 7.2 Offline Brain: Introducing Global Deep Post-processing with an Edge-based Third Language Model (3.0 Architecture)

[0176] Because the dynamic slicing mechanism may still cause slight fragmentation of the context of long and complex sentences in a very few extreme edge scenarios, and the generated text document may lack sufficient logical coherence in paragraphs, this invention further introduces an "offline brain" architecture.

[0177] The architecture deploys an additional independent third-party language model on the terminal (this model is a large plain text language model that is specifically responsible for natural language logic processing and does not participate in the front-end speech acoustic decoding).

[0178] Global error correction and reconstruction compatible with multi-source input: The system allows text generated by the 1.0 real-time architecture or documents exported from the 2.0 offline workstation to be used as complete contextual input to the third language model. Based on the document's global macro-context, the model semantically corrects local spelling errors and word order incoherence caused by segmentation. Simultaneously, it can perform intelligent paragraph reconstruction and document summarization based on natural semantic transitions, aligning with original timestamps, and ultimately outputting a standard document with proper formatting.

[0179] It should be understood that the 2.0 offline workstation batch processing architecture (especially the global prior prompt word injection mechanism based on the fast track feeding back to the slow track) and the 3.0 offline brain architecture (introducing an independent end-side text model for global semantic reconstruction) disclosed in this section constitute an independent invention concept that complements the 1.0 real-time transcription architecture but also has independent technical value.

[0180] In practical applications, implementers may integrate the aforementioned real-time transcription, batch file processing, and global typesetting into a single system workflow, or deploy them as independent functional modules on different smart terminals (such as independent voice recorder hardware and post-processing software). All of these are equivalent replacements of the technology disclosed herein.

[0181] Section 8 Electronic Devices and Multidimensional Physical Implementation Forms Supporting Offline Dual-Track Architecture

[0182] To effectively implement the macro-architecture of the methods and systems described in Sections 1 through 7, namely "1.0 Real-time Transcription," "2.0 Offline Workstation," and "3.0 Offline Brain," this invention also provides a highly integrated electronic device and a corresponding computer-readable storage medium. This section will describe the core hardware base, software deployment form, and highly innovative physical entity embodiment of this electronic device.

[0183] 8.1 Core Hardware Foundation and Instruction Execution Basis of Electronic Devices

[0184] The electronic device of the present invention, at the hardware architecture level, effectively includes:

[0185] At least one processor: This processor can be a central processing unit (CPU), graphics processing unit (GPU), neural network processing unit (NPU), digital signal processor (DSP), or a heterogeneous collaborative array of the above processors. Its core function is to provide sufficient local computing power to drive the offline inference of the aforementioned speech processing modules at each level (fast and slow tracks) and the edge-side third language model.

[0186] A memory communicatively connected to the at least one processor: this memory includes, but is not limited to, random access memory (RAM) and non-volatile storage media (ROM / Flash). The memory stores computer instructions executable by the processor. When the computer instructions are retrieved and executed by the processor, the electronic device is able to fully execute all core steps under the aforementioned 1.0, 2.0, or 3.0 architecture, such as audio slicing, dual-track scheduling, multi-level dictionary post-processing, timestamp anchoring, and dynamic UI replacement, within a local controlled environment.

[0187] 8.2 Software Implementation Mode: Standalone Application Distribution and Deep System-Level Integration

[0188] In actual commercial deployments, the computer instruction set carried by this electronic device can exist in an extremely flexible software product form:

[0189] Standalone Application (App) Form: This system can be packaged into a standard format mobile or desktop application. Users can download, install, and run it on their existing electronic devices through common digital application distribution platforms (such as major official app stores or web servers). In this form, the system operates as an independent process, utilizing the microphone and local underlying computing power at the operating system's application layer.

[0190] Deep Integration with Native OS Services: In a more in-depth embodiment, the core logic of this invention can be directly compiled and "deeply integrated" into the underlying operating system (OS) of an electronic device (e.g., as the system's default underlying global input method core framework, or a kernel-level voice processing daemon). In this form, the invention obtains extremely high memory keep-alive priority and hardware scheduling privileges, enabling a seamless global replacement experience in any third-party host application (such as memos, communication software, or document editors).

[0191] 8.3 Reusable deployment of general-purpose mobile terminals (such as smartphones and tablets)

[0192] When the electronic device is a general-purpose mobile terminal such as a smartphone, tablet, or portable personal computer (PC), this invention, as a pure software-level system logic, does not rely on any additional physical acceleration peripherals. The system directly reuses the high-performance SoC (System-on-a-Chip) currently mounted on the general-purpose terminal. Through the aforementioned "CPU+NPU heterogeneous concurrent scheduling strategy," the existing computing resources of the terminal are effectively scheduled and released, thereby completing dual-track parallel transcription and dynamic replacement.

[0193] 8.4 Independent Edge Computing Nodes: Physical Isolation of Smart Microphone Audio Peripherals

[0194] Considering the high requirements for absolute data security in specific business scenarios, this invention provides a highly targeted proprietary hardware implementation – a smart microphone audio peripheral that integrates an independent computing chip.

[0195] In this form, the smart device is no longer a traditional "passive microphone," but rather an edge computing node with independent computing power. Its main workflow is as follows:

[0196] Hardware-level privacy physical isolation: This audio peripheral independently encapsulates a microprocessor (CPU / DSP / NPU) and built-in RAM. When the user speaks, it independently completes dynamic audio segmentation, dual-track speech transcription inference, and deep post-processing and boundary splicing of multi-level dictionaries within its own hardware-isolated environment.

[0197] One-way transmission of text data: This peripheral blocks the transmission of the original audio stream or acoustic features to the host device (such as a mobile phone or computer). It only transmits the final "finalized text characters" unidirectionally to the display area of the host device via common low-level transmission protocols such as Bluetooth, USB, or HID (Human Interface Device).

[0198] Offloading computing power and high security: Under this architecture, the CPU / RAM of the host device (phone / PC) is effectively reduced, and because it cannot obtain the user's original voice data from the physical link, it provides a high level of privacy isolation.

[0199] 8.5 Cross-Device Extensions and Declaration of Computer-Readable Storage Media

[0200] It should be understood that the general-purpose smartphones, tablets, PCs, and smart microphone peripherals with independent computing power listed in this section are merely preferred physical forms to illustrate the deployment flexibility of the electronic devices of the present invention.

[0201] With the rapid development of edge computing, the Internet of Things (IoT), and wearable hardware, implementers may deploy the instruction set and architecture described in this invention in unknown physical carriers such as smartwatches, AR / VR spatial computing glasses, and smart vehicle cockpit hosts; or execute the various steps of this method through distributed hardware collaboration, all of which should fall within the protection scope of this invention.

[0202] In addition, the present invention also provides a computer-readable storage medium (such as optical disc, hard disk, USB flash drive, network storage space, etc.) containing instructions, which, when running on a computer or smart terminal, causes the device to execute the offline continuous speech transcription and dynamic replacement method described above.

[0203] Example 1: Preferred Engineering Example for an All-English Interactive Scenario

[0204] To enable those skilled in the art to more intuitively understand the flow mechanism of the macro-architecture of this invention in actual engineering, the following detailed process demonstration of the aforementioned "asynchronous dual-track and memory waterfall post-processing replacement" is presented in conjunction with a specific scenario of offline all-English input.

[0205] Setting application scenarios and core parameters

[0206] Target terminal and environment: A general-purpose Android smart terminal equipped with a neural network processing unit (NPU) in a local environment that is completely offline.

[0207] Model selection: The first speech processing module (fast track) adopts the open-source VOSK lightweight streaming acoustic model; the second speech processing module (slow track) adopts the open-source Whisper (WH) end-side large model based on INT8 deep quantization.

[0208] The dictionary settings: The local basic whitelist includes a large number of high-frequency and commonly used English words such as Today, I, am, going, to, University, listen, professor, lecture; the user-defined hot word library includes the exclusive word Harvard.

[0209] User speech behavior: Click the microphone and say the following in English: "Today I am going to Harvard University, (accompanied by a natural pause of about 600 milliseconds) to listen to the professor's lecture."

[0210] Full process execution steps demonstration

[0211] Phase 1: Dynamic Slicing and Global Timestamp Distribution (Frontend Processing)

[0212] When the user taps the microphone to start speaking, the system continuously acquires a clean audio digital stream. When the user says "University" and pauses, the terminal's underlying VAD (Acoustic Endpoint Detection) algorithm detects a silence interval that has reached a preset threshold (e.g., 600 milliseconds). At this point, the system performs "soft segmentation," dividing the continuous long sentence into two independent audio slices:

[0213] For the first slice of data, Slice_1, it mainly contains the audio pronunciation of the front end: "Today I am going to Harvard University", and the system generates and assigns it a global timestamp Timestamp_T0.

[0214] For the second slice data Slice_2 that follows, due to the tail smoothing delay set by the Acoustic Endpoint Detection (VAD) algorithm and the redundant design of the audio buffer, it contains residual echoes and the second half of the sentence: "university to listen to the professor's lecture". The system assigns it a global timestamp Timestamp_T1.

[0215] Phase Two: Dual-track flow and first full-screen display of the first segment (Slice_1)

[0216] Fast track coordinate anchoring: The system distributes Slice_1 to the fast track. Due to a lack of context, the VOSK model's high-speed streaming decoding misidentifies Harvard as Harbor.

[0217] At this point, the display area first outputs a temporary draft: "Today I am going to harbor university".

[0218] The underlying input method service accurately records the coordinate range of this draft on the screen and associates it with Timestamp_T0.

[0219] Slow-track waterfall post-processing: Meanwhile, the slow-track Whisper language module is calling the NPU to perform deep inference on Slice_1. Due to the detection of a natural pause at the end of the sentence, the Whisper language module triggers punctuation prediction and outputs a first-level draft with quantized error: "Today I am going to Harbard University." (Note: a period is automatically generated at the end). Subsequently, the CPU intervenes to perform waterfall post-processing: the misspelled word "Harbard" is whitelisted and tagged with OOV; the user-defined hot word mechanism triggers a first-level intervention weight, calculates the edit distance in real time, and replaces "Harbard" with the user's custom hot word "Harvard".

[0220] Judgment and First Full Replacement: Since Slice_1 is the "starting segment (Number One)" of this voice input, and there is no historical audio preceding it, the system directly skips the boundary deduplication (FuzzyMatch) logic and determines it as the final text "Today I am going to Harvard University.".

[0221] Replace the scheduling center with a Timestamp_T0 lock screen coordinate and send a full coverage command to the underlying layer.

[0222] The display area then changes as follows: the original initial draft is completely deleted, and the latest output content is smoothly updated to: "Today I am going to Harvard University."

[0223] Phase 3: Handling and splicing the second segment (Slice_2)

[0224] The second fast track is displayed on the screen: Within seconds of processing the slow track Slice_1, the fast track Slice_2 has also been quickly completed and streamed, and the initial draft of the second segment "university tolisent to the professors lecture" (containing typical fast track spelling and missing symbol features) is automatically appended to the end of the first sentence on the screen.

[0225] After the above splicing process is completed, the overall image of the display area is as follows: "Today I am going to Harvard University. university to listen to the professors lecture".

[0226] The underlying layer also records the independent coordinate range of the second draft and binds it to Timestamp_T1.

[0227] Slow-track inference and FuzzyMatch deduplication and concatenation: The slow-track inference process then performs inference on Slice_2, generating the correctly spelled text: "university to listen to the professor's lecture." Before outputting it to the display area, the system extracts the tail features from the previous segment (Slice_1) cache and compares them with the head features of the current segment (Slice_2). The system identifies the redundant character "university" at the junction and removes the redundant word at the beginning of Slice_2. The final concatenated version of Slice_2 is then generated: "to listen to the professor's lecture."

[0228] Relay full replacement: The system carries Timestamp_T1 to lock the coordinates of the second initial draft on the display area, and performs a second low-level full silent deletion and new text writing.

[0229] Phase Four: Summary of Visual Effects and Display Area Presentation

[0230] Through the aforementioned timing relay, users can observe a smooth visual replacement process within the terminal's display area:

[0231] The screen first displays the initial draft: "Today I am going to harbor university."

[0232] The draft was quickly and completely replaced with a high-precision final version including periods: "Today I am going to Harvard University."

[0233] The screen then appends a second initial draft after the period: "university to lisent..."

[0234] The draft of the second half of the sentence was completely overwritten and replaced with the final, pieced-together text: "to listen to the professor's lecture."

[0235] Finally, based on the actual slice time series, large model punctuation prediction, and full replacement mechanism, the final stable display of complete text within the display area is:

[0236] "Today I am going to Harvard University. to listen to the professor's lecture."

[0237] The entire offline dual-track dynamic replacement architecture objectively and completely verifies the real engineering state after the large model on the end side is combined with physical slicing.

[0238] Special Note on Engineering Phenomena in Examples:

[0239] In the final text presented above ("Today I am going to Harvard University. to listen to the professor's lecture."), the extra period in the sentence is an objective punctuation feature automatically generated by the second speech processing module when it performs independent reasoning on the first slice (Slice_1) and captures speech pauses. This is the objective engineering status under the combination of the large-scale end-side model and physical slice processing.

[0240] To address this issue, implementers could certainly introduce a cross-slice punctuation smoothing algorithm to eliminate it before the finalized version of the next audio segment (Slice_2) is displayed. However, this would significantly increase the complexity of the system's low-level control and cross-slice scheduling, for the following reasons:

[0241] First, the types of punctuation marks in natural language are complex, and the standards for judging their legality are extremely high;

[0242] Secondly, the text and punctuation mark in the first segment have already been completely replaced on the interface and are stably residing in the terminal display area. If the symbol is to be deleted across segments, the system must reverse-search and locate the precise screen coordinate range of the symbol at the end of the first segment, and issue an additional low-level targeted deletion command to the operating system. This will significantly increase the computational overhead and UI redrawing burden during the rapid input process.

[0243] Thirdly (core risk), due to limitations in edge computing power, the current model cannot fully and accurately distinguish whether the punctuation mark originates from a user's actual speech pause (i.e., a legitimate sentence segment with actual semantic transition logic) or a redundant symbol that the model infers and adds on its own due to the physical segmentation of audio.

[0244] In summary, to mitigate the engineering risks of "accidentally deleting valid punctuation" and ensure the simplicity and robustness of the overall UI replacement framework, this embodiment objectively presents this phenomenon. It is recommended that implementers, during specific commercial deployments, conduct comprehensive consideration and in-depth testing, taking into account specific language habits and the model's punctuation prediction tendencies, to properly address such minor visual imperfections caused by physical slicing.

[0245] The embodiments described above are merely several exemplary implementations of the technical concept of this invention, providing specific and detailed descriptions intended to fully disclose the invention, and not to limit the scope of protection of the invention in any way. The core concept of this invention has broad applicability. Those skilled in the art, based on an understanding of the core concept and basic principles of this invention, can make various equivalent substitutions, modifications, combinations, improvements, or functional extensions to the technical solutions described herein. Any modification, variation, or equivalent implementation that incorporates the core technical features of this invention and does not depart from the spirit and scope of protection of this invention should be considered to fall within the scope of protection claimed by this invention. Therefore, the actual scope of protection of this patent will ultimately be defined by the appended claims and their equivalents.

Claims

1. An offline continuous speech transcription and dynamic replacement method, applied to a terminal with local computing capabilities, wherein the method operates independently in a local environment without needing to interact with an external server for data communication, characterized in that... The method includes the following steps: a. Audio Acquisition: In response to a voice input command, acquire audio stream data locally on the terminal; b. First-level transcription: The audio stream data is input to a locally deployed first speech processing module to generate a first transcribed text with a first response delay, and the first transcribed text is output to the display area of the terminal; c. Second-level transcription: The audio stream data is input to a locally deployed second speech processing module to generate a second transcribed text with a second response delay; wherein the second response delay is greater than the first response delay; d. Comparison and Replacement: Based on the second transcribed text, the portion in the display area corresponding to the first transcribed text is compared and replaced to output the final continuous transcribed text.

2. The method according to claim 1, characterized in that, Step a specifically includes the following steps: dividing the acquired audio stream data into multiple audio slices.

3. The method according to claim 2, characterized in that, The triggering conditions for dividing the audio into multiple slices include at least one of the following: First trigger condition: The duration of the detected voice pause reaches a first preset time threshold; Second trigger condition: The duration of the current audio slice reaches the second preset time threshold.

4. The method according to claim 1, characterized in that, The first speech processing module is an acoustic feature extraction model based on a probabilistic statistical model or a lightweight neural network, wherein the lightweight neural network includes a recurrent neural network, a convolutional neural network, or a streaming decoding network model containing a self-attention mechanism; the second speech processing module is a neural network model for performing global deep semantic modeling and containing an attention mechanism.

5. The method according to claim 1, characterized in that, After step c and before step d, the method further includes a step of mapping the second transcribed text to a lexicon: Call the locally deployed preset basic dictionary and user-defined dictionary; The second transcribed text generated in step c is compared with the preset basic dictionary and the user-defined dictionary by character or word; If a match is found, the matched words are used to replace the corresponding characters or word sequences in the second transcribed text according to the preset weight allocation logic, and the replaced text is used as the final second transcribed text for performing step d.

6. The method according to claim 1, characterized in that, In step d, the operation of replacing the corresponding part in the display area is specifically implemented through at least one of the following methods: The first method is full coverage: send a delete or clear command to the display area of the terminal to delete all character sequences corresponding to the first transcribed text and output the second transcribed text; The second method is differential erasure: locate the difference character segments between the first transcribed text and the second transcribed text, and perform targeted deletion and new character writing only on the difference character segments.

7. The method according to claim 2, characterized in that, The process of generating the final continuous transcribed text also includes a boundary deduplication step for adjacent audio slices: Extract the tail character sequence of the transcribed text corresponding to the adjacent previous audio slice, and the head character sequence of the transcribed text corresponding to the current audio slice; Calculate the character overlap between the tail character sequence and the head character sequence; When the character overlap meets the preset overlap condition, redundant overlapping character sequences are removed from the final continuous transcribed text.

8. The method according to claim 2, characterized in that, The method also includes a user interface interaction step based on the computing load status: Real-time monitoring of the number of unprocessed audio segments in the second speech processing module; When the number of audio slices to be processed exceeds the preset queue backlog threshold, a backlog prompt message is triggered and output on the terminal interface. In response to the voice input end command, if there are still audio slices to be processed in the second voice processing module, the processing progress prompt information is dynamically output on the terminal interface. Once all audio slices to be processed have completed the corresponding comparison and replacement operations, a status indicator indicating that processing is complete will be output on the terminal interface.

9. The method according to claim 1, characterized in that, The terminal with local computing capabilities includes at least one of the following devices: smartphone, computer equipment, and microphone audio peripheral with an independent processing chip.

10. An electronic device, characterized in that, include: At least one processor; as well as, A memory that is communicatively connected to the at least one processor; The memory stores computer instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enable the electronic device to perform the method of any one of claims 1 to 9.