Low-latency dialogue generation method and related device thereof

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing static interval rules and phoneme mapping tables during the dialogue generation process, the text and phoneme sequence processing is optimized, solving the problems of dialogue generation latency and stability in existing technologies, and achieving low-latency and efficient dialogue generation.

CN122201251APending Publication Date: 2026-06-12SHENZHEN STORYTELLING TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENZHEN STORYTELLING TECH CO LTD
Filing Date: 2026-03-06
Publication Date: 2026-06-12

Application Information

Patent Timeline

06 Mar 2026

Application

12 Jun 2026

Publication

CN122201251A

IPC: G10L13/08

AI Tagging

Application Domain

Speech synthesis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122201251A_ABST

Patent Text Reader

Abstract

The application provides a low-delay dialogue generation method and a related device, including: obtaining input request information; determining corresponding target text information based on the input request information; performing interval insertion processing on a sequence corresponding to the target text information based on a preset static interval rule to obtain a text optimization sequence; converting the text optimization sequence into a phoneme sequence; performing mapping processing on the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence; and performing dialogue synthesis processing based on the speech unit sequence to generate a dialogue output result. By introducing a static interval rule and a phoneme mapping table in the dialogue generation process, the structural control and deterministic mapping of the dialogue generation process are realized, the real-time calculation overhead is reduced, the dialogue generation delay is reduced, and the dialogue generation efficiency and stability are improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent speech generation, and more particularly to a low-latency dialogue generation method, apparatus, electronic device, and storage medium thereof. Background Technology

[0002] With the development of artificial intelligence technology, dialogue generation technology has been widely applied in intelligent voice assistants, human-computer interaction devices, smart homes, and various embedded terminals. These devices typically need to generate and output dialogue content in real time based on user-input text or voice requests to achieve a natural interactive experience. Especially in application scenarios such as real-time dialogue and continuous interaction, the response speed and stability of dialogue generation become crucial factors affecting user experience.

[0003] Existing dialogue generation solutions typically rely on complex text generation and speech synthesis models, requiring extensive real-time inference and computation during the generation process. This leads to high dependence on system computing power and storage resources, making it difficult to operate efficiently on resource-constrained terminal devices. Furthermore, current technologies mostly employ dynamic generation methods for text-to-speech processing, lacking effective control over the generated sequence structure. This can easily introduce unnecessary computational redundancy, resulting in increased dialogue generation latency. Moreover, in continuous dialogue or multi-turn interaction scenarios, existing technologies struggle to deterministically schedule the generation process, impacting the real-time performance and consistency of dialogue output.

[0004] To address the aforementioned issues, this application proposes a low-latency dialogue generation method. By introducing static interval rules into the dialogue generation process, the sequence corresponding to the target text is structurally optimized. Furthermore, a pre-defined phoneme mapping table is used to map the phoneme sequence, transforming the real-time computation process into a controllable sequence processing and table lookup mapping process. This reduces the computational complexity and response latency in the dialogue generation process. This application aims to solve the problems of real-time performance, stability, and terminal adaptability in existing dialogue generation technologies, and improve the application effect of dialogue generation in real-time interaction and resource-constrained scenarios. Summary of the Invention

[0005] This invention provides a low-latency dialogue generation method to solve the problems of existing dialogue generation methods in terms of real-time performance, stability, and terminal adaptability.

[0006] In a first aspect, the present invention provides a low-latency dialogue generation method, the method comprising the following steps: Obtain input request information; Based on the input request information, determine the corresponding target text information; Based on a preset static interval rule, the sequence corresponding to the target text information is subjected to interval insertion processing to obtain an optimized text sequence; Convert the optimized text sequence into a phoneme sequence; Based on a preset phoneme mapping table, the phoneme sequence is mapped to obtain a speech unit sequence; Based on the speech unit sequence, dialogue synthesis processing is performed to generate dialogue output results.

[0007] Optionally, obtaining the input request information includes: Obtain the input signal used to trigger dialogue generation; The input request information is determined based on the input signal, which includes text input signal, voice input signal and / or command input signal.

[0008] Optionally, determining the corresponding target text information based on the input request information includes: The input request information is parsed to determine at least one keyword; Based on the keywords, determine the text length and / or number of sentences of the target text information; The target text information is generated based on the text length and / or number of sentences of the target text information.

[0009] Optionally, the step of performing interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain an optimized text sequence includes: Determine at least one interval insertion position in the sequence corresponding to the target text information; According to the preset static interval rule, an interval marker is inserted at the interval insertion position to obtain the candidate text optimization sequence; Based on the text elements in the candidate text optimization sequence, the corresponding priorities are determined, and the candidate text optimization sequence is sorted according to the priorities to obtain the text optimization sequence.

[0010] Optionally, converting the optimized text sequence into a phoneme sequence includes: The optimized text sequence is subjected to phonetic analysis to determine the phonemes corresponding to each text element in the optimized text sequence. Based on the order relationship of each text element in the optimized text sequence, a phoneme sequence corresponding to the optimized text sequence is generated.

[0011] Optionally, the step of mapping the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence includes: Based on the order relationship in the phoneme sequence, at least one phoneme to be mapped is determined; Based on the preset phoneme mapping table, query the speech unit corresponding to the phoneme to be mapped; The obtained speech units are arranged in the order of the phoneme sequence to obtain the speech unit sequence.

[0012] Optionally, the step of performing dialogue synthesis processing based on the speech unit sequence to generate dialogue output results includes: Determine the synthesized segment corresponding to at least one speech unit in the speech unit sequence; The synthesized segments are spliced and integrated according to the order of the speech unit sequence to generate a dialogue synthesis result; Output the synthesized dialogue result to obtain the dialogue output result.

[0013] Secondly, the present invention also provides a low-latency dialogue generation apparatus, the low-latency dialogue generation apparatus comprising: The first acquisition module is used to acquire input request information; The first determining module is used to determine the corresponding target text information based on the input request information; The first processing module is used to perform interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain an optimized text sequence. The first conversion module is used to convert the optimized text sequence into a phoneme sequence; The second processing module is used to perform mapping processing on the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence; The first synthesis module is used to perform dialogue synthesis processing based on the speech unit sequence to generate dialogue output results.

[0014] Thirdly, the present invention provides an electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the low-latency dialogue generation method provided by the present invention.

[0015] Fourthly, the present invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the low-latency dialogue generation method provided by the invention.

[0016] This invention acquires input request information; determines corresponding target text information based on the input request information; performs interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain an optimized text sequence; converts the optimized text sequence into a phoneme sequence; performs mapping processing on the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence; and performs dialogue synthesis processing based on the speech unit sequence to generate a dialogue output result. By introducing static interval rules and a phoneme mapping table into the dialogue generation process, structured control and deterministic mapping of the dialogue generation process are achieved, reducing real-time computation overhead, thereby reducing dialogue generation latency and improving dialogue generation efficiency and stability. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a flowchart of a low-latency dialogue generation method provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of another low-latency dialogue generation device provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] like Figure 1 As shown, Figure 1 This is a flowchart of a low-latency dialogue generation method provided by an embodiment of the present invention. The low-latency dialogue generation method includes the following steps: 101. Obtain input request information.

[0021] In this embodiment of the invention, the low-latency dialogue generation method described above can be applied to a low-latency dialogue generation platform. The low-latency dialogue generation platform has functions such as low-latency dialogue generation data processing, low-latency dialogue generation data sending and receiving, and low-latency dialogue generation data memory storage. It can be built based on a server or server cluster. The server or server cluster can be an electronic device with low-latency dialogue generation data processing capabilities.

[0022] The aforementioned request information can refer to input data used to trigger a dialogue generation process. Generally, it can originate from user text input, text information converted from voice input, or instruction information triggered by preset interaction logic. In practical applications, request information is used to characterize the triggering conditions and basic requirements for dialogue generation.

[0023] In one possible embodiment, the aforementioned low-latency dialogue generation platform can generate corresponding input request information after detecting a user wake-up command, key operation, or external event trigger signal.

[0024] 102. Based on the input request information, determine the corresponding target text information.

[0025] In this embodiment of the invention, the aforementioned target text information may refer to text content determined based on request information and used for subsequent dialogue generation processing. The target text information may be complete dialogue response text, or text content that has undergone length control, structural constraints, or scene adaptation, to balance expressive effect and processing efficiency.

[0026] When determining the target text information, the aforementioned low-latency dialogue generation platform can parse the input request information, such as extracting keywords and identifying interaction intents, and generate the target text information in conjunction with the current interaction scenario. In one implementation, the platform limits the text length or number of sentences of the target text information based on the number of keywords or the complexity of the interaction to avoid generating excessively long text, thereby reducing subsequent processing overhead.

[0027] In another implementation, the aforementioned low-latency dialogue generation platform can differentiate generation strategies based on dialogue type, such as generating short text containing feedback words in dialogue response scenarios and generating narrative text in storytelling scenarios.

[0028] 103. Based on the preset static interval rules, the sequence corresponding to the target text information is subjected to interval insertion processing to obtain the optimized text sequence.

[0029] In this embodiment of the invention, the aforementioned preset static interval rule may refer to a set of rules predetermined before the dialogue is generated, which is used to limit the position and manner of the interval insertion in the corresponding text sequence.

[0030] Specifically, the aforementioned preset static interval rules can remain fixed during the generation process to reduce the uncertainty caused by dynamic calculations and to provide structured control over the generation rhythm.

[0031] The aforementioned sequence can refer to a sequential data structure obtained from parsing target text information, used to represent the arrangement order of the constituent elements in the text. Generally, the above sequence can be composed of characters, terms, sub-words, or other text elements, and serves as the basis for performing interval insertion processing and subsequent transformation processing.

[0032] The aforementioned interval insertion process refers to the process of inserting interval markers at at least one position in a sequence according to a preset static interval rule. This process can optimize the text structure and enable better rhythm control in the subsequent generation process.

[0033] The aforementioned optimized text sequence can refer to a sequence structure formed by performing interval insertion processing and priority adjustment on the original sequence. This optimized text sequence is structurally more suitable for performing pronunciation parsing and synthesis processing, helping to reduce the complexity of subsequent processing.

[0034] In one possible embodiment, when performing interval insertion processing, the aforementioned low-latency dialogue generation platform parses the target text information into a sequential form with an ordered relationship and determines the appropriate position for inserting the interval based on preset static interval rules. For example, interval markers are inserted at the beginning of sentences, between sentences, or at semantic boundaries to control the subsequent generation rhythm.

[0035] In another possible implementation, the aforementioned low-latency dialogue generation platform can also prioritize text elements in the text optimization sequence to distinguish between key and non-key information, thereby providing a scheduling basis for subsequent processing.

[0036] 104. Convert the optimized text sequence into a phoneme sequence.

[0037] In this embodiment of the invention, the aforementioned phoneme sequence may refer to the phoneme arrangement result obtained by phoneme analysis of the optimized text sequence. It is understood that the aforementioned phoneme sequence inherits the order relationship in the optimized text sequence and is used to characterize the structured representation of the target text at the phonetic level.

[0038] During the conversion process, the aforementioned low-latency dialogue generation platform performs phonetic parsing on the optimized text sequence, mapping text elements to corresponding phonemes while maintaining the order of the optimized text sequence. This ensures that the generated phoneme sequence is structurally consistent with the optimized text sequence. In one possible embodiment, the interval markers included in the optimized text sequence can be used to generate pause phonemes or control markers during the conversion process to reflect rhythmic changes in subsequent synthesis stages.

[0039] 105. Based on the preset phoneme mapping table, the phoneme sequence is mapped to obtain the speech unit sequence.

[0040] In this embodiment of the invention, the aforementioned preset phoneme mapping table can refer to a pre-established and stored mapping relationship table used to describe the correspondence between phonemes and speech units. The aforementioned preset phoneme mapping table can support the mapping processing of phonemes to speech units through a lookup table, thereby reducing real-time computational overhead.

[0041] In one possible embodiment, during the mapping process, the aforementioned low-latency dialogue generation platform can query the phoneme mapping table one by one according to the order relationship in the phoneme sequence to determine the corresponding speech unit, and then combine the queried speech units in order to form a speech unit sequence.

[0042] 106. Perform dialogue synthesis processing based on speech unit sequences to generate dialogue output results.

[0043] In this embodiment of the invention, the above-mentioned speech unit sequence may refer to the speech unit arrangement result obtained by mapping the phoneme sequence, which can maintain the order relationship of the phoneme sequence and be used to drive subsequent dialogue synthesis processing.

[0044] The aforementioned dialogue output results can refer to the dialogue content results finally generated by the aforementioned low-latency dialogue generation platform. Generally, they can be presented in the form of voice or provided in the form of synthesized results that can be used for voice output, and are suitable for real-time interaction and multi-turn dialogue application scenarios.

[0045] When performing dialogue synthesis processing, the aforementioned low-latency dialogue generation platform can obtain the synthesized segments corresponding to each speech unit in the speech unit sequence, and splice and integrate the synthesized segments according to the order of the speech unit sequence to generate a complete dialogue synthesis result.

[0046] Specifically, based on the priority information determined in the text optimization sequence, higher-priority synthesized segments can be generated or output first, thereby improving response speed in real-time interactive scenarios. The generated dialogue output can be provided in speech form or as a synthesized result that can be used for speech output.

[0047] In this embodiment of the invention, input request information is acquired; based on the input request information, the corresponding target text information is determined; based on a preset static interval rule, the sequence corresponding to the target text information is subjected to interval insertion processing to obtain an optimized text sequence; the optimized text sequence is converted into a phoneme sequence; based on a preset phoneme mapping table, the phoneme sequence is mapped to obtain a speech unit sequence; and based on the speech unit sequence, dialogue synthesis processing is performed to generate a dialogue output result. By introducing static interval rules and a phoneme mapping table in the dialogue generation process, structured control and deterministic mapping of the dialogue generation process are achieved, reducing real-time computational overhead, thereby reducing dialogue generation latency and improving dialogue generation efficiency and stability.

[0048] Optionally, in the step of obtaining input request information, an input signal used to trigger dialog generation may also be obtained; and the input request information may be determined based on the input signal.

[0049] In this embodiment of the invention, the aforementioned input signal may refer to trigger data used to indicate the need for dialogue generation, wherein the aforementioned input signal may include, but is not limited to, one or more of text input signals, voice input signals, and command input signals.

[0050] The aforementioned text input signals can be used to directly express the user's dialogue needs; the aforementioned voice input signals can be used to support natural voice interaction scenarios; and the aforementioned command input signals can be used to trigger preset dialogue logic or functional responses. By uniformly acquiring and parsing the input signals, the aforementioned low-latency dialogue generation platform can generate input request information in a consistent manner under different interaction scenarios, avoiding the need to design independent processing flows for different input methods. This simplifies the overall dialogue generation architecture, reduces processing complexity, and helps to shorten the response time of dialogue generation, thereby improving the real-time interactive experience.

[0051] During operation, the aforementioned low-latency dialogue generation platform can acquire corresponding input signals by monitoring user interaction behaviors or external triggering events, and determine the input request information based on the input signals. In practical applications, upon detecting an input signal, the platform generates input request information matching the current interaction scenario based on the type and content of the input signal, thereby triggering the subsequent dialogue generation process.

[0052] For example, when a user inputs text content, the aforementioned low-latency dialogue generation platform can directly generate input request information based on the text content; when a user issues a voice command, the aforementioned low-latency dialogue generation platform can generate input request information based on the text information obtained by converting the voice input signal; in specific application scenarios, the aforementioned low-latency dialogue generation platform can also generate input request information based on a preset command input signal to trigger specific types of dialogue generation tasks.

[0053] By introducing input signals as the trigger for dialogue generation, the aforementioned low-latency dialogue generation platform can adapt to various interaction methods, improving the flexibility and applicability of the dialogue generation process.

[0054] Optionally, the step of determining the corresponding target text information based on the input request information further includes parsing the input request information to determine at least one keyword; determining the text length and / or number of sentences of the target text information based on the keyword; and generating the target text information based on the text length and / or number of sentences of the target text information.

[0055] In this embodiment of the invention, the aforementioned keywords may refer to core terms determined from the input request information that characterize the dialogue topic or interaction intent. They can be used to distinguish different dialogue scenarios, such as dialogue response scenarios or storytelling scenarios, and serve as an important basis for determining the scale and structure of text generation.

[0056] The aforementioned text length can refer to the number of characters, terms, or text size contained in the target text information. It can be used to constrain the overall size of the target text information to avoid generating excessively long texts that would increase the computational burden.

[0057] The number of sentences mentioned above can refer to the number of sentences or semantic units contained in the target text information. It can be used to control the structural complexity of the target text information so that the generated text content matches the current interactive scenario.

[0058] In one possible embodiment, the aforementioned low-latency dialogue generation platform can perform semantic and structural analysis on the input request information during operation, extract content features that can reflect the dialogue intent, and determine at least one keyword based on the keyword. Then, it determines the text length and number of sentences of the target text information based on the keyword, and generates the target text information based on the determined text length and number of sentences, thereby reducing the computational complexity of the text generation stage while ensuring the dialogue expression effect.

[0059] For example, in a dialogue interaction scenario, the input request information corresponds to the user's inquiry, such as "How is the weather today?" After parsing the input request information, the aforementioned low-latency dialogue generation platform can determine "weather" as a keyword, and based on this keyword, determine the target text information to use a shorter text length and fewer sentences, generating target text information that includes feedback words and brief explanations.

[0060] In another storytelling scenario, the input request information corresponds to a story playback request triggered by the user. The aforementioned low-latency dialogue generation platform can parse the input request information and determine "story" as a keyword, thereby determining that the target text information adopts a relatively long text length and multiple sentences to generate target text information with a narrative structure.

[0061] Through the above methods and steps, the low-latency dialogue generation platform can adaptively control the scale of text generation in different application scenarios, thereby improving the real-time performance and stability of dialogue generation.

[0062] Optionally, in the step of performing interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain a text optimization sequence, the method further includes determining at least one interval insertion position in the sequence corresponding to the target text information; inserting interval markers at the interval insertion positions according to the preset static interval rule to obtain a candidate text optimization sequence; determining the corresponding priority based on the text elements in the candidate text optimization sequence, and sorting the candidate text optimization sequence according to the priority to obtain a text optimization sequence.

[0063] In this embodiment of the invention, the above-mentioned interval insertion position may refer to the position in the sequence corresponding to the target text information that is suitable for inserting the interval mark. It can be determined according to the text structure, semantic boundary or preset rules, and is used to provide positional basis for subsequent rhythm control.

[0064] The aforementioned interval markers can refer to control identifiers inserted into the sequence to indicate positions in the text sequence where pauses or rhythm adjustments are needed. They can take the form of placeholders, control identifiers, or other markers used to distinguish ordinary text elements. It should be noted that these interval markers do not participate in the semantic expression of the text; they are used to assist in structured control during subsequent processing. By inserting interval markers, the aforementioned low-latency dialogue generation platform can constrain the rhythm and structure of subsequent processing without altering the text semantics, thereby reducing the computational overhead of dynamic decision-making.

[0065] The aforementioned candidate text optimization sequence can refer to the intermediate sequence result formed after inserting interval markers in the sequence corresponding to the target text information, which is used as the basis for priority determination and sorting.

[0066] The aforementioned text elements can refer to the basic units that constitute the candidate text optimization sequence, and may include, but are not limited to, characters, terms, subwords, or other unit forms used to represent text content.

[0067] The aforementioned priority can refer to the identification information used to characterize the importance of different text elements in processing, to distinguish between key text elements and non-key text elements, thereby guiding the determination of the subsequent processing order.

[0068] In one possible embodiment, the aforementioned low-latency dialogue generation platform, during operation, can analyze the corresponding sequence of target text information based on text structure features, semantic boundaries, or preset rules to determine the appropriate position for inserting an interval. For example, in a dialogue response scenario, the interval insertion position can be determined at the beginning of a sentence or before a response word; in a storytelling scenario, the interval insertion position can be determined at the semantic paragraph boundary or between sentences.

[0069] After determining the interval insertion position, the aforementioned low-latency dialogue generation platform inserts interval markers at the corresponding interval insertion positions according to preset static interval rules, thereby obtaining the candidate text optimization sequence.

[0070] After generating the candidate text optimization sequence, the aforementioned low-latency dialogue generation platform determines the corresponding priority based on the text elements in the candidate text optimization sequence. When determining priorities, the platform can base its judgment on the importance of text elements, semantic weight, or interaction requirements. For example, in a dialogue response scenario, text elements directly related to the user's question can be assigned higher priority, while text elements used for supplementary explanations can be assigned lower priority; in a storytelling scenario, text elements used to advance the plot can be assigned higher priority, while text elements used for background descriptions can be assigned relatively lower priority.

[0071] After determining the priority of text elements, the aforementioned low-latency dialogue generation platform sorts the candidate text optimization sequence according to the priority to obtain the text optimization sequence.

[0072] By introducing priority-based sorting, the aforementioned low-latency dialogue generation platform can prioritize key text content in real-time interactive scenarios, thereby improving the response speed and stability of dialogue generation under resource-constrained conditions.

[0073] Optionally, the step of converting the text optimization sequence into a phoneme sequence may further include performing pronunciation analysis on the text optimization sequence to determine the phonemes corresponding to each text element in the text optimization sequence; and generating a phoneme sequence corresponding to the text optimization sequence based on the order relationship of each text element in the text optimization sequence.

[0074] In this embodiment of the invention, the above-mentioned pronunciation parsing may refer to the process of converting text elements in the optimized text sequence into pronunciation-level representations, which is used to determine the pronunciation form of the text elements when expressed in speech.

[0075] The aforementioned phonemes can refer to the smallest or basic units used to describe speech pronunciation, and are used to characterize the structural composition of text at the pronunciation level.

[0076] The aforementioned order relationship can refer to the arrangement relationship between each text element in the text optimization sequence, which is used to constrain the arrangement order of phonemes in the phoneme sequence, so that the generated phoneme sequence is consistent with the text optimization sequence in structure, thereby ensuring the consistency of the dialogue synthesis result in semantics and rhythm.

[0077] In one possible embodiment, the low-latency dialogue generation platform, when converting the optimized text sequence into a phoneme sequence, can also perform pronunciation parsing on the optimized text sequence to determine the phonemes corresponding to each text element in the optimized text sequence. It can parse each text element in the optimized text sequence one by one according to preset pronunciation rules or pronunciation models, converting the text-level representation into a pronunciation-level representation. In practical applications, the low-latency dialogue generation platform can determine the corresponding phonemes based on the type of text element, linguistic features, or contextual relationships, thereby forming a pronunciation structure that matches the text content.

[0078] The aforementioned pronunciation model refers to a model structure used to map text elements to corresponding phonetic representations. It can parse the pronunciation of text elements in an optimized text sequence based on linguistic features, contextual relationships, or statistical patterns to determine the corresponding phonemes. This pronunciation model can be deployed as a lightweight model within the aforementioned low-latency dialogue generation platform to reduce computational complexity while maintaining pronunciation accuracy. In practical applications, this pronunciation model can be adapted for dialogue response scenarios or storytelling scenarios, maintaining a balance between naturalness and real-time performance in the generated phoneme sequences.

[0079] The aforementioned preset pronunciation rules refer to a set of rules predetermined before dialogue generation. These rules define the correspondence between text elements and phonemes, and may include character-phoneme mapping, pronunciation patterns of word combinations, and pronunciation conventions in specific contexts. Pronunciation is resolved through rule matching, thereby reducing reliance on real-time reasoning. In this embodiment, the low-latency dialogue generation platform can prioritize phoneme determination based on the preset pronunciation rules, only introducing a pronunciation model for resolution when the rules cannot cover all phonemes, further reducing overall processing latency.

[0080] After completing pronunciation analysis, the aforementioned low-latency dialogue generation platform generates a phoneme sequence corresponding to the optimized text sequence based on the sequential relationship of each text element in the optimized text sequence. Specifically, when generating the phoneme sequence, the consistency of the phoneme order with the corresponding text element is maintained, ensuring that the phoneme sequence accurately reflects the structure of the optimized text sequence. For example, in a dialogue response scenario, when the feedback word in the optimized text sequence is at the beginning of a sentence, the corresponding phoneme is also prioritized in the phoneme sequence; in a storytelling scenario, when the narrative text in the optimized text sequence is arranged in paragraph order, the corresponding generated phoneme sequence also maintains this arrangement.

[0081] By inheriting the order relationship in the text optimization sequence through the above methods and steps, the low-latency dialogue generation platform can avoid introducing additional sorting calculations in the phoneme generation stage, thereby helping to reduce processing latency.

[0082] Optionally, the step of mapping the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence further includes determining at least one phoneme to be mapped based on the order relationship in the phoneme sequence; querying the speech unit corresponding to the phoneme to be mapped based on the preset phoneme mapping table; and arranging the queried speech units according to the order of the phoneme sequence to obtain a speech unit sequence.

[0083] In this embodiment of the invention, the aforementioned phonemes to be mapped may refer to phonemes selected from the phoneme sequence for performing phoneme-to-speech unit mapping processing, used as input objects for querying a preset phoneme mapping table, and determined sequentially according to the order relationship or priority rules in the phoneme sequence.

[0084] The above query can refer to the process of obtaining the correspondence between phonemes and speech units based on a preset phoneme mapping table. It can be completed by looking up the table and is used to quickly determine the speech unit corresponding to a phoneme without introducing complex calculations.

[0085] The aforementioned speech unit can refer to the basic pronunciation unit used in dialogue synthesis processing. It can be used to represent pronunciation segments that can be directly spliced or combined, and is the basic element for constructing speech unit sequences and performing dialogue synthesis processing.

[0086] In this embodiment, the low-latency dialogue generation platform can select phonemes as phonemes to be mapped in sequence according to the arrangement order of phonemes in the phoneme sequence, and perform mapping processing on the selected phonemes to be mapped. Specifically, phonemes located at the beginning of the phoneme sequence or with higher processing priority can be selected as phonemes to be mapped, thereby generating key pronunciation content in real-time dialogue scenarios.

[0087] After determining the phonemes to be mapped, the aforementioned low-latency dialogue generation platform can directly obtain the correspondence between phonemes and speech units by looking up a table when performing a query, without the need for complex real-time calculations or reasoning. It can only execute the query on the phonemes that have not yet been mapped, thereby avoiding repeated queries and further reducing processing overhead.

[0088] After completing the query, the low-latency dialogue generation platform arranges the retrieved speech units according to the order in the phoneme sequence, forming a speech unit sequence. By maintaining the consistency of the speech unit sequence with the phoneme sequence in order, the platform ensures the stability of pronunciation structure and rhythm control in subsequent dialogue synthesis processing. Through this order-based mapping method, the platform limits the phoneme mapping process to a deterministic lookup table operation, thereby reducing response latency in the mapping stage and improving overall dialogue generation efficiency.

[0089] Optionally, the step of performing dialogue synthesis processing based on the speech unit sequence to generate a dialogue output result further includes determining the synthesis segment corresponding to at least one speech unit in the speech unit sequence; splicing and integrating the synthesis segments according to the order of the speech unit sequence to generate a dialogue synthesis result; and outputting the dialogue synthesis result to obtain the dialogue output result.

[0090] In this embodiment of the invention, the above-mentioned synthesized segment may refer to the basic pronunciation content corresponding to the speech unit that can be directly used for splicing or combining. It may exist in the form of audio segments or in the form of pronunciation data used to quickly generate audio, and is the basic unit for performing dialogue synthesis processing.

[0091] The above-mentioned splicing and integration can refer to the process of combining and connecting multiple synthesized segments according to the sequential relationship in the speech unit sequence.

[0092] The aforementioned dialogue synthesis result can refer to the overall pronunciation result obtained by splicing and integrating multiple synthesized segments, which can be used as an intermediate or final form of dialogue output and can reflect the complete expression of the target text information at the speech level.

[0093] In one possible embodiment, the aforementioned low-latency dialogue generation platform can obtain the synthesized segment corresponding to a speech unit based on the type, pronunciation features, or storage format of the speech unit. Specifically, it can prioritize obtaining the synthesized segment corresponding to speech units located at the beginning of the speech unit sequence or with higher processing priority, thereby generating key pronunciation content in advance in real-time dialogue scenarios.

[0094] After determining the synthesized segments, the aforementioned low-latency dialogue generation platform can continuously combine multiple synthesized segments according to the sequential relationship in the speech unit sequence, and adjust the connection relationship between adjacent synthesized segments when necessary to ensure the naturalness of the dialogue synthesis result in terms of rhythm and continuity. For example, during the splicing and integration process, pauses corresponding to the interval markers in the optimized text sequence can be maintained, so that the generated dialogue synthesis result is consistent with the text structure in terms of expression rhythm.

[0095] After the concatenation and integration are completed, the aforementioned low-latency dialogue generation platform outputs the dialogue synthesis result, thus obtaining the dialogue output. The dialogue output result can be output directly in speech form for real-time human-computer dialogue scenarios; or it can be provided in the form of a synthesized result that can be used for speech output for subsequent playback or caching. In continuous dialogue or multi-turn interaction scenarios, by sequentially executing dialogue synthesis processing based on speech unit sequences, the aforementioned low-latency dialogue generation platform can reduce repetitive synthesis operations and improve the overall response speed and stability of dialogue generation.

[0096] More specifically, this can be illustrated in two scenario examples. The aforementioned low-latency dialogue generation platform is applied to human-computer dialogue scenarios using AI voice devices. During operation, the AI voice device generates input request information after detecting an input signal that triggers dialogue generation. For example, a user might wake up the device with voice and issue a query command such as "How's the weather today?", "Turn on the living room lights", or "Set a ten-minute timer for me," or the user might input text in a companion application such as "Play some light music" or "Turn the volume up to thirty." The aforementioned low-latency dialogue generation platform determines the target text information based on the input request information and constrains the generation scale by incorporating keywords. For example, it generates short replies containing feedback words around keywords such as "weather", "lights", "timer", and "volume", controlling the text length and the number of sentences to make the target text information more suitable for rapid generation and synthesis. Simultaneously, the platform performs structured processing on the sequence corresponding to the target text information based on preset static interval rules. For example, it inserts interval markers after feedback words or at semantic boundaries and completes a deterministic mapping from phonemes to speech units based on a phoneme mapping table to reduce real-time inference overhead.

[0097] In this human-computer dialogue scenario, the aforementioned low-latency dialogue generation platform generates phoneme sequences and speech unit sequences according to the optimized text order, and performs dialogue synthesis processing based on the speech unit sequences to form dialogue output results, such as quickly outputting "Okay, I've turned on the living room light for you," "The temperature is 25 degrees Celsius now, perfect for going out," and "The timer has started; I'll remind you in ten minutes." When there are continuous interactions, such as the user adding questions like "What about tomorrow?" or "Turn off the bedroom light again," the aforementioned low-latency dialogue generation platform can shorten the generation path while maintaining the consistency of structured sequence processing, enabling AI voice devices to maintain low response latency and a stable output rhythm in multi-turn dialogues, thereby improving the real-time interactive experience.

[0098] In another embodiment, the aforementioned low-latency dialogue generation platform is applied to storytelling scenarios using AI voice devices. After receiving an input request to trigger storytelling, the AI voice device generates target text information. For example, a user might issue a voice command such as "Tell me a bedtime story," "Tell me a story about a little rabbit," or "Continue the previous story," or the user might trigger "Play the fairy tale: The Brave Boat" through text input. When determining the target text information, the low-latency dialogue generation platform can generate narrative text based on keywords such as "bedtime story," "little rabbit," and "brave," setting a relatively long text length and a large number of sentences to meet the needs of continuous storytelling. Simultaneously, it limits the scale of single generation through generation constraints to avoid stuttering caused by long text reasoning on resource-constrained terminals. Subsequently, the platform inserts interval markers into the sequence corresponding to the target text information based on preset static interval rules, such as inserting intervals at paragraph boundaries and plot transitions to reflect the storytelling rhythm and pauses. The optimized text sequence is then converted into a phoneme sequence, and a speech unit sequence is obtained based on a phoneme mapping table.

[0099] In this storytelling scenario, the aforementioned low-latency dialogue generation platform performs dialogue synthesis processing based on speech unit sequences, sequentially splicing and integrating synthesized segments to form a coherent dialogue synthesis result and output it, such as "Once upon a time, there was a little rabbit who lived in a small house by the forest..." "Suddenly, a gust of wind came from afar, and the little rabbit mustered up its courage and went out..." When the user inserts control commands during the narration, such as "pause," "continue," or "slow down," the aforementioned low-latency dialogue generation platform can dynamically adjust the target text information generation scale and rhythm marker density based on the input request information, and complete mapping and synthesis without relying on heavy real-time reasoning, thereby maintaining the fluency, stability, and interactivity of storytelling on resource-constrained devices.

[0100] like Figure 2 As shown, this embodiment of the invention also provides a low-latency dialogue generation apparatus 200, which includes: The first acquisition module 201 is used to acquire input request information; The first determining module 202 is used to determine the corresponding target text information based on the input request information; The first processing module 203 is used to perform interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain an optimized text sequence; The first conversion module 204 is used to convert the optimized text sequence into a phoneme sequence; The second processing module 205 is used to perform mapping processing on the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence; The first synthesis module 206 is used to perform dialogue synthesis processing based on the speech unit sequence to generate dialogue output results.

[0101] Optionally, the first acquisition module 201 mentioned above includes: The first acquisition submodule is used to acquire the input signal used to trigger dialogue generation; The second acquisition submodule is used to determine the input request information based on the input signal, wherein the input signal includes text input signal, voice input signal and / or command input signal.

[0102] Optionally, the first determining module 202 mentioned above includes: The first determining submodule is used to parse the input request information and determine at least one keyword; The second determining submodule is used to determine the text length and / or number of sentences of the target text information based on the keywords; The third determining submodule is used to generate target text information based on the text length and / or number of sentences of the target text information.

[0103] Optionally, the first processing module 203 mentioned above includes: The first processing submodule is used to determine at least one interval insertion position in the sequence corresponding to the target text information; The second processing submodule is used to insert interval markers at the interval insertion positions according to the preset static interval rules to obtain the candidate text optimization sequence. The third processing submodule is used to determine the corresponding priority based on the text elements in the candidate text optimization sequence, and sort the candidate text optimization sequence according to the priority to obtain the text optimization sequence.

[0104] Optionally, the first conversion module 204 mentioned above includes: The first conversion submodule is used to perform pronunciation parsing on the text optimization sequence and determine the phonemes corresponding to each text element in the text optimization sequence; The second conversion submodule is used to generate a phoneme sequence corresponding to the text optimization sequence based on the order relationship of each text element in the text optimization sequence.

[0105] Optionally, the second processing module 205 mentioned above includes: The fourth processing submodule is used to determine at least one phoneme to be mapped based on the order relationship in the phoneme sequence; The fifth processing submodule is used to query the speech unit corresponding to the phoneme to be mapped based on the preset phoneme mapping table; The sixth processing submodule is used to arrange the queried speech units in the order of the phoneme sequence to obtain the speech unit sequence.

[0106] Optionally, the first synthesis module 206 mentioned above includes: The first synthesis submodule is used to determine the synthesized segment corresponding to at least one speech unit in the speech unit sequence; The second synthesis submodule is used to splice and integrate the synthesized segments according to the order of the speech unit sequence to generate a dialogue synthesis result; The third synthesis submodule is used to output the dialogue synthesis result, thereby obtaining the dialogue output result.

[0107] like Figure 3 As shown, this embodiment of the invention also provides an electronic device 300, including a processor, which can execute any of the above-described low-latency dialogue generation methods.

[0108] Specifically, it includes a processor 301 and a memory 302, as well as a computer program stored in the memory 302 and capable of running on the processor 301, which executes a low-latency dialogue generation method, wherein: The processor 301 executes the calculator program of the low-latency dialogue generation method stored in the memory 302, and performs the following steps: Obtain input request information; Based on the input request information, determine the corresponding target text information; Based on a preset static interval rule, the sequence corresponding to the target text information is subjected to interval insertion processing to obtain an optimized text sequence; Convert the optimized text sequence into a phoneme sequence; Based on a preset phoneme mapping table, the phoneme sequence is mapped to obtain a speech unit sequence; Based on the speech unit sequence, dialogue synthesis processing is performed to generate dialogue output results.

[0109] Optionally, the processor 301 executes the process of obtaining the input request information, including: Obtain the input signal used to trigger dialogue generation; The input request information is determined based on the input signal, which includes text input signal, voice input signal and / or command input signal.

[0110] Optionally, the processor 301 performs the step of determining the corresponding target text information based on the input request information, including: The input request information is parsed to determine at least one keyword; Based on the keywords, determine the text length and / or number of sentences of the target text information; The target text information is generated based on the text length and / or number of sentences of the target text information.

[0111] Optionally, the processor 301 executes the interval insertion processing on the sequence corresponding to the target text information based on the preset static interval rule to obtain the optimized text sequence, including: Determine at least one interval insertion position in the sequence corresponding to the target text information; According to the preset static interval rule, an interval marker is inserted at the interval insertion position to obtain the candidate text optimization sequence; Based on the text elements in the candidate text optimization sequence, the corresponding priorities are determined, and the candidate text optimization sequence is sorted according to the priorities to obtain the text optimization sequence.

[0112] Optionally, the processor 301 performs the conversion of the text-optimized sequence into a phoneme sequence, including: The optimized text sequence is subjected to phonetic analysis to determine the phonemes corresponding to each text element in the optimized text sequence. Based on the order relationship of each text element in the optimized text sequence, a phoneme sequence corresponding to the optimized text sequence is generated.

[0113] Optionally, the processor 301 executes the mapping process based on the preset phoneme mapping table to obtain a speech unit sequence, including: Based on the order relationship in the phoneme sequence, at least one phoneme to be mapped is determined; Based on the preset phoneme mapping table, query the speech unit corresponding to the phoneme to be mapped; The obtained speech units are arranged in the order of the phoneme sequence to obtain the speech unit sequence.

[0114] Optionally, the processor 301 performs the dialogue synthesis processing based on the speech unit sequence to generate a dialogue output result, including: Determine the synthesized segment corresponding to at least one speech unit in the speech unit sequence; The synthesized segments are spliced and integrated according to the order of the speech unit sequence to generate a dialogue synthesis result; Output the synthesized dialogue result to obtain the dialogue output result.

[0115] This invention also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the various processes of the low-latency dialogue generation method or the application-side low-latency dialogue generation method provided in this invention, and achieves the same technical effect. To avoid repetition, it will not be described again here.

[0116] Those skilled in the art will understand that implementing all or part of the processes in the above embodiments can be done by a computer program instructing related hardware, and can be stored in a computer-readable storage medium. When executed, the program can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

[0117] The above description discloses only preferred embodiments of the present invention and should not be construed as limiting the scope of the present invention. Therefore, equivalent variations made in accordance with the claims of the present invention are still within the scope of the present invention.

Claims

1. A low-latency dialogue generation method, characterized in that, include: Obtain input request information; Based on the input request information, determine the corresponding target text information; Based on a preset static interval rule, the sequence corresponding to the target text information is subjected to interval insertion processing to obtain an optimized text sequence; Convert the optimized text sequence into a phoneme sequence; Based on a preset phoneme mapping table, the phoneme sequence is mapped to obtain a speech unit sequence; Based on the speech unit sequence, dialogue synthesis processing is performed to generate dialogue output results.

2. The low-latency dialogue generation method as described in claim 1, characterized in that, The process of obtaining input request information includes: Obtain the input signal used to trigger dialogue generation; The input request information is determined based on the input signal, which includes text input signal, voice input signal and / or command input signal.

3. The low-latency dialogue generation method as described in claim 1, characterized in that, The step of determining the corresponding target text information based on the input request information includes: The input request information is parsed to determine at least one keyword; Based on the keywords, determine the text length and / or number of sentences of the target text information; The target text information is generated based on the text length and / or number of sentences of the target text information.

4. The low-latency dialogue generation method as described in claim 1, characterized in that, The step of performing interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain an optimized text sequence includes: Determine at least one interval insertion position in the sequence corresponding to the target text information; According to the preset static interval rule, an interval marker is inserted at the interval insertion position to obtain the candidate text optimization sequence; Based on the text elements in the candidate text optimization sequence, the corresponding priorities are determined, and the candidate text optimization sequence is sorted according to the priorities to obtain the text optimization sequence.

5. The low-latency dialogue generation method as described in claim 4, characterized in that, The step of converting the optimized text sequence into a phoneme sequence includes: The optimized text sequence is subjected to phonetic analysis to determine the phonemes corresponding to each text element in the optimized text sequence. Based on the order relationship of each text element in the optimized text sequence, a phoneme sequence corresponding to the optimized text sequence is generated.

6. The low-latency dialogue generation method as described in claim 1, characterized in that, The process of mapping the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence includes: Based on the order relationship in the phoneme sequence, at least one phoneme to be mapped is determined; Based on the preset phoneme mapping table, query the speech unit corresponding to the phoneme to be mapped; The obtained speech units are arranged in the order of the phoneme sequence to obtain the speech unit sequence.

7. The low-latency dialogue generation method as described in claim 1, characterized in that, The step of performing dialogue synthesis processing based on the speech unit sequence to generate dialogue output results includes: Determine the synthesized segment corresponding to at least one speech unit in the speech unit sequence; The synthesized segments are spliced and integrated according to the order of the speech unit sequence to generate a dialogue synthesis result; Output the synthesized dialogue result to obtain the dialogue output result.

8. A low-latency dialogue generation apparatus, characterized in that, include: The first acquisition module is used to acquire input request information; The first determining module is used to determine the corresponding target text information based on the input request information; The first processing module is used to perform interval insertion processing on the sequence corresponding to the target text information based on a preset static interval rule to obtain an optimized text sequence. The first conversion module is used to convert the optimized text sequence into a phoneme sequence; The second processing module is used to perform mapping processing on the phoneme sequence based on a preset phoneme mapping table to obtain a speech unit sequence; The first synthesis module is used to perform dialogue synthesis processing based on the speech unit sequence to generate dialogue output results.

9. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the low-latency dialogue generation method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the low-latency dialogue generation method as described in any one of claims 1 to 7.