A personalized speech generation system and method based on multivariate parameters
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING QIYU INFORMATION TECH CO LTD
- Filing Date
- 2026-01-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing single-variable TTS methods cannot meet the diverse needs of users in voice interaction, resulting in wasted resources and difficulty in supporting real-time synthesis in high-concurrency scenarios.
A personalized voice generation system based on multivariate parameters is adopted, including a list distribution module, a task distribution module, an elastic scheduling module, and a post-processing module. By generating a parameter mapping table and dynamically scheduling computing resources, personalized voice files are generated.
It enables differentiated access to user needs, optimizes the utilization of computing resources, reduces costs, and supports stable operation under high concurrency.
Smart Images

Figure CN121506091B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech synthesis technology, and more specifically, to a personalized speech generation system and method based on multivariate parameters. Background Technology
[0002] Large-scale intelligent voice interaction technology, by integrating artificial intelligence, cloud computing, big data, and communication technologies, can achieve near-human "listening and speaking" capabilities, and is therefore widely used in scenarios such as intelligent telemarketing, voice chat, and business guidance. It typically uses a single-variable TTS (Text-to-Speech) method to generate audio files during the interaction. However, this single-variable text-to-speech synthesis method only generates audio files with fixed timbre, speed, and intonation using a single speech model, which cannot meet the needs of voice interaction, as evidenced by:
[0003] 1. The single-variable text-to-speech synthesis method directly schedules based on pre-synthesized audio files, does not support dynamic adjustment of audio files, and uses the same audio file to reach all users, which cannot meet the business's need for differentiated user outreach.
[0004] 2. In large-scale outbound call scenarios, single-variable text-to-speech synthesis methods often rely on repeated scheduling of fixed models, which cannot make full use of hardware resources such as GPUs, resulting in a waste of computing resources and making it difficult to support the real-time synthesis requirements in high-concurrency scenarios. Summary of the Invention
[0005] In view of this, the main objective of the present invention is to propose a personalized speech generation system and method based on multivariate parameters, in order to at least partially solve at least one of the above-mentioned technical problems.
[0006] To address the aforementioned technical problems, the first aspect of this invention proposes a personalized speech generation system based on multivariate parameters, comprising:
[0007] The list distribution module is used to distribute user lists to the task distribution module in batches according to the priority of the user list queue.
[0008] The task distribution module is used to receive a user list and obtain the corresponding script template; generate a parameter mapping table based on the multivariate parameters and dynamic values of the multivariate parameters in the script template; and synthesize and distribute script tasks in batches based on the parameter mapping table and the script template.
[0009] The elastic scheduling module is used to receive the speech tasks issued by the task distribution module in batches, determine the computing power resource index value required for each batch of speech tasks to be synthesized in real time according to the synthesis index, and perform speech synthesis inference based on the computing power resource index value and the computing power resource allocation target computing power resource, and output audio files.
[0010] The post-processing module is used to post-process the audio file to generate a personalized voice file.
[0011] According to a preferred embodiment of the present invention, the elastic scheduling module includes:
[0012] The configuration unit is used to configure the priority of each computing resource according to the computing resource investment;
[0013] The selection unit is used to select and schedule target computing resources that meet the computing resource index values required for the batch of synthesized speech tasks according to the priority of computing resources.
[0014] According to a preferred embodiment of the present invention, the elastic scheduling module further includes:
[0015] The second determining unit is used to determine the stress test indicators of multiple computing resources when the target computing resources selected by the selecting unit are external cloud resources.
[0016] The sub-selection unit is used to select and schedule the final target computing resources from the selected external cloud resources according to the computing power resource throughput based on the stress test indicators.
[0017] According to a preferred embodiment of the present invention, the system further includes:
[0018] The link tracing module is used to obtain at least one of the following in real time: user list distribution progress information, speech synthesis progress information, and inference progress information;
[0019] The elastic scheduling module determines the synthesis index based on the list distribution progress information and the speech synthesis progress information in the link tracking module.
[0020] According to a preferred embodiment of the present invention, the elastic scheduling module further includes:
[0021] The computing power shutdown unit is used to obtain speech synthesis progress information and inference progress information from the link tracing module, and determine and shut down redundant computing power resources in real time based on the speech synthesis progress information and inference progress information.
[0022] According to a preferred embodiment of the present invention, the task distribution module includes:
[0023] The parsing unit is used to parse and decompose the multivariate parameters in the script template;
[0024] The acquisition unit is used to dynamically acquire the value of the multivariate parameter corresponding to each user based on the user list;
[0025] The matching unit is used to match the multivariate parameters with the multivariate parameter values corresponding to each user to generate a parameter mapping table.
[0026] To address the aforementioned technical problems, a second aspect of the present invention provides a personalized speech generation method based on multivariate parameters, comprising:
[0027] User lists are distributed in batches according to the priority of the user list queue;
[0028] Obtain the script templates corresponding to the user list in batches; generate a parameter mapping table based on the multivariate parameters and dynamic values of the multivariate parameters in the script templates; and synthesize and distribute script tasks in batches based on the parameter mapping table and the script templates.
[0029] The computing power resource index value required for each batch of speech tasks to be synthesized is determined in real time based on the synthesis index value. Speech synthesis inference is performed based on the computing power resource index value and the target computing power resource allocation, and audio files are output.
[0030] The audio file is post-processed to generate a personalized voice file.
[0031] The computing power resource index values and the target computing power resources for scheduling include:
[0032] The priority of allocating computing resources is based on the allocation of computing resources.
[0033] Target computing resources that meet the aforementioned computing resource index values are selected and scheduled based on the priority of the computing resources.
[0034] According to a preferred embodiment of the present invention, if the selected target computing power resource is an external cloud resource, the method further includes:
[0035] Determine the stress test indicators for multiple computing resources;
[0036] Based on the load test metrics, the final target computing resources are selected and scheduled from the chosen external cloud resources according to the throughput of computing resources.
[0037] According to a preferred embodiment of the present invention, the method further includes:
[0038] Real-time acquisition of at least one of the following: user list distribution progress information, script synthesis progress information, and reasoning progress information;
[0039] The synthesis indicators are determined based on the progress information of the list distribution and the progress information of the script synthesis.
[0040] According to a preferred embodiment of the present invention, the method further includes:
[0041] Based on the speech synthesis progress information and reasoning progress information, redundant computing resources are determined and shut down in real time.
[0042] According to a preferred embodiment of the present invention, the step of generating a parameter mapping table based on the multivariate parameters in the script template and the dynamic values of the multivariate parameters includes:
[0043] Analyze and break down the multivariate parameters in the script template;
[0044] The values of the multivariate parameters corresponding to each user are dynamically obtained based on the user list;
[0045] A parameter mapping table is generated by matching the multivariate parameters with the values of the multivariate parameters corresponding to each user.
[0046] To solve the above-mentioned technical problems, a third aspect of the present invention provides an electronic device, comprising:
[0047] Processor; and
[0048] A memory storing computer-executable instructions, which, when executed, cause the processor to perform the method according to any one of the preceding claims.
[0049] To solve the above-mentioned technical problems, the fourth aspect of the present invention provides a computer program product, including a computer program, characterized in that the computer program, when executed by a processor, implements the method described in any of the above-mentioned embodiments.
[0050] In summary, this invention generates a parameter mapping table based on the multivariate parameters in the script template and their dynamic values. This allows for the dynamic batch generation of millions of script tasks based on the parameter mapping table and its corresponding batch of script templates, meeting the business's need for differentiated user outreach. Simultaneously, it dynamically schedules target computing resources for inference based on the required computing resource indicators and the allocated computing resources for each batch of script tasks to be synthesized, outputting audio files. This achieves dynamic allocation and maximized utilization of computing resources, reducing computing costs by over 30% and supporting stable operation under high concurrency. Attached Figure Description
[0051] To make the technical problems solved by this invention, the technical means adopted, and the technical effects achieved clearer, specific embodiments of this invention will be described in detail below with reference to the accompanying drawings. However, it should be noted that the drawings described below are merely drawings of exemplary embodiments of this invention. Those skilled in the art can obtain drawings of other embodiments based on these drawings without any creative effort.
[0052] Figure 1 This is a schematic diagram of the structural framework of a personalized speech generation system based on multivariable parameters provided in an embodiment of the present invention;
[0053] Figure 2 This is a flowchart illustrating a personalized speech generation method based on multivariate parameters provided in an embodiment of the present invention.
[0054] Figure 3 This is a schematic diagram of the workflow of the personalized speech generation system based on multivariable parameters provided in the embodiments of the present invention;
[0055] Figure 4 This is a structural block diagram of an exemplary embodiment of an electronic device according to the present invention. Detailed Implementation
[0056] Subject to the inventive concept, the structures, performance, effects or other features described in a particular embodiment may be combined in any suitable manner with one or more other embodiments.
[0057] In the description of specific embodiments, detailed descriptions of structures, performance, effects, or other features are provided to enable those skilled in the art to fully understand the embodiments. However, it is not excluded that those skilled in the art can implement the present invention under certain circumstances with technical solutions that do not contain the above-described structures, performance, effects, or other features. The figures in the accompanying drawings are merely illustrative examples and do not imply that the solutions of the present invention must include all the contents, operations, and steps shown in the figures, nor do they imply that they must be performed in the order shown in the figures.
[0058] refer to Figure 1 , Figure 1 This is a schematic diagram of the structure of a personalized speech generation system based on multivariable parameters provided in an embodiment of the present invention, as shown below. Figure 1 As shown, the system includes:
[0059] The list distribution module 11 is used to distribute the user list to the task distribution module 12 in batches according to the priority of the user list queue.
[0060] The task distribution module 12 is used to receive a user list and obtain a script template corresponding to the user list; generate a parameter mapping table based on the multivariate parameters and dynamic values of the multivariate parameters in the script template; and synthesize and distribute script tasks in batches based on the parameter mapping table and the script template.
[0061] The elastic scheduling module 13 is used to receive the script tasks issued by the task distribution module 12 in batches, determine the computing power resource index value required for each batch of script tasks to be synthesized in real time according to the synthesis index, perform TTS inference according to the computing power resource index value and the computing power resource investment scheduling target computing power resource, and output audio files.
[0062] The post-processing module 14 is used to post-process the audio file to generate a personalized voice file.
[0063] In one example, the synthesis metrics include: total number of script tasks, total number of synthesized script tasks, and expected completion time. The computing power resource metric is QPS (Queries Per Second). The elastic scheduling module then determines the required QPS for each batch of script tasks to be synthesized using the following formula:
[0064]
[0065] Where: taskTotal is the total number of script tasks in each batch, inferFinalTaskTotal is the total number of synthesized script tasks in each batch, and needTime is the expected completion time for each batch, which is determined by the following formula:
[0066]
[0067] Where: batchStartTimeStamp is the start time of each batch, and T is the expected completion time of each batch.
[0068] Furthermore, the elastic scheduling module 13 includes:
[0069] The configuration unit is used to configure the priority of each computing resource according to the computing resource investment;
[0070] The selection unit is used to select and schedule target computing resources that meet the computing resource index values required for the batch of synthesized speech tasks according to the priority of computing resources.
[0071] The second determining unit is used to determine the stress test indicators of multiple computing resources when the target computing resources selected by the selecting unit are external cloud resources.
[0072] The sub-selection unit is used to select and schedule the final target computing resources from the selected external cloud resources according to the computing power resource throughput based on the stress test indicators.
[0073] In one embodiment, the system further includes:
[0074] Link tracing module 15 is used to obtain at least one of the following in real time: user list distribution progress information, speech synthesis progress information, and inference progress information;
[0075] The elastic scheduling module 13 determines the synthesis index based on the list distribution progress information and the script synthesis progress information in the link tracking module 15.
[0076] The elastic scheduling module 13 also includes:
[0077] The computing power shutdown unit is used to obtain speech synthesis progress information and inference progress information from the link tracking module 15, and determine and shut down redundant computing power resources in real time based on the speech synthesis progress information and inference progress information.
[0078] According to a preferred embodiment of the present invention, the task distribution module 12 includes:
[0079] The parsing unit is used to parse and decompose the multivariate parameters in the script template;
[0080] The acquisition unit is used to dynamically acquire the value of the multivariate parameter corresponding to each user based on the user list;
[0081] The matching unit is used to match the multivariate parameters with the multivariate parameter values corresponding to each user to generate a parameter mapping table.
[0082] The various modules of the aforementioned personalized speech generation system based on multivariate parameters (list distribution module, task distribution module, elastic scheduling module, and link tracing module) can be decoupled and collaborated through a microservice architecture to achieve efficient distribution of personalized speech tasks, dynamic scheduling of resources, and full traceability of the link.
[0083] based on Figure 1 The present invention also provides a method for personalized speech generation based on multivariate parameters, as shown in the multivariate parameter-based personalized speech generation system. Figure 2 The personalized speech generation method based on multivariate parameters includes:
[0084] S1. Distribute user lists in batches according to the priority of the user list queue;
[0085] In this embodiment, the user list is a list generated from multiple actual users who need to be reached via voice. In one example, the user list can be distributed at a certain frequency (i.e., at preset time intervals), then the user list includes the list of users who need to be reached within that preset time interval. For example... Figure 3 The list distribution module 11 can pre-store the list of users to be dialed that day into a first message queue (e.g., Kafka), thus obtaining a user list queue. A user list queue can contain: list information 1, list information 2, ..., list information n. Each user list queue is marked with a corresponding batch number, and this batch number is stored in a second message queue (e.g., Kafka), thus obtaining a list batch number queue. The list distribution module 11 can retrieve the user list from the batch number queue at a fixed time each day (e.g., before the business dialing task starts) and distribute the user list to the task distribution module 12 in batches according to the batch numbers in the list batch number queue.
[0086] Furthermore, user lists can be prioritized based on business characteristics, the relevance of users to the business, and the urgency of users' business transactions. User lists with the same priority are stored in the corresponding user list queue, and user lists are distributed to the task distribution module 12 in batches according to the priority of the user list queue. For example, for telemarketing, the priority of user lists can be divided and marked according to the number of times a user has purchased telemarketing products in the past; for example, the higher the number of past purchases, the higher the user priority. For pay-later services (i.e., users can use products for free for a period of time before paying), the priority of user lists can be divided and marked according to the length of the remaining payment time; for example, the shorter the remaining payment time, the higher the priority. In addition, the priority of user lists can also be divided and marked according to the number of times users have viewed or favorited products. In this way, after the task distribution module 12 receives user lists in batches, it selects the user list queue according to the order of priority from high to low and obtains the user lists to synthesize the sales script tasks, thereby taking into account both user characteristics and business urgency in the synthesis of sales script tasks and improving outreach efficiency.
[0087] S2. Obtain the script templates corresponding to the user list in batches; generate a parameter mapping table based on the multivariate parameters and dynamic values of the multivariate parameters in the script templates; synthesize and distribute script tasks in batches based on the parameter mapping table and the script templates.
[0088] like Figure 3 After receiving the user list, the task distribution module 13 dispatches the script system to obtain the corresponding script templates. In this embodiment, user lists from the same day can correspond to the same script template or different script templates. For example, user lists from the same batch can correspond to the same script template, or user lists received within a preset time period can correspond to the same script template, and so on. In this way, different batches of users can use different script templates to generate audio files in real time, realizing the need for personalized user outreach.
[0089] Furthermore, it can also record the start time, end time, and distribution status of each batch of user lists in the batch number queue, and send them to the link tracking module 15 in real time to record the distribution progress information of each batch of user lists.
[0090] In another example, the user list can be received in real time, which is then the list of users to be reached in real time. In this case, the list distribution module 11 distributes the list of users to be reached to the task distribution module 12 in real time, and the task distribution module 12 receives the user list in real time and schedules the script system to obtain the corresponding script templates for the user list.
[0091] In this embodiment, the script template includes fixed text and multiple predefined replaceable variables. Each variable can be broken down into variable parameters and corresponding variable parameter values, with variable parameters being occupied by variable placeholders. The variable parameters can be set according to business requirements. For example, in product recommendations, user name, product name, and product price can be used as variable parameters, with their corresponding values being variable parameter values.
[0092] like Figure 3 The task distribution module 12 first parses and decomposes multiple variable placeholders in the script template to obtain multivariate parameters. Then, it dynamically retrieves the multivariate parameter values for each user from the operations center based on the user list. For example, if the user list stores each user's user ID and the operations center stores different parameter values for different users, the module can dynamically retrieve the multivariate parameter values for the current script template for that user based on their user ID. Next, it matches the multivariate parameters with the corresponding multivariate parameter values for each user based on their user ID to generate a parameter mapping table. Finally, it synthesizes the script task based on the parameter mapping table and the script template. For example, it replaces the corresponding multivariate parameters in the script template with the multivariate parameter values for each user. Taking the script template "Dear {username}, I recommend a {productname} to you" as an example, username and productname are used as variable parameters. Each user corresponds to different values for username and productname. A parameter mapping table is generated by matching the variable parameters username and productname with the different user username and productname values. The username and productname values corresponding to the same user ID in the parameter mapping table are then used to replace the username and productname in the script template to synthesize different user script tasks. For example, it can synthesize "Dear Ms. Wang, I recommend a facial cleanser to you," or "Dear Mr. Li, I recommend a mobile phone to you." Thus, if different user lists use different script templates, different user script tasks can be generated; if different batches of user lists use different script templates, different batches of user script tasks can be generated; and so on.
[0093] Furthermore, after the task of synthesizing the script, call information tailored to different users is obtained, such as... Figure 3Call information 1, call information 2, ... call information n can be uniformly stored in a database queue (such as the TTS-Batch-Task-Queue in Redis) to obtain a call queue. The same queue constitutes a batch, and call script tasks are distributed to the elastic scheduling module 13 in batches. Simultaneously, call script synthesis progress information (total number of call script tasks, total number of synthesized call script tasks, etc.) can be reported to the link tracing module 15 in real time. Furthermore, the task distribution module 12 will also verify the synthesized call script tasks and write back any abnormal call script tasks.
[0094] S3. Determine the required computing power resource index value for each batch of speech tasks to be synthesized in real time based on the synthesis index value, perform speech synthesis inference based on the computing power resource index value and the target computing power resource for computing power resource allocation, and output audio files.
[0095] Computing power resources refer to the computing power used to process data and perform computational tasks, including hardware devices, software systems, and network resources. Computing power resources can be categorized by hardware type, such as CPU, GPU, and FPGA. In practical applications, appropriate computing power resources can be selected based on the type of task to be processed. This embodiment processes the inference task of a speech synthesis model; therefore, GPUs can be used as the computing power resource.
[0096] To determine the required computing resource indicators for each batch of script synthesis tasks in real time, the synthesis indicators for each batch of script synthesis tasks can be obtained, and then the required computing resource indicators for each batch of script synthesis tasks can be determined based on the synthesis indicators. Specifically, the synthesis indicators are determined based on the computing resource indicators. In this embodiment, the computing resource indicator value can be QPS (Queries Per Second) to measure the throughput of GPU online services. Correspondingly, the synthesis indicators may include: the total number of script tasks, the total number of synthesized script tasks, and the expected completion time. Optionally, the synthesis indicators can be determined based on the script synthesis progress information in the link tracing module 15.
[0097] like Figure 3 After the synthesized script tasks in step S2 enter the call queue, each call queue is a batch. Tasks are distributed according to batches. The elastic scheduling module 13 can obtain the total number of script tasks in the current batch, the total number of synthesized script tasks in the current batch, and the expected completion time of the current batch from the task synthesis progress information of the link tracking module 15 in real time. The QPS required for the current batch of script tasks to be synthesized is then determined by the following formula:
[0098] (1)
[0099] In the above formulas: taskTotal represents the total number of script tasks in the current batch, which can be obtained by pre-calculating the total number of script tasks in each call queue; inferFinalTaskTotal represents the total number of synthesized script tasks in the current batch, which can be determined based on the script synthesis information reported in real time by the link tracing module 15; and needTime represents the expected completion time for the current batch.
[0100] (2)
[0101] Where: batchStartTimeStamp is the timestamp of the start of the current batch, and T is the expected completion time of the current batch (for example, T=3600 means the expected completion time is 1 hour). T can be configured according to actual business needs. For example, the expected completion time of each batch can be configured according to the priority of each batch of call tasks. For instance, if the user list has been assigned to the corresponding user list queue according to priority in step S1, and the call tasks are synthesized in batches according to the priority of the user list queue, then the synthesized call tasks are also queued in batches according to the priority of the user list, and the call tasks in the same batch (i.e., in the same call queue) have the same priority; then the priority of the call tasks in the previous batch is higher than the priority of the call tasks in the next batch; therefore, the expected completion time of the call tasks in the previous batch can be configured to be less than the expected completion time of the call tasks in the next batch. For cases where there are many call tasks of the same priority and multiple queues are needed for storage, queue priorities can be marked, and the expected completion time of the corresponding batch can be configured to increase sequentially according to the queue priority from high to low. In addition, in step S2, the priority of the script task can be marked according to the priority of the user list during the script task synthesis. The script task enters the call queue with the corresponding priority mark according to its marked priority. The expected completion time of the corresponding batch is configured to increase sequentially according to the order of call queue priority from high to low.
[0102] Furthermore, when the elastic scheduling module 13 receives multiple batches of script tasks simultaneously, it can adjust the synthesis metrics of each batch of script tasks based on the script synthesis information in the link tracing module 15. For example, it can retrieve the total number of script tasks, the total number of synthesized script tasks, and the expected completion time of these multiple batches of tasks from the link tracing module 15 in real time for adjustment, and adjust the QPS value required for each batch of script tasks to be synthesized in real time according to the above formula, so as to ensure the dynamic accuracy of the computational power assessment required for script tasks and the rationality of scheduling.
[0103] In this embodiment, to ensure both the effectiveness of speech synthesis inference and the economic efficiency of computing resource investment, target computing resources can be selected and scheduled based on the computing resource indicators required for the batch of speech synthesis tasks and the computing resource investment. For example, first, the priority of each computing resource is configured according to the computing resource investment; then, target computing resources that meet the computing resource indicators required for the batch of speech synthesis tasks are selected and scheduled according to the priority of the computing resources. Figure 3 This allows for the construction of a speech synthesis task computing power pool, with the priorities of each computing resource within the pool configured in ascending order of computing power investment. For example, if the computing resources in the pool are divided into internal local computing resources, internal cloud computing resources, and external cloud computing resources, then the priorities of internal local computing resources, internal cloud computing resources, and external cloud computing resources can be configured in descending order of computing power investment. Here, internal local computing resources refer to computing resources available locally within the enterprise, internal cloud computing resources refer to computing resources available on the enterprise's internal cloud platform, and external cloud computing resources refer to computing resources available on the enterprise's external cloud platform. Figure 3 The elastic scheduling module 13 finds all batches to be synthesized based on the narration task synthesis progress information reported by the link tracking module 15. According to the priority of the candidate computing resources, it first selects and starts the internal local computing resources that meet the above QPS value. When the internal local computing resources cannot meet the above QPS value requirement, the internal cloud computing resources that meet the above QPS value are then started. Similarly, when neither the internal local computing resources nor the internal cloud computing resources can meet the above QPS value requirement, the external cloud computing resources that meet the above QPS value are started.
[0104] Furthermore, if the selected target computing resources are external cloud resources, in order to further save on computing resource investment, load testing metrics can be used to comprehensively consider the computing resource investment and the inference performance of each computing resource. For example, load testing metrics may include: computing resource investment and / or computing resource load testing ratio; where: the computing resource load testing ratio refers to the ratio between the actual computing resources consumed by the system during stress testing and the theoretical maximum computing demand simulated by the load testing request. It is a key indicator for measuring the efficiency and cost of computing resource utilization. Therefore, this step can further determine the load testing metrics for multiple computing resources; based on the load testing metrics, the final target computing resources can be selected and scheduled from the selected external cloud resources according to the computing resource throughput. Taking GPUs as an example, performance load testing and comparative analysis of different types of GPUs can be performed in advance to determine the QPS value of each type of GPU in the speech synthesis model inference scenario, generating a GPU-QPS correspondence table. Simultaneously, the investment costs of various types of GPU cards can be obtained in advance. Based on the GPU-QPS mapping table and the investment costs of GPU cards, a GPU computing power investment comparison table can be synthesized, including resource type, number of Triton instances, number of characters, OPS, and investment (such as the price per hour per card and / or the price per single speech task). In practice, the number of GPUs available for performance stress testing and comparison is limited. Therefore, after obtaining the GPU computing power investment comparison table, the list of available single-card GPU resources can be traversed, and the GPU card types in the GPU computing power investment comparison table can be selected. The selected GPU card types are then sorted in descending order according to GPU throughput based on the GPU computing power investment comparison table. Based on this sorting, the final target computing power resources are selected to start the speech synthesis model for inference on the batch of speech tasks to be synthesized. For example, the speech synthesis inference service of a single A30 GPU instance can be started; or, the speech synthesis inference service of a single T4 GPU instance can be started.
[0105] In this embodiment, the speech synthesis inference service needs to load the speech synthesis model file when it starts up. However, speech synthesis model files are usually quite large, and loading them at startup is time-consuming, especially when the data centers are different. Loading the speech synthesis model file can seriously affect the startup speed of the inference service. Therefore, in this embodiment, the speech synthesis model file is preloaded into the corresponding GPU data center. In this way, when the speech synthesis inference service starts up, inference can be performed directly without spending too much time loading the model file.
[0106] like Figure 3During the speech synthesis model inference process, each speech synthesis inference task instance pulls an inference task from the Redis TTS-Batch-Task-Queue for inference. After inference is completed, the synthesized inference result (i.e., the synthesized audio file) is stored in the audio queue for post-processing. Simultaneously, the inference result can also be reported to the task distribution module 12 and the link tracing module 15.
[0107] The task distribution module 12 can push the inference results to the cloud for easy retrieval and use by the call center platform.
[0108] The link tracing module 15 can record the synthesized script tasks, which facilitates the elastic scheduling module 13 to correct the task progress and display the synthesized link in a report.
[0109] After the speech synthesis inference task is completed, the elastic scheduling module 13 can promptly reclaim and destroy the occupied GPU resources, avoiding long-term idleness and waste of resources, thereby improving the overall computing power utilization and system stability. Figure 3 As shown, the progress of all speech synthesis tasks, including speech synthesis progress information and inference progress information, is reported to the link tracking module 15 in real time through the task distribution module 12. The elastic scheduling module 13 obtains the speech synthesis progress information and inference progress information in real time; evaluates the inference progress of each batch of speech tasks based on the speech synthesis progress information and inference progress information; and determines the QPS value of computing power resources required for the remaining speech tasks in each batch according to formulas (1) and (2), and shuts down redundant computing power resources in real time. This ensures that the resource allocation is consistent with the actual needs. Until the entire task inference process is completely finished, the elastic scheduling module 13 will uniformly shut down all running GPU instances to realize the dynamic release of computing power and efficient recycling of resources. Among them: redundant computing power resources refer to the computing power resources that are in excess after a period of inference, especially as the inference is nearing its end.
[0110] S4. Post-process the audio file to generate a personalized voice file.
[0111] For example, the audio file output in step S3 can undergo volume normalization, noise reduction, equalization, and audio format conversion, etc., to improve the quality and fluency of the audio file. Further, such as... Figure 3 According to actual business needs, audio files can be transferred to a specified address or other resources, or connected to other APIs for processing or analysis, or transmitted to NAS (Network Attached Storage) to achieve multi-device sharing and backup, or transmitted to Redis in the outbound call system application for storage with call content and audio file address as key-value pairs, which facilitates outbound call system scheduling.
[0112] In summary, the personalized speech generation system and method based on multivariate parameters provided by this invention have at least the following advantages compared to existing technologies:
[0113] 1. This invention generates a parameter mapping table based on the multivariate parameters in the script template and the values of the multivariate parameters obtained from the user list. Based on the parameter mapping table and its corresponding batch of script templates, millions of script tasks can be dynamically generated in batches, meeting the business needs for differentiated user outreach.
[0114] 2. This invention dynamically selects and schedules target computing resources in real time based on the computing resource indicators required for batch-to-be-synthesized dialogue tasks and the available computing resources, initiating the speech synthesis model to perform inference on the batch-to-be-synthesized dialogue tasks and outputting audio files; achieving dynamic and precise scheduling of computing resources, reducing computing costs by more than 30%; and supporting stable operation under high concurrency. Highly personalized dialogue batch generation improves telemarketing coverage and success rate.
[0115] 3. This invention reports the progress of task batches to the tracking module in real time, realizing full traceability from task issuance, speech synthesis, computing resource scheduling, and speech synthesis inference.
[0116] 4. This invention uses real-time data from the tracking system to calculate the supply and demand of computing resources and dynamically destroy redundant computing resources, thus saving computing resources while ensuring stable operation of the system under high concurrency.
[0117] 5. This invention distributes user lists in batches according to the priority of the user list queue, synthesizes speech tasks, and performs inference by flexibly scheduling computing resources, thereby prioritizing the synthesis of voice files for target users (such as high-quality users or emergency users) and improving reach efficiency.
[0118] 6. This invention preloads the speech synthesis model file in the computer room, so that the speech synthesis model does not need to be reloaded when the inference service starts, thereby greatly improving the startup speed of computing instances.
[0119] 7. This invention provides a multivariate speech synthesis architecture with scalability and modular design, which overcomes the shortcomings of existing systems that require large-scale modification when expanding multivariate synthesis, reduces R&D and deployment costs, and shortens the launch cycle.
[0120] Those skilled in the art will understand that the modules in the above system embodiments can be distributed in the device as described, or they can be modified accordingly and distributed in one or more devices different from the above embodiments. The modules in the above embodiments can be combined into one module, or they can be further divided into multiple sub-modules.
[0121] The following describes embodiments of the electronic device of the present invention, which can be considered as implementations of the physical form of the methods and apparatus embodiments of the present invention described above. Details described in the embodiments of the electronic device of the present invention should be considered as supplements to the methods or apparatus embodiments described above; details not disclosed in the embodiments of the electronic device of the present invention can be implemented with reference to the methods or systems embodiments described above.
[0122] Figure 4 This is a structural block diagram of an exemplary embodiment of an electronic device according to the present invention. Figure 4 The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present invention.
[0123] like Figure 4 As shown, the electronic device 400 of this exemplary embodiment is manifested in the form of a general data processing device. The components of the electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one storage unit 420, a bus 430 connecting different electronic device components (including storage unit 420 and processing unit 410), a display unit 440, etc.
[0124] The storage unit 420 stores a computer-readable program, which may be source code or read-only program code. The program can be executed by the processing unit 410, causing the processing unit 410 to perform the steps of various embodiments of the present invention. For example, the processing unit 410 can perform actions such as... Figure 2 The steps are shown.
[0125] Bus 430 can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the various bus structures.
[0126] Electronic device 400 can also communicate with one or more external devices 100 (e.g., keyboard, monitor, network device, Bluetooth device, etc.), enabling users to interact with electronic device 400 via these external devices 100, and / or enabling electronic device 400 to communicate with one or more other data processing devices (e.g., router, modem, etc.). This communication can be performed via input / output (I / O) interface 450, and also via network adapter 460 to one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public network). Network adapter 460 can communicate with other modules of electronic device 400 via bus 430.
[0127] This invention also provides a computer program product, including a computer program that, when executed by a processor, implements the method described in any of the above-mentioned embodiments. Specifically: user lists are distributed in batches according to the priority of the user list queue; speech templates corresponding to the user lists are obtained in batches; a parameter mapping table is generated based on the multivariate parameters in the speech templates and their dynamic values; speech tasks are synthesized and distributed in batches based on the parameter mapping table and the speech templates; the required computing power resource index value for each batch of speech tasks to be synthesized is determined in real time according to the synthesis index; speech synthesis inference is performed based on the computing power resource index value and the target computing power resource allocation, and an audio file is output; the audio file is post-processed to generate a personalized speech file.
[0128] In summary, the present invention can be implemented by methods, systems, electronic devices, or computer program products that execute computer programs. In practice, some or all of the functions of the present invention can be implemented using general-purpose data processing devices such as microprocessors or digital signal processors (DSPs).
[0129] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the present invention is not inherently related to any specific computer, virtual device, or electronic device, and various general-purpose devices can also implement the present invention. The above descriptions are merely specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A personalized speech generation system based on multivariate parameters, characterized in that, include: The list distribution module is used to distribute user lists to the task distribution module in batches according to the priority of the user list queue. The task distribution module is used to receive user lists in batches and obtain the corresponding script templates for each user list; user lists in the same batch correspond to the same script template, or user lists received within a preset time period correspond to the same script template; the script template contains fixed text and multiple predefined replaceable variable parameters and their corresponding values; a parameter mapping table is generated based on the multiple variable parameters and their dynamic values in the script template, and script tasks are synthesized and distributed in batches based on the parameter mapping table and the script template; The elastic scheduling module receives speech tasks from the task distribution module in batches. Based on synthesis metrics, it determines in real-time the required computing resource indicators for each batch of speech tasks to be synthesized. Then, based on these indicators and the target computing resource allocation, it performs speech synthesis inference to achieve dynamic allocation and maximized utilization of computing resources, outputting audio files. The computing resource indicator is queries per second (QPS), and: ; `taskTotal` represents the total number of script tasks in the current batch, `inferFinalTaskTotal` represents the total number of synthesized script tasks in the current batch, and `needTime` represents the expected completion time for the current batch. Furthermore: ; batchStartTimeStamp is the timestamp of the start of the current batch, and T is the expected completion time of the current batch. The post-processing module is used to post-process the audio file to generate a personalized voice file.
2. The system according to claim 1, characterized in that, The elastic scheduling module includes: The configuration unit is used to configure the priority of each computing resource according to the computing resource investment; The selection unit is used to select and schedule target computing resources that meet the computing resource index values required for the batch of synthesized speech tasks according to the priority of computing resources.
3. The system according to claim 2, characterized in that, The elastic scheduling module also includes: The second determining unit is used to determine the stress test indicators of multiple computing resources when the target computing resources selected by the selecting unit are external cloud resources. The sub-selection unit is used to select and schedule the final target computing resources from the selected external cloud resources according to the computing power resource throughput based on the stress test indicators.
4. The system according to claim 1, characterized in that, The system also includes: The link tracing module is used to obtain at least one of the following in real time: user list distribution progress information, speech synthesis progress information, and inference progress information; The elastic scheduling module determines the synthesis index based on the list distribution progress information and the speech synthesis progress information in the link tracking module.
5. The system according to claim 4, characterized in that, The elastic scheduling module also includes: The computing power shutdown unit is used to obtain speech synthesis progress information and inference progress information from the link tracing module, and determine and shut down redundant computing power resources in real time based on the speech synthesis progress information and inference progress information.
6. The system according to claim 1, characterized in that, The task distribution module includes: The parsing unit is used to parse and decompose the multivariate parameters in the script template; The acquisition unit is used to dynamically acquire the value of the multivariate parameter corresponding to each user based on the user list; The matching unit is used to match the multivariate parameters with the multivariate parameter values corresponding to each user to generate a parameter mapping table.
7. A personalized speech generation method based on multivariate parameters, characterized in that, include: User lists are distributed in batches according to the priority of the user list queue; The script templates are obtained in batches and correspond to the user lists. The same script template is used for the same batch of user lists, or the same script template is used for user lists received within a preset time period. The script template contains fixed text and multiple predefined replaceable variable parameters and their corresponding values. A parameter mapping table is generated based on the multi-variable parameters and their dynamic values in the script template. The script tasks are synthesized and distributed in batches based on the parameter mapping table and the script template. The required computing power resource index for each batch of speech synthesis tasks is determined in real time based on the synthesis index. Speech synthesis inference is then performed based on the computing power resource index and the target computing power resource allocation, achieving dynamic allocation and maximization of computing power resources, and outputting audio files. The computing power resource index is the number of queries per second (QPS), and: ; `taskTotal` represents the total number of script tasks in the current batch, `inferFinalTaskTotal` represents the total number of synthesized script tasks in the current batch, and `needTime` represents the expected completion time for the current batch. Furthermore: ; batchStartTimeStamp is the timestamp of the start of the current batch, and T is the expected completion time of the current batch. The audio file is post-processed to generate a personalized voice file.
8. The method according to claim 7, characterized in that, The computing power resources allocated according to the computing power resource index value and the target computing power resources for scheduling include: The priority of allocating computing resources is based on the allocation of computing resources. Target computing resources that meet the aforementioned computing resource index values are selected and scheduled based on the priority of the computing resources.
9. The method according to claim 8, characterized in that, If the selected target computing resource is an external cloud resource, the method further includes: Determine the stress test indicators for multiple computing resources; Based on the load test metrics, the final target computing resources are selected and scheduled from the chosen external cloud resources according to the throughput of computing resources.
10. The method according to claim 7, characterized in that, The method further includes: Real-time acquisition of at least one of the following: user list distribution progress information, script synthesis progress information, and reasoning progress information; The synthesis indicators are determined based on the progress information of the list distribution and the progress information of the script synthesis.
11. The method according to claim 10, characterized in that, The method further includes: Based on the speech synthesis progress information and reasoning progress information, redundant computing resources are determined and shut down in real time.
12. The method according to claim 7, characterized in that, The process of generating a parameter mapping table based on the multivariate parameters in the script template and their dynamic values includes: Analyze and break down the multivariate parameters in the script template; The values of the multivariate parameters corresponding to each user are dynamically obtained based on the user list; A parameter mapping table is generated by matching the multivariate parameters with the values of the multivariate parameters corresponding to each user.
13. An electronic device, characterized in that, include: processor; as well as A memory storing computer-executable instructions, which, when executed, cause the processor to perform the method according to any one of claims 7 to 12.
14. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 7 to 12.