Content generation method and apparatus, device and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By acquiring configuration information and automatically generating target text and audio data using target models, the problem of cumbersome and time-consuming traditional content production processes is solved, enabling fast, professional, and personalized content generation.

WO2026129106A1PCT designated stage Publication Date: 2026-06-25BEIJING ZITIAO NETWORK TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date: 2024-12-16
Publication Date: 2026-06-25

Application Information

Patent Timeline

16 Dec 2024

Application

25 Jun 2026

Publication

WO2026129106A1

IPC: G10L13/033

AI Tagging

Application Domain

Speech synthesis

Technology Topics

Engineering Audio frequency

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data setDescent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangementHeating and refrigeration combinationsHeat flowWorking fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories EngineeringSediment
Credit text analysis method, credit object auditing method and credit object auditing device
CN114386430AReduce labor costs Improve efficiency Finance Semantic analysisCredit cardEngineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional content production processes are cumbersome and time-consuming, making it difficult to meet the needs of rapid creation. Furthermore, the style and timeliness of content generation are limited, affecting the diversity and appeal of creations.

Method used

By acquiring configuration information, reference information associated with the topic is generated, and target text and audio data are automatically generated using the target model, supporting user intervention and customized adjustments.

Benefits of technology

It achieves a fully automated process from task submission to audio data generation, saving time and costs, and generating professional and attractive content to meet different creative needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2024139739_25062026_PF_FP_ABST

Patent Text Reader

Abstract

A content generation method and apparatus, a device and a storage medium. The method comprises: acquiring configuration information for a content generation task (210), the configuration information at least indicating a theme of content to be generated and an audio configuration of said content; on the basis of the configuration information, acquiring reference information associated with the theme (220); at least on the basis of the reference information, generating a target text matching the theme (230); and, on the basis of the audio configuration and the target text, generating audio data corresponding to the target text (240). In this way, integrating a target model and an engineering link can automatically complete the generation of a target text and audio data, thereby remarkably reducing the time and cost for content production.

Need to check novelty before this filing date? Find Prior Art

Description

Methods, apparatus, devices and storage media for content generation Technical Field

[0001] The exemplary embodiments disclosed herein generally relate to the field of computers, and more particularly to methods, apparatus, devices and computer-readable storage media for content generation. Background Technology

[0002] With the rapid development of the internet and information technology, various types of content creation and media production activities have also shown a rapid growth trend. For example, podcasts, as a form of online audio program, involve multiple stages in their production and distribution process, such as material collection, recording, editing, and synthesis. These stages often rely on the creator's personal abilities and a large amount of manual work. Summary of the Invention

[0003] In a first aspect of this disclosure, a method for content generation is provided. The method includes: obtaining configuration information for a content generation task, the configuration information indicating at least a topic and audio configuration of the content to be generated; obtaining reference information associated with the topic based on the configuration information; generating target text matching the topic based at least on the reference information; and generating audio data corresponding to the target text based on the audio configuration and the target text.

[0004] In a second aspect of this disclosure, a content generation apparatus is provided. The apparatus includes: a configuration information acquisition module configured to acquire configuration information for a content generation task, the configuration information indicating at least a topic of the content to be generated and an audio configuration of the content to be generated; a reference information acquisition module configured to acquire reference information associated with the topic based on the configuration information; a generation module configured to generate target text matching the topic, at least based on the reference information; and the generation module further configured to generate audio data corresponding to the target text based on the audio configuration and the target text.

[0005] In a third aspect of this disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. When executed by the at least one processing unit, the instructions cause the device to perform the method of the first aspect.

[0006] In a fourth aspect of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program that can be executed by a processor to implement the method of the first aspect.

[0007] It should be understood that the content described in this content section is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0008] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:

[0009] Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure may be implemented;

[0010] Figure 2 illustrates a flowchart of an example process for generating content according to some embodiments of this disclosure;

[0011] Figure 3 illustrates a schematic diagram of the content generation process according to some embodiments of the present disclosure;

[0012] Figure 4A illustrates a schematic diagram of a process for generating target text according to some embodiments of the present disclosure;

[0013] Figure 4B shows a schematic diagram of the sound effect processing procedure according to some embodiments of the present disclosure;

[0014] Figure 4C illustrates a schematic diagram of the process of performing a podcast generation task according to some embodiments of the present disclosure;

[0015] Figure 5 shows a schematic structural block diagram of an example device generated according to some embodiments of the present disclosure; and

[0016] Figure 6 shows a block diagram of an electronic device capable of implementing several embodiments of the present disclosure. Detailed Implementation

[0017] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0018] It should be noted that the headings of any section / subsection provided herein are not limiting. Various embodiments are described throughout this document, and embodiments of any type may be included under any section / subsection. Furthermore, embodiments described in any section / subsection may be combined in any way with any other embodiments described in the same section / subsection and / or different sections / subsections.

[0019] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below. The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0020] The embodiments of this disclosure may involve user data, data acquisition, and / or use. All of these aspects comply with applicable laws, regulations, and relevant provisions. In the embodiments of this disclosure, all data collection, acquisition, processing, manipulation, forwarding, and use are conducted with the user's knowledge and confirmation. Accordingly, in implementing the embodiments of this disclosure, the type, scope of use, and usage scenarios of any data or information that may be involved should be communicated to the user and their authorization obtained in accordance with relevant laws and regulations through appropriate means. The specific methods of notification and / or authorization may vary depending on the actual situation and application scenario, and the scope of this disclosure is not limited in this respect.

[0021] In this specification and the embodiments, any processing of personal information will be carried out only under the premise of legality (such as obtaining the consent of the personal information subject, or being necessary for the performance of a contract), and will only be carried out within the scope stipulated or agreed upon. A user's refusal to process personal information other than that necessary for basic functions will not affect the user's use of basic functions.

[0022] In embodiments of this disclosure, the target model can employ any suitable algorithm or operation to implement the described functionality. In some embodiments, the target model may include any appropriate machine learning model. In some embodiments, one or more target models may be constructed based on a language model (LM), such as a large language model (LLM). The machine learning model used may be a content-generative model capable of generating corresponding outputs based on model inputs. In some embodiments, the machine learning model may be a multimodal model capable of receiving textual modal model inputs (e.g., natural language and / or machine language) and / or non-textual modal model inputs (e.g., images, speech, video, etc.), and capable of generating the desired output based on the model inputs and prompts.

[0023] Example Environment

[0024] Figure 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. In this example environment 100, an application 120 is installed on a terminal device 110. A user 140 can interact with the application 120 via the terminal device 110 and / or an attached device of the terminal device 110.

[0025] In some embodiments, application 120 may be a content sharing application, a content editing application, a content creation application, etc. Application 120 can provide user 140 with various services related to content generation, such as functions including text generation, editing, and publishing, and audio content generation and processing.

[0026] In environment 100 of Figure 1, if application 120 is active, terminal device 110 can display the interface 150 of application 120. Interface 150 may include various interfaces provided by application 120, such as a content generation configuration interface, a generation status interface, a result display interface, etc. Application 120 can provide content editing and generation functions (e.g., application 120 can be a podcast generation platform) to support the submission of content generation tasks and the generation, editing, and publishing of content within application 120.

[0027] In some embodiments, terminal device 110 communicates with server 130 to provide services to application 120. Terminal device 110 can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. In some embodiments, terminal device 110 can also support any type of user-facing interface (such as "wearable" circuitry). Server 130 can be various types of computing systems / servers capable of providing computing power, including but not limited to mainframes, edge computing nodes, computing devices in cloud environments, etc.

[0028] It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and do not imply any limitation on the scope of this disclosure.

[0029] As mentioned above, with the rapid development of the internet and information technology, various types of content creation and media production activities have also shown a rapid growth trend. For example, podcasts, as a widely popular and increasingly mainstream media form, are becoming an important channel for individuals and businesses to convey information and share viewpoints. With the rapid development of artificial intelligence (AI) and natural language processing technologies, generative AI has already demonstrated strong potential in the creation of text, images, audio, and other content. However, the traditional content production process still faces a series of problems and challenges. For example, in related technologies, the creation and media production process of podcasts and other content involves multiple manual operation steps. These steps not only require creators to have certain technical skills and creative experience, but may also be limited by resources and time. Furthermore, due to the long content production cycle, the produced content often suffers from poor timeliness. Several typical scenarios are described below.

[0030] A typical scenario involves efficiency issues in the content creation process. Specifically, when creating media content such as podcasts, users typically need to go through multiple stages, including theme planning, scriptwriting, recording, and editing. For inexperienced users, these stages often require a significant investment of time and energy, and even experienced users struggle to avoid the high costs of material gathering and content creation. This cumbersome process significantly reduces users' enthusiasm for attempting content creation and also constitutes a certain productivity bottleneck.

[0031] Another typical scenario involves the problem of homogenized content creation styles. Specifically, when creating media content, users' output language style, presentation format, and vocal characteristics are often limited by their personal abilities and experience. For example, a host accustomed to creating news podcasts may find it difficult to produce entertainment or educational podcasts; and for non-native speakers, creating podcast content with fluent speech and conforming to the target language's habits requires additional language learning and adjustment costs. These limitations directly affect the diversity and creativity of content production.

[0032] Another typical scenario involves the timeliness of media content production. Specifically, when trending social topics emerge, users typically need to quickly complete tasks such as material gathering, scriptwriting, and audio recording. However, due to the long cycle of traditional content production, these trending topics may have already lost their timeliness, missing the optimal window for dissemination. This lag is particularly pronounced in today's rapidly changing landscape of trending information, negatively impacting the appeal and dissemination effectiveness of media content.

[0033] In view of this, embodiments of the present disclosure propose a content generation scheme. The scheme includes: obtaining configuration information for a content generation task, the configuration information indicating at least the topic of the content to be generated and the audio configuration of the content to be generated; obtaining reference information associated with the topic based on the configuration information; generating target text matching the topic based at least on the reference information; and generating audio data corresponding to the target text based on the audio configuration and the target text.

[0034] In this way, the embodiments of this disclosure can achieve a fully automated process from task submission to the generation of text and audio data, greatly saving time and costs in content generation. Furthermore, in some embodiments, through the application of models and user intervention configuration, different user creative needs are met, ensuring the professionalism and appeal of the generated content.

[0035] The following section provides a detailed description of various example implementations of this scheme, with reference to the accompanying drawings.

[0036] Figure 2 shows a flowchart of an example process 200 generated according to some embodiments of the present disclosure. Process 200 may be implemented at server 130. Process 200 will now be described with reference to Figure 1.

[0037] As shown in Figure 2, in box 210, server 130 obtains configuration information for the content generation task. The configuration information at least indicates the theme and audio configuration of the content to be generated. The configuration information is the basic information defining the generated content and output format, used to specify the input content and theme of the generation task. The configuration information can also define audio configuration details of the generated content, such as audio format, language, duration, and other characteristics.

[0038] As an example, user 140 can submit a content generation task to server 130 by calling the content generation service's interface through a mobile application (or mini-program), PC application, or third-party tool platform. In response to the content generation task submitted by user 140, server 130 processes the specific business logic, completing various tasks and operations during the content generation process. First, server 130 obtains the configuration information for the content generation task.

[0039] In box 220, server 130 obtains reference information associated with the topic based on configuration information. In some embodiments, after obtaining the configuration information, server 130 can obtain associated reference information based on the topic of the generated content. For example, server 130 can determine user input associated with the topic based on the configuration information and use the user input as reference information. Alternatively, server 130 can further search for and supplement material associated with the topic as reference information based on the configuration information to ensure the richness and accuracy of the generated content.

[0040] In box 230, server 130 generates target text that matches the topic, at least based on reference information. As an example, if the input provided by user 140 is sufficiently specific and complete, server 130 can directly use this input as the core reference information to generate the corresponding target text content (such as a podcast script or broadcast transcript). Alternatively, server 130 can further search for and supplement relevant materials related to the topic as reference information based on configuration information to generate the corresponding target text content.

[0041] In box 240, server 130 generates audio data corresponding to the target text based on audio configuration and target text. In some embodiments, server 130 can generate audio data corresponding to the target text using text-to-speech (TTS) technology based on audio configuration (such as timbre, tone, and speech rate) and target text (e.g., a generated podcast script). In this process, server 130 converts each part of the target text into speech and adjusts parameters such as sound quality, timbre, and tone according to the audio configuration, ultimately generating complete audio data.

[0042] In this disclosed embodiment, a fully automated process from task submission to text and audio data generation is achieved, significantly saving time and cost in content generation. As can be seen from the further embodiments described below, the application of models and user intervention meet diverse user creative needs, ensuring the professionalism and appeal of the generated content.

[0043] In this embodiment of the disclosure, server 130 can divide multiple nodes in the content generation process into pluggable or removable functional modules or node units that can be enabled on demand. Such node units may include, for example, determining search terms, determining reference information, determining target text, determining audio data, etc. User 140 can choose to automate the generation process, manually intervene in certain key node units, or selectively enable or skip some nodes, thereby ensuring content generation efficiency while adapting to diverse content generation needs.

[0044] As an example, Figure 3 illustrates a schematic diagram of a content generation process 300 according to some embodiments of the present disclosure. Process 300 may be implemented at server 130.

[0045] In box 311, server 130 obtains configuration information 312 for the content generation task. As shown in Figure 3, configuration information 312 may include user input. In some embodiments, user 140 may only provide initial input, and server 130 may determine other required configuration information based on built-in logic (such as default configuration information) or with the help of a target model after validating the user input, thereby executing the corresponding content generation task. If user 140 has more specific needs, the configuration information can also be flexibly adjusted (such as specifying a role name, selecting multilingual support, adjusting the search strategy, adjusting audio configuration, etc.). This allows content generation to meet both the simple needs of users without configuration and the diverse needs of users with customized requirements.

[0046] Referring to Figure 3, configuration information 312 may also include input type, i.e., the type of initial input provided by the user. Input type may include topic keywords, such as user 140 directly providing a clear podcast topic or core concept. Alternatively, input type may include long text, such as user 140 uploading a complete transcript or lengthy text content as reference information for the generated content.

[0047] In some embodiments, the input type may also include a web link, such as a web Uniform Resource Locator (web URL). In response to determining that the configuration information includes link information indicating the topic, server 130 may utilize a data acquisition function block to acquire information from the data object indicated by the link information as at least a portion of the reference information.

[0048] For example, server 130 receives a web link submitted by a user and can use methods provided by data acquisition tools or other plug-in tools to access the specified webpage and obtain its content. Then, server 130 can use natural language processing technology to analyze the webpage content and extract core information as at least some of the reference information.

[0049] In this embodiment, by providing multiple content input methods and automatically selecting the corresponding processing flow based on the type of user input, the generated content can be ensured to be accurate and rich. Simultaneously, this design significantly improves the adaptability of the content generation service and the user experience. Users can automatically complete all content production tasks by inputting only a single topic, achieving one-click content production. Users can personalize content generation tasks according to their own needs, ensuring that the generated results highly match the target content.

[0050] Server 130 can further search for and supplement information related to the topic based on the configuration information as reference information to ensure the richness and accuracy of the generated content. Continuing to refer to Figure 3, in box 313, server 130 determines at least one keyword based on user input related to the topic in the configuration information, using the first target model.

[0051] In some embodiments, server 130 may also determine one or more candidate keywords (which can also be understood as search terms or query questions) based on user input and using a first target model. Then, it may use one or more candidate keywords to obtain reference information (which can also be understood as materials for generating target text) from one or more data sources 314 for generating the corresponding text.

[0052] As an example, server 130 determines the topic of the content to be generated as "the future development of smart home technology" based on user input. Then, server 130 can use target models such as large language models to generate multiple keywords based on the input content to obtain at least some reference information, such as keywords like "What are the application scenarios of smart home technology?", "Smart home development trends", "Smart home market forecast for the next five years", and "Latest smart home technology innovation cases".

[0053] In some embodiments, server 130 obtains user feedback on one or more candidate keywords. Based on the user feedback on the one or more candidate keywords, at least one keyword is determined from the one or more candidate keywords. As an example, after determining one or more candidate keywords using a first target model, server 130 may present the candidate keywords determined by the first target model to user 140. Server 130 then receives user feedback on the candidate keywords. For example, based on user 140's feedback operations such as adding, deleting, or modifying candidate keywords determined by the target model, server 130 determines at least one keyword for determining reference information.

[0054] Referring again to Figure 3. In box 315, server 130 retrieves information matching at least one keyword from one or more data sources 314 as at least a portion of the reference information. In some embodiments, data source 314 may include, for example, an internal knowledge base (e.g., an entity database, a news database) and / or an external search engine. Server 130 may use the interface of a search plugin provided in the plugin tool to send a query request to an external search engine (e.g., a webpage, other applications) or an internal knowledge base. Server 130 may then receive results related to the query returned by the search engine or knowledge base, such as text, images, audio, video, etc., as at least a portion of the reference information.

[0055] In some embodiments, server 130 may also use a target model to filter and sort the information that matches at least one keyword, extracting content highly relevant to the topic as at least a portion of the reference information. Then, server 130 may integrate the retrieved information and user input into structured material, which will serve as reference information for subsequent content generation tasks.

[0056] Referring again to Figure 3, configuration information 312 may also include a search strategy. The search strategy refers to the policy configuration by which server 130 obtains information matching at least one keyword from one or more data sources. For example, the search strategy may indicate whether to enable a search (e.g., the search can be disabled when the user's input is sufficiently specific and complete, thus directly using the user's input as reference information), automatically enable a search when the user's input is less than a specified number of characters, limit the number of reference articles found (e.g., limit the number of reference articles searched to a maximum of 10), and the number of keywords to be obtained from the data sources.

[0057] In this embodiment of the disclosure, by determining at least one keyword based on user input and obtaining at least a portion of the extended information matching the keyword as reference information, it is possible to comprehensively acquire topic-related information, thereby improving the breadth and depth of content generation. Simultaneously, the keywords generated using the target model can flexibly adapt to different task requirements, enhancing the adaptability and intelligence level for complex content generation scenarios, thus providing users with more accurate and high-quality service support.

[0058] Referring again to Figure 3, in box 316, server 130 can generate target text that matches the topic based on reference information. In some embodiments, after obtaining the reference information, server 130 can use a second target model to generate target text that matches the topic. The second target model can be the same as or different from the first target model used to generate candidate keywords.

[0059] Server 130 can call a second target model to generate the complete target text. For example, when the content to be generated is shorter than the maximum token length of the large language model, such as in the generation of short podcasts, server 130 can call the large language model all at once to generate the complete podcast script (i.e., the target text) including the opening, main content, and closing. This method offers faster generation speed while reducing resource consumption and the overhead of multiple calls, meeting users' demands for high efficiency. Furthermore, it supports streaming output, allowing for real-time output and presentation of the script content, enhancing the user experience.

[0060] If the content to be generated exceeds the maximum token length that the target model can generate in a single operation, the server 130 can also adopt a segmented generation strategy. Figure 4A shows a schematic diagram of a process 400A for generating target text according to some embodiments of this disclosure. Process 400A can be implemented at the server 130.

[0061] In box 411, server 130 determines multiple subtopics associated with the main topic based on reference information. Subtopics are refinements or extensions of the content topic determined based on user input, ensuring more comprehensive content coverage. For example, for podcast generation tasks, subtopics can also be understood as multiple topics or questions associated with the main topic determined based on reference information (materials).

[0062] As an example, for podcast-type generation tasks, subtopics can take many forms. For instance, a subtopic might be defined as a broad topic, around which the target text (script segment) narrates or discusses the topic. Server 130 will expand the content depth of the subtopic, generating more detailed explanations, viewpoints, or examples by incorporating reference information. Alternatively, a subtopic might be defined as a specific question, with the script segment aiming to answer it. Server 130 will directly generate the answer to the question based on reference information, potentially supplementing it with background information and related content.

[0063] In some embodiments, server 130 may utilize a third target model to determine one or more candidate subtopics based on reference information. The third target model may be the same as or different from the second target model used to generate the target text. As an example, for the topic "future technology trends," server 130 may determine at least one keyword, invoke a search engine, and retrieve relevant articles, news reports, research literature, and other materials as reference information. Alternatively, server 130 may also include long text or link-type content contained in user input as partial reference information. Then, server 130 may refine the topic using the target model based on the reference information, automatically generating subtopics, such as subtopic 1 "Future Applications of Artificial Intelligence," subtopic 2 "Potential Breakthroughs in Quantum Computing Technology," and subtopic 3 "The Impact of Robotics and Automation on Society," etc.

[0064] In some embodiments, server 130 may also obtain user feedback on one or more candidate subtopics. Then, based on the user feedback on the one or more candidate subtopics, server 130 determines multiple subtopics from the one or more candidate subtopics. As an example, after determining one or more candidate subtopics using a first target model, server 130 may present the candidate subtopics determined by the first target model to user 140. Server 130 then receives user feedback on the candidate subtopics. For example, user 140 manually adds "future trends of blockchain technology" as a new subtopic, or user 140 deletes subtopic 2 "potential breakthroughs in quantum computing technology," or user 140 modifies subtopic 3 into the form of a question, "How will robots change daily human life?".

[0065] In some embodiments, in response to user feedback on adjustments to subtopics, server 130 may perform supplementary searches to determine reference information related to the multiple subtopics adjusted by user 140 for generating target text. Furthermore, server 130 may generate multiple text segments corresponding to the multiple subtopics, respectively, based on the reference information related to the multiple subtopics adjusted by the user.

[0066] Referring again to Figure 4A, in box 412, server 130 uses the second target model to generate multiple text segments corresponding to multiple sub-topics. Server 130 can use multiple target model nodes to generate text segments for each sub-topic in parallel, thereby processing multiple sub-topics simultaneously and shortening the generation time. However, since each sub-topic is generated independently, the content may lack logical connection, or similar or repetitive information may appear in text segments corresponding to different sub-topics. Therefore, server 130 can also adopt a sequential generation method to generate multiple text segments corresponding to multiple sub-topics.

[0067] In some embodiments, server 130 generates a first text segment corresponding to the first subtopic based on a first subtopic and reference information from a plurality of subtopics, using a second target model. Then, server 130 generates a second text segment corresponding to the second subtopic based on a subsequent second subtopic from the first subtopic and the first text segment, using the second target model. Server 130 generates the corresponding text segments in the order of the subtopics. Each time it generates a text segment corresponding to the next subtopic, it uses the previous segment as context input, leveraging the model to remember the context and enhancing the coherence of the target text.

[0068] In some embodiments, server 130 may also invoke a large language model to specifically generate transition words (e.g., concatenation words) between script fragments to further optimize the connection and logical fluency between subtopics.

[0069] In this embodiment, multiple sub-topics are determined based on reference information, and corresponding text segments are generated for each sub-topic. By using the content of the previous text segment as context input, not only are the technical limitations of generating long content resolved, but the coherence and logical consistency between different text segments are also significantly enhanced. This allows for the generation of complete and fluent target text while maintaining a smooth transition of content and overall semantic consistency.

[0070] In some embodiments, server 130 may also utilize a target model such as a large language model to generate introductory text in the target text. As an example, for podcast content, server 130 uses a large language model to generate opening and closing text. The opening may include, for example, a brief introduction to the topic, an introductory statement to spark interest, such as a story, anecdote, or a question. The closing may include, for example, a summary of the podcast content, a call to action or suggestions for the next step, gratitude to the listener, and guidance to the next episode.

[0071] Referring again to Figure 4A, in box 413, server 130 generates target text by combining multiple text segments. By concatenating the text segments generated using the second target model, server 130 obtains the target text used to generate audio data.

[0072] Referring back to Figure 3, in some embodiments, configuration information 312 may also include role information. For example, for podcast content, role information includes the names of participants (such as the names of hosts and guests) and role attributes (such as including a guiding host role or a discussion-oriented guest role).

[0073] In some embodiments, configuration information 312 may also indicate the language of the generated content. For example, a user may specify the language of the generated content (such as Chinese or English), or server 130 may automatically determine the language of the generated content based on the original content.

[0074] When generating target text, server 130 can input configuration information into the large language model using a structured prompt (e.g., roleA.Name = "XX", roleA.role = "Host"). Accordingly, the large language model can generate dialogue content that matches the role attributes based on these role settings.

[0075] In some embodiments, the configuration information also indicates the style of the target text. Server 130 can generate the target text using a second target model based on the prompt word template corresponding to the style indicated by the configuration information. Style determines aspects such as the language expression, structural arrangement, and interactive methods of the generated content. Different styles of text content require different prompt word templates to guide the model in generating the corresponding text. For example, for podcast-type content, several common styles can be included: crosstalk style, talk show style, and solo broadcast style.

[0076] As an example, for the style of solo broadcasting, prompt word templates can accurately guide large speech models to generate monologue-style text content that is natural in tone, fluent in language, and clear in logic. For example, it can be used to broadcast concisely and directly on a certain topic, transform an article into a learning guide that is easy to learn and understand, or to explain user-input content in chronological order.

[0077] In some embodiments, server 130 may also include user 140 defining their own desired prompt word templates according to different styles. The prompt word template mainly consists of a series of prompt words and structured text, used to guide the model to generate target text of the corresponding style. Within different styles of text content, users can customize the generated structure, language style, dialogue interaction, etc.

[0078] Referring again to Figure 3, in box 317, server 130 corrects the target text. Specifically, server 130 uses a second target model to generate candidate text based on reference information. The generated text may contain linguistic inaccuracies, incoherent expressions, or structural problems. At this point, server 130 can use a fourth target model, different from the second target model, to correct the candidate text and obtain corrected candidate text.

[0079] The text correction process may include: detecting and correcting grammatical errors, spelling errors, or sentence structure problems; adjusting logical relationships in the text; adjusting the tone, rhythm, or dialogue style of the text according to the style set by the user (such as solo broadcasting, talk show, etc.); ensuring that the content of the generated text does not deviate from the theme required by the user; and correcting factual errors or inconsistent parts of the text.

[0080] In some embodiments, server 130 may also obtain user feedback on the corrected candidate text. Then, server 130 determines the target text based on the user feedback and the corrected candidate text. As an example, server 130 may present the corrected candidate text to user 140 after correcting the candidate text using a fourth target model. Server 130 then receives user feedback on the corrected candidate text. For example, server 130 determines the target text for generating audio data based on user 140's feedback operations such as adding, deleting, or modifying candidate text.

[0081] In this embodiment, the post-generation text correction process not only improves the accuracy, fluency, and coherence of the generated content but also ensures that the generated text conforms to the user-defined style and tone. Furthermore, by supporting both automatic model correction and manual user correction, it provides users with the ability to make fine-tuning adjustments, allowing for personalized modifications to specific content or tone. The combination of these two methods guarantees the basic quality of the system-generated content while retaining room for flexible adjustments.

[0082] Referring again to Figure 3, in box 318, server 130 uses the fifth target model to determine one or more candidate timbres for the audio data. In some embodiments, after generating the target text, server 130 can continue to generate audio data corresponding to the target text. In generating audio data, in addition to converting text to audio, server 130 also supports modifying the timbre of the audio. Timbre refers to the characteristics of a sound, such as the voices of different characters, different pitches, speech rates, emotional nuances, etc.

[0083] In some embodiments, server 130 obtains user feedback on one or more candidate timbres. Then, based on the user feedback on the one or more candidate timbres, server 130 determines the target timbre for the audio data. Finally, server 130 generates audio data with the target timbre.

[0084] As an example, after determining one or more candidate timbres for the target text using a fifth target model, server 130 can present audition options for these candidate timbres to user 140. User 140 can provide feedback on the candidate timbres based on personal preferences or content needs, such as selecting the most suitable timbre or uploading a custom timbre. Then, based on user 140's feedback, server 130 determines the target timbre for generating audio data. Finally, server 130 generates audio data that matches the target text based on the target timbre.

[0085] In this embodiment, by utilizing a target model to determine the timbre of the audio and supporting user modification of the timbre, personalized, flexible, and refined audio adjustment functions are provided, enabling the generated audio content to better suit the user's creative needs. In this way, users can not only generate audio from text but also personalize the timbre according to style, character, and audience needs, greatly enhancing the appeal of the content and the user experience.

[0086] In some embodiments, server 130 can also utilize the target model for audio processing, such as adding interlude audio effects and mixing music and text content. Interlude audio effects refer to audio effects added between different segments, primarily used to distinguish between segments and enhance the smoothness of transitions. In media content, audio effects can not only enhance entertainment value but also help clarify the structure of the content, making it easier for listeners to understand content transitions.

[0087] As an example, Figure 4B illustrates a schematic diagram of a sound effects processing procedure 400B according to some embodiments of the present disclosure. As shown in Figure 4B, server 130 can use the target model to generate key dialogue switching sound effects 421, thereby enhancing the listener experience when key moments of character or emotion change, through appropriate sound effects, such as light transition sounds or tense situational sound effects.

[0088] Server 130 can also use the target model to generate opening sound effects 422 and ending sound effects 424 to remind listeners of the start and end of the content. Server 130 can also generate sound effects used when there are significant changes between topics or segments, such as subtopic switching sound effects 423, to guide listeners to switch to the next subtopic or topic.

[0089] The purpose of audio mixing is to combine background music with content (such as a presenter's speech or story) to make the overall audio more layered and emotionally resonant. For example, by combining music and content (such as captivating quotes, opening introductions, etc.) at the beginning of the content, the listener is engaged at the outset and quickly drawn into the rhythm. Adding music and content mix at the end makes the conclusion of the content both natural and suggestive.

[0090] Referring back to Figure 3, in some embodiments, configuration information 312 may also indicate the audio type, i.e., determine the form of the data that generates the audio, such as podcasts suitable for multi-person conversations, long-form content, and broadcasts suitable for briefings or solo presentations.

[0091] In some embodiments, configuration information 312 can indicate the audio duration, i.e., the duration of the generated audio data, thereby facilitating control over content granularity and user experience. In some embodiments, configuration information 312 can indicate the output method, such as generating and providing a complete audio file, or outputting TTS streaming script data, supporting user-managed audio generation. In box 319, server 130 generates audio data corresponding to the target text based on configuration information 312.

[0092] In some embodiments, configuration information 312 can indicate intervention options, namely, the intervention mode of user 140 during the execution of the content generation task and the modules requiring user feedback, thereby enabling modularity and flexibility of the generation task. User 140 can selectively enable or skip parts of the process. For example, if the user only needs to quickly generate podcast-type content, they can choose to skip the user feedback stage. Intervention options may include, for example, single-node control, where user 140 can manually adjust only at a specific stage; and full-process intervention, where user 140 can adjust and customize at multiple stages of the content generation task.

[0093] To ensure smooth execution of each stage from task submission to final content generation and to support flexible interruption recovery during content generation tasks, server 130 also supports a task management mechanism. Task management not only handles the lifecycle of content generation tasks but also supports operations such as resuming from breakpoints, restarting tasks, and resuming tasks to address any interruptions or failures that may occur during the generation process.

[0094] As an example, the entity design corresponding to a content generation task can include three structures: a task container, a specific execution workflow, and atomic nodes within the workflow. The task container can include fields such as: a task identifier, a workflow object associated with the task, and the task's current status (e.g., pending, in progress, completed). Through the task container, server 130 can manage the execution of multiple generation tasks, ensuring that tasks are processed according to a predetermined workflow and that their status is updated at different stages.

[0095] Workflows can include fields such as: workflow identifier and a directed acyclic graph (used to represent the execution order of tasks, i.e., the relationships and execution order of nodes). Workflows define the complete execution path of content generation tasks and manage the execution order of multiple nodes. They structure the task execution process, describing the entire execution flow from start to finish and ensuring that each execution operation of the task is performed in a logical order.

[0096] A node is an atomic execution unit in a workflow, representing a specific operation of a task. It can include fields such as: node identifier, the capability or task corresponding to the node (e.g., data processing, model invocation), the current state of the node (e.g., pending execution, in execution, requiring manual intervention, execution failed, execution successfully completed), and a reference to the next node (which defines the execution order and dependencies between nodes).

[0097] In some embodiments, in response to obtaining configuration information, server 130 generates a task identifier corresponding to the content generation task. During the execution of the content generation task, server 130 stores the corresponding processing results and corresponding node configurations of the multiple processing nodes included in the content generation task.

[0098] As an example, after obtaining the configuration information, server 130 can generate a unique task identifier to identify the current content generation task, thereby tracking and managing the task's lifecycle. During the execution of the content generation task, server 130 can execute the task node by node. The processing and generation results of each node are stored. This includes, for example, node configuration, i.e., the specific configuration used by each node (such as input data, configuration parameters, etc.), and node processing results, i.e., the output results after the node execution (e.g., generated text, audio, etc.). This information is saved during task execution for later backtracking, recovery, or re-execution when needed.

[0099] In some embodiments, during the execution of a content generation task, server 130 receives a user request. The user request instructs that the content generation task be re-executed starting from the first node. In response to the user request, server 130 retrieves the stored first processing result and first node configuration corresponding to the first node based on the task identifier. Then, server 130 re-executes the processing nodes following the first node based on the first processing result and the first node configuration.

[0100] As an example, when user 140 issues a re-execution request, or when it is determined that a node has failed during task execution, server 130 can use the task identifier to find and retrieve the relevant configuration and processing results of the node (e.g., the first node) that user 140 indicated needs to be regenerated or failed during task execution, which was stored during the task execution process. Then, based on this information, server 130 can re-execute nodes starting from the first node. This mechanism ensures that even if a node fails or is interrupted during task execution, server 130 can still recover from that node and continue subsequent operations without having to re-execute the entire task.

[0101] In some embodiments, in response to determining that the complete task execution has failed or needs to be restarted, the server 130 can regenerate the task identifier by calling the task submission interface, indicating that the task needs to be re-executed from the beginning.

[0102] In some embodiments, server 130 can also ensure that the execution order of each node is correct by checking whether the workflow constitutes a reasonable directed acyclic graph, thereby avoiding circular dependencies or deadlocks. If the workflow does not meet the requirements of a directed acyclic graph, the generation task will fail to execute and an error message will be displayed.

[0103] In this embodiment, by storing configuration and processing results for each task and node, task execution can be flexibly resumed regardless of whether interruptions or failures occur during task execution. This mechanism improves the reliability and adaptability of task management and can effectively cope with various emergencies (such as failed nodes or manual intervention during execution). If a task fails midway or the user needs to modify a certain step, it can be re-executed from the failed node or the specified node, without having to regenerate the entire task. This saves time and computing resources and improves the efficiency of content generation.

[0104] The following embodiments of this disclosure illustrate the content generation method described above with reference to examples. Specifically, client 110 serves as the terminal device that submits the content generation task and communicates with server 130; the script and audio of the podcast content are used as the target text and corresponding audio data, respectively; one or more target models 132 are used as various target models in the content generation task; and one or more plugin systems 132 are used as various plugins or third-party tools in the content generation task.

[0105] As an example, Figure 4C illustrates a schematic diagram of a process 400C for performing a podcast generation task according to some embodiments of this disclosure.

[0106] In box 431, user 140 submits a podcast generation task request through client 110. The task request may include the following information: input content (web link or long text) and other configuration information specified by user 140. In box 441, server 130 receives the task request, creates the corresponding podcast generation task, and assigns a task ID.

[0107] In box 442, server 130 determines the type of user input. If the user input includes link information, then box 442-1 is executed. In box 442-1, server 130 uses the plug-in tools included in plug-in system 131 to obtain the webpage content indicated by the link information and converts it into processable text.

[0108] In box 443, server 130 determines keywords for further searching to supplement information related to the topic. In box 443-1, server 130 may automatically determine one or more candidate keywords based on the first target model 132.

[0109] In box 444, server 130 retrieves information from a data source that matches at least one keyword, as at least a portion of the reference information. The data source may, for example, include an internal knowledge base (e.g., an entity database, a news database) and / or an external search engine.

[0110] In box 445, server 130 determines whether manual intervention is configured for the keyword selection process. If manual intervention is configured, then box 443-2 is executed. In box 443-2, client 110 provides user feedback on candidate keywords; for example, user 140 can manually add, delete, or adjust candidate keywords. After receiving user feedback, server 130 re-executes box 444 to obtain information matching the target keywords.

[0111] In box 451, server 130 determines multiple subtopics (i.e., topics) associated with the main topic based on reference information. In box 451-1, server 130 may utilize third target model 132 to determine one or more subtopics and the reference information content corresponding to each subtopic based on the reference information.

[0112] In box 452, server 130 concatenates the corresponding reference information content according to the logic of sub-topics to form a coherent material base. In box 453, server 130 can generate text segments corresponding to each sub-topic based on the prompt word template corresponding to the style indicated by the configuration information, using the second target model 131.

[0113] In box 454, server 130 uses target model 131 to generate connecting words, such as opening remarks, transitions, and closing remarks. In box 455, server 130 concatenates the various text segments, opening remarks, closing remarks, and transitions to generate a complete podcast script.

[0114] In box 456, server 130 can correct the generated complete podcast script, i.e., the candidate text. In box 456-1, server 130 can use a fourth objective model, different from the second objective model, to correct the candidate text, improving language fluency and semantic accuracy, to obtain corrected candidate text.

[0115] In box 457, server 130 determines whether manual intervention for the text proofreading process is configured. If manual intervention is configured, then box 456-2 is executed. In box 456-2, client 110 provides user feedback on the candidate text; for example, user 140 can modify and fine-tune the complete podcast script. After receiving user feedback, server 130 executes box 458.

[0116] In box 458, server 130 can determine the timbre of the audio data to be generated. In box 458-1, server 130 can use the fifth objective model 131 to determine one or more candidate timbres of the audio data.

[0117] In box 459, server 130 determines whether manual intervention for the timbre determination process is configured. If manual intervention is configured, then box 458-2 is executed. In box 458-2, client 110 provides user feedback on candidate timbres; for example, user 140 can select a target timbre from the candidate timbres or upload the target timbre of the audio data. After receiving user feedback, server 130 executes box 460 to determine the audio data output method. If the output method is determined to be TTS data, then box 432 is executed. In box 432, the client receives the TTS streaming output data from server 130. If the output method is determined to be complete audio data, then box 461 is executed.

[0118] In box 461, server 130 generates complete audio data corresponding to the corrected playback script. In box 462, server 130 uploads the audio data. In box 463, server 130 uses a plug-in system 131, such as an audio storage module, to store the generated audio data. In box 464, server 130 determines the link corresponding to the audio data and sends it to client 110.

[0119] In this embodiment, modular task management and a flexible content generation process significantly improve the efficiency and quality of podcast content creation. This approach intelligently generates target text based on user input and produces customized audio content using text-to-speech technology. Furthermore, by combining automatic generation, user-defined configurations, and a refined task management mechanism, the personalization, accuracy, and stability of content generation are greatly enhanced. This allows for meeting the diverse content creation needs of users while reducing the burden of manual intervention, providing an efficient and scalable content creation platform.

[0120] Example devices and equipment

[0121] Embodiments of this disclosure also provide corresponding apparatus for implementing the methods or processes described above. Figure 5 shows a schematic structural block diagram of an example apparatus 500 generated according to certain embodiments of this disclosure. Apparatus 500 may be implemented as or included in server 130. The various modules / components in apparatus 500 may be implemented by hardware, software, firmware, or any combination thereof.

[0122] As shown in Figure 5, the device 500 includes a configuration information acquisition module 510, configured to acquire configuration information for a content generation task, wherein the configuration information at least indicates the theme of the content to be generated and the audio configuration of the content to be generated. The device 500 also includes a reference information acquisition module 520, configured to acquire reference information associated with the theme based on the configuration information. The device 500 further includes a generation module 530, configured to generate target text matching the theme, at least based on the reference information. The generation module 530 is also configured to generate audio data corresponding to the target text based on the audio configuration and the target text.

[0123] In some embodiments, the reference information acquisition module 520 is further configured to determine at least one keyword based on user input associated with the topic in the configuration information using a first target model; and to acquire information matching the at least one keyword from one or more data sources as at least a part of the reference information.

[0124] In some embodiments, the reference information acquisition module 520 is further configured to determine one or more candidate keywords based on user input and using a first target model; acquire user feedback on the one or more candidate keywords; and determine at least one keyword from the one or more candidate keywords based on the user feedback on the one or more candidate keywords.

[0125] In some embodiments, the generation module 530 is further configured to determine multiple subtopics associated with a topic based on reference information; generate multiple text segments corresponding to the multiple subtopics respectively using a second target model; and generate target text by combining the multiple text segments.

[0126] In some embodiments, the generation module 530 is further configured to utilize a third target model to determine one or more candidate subtopics based on reference information; obtain user feedback on the one or more candidate subtopics; and determine multiple subtopics from the one or more candidate subtopics based on the user feedback on the one or more candidate subtopics.

[0127] In some embodiments, the generation module 530 is further configured to generate a first text segment corresponding to the first subtopic based on a first subtopic and reference information among multiple subtopics, using a second target model; and to generate a second text segment corresponding to the second subtopic based on a second subtopic following the first subtopic among multiple subtopics and the first text segment, using a second target model.

[0128] In some embodiments, the reference information acquisition module 520 is further configured to, in response to determining that the configuration information includes link information indicating the topic, use a data acquisition function block to acquire information from the data object indicated by the link information as at least a portion of the reference information.

[0129] In some embodiments, the configuration information also indicates the style of the target text. The generation module 530 is further configured to generate the target text using a second target model based on a cue word template corresponding to the indicated style.

[0130] In some embodiments, the generation module 530 is further configured to generate candidate text based on reference information using a second target model; to correct the candidate text using a fourth target model different from the second target model to obtain corrected candidate text; to obtain user feedback on the corrected candidate text; and to determine target text based on the user feedback and the corrected candidate text.

[0131] In some embodiments, the generation module 530 is further configured to determine one or more candidate timbres of the audio data using a fifth target model; obtain user feedback on the one or more candidate timbres; determine the target timbre of the audio data based on the user feedback on the one or more candidate timbres; and generate audio data with the target timbre.

[0132] In some embodiments, the apparatus 500 further includes a task management module configured to generate a task identifier corresponding to the content generation task in response to obtaining configuration information; and to store the corresponding processing results and corresponding node configurations of the multiple processing nodes included in the content generation task during the execution of the content generation task.

[0133] In some embodiments, the task management module is further configured to, during the execution of a content generation task, receive a user request instructing the content generation task to be re-executed starting from the first node; in response to the user request, obtain, based on the task identifier, the stored first processing result and first node configuration corresponding to the first node; and, based on the first processing result and first node configuration, re-execute the processing nodes after the first node.

[0134] Figure 6 shows a block diagram of an electronic device capable of implementing several embodiments of the present disclosure.

[0135] As shown in Figure 6, the electronic device 600 is in the form of a general-purpose electronic device. Components of the electronic device 600 may include, but are not limited to, one or more processors or processing units 610, memory 620, storage device 630, one or more communication units 640, one or more input devices 660, and one or more output devices 660. The processing unit 610 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 620. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of the electronic device 600.

[0136] Electronic device 600 typically includes multiple computer storage media. Such media can be any accessible media that is accessible to electronic device 600, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 620 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 630 can be removable or non-removable media and can include machine-readable media, such as flash drives, disks, or any other media that can be used to store information and / or data and can be accessed within electronic device 600.

[0137] Electronic device 600 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not shown in FIG. 6, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks may be provided. In these cases, each drive may be connected to a bus (not shown) via one or more data media interfaces. Memory 620 may include computer program product 625 having one or more program modules configured to perform various methods or actions of various embodiments of the present disclosure.

[0138] The communication unit 640 enables communication with other electronic devices via a communication medium. Additionally, the functionality of the components of the electronic device 600 can be implemented using a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the electronic device 600 can operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.

[0139] Input device 650 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 660 can be one or more output devices, such as a monitor, speaker, printer, etc. Electronic device 600 can also communicate with one or more external devices (not shown) via communication unit 640 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with electronic device 600, or with any device that enables electronic device 600 to communicate with one or more other electronic devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).

[0140] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above.

[0141] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0142] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0143] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0144] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0145] Various implementations of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein.

Claims

1. A content generation method, comprising: Obtain configuration information for the content generation task, wherein the configuration information at least indicates the theme of the content to be generated and the audio configuration of the content to be generated; Based on the configuration information, obtain reference information associated with the topic; Based at least on the reference information, generate target text that matches the topic; as well as Based on the audio configuration and the target text, audio data corresponding to the target text is generated.

2. The method according to claim 1, wherein obtaining the reference information includes: Based on user input associated with the topic in the configuration information, at least one keyword is determined using a first target model; as well as Information matching the at least one keyword is obtained from one or more data sources as at least part of the reference information.

3. The method according to claim 2, wherein determining the at least one keyword comprises: Based on the user input, one or more candidate keywords are determined using the first target model; Obtain user feedback regarding the one or more candidate keywords; as well as Based on the user feedback regarding the one or more candidate keywords, the at least one keyword is determined from the one or more candidate keywords.

4. The method according to claim 1, wherein generating the target text comprises: Based on the reference information, multiple subtopics associated with the topic are identified; Using the second objective model, generate multiple text segments corresponding to the multiple subtopics; and The target text is generated by combining the multiple text segments.

5. The method of claim 4, wherein determining the plurality of subtopics associated with the topic comprises: Using the third target model, one or more candidate subtopics are determined based on the reference information; Obtain user feedback for the one or more candidate subtopics; as well as Based on user feedback regarding the one or more candidate subtopics, the plurality of subtopics are determined from the one or more candidate subtopics.

6. The method of claim 4, wherein generating the plurality of text segments comprises: Based on the first subtopic among the plurality of subtopics and the reference information, using the second target model, a first text segment corresponding to the first subtopic is generated; and Based on the second subtopic following the first subtopic in the plurality of subtopics and the first text segment, the second target model is used to generate a second text segment corresponding to the second subtopic.

7. The method according to claim 1, further comprising: In response to determining that the configuration information includes link information indicating the topic, the data acquisition function block is used to acquire information from the data object indicated by the link information as at least a portion of the reference information.

8. The method of claim 1, wherein the configuration information further indicates the style of the target text, wherein generating the target text further includes: The target text is generated using a second target model based on a prompt word template corresponding to the indicated style.

9. The method of claim 1, wherein generating the target text comprises: Using the second target model, candidate text is generated based on the reference information; The candidate text is corrected using a fourth target model that is different from the second target model to obtain corrected candidate text. Obtain user feedback on the corrected candidate text; as well as The target text is determined based on the user feedback and the corrected candidate text.

10. The method according to claim 1, wherein generating the audio data comprises: The fifth objective model is used to determine one or more candidate timbres of the audio data; Obtain user feedback for the one or more candidate timbres; The target timbre of the audio data is determined based on user feedback regarding the one or more candidate timbres. as well as Generate the audio data having the target timbre.

11. The method according to claim 1, further comprising: In response to obtaining the configuration information, a task identifier corresponding to the content generation task is generated; as well as During the execution of the content generation task, the corresponding processing results and corresponding node configurations of the multiple processing nodes included in the content generation task are stored.

12. The method of claim 11, further comprising: During the execution of the content generation task, a user request is received, which instructs the content generation task to be re-executed starting from the first node; In response to the user request, based on the task identifier, the stored first processing result and first node configuration corresponding to the first node are obtained; as well as Based on the first processing result and the first node configuration, the processing nodes after the first node are re-executed.

13. A content generation apparatus, comprising: The configuration information acquisition module is configured to acquire configuration information for the content generation task, wherein the configuration information at least indicates the theme of the content to be generated and the audio configuration of the content to be generated; The reference information acquisition module is configured to acquire reference information associated with the topic based on the configuration information; The generation module is configured to generate target text that matches the topic, at least based on the reference information. as well as The generation module is also configured to generate audio data corresponding to the target text based on the audio configuration and the target text.

14. An electronic device comprising: At least one processing unit; as well as At least one memory, coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 12 when executed by the at least one processing unit.

15. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement the method according to any one of claims 1 to 12.