Digital human generation method and device, electronic equipment, storage medium and program product
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN DIANMAO TECH CO LTD
- Filing Date
- 2026-03-27
- Publication Date
- 2026-06-30
AI Technical Summary
In existing digital human generation technologies, the data from each generation stage are independent of each other, resulting in a lack of correlation between the generated digital human content and user information, and an inability to reflect the user's true personalized information.
By interacting with target users through an artificial intelligence interaction model, text interaction data is obtained, visual feature information is analyzed and extracted to generate image data, video data is generated by combining scene description information, and intelligent agent configuration data is constructed. Finally, real feature data is integrated with these data to generate a personalized digital human.
It achieves a deep correlation between digital human-generated content and user's personal characteristics, ensuring consistency in visual style between images and videos, avoiding stylistic fragmentation between generated content at different stages, and enhancing the personalization of digital human content.
Smart Images

Figure CN122309010A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, specifically to digital human generation methods, apparatus, electronic devices, storage media, and program products. Background Technology
[0002] Current digital human generation technologies typically use only user-uploaded static photos and audio recordings as input, directly driving the digital human generation service to complete video synthesis. The input materials are independent of each other, with no content connection or inheritance relationship. This approach results in a lack of personalization in the generated digital human at the content level. Its visual content, dialogue style, and behavioral configuration all come from external preset or random parameters, creating a clear disconnect from the user's own creative expression, interests, and personal context, failing to reflect the user's true personalized information.
[0003] Therefore, there is an urgent need for a digital human generation method to solve the problems of independent data in each generation stage and lack of correlation between digital human content and user information in related technologies. Summary of the Invention
[0004] This application provides a digital human generation method, apparatus, electronic device, storage medium, and program product to solve the problems in related technologies where data in each generation stage is independent and digital human content lacks correlation with user information.
[0005] In a first aspect, embodiments of this application provide a method for generating a digital human, the method comprising: By using an AI-powered interactive model to interact with target users, text interaction data can be obtained. Analyze text interaction data to extract visual feature information, and generate image data based on the visual feature information; Video data is generated by using image data as a visual reference and combining it with scene description information extracted from text interaction data. Extract configuration reference information from text interaction data and combine it with the target user's setting instructions to construct the agent's configuration data; The system acquires the real characteristic data of the target user, and then aggregates and merges this real characteristic data with text interaction data, image data, video data, and intelligent agent configuration data to generate a digital human corresponding to the target user.
[0006] In one optional implementation, acquiring text interaction data, generating image data, generating video data, and constructing configuration data for the intelligent agent are each of several processing stages executed sequentially; the method further includes: After acquiring, generating, or constructing the corresponding data in any current processing stage, the data obtained in the current processing stage is associated with a unique data identifier and stored; wherein, the unique data identifier includes a combination of step identifier, course identifier, and semester identifier; When executing the next processing stage, data that has been associated with and stored in the previous processing stage is retrieved based on a unique data identifier and used as the data input for the next processing stage.
[0007] In one alternative implementation, analyzing text interaction data to extract visual feature information and generating image data based on the visual feature information includes: The semantic analysis model is used to extract features from text interaction data, and character features and color style elements are extracted as visual feature information. Combine visual feature information into a first prompt word and display it in the interface; Receive the target user's instruction to modify the first prompt word, and modify the first prompt word based on the instruction to obtain the second prompt word; Image data is generated using an image generation model based on the second cue word.
[0008] In one alternative implementation, video data is generated using image data as a visual reference and combined with scene description information extracted from text interaction data, including: The text interaction data is parsed using a semantic analysis model to extract the environmental scene as scene description information; The image data is used as a reference frame, combined with scene description information and the parameter settings received from the target user regarding video duration, aspect ratio or motion style, to form the video generation request parameters. Based on the video generation request parameters, video data is generated.
[0009] In one optional implementation, configuration reference information is extracted from text interaction data and, in conjunction with the target user's setting instructions, configuration data for the intelligent agent is constructed, including: By analyzing text interaction data through a semantic analysis model, personalized self-introduction text is generated and pre-filled as configuration reference information. It receives basic configuration instructions from the target user to determine the name and avatar of the agent, and receives interaction setting instructions to determine the personality settings, the opening remarks for the first round of dialogue, and the list of guiding questions, and integrates them to generate agent configuration data.
[0010] In one optional implementation, obtaining the target user's real characteristic data includes: Obtain facial image data uploaded by the target user and extract facial features from the facial image data; Voice features are obtained by acquiring online recordings of target users or by receiving input text to be read aloud and synthesizing it through text-to-speech conversion. Combine visual and vocal features to create real feature data.
[0011] Secondly, embodiments of this application provide a digital human generation device, the device comprising: The acquisition module is used to interact with target users through an artificial intelligence interaction model and acquire text interaction data; The image generation module is used to analyze text interaction data to extract visual feature information and generate image data based on the visual feature information. The video generation module is used to generate video data by using image data as a visual reference and combining it with scene description information extracted from text interaction data. The agent configuration module is used to extract configuration reference information from text interaction data and, in conjunction with the target user's setting instructions, construct the agent's configuration data. The generation module is used to acquire the real feature data of the target user, and to aggregate and fuse the real feature data with text interaction data, image data, video data and intelligent agent configuration data to generate a digital human corresponding to the target user.
[0012] Thirdly, embodiments of this application provide an electronic device, including: a memory and a processor, which are communicatively connected to each other. The memory stores computer instructions, and the processor executes the computer instructions to perform the digital human generation method described in the first aspect or any corresponding embodiment.
[0013] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer instructions that cause a computer to execute the digital human generation method described in the first aspect or any corresponding embodiment.
[0014] Fifthly, embodiments of this application provide a computer program product, including computer instructions for causing a computer to execute the digital human generation method described in the first aspect or any corresponding embodiment thereof.
[0015] The digital human generation method provided in this application interacts with the target user through an artificial intelligence interaction model and obtains text interaction data, ensuring that the subsequent generation processes all have content based on the user themselves, rather than relying on external preset parameters. Visual feature information is extracted from the text interaction data to generate image data, ensuring that the image content directly corresponds to the user's interactive expression. Furthermore, this image data is used as a visual reference and combined with scene description information from the text interaction data to generate video data, ensuring consistency in visual style between the video and the image, avoiding stylistic disjointness between the generated content at each stage. Configuration reference information is extracted from the text interaction data and combined with user setting instructions to construct intelligent agent configuration data, giving the intelligent agent's behavioral characteristics personalized support from the interactive content. Finally, by fusing the target user's real feature data with the aforementioned text interaction data, image data, video data, and intelligent agent configuration data, a deep correlation between the digital human generation content and the user's personal characteristics is achieved, solving the technical problem in related technologies where the input materials at each stage are isolated, leading to insufficient personalization of the digital human content. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the specific embodiments of this application or the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0017] Figure 1 This is a schematic diagram illustrating an application scenario according to an embodiment of this application; Figure 2 This is a schematic diagram of a first type of digital human generation method according to an embodiment of this application; Figure 3 This is a second flowchart illustrating the digital human generation method according to an embodiment of this application; Figure 4 This is a schematic diagram of the cross-step data inheritance chain in the progressive AI creation teaching system according to an embodiment of this application; Figure 5 This is a schematic diagram illustrating the complete process from AI dialogue to AI image generation according to an embodiment of this application; Figure 6 This is a schematic diagram illustrating the complete process from AI-generated video to AI agent stage according to an embodiment of this application; Figure 7 This is a schematic diagram illustrating the complete process from AI digital human to video check-in stage according to an embodiment of this application; Figure 8 This is a schematic diagram illustrating the state transition of a digital human generation task according to an embodiment of this application; Figure 9 This is a schematic diagram of a closed-loop teaching evaluation system according to an embodiment of this application; Figure 10 This is a structural block diagram of a digital human generation device according to an embodiment of this application; Figure 11 This is a schematic diagram of the hardware structure of an electronic device according to an embodiment of this application. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0019] It should be noted that the information (including but not limited to user input information, such as information entered by the user into input boxes), data (including but not limited to data used for analysis, stored data, and displayed data, such as context code, all code of the current project, the service pressure corresponding to operations performed on all code of the current project, and the code development status of the current project), and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use, and processing of related data must comply with relevant laws, regulations, and standards. For example, the context code, operations performed on all code of the current project, the corresponding service pressure, and the code development status involved in this application were all obtained with full authorization.
[0020] The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of embodiments of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0021] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained.
[0022] (1) AI (Artificial Intelligence): refers to the technology of simulating human intelligent behavior by computer systems.
[0023] (2) LLM (Large Language Model): refers to a language model that has been trained on a large-scale corpus and has the ability to understand and generate natural language.
[0024] (3) Prompt: refers to the text description input to the image generation or video generation model, which is used to guide the model to generate visual content that meets expectations.
[0025] (4) URL (Uniform Resource Locator): refers to the access address of a resource on the Internet.
[0026] (5) stepId: refers to the unique number used to identify the current processing stage. Together with courseId and termId, it forms a triple and serves as the storage index key for the data produced by each step.
[0027] (6) courseId: refers to the unique number used to identify the current course. It is used in combination with stepId and termId to ensure the isolation of data between different courses.
[0028] (7) termId (semester identifier): refers to a unique number used to identify the current semester. It is used in combination with stepId and courseId to ensure accurate differentiation of data between different semesters.
[0029] (8) agentId (agent identifier): refers to the globally unique identifier generated by the server after the agent is configured, which is used to accurately reference the complete configuration data of the agent in subsequent steps.
[0030] (9) getDrawProperties (get drawing properties interface): refers to the unified data query interface provided by the server.
[0031] (10) voiceUrl (sound resource address): refers to the storage address of the recording file on the server after the online recording is completed, and is used to input the voice features in the digital human generation task.
[0032] (11) TTS (Text-To-Speech): refers to the technology of converting input text content into speech audio through speech synthesis technology.
[0033] (12) aiType (AI generation type parameter): refers to the parameter used to specify the generation mode in the image generation request. aiType=1 indicates text-based image generation mode (generating images based on text descriptions), and aiType=2 indicates image-based image generation mode (generating images based on reference images through style transfer).
[0034] (13) VideoGenMode (video generation mode parameter): refers to the parameter used to specify the generation mode in the video generation request. It supports two modes: txt2Video (generating video based on text description) and img2Video (generating video with the preceding image as the reference frame).
[0035] (14) txt2Video (text-generated video): refers to a mode that drives video generation solely based on text descriptions.
[0036] (15) img2Video (image to video): refers to the mode of using a specified image as a reference frame and combining text description to drive video generation, which can make the generated video consistent with the reference image in visual style.
[0037] As one optional application scenario in the embodiments of this application, such as Figure 1 As shown, this digital human generation method can run on a system that includes at least one terminal device and at least one server. Figure 1 The system is illustrated in the example, which includes a computer 101, a mobile terminal 102, and a server 103, and the terminal devices such as the computer 101 and the mobile terminal 102 are connected to the server 103 through a network 110.
[0038] The terminal device can be a smartphone, tablet, laptop, desktop computer, smart TV, or smart wearable device. The terminal device is primarily responsible for presenting the interactive interface, receiving input information and setting instructions from the target user, and initiating data requests and task submissions to the server 103. The server 103 can be a standalone physical server, a server cluster, a distributed system, or a cloud server providing cloud services. It is primarily responsible for performing processing tasks such as text analysis, image generation, video generation, intelligent agent configuration and construction, and digital human fusion generation. The network 110 can be a wired or wireless network, examples of which include, but are not limited to, the Internet, corporate intranets, local area networks, wide area networks, mobile communication networks, and combinations thereof.
[0039] It should be noted that, Figure 1 This is merely an example of an application scenario and does not limit the scope of protection of this application.
[0040] In existing digital human generation schemes, image generation, video generation, and intelligent agent configuration are usually independent of each other. The data generated in each step cannot be systematically referenced in subsequent generation processes, and the final digital human content lacks an intrinsic connection with the user's personal information. To address the technical problems of independent data in each generation step and insufficient correlation between digital human content and user information, this application provides a digital human generation method to achieve the technical effect of generating digital humans that are deeply associated with the user's personal information.
[0041] According to an embodiment of this application, a method for generating a digital human is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0042] This embodiment provides a digital human generation method that can be applied to servers in the above system, such as independent servers, cloud servers, or distributed server clusters. Terminal devices (such as mobile phones, tablets, computers, etc.) are responsible for presenting the interactive interface and transmitting user input data to the server. The server is responsible for executing data analysis and content generation tasks at each stage and finally returning the generation results to the terminal device for display. Figure 2 This is a flowchart of a digital human generation method according to an embodiment of this application, such as... Figure 2 As shown, the process includes the following steps: Step S201: Interact with the target user through an artificial intelligence interaction model to obtain text interaction data.
[0043] In this step, the target user refers to the specific individual user currently using the terminal device to participate in the interaction, and the data generated by their interaction behavior will be used throughout the subsequent generation stages.
[0044] Text interaction data refers to the interactive content generated in the interaction process of this step and recorded in text form, including the user's input, responses, and complete dialogue records formed by the back-and-forth communication with the artificial intelligence interaction model.
[0045] An AI-powered interactive model is a model capable of natural language understanding and generation. It proactively poses questions or guiding remarks to the target user, understands and analyzes the user's responses, and then generates content for the next round of interaction. For example, an AI-powered interactive model can pre-set dialogue scripts related to a specific topic. Through multiple rounds of interaction with the target user, it gradually guides the user to express ideas related to visual scenes, character images, emotional styles, etc., until the interactive content covers the predetermined topic range and the current stage of the interaction ends.
[0046] Acquiring text interaction data refers to recording the complete dialogue content generated during the interaction in a structured form after the interaction is completed, and persistently storing it on the server for subsequent stages to access. For example, the server can bind the text interaction data with a unique index that identifies the task and store it in the database, ensuring that subsequent steps can accurately and completely retrieve the data produced in this step when needed.
[0047] Step S202: Analyze the text interaction data to extract visual feature information, and generate image data based on the visual feature information.
[0048] In this step, visual feature information refers to descriptive information extracted from text interaction data that reflects the visual performance of the screen, such as scene environment, character appearance features, color style, emotional atmosphere, and other elements.
[0049] Image data refers to specific image files or corresponding resource identifiers generated by an image generation service based on extracted visual feature information.
[0050] Analyzing the text interaction data to extract visual feature information refers to the server performing semantic parsing on the text interaction data stored in step S201, identifying and integrating descriptive content related to image generation, and forming a structured visual feature representation. For example, the server can call a large language model to perform summary analysis on the complete dialogue record, extracting information such as scenes, roles, and styles mentioned by the user in the interaction into prompt words that can be used by the image generation service.
[0051] Generating image data based on the aforementioned visual feature information means using the extracted visual feature information as input parameters for an image generation request, calling an image generation service to create the image, and persistently storing the generated result for reference in subsequent steps.
[0052] Step S203: Using image data as a visual reference and combining it with scene description information extracted from text interaction data, video data is generated.
[0053] In this step, scene description information refers to semantic information extracted from text interaction data that reflects the dynamic content of the scene, such as motion trends, scene atmosphere, and action descriptions. This complements the visual feature information in step S202, corresponding to the static visual style of the image and the dynamic scene content of the video, respectively. Video data refers to the video file or corresponding resource identifier ultimately generated by the video generation service.
[0054] Using the image data as a visual reference means using the image data generated in step S202 as a reference input for the video generation request, so that the generated video maintains visual consistency with the preceding images in terms of visual style, color tone, and subject image. Generating video data by combining the scene description information extracted from the text interaction data means simultaneously using the scene description information extracted from the text interaction data as a semantic guide for the video content, which, together with the image data, constitutes the complete parameters of the video generation request, is submitted to the video generation service, and the generated video data is persistently stored on the server.
[0055] Step S204: Extract configuration reference information from the text interaction data and combine it with the target user's setting instructions to construct the intelligent agent's configuration data.
[0056] In this step, the configuration reference information refers to the content extracted from the text interaction data that can be used to assist the agent in personalized configuration. For example, descriptive information that reflects the personal characteristics of the target user can be used to pre-fill the relevant configuration items of the agent and provide a reference basis for the user's manual settings in the future.
[0057] Setting instructions refer to the specific settings that the target user inputs on the terminal device for each configuration item of the intelligent agent, such as the intelligent agent's name, avatar, personality, and opening remarks.
[0058] Intelligent agent configuration data refers to a structured data set that fully describes the attributes of an intelligent agent after integrating configuration reference information and user setting instructions, including all personality attributes and interaction behavior parameters of the intelligent agent.
[0059] The configuration data for building an intelligent agent refers to the process by which the server, after receiving the setting instructions submitted by the target user, combines them with the configuration reference information extracted from the text interaction data to form complete intelligent agent configuration data, which is then persistently stored and a globally unique identifier corresponding to the intelligent agent is generated for subsequent steps to aggregate and reference.
[0060] Step S205: Obtain the real feature data of the target user, and aggregate and fuse the real feature data with text interaction data, image data, video data and intelligent agent configuration data to generate a digital human corresponding to the target user.
[0061] In this step, real feature data refers to data that can reflect the physiological characteristics of the target user, such as the target user's facial image information and voice information.
[0062] A digital human is a digital virtual image that is generated based on the real characteristics of the target user and the creative data from previous stages, and possesses the personal attributes of the target user.
[0063] Obtaining the target user's real feature data means collecting or receiving facial images and audio information uploaded by the target user through a terminal device and transmitting it to the server as input for this step.
[0064] Data aggregation and fusion refers to the server integrating the text interaction data, image data, video data, and agent configuration data generated in steps S201 to S204 with the real feature data obtained in this step. This integration forms the complete input parameters for the digital human generation task, which are then submitted to the digital human generation service to complete the creation of the digital human. After generation, the corresponding video data of the digital human is persistently stored on the server and can be displayed and used on terminal devices.
[0065] In summary, the digital human generation method provided in this application interacts with the target user through an artificial intelligence interaction model and acquires text interaction data, ensuring that the subsequent generation processes all have content based on the user themselves, rather than relying on external preset parameters. Visual feature information is extracted from the text interaction data to generate image data, ensuring that the image content directly corresponds to the user's interactive expression. Furthermore, this image data is used as a visual reference and combined with scene description information from the text interaction data to generate video data, ensuring consistency in visual style between the video and the image, avoiding stylistic disjointness between the generated content at each stage. Configuration reference information is extracted from the text interaction data and combined with user setting instructions to construct intelligent agent configuration data, giving the intelligent agent's behavioral characteristics personalized support from the interactive content. Finally, by fusing the target user's real feature data with the aforementioned text interaction data, image data, video data, and intelligent agent configuration data, a deep correlation between the digital human generation content and the user's personal characteristics is achieved, solving the technical problem in related technologies where the input materials at each stage are isolated, leading to insufficient personalization of the digital human content.
[0066] This embodiment provides a digital human generation method that can be applied to servers in the above system, such as independent servers, cloud servers, or distributed server clusters. Terminal devices (such as mobile phones, tablets, computers, etc.) are responsible for presenting the interactive interface and transmitting user input data to the server. The server is responsible for executing data analysis and content generation tasks at each stage and finally returning the generation results to the terminal device for display. Figure 3 This is a flowchart of a digital human generation method according to an embodiment of this application, such as... Figure 3 As shown, the process includes the following steps: Step S301: Interact with the target user through an artificial intelligence interaction model to obtain text interaction data.
[0067] To ensure that the text interaction data acquired in this step has more semantic information for use in subsequent stages, in one optional implementation, the AI interaction model, based on a preset guided dialogue script, issues guiding questions to the target user to guide the user to output creative content; after the preset interaction conditions are met, the interaction ends and the creative content is treated as text interaction data.
[0068] Specifically, a guided dialogue script refers to a pre-set dialogue framework designed to guide users in expressing specific topics. It includes several structured guiding questions, typically revolving around themes related to visual scenes, character portrayals, and emotional styles. For example, a guided dialogue script might pre-set guiding questions such as "What is your favorite story scene like?" or "What are the characteristics of the protagonist you imagine?" An AI interaction model then sequentially poses these questions to the target user according to the script's logic, performs semantic understanding of the user's responses, and generates more targeted follow-up questions to gradually guide the user to output creative content related to the topic.
[0069] Creative content refers to the personalized ideas and descriptions related to visual creation expressed by the target users during the above guidance process, such as scene environment, character characteristics, color preferences, storyline, etc. This content will be referenced in subsequent processing stages.
[0070] Preset interaction conditions refer to pre-defined conditions used to determine whether the current round of interaction has covered the predetermined topics. Examples include reaching a preset threshold for the number of dialogue rounds, or the AI interaction model determining that the target user has expressed their views on all preset topics. Once these conditions are met, the AI interaction model ends the current stage of interaction and integrates all text content generated during this interaction into structured text interaction data.
[0071] It is worth noting that in some implementation scenarios, the timing for ending text interaction data can also be actively triggered by the user, such as when the user clicks an operation button such as "End Conversation" in the interaction interface, so as to flexibly adapt to the interaction rhythm of different users.
[0072] Furthermore, considering that each subsequent processing stage needs to reference the data generated in this step and subsequent stages, in order to achieve accurate association and automatic transfer of data between processing stages, in one optional implementation, acquiring text interaction data, generating image data, generating video data, and constructing configuration data for the intelligent agent are each multiple processing stages executed sequentially; after acquiring, generating, or constructing the corresponding data in any current processing stage, the data obtained in the current processing stage is associated with a unique data identifier and stored; wherein, the unique data identifier includes a combination of step identifier, course identifier, and semester identifier; when executing the next processing stage, the data already associated and stored in the previous processing stage is retrieved based on the unique data identifier as the data input for the next processing stage.
[0073] A unique data identifier is a composite identifier used to precisely locate data produced at a specific processing stage at the data storage level. It is composed of three dimensions: step identifier, course identifier, and semester identifier. These three elements together determine "which user, in which semester, in which course, and at which processing stage" generated the data, thus ensuring that data from different users, different courses, and different semesters is not confused. For example, after a target user completes text interaction data acquisition in a specific course of a specific semester, this text interaction data is associated with a unique data identifier composed of the step identifier "step_01", the course identifier "course_A", and the semester identifier "term_2024", and stored in the database. When entering the image data generation stage, the aforementioned text interaction data is retrieved from the database based on the same unique data identifier and used as the data input for this stage.
[0074] Through the above mechanism, the output data of each processing stage is stored with a unique data identifier as an index. Subsequent stages automatically retrieve the output of the previous stage during initialization, without requiring any manual operation from the user, thus achieving seamless automatic inheritance of data across stages.
[0075] Step S302: Analyze the text interaction data to extract visual feature information, and generate image data based on the visual feature information.
[0076] To make the image data generation process more personalized and retain the user's creative control, in one optional implementation, the image data generation specifically includes the following processes: extracting features from text interaction data using a semantic analysis model, extracting character features and color style elements as visual feature information; combining the visual feature information into a first prompt word and pre-filling and displaying it on the interface; receiving modification instructions from the target user for the first prompt word, and modifying the first prompt word based on the modification instructions to obtain a second prompt word; and generating image data based on the second prompt word using an image generation model.
[0077] Semantic analysis models are models with semantic understanding and information extraction capabilities, such as large language models, which can perform summary analysis on input text content to identify and extract key elements related to the visual representation of images.
[0078] Character characteristics refer to descriptive information about the main character's image in text interaction data, such as the character's physical outline, clothing style, facial features, etc.
[0079] Color style elements refer to descriptive information in text interaction data that involves the overall tone, color scheme, or visual style of the screen, such as "bright and warm warm colors" or "calm and deep blue-gray tones".
[0080] The two types of information mentioned above are automatically extracted from the text interaction data by the semantic analysis model and combined into a structured first prompt word. The first prompt word refers to the draft image generation prompt word automatically generated based on the text interaction data, which is pre-filled and displayed in the prompt word input box of the image generation interface for the target user to view and refer to.
[0081] A modification command refers to an operation command issued by a target user after viewing the first prompt word, which adjusts the content of the prompt word. This includes deleting or replacing a keyword in the prompt word, or adding supplementary information to the prompt word. For example, after viewing the first prompt word "deep sea scene, blue tone, glowing fish", the target user adds keywords such as "treasure, mystery" to obtain the modified second prompt word "deep sea scene, blue tone, glowing fish, treasure, mystery".
[0082] Based on the second prompt word, combined with parameters such as the image style, aspect ratio, and generation model selected by the target user in the interface, the image generation model is invoked to generate image data. In another implementation, the target user can also choose not to modify the first prompt word at all, and directly submit the generation request using the first prompt word as the second prompt word, to meet the operational preferences of different users.
[0083] In addition, the image generation model can support a variety of generation modes, such as text-to-image generation based on plain text descriptions, and image generation based on style transfer of reference images uploaded by the target user, to adapt to different creative needs and usage scenarios.
[0084] Step S303: Using image data as a visual reference and combining it with scene description information extracted from text interaction data, video data is generated.
[0085] To ensure that the generated video data maintains visual style consistency with the preceding image data while fully reflecting the personalized preferences of the target user, one optional implementation method for generating video data includes: parsing text interaction data using a semantic analysis model to extract the environmental scene as scene description information; using image data as reference frames, combining the scene description information with the parameter settings received from the target user regarding video duration, aspect ratio, or motion style to form video generation request parameters; and generating video data based on the video generation request parameters.
[0086] A reference frame is a reference image used during video generation to constrain the visual style and subject image of the video. The image data generated in step S302 is used as a reference frame in the video generation request to ensure that the generated video maintains visual consistency with the generated image in terms of tone, style, and subject image. For example, if a dark blue image of a school of deep-sea fish is generated in step S302, the video generated using this image as a reference frame will maintain visual consistency with the image, presenting a coherent visual aesthetic.
[0087] The video generation request parameters refer to a complete request data set formed by integrating scene description information, reference frames, and user parameter settings. After this data set is submitted to the video generation service, the service asynchronously executes the video generation task. Parameters such as video duration, aspect ratio, and motion style are selected by the target user in the video generation interface. For example, the target user can choose a 5-second duration, a 16:9 aspect ratio, and a "dreamy floating" motion style.
[0088] In addition, the video generation service also supports multiple generation modes, including a mode that generates video based solely on text descriptions, and a mode that uses preceding images as reference frames to assist in video generation. The appropriate generation mode can be flexibly selected according to the actual input.
[0089] Furthermore, to address situations where users are dissatisfied with the preceding image data during the video generation stage, in one optional implementation, if a redo instruction is received from the target user to return to the image generation step during the video data generation process, the unique data identifier remains unchanged, and the image data generation processing stage is re-executed; the newly generated image data is used to overwrite the original image data, and when re-entering the video generation step, the latest image data is obtained as a visual reference.
[0090] A redo instruction is an operation command initiated by the target user during the video generation stage, requesting to return to the image generation stage for reprocessing. Upon receiving a redo instruction, the system re-enters the image generation stage with the same unique data identifier as the current task, without creating a new identifier. The target user can modify the original initial prompt and regenerate the image data. The newly generated image data will update the original image data stored on the server by overwriting it, ensuring the uniqueness and consistency of data at each processing stage. Once the target user is satisfied with the newly generated image data, the system re-enters the video generation stage. Based on the unique data identifier, the latest stored image data is automatically retrieved as a reference frame, and the entire generation process restarts from the video generation stage without requiring the user to manually specify reference materials. Through this mechanism, the target user can flexibly adjust the creative results at any stage without affecting the overall processing flow identifier system, ensuring the correctability of data at each processing stage and the stability of the overall process.
[0091] Step S304: Extract configuration reference information from the text interaction data and combine it with the target user's setting instructions to construct the configuration data of the intelligent agent.
[0092] To make the configuration process of the intelligent agent more efficient and personalized, in one optional implementation, the construction of the intelligent agent configuration data specifically includes: analyzing text interaction data through a semantic analysis model to generate personalized self-introduction text as configuration reference information for pre-filling and display; receiving basic configuration instructions from the target user to determine the name and avatar of the intelligent agent, and receiving interaction setting instructions to determine personality settings, opening remarks for the first round of dialogue, and a list of guiding questions, and integrating them to generate intelligent agent configuration data.
[0093] Personalized self-introduction text refers to text content automatically generated by a semantic analysis model based on the user's personal characteristics and interests reflected in text interaction data, used to describe the agent's own attributes. For example, if the target user repeatedly mentions topics such as "adventure" and "mysterious ocean" in the text interaction data, the semantic analysis model can generate a self-introduction text such as "I am a digital partner who loves exploring the unknown world and enjoys taking you to discover the mystery and wonder of the deep sea," and pre-fill it into the self-introduction input box in the configuration interface for the target user to view, modify, or use directly.
[0094] Basic configuration commands refer to the operation commands issued by the target user in the first stage (creation stage) of agent configuration, which are mainly used to determine the name and avatar of the agent.
[0095] Interaction setting instructions refer to the operation commands issued by the target user in the second stage (setting stage) of agent configuration. These instructions are used to determine interactive behavior parameters such as the agent's personality settings, opening remarks for the first round of dialogue, and a list of guiding questions. After receiving these two types of configuration commands, the system integrates them with configuration reference information to form complete agent configuration data, which is then submitted to the server for storage, generating a globally unique agent identifier for reference in subsequent digital human generation stages. In some implementations, the system can also periodically save the entered configuration content during the configuration process and allow the target user to continue from the last saved point after an interruption, thereby improving the reliability of the configuration process and the user experience.
[0096] Step S305: Obtain the real feature data of the target user, and aggregate and fuse the real feature data with text interaction data, image data, video data and intelligent agent configuration data to generate a digital human corresponding to the target user.
[0097] To ensure that the digital human accurately reflects the personal characteristics of the target user, in one optional implementation, the acquisition of real feature data specifically includes: acquiring facial image data uploaded by the target user and extracting image features from the facial image data; obtaining voice features by acquiring online recordings of the target user or receiving input text for reading and synthesizing it through text-to-speech; and combining the image features and voice features into real feature data.
[0098] Facial image data refers to image files containing facial information uploaded by a target user through a terminal device.
[0099] Image features refer to the feature data extracted from facial image data that drives the facial expressions of a digital human. This extraction process typically requires multi-level security reviews to ensure the compliance of the image content. Once the review is passed, a unique identifier for the image features is returned for use by the digital human generation task. If the image fails the security review, the target user will be prompted to re-upload a compliant image.
[0100] Voice features refer to the sound data used to drive the voice output of a digital human. These can be obtained in two ways: first, the target user can record online using the microphone of their terminal device, directly acquiring the recording file as the voice feature; second, the target user can input text content and select a voice style in the interface, and the text can be converted into speech data using text-to-speech synthesis technology, which will then serve as the voice feature. During the generation or reception of voice data, sensitive information detection is typically performed on the input content to ensure the compliance of the voice data.
[0101] Data aggregation and fusion refers to integrating the text interaction data, image data, video data, and intelligent agent configuration data generated in steps S301 to S304 with the real feature data obtained in this step into complete digital human generation task parameters, and submitting them to the digital human generation service.
[0102] After receiving the above parameters, the digital human generation service executes the generation task asynchronously. It continuously queries the task execution status through a polling mechanism. Upon successful completion, it obtains the corresponding video resource identifier for the digital human and persistently stores it on the server. If any abnormalities occur during task execution, such as generation failure, timeout, or rejection due to compliance checks, the service can display corresponding processing prompts to the target user based on the type of exception. For example, for retryable exceptions, it displays an entry point for regeneration; for compliance issues, it guides the target user to return to the corresponding stage, make modifications, and resubmit.
[0103] Step S306: Deploy the digital human to a preset application interface for continuous interactive use by the target user.
[0104] Deploying to a pre-defined application interface refers to binding the digital human's video resources and agent configuration to a frequently used application interface for the target user. This allows the target user to interact with their digital human at any time during daily use of the application after creation, not just during course learning. For example, the digital human can be deployed to the application's homepage, allowing users to converse with their digital human each time they enter the application. The digital human interacts with the user personalizedly based on the personality settings, opening lines, and a list of guiding questions in the agent configuration data. Through this deployment mechanism, the digital human creations made by the target user at each processing stage are truly integrated into their daily usage scenarios, forming a sustainably interactive personal digital avatar.
[0105] In addition, to form a complete learning evaluation chain, in one optional implementation, after deploying the digital human to the application interface for continuous interaction by the target user, it also includes: receiving and storing operation explanation videos recorded or uploaded by the target user; receiving voice or text comments on the operation explanation videos and displaying them to the target user.
[0106] Operation demonstration videos refer to video content recorded by the target user to explain their creative process using relevant tools at each stage of processing. The target user can record on their desktop using the screen recording function of their terminal device, or upload a pre-recorded video file via a mobile device. After receiving the operation demonstration videos submitted by the target user, we store them on the server for reviewers to view.
[0107] Voice or text comments refer to the feedback given by the evaluator in the form of voice or text after watching the operation tutorial video. The feedback is stored and then displayed to the target user, who can view the comments and understand the direction for improvement in the creation process.
[0108] Through the above mechanism, a complete evaluation loop is formed, from the creation, deployment and use of digital humans to operation explanation and feedback review, providing effective evaluation support for the digital human generation method in application scenarios such as teaching.
[0109] In summary, the digital human generation method provided in this application interacts with the target user through an artificial intelligence interaction model and obtains text interaction data. This establishes a unified content source for subsequent generation tasks, ensuring that the visual feature information relied upon for image generation, the scene description information relied upon for video generation, and the configuration reference information relied upon for agent configuration all originate from the target user's authentic expression. This fundamentally solves the problem in related technologies where the input materials at each stage are independent and the generated content is disconnected from the user's personalized information. In the image generation stage, visual feature information is extracted from the text interaction data through a semantic analysis model and the first prompt word is displayed in a pre-filled manner. The target user can modify this to obtain the second prompt word. This mechanism lowers the threshold for prompt word writing while retaining the target user's complete control over the created content. In the video generation stage, image data is used as a reference frame input, ensuring the consistency of visual style between the video and the image and avoiding content fragmentation across stages. The rollback and redo mechanism achieves flexible adjustment of the created content by keeping the unique data identifier unchanged and overwriting the old record with new data, while maintaining the integrity of the data link.
[0110] In terms of cross-stage data management, a unique data identifier is a triplet combination of step identifier, course identifier, and semester identifier. This enables precise indexing and automatic transfer of outputs from each stage, allowing subsequent steps to access creative data from all previous stages without manual intervention from the target user during initialization. It also ensures strict isolation of data records between different users, courses, and semesters. In the digital human generation stage, the target user's real-life image and voice characteristics, along with text interaction data, image data, video data, and agent configuration data, are integrated to ensure that the final digital human visually originates from the target user and is deeply connected to the user's personalized expression throughout the creative process. After the digital human is deployed to daily application interfaces, the target user can continuously interact with it. Combined with the submission of instructional videos and feedback mechanisms, a closed loop of continuous use, evaluation, and improvement of creative outcomes is further formed, ensuring that learning outcomes retain long-term application value even after the course ends.
[0111] To better illustrate the digital human generation method of this application, a preferred embodiment will be provided below. This embodiment is intended to describe the implementation process of this application in detail, but is not intended to limit the scope of protection of this disclosure.
[0112] With the rapid popularization of artificial intelligence technology, how to systematically guide users to understand and master various types of AI tools has become an urgent problem to be solved in the field of AI education. Existing AI content generation products usually provide independent services with a single tool as the core, such as providing image generation or digital human synthesis functions alone. There is a lack of content connection and data transfer mechanisms between the tools, making it difficult for users to establish a systematic understanding that "multiple types of AI tools can be used in combination". In addition, there is a significant disconnect between the content generated by existing products and the user's own personalized information. The generated results rely on system preset parameters rather than the user's actual expression, resulting in the final output lacking personal meaning for the user, and the learning outcomes cannot be extended to daily scenarios outside the course.
[0113] To better illustrate the digital human generation method of this application, a user-oriented progressive AI creation teaching system is used as a preferred specific embodiment to describe in detail the complete implementation process of the method in a real-world scenario. In this system, the user (i.e., the target user, hereinafter referred to as "student") sequentially goes through five processing stages: AI dialogue, AI image generation, AI video generation, AI agent configuration, and AI digital human generation. The output of each stage serves as input material for the subsequent stages, ultimately generating a personalized digital human deeply associated with the student's personal information. This is supplemented by video check-ins and teacher comments to form a complete teaching evaluation loop.
[0114] In this embodiment, the progressive AI creative teaching system is supported by a front-end terminal device (such as a computer or mobile device used by students) and a back-end server. The front-end is responsible for presenting the interactive interface at each stage, receiving student input, and submitting data requests to the server; the server is responsible for executing the data processing and generation tasks at each stage, and returning the output data of the previous steps to the front-end through a unified getDrawProperties interface (i.e., the interface for obtaining drawing properties), thereby realizing automatic data inheritance between each step.
[0115] Figure 4 This is a schematic diagram of the cross-step data inheritance chain in this progressive AI creation teaching system. For example... Figure 4As shown, the output of AI dialogue, the structured dialogue record, flows to the left branch to AI-generated images as the material basis for text-to-image prompts, and to the right branch to the AI agent as a personalized reference for self-introduction templates; the personalized image URLs generated by AI-generated images further flow to AI-generated videos, serving as reference frame inputs for image-to-video creation; all outputs ultimately converge on the AI digital human, combining the student's real photo and voice to generate a personalized digital human video for each student. The above chain clearly demonstrates the mandatory dependencies and transmission paths of data between each step in this solution: the output of each step is not only the result of the current stage but also the raw material for the next stage of creation.
[0116] The server uses a combination of stepId, courseId, and termId as a unique data identifier to accurately identify and persistently store each student's creative output in each course step, ensuring that data from different students, different courses, and different semesters are isolated from each other and do not interfere with each other.
[0117] The specific workflow of this progressive AI creative teaching system includes the following steps: Step 1: AI Dialogue Phase.
[0118] Once a student enters the AI dialogue interface, the front end automatically loads a pre-made guided dialogue script. The artificial intelligence interaction model (i.e., LLM, Large Language Model) then proactively poses the first guiding question to the student, thus initiating the interaction process for this stage.
[0119] Figure 5 This is a diagram illustrating the complete process from AI dialogue to AI-generated image. For example... Figure 5 As shown, the interactive process proceeds in a cyclical manner: students input answers or ideas, LLM performs semantic analysis on the answers, and generates the next guiding question based on the analysis results. This process repeats until the topic range set in the pre-made script is fully covered. The purpose of this cycle is to guide students to gradually output creative content related to the course theme, such as students' expressions of "I imagine a mysterious underwater world full of treasures," and descriptions of the main character's image and color scheme. This content will constitute the core semantic material for the subsequent image and video generation stages.
[0120] Once the LLM determines that the topic coverage is complete, the front end displays a prompt card saying "The dialogue is complete, and your creative content has been recorded," and activates the "Proceed to the next step" button. After the student actively clicks this button, the system exits the dialogue loop and transmits the complete dialogue record generated by this interaction to the server in a structured form. The record is persistently stored in the database with a triple (stepId + courseId + termId) as the key for automatic reference in subsequent steps.
[0121] Step two, AI-generated image stage.
[0122] After students enter the AI-generated image interface, a status message "X creative content items have been imported from the AI dialogue" is displayed at the top of the interface, informing students that the previous dialogue data has been automatically inherited to this stage through the getDrawProperties interface, and no manual operation is required.
[0123] Continue to refer to Figure 5 Upon entering the image generation interface, the server uses LLM (Layered Language Management) to perform summary analysis on the preceding dialogue records, extracting key elements such as scene descriptions, character traits, color styles, and emotional atmosphere. These elements are then combined into structured image generation prompts, which are pre-filled and displayed in the input boxes of the image generation interface. Students can directly view these keywords and modify, delete, or add prompts on the interface according to their creative intentions. After confirming the prompt content, students further select parameters such as image style, aspect ratio, and generation model on the interface and submit a generation request.
[0124] The system calls an image generation service (Volcano Engine Text-to-Image / Image-to-Image Service or Minimax Text-to-Image Service) to perform the image generation task, generating personalized images for students and persistently storing the image URLs on the server with the same triples as keys. Image generation supports two modes: aiType=1 (text-to-image mode, generating images based solely on text descriptions) and aiType=2 (image-to-image mode, generating images based on style transfer from reference images uploaded by students). After generation, the front end displays the generated images for students to view and confirm. If students are not satisfied with the images, they can return for regeneration. The system re-executes the generation with the same triples, without creating new identifiers, and the newly generated image overwrites the original record.
[0125] Step 3: AI-generated video stage.
[0126] Figure 6 This is a schematic diagram illustrating the complete process from AI-generated video to AI intelligent agent. For example... Figure 6 As shown, after the student confirms entry into the AI-generated video stage, the system automatically integrates the preceding inputs: extracting scene and dynamic descriptions from the AI dialogue record using LLM (e.g., "deep-sea exploration," "glowing fish," "camera zooming in"), and simultaneously obtaining the image URL generated and stored in step two as a video reference frame (i.e., the reference image input in img2Video mode). The system also receives the student's parameter selections for video duration, aspect ratio, and motion style on the interface. These three types of inputs are combined into video generation request parameters, submitted to the video generation service, which asynchronously generates a personalized video. Once generated, the video URL is persistently stored on the server using a triple as the key.
[0127] Video generation supports two modes, identified by the VideoGenMode parameter: txt2Video (text-to-video mode, generating video based solely on text descriptions) and img2Video (image-to-video mode, using preceding images as reference frames for auxiliary generation). The system selects the appropriate mode based on the actual input to ensure the generated video maintains visual consistency with the preceding images. After generation, the video is displayed on the front end for students to review and confirm.
[0128] Furthermore, if students are not satisfied with the previous images at this stage, they can click "Back" to return to the AI image generation step and regenerate. The system keeps the triplet unchanged and re-executes the image generation process, with the new image overwriting the original stored records. After the student confirms their satisfaction with the new image, when they re-enter the video generation stage, the system automatically retrieves the latest image URL as a reference frame, and the overall generation process restarts from the video generation stage.
[0129] Step 4: AI agent configuration phase.
[0130] Continue to refer to Figure 6 After the student confirms entry into the AI agent configuration phase, the server performs static mapping based on the courseId (course identifier) and returns the corresponding material package for this course. At the same time, LLM analyzes the preceding AI dialogue records and generates personalized pre-filled self-introduction text based on the student's personal characteristics reflected in the dialogue. This text is displayed in the configuration interface for the student to refer to or use directly.
[0131] Agent configuration is carried out in two phases sequentially: CREATE stage: Students complete the basic attribute configuration of the agent in this stage, including the agent's name, avatar, and self-introduction; SETTING Phase: In this phase, students configure the agent's interactive behaviors, including personality traits, opening remarks for the first round of dialogue, a list of guiding questions, and a personalized background image. The system automatically saves the configuration every two minutes during this phase, allowing students to resume configuration from where they left off (i.e., breakpoint resume configuration) to ensure the reliability of the configuration process.
[0132] After both phases are completed, the student submits the complete configuration to the server interface ( / classroom / agent / save). The server returns a globally unique agentId (agent identifier) and persistently stores the complete agent configuration data in the database. Once the configuration is complete, the front end switches to the CHAT page (dialogue page), where students can immediately begin interacting with their created agent to verify the configuration effect.
[0133] Step 5: AI Digital Human Generation Stage.
[0134] Figure 7 This is a schematic diagram illustrating the complete process from AI digital human to video check-in. For example... Figure 7 As shown, after students enter the AI digital human step, the system first pops up a privacy authorization secondary confirmation pop-up. Students can only proceed to the next step after they clearly confirm the authorization; if students refuse to authorize, they will not be able to continue using this function.
[0135] Step a, Upload a profile photo. Students upload a real photo of themselves, and the system initiates a multi-level security review process for the uploaded photo. If the review is successful, the system obtains the corresponding image ID for the photo, which is used as the image input parameter for the digital human generation task; if the review fails, the system prompts the student to re-upload a compliant photo that meets the requirements.
[0136] Step b, Sound Acquisition. The system provides two sound acquisition methods. One method allows students to record online using their terminal device's microphone, and the system obtains the corresponding voiceUrl (sound resource address) for the recording file. The other method allows students to input text to be read aloud and select an AI voice. After sensitive word detection, the system synthesizes the sound using TTS (text-to-speech) technology. If the sensitive word detection fails, the student is prompted to modify the input text and resubmit. Both methods use the student's own voice as the source.
[0137] Step c, submit the digital human generation task. The system will combine the image ID obtained in step a, the voice data collected in step b, the generated text interaction data (dialogue records), image data (image URLs), video data (video URLs), and agent configuration data (agentId) into complete digital human generation task parameters, and submit them to the digital human generation service, which will then execute the generation task asynchronously.
[0138] Figure 8 This is a schematic diagram illustrating the state transitions of a digital human generation task. For example... Figure 8As shown, after a task is submitted, it goes through three states in sequence: UNSTART, PROCESSING, and GENERATING. After generation, it enters the final state, which has the following categories: If the task is successfully completed, the state transitions to SUCCESS, and the server returns the videoUrl (video resource address) of the digital human video. The system then deploys the digital human video to the application homepage. If the task encounters retryable exceptions such as FAILED, QUEUE_TIMEOUT, GENERATE_TIMEOUT, or LIMIT, the front end displays a "Regenerate" button, allowing students to resubmit the task with the same parameters. If the task is rejected for compliance reasons such as VENDOR_RISK_REJECT or MANUAL_RISK_REJECT, the system guides students to return to the corresponding steps (such as re-uploading a photo or re-recording) to modify the materials and resubmit.
[0139] After the digital human video is successfully generated, the system persistently stores the video URL on the server with a triple as the key. The digital human video is also saved to the student's personal portfolio and deployed to the application homepage. Students can interact with their digital human at any time when using the application.
[0140] This system also includes a closed loop for video check-in and teaching evaluation.
[0141] Figure 9 This is a schematic diagram of the closed-loop teaching evaluation system in this implementation method. For example... Figure 9 As shown, after students complete the creation of their AI digital avatars, they enter the video check-in stage: students record or upload a video explaining their AI operation, detailing the complete process of using AI tools for creation at each stage. Recording supports both desktop screen recording and mobile video upload, with both ends working collaboratively via bridged communication to adapt to different device usage scenarios. After the video is submitted to the server for saving, teachers can view the video content through the platform and provide voice or text feedback; students can view the feedback on the platform to understand areas for improvement in their creative process. After the digital avatar is deployed on the application's homepage, students can continuously interact with their AI counterpart in daily scenarios; for further iteration and optimization, they can return to the first step and restart a new round of the creative process. Thus, a complete teaching evaluation loop is formed, from AI tool learning, digital avatar creation, deployment and use to operation demonstration and teacher evaluation.
[0142] In summary, this application's embodiments link five processing stages—AI dialogue, AI image generation, AI video generation, AI agent configuration, and AI digital human generation—into a complete chain with mandatory data dependencies. A triplet of step identifiers, course identifiers, and semester identifiers is used as an index to achieve persistent storage and automatic cross-stage transfer of output from each stage. This allows each subsequent step to access the content created in all previous stages without manual user intervention, fundamentally solving the problems of content fragmentation and data isolation between different types of AI tools. Combined with video check-in and teacher feedback mechanisms, students receive targeted feedback after completing their AI tool creations, forming a complete closed loop from creation to evaluation. The digital human is permanently installed on the application's homepage for continuous daily use, extending learning outcomes beyond course boundaries to everyday scenarios, effectively solving the problem in traditional teaching products where learning outcomes become invalid after the course ends.
[0143] This embodiment also provides a digital human generation device for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that performs a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0144] This embodiment provides a digital human generation device, such as... Figure 10 As shown, it includes: The acquisition module 1001 is used to interact with the target user through an artificial intelligence interaction model and acquire text interaction data; The image generation module 1002 is used to analyze text interaction data to extract visual feature information and generate image data based on the visual feature information. The video generation module 1003 is used to generate video data by using image data as a visual reference and combining it with scene description information extracted from text interaction data. The intelligent agent configuration module 1004 is used to extract configuration reference information from text interaction data and, in combination with the target user's setting instructions, construct the intelligent agent's configuration data; The generation module 1005 is used to acquire the real feature data of the target user, and to aggregate and fuse the real feature data with text interaction data, image data, video data and intelligent agent configuration data to generate a digital human corresponding to the target user.
[0145] In one optional implementation, acquiring text interaction data, generating image data, generating video data, and constructing configuration data for the intelligent agent are each of several processing stages executed sequentially; the acquisition module 1001 is further configured to: After acquiring, generating, or constructing the corresponding data in any current processing stage, the data obtained in the current processing stage is associated with a unique data identifier and stored; wherein, the unique data identifier includes a combination of step identifier, course identifier, and semester identifier; When executing the next processing stage, data that has been associated with and stored in the previous processing stage is retrieved based on a unique data identifier and used as the data input for the next processing stage.
[0146] In one optional implementation, the image generation module 1002 is configured to: The semantic analysis model is used to extract features from text interaction data, and character features and color style elements are extracted as visual feature information. Combine visual feature information into a first prompt word and display it in the interface; Receive the target user's instruction to modify the first prompt word, and modify the first prompt word based on the instruction to obtain the second prompt word; Image data is generated using an image generation model based on the second cue word.
[0147] In one optional implementation, the video generation module 1003 is used for: The text interaction data is parsed using a semantic analysis model to extract the environmental scene as scene description information; The image data is used as a reference frame, combined with scene description information and the parameter settings received from the target user regarding video duration, aspect ratio or motion style, to form the video generation request parameters. Based on the video generation request parameters, video data is generated.
[0148] In one optional implementation, the agent configuration module 1004 is used for: By analyzing text interaction data through a semantic analysis model, personalized self-introduction text is generated and pre-filled as configuration reference information. It receives basic configuration instructions from the target user to determine the name and avatar of the agent, and receives interaction setting instructions to determine the personality settings, the opening remarks for the first round of dialogue, and the list of guiding questions, and integrates them to generate agent configuration data.
[0149] In one optional implementation, the generation module 1005 is used for: Obtain facial image data uploaded by the target user and extract facial features from the facial image data; Voice features are obtained by acquiring online recordings of target users or by receiving input text to be read aloud and synthesizing it through text-to-speech conversion. Combine visual and vocal features to create real feature data.
[0150] The digital human generation apparatus provided in this application can execute the digital human generation method provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the method. Further functional descriptions of the various modules and units described above are the same as in the corresponding embodiments described above, and will not be repeated here.
[0151] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0152] The following is a detailed reference. Figure 11 The diagram illustrates a structural schematic suitable for implementing the electronic device described in the embodiments of this application. The electronic device may include a processor (e.g., a central processing unit, graphics processor, etc.) 1101, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 1102 or a program loaded from memory 1108 into random access memory (RAM) 1103. The RAM 1103 also stores various programs and data required for the operation of the electronic device. The processor 1101, ROM 1102, and RAM 1103 are interconnected via a bus 1104. An input / output (I / O) interface 1105 is also connected to the bus 1104.
[0153] Typically, the following devices can be connected to I / O interface 1105: input devices 1106 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 1107 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; memory devices 1108 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1109. Communication device 1109 allows electronic devices to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 11 Electronic devices with various devices are shown, but it should be understood that it is not required to implement or have all of the devices shown, and more or fewer devices may be implemented or have instead.
[0154] Specifically, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 1109, or installed from memory 1108, or installed from ROM 1102. When the computer program is executed by processor 1101, it performs the functions defined in the digital human generation method of embodiments of this application.
[0155] Figure 11The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.
[0156] This application also provides a computer-readable storage medium. The methods described in this application can be implemented in hardware or firmware, or implemented as recordable on a storage medium, or implemented as computer code downloaded over a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and then stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that computers, processors, microprocessor controllers, or programmable hardware include storage components capable of storing or receiving software or computer code. When the software or computer code is accessed and executed by the computer, processor, or hardware, the digital human generation method shown in the above embodiments is implemented.
[0157] A portion of this application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to this application through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.
[0158] Although embodiments of this application have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of this application, and all such modifications and variations fall within the scope defined by the appended claims.
Claims
1. A method for generating a digital human, characterized in that, The method includes: By using an AI-powered interactive model to interact with target users, text interaction data can be obtained. The text interaction data is analyzed to extract visual feature information, and image data is generated based on the visual feature information. Using the image data as a visual reference and combining it with scene description information extracted from the text interaction data, video data is generated; Configuration reference information is extracted from the text interaction data and combined with the setting instructions of the target user to construct the configuration data of the intelligent agent; The system acquires the real feature data of the target user, and then aggregates and fuses the real feature data with the text interaction data, the image data, the video data, and the intelligent agent configuration data to generate a digital human corresponding to the target user.
2. The method according to claim 1, characterized in that, Acquiring text interaction data, generating image data, generating video data, and constructing configuration data for the intelligent agent are each of the multiple processing stages executed sequentially; the method further includes: After acquiring, generating, or constructing the corresponding data in any current processing stage, the data obtained in the current processing stage is associated with a unique data identifier and stored; wherein, the unique data identifier includes a combination of step identifier, course identifier, and semester identifier; When executing the next processing stage, the data that has been associated and stored in the previous processing stage is obtained based on the unique data identifier, and used as the data input for the next processing stage.
3. The method according to claim 1, characterized in that, The step of analyzing the text interaction data to extract visual feature information and generating image data based on the visual feature information includes: The text interaction data is used to extract features through a semantic analysis model, and the character features and color style elements are extracted as the visual feature information. The visual feature information is combined into a first prompt word and pre-filled and displayed in the interface; Receive the target user's modification instruction for the first prompt word, and modify the first prompt word based on the modification instruction to obtain the second prompt word; The image data is generated based on the second prompt word using an image generation model.
4. The method according to claim 1, characterized in that, The step of generating video data by using the image data as a visual reference and combining it with scene description information extracted from the text interaction data includes: The text interaction data is parsed using a semantic analysis model to extract the environmental scene as the scene description information; The image data is used as a reference frame, combined with the scene description information and the parameter settings received from the target user regarding video duration, aspect ratio, or motion style, to form the video generation request parameters. The video data is generated based on the video generation request parameters.
5. The method according to claim 1, characterized in that, The step of extracting configuration reference information from the text interaction data and constructing the agent's configuration data in conjunction with the target user's setting instructions includes: The text interaction data is analyzed using a semantic analysis model to generate personalized self-introduction text, which is then pre-filled and displayed as configuration reference information. The system receives basic configuration instructions from the target user to determine the name and avatar of the intelligent agent, and receives interaction setting instructions to determine personality settings, opening remarks for the first round of dialogue, and a list of guiding questions, and then merges these to generate the intelligent agent configuration data.
6. The method according to claim 1, characterized in that, The acquisition of the target user's real characteristic data includes: Obtain facial image data uploaded by the target user, and extract image features from the facial image data; Voice features are obtained by acquiring online recordings of the target user, or by receiving input text to be read aloud and synthesizing it via text-to-speech. The image features and the voice features are combined to form the real feature data.
7. A digital human generation device, characterized in that, The device includes: The acquisition module is used to interact with target users through an artificial intelligence interaction model and acquire text interaction data; An image generation module is used to analyze the text interaction data to extract visual feature information and generate image data based on the visual feature information; The video generation module is used to generate video data by using the image data as a visual reference and combining it with scene description information extracted from the text interaction data; The intelligent agent configuration module is used to extract configuration reference information from the text interaction data and, in conjunction with the setting instructions of the target user, construct the configuration data of the intelligent agent. The generation module is used to acquire the real feature data of the target user, and to aggregate and fuse the real feature data with the text interaction data, the image data, the video data and the intelligent agent configuration data to generate a digital human corresponding to the target user.
8. An electronic device, characterized in that, include: A memory and a processor are communicatively connected, the memory storing computer instructions, and the processor executing the computer instructions to perform the digital human generation method according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing a computer to perform the digital human generation method according to any one of claims 1 to 6.
10. A computer program product, characterized in that, Includes computer instructions for causing a computer to perform the digital human generation method according to any one of claims 1 to 6.