Method and related products for digital avatar perceiving self-appearance, clothing and scene

By verifying and normalizing the legality of digital avatar content and combining it with a multimodal analysis model, the system automatically constructs and stores the persona data of digital avatars, solving the problems of complexity and inconsistency in persona construction in existing technologies, and achieving efficient and accurate digital avatar image management.

CN121745149BActive Publication Date: 2026-06-30LIANGSHENG DIGITAL CREATIVE DESIGN (HANGZHOU) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LIANGSHENG DIGITAL CREATIVE DESIGN (HANGZHOU) CO LTD
Filing Date
2026-02-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing digital avatar technology, the persona construction process suffers from problems such as complex adaptation of multiple content formats, reliance on manual intervention, unstructured results, and weak correlation, resulting in low analysis efficiency and inconsistent digital avatar appearances.

Method used

By acquiring digital clone content, performing legality verification and normalization, constructing structured prompts and constraints using a multimodal analysis model, and automatically analyzing and storing the set data, the system achieves automated construction and strong binding of appearance, clothing, and scene.

Benefits of technology

It enables unified processing of multi-format content, and the automated process improves the efficiency and accuracy of persona creation, ensures the consistency and security of digital avatar images, and reduces development and maintenance costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121745149B_ABST
    Figure CN121745149B_ABST
Patent Text Reader

Abstract

This invention discloses a method and related products for a digital clone to perceive its own appearance, clothing, and scene. The method includes: acquiring digital clone content associated with a unique digital clone identifier; performing legality verification and normalization processing on the digital clone content to obtain standardized materials; constructing structured prompts and structured constraints based on a preset set of appearance-clothing-scene fields, and inferring from the standardized materials using a multimodal analysis model to obtain candidate structured results; performing syntax verification, field verification, and value range consistency verification on the candidate structured results to obtain a target structured result; and storing the target structured result in association with the digital clone's unique identifier in an appearance-clothing-scene setting library, and writing a version number and update timestamp. Through the above technical solution, unified processing of multi-format input and field-level structured output are achieved, reducing the uncertainty caused by manual intervention and secondary parsing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision, multimodal machine learning, and virtual digital avatar technology, specifically to a method and related products for a digital avatar to perceive its own appearance, clothing, and scene. Background Technology

[0002] With the rapid development of digital avatar technology, digital avatars, as the core application form of personalized digital avatars (also known as virtual avatars), have been widely used in social, office, and entertainment scenarios. The persona information of a digital avatar is a key factor in reflecting its personalization and image consistency, directly affecting the user's interactive experience. Here, persona information refers to character settings, including physical characteristics, clothing style, and the setting.

[0003] In existing technologies, the creation of digital avatar personas faces several problems: First, adapting to various content input formats is complex. Specifically, user-uploaded or system-generated digital avatar content comes in diverse forms, including single images, multiple images, and videos. Different formats typically require separate analysis processes, resulting in high development and maintenance costs and low analysis efficiency. Second, the persona creation process relies on manual intervention. Most existing systems require users to manually provide persona information (e.g., appearance description, clothing type, scene description) or manually select analysis materials, making the process cumbersome and prone to errors. Furthermore, user descriptions may be inaccurate, leading to discrepancies between the recorded persona information and the actual appearance of the digital avatar. Third, the structure of persona analysis results is low. Existing multimodal analysis models often provide unstructured natural language descriptions, requiring secondary parsing before storage in the appearance-clothing-scene setting library. This increases the risk of parsing errors and prevents direct use for subsequent content generation or real-time interactive calls for the digital avatar. Finally, the connection between character information and digital clone is weak. Specifically, the existing solution lacks a strong binding mechanism between character information and digital clone, which makes it easy to call the wrong character preset when generating or interacting with the digital clone content, resulting in inconsistencies between the digital clone's appearance before and after. Summary of the Invention

[0004] This invention provides a method and related products for a digital clone to perceive its own appearance, clothing and scene, in order to overcome the shortcomings of digital clones in the process of character creation, such as complex multi-format content adaptation, reliance on manual intervention, unstructured results and weak correlation.

[0005] A method for a digital avatar to perceive its own appearance, clothing, and environment includes:

[0006] Obtain the digital clone content associated with the unique identifier of the digital clone, the digital clone content including original content in multiple modalities, the original content including at least visual content, and also including at least one of audio content and text content;

[0007] The digital clone content is subjected to legality verification and normalization processing to determine standardized materials. The standardized materials include at least a standard image determined based on the visual content, and at least one of standard audio determined based on the audio content and standard text determined based on the text content.

[0008] Based on a preset set of appearance-clothing-scene fields, structured prompts and structured constraints are constructed. A multimodal analysis model is called to reason about the standardized materials to determine candidate structured results that represent the set data. The candidate structured results are matched with the structured prompts and structured constraints.

[0009] Perform syntax validation, field validation, and value range consistency validation on the candidate structured results to determine the target structured results that pass the validation;

[0010] The target structured result is associated with the unique identifier of the digital clone and stored in the appearance-clothing-scene setting library, and a version number and update timestamp are written to the target structured result;

[0011] After detecting that the target structured result has been successfully updated, the temporary files generated during the normalization process are either securely deleted or encrypted and then retained; if the target structured result update fails or the process is abnormally terminated, the temporary files are retained; and log summary information of this construction is recorded.

[0012] A system for a digital avatar to perceive its own appearance, clothing, and environment includes:

[0013] The content acquisition module is used to acquire digital clone content associated with the unique identifier of the digital clone. The digital clone content includes original content in multiple modalities. The original content includes at least visual content, and also includes at least one of audio content and text content.

[0014] The normalization module is used to perform legality verification and normalization processing on the digital clone content, and determine standardized materials. The standardized materials include at least a standard image determined based on the visual content, and at least one of standard audio determined based on the audio content and standard text determined based on the text content.

[0015] The multimodal reasoning module is used to construct structured prompts and structured constraints based on a preset set of appearance-clothing-scene fields, call a multimodal analysis model to reason about the standardized materials, and determine candidate structured results that represent the set data. The candidate structured results are matched with the structured prompts and structured constraints.

[0016] The verification and completion module is used to perform syntax verification, field verification, and value range consistency verification on the candidate structured results, and determine the target structured result that passes the verification.

[0017] The versioned storage module is used to associate the target structured result with the unique identifier of the digital clone and store it in the appearance-clothing-scene setting library, and write a version number and update timestamp to the target structured result;

[0018] The security cleanup module is used to securely delete or encrypt and retain temporary files generated during the normalization process after detecting that the target structured result has been successfully updated; if the target structured result update fails or the process is abnormally terminated, the temporary files will not be deleted; and log summary information of this construction will be recorded.

[0019] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the aforementioned method for a digital clone to perceive its own appearance, clothing, and scene.

[0020] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned method for a digital clone to perceive its own appearance, clothing, and scene.

[0021] This invention provides a method and related products for digital avatars to perceive their own appearance, clothing, and environment. It can uniformly normalize digital avatar content in various formats, automatically identifying, analyzing, and storing the avatar's appearance features, clothing style, and environment information without human intervention. Furthermore, it strongly binds the generated avatar preset to the digital avatar's unique identifier. Through this method, the invention significantly improves the efficiency and accuracy of avatar construction, ensuring consistency in the use of digital avatar images. Attached Figure Description

[0022] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a flowchart of a method for a digital clone to perceive its own appearance, clothing, and scene in an embodiment of the present invention;

[0024] Figure 2 yes Figure 1 A flowchart of step S102;

[0025] Figure 3 yes Figure 1 A flowchart of step S103;

[0026] Figure 4 yes Figure 3 A flowchart of step S302;

[0027] Figure 5 yes Figure 1 A flowchart of step S104;

[0028] Figure 6 yes Figure 1 A flowchart of step S105;

[0029] Figure 7 This is a schematic diagram of a system for a digital clone to perceive its own appearance, clothing, and scene in an embodiment of the present invention. Detailed Implementation

[0030] To make the technical problems solved, the technical solutions, and the beneficial effects of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0031] This invention provides a method for a digital avatar to perceive its own appearance, clothing, and environment. This method enables the digital avatar to perceive its own appearance, clothing, and environment and construct a persona. This method is applicable to mobile phones, computers, or other electronic devices, such as… Figure 1 As shown, the method includes:

[0032] S101: Obtain the digital clone content associated with the unique identifier of the digital clone. The digital clone content includes original content in multiple modalities. The original content includes at least visual content, and also includes at least one of audio content and text content.

[0033] S102: Perform legality verification and normalization processing on the digital clone content, and determine standardized materials. The standardized materials include at least a standard image determined based on visual content, and at least one of standard audio determined based on audio content and standard text determined based on text content.

[0034] S103: Based on the preset set of appearance-clothing-scene fields, construct structured prompt words and structured constraints, call the multimodal analysis model to reason about the standardized materials, determine the candidate structured results that represent the set data, and match the candidate structured results with the structured prompt words and structured constraints;

[0035] S104: Perform syntax validation, field validation, and value range consistency validation on the candidate structured results to determine the target structured results that pass the validation;

[0036] S105: Associate the target structured result with the unique identifier of the digital clone and store it in the appearance-clothing-scene setting library, and write the version number and update timestamp to the target structured result;

[0037] S106: After detecting that the target structured result has been successfully updated, perform secure deletion or encryption and retention of temporary files generated during the normalization process; when detecting that the target structured result has failed to be updated or the process has been abnormally terminated, retain the temporary files; and record the log summary information of this construction.

[0038] Digital avatars (also known as digital doubles or digital doppelgängers) are virtual agents created in digital space through high-precision digital replication of a specific real individual using technologies such as computer graphics, artificial intelligence, motion capture, and audio synthesis. They are digital doppelgängers with a physical appearance, the ability to speak, and the capacity to interact. A unique identifier for a digital avatar is used to uniquely identify that avatar. Digital avatar content refers to content products produced using various digital avatars as the main body.

[0039] In this embodiment, the "appearance-clothing-scene field set" refers to a predefined set of structured fields for setting data for digital avatars, used to limit the output boundaries of the construction results and ensure consistency with database fields. The appearance field includes at least one or more of face shape, facial features, default expression, hairstyle, and makeup; the clothing field includes at least one or more of clothing category, style, main color, and design details; and the scene field includes at least one or more of scene type, scene features, and atmosphere. In an optional embodiment, the field set can also be expanded to include voice and text fields, but the expanded fields do not change the constraint relationship of the appearance-clothing-scene field as a mandatory core field.

[0040] As an example, in step S101, the electronic device can receive a build request triggered by the user and, in response to the build request, obtain the unique identifier and content of the digital clone input by the user. Alternatively, the electronic device can monitor a pre-set content generation model in real time, and automatically trigger a build request process when it detects that the content generation model has completed generating the digital clone content corresponding to a certain digital clone, in order to obtain parameters such as the unique identifier and storage path of the digital clone content. In this example, the digital clone content obtained by the electronic device includes at least visual content such as a single image, multiple images, or videos, so as to perform appearance and clothing analysis based on visual content. It may also include at least one of audio content and text content, so that it can combine visual content for scene analysis.

[0041] As an example, in step S102, after acquiring the original content of multiple modalities, the electronic device needs to use the legality verification rules corresponding to different modalities to verify the legality of the original content corresponding to different modalities, so as to obtain the original content that passes the legality verification. Then, it uses the normalization processing rules corresponding to different modalities to normalize the original content that passes the legality verification, so as to determine the standard content corresponding to different modalities. Based on the standard content corresponding to all modalities, the standardized materials finally used for input into the multimodal analysis model for multimodal analysis processing are determined. In this example, the original content corresponding to all modalities is converted into standard content to ensure the feasibility of subsequent multimodal analysis. In this example, the standardized materials include at least one of the following: standard images determined based on visual content, standard audio determined based on audio content, and standard text determined based on text content. These standardized materials can be input into the multimodal analysis model to analyze and determine their corresponding appearance, clothing, scene, and other setting data.

[0042] The Appearance-Clothing-Scene field set is a pre-defined set of fields that control the output of the multimodal analysis model. Specifically, it needs to include one or more fields corresponding to the attributes of Appearance, Clothing, and Scene. Structured prompts are standardized prompts generated based on the pre-defined Appearance-Clothing-Scene field set, used to instruct the multimodal analysis model to output analysis results containing all target fields. Structured constraints are mainly used to ensure that the candidate structured results output by the multimodal analysis model conform to preset format and content requirements. As an example, structured constraints can be, but are not limited to, JSON Schema constraints. JSON Schema constraints are a standardized mechanism for defining and validating JSON data structures. It uses a set of JSON format rules to precisely define and enforce the structure, data types, field value ranges, required attributes, and optional attributes of JSON data.

[0043] As an example, in step S103, the electronic device can construct corresponding structured prompts and structured constraints based on a preset set of appearance-clothing-scene fields. The structured prompts guide the model to output in a specified format, explicitly requiring the model to generate JSON results containing preset fields (such as appearance, clothing, and scene), avoiding redundant information. For example, the structured prompts might explicitly instruct: "Please output fields such as appearance_face_shape and clothing_style in pure JSON format." Structured constraints are used to verify the compliance of the model's output. By defining rules such as field types, value ranges, and required attributes, they ensure that the results meet the storage requirements of the appearance-clothing-scene setting library. For example, they verify whether the appearance_face_shape field belongs to a preset enumerated value (oval face, round face, etc.). Next, the structured prompts and structured constraints are injected into the multimodal analysis model, enabling the model to perform multimodal analysis on standardized materials and output candidate structured results that match the structured prompts and structured constraints. This makes the candidate structured results The structured results correspond one-to-one with the preset appearance-clothing-scene field set, avoiding redundant information and ensuring the accuracy and efficiency of subsequent data processing.

[0044] As an example, in step S104, the electronic device needs to perform syntax validation on the candidate structured results. Specifically, this verifies whether the candidate structured results are in a valid format, including checking syntax rules such as bracket matching, quotation mark usage, and comma separation. For example, it checks for missing closing brackets, strings not enclosed in double quotes, and missing commas between key-value pairs to filter out incorrectly formatted data and ensure that the data type conforms to structured constraints. If the syntax validation passes, field validation is then performed on the candidate structured results. Field validation specifically verifies that the candidate structured results output by the multimodal analysis model conform to the preset appearance-clothing-scene field set. This verifies the completeness, format, and value range validity of the fields, ensuring the standardization and usability of the persona data. After the field validation passes, value range consistency validation is also performed on the candidate structured results to ensure that the candidate structured results output by the multimodal analysis model are consistent in logic and business rules. By verifying the rationality, relevance, and compliance of field values, the accuracy and usability of the persona data are ensured. In this example, only when syntax validation, field validation, and value range consistency validation all pass will the candidate structured result be determined as the validated target structured result.

[0045] As an example, in step S105, after determining the target structured result, the electronic device can associate the target structured result with the unique identifier of the digital clone and store it in the appearance-clothing-scene setting library. This library records the appearance, clothing, and scene settings corresponding to the digital clone. Associating the target structured result with the unique identifier of the digital clone allows for quick retrieval of the corresponding target structured result based on that identifier. In this example, when storing the target structured result with the unique identifier of the digital clone in the appearance-clothing-scene setting library, the version number and update timestamp corresponding to the target structured result are also written. This allows for subsequent updates of the appearance, clothing, and scene settings of the digital clone corresponding to the unique identifier based on the version number and update timestamp, improving update speed and ensuring update effectiveness.

[0046] As an example, in step S106, after detecting that the target structured information has been successfully synchronized into the database, the electronic device immediately performs secure deletion or encryption of the standard images, standard audio, standard text, and other temporary files temporarily stored during the normalization phase, freeing up storage space and preventing information leakage or interference with subsequent processing caused by not deleting temporary files. Furthermore, if the electronic device detects that the synchronization into the database has failed or the process has abnormally stopped, it may choose not to delete the temporary files for debugging purposes, allowing developers to troubleshoot the problem. Simultaneously, the electronic device must also record log summary information for this build, including the unique identifier of the digital clone, version number, update timestamp, build results, quality score of the standard images used, and any abnormal situations that occurred during the build process, providing a basis for subsequent auditing and problem investigation, and ensuring the cleanliness and security of the system during long-term operation.

[0047] The method for digital clones to perceive their own appearance, clothing, and scene provided in this application embodiment has the following beneficial effects:

[0048] Multi-format input normalization and optimization: Provides unified input processing rules, converting all image data, including single images, multiple images, and videos, into a single standard image format for processing. This eliminates the need to develop multiple analysis workflows for different formats, reducing development and maintenance costs. Simultaneously, a quality scoring and optimization mechanism is introduced, automatically selecting the most suitable standard image to represent the digital avatar's appearance, clothing, and scene. This avoids issues such as occlusion, blurring, or unrepresentative scenes caused by defaulting to the first frame or the first image, improving analysis accuracy. Furthermore, only the single optimal image needs to be analyzed at any given time, significantly reducing the need for frame-by-frame processing of entire videos or multiple images, substantially improving analysis efficiency and reducing computational consumption.

[0049] Fully automated process: From generating digital clone content to storing the target structured results in the database, the entire process requires no manual intervention, achieving complete automation of the setup and construction process. This avoids the inefficiency and error risks caused by manually filling in setting information or manually selecting analysis materials in existing technologies, significantly improving the speed and accuracy of setup and construction.

[0050] The analysis results are structured and validated in a closed loop: Based on a preset set of appearance-clothing-scene fields, structured prompts and constraints are constructed. This forces the multimodal analysis model to directly output candidate structured results that correspond one-to-one with the structured prompts and constraints, enabling data integration into the database without secondary parsing. This avoids errors caused by text parsing and improves the accuracy and standardization of writing setting information into the appearance-clothing-scene setting database. Simultaneously, syntax validation, field validation, and value range consistency validation are introduced. If a validation fails, a completion prompt is generated based on the missing field set for retrying, forming a validation and completion closed loop that ensures the completeness and accuracy of the output results.

[0051] The system features strong binding and version management between settings and digital avatars: A unique identifier, version number, and update timestamp for each digital avatar tightly link the structured results representing the settings data to the corresponding digital avatar. This ensures accurate matching of the correct settings data during subsequent digital avatar content generation or real-time interactive calls, guaranteeing consistency in digital avatar appearance and preventing errors caused by identifier confusion. It also supports overwriting and updating basic default settings and adding and saving user-defined settings. Users can save multiple sets of settings as needed, balancing the requirements for image uniformity and personalized expansion. Furthermore, the settings data is traceable, supporting rollback and auditing.

[0052] Scalable model adaptation: By introducing a model adaptation layer and encapsulating common interfaces, the system can quickly switch between different multimodal analysis models without modifying the overall workflow. This means that, based on actual application needs and policy compliance requirements, domestic models (such as Tongyi Qianwen Multimodal Edition and Wenxin Yiyan Multimodal Edition) or foreign models (such as GPT-4V) can be selected for analysis, achieving flexible, model-independent expansion and improving the system's adaptability and maintainability.

[0053] Comprehensive security controls: This invention incorporates a security control module into the process, providing functions such as permission verification, exception handling, and content review. Permission control ensures the legitimacy of setup requests, guaranteeing data security by allowing each user to operate only their own digital clone data. Automatic retrying for exceptions and logging mechanisms enhance system robustness. Content security review prevents harmful or erroneous information from entering the setup database. Furthermore, after successful data entry, temporary files are securely deleted or encrypted and then stored to prevent the leakage of sensitive information. These measures further improve the reliability and security compliance of the system.

[0054] In one embodiment, such as Figure 2 As shown, step S102 involves performing legality verification and normalization on the digital clone content to determine standardized materials, including:

[0055] S201: Based on the legal formats corresponding to different modalities, perform legality checks on the original content corresponding to multiple modalities respectively, and determine the original content corresponding to multiple modalities that passes the legality checks;

[0056] S202: Normalize the original content corresponding to multiple modalities that have passed the legality check to determine standardized materials, including:

[0057] When the original content is visual content, image filtering is performed on the visual content to determine candidate images. The candidate images are then normalized to determine the standard image. The candidate images are determined as follows: when the visual content is a single original image, the original image is determined as the candidate image; when the visual content is multiple original images, the original image with the highest quality score among the multiple original images is determined as the candidate image; when the visual content is an original video formed based on multiple original images, multiple frames of original images are extracted from the original video according to a preset sampling step size to form an original image set, and the original image with the highest quality score in the original image set is determined as the candidate image.

[0058] When the original content is audio, audio stream extraction and format conversion are performed to determine the standard audio.

[0059] When the original content is text, the text content is encoded and converted to determine the standard text.

[0060] As an example, in step S201, after acquiring the original content corresponding to multiple modalities, the electronic device needs to perform format verification on the original content corresponding to different modalities based on the legal formats corresponding to those modalities. This is used to filter out invalid content with non-legal formats and determine the original content corresponding to the multiple modalities that has passed the legality verification. For example, legal formats for images include JPG, PNG, and WebP; legal formats for videos include MP4, MOV, and AVI; and legal formats for audio include MP3 and WAV. Other formats will be considered unsupported and skipped, thus determining the original content corresponding to the legal formats.

[0061] In this example, a list of supported file formats is predefined for each media type. If the input content does not meet these format requirements, it will be considered invalid and filtered out, and will not proceed to the next step. For example, images only accept predefined formats (such as JPG, PNG, WebP, etc.), videos only accept MP4, MOV, AVI, etc., audio requires supported formats such as MP3, WAV, etc., and text needs to be in a specified encoding format (such as UTF-8 plain text). If the input file format is not in the supported list, the system will determine that it "failed the filter" during the format verification stage, thereby triggering exception handling or directly terminating the persona construction process. Only content that passes all format verifications will proceed to the standardization conversion step.

[0062] As an example, in step S202, the electronic device needs to normalize the original content corresponding to the various modalities that have passed the legality verification, and determine the standardized material. The processing process requires processing the original content corresponding to the modal according to the normalization rules corresponding to different modalities, and determining the corresponding standardized material.

[0063] When the original content is visual content, such as a single original image, multiple original images, or original video, it is necessary to first screen the visual content to determine candidate images; then, normalize the candidate images to determine the standard image.

[0064] In this example, the visual content is filtered to determine the candidate images in the following three cases: (1) when the visual content includes a single original image. (2) When the visual content includes multiple original images, a pre-set quality scoring algorithm can be used to extract features, analyze and score the quality of the multiple original images to determine the quality score corresponding to each original image. The quality scores of multiple original images are sorted and then ranked. The highest original image is determined as the candidate image. (3) When the visual content includes the original video, multiple original images are extracted from the original video according to the preset sampling step size to form an original image set. Feature extraction, analysis, and quality scoring are performed on multiple original images from the original image set to determine the quality score corresponding to each original image. The quality scores of multiple original images are sorted and then ranked. The highest-quality original image was selected as the candidate image.

[0065] When the original content is audio, such as a video's built-in audio track or a separate audio file, it is necessary to extract the audio stream and convert it to a standard format (such as WAV or MP3) for temporary storage to determine the standard audio. This ensures that it meets the requirements for subsequent multimodal analysis.

[0066] When the original content is text, it needs to be encoded and standardized, for example, uniformly encoded in UTF-8, to determine the standard text. This is to ensure it meets the requirements for subsequent multimodal analysis. In this example, the processed standard image can be used. Standard audio and standard text Standardized materials are used to ensure the feasibility of subsequent multimodal analysis.

[0067] In one embodiment, the quality score is determined by the following formula:

[0068]

[0069] in, For the first Quality score of the original image; As an indicator of facial presence, As an indicator of human body coverage, For facial clarity indicators, As an indicator of scene recognizability As a penalty for occlusion, For preset weights and satisfying .

[0070] As an example, when the format of the visual content is valid, it is uniformly converted into a standard image according to predetermined rules. To avoid problems such as occlusion, blurring, or unrepresentative scenes caused by defaulting to the first frame or the first image, this invention introduces a quality scoring image selection mechanism during the normalization stage. Assume that the visual content corresponding to the digital clone includes the original image set. Specifically, when the input is the original video, It consists of multiple original images obtained by sampling at a step size; when the input is multiple original images, For multiple independent original images, for each original image Calculate the quality score: ,choose

[0071] As a standard image and will The corresponding scores and selection sources are recorded in the log for traceability. A format validation function is also introduced. This indicates filtering of the input content format: if and only if the input belongs to the set of formats supported by the system. ,otherwise And it is determined to be invalid input; the normalization process can be expressed as: only when Only when Obtain a standard image; otherwise, the process terminates or returns an error code and records the reason.

[0072] Specifically, the quality scoring function for the standard image is defined as follows: When the input is a single original image, let the candidate set be... and take When the input consists of multiple original images, let the candidate set be... and take When the input is the original video, a set of candidate frames is extracted from the original video according to a preset sampling step size. and take ,in, This represents the input sample that maximizes the objective function. (Quality scoring function) Constructed based on the joint representation target of appearance, clothing, and scene, it includes at least five components: face presence, body coverage, face clarity, scene recognizability, and occlusion. The preferred definition is: ,in, It can be obtained by weighting the face detection confidence score and the face region sharpness after normalization. It can be obtained by normalizing the area ratio of the human body detection frame and the visibility rate of key points. It can be obtained by normalization using Laplace variance or other sharpness metrics. It can be obtained from characteristics such as scene classification confidence or image information entropy. It can be estimated from the proportion of face and upper body obscured; weight Configurable parameters and satisfy This is used to balance the contribution of appearance, clothing, and setting to "representativeness".

[0073] Furthermore, electronic devices can also process standard images. Perform size and color space normalization to adapt to subsequent model input interfaces; the above normalization parameters are configurable, for example, the image can be scaled to... Pixels, preferred For example, take The color space can be RGB or other preset spaces; the image encoding format can be JPG, PNG and WebP, etc., where the compression quality is a configurable parameter when lossy encoding is used (e.g., 75% as the default value), and lossless compression is retained when lossless encoding is used; the selection of the above parameters does not constitute a limitation on the scope of protection of this invention, and can be adaptively adjusted according to computing power, bandwidth and model interface requirements.

[0074] In one embodiment, standardized materials include standard images, standard audio, and standard text;

[0075] like Figure 3 As shown, in step S103, the multimodal analysis model is invoked to perform reasoning on the standardized materials to determine candidate structured results representing the set data, including:

[0076] S301: Perform structured analysis on standard images, standard audio, and standard text respectively to obtain visual features, audio features, and text features;

[0077] S302: Call the multimodal analysis model to reason about visual features, audio features and text features, and determine candidate structured results. The candidate structured results are used to characterize the appearance, clothing and scene settings corresponding to the digital clone.

[0078] As an example, in step S301, the electronic device can process a standard image. Structured processing is performed to obtain visual features that meet the input interface requirements of the multimodal analysis model. In this example, standard images can be processed. Image serialization is performed to obtain visual features that meet the input interface requirements of the multimodal analysis model. Serialization includes converting the image into a binary byte stream and optionally encoding it into a Base64 string. The Base64 string carries the image content and serves as one of the model's input parameters, rather than the image's semantic feature vector. In one embodiment, the visual features are obtained from a visual encoder within the multimodal analysis model. The data is extracted and used to generate structured outputs with appearance-clothing-scene fields. Electronic devices also employ audio recognition models to analyze standard audio. Feature extraction is performed to determine the audio features corresponding to the speaker's audio features. Electronic devices process standard text. Encoding is performed to determine text features. In this example, this is done by encoding a standard image. Standard audio and standard text Structured analysis is performed separately to determine the corresponding visual, audio, and text features, so that they meet the input interface requirements of the multimodal analysis model and can be input into the multimodal analysis model for analysis and processing.

[0079] In this example, standard audio can be transformed into speech feature vectors or directly obtained speech attributes using a speech recognition model or a speaker feature extraction model. For example, features such as the speaker's gender, timbre, tone, speech rate, and accent type can be extracted from the audio, and this information can be organized into structured audio features (such as voice_gender, voice_timbre, voice_accent, etc.). Standard text can be semantically parsed using natural language processing or a large language model to extract the implicit data from the text descriptions of the digital avatar (such as personality, character background, etc.). For example, the personality traits (persona_personality) and background story (persona_background) can be analyzed, and corresponding structured fields can be generated. In short, standard audio and standard text are not directly fed into the visual model in Base64 format like standard images. Instead, structured features are extracted separately using specialized speech and text analysis algorithms, and then fused by the multimodal analysis module.

[0080] As an example, in step S302, the electronic device injects structured cue words and structured constraints into the multimodal analysis model, enabling the multimodal analysis model (which can be replaced by visual recognition or multimodal analysis models from different providers as needed) to perform multimodal analysis on visual features, audio features, and text features, and output candidate structured results that match the structured cue words and structured constraints. This makes the candidate structured results The structured results correspond one-to-one with the preset appearance-clothing-scene field set, avoiding redundant information and ensuring the accuracy and efficiency of subsequent data processing. In this example, structured prompts and constraints are generated based on the preset appearance-clothing-scene field set, instructing the multimodal analysis model to output candidate structured results in a format defined by the structured prompts and constraints (e.g., plain JSON format) that includes all fields in the preset appearance-clothing-scene field set. The candidate structured result It can represent the appearance, clothing, and scene settings of the digital clone.

[0081] In one embodiment, such as Figure 4 As shown, step S302 involves calling a multimodal analysis model to infer from visual features, audio features, and text features to determine candidate structured results, including:

[0082] S401: Call the multimodal analysis model to reason about visual features, audio features and text features, determine visual setting data, audio setting data and text setting data respectively, and determine whether there is duplicate setting data among visual setting data, audio setting data and text setting data;

[0083] S402: When there is no duplicate setting data, the visual setting data, audio setting data and text setting data are concatenated to determine the candidate structured result;

[0084] S403: When duplicate setting data exists, a decision rule based on priority and consistency verification is adopted to fuse visual setting data, audio setting data and text setting data to determine candidate structured results.

[0085] Specifically, visual setting data refers to the setting data determined by the multimodal analysis model through inference of visual features, which can be at least one setting data corresponding to appearance, clothing, and scene. Audio setting data refers to the setting data determined by the multimodal analysis model through inference of audio features, which can be at least one setting data corresponding to appearance, clothing, and scene. Text setting data refers to the setting data determined by the multimodal analysis model through inference of text features, which can be at least one setting data corresponding to appearance, clothing, and scene.

[0086] As an example, in step S401, the electronic device calls a multimodal analysis model to infer visual features, audio features, and text features, and determines visual setting data, audio setting data, and text setting data respectively. Each of these setting data can include at least one setting data corresponding to appearance, clothing, and scene. Specifically, each can include at least one setting data from all fields in the appearance-clothing-scene field set, so that the same field may output one or more field contents. When outputting one field content, it is determined that there is no duplicate setting data; if multiple field contents are output, it is determined that there is duplicate setting data.

[0087] As an example, in step S402, when there is no duplicate setting data among the visual setting data, audio setting data, and text setting data, the electronic device directly concatenates the visual setting data, audio setting data, and text setting data to determine the corresponding candidate structured result. For example, suppose the visual setting data output by the visual modality is... The audio setting data for the audio modal output is: The text feature modality output text setting data is Then, basic fusion adopts key merging: In this example, the multimodal analysis model can analyze video features to determine their corresponding visual setting data. The audio content is analyzed by extracting and analyzing the speaker's audio features, such as timbre, pitch, speech rate, intonation, and accent type, and converting them into corresponding structured attributes (e.g., voice_timbre, voice_pitch, voice_accent fields) to determine the audio setting data. The text features describing the personality or background of the digital avatar are analyzed to extract key information and form structured results (such as fields like persona_personality and persona_role), thus determining the text setting data. The analysis results from multiple modalities are fused using a structured engine and integrated into a unified candidate structured result. The field names of the outputs from different modalities are distinct and pre-planned, and can be directly merged; if there are overlapping fields, the values ​​are determined according to priority rules (e.g., actual image recognition results take precedence over text descriptions) to ensure the consistency and accuracy of the set information.

[0088] As an example, in step S403, when the electronic device has duplicate setting data among visual setting data, audio setting data, and text setting data, it can be determined that different modalities give conflicting values ​​for the same field. In this case, a decision rule based on priority and consistency verification is used to determine the final setting data. In this example, the priority defaults to visual modality being higher than audio modality being higher than text modality. Furthermore, when a lower-priority modal value is equivalent to a higher-priority modal value under a preset synonym normalization rule, they are considered consistent. When the conflict cannot be resolved, the higher-priority modal value is written to and recorded in the conflict log for subsequent auditing or manual review.

[0089] For example, electronic devices can invoke multimodal analysis models to infer visual, audio, and textual features. Specifically, this can be achieved by calling intelligent analysis models corresponding to different modalities to extract features and perform structured analysis on the content of each modality. For image analysis, image recognition models or multimodal analysis models are used to identify and extract appearance, clothing, and scene features from temporarily stored standard images. First, the image recognition model or multimodal analysis model serializes the standard image to adapt it to the model's input format. Then, based on all preset appearance-clothing-scene fields, structured prompts and structured constraints are constructed, instructing the model to output results in a predefined format (pure JSON structure) containing all preset appearance-clothing-scene fields. After processing, the model returns the analysis results, for example:

[0090]

[0091] For image modalities, the feature extraction mapping of the image recognition model is defined as follows: , standard image Mapped to high-dimensional visual features This visual feature This includes deep feature representations of the digital avatar's appearance, clothing, and setting; similarly, the feature mapping of the audio analysis model is defined as... Standard audio Mapped to audio features The feature mapping of the text analysis model is , standard text Mapping to text features The multimodal analysis model can be viewed as a function that jointly decodes the aforementioned different modal characteristics. Its input can be a single-modal or multi-modal feature set, and the output is the corresponding set of data. When there is only image input, the decoding function degenerates into pure visual decoding, i.e. Output the specified dataset corresponding to the image; when multimodal features such as image, audio, and text are complete, the decoding function... By comprehensively utilizing features from various modalities, we can infer the full attributes of a digital clone. The mapping from the multimodal feature space to a given data space can be represented as: ,in Define the data space (a set consisting of all appearance-clothing-scene fields and their possible values). For example, in the case of image input only, The output set of settings data includes preset appearance, clothing, and scene fields. A vector consisting of each field and its corresponding value; when audio or text modalities also exist. It integrates multimodal features to jointly determine the corresponding attribute values, improving the accuracy and completeness of the recognition results. Therefore, the final candidate structured results... It can be viewed as a defined data vector: ,in Indicates the first The system assigns values ​​to several preset attribute fields. These attribute vectors can include discrete classification results (e.g., a face shape field value of "round face"), as well as continuous feature values ​​or text descriptions. The system will perform necessary type adaptation for different types of fields during data entry (e.g., storing long text in a TEXT type field).

[0092] In multimodal result fusion, we use , , These represent the set of configuration data for visual analysis output, speech analysis output, and text analysis output, respectively. Each set contains several key-value pairs in the format "attribute name: attribute value". The combined result is denoted as a set. The fusion strategy employs a key-value pair union approach, which can be formally represented as: In other words, It contains all attribute keys and their corresponding values ​​for each modal output. When the attribute names of different modal outputs are different, a simple union can be used to integrate them, including all fields without omission. Since we have tried to avoid using the same field names for different modal outputs during the design phase, in most cases... It's simply a set union.

[0093] The term "pairwise merging" refers to the practice of merging the result sets of any two modalities first, and then merging that result with a third set, if necessary. This is consistent with merging all three sets at once (because the union operation satisfies the associative law). Therefore, regardless of the order, pairwise merging will ultimately yield the desired result. This entire set. The key lies in handling overlapping sections: if , , If there are identical attribute keys, for example, both image analysis and text analysis give the attribute "hair color", then in Only one instance of this key is retained during the calculation. At this point, it's necessary to determine which source's result to use for its value—this is where priority rules come into play. For example, if we preset the image analysis result to have the highest priority, then the final "color" will be taken from the image analysis value, and the values ​​of this field from other modalities will be ignored.

[0094] The advantages of key-value pair merging strategies are: they can retain and fuse as much information as possible from image, sound, and text modalities, forming a unified set of data. During the merging process, pre-defined priority rules effectively resolve conflicts when different modalities describe the same attribute inconsistently, ensuring the final output persona information is consistent, accurate, and reliable. Since the attribute sets output by each modality are pre-planned in the design (most field names do not conflict), the merging process mainly involves set union operations, resulting in low complexity and simple implementation. Furthermore, this key-value-based merging method has excellent scalability: if a new modality is added in the future, it can be considered as adding a new set to participate in the union, without disrupting the existing process. In summary, the key-value pair merging strategy, combined with priority decision-making, ensures both comprehensive information fusion and consistent authority in the results, making it efficient and robust for multimodal persona construction.

[0095] In one embodiment, such as Figure 5 As shown, step S104 involves performing syntax validation, field validation, and value range consistency validation on the candidate structured results to determine the target structured results that pass the validation, including:

[0096] S501: Perform syntax validation, field validation, and value range consistency validation on the candidate structured results;

[0097] S502: When the candidate structured result meets the verification pass conditions, the candidate structured result is determined as the target structured result; the verification pass conditions include that the format of the candidate structured result is the format specified in the structured constraints, the fields of the candidate structured result include all fields in the appearance-clothing-scene field set, and the values ​​of each field of the candidate structured result conform to the preset format and value range constraints.

[0098] S503: Update the number of verification failures when the candidate structured result does not meet the verification pass conditions;

[0099] S504: If the number of validation failures is less than the preset number, an exception handling mechanism is executed. Based on the set of missing fields, a completion prompt is generated for retry. The syntax validation, field validation, and value range consistency validation of the candidate structured results are performed repeatedly.

[0100] S505: If the number of verification failures is not less than the preset number, record the abnormal situation and terminate the process.

[0101] As an example, in step S501, the electronic device needs to process the candidate structured results. Perform syntax validation, field validation, and value range consistency validation. Specifically, check whether the format of the candidate structured result conforms to the JSON format, whether the fields of the candidate structured result contain all fields in the appearance-clothing-scene field set, and whether the values ​​of each field of the candidate structured result conform to the preset format and value range constraints, so as to evaluate whether the candidate structured result meets the validation conditions.

[0102] As an example, in step S502, the electronic device will only determine that the candidate structured result meets the verification conditions and is identified as the target structured result if the format of the candidate structured result is in the format specified in the structured constraints (for example, when the specified format is JSON, the verification confirms that there are no JSON syntax errors and the format is correct), the fields of the candidate structured result include all fields in the appearance-clothing-scene field set, and the values ​​of each field of the candidate structured result conform to the preset format and value range constraints. In other words, the electronic device will only determine that the candidate structured result meets the verification conditions and is identified as the target structured result if the format of the candidate structured result is in the format specified in the structured constraints, the fields of the candidate structured result cover all fields (required fields) in the appearance-clothing-scene field set, and the values ​​of each field of the candidate structured result conform to the value range constraints.

[0103] As an example, in step S503, if the electronic device encounters a situation where the format of a candidate structured result does not conform to the JSON format, the fields of the candidate structured result do not include all fields in the appearance-clothing-scene field set, or the values ​​of each field in the candidate structured result do not conform to at least one of the preset format and value range constraints (e.g., there are JSON syntax errors, missing fields, format abnormalities, or value range mismatches), it is determined that the validation pass condition is not met. In this case, the validation failure count N needs to be updated, i.e., N=N+1, and then compared with the preset count to assess whether a retry is necessary. The preset count refers to the number of retries allowed in advance, for example, it can be set to 3 times.

[0104] As an example, in step S504, if the number of verification failures is less than a preset number, the electronic device can execute an exception handling mechanism. It generates completion prompts based on the missing field set and retryes, repeatedly performing syntax validation, field validation, and value range consistency validation on the candidate structured results. That is, it repeatedly executes step S501, automatically generating completion prompts based on the missing field set and re-calling the model for up to three retries until a complete result is obtained or the retry limit is reached. If the validation requirements are still not met after multiple retries, the exception is recorded and the process is terminated to avoid writing incomplete or erroneous data. In this example, the format of the candidate structured results is the format defined in the structured constraints, and the required field set in the appearance-clothing-scene field set is... The set of output fields for candidate structured results is ,when ,when If each field value meets a preset format (e.g., enumeration, regular expression, length), the candidate structured result is deemed to meet the validation criteria; otherwise, a set of missing fields is generated. And initiate the completion loop: construct completion prompts. The model is required to output only The corresponding JSON key-value pairs; re-execute syntax validation, field validation, and value range consistency validation on the completed result, update the original result using key merging, and validate again; if the validation fails after a preset number of times, the result still cannot be satisfied. If the system fails to build the result, it will mark the result as a failure and refuse to include it in the database to avoid incomplete data from polluting the appearance-clothing-scene setting database.

[0105] In this example, syntax validation, field validation, and value range consistency checks are performed on the candidate structured results to ensure that the JSON contains all predefined fields, is correctly formatted, and conforms to value range constraints. If JSON syntax errors, missing fields, abnormal formats, or mismatched value ranges are found, exception handling or retry mechanisms are triggered (e.g., the model analysis can be retried within a limited number of attempts based on the missing field set, with a maximum of 3 retries) to accommodate the differences in API interface formats among different multimodal analysis models.

[0106] This invention introduces a model adaptation layer in the multimodal inference module. The model adaptation layer encapsulates the differences in interfaces between different models by dividing the module into sub-units. The responsibilities of each sub-unit are as follows: **Model Registration Sub-unit:** This sub-unit registers and manages interfacing multimodal models and their interface information. It saves parameters such as API call entry points and authentication methods for different provider models through configuration files or a registration mechanism, allowing the system to easily switch between the multimodal models used without modifying the core code. **Parameter Conversion Sub-unit:** This sub-unit converts input data and parameters into the API call format required by the target model. Specifically, based on the interface requirements of the target model, it encapsulates standardized prompts, image data, etc., into a call request and sends it to the corresponding multimodal model service. **Result Standardization Sub-unit:** This sub-unit receives and processes the raw results returned by the multimodal model, parsing and converting them into structured persona attribute output in the system's predefined standard JSON format. In other words, regardless of the data format output by the underlying model, after processing by this sub-unit, the output results will be organized into a unified field format required by the appearance-clothing-scene setting library. Through the collaborative work of the above sub-units, the model adaptation layer shields the differences in interface protocols and data formats between different multimodal models, achieving a unified model access interface. Therefore, the system can seamlessly adapt to and replace multimodal models from different providers simply by changing the model configuration or API call parameters.

[0107] In this example, the model adaptation layer acts as an abstract encapsulation layer, shielding the differences in calling methods among various multimodal analysis models and providing a unified model access interface for the system. On one hand, this adaptation layer flexibly connects to multimodal analysis models from different providers based on configuration. For example, it sets parameters such as the model's interface URL and authentication key through configuration files, allowing the main system to switch underlying analysis models without modifying the code. On the other hand, the adaptation layer is responsible for standardizing the input and output formats: it encapsulates and packages the standard images, standard audio, standard text, and prompts to be analyzed into requests according to the requirements of the multimodal analysis model, calls the corresponding model's API to obtain the analysis results, and then parses the data returned by the model into the system's predefined JSON structured format. Through this model adaptation layer design, regardless of the underlying multimodal analysis model (including mainstream large models that support visual input or other image recognition algorithms), the system of this invention can use a consistent method for calling and parsing results, realizing flexible replacement and seamless upgrades of multimodal analysis model modules without modifying other parts of the system. This decoupling design ensures that the system has good portability and scalability, and can be quickly adapted to connect to new models as needed to improve analysis performance or add functions, thereby maintaining the advanced nature of the present invention.

[0108] In one embodiment, such as Figure 6 As shown, step S105 involves associating the target structured result with the unique identifier of the digital clone and storing it in the appearance-clothing-scene setting library, including:

[0109] S601: Extract the visual embedding vector corresponding to the current version from the standard image, and determine the visual embedding vector corresponding to the previous version of the unique identifier of the digital clone from the appearance-clothing-scene setting library.

[0110] S602: Calculate the distance between the visual embedding vector corresponding to the current version and the visual embedding vector corresponding to the previous version to determine the visual relative distance;

[0111] S603: When the visual relative distance is less than the preset relative distance, mark the target structured result as a stable appearance update and retain the field values ​​of the previous version that have not been covered by the current version.

[0112] S604: When the visual relative distance is not less than the preset relative distance, mark the target structured result as appearance drift update, and use the target structured result to perform an overwrite update operation on the appearance field.

[0113] In this example, the Appearance-Clothing-Scene Setting Library (Appearance-Clothing-Scene Setting Library) is used to store structured setting data. The Appearance-Clothing-Scene Setting Library uses a predefined table structure, including a unique identifier field for the digital avatar (user_avatar_id), metadata fields, and setting data fields for various dimensions such as appearance, clothing, and scene. The metadata fields include at least: persona_version (setting version number), preset_type (basic and custom type markers), updated_at (update timestamp), source_modality (modality combination marker used in this construction), quality_score (quality score of the selected standard image), and optional confidence_json (field-level confidence set). The appearance dimension includes fields such as `appearance_face_shape`, `appearance_facial_features`, `appearance_default_expression`, `appearance_hair_style`, and `appearance_makeup`. The clothing dimension includes fields such as `clothing_category`, `clothing_style`, `clothing_color`, and `clothing_design`. The scene dimension includes fields such as `scene_type`, `scene_features`, and `scene_atmosphere`. When necessary, the appearance-clothing-scene setting library can also be expanded with audio and text-related setting fields, such as `voice_gender`, `voice_accent`, and `persona_personality`, to store audio and text analysis results. The appearance-clothing-scene setting library can associate and save the target structured results with the unique identifier of the digital avatar to achieve a strong binding between the settings and the digital avatar. In this example, persona_version is generated incrementally based on the unique identifier of the same digital clone; when preset_type is a basic type and the consistency check determines it to be "appearance drift update", an overwrite update is performed and a new version is generated; when preset_type is a custom type, an addition is performed and an independent version is generated, supporting multiple sets of settings to coexist for the same digital clone.

[0114] As an example, in step S601, the electronic device automatically writes the target structured result into the appearance-clothing-scene setting library (appearance-clothing-scene setting library), and updates or adds setting records according to a preset strategy. During this process, the electronic device needs to perform consistency detection: from the standard image Extract the visual embedding vector corresponding to the current version Retrieves the visual embedding vector corresponding to the previous version and associated with the unique identifier of the digital clone from the appearance-clothing-scene setting library. .

[0115] As an example, in step S602, the electronic device can perform visual embedding vectors corresponding to the current version. Visual embedding vectors corresponding to the previous version Perform distance calculations to determine the visual relative distance between the two; for example, the visual relative distance is... Then, compare the visual relative distance with the preset relative distance. Compare the results and perform different steps based on the comparison.

[0116] As an example, in step S603, the electronic device is in a visual relative distance that is less than a preset relative distance ( When updating the target structured result, mark it as a stable appearance update, preserving the values ​​of fields not covered by the current version in the previous version. That is, when the target structured result only updates some fields (such as clothing color changing from "black" to "white"), the uncovered fields (such as facial features like face shape and hairstyle) have already passed validation and are valid in the previous version, thus ensuring field integrity and reducing computational resources.

[0117] As an example, in step S604, the electronic device is at a visual relative distance of not less than a preset relative distance. When the target structured result is marked as appearance drift update, the appearance field is overwritten and updated using the target structured result. The update scope is strictly limited to the appearance field, which is the only structured result that needs to be corrected due to appearance drift. This will not accidentally touch other non-appearance related structured fields, avoid meaningless overwriting of valid data, and reduce the computational redundancy and data tampering risk of the update operation.

[0118] In this example, when the target structured result needs to be stored in the appearance-clothing-scene setting library, it is necessary to query the appearance-clothing-scene setting library based on the unique identifier of the digital clone to see if a corresponding setting record already exists for that unique identifier. Different synchronization strategies are adopted depending on the type of setting.

[0119] (1) For the basic default settings (the system's basic image settings), it is necessary to determine whether the basic settings record for the digital clone already exists in the appearance-clothing-scene settings library. If the basic settings record for the digital clone already exists in the appearance-clothing-scene settings library, delete the old record or overwrite and update its field values, and then insert and write the new analysis results, version number, and update timestamp (keeping each digital clone with only one latest basic settings record). Specifically, write the field values, version number, and update timestamp obtained from the new analysis into the corresponding columns of the appearance-clothing-scene settings library to complete the insertion (or update) of the basic settings record. If there is no record with a unique identifier for the digital clone in the appearance-clothing-scene settings library, directly insert the new record, version number, and update timestamp.

[0120] (2) For custom settings (personalized image configurations saved by the user), there is no need to delete the original data. Just add a new record with the unique identifier of the digital clone, version number and update timestamp (a custom name can be specified for the record). Only fill in the field values ​​generated in this analysis. Other fields not involved can be left blank or inherited from the basic settings.

[0121] The above write operations are completed within the open Appearance-Clothing-Scene Setting Library transaction, ensuring atomicity and data consistency throughout the process: if an error occurs during the write, it will be automatically rolled back to avoid incomplete data records. After the write is complete, the new or updated setting record will store the unique identifier of the digital clone, the version number, and the update timestamp, thereby achieving a strong binding association between the setting data and the unique identifier of the digital clone, ensuring that subsequent searches based on the unique identifier of the digital clone will always retrieve the matching image setting.

[0122] In this example, during synchronization, the field mapping unit ensures that each model output field is written to its corresponding appearance-clothing-scene setting database field. This mapping relationship is a predefined one-to-one correspondence, ensuring that the keys in the JSON result are the same as or directly convertible with the column names in the appearance-clothing-scene setting database. The write operation is performed under transaction control to guarantee atomicity: first, the appearance-clothing-scene setting database transaction is started, and each field value, version number, and update timestamp are written to or updated in the table; finally, the transaction is committed. If any step fails, it is rolled back to ensure the consistency and integrity of the appearance-clothing-scene setting database data. After the write is complete, the record will contain the unique identifier of the digital clone, the version number, and the update timestamp, achieving a strong binding between the setting data and the digital clone. This strong binding ensures that regardless of whether the digital clone generates new content or performs real-time interaction, the corresponding setting data can be accurately retrieved and applied through the ID, thus ensuring that the appearance, clothing, and scene of the digital clone always remain consistent with the preset.

[0123] Field mapping transformation: To accurately store the target structured results output by the model into the appearance-clothing-scene setting library, a field mapping function is defined. This is used to convert model output fields into appearance-clothing-scene setting library fields. Let the target structured result contain the following set of fields. Its corresponding value set is The predefined set of table fields in the appearance-clothing-scene setting library is as follows: The mapping function satisfies That is, each field in the output result Uniquely corresponds to the field in the appearance-clothing-scene setting library And they are semantically equivalent. For example, mapping relations include: , (Same-name mapping), etc. In actual implementation, since the key names of the model output JSON have been standardized in the structured prompts to be consistent with the field names of the appearance-clothing-scene setting library, the above mapping process can be completed by directly matching the field names—for each key-value pair of the output JSON. Assignment is performed during insertion and update operations in the appearance-clothing-scene setting library. That's it. Formally, the write operation of the set data can be represented as: for each Execute write Enter the settings record. Use the mapping function. The predefined structure ensures a seamless connection between the candidate structured results and the appearance-clothing-scene setting database table structure.

[0124] In one embodiment, to reduce secondary parsing and field mapping errors, this invention directly restricts the key names of the model's output JSON to be consistent with the field names of the Appearance-Clothing-Scene Setting Library in the structured prompt words. This allows the Appearance-Clothing-Scene Setting Library to be written using a "direct assignment of fields with the same name" method. The mapping function is only activated when the business requires a different naming system. The output key names are converted to column names in the Appearance-Clothing-Scene settings library. However, this mapping function must be given in the specification as a complete mapping table, and the mapped column names should correspond one-to-one with the semantics.

[0125] ID binding synchronization logic: Strong binding of the unique identifier for the digital clone is achieved by storing a unique identifier field for the digital clone in a designated record. Let the unique identifier for the digital clone be... The appearance-clothing-scene setting library's setting table contains corresponding fields (e.g., user_avatar_id). Whenever a setting record is generated, its user_avatar_id is set as the unique identifier of the associated digital avatar, ensuring that subsequent queries accurately locate the corresponding record by ID. To achieve overwrite updates of basic settings and incremental saving of custom settings, this invention manages multiple setting records by differentiating setting types. The operation for saving results is defined. ,in Indicates the setting type (basic default or custom). For version number, To update the timestamp. The synchronization logic is as follows: If (Based on default settings), it must be ensured that each digital clone has at most one record of the basic settings. Constraints can be established to ensure that the same... There are no two or more records of type "base" in the Appearance-Clothing-Scene Setting library. In implementation, when calling... When this happens, the system first checks if the [object / resource] already exists. Basic settings record: If it exists, delete the old record or directly update its fields, version number, and update timestamp, then insert and save the new analysis result record; if it does not exist, directly insert a new record with a version number and update timestamp. (Custom settings) allow each digital clone to have multiple custom setting records with different names, so there's no need to delete existing records during synchronization; you can simply insert a new record with a different name. A new record with a version number, update timestamp, and a custom setting type is required (optionally, a field identifying the setting name can also be included to distinguish different custom images). Through the above strategy, the system ensures that each digital clone always has one and only one latest basic default setting bound to it, while it can expand to store multiple sets of custom settings for that clone, truly achieving a strong binding and synchronous update of setting data and the unique identifier of the digital clone.

[0126] Through the collaboration and step control of the above modules, the setting and construction process of this invention forms a closed-loop automatic processing flow: the trigger control module sequentially drives the content acquisition, normalization and frame selection, multimodal reasoning, verification and completion, versioned storage, and security cleanup modules to work in sequence. After the previous module successfully completes, it outputs data for use by the next module. The information flow is unidirectionally transmitted within the system until the process ends. The system is designed with strict judgment and rollback mechanisms at key nodes to ensure the robustness and reliability of the process: if the input format verification fails, the process is immediately terminated to avoid invalid analysis; if the candidate structured results are incomplete, the process is automatically retried or terminated to prevent erroneous data from being entered into the database; if the writing of the appearance-clothing-scene setting library fails, the process is rolled back to ensure data consistency; the cleanup operation is executed based on the synchronization results to avoid accidentally deleting debugging evidence. These security control logics and exception handling mechanisms ensure that even in the case of unsatisfactory input or abnormal analysis, the system can be orderly terminated or retried without generating erroneous setting data. The entire automatic construction process has good fault tolerance and reliability.

[0127] The method proposed in this invention for digital clones to perceive their own appearance, clothing, and scene has significant advantages over traditional setting and construction methods.

[0128] First, the settings for traditional digital clones usually require users to manually input or edit them, or to manually select and analyze materials. This process is not only cumbersome and inefficient, but also often inaccurate, which may result in the recorded settings not matching the actual digital clone image. In contrast, this invention can automatically extract settings from the digital clone content, greatly reducing human intervention, avoiding human subjective errors, and achieving efficient and accurate automated construction.

[0129] Secondly, for multi-source content input, existing solutions often require different processing flows for single images, multiple images, and videos, lacking a unified adaptation, resulting in high development and maintenance costs and slow processing speed. This invention, by determining standard images from visual content and processing audio and text content accordingly, can effectively simplify system complexity and improve processing efficiency, while the quality scoring and selection mechanism improves the accuracy of analysis.

[0130] Furthermore, traditional multimodal analysis models mostly output unstructured natural language descriptions, which must undergo secondary parsing before being stored in the appearance-clothing-scene setting library. This increases the risk of parsing errors, and the parsing process requires additional computing power and time, making it unsuitable for direct use in subsequent content generation or real-time interactive calls for digital avatars. This invention, through the dual constraints of structured prompts and structured limitations, forces the multimodal model to directly output JSON structured results that correspond one-to-one with database fields. This allows for seamless storage without secondary parsing, reducing parsing errors and improving data retrieval efficiency, ensuring that persona information can quickly support real-time interaction and content generation for digital avatars.

[0131] Finally, regarding the association between setting data and digital clones, existing solutions lack a strong association binding mechanism and version management. Setting information is mostly stored in independent files or loose data tables, without forming a unique identifier for the digital clone. This leads to setting confusion and version errors during subsequent calls, resulting in inconsistencies in the digital clone's appearance before and after. This invention strongly associates character setting data with digital clones through a triple binding of the digital clone's unique identifier, version number, and update timestamp. At the same time, it implements a differentiated strategy of "stable appearance update" and "drifting appearance update" through visual embedding distance detection. This ensures the traceability of setting data, avoids invalid overwriting and data redundancy, and ensures accurate matching of the corresponding version of character setting information during subsequent calls, completely solving the pain point of weak association between settings and digital clones.

[0132] In summary, compared with existing technologies, this invention has achieved breakthrough improvements in terms of process uniformity, automation level, data structuring level, association binding strength, and version management capabilities, enabling more efficient, accurate, and secure construction and management of appearance-clothing-scene setting data for digital clones.

[0133] This invention provides a system for a digital clone to perceive its own appearance, clothing, and scene. This system corresponds one-to-one with the method for digital clones to perceive their own appearance, clothing, and scene in the above embodiments. Figure 7 As shown, the system by which this digital avatar perceives its own appearance, clothing, and environment includes:

[0134] The content acquisition module 701 is used to acquire digital clone content associated with the unique identifier of the digital clone. The digital clone content includes original content in multiple modalities. The original content includes at least visual content, and also includes at least one of audio content and text content.

[0135] The normalization module 702 is used to perform legality verification and normalization processing on the digital clone content, and determine standardized materials. The standardized materials include at least a standard image determined based on the visual content, and at least one of standard audio determined based on the audio content and standard text determined based on the text content.

[0136] The multimodal reasoning module 703 is used to construct structured prompts and structured constraints based on a preset set of appearance-clothing-scene fields, call a multimodal analysis model to reason about the standardized materials, and determine candidate structured results that represent the set data. The candidate structured results are matched with the structured prompts and structured constraints.

[0137] The verification and completion module 704 is used to perform syntax verification, field verification, and value range consistency verification on the candidate structured results, and determine the target structured result that passes the verification.

[0138] The versioned storage module 705 is used to associate the target structured result with the unique identifier of the digital clone and store it in the appearance-clothing-scene setting library, and write a version number and update timestamp to the target structured result;

[0139] The security cleanup module 706 is used to perform secure deletion or encryption and retention of temporary files generated during the normalization process after detecting that the target structured result has been successfully updated; if the target structured result update fails or the process is abnormally terminated, the temporary files are not deleted; and the log summary information of this construction is recorded.

[0140] Specific limitations regarding the system for a digital avatar to perceive its own appearance, clothing, and environment can be found in the limitations of the methods for digital avatars to perceive their own appearance, clothing, and environment described above, and will not be repeated here. Each module in the aforementioned system for a digital avatar to perceive its own appearance, clothing, and environment can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.

[0141] This invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned method for a digital avatar to perceive its own appearance, clothing, and scene. This electronic device can be a smartphone, tablet computer, laptop computer, desktop computer, server, or other device with data processing capabilities. Its processor can be a CPU, GPU, FPGA, or other chip, and its memory can include RAM, ROM, solid-state drive, etc., supporting the storage and processing of multimodal data.

[0142] This invention provides a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the aforementioned method for a digital avatar to perceive its own appearance, clothing, and scene. The storage medium can be any medium capable of storing computer programs, such as a USB flash drive, external hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, or it can be a network storage device such as a server.

[0143] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A method for a digital avatar to perceive its own appearance, clothing, and scene, characterized in that, include: Obtain the digital clone content associated with the unique identifier of the digital clone, the digital clone content including original content in multiple modalities, the original content including at least visual content, and also including at least one of audio content and text content; The digital clone content is subjected to legality verification and normalization processing to determine standardized materials. The standardized materials include at least a standard image determined based on the visual content, and at least one of standard audio determined based on the audio content and standard text determined based on the text content. Based on a preset set of appearance-clothing-scene fields, structured prompt words and structured constraints are constructed. The structured prompt words are generated based on the preset set of appearance-clothing-scene fields. A multimodal analysis model is invoked to infer the standardized material and determine candidate structured results representing the set data. The candidate structured results are matched with the structured cue words and structured constraints. Syntax validation, field validation, and value range consistency validation are performed on the candidate structured results. When the candidate structured results meet the validation pass conditions, the candidate structured results are determined as the target structured results. When the candidate structured result does not meet the verification pass condition, update the number of verification failures; If the number of verification failures is less than the preset number, an exception handling mechanism is executed. Based on the set of missing fields, a completion prompt is generated for retrying, and a new candidate structured result is obtained. The syntax verification, field verification, and value range consistency verification are then repeatedly performed on the candidate structured result. If the number of verification failures is not less than the preset number, record the abnormal situation and terminate the process; Extract the visual embedding vector corresponding to the current version from the standard image, and determine the visual embedding vector corresponding to the previous version of the unique identifier of the digital clone from the appearance-clothing-scene setting library; calculate the distance between the visual embedding vector corresponding to the current version and the visual embedding vector corresponding to the previous version to determine the visual relative distance; When the visual relative distance is less than a preset relative distance, the target structured result is marked as a stable appearance update, and the current version is used to update some fields of the previous version, keeping the field values ​​of the previous version that are not covered by the updated values ​​of the current version; when the visual relative distance is not less than a preset relative distance, the target structured result is marked as an appearance drift update, and the target structured result is used to perform an overwrite update operation on the appearance fields. The target structured result is associated with the unique identifier of the digital clone and stored in the appearance-clothing-scene setting library, and a version number and update timestamp are written to the target structured result; After detecting that the target structured result has been successfully updated, the temporary files generated during the normalization process are either securely deleted or encrypted and then stored. The temporary file is retained when the target structured result update fails or the process is abnormally terminated. It also records the log summary information for this build.

2. The method according to claim 1, characterized in that, The process of performing legality verification and normalization on the digital clone content to determine standardized materials includes: Based on the legal formats corresponding to different modalities, the legality of the original content corresponding to multiple modalities is verified to determine the original content corresponding to multiple modalities that passes the legality verification. The original content corresponding to the various modalities that passed the legality check was normalized to determine standardized materials, including: When the original content is visual content, image filtering is performed on the visual content to determine candidate images, and the candidate images are normalized to determine standard images. The candidate images are determined in the following ways: when the visual content is a single original image, the original image is determined as a candidate image; when the visual content is multiple original images, the original image with the highest quality score among the multiple original images is determined as a candidate image; when the visual content is an original video formed based on multiple original images, multiple frames of original images are extracted from the original video according to a preset sampling step size to form an original image set, and the original image with the highest quality score in the original image set is determined as a candidate image. When the original content is audio content, the audio content is extracted and its format is converted to determine the standard audio. When the original content is text content, the text content is encoded and converted to determine the standard text.

3. The method according to claim 2, characterized in that, The quality score is determined using the following formula: in, For the first Quality score of the original image; As an indicator of facial presence, As an indicator of human body coverage, For facial clarity indicators, As an indicator of scene recognizability As a penalty for occlusion, For preset weights and satisfying .

4. The method according to claim 1, characterized in that, The standardized materials include standard images, standard audio, and standard text; The step of invoking a multimodal analysis model to infer the standardized materials and determine candidate structured results representing the set data includes: Structured analysis is performed on standard images, standard audio, and standard text to obtain visual features, audio features, and text features; A multimodal analysis model is invoked to infer the visual features, audio features, and text features to determine candidate structured results. The candidate structured results are used to characterize the appearance, clothing, and scene settings corresponding to the digital clone.

5. The method according to claim 4, characterized in that, The step of invoking a multimodal analysis model to infer the visual features, audio features, and text features to determine candidate structured results includes: The multimodal analysis model is invoked to infer the visual features, audio features, and text features, respectively determining the visual setting data, audio setting data, and text setting data, and determining whether there is duplicate setting data among the visual setting data, audio setting data, and text setting data; When there is no duplicate setting data, the visual setting data, the audio setting data, and the text setting data are concatenated to determine the candidate structured result; When duplicate setting data exists, a decision rule based on priority and consistency verification is adopted to fuse the visual setting data, the audio setting data, and the text setting data to determine candidate structured results.

6. A system for a digital avatar to perceive its own appearance, clothing, and scene, characterized in that, include: The content acquisition module is used to acquire the digital clone content associated with the unique identifier of the digital clone. The digital clone content includes original content in multiple modalities. The original content includes at least visual content, and also includes at least one of audio content and text content. The normalization module is used to perform legality verification and normalization processing on digital clone content, and to determine standardized materials. The standardized materials include at least a standard image determined based on visual content, and at least one of standard audio determined based on audio content and standard text determined based on text content. The multimodal reasoning module is used to construct structured prompts and structured constraints based on a preset set of appearance-clothing-scene fields. The structured prompts are generated based on the preset set of appearance-clothing-scene fields. The module calls the multimodal analysis model to reason about standardized materials and determine candidate structured results that represent the set data. The candidate structured results are matched with the structured prompts and structured constraints. The validation and completion module is used to perform syntax validation, field validation, and value range consistency validation on the candidate structured results. When the candidate structured results meet the validation conditions, the candidate structured results are determined as the target structured results. When a candidate structured result does not meet the validation pass criteria, update the validation failure count. If the number of validation failures is less than the preset number, an exception handling mechanism is executed. Based on the set of missing fields, a completion prompt is generated for retrying, and a new candidate structured result is obtained. The syntax validation, field validation, and value range consistency validation of the candidate structured result are repeatedly executed. If the number of verification failures is not less than the preset number, record the abnormal situation and terminate the process; The versioned storage module is used to extract the visual embedding vector corresponding to the current version from the standard image, and to determine the visual embedding vector corresponding to the previous version of the unique identifier of the digital clone from the appearance-clothing-scene setting library; and to calculate the distance between the visual embedding vector corresponding to the current version and the visual embedding vector corresponding to the previous version to determine the visual relative distance. When the visual relative distance is less than the preset relative distance, the target structured result is marked as a stable appearance update. The current version is used to update some fields of the previous version, keeping the values ​​of fields in the previous version that are not covered by the updated values ​​of the current version. When the visual relative distance is not less than the preset relative distance, the target structured result is marked as an appearance drift update. The target structured result is used to overwrite the appearance fields. The target structured result is associated with the unique identifier of the digital clone and stored in the appearance-clothing-scene setting library. The version number and update timestamp are written to the target structured result. The security cleanup module is used to securely delete or encrypt and retain temporary files generated during the normalization process after detecting that the target structured result has been successfully updated. If the target structured result update fails or the process is abnormally terminated, the temporary file will not be deleted; It also records the log summary information for this build.

7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method for a digital clone to perceive its own appearance, clothing, and scene as described in any one of claims 1-5.

8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the method for a digital clone to perceive its own appearance, clothing, and scene as described in any one of claims 1-5.