A method and device for generating a voice based on a bipartite graph structure similarity

By constructing an appearance-voice bipartite graph and calculating the structural similarity coefficient, an automated and quantifiable mapping from appearance to voice in AI short dramas was achieved, solving the problems of low audio-visual matching and homogenized timbre, and improving generation efficiency and adaptability.

CN122201249APending Publication Date: 2026-06-12UNICOM WOYUEDU TECH CULTURE CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UNICOM WOYUEDU TECH CULTURE CO LTD
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In the creation of AI short dramas derived from novel IPs, the existing technology cannot quantify the matching degree between character appearance and voice, resulting in low audio-visual matching degree, low generation efficiency and serious homogenization of timbre, which cannot meet the industrialized mass production needs of AI short dramas.

Method used

By constructing a multi-dimensional feature system that includes standardized appearance and voice features, an appearance-voice bipartite graph is established. The structural similarity coefficient between the target appearance vertex and the candidate voice vertex is calculated. Based on the similarity coefficient, priority sorting and weighted fusion are performed to generate exclusive voice data.

🎯Benefits of technology

It achieves automated and quantifiable mapping from physical features to voice, improves the matching degree of sound and image and the generation efficiency, reduces production costs and the risk of homogenization of timbre, and is suitable for industrialized mass production scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201249A_ABST
    Figure CN122201249A_ABST
Patent Text Reader

Abstract

The application discloses a kind of appearance sound generation method and device based on bipartite graph structure similarity, belong to multimodal data processing and speech synthesis technical field.Pass through obtaining multi-source sample data, construct appearance-sound bipartite graph;Receive target person appearance description data, mapping is target appearance vertex, extract candidate sound vertex;Based on bipartite graph topological structure, the structural similarity coefficient between target appearance vertex and candidate sound vertex is calculated, the coefficient represents the ratio of butterfly structure quantity and the number of modified wedge-shaped path, according to the coefficient sorting screening effective sound vertex;With the coefficient after normalization as mix weight, the basic sound sample corresponding to effective sound vertex is weighted fusion, generates exclusive sound data.The application solves the problem that it is difficult to generate matching sound according to appearance accurately, realizes high-similarity cross-modal sound synthesis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of multimodal data processing and speech synthesis technology, and in particular to a method, apparatus, and device for generating appearance-based voices based on bipartite graph structural similarity. Background Technology

[0002] In the field of AI short drama creation derived from novel IPs, existing technologies typically employ a process of "textual description—visual presentation—audio matching." Specifically, for the appearance of characters clearly depicted in the novel, features can be extracted using image generation tools and transformed into a human image that matches the description; for the characters' voices, two main methods are relied upon: one is to manually set single-dimensional linear rules (such as "the taller the person, the lower the pitch"), and the other is for the production staff to select the timbre and conduct auditions entirely based on their subjective experience, and then call general text-to-speech (TTS) tools to generate the dubbing.

[0003] However, the above solution has the following technical drawbacks in practical applications: First, the matching degree between audio and visuals cannot be quantified and guaranteed. Current technology lacks objective and quantifiable standards for matching appearance and voice, often resulting in a sense of incongruity between the character's appearance and the voice acting, severely impacting the viewer's immersive experience and leading to a high rate of viewers abandoning the show. Second, the generation efficiency is low. The entire process of voice acting for a single character, from voice selection, audition, recording to post-production tuning, takes an average of over 2 hours and relies entirely on manual intervention, failing to meet the industrialized mass production demands of AI short dramas, which typically feature dozens of episodes per day and dozens of characters per drama. Third, production costs are high and voice homogenization is severe. Manual voice acting for a single character is prohibitively expensive for small and medium-sized teams; while general-purpose TTS tools rely on fixed voice libraries, resulting in a large number of short drama characters having identical voice acting, with a voice homogenization rate exceeding 90%, leading to extremely low content recognizability.

[0004] The root causes of these problems are twofold: First, novel texts inherently lack explicit descriptions of quantifiable vocal characteristics such as timbre and tone. Authors typically only hint at tone through dialogue, leaving AI voice generation without quantifiable input and reliant on subjective human design. Second, current technologies have not translated the natural statistical correlation between appearance and voice into an engineered solution suitable for AI short drama scenarios. They lack a positive correlation modeling technique for "multi-dimensional appearance features → exclusive voice features," making it impossible to achieve a precise and quantifiable mapping from appearance to voice. Therefore, a technological solution is urgently needed that can automatically and quantifiably generate exclusive voices that highly match a character's appearance. Summary of the Invention

[0005] The present invention aims to at least partially solve one of the technical problems in the related art.

[0006] To address this, this invention proposes a method for generating appearance-voice based on bipartite graph structural similarity. By acquiring multi-source sample data, a multi-dimensional feature system is constructed, comprising standardized appearance feature sets and standardized voice feature sets. Samples with completely identical appearance features are aggregated into appearance vertices, and samples with completely identical voice features are aggregated into voice vertices. An appearance-voice bipartite graph is constructed based on the association relationships within the same subject. The method receives the appearance description data of the target person, maps it to target appearance vertices, and extracts directly connected candidate voice vertices to form a candidate voice vertex set. Based on the bipartite graph topology, the structural similarity coefficient between the target appearance vertices and each candidate voice vertex is calculated. This coefficient represents the ratio of the number of butterfly structures containing two-vertex connections to the number of wedge paths after invalid path correction. Valid voice vertex sets are obtained by sorting and filtering according to this coefficient. A speech synthesis tool is invoked to generate basic voice sample data based on the voice features corresponding to each valid voice vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic voice sample data, resulting in the unique voice data of the target person. This invention solves the problem that existing methods are unable to accurately generate matching voices based on physical features, and achieves cross-modal voice synthesis with high similarity and high naturalness.

[0007] Another objective of this invention is to provide an appearance-sound generation device based on bipartite graph structural similarity.

[0008] The third objective of this invention is to provide a computer device.

[0009] To achieve the above objectives, this invention proposes a method for generating appearance-based voices based on bipartite graph structural similarity, comprising: Acquire multi-source sample data, construct a multi-dimensional feature system containing standardized appearance feature set and standardized voice feature set, aggregate samples with completely consistent appearance features into appearance vertices and samples with completely consistent voice features into voice vertices based on multi-source sample data, and construct appearance-voice bipartite graph according to the correlation relationship of the same subject in the samples. Receive the target person's appearance description data to be generated as the voice, map it to the target appearance vertices in the appearance-voice bipartite graph, and extract all candidate voice vertices directly connected to the corresponding target appearance vertices to form a candidate voice vertex set; Based on the bipartite graph topology, the structural similarity coefficient between the target appearance vertex and each candidate sound vertex is calculated. The coefficient represents the ratio of the number of butterfly structures containing two vertices to the number of wedge paths after invalid path correction. The candidate sound vertices are prioritized and filtered according to the corresponding coefficients to obtain the set of valid sound vertices. The speech synthesis tool is invoked to generate basic sound sample data based on the sound features corresponding to each effective sound vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic sound sample data to obtain the exclusive voice data of the target person.

[0010] The appearance-sound generation method based on bipartite graph structural similarity according to an embodiment of the present invention may also have the following additional technical features: In one embodiment of the present invention, a multi-dimensional feature system is constructed, including: Determine the core feature set of appearance, which includes quantitative features and qualitative features. Quantitative features include height range and BMI classification of body type, age stage division, and qualitative features include regional attributes, facial temperament, body shape characteristics and gender. Determine the core feature set of the voice, including the gender attribute of the voice, timbre category, pitch frequency range, speech rate and word count range, and intensity level; Among them, the height range is divided into four levels: ≤155cm, 156-165cm, 166-175cm, and ≥176cm; the BMI classification is divided into four categories: underweight, normal, overweight, and obese; the tone frequency range is divided into three ranges: 250-500Hz high frequency, 150-250Hz mid frequency, and 80-150Hz low frequency; and the speech rate and word count range is divided into three ranges: ≤150 words / minute soothing, 150-220 words / minute moderate, and ≥220 words / minute rapid.

[0011] In one embodiment of the present invention, aggregating appearance vertices and sound vertices includes: Collect multi-source sample data containing appearance feature labels, voice feature labels, and voice samples; Samples with identical appearance features are grouped into one appearance group, and samples with identical voice features are grouped into one voice group. The minimum effective threshold for the number of samples in a single feature combination is set to 5. Appearance groups and voice groups with a sample size lower than the corresponding threshold are removed. Generate a unique appearance vertex for each retained valid appearance group, and a unique sound vertex for each retained valid sound group.

[0012] In one embodiment of the present invention, constructing an appearance-voice bipartite graph includes: Construct an undirected bipartite graph G=(U, L, E); where U is the set of appearance vertices and L is the set of sound vertices; Traverse all samples. When a sample corresponding to a certain appearance vertex and a certain sound vertex originates from the same subject, establish an edge E between them to form a network topology structure in which a single appearance vertex connects multiple sound vertices and a single sound vertex connects multiple appearance vertices. The initially established edges are not assigned weights.

[0013] In one embodiment of the present invention, calculating the structural similarity coefficient includes: Define a wedge structure as a three-point path starting from vertex u, passing through vertex v, and reaching vertex w. Define a butterfly structure as a four-vertex closed path containing edges (u,v), (u,x), (w,v), and (w,x). Obtain the edge between the target appearance vertex u and the candidate sound vertex v, and count the total number of butterfly structures containing this edge |H(u,v)|; Calculate the total number of wedge paths starting from u and not passing through v, |∧u|, and the total number of wedge paths starting from v and not passing through u, |∧v|, and obtain the degree d[u] of vertex u and the degree d[v] of vertex v; Calculate the structural similarity coefficient using the following formula. :

[0014] in The value range is [0,1].

[0015] In one embodiment of the present invention, prioritizing and filtering candidate sound vertices includes: The candidate sound vertex set is sorted from high to low according to the corresponding structural similarity coefficient value to generate a matching priority sequence; When executing the filtering rules, if the preset matching threshold method is used, vertices with σ values ​​below the set threshold in the range of 0.1 to 0.7 will be removed; if the preset fixed retention number method is used, only the first N vertices after sorting will be retained. The remaining candidate sound vertices after filtering are taken as the set of valid sound vertices.

[0016] In one embodiment of the present invention, generating a basic sound sample includes: Extract the full-dimensional standardized sound feature combination corresponding to each effective sound vertex and convert it into input parameters that can be recognized by speech synthesis tools; For the same dialogue of the target character, generate a pure base sound sample with no background noise for each valid sound vertex, and ensure that all samples have the same duration; All basic sound samples were uniformly preprocessed with a sampling rate of 44.1kHz and a bit depth of 16bit, and phase alignment was performed to obtain a set of basic sound samples with completely synchronized time series.

[0017] In one embodiment of the present invention, the weighted fusion process includes: Normalize the structural similarity coefficients corresponding to each valid sound vertex, and calculate the mixing weights with a sum of 1. The normalization formula is as follows:

[0018] in, Let i be the normalized mixing weight for the i-th valid sound vertex. It is its structural similarity coefficient, where n is the total number of valid sound vertices; Based on the principle of linear superposition of digital audio, using the synthesis formula:

[0019] Weighted mixing and synthesis of basic sound samples, output As the final dedicated sound audio signal at time t, among which, Let be the audio signal of the i-th sample at time t.

[0020] To achieve the above objectives, another aspect of the present invention proposes an appearance-sound generation device based on bipartite graph structural similarity, comprising: The feature aggregation and mapping module is used to acquire multi-source sample data, construct a multi-dimensional feature system containing standardized appearance feature sets and standardized voice feature sets, aggregate samples with completely consistent appearance features into appearance vertices and samples with completely consistent voice features into voice vertices based on multi-source sample data, and construct an appearance-voice bipartite graph according to the correlation relationship of the same subject in the samples. The target mapping module is used to receive the appearance description data of the target person to be generated, map it into the target appearance vertices in the appearance-voice bipartite graph, and extract all candidate voice vertices directly connected to the corresponding target appearance vertices to form a candidate voice vertex set; The similarity filtering module is used to calculate the structural similarity coefficient between the target appearance vertex and each candidate sound vertex based on the bipartite graph topology. The coefficient represents the ratio of the number of butterfly structures containing two vertices to the number of wedge paths after invalid path correction. The candidate sound vertices are prioritized and filtered according to the corresponding coefficient to obtain the set of valid sound vertices. The audio mixing and synthesis module is used to call the speech synthesis tool to generate basic audio sample data based on the audio features corresponding to each effective audio vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic audio sample data to obtain the exclusive audio data of the target person.

[0021] This invention discloses a method and apparatus for generating appearance-sound based on bipartite graph structural similarity. It constructs a multi-dimensional feature system comprising standardized appearance and sound features, aggregating appearance vertices and sound vertices to establish an appearance-sound bipartite graph. Based on the bipartite graph topology, it calculates the structural similarity coefficient between target appearance vertices and candidate sound vertices. This coefficient represents the ratio of the number of butterfly structures to the number of wedge paths corrected for invalid paths. Using the normalized coefficient as mixing weights, it weights and fuses the basic sound samples corresponding to valid sound vertices to generate unique sound data. This effectively solves the problem of existing technologies struggling to accurately match sounds based on appearance features. It achieves a fully integrated solution from multi-source sample aggregation, bipartite graph modeling, structural similarity calculation to weighted mixing and synthesis, significantly improving the similarity and naturalness of cross-modal sound generation, and enhancing the method's adaptability and engineering practical value under conditions of limited samples and sparse features.

[0022] To achieve the above objectives, a third aspect of this application provides a computer device, including a processor and a memory; wherein the processor reads executable program code stored in the memory to run a program corresponding to the executable program code, for implementing an appearance and sound generation method based on bipartite graph structure similarity as described in the first aspect embodiment.

[0023] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0024] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein: Figure 1 This is a flowchart of an appearance and sound generation method based on bipartite graph structural similarity according to an embodiment of the present invention; Figure 2(a) is an example graph of bipartite graph G; Figure 2(b) shows the structural similarity of each edge of bipartite graph G. Figure 3 This is a schematic diagram of an appearance-sound generation device based on bipartite graph structural similarity according to an embodiment of the present invention; Figure 4 It is a computer device according to an embodiment of the present invention. Detailed Implementation

[0025] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0026] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0027] The following description, with reference to the accompanying drawings, describes a method, apparatus, and device for generating appearance and sound based on bipartite graph structural similarity according to an embodiment of the present invention.

[0028] The core idea of ​​this invention is to address the technical problem of difficulty in audio-visual matching caused by sufficient character appearance descriptions but lack of quantitative voice features in the creation of AI short dramas derived from novel IPs. This is achieved by constructing an appearance-voice generation method covering the entire process of "feature system construction—bipartite graph modeling—similarity calculation—weighted mixing," thus overcoming the shortcomings of existing technologies that rely on subjective manual matching, are inefficient, and produce homogenized voices. First, multi-source sample data is acquired to establish a standardized appearance and voice feature system. Samples with consistent features are aggregated into appearance vertices and voice vertices, and an appearance-voice bipartite graph is constructed based on the association relationships of the same subject. Second, the appearance description of the target character is received, mapped to target appearance vertices, and a set of directly connected candidate voice vertices is extracted. Third, based on the bipartite graph topology, the structural similarity coefficient between the target vertex and each candidate vertex is calculated. This coefficient represents the association strength through the ratio of the butterfly structure to the corrected wedge path, and effective voice vertices are obtained by sorting and filtering accordingly. Finally, a speech synthesis tool is called to generate basic samples, which are then weighted and fused using the normalized coefficients as weights to output a unique voice that matches the appearance. By integrating the above-mentioned technical approach of "feature aggregation - graph structure similarity - weighted mixing", this invention achieves automated and quantifiable mapping from appearance to sound, significantly improving the matching degree of sound and image and the generation efficiency, and reducing the production cost and the risk of homogenization of timbre.

[0029] The following describes, with reference to the accompanying drawings, an appearance sound generation method and apparatus based on bipartite graph structural similarity according to an embodiment of the present invention.

[0030] The appearance-sound generation method based on bipartite graph structural similarity of the present invention includes the following steps: S1. Obtain multi-source sample data, construct a multi-dimensional feature system containing standardized appearance feature set and standardized voice feature set, aggregate samples with completely consistent appearance features into appearance vertices and samples with completely consistent voice features into voice vertices based on multi-source sample data, and construct appearance-voice bipartite graph according to the correlation relationship of the same subject in the samples. S2, receive the target person's appearance description data to be generated as the voice, map it to the target appearance vertices in the appearance-voice bipartite graph, and extract all candidate voice vertices directly connected to the corresponding target appearance vertices to form a candidate voice vertex set; S3. Based on the bipartite graph topology, calculate the structural similarity coefficient between the target appearance vertex and each candidate sound vertex. The coefficient represents the ratio of the number of butterfly structures containing two vertices to the number of wedge paths after invalid path correction. Based on the corresponding coefficient, prioritize and filter the candidate sound vertices to obtain the set of valid sound vertices. S4 calls the speech synthesis tool to generate basic sound sample data based on the sound features corresponding to each effective sound vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic sound sample data to obtain the exclusive voice data of the target person.

[0031] like Figure 1 As shown, the core logic of this invention is "feature extraction - association modeling - similarity calculation - weighted synthesis", and the specific steps and key parameters are as follows: Step 1: Construct a multi-dimensional appearance-voice feature system.

[0032] Based on statistical correlation and domain knowledge, we identify the core features of sound appearance and quantifiable core sound features, providing a foundation for subsequent modeling. The core features of physical appearance are divided into quantitative and qualitative features, covering proven voice-related factors. Quantitative features include body shape (height range: ≤155cm / 156-165cm / 166-175cm / ≥176cm, BMI: underweight / normal / overweight / obese), and age (child / teenager / young adult / middle-aged / elderly). Qualitative features include regional attributes (South / North / Northwest / Southwest, etc.), facial temperament (gentle / resolute / agile / composed), body shape (upright / slender / robust / rounded), and gender (male / female). The core features of voice focus on standardized and quantifiable parameters that can be directly executed by AI speech synthesis tools. Voice gender attributes (male voice, female voice), timbre (soft / clear / sweet / mellow / rough / deep / hoarse), pitch (high frequency [250-500Hz] / mid frequency [150-250Hz] / low frequency [80-150Hz]), speech rate (slow [≤150 words / minute] / medium [150-220 words / minute] / rapid [≥220 words / minute]), volume (soft / moderate / loud).

[0033] Step 2: Construct an appearance-voice association sample library and a bipartite graph model.

[0034] Sub-step 2.1: Sample Collection and Labeling: Collect samples from multiple sources, including film and television characters, real-person interviews, and publicly available voiceprints. Each sample must include "appearance feature label + voice feature label + voice sample". The sample size should be ≥100,000 to ensure robustness, covering people of different regions, ages, and body types. The sample size for single-class feature combinations should be ≥100 to ensure model robustness and avoid matching bias caused by missing samples or uneven distribution. This ensures that all character settings can stably output accurately matched voices. Feature Aggregation and Vertex Generation: Based on the standardized feature system in step one, perform feature aggregation and vertex generation on the samples. Group samples with completely identical appearance features into one appearance group, and samples with completely identical voice features into one voice group. Set 5 as the minimum effective sample size threshold, and directly remove niche groups with less than 5 samples (this threshold is used to filter out random samples without statistical representativeness to ensure the reliability of subsequent association calculations), retaining only effective groups with ≥5 samples. Generate a unique appearance vertex for each effective appearance group and a unique voice vertex for each effective voice group.

[0035] Sub-step 2.2: Bipartite graph construction: Construct an undirected bipartite graph G=(U,L,E), where U is the set of appearance vertices and L is the set of sound vertices; when the sample corresponding to a certain appearance vertex and a certain sound vertex comes from the same subject, establish an edge E between them (without setting an initial weight, which will be confirmed later through structural similarity), forming a network structure of "single appearance vertex connected to multiple sound vertices, and single sound vertex connected to multiple appearance vertices".

[0036] Step 3: Prioritize sounds based on bipartite graph structural similarity.

[0037] Sub-step 3.1: Target vertex matching: Obtain the appearance description of the person to be voiced, first decompose it into the standardized appearance feature fixed level determined in step one, and then perform a full-dimensional accurate comparison with the feature labels of all appearance vertices generated in step two, match the target appearance vertex A with completely consistent feature level, and finally extract the set of sound vertices {L1,L2,...,Ln} directly connected to A.

[0038] Sub-step 3.2: Core Definition and Similarity Calculation: Calculate the structural similarity between the target's appearance vertices and their directly connected sound vertices. (For industrial applications requiring long-term, large-scale processing of target inputs, the structural similarity of all vertex pairs corresponding to related edges in the bipartite graph can be pre-calculated and stored. Subsequent batch processing can directly call the pre-calculated results without repeated calculations.) The following are the core definitions related to structural similarity: Wedge structure: In a bipartite graph, for vertices u, v, w, a wedge-shaped path is a three-way path starting from vertex u, passing through v, and reaching w. The set of wedge paths starting from u is represented as... Butterfly structure: In a bipartite graph u∈U, v∈L, w∈U, x∈L, if there are edges (u,v), (u,x), (w,v), (w,x), then a butterfly structure (u,v,w,x) is formed, reflecting the strong association of "two appearance vertices sharing two sound vertices". Let N[u] represent a butterfly structure containing edge e; Structural neighborhood and degree: The structural neighborhood N[u] of vertex u is the set of directly connected vertices, and the degree d[u] is the number of vertices in N[u].

[0039] Structural similarity formula:

[0040] Where |H(u,v)| represents the total number of butterfly structures containing edge (u,v), |∧u| represents the total number of all wedge paths starting from vertex u, and d[v] represents the number of vertices directly connected to vertex v, i.e., the degree of vertex v.

[0041] From the perspective of bipartite graph topological logic, any butterfly structure containing edge (u,v) must be composed of the edge (u,v) itself and two wedge-shaped paths originating from u and v respectively, without passing through the other vertex; for example, in Figure 2(a), the butterfly structure containing edge (u,v) is composed of the edge itself and two wedge-shaped paths originating from u and v respectively, without passing through the other vertex; , The butterfly structure () , , , ), by the edge ( ) and wedge path ( , , ), ( , , )composition.

[0042] However, not all wedge paths starting from u and v can be paired to form a butterfly structure containing the edge (u,v). For example, in Figure 2(a), with... wedge path starting from ( , It is impossible to form an edge containing ( , The butterfly structure. Furthermore, the edge (u,v) and two wedge paths originating from u and v and passing through each other's vertices cannot form a butterfly structure. For example, in Figure 2(a), the edge (u,v) cannot form a butterfly structure. , ) and wedge path ( , , ), ( , , Since the path passes through the other party, it cannot form a closed butterfly structure that meets the definition, and is therefore an invalid path. So, this part of the invalid wedge path needs to be removed during the calculation.

[0043] Based on this, the physical meaning of this formula is as follows: As shown in Figure 2(b), the total number of butterfly structures containing the target edge is used as the numerator, and the total number of corrected wedge paths after eliminating invalid paths from the two vertices is used as the denominator for normalization calculation. The final output σ(u,v) is essentially the proportion of paths that can be paired to form butterfly structures containing the edge (u,v) among the effective wedge paths of u and v; the value range of σ is [0,1]. The closer σ is to 1, the higher the proportion, indicating that the correlation and coupling between the two vertices are stronger and the appearance-voice matching degree is higher; when σ(u,v)=1, it means that all matching wedge paths of the two vertices can form butterfly structures containing the edge (u,v), and the two are strongly bound together. For example, in Figure 2(a), for the edge ( , ) and with Starting point and without passing through For any wedge-shaped path, one can find a path that... Starting point and without passing through The wedge-shaped paths pair up to form a butterfly structure.

[0044] This formula allows for the quantifiable calculation of the true correlation strength between appearance vertices and sound vertices. Furthermore, the structural similarity value output by this formula can be directly used for subsequent priority ranking of sound vertices and weight allocation in the final mixing process.

[0045] Sub-step 3.3: Priority sorting: Sort {L1, L2, ..., Ln} according to σ value from high to low to determine the matching priority of each sound vertex. Then, according to the preset filtering rules, remove the sound vertices with low relevance after sorting, and only retain the sound vertices with high matching degree to enter the subsequent sound generation stage. The filtering rules can be selected from either of the following two: ① Preset matching degree threshold, remove vertices with σ value lower than the threshold. The threshold can be flexibly adjusted in the range of 0.1-0.7 according to the accuracy requirements of AI short drama dubbing; ② Preset fixed number of retention, only retain the first N vertices after sorting. N can be flexibly adjusted within a reasonable range.

[0046] Step 4: Generating sound mixing based on similarity weights.

[0047] Single-voice sample generation: The AI ​​speech synthesis tool is invoked to generate basic voice samples S1, S2, ..., Sn corresponding to each vertex; using the voice vertices sorted by priority in step 3 as the sole basis, the full-dimensional standardized voice feature combination defined in step 1 corresponding to each vertex is extracted and converted into input parameters that can be directly recognized by the AI ​​speech synthesis tool; the AI ​​speech synthesis tool is invoked to generate a one-to-one pure basic voice sample without background noise for each voice vertex for the same line of dialogue of the target character in the AI ​​short drama, ensuring that all samples have completely consistent duration; all samples are uniformly preprocessed with a sampling rate of 44.1kHz and a bit depth of 16bit and phase alignment to ensure that the audio specifications are completely consistent.

[0048] Step 5: Weighted Mixing: The normalized values ​​are weights, and S1 to Sn are fused using an audio mixing algorithm.

[0049] Sub-step 5.1: Calculate the structural similarity coefficients for each sound vertex output in step 3. After normalization, we obtain mix weights that sum to 1. The normalization formula is:

[0050] In the formula, For the first Normalized mixing weights corresponding to each sound vertex No. The structural similarity coefficient between each sound vertex and the target appearance vertex The total number of effective sound vertices participating in the mixing ultimately satisfies .

[0051] Sub-step 5.2: Based on the principle of linear superposition of digital audio, and using the normalized weights as a basis, perform weighted mixing and synthesis on the sound samples S1, S2, ..., Sn to generate the final unique sound. The synthesis formula is as follows:

[0052] In the formula, for The final synthesized unique sound and audio signal at each moment for Time of the first The single-sound sample audio signal corresponding to each sound vertex. This is a time series of audio sampling points. All samples have been standardized in terms of sampling rate and phase alignment to ensure complete synchronization of the time series.

[0053] The method of this invention effectively solves the problems of audio-visual dissonance, low efficiency, and homogenized timbre caused by the lack of quantitative description and objective matching standards in existing technologies. It achieves fully automated solution from appearance feature input, bipartite graph matching, structural similarity calculation to weighted mixing output, significantly improving the audio-visual matching accuracy and generation efficiency of AI short drama character dubbing, reducing production costs and the risk of timbre homogenization, and enhancing the adaptability and engineering practical value of the method in industrialized mass production scenarios.

[0054] In addition, this embodiment of the invention uses AI short drama character dubbing generation as a scenario, employs 50 sets of matching female character samples that meet the requirements, and uses the female lead's appearance features customized by a certain short drama platform as input to verify and illustrate the entire process of the method of the present invention: First, the aggregation of pre-parameters and samples (corresponding to steps 1 and 2) specifically includes: Constructing a multi-dimensional feature system: Strictly following the settings of step one of this invention, appearance features and voice features are divided; Feature aggregation and vertex generation: In accordance with the principle of "complete consistency of all-dimensional features", 50 groups of samples are aggregated and grouped. The minimum effective sample size threshold for a single group is set to 5. The initial aggregation of 50 groups of samples yields 11 appearance groups and 9 voice groups. After removing invalid groups with less than 5 samples, 6 effective appearance vertices and 4 effective voice vertices are finally obtained; Bipartite graph construction: Based on the obtained appearance and voice vertices, a bipartite graph is constructed, as shown in Figure 2(a). The effective associated edges between vertices are 16.

[0055] Second, target matching and similarity calculation (corresponding to step 3), specifically including: Target vertex matching: Obtain the physical features of the input female protagonist and match the target physical vertices according to the principle of complete feature matching. The sound vertices directly connected to it are , , Structural similarity calculation: Calculate the vertices of the target's appearance separately. With the apex of sound , , The structural similarities between them were 0.79, 0.38, and 0.77, respectively. and The calculation process is illustrated using structural similarity calculation as an example: 1. First, obtain the containing edges ( There are two butterfly structures in total, namely ( ), ( ); 2. Obtain There are 10 wedge-shaped structures starting from the first wedge, namely ( ), ( ), ( ), ( ), ( ), ( ), ( ), ( ), ( ), ( ); obtain There are 5 wedge-shaped structures starting from the first wedge, namely ( ), ( ), ( ), ( ), ( ); obtain and The degrees are 3 and 2 respectively; 3. Calculate according to the structural similarity calculation formula in step 3:

[0056] Priority sorting: The three sound vertices are sorted according to their structural similarity values ​​from highest to lowest. The sorting result is as follows: , , A threshold of 0.2 was set, and the structural similarity values ​​of the three vertices were all not less than the threshold, so they were all included in the subsequent sound generation process.

[0057] Third, sound mixing generation (corresponding to step 4), specifically including: Single-sound sample generation: For the same 10-second AI short drama dialogue, pure single-sound samples are generated and bound one-to-one for the three sound vertices. They are uniformly preprocessed to a sampling rate of 44.1kHz and a bit depth of 16bit to ensure complete synchronization of duration and phase.

[0058] Fourth, weighted mixing (corresponding to step 5), specifically includes: Weight normalization: The three structural similarity values ​​are normalized to obtain mixing weights of 0.41, 0.19, and 0.40 respectively; Mixing synthesis: Based on the linear weighted mixing formula, the three single voice samples are weighted and superimposed to finally output a unique human voice audio that highly matches the target appearance.

[0059] Further analysis reveals that the core application scenario of this invention is the industrialized content production of AI short dramas. Currently, this scenario faces three major unresolved industry pain points: Low audio-visual matching: Existing technologies mostly rely on manual matching of timbre or fixed linear rules, depending on subjective human experience. This results in insufficient matching between the character's appearance and the voice acting, leading to poor audience immersion and a high rate of viewers abandoning the show; Extremely low production efficiency: The average time for voice acting for a single character exceeds 2 hours, from timbre selection, audition, recording to post-production sound mixing, which cannot meet the industrialized mass production needs of short dramas that require "dozens of episodes per day"; High cost and severe homogenization: The cost of manual voice acting for a single character exceeds 300 yuan, and the homogenization rate of general TTS fixed timbres exceeds 90%, resulting in low content recognizability and high copyright risks.

[0060] Moreover, this experimental solution can be directly applied to the industrialized mass dubbing production of AI short dramas. It only requires inputting a description of the character's appearance to automatically generate a highly matched and directly usable character dubbing, replacing the tedious process of manually selecting timbre, auditioning, and adjusting the sound.

[0061] In addition, the entire dubbing process for a single character in this experiment only took a few minutes, with a high degree of audio-visual matching. Compared with traditional manual dubbing, which takes more than 2 hours and costs more than 300 yuan per character, this solution improves the generation efficiency by more than 93%, reduces the production cost by 99%, and has no problem of homogenized timbre.

[0062] In summary, the application scenarios of the method in this embodiment of the invention achieve objective and accurate matching of appearance and voice through the quantitative calculation of similarity of bipartite graph butterfly structure, solving the pain points of existing technologies that rely on subjective human matching and sound-image inconsistency; at the same time, by generating exclusive timbre through weighted mixing, it not only ensures the matching degree, but also achieves the uniqueness of the character's voice, which is fully adapted to the industry demand for rapid mass production of AI short dramas.

[0063] To achieve the above invention, such as Figure 3 As shown, this embodiment also provides an appearance-sound generation device 10 based on bipartite graph structural similarity, the device 10 including: The feature aggregation and mapping module 100 is used to acquire multi-source sample data, construct a multi-dimensional feature system containing standardized appearance feature sets and standardized voice feature sets, aggregate samples with completely consistent appearance features into appearance vertices and samples with completely consistent voice features into voice vertices based on multi-source sample data, and construct an appearance-voice bipartite graph according to the correlation relationship of the same subject in the samples.

[0064] The target mapping module 200 is used to receive the target person's appearance description data to be generated as sound, map it as target appearance vertices in the appearance-voice bipartite graph, and extract all candidate sound vertices directly connected to the corresponding target appearance vertices to form a candidate sound vertex set.

[0065] The similarity filtering module 300 is used to calculate the structural similarity coefficient between the target appearance vertex and each candidate sound vertex based on the bipartite graph topology. The coefficient represents the ratio of the number of butterfly structures containing two vertices to the number of wedge paths after invalid path correction. The candidate sound vertices are prioritized and filtered according to the corresponding coefficient to obtain the set of valid sound vertices.

[0066] The mixing and synthesis module 400 is used to call the speech synthesis tool to generate basic sound sample data based on the sound features corresponding to each effective sound vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic sound sample data to obtain the exclusive sound data of the target person.

[0067] The device of this invention effectively solves the problems of audio-visual disharmony, low efficiency and homogenization of timbre caused by the lack of quantitative matching standards in the prior art. It realizes the fully automated generation from the input of appearance features to the output of exclusive voice, which significantly improves the audio-visual matching degree and generation efficiency, reduces production costs, and enhances its adaptability and engineering practical value in the industrialized mass production of AI short dramas.

[0068] To implement the methods of the above embodiments, the present invention also provides a computer device, such as... Figure 4 As shown, the computer device 600 includes a memory 601 and a processor 602; wherein, the processor 602 reads the executable program code stored in the memory 601 to run a program corresponding to the executable program code, so as to implement the various steps of the appearance and sound generation method based on bipartite graph structure similarity described above.

[0069] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0070] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

Claims

1. A method for generating appearance-based voices based on bipartite graph structural similarity, characterized in that, include: Acquire multi-source sample data, construct a multi-dimensional feature system containing standardized appearance feature set and standardized voice feature set, aggregate samples with completely consistent appearance features into appearance vertices and samples with completely consistent voice features into voice vertices based on multi-source sample data, and construct appearance-voice bipartite graph according to the correlation relationship of the same subject in the samples. Receive the target person's appearance description data to be generated as the voice, map it to the target appearance vertices in the appearance-voice bipartite graph, and extract all candidate voice vertices directly connected to the corresponding target appearance vertices to form a candidate voice vertex set; Based on the bipartite graph topology, the structural similarity coefficient between the target appearance vertex and each candidate sound vertex is calculated. The coefficient represents the ratio of the number of butterfly structures containing two vertices to the number of wedge paths after invalid path correction. The candidate sound vertices are prioritized and filtered according to the corresponding coefficients to obtain the set of valid sound vertices. The speech synthesis tool is invoked to generate basic sound sample data based on the sound features corresponding to each effective sound vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic sound sample data to obtain the exclusive voice data of the target person.

2. The method as described in claim 1, characterized in that, Constructing a multi-dimensional feature system, including: Determine the core feature set of appearance, which includes quantitative features and qualitative features. Quantitative features include height range and BMI classification of body type, age stage division, and qualitative features include regional attributes, facial temperament, body shape characteristics and gender. Determine the core feature set of the voice, including the gender attribute of the voice, timbre category, pitch frequency range, speech rate and word count range, and intensity level; Among them, the height range is divided into four levels: ≤155cm, 156-165cm, 166-175cm, and ≥176cm; the BMI classification is divided into four categories: underweight, normal, overweight, and obese; the tone frequency range is divided into three ranges: 250-500Hz high frequency, 150-250Hz mid frequency, and 80-150Hz low frequency; and the speech rate and word count range is divided into three ranges: ≤150 words / minute soothing, 150-220 words / minute moderate, and ≥220 words / minute rapid.

3. The method as described in claim 1, characterized in that, Aggregate appearance vertices and sound vertices, including: Collect multi-source sample data containing appearance feature labels, voice feature labels, and voice samples; Samples with identical appearance features are grouped into one appearance group, and samples with identical voice features are grouped into one voice group. The minimum effective threshold for the number of samples in a single feature combination is set to 5. Appearance groups and voice groups with a sample size lower than the corresponding threshold are removed. Generate a unique appearance vertex for each retained valid appearance group, and a unique sound vertex for each retained valid sound group.

4. The method as described in claim 1, characterized in that, Constructing an appearance-voice bipartite graph includes: Construct an undirected bipartite graph G=(U, L, E); where U is the set of appearance vertices and L is the set of sound vertices; Traverse all samples. When a sample corresponding to a certain appearance vertex and a certain sound vertex originates from the same subject, establish an edge E between them to form a network topology structure in which a single appearance vertex connects multiple sound vertices and a single sound vertex connects multiple appearance vertices. The initially established edges are not assigned weights.

5. The method as described in claim 1, characterized in that, Calculating the structural similarity coefficient includes: Define a wedge structure as a three-point path starting from vertex u, passing through vertex v, and reaching vertex w. Define a butterfly structure as a four-vertex closed path containing edges (u,v), (u,x), (w,v), and (w,x). Obtain the edge between the target appearance vertex u and the candidate sound vertex v, and count the total number of butterfly structures containing this edge |H(u,v)|; Calculate the total number of wedge paths starting from u and not passing through v, |∧u|, and the total number of wedge paths starting from v and not passing through u, |∧v|, and obtain the degree d[u] of vertex u and the degree d[v] of vertex v; Calculate the structural similarity coefficient using the following formula. : in The value range is [0,1].

6. The method as described in claim 1, characterized in that, Prioritize and filter candidate sound vertices, including: The candidate sound vertex set is sorted from high to low according to the corresponding structural similarity coefficient value to generate a matching priority sequence; When executing the filtering rules, if the preset matching degree threshold method is used, vertices with σ values ​​below the set threshold in the range of 0.1 to 0.7 will be removed; if the preset fixed retention number method is used, only the first N vertices after sorting will be retained. The remaining candidate sound vertices after filtering are taken as the set of valid sound vertices.

7. The method as described in claim 1, characterized in that, Generate basic sound samples, including: Extract the full-dimensional standardized sound feature combination corresponding to each effective sound vertex and convert it into input parameters that can be recognized by speech synthesis tools; For the same dialogue of the target character, generate a pure base sound sample with no background noise for each valid sound vertex, and ensure that all samples have the same duration; All basic sound samples were uniformly preprocessed with a sampling rate of 44.1kHz and a bit depth of 16bit, and phase alignment was performed to obtain a set of basic sound samples with completely synchronized time series.

8. The method as described in claim 1, characterized in that, Weighted fusion processing includes: Normalize the structural similarity coefficients corresponding to each valid sound vertex, and calculate the mixing weights with a sum of 1. The normalization formula is as follows: in, Let i be the normalized mixing weight for the i-th valid sound vertex. It is its structural similarity coefficient, where n is the total number of valid sound vertices; Based on the principle of linear superposition of digital audio, using the synthesis formula: Weighted mixing and synthesis of basic sound samples, output As the final dedicated sound audio signal at time t, among which, Let be the audio signal of the i-th sample at time t.

9. A device for generating appearance and sound based on bipartite graph structural similarity, characterized in that, include: The feature aggregation and mapping module is used to acquire multi-source sample data, construct a multi-dimensional feature system containing standardized appearance feature sets and standardized voice feature sets, aggregate samples with completely consistent appearance features into appearance vertices and samples with completely consistent voice features into voice vertices based on multi-source sample data, and construct an appearance-voice bipartite graph according to the correlation relationship of the same subject in the samples. The target mapping module is used to receive the appearance description data of the target person to be generated, map it into the target appearance vertices in the appearance-voice bipartite graph, and extract all candidate voice vertices directly connected to the corresponding target appearance vertices to form a candidate voice vertex set; The similarity filtering module is used to calculate the structural similarity coefficient between the target appearance vertex and each candidate sound vertex based on the bipartite graph topology. The coefficient represents the ratio of the number of butterfly structures containing two vertices to the number of wedge paths after invalid path correction. The candidate sound vertices are prioritized and filtered according to the corresponding coefficient to obtain the set of valid sound vertices. The audio mixing and synthesis module is used to call the speech synthesis tool to generate basic audio sample data based on the audio features corresponding to each effective audio vertex. The normalized structural similarity coefficient is used as the mixing weight to perform weighted fusion processing on the basic audio sample data to obtain the exclusive audio data of the target person.

10. An electronic device, characterized in that, include: processor; The memory stores executable instructions; when the processor executes the instructions, it implements the appearance and sound generation method based on bipartite graph structure similarity as described in any one of claims 1-8.