Neural network-based motion capture actor fitness evaluation method and system

By using a neural network-based approach, character descriptions are transformed into action style vectors. Combined with action structure generation and a two-branch neural network, the actor's action fit is evaluated, solving the quantitative problem of actor action evaluation in existing technologies and achieving efficient and objective actor-role matching evaluation.

CN121170906BActive Publication Date: 2026-06-26GUANGZHOU PANGU CULTURE COMM CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU PANGU CULTURE COMM CO LTD
Filing Date
2025-11-12
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies cannot effectively assess whether an actor's movements meet the requirements of a character's style. The lack of quantitative standards leads to the matching of actors and characters relying on human experience. Furthermore, existing methods are difficult to perform detailed modeling at the stylistic levels such as the rhythm, amplitude, and stability of movements, resulting in evaluation bias.

Method used

A neural network-based approach is adopted, which maps character description text to a low-dimensional style vector space through a text encoder and a Transformer model. By combining a motion structure generation network and a two-branch neural network, style and structural features are extracted from the actor's motion capture data. A unified scoring model is used to calculate the motion fit between the actor and the character, and a learnable mapping matrix is ​​introduced for style calibration.

Benefits of technology

It has achieved a closed-loop process from character description to actor evaluation, which improves the objectivity and interpretability of the evaluation, and can quantify the rationality and conformity of the actor's actions, thus meeting the actual needs of film and television production and virtual human performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121170906B_ABST
    Figure CN121170906B_ABST
Patent Text Reader

Abstract

The application provides a motion capture actor fitness evaluation method and system based on a neural network, which comprises the following steps: based on an input role description text, extracting a semantic representation of the text, and mapping the semantic representation to obtain a role style vector; based on the role style vector, converting the role style vector into target motion structure features through a motion structure generation network; based on motion capture data of an actor, simultaneously extracting an actor style vector and an actor structure feature from the motion capture data through a double-branch neural network; and based on the role style vector, the target motion structure features, the actor style vector and the actor structure feature, calculating a motion fitness score between the actor and the role through a unified scoring model for fusing style and structure.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of digital film and television, and in particular relates to a method and system for evaluating the suitability of motion capture actors based on neural networks. Background Technology

[0002] With the widespread application of motion capture technology in film and television production, virtual human-driven production, and digital performance, the need to evaluate whether an actor is suitable for a specific role's movements has become increasingly prominent. Traditional motion capture systems can collect skeletal trajectory data with high precision, but lack the ability to understand the character's style. They can usually only judge whether the movement is completed, but cannot answer whether the actor's movement conforms to the character's style requirements. In actual film and television creation, directors often define the character's movement style through script descriptions, verbal instructions, etc., such as requiring movements to be "flamboyant and powerful" or "calm and restrained." However, existing technology cannot directly convert such textual descriptions into comparable numerical expressions, causing the matching of actors and roles to rely entirely on human experience, which is highly subjective and lacks quantitative standards. On the other hand, most existing motion recognition and classification methods are limited to identifying categories or reconstructing trajectories, and cannot perform fine-grained modeling of stylistic aspects such as the rhythm, amplitude, and stability of movements, thus failing to conduct structured comparisons between characters and actors. Furthermore, existing research often models style and structure separately, lacking a unified computational framework to measure whether an actor's stylistic performance and structural execution both meet the character's requirements, which can easily lead to evaluation bias. Therefore, existing technologies have prominent problems in assessing actor movement suitability, such as insufficient understanding of character style, limited expression of structural features, and lack of a mechanism for evaluating the integration of style and structure, making it difficult to meet the actual needs of digital film and television production and intelligent character performance. Summary of the Invention

[0003] The purpose of this invention is to design a method and system for evaluating the suitability of motion capture actors based on neural networks. This method can adaptively simulate the teaching process, guide cognitive transfer, and dynamically adjust the pace of content, truly achieving a three-in-one teaching experience of intelligence, immersion, and interaction, and significantly improving students' learning efficiency and depth of understanding.

[0004] To achieve the above objectives, a method for evaluating the fit of motion capture actors based on neural networks is provided in a first aspect of the present invention, the method comprising:

[0005] Based on the input character description text, the semantic representation of the text is extracted through a text encoder and a Transformer model, and the semantic representation is mapped to a low-dimensional style vector space using a linear mapping layer to obtain the character style vector.

[0006] Based on the character style vector, the character style vector is converted into target action structure features through an action structure generation network. The action structure generation network includes multiple fully connected layers and activation functions, and outputs multi-dimensional structural features representing action rhythm, amplitude, inertia, control stability, trunk stability and limb symmetry.

[0007] Based on the motion capture data of the actors, a dual-branch neural network is used to simultaneously extract the actor style vector and the actor structural features from the motion capture data. The dual-branch neural network includes a temporal structure branch and a semantic style branch. The temporal structure branch is used to extract the actor structural features, and the semantic style branch is used to extract the actor style vector. The dual-branch neural network enhances the correlation between the two branches through a style structure consistency regularization term.

[0008] Based on the character style vector, the target action structure features, the actor style vector, and the actor structure features, a unified scoring model that integrates style and structure is used to calculate the action fit score between the actor and the character. The unified scoring model uses a weighted combination of style bias and structure bias, and introduces a learnable mapping matrix to perform style calibration on the actor structure features in order to minimize the collaborative bias between style and structure.

[0009] Furthermore, the character style vector extraction steps specifically include: segmenting the input character description text to generate a word fragment sequence; mapping the word fragment sequence into a high-dimensional vector embedding sequence; inputting the vector embedding sequence into a multi-layer Transformer encoder for contextual semantic modeling, and extracting the context-related representation of the text through a multi-layer self-attention mechanism and a feedforward neural network; extracting the representation of the first special position token output by the encoder as the semantic summary vector of the entire text; projecting the semantic summary vector onto a low-dimensional style vector space through a linear mapping layer, and using an activation function to enhance the non-linear expressive power, finally obtaining the character style vector.

[0010] Furthermore, the training process for character style vectors uses expert-annotated weak labels as regression targets, including multiple dimensions such as action tension, rhythm speed, control precision, and body openness, and optimizes the projection process through the mean squared error loss function.

[0011] Furthermore, the style-sensitive regularization term is calculated based on the Euclidean distance between the structural prediction outputs of different samples in the training batch. The discriminative power is controlled by an exponential function and hyperparameters to increase the structural feature spacing between different character styles.

[0012] Furthermore, the training data for the motion structure generation network comes from real actor performance clips captured by the motion capture system. Skeletal coordinates are mapped to motion feature parameters through the skeleton topology and normalized to the 0 to 1 range.

[0013] Furthermore, the temporal structure branch divides the joints into three categories: upper limbs, lower limbs, and trunk, and models them separately. It also combines the sliding window module to extract rhythm and control indicators.

[0014] Furthermore, the style structure consistency regularization term is aligned with the cosine similarity of the actor's structural features after they have been mapped to a 128-dimensional space and with the actor's style vector, encouraging the higher-order representation of structural features to be consistent with the style representation in direction.

[0015] Furthermore, the learnable mapping matrix can be subjected to a structural sparsity regularization term during training to automatically identify and ignore low-correlation structural dimensions, thereby improving the generalization ability of the scoring model.

[0016] Furthermore, the unified scoring model calculates the fit score in the following way: First, the squared Euclidean distance between the character style vector and the actor style vector is calculated as the style bias; then, the squared Euclidean distance between the target action structural features and the actor structural features after linear transformation by the learnable mapping matrix is ​​calculated as the structural bias; next, the structural bias is weighted using weighting coefficients; finally, the style bias is added to the weighted structural bias, and the sum is subtracted from 1 to obtain the fit score.

[0017] A second aspect of the present invention provides a neural network-based motion capture actor fit evaluation system, the system comprising:

[0018] The semantic vector module is used to extract the semantic representation of the input character description text through a text encoder and a Transformer model, and then use a linear mapping layer to map the semantic representation to a low-dimensional style vector space to obtain the character style vector.

[0019] The structural feature module is used to convert the character style vector into target action structure features through an action structure generation network based on the character style vector. The action structure generation network includes multiple fully connected layers and activation functions, and outputs multi-dimensional structural features representing action rhythm, amplitude, inertia, control stability, trunk stability and limb symmetry.

[0020] A style modeling module is used to extract actor style vectors and actor structural features from actor motion capture data simultaneously using a dual-branch neural network. The dual-branch neural network includes a temporal structure branch and a semantic style branch. The temporal structure branch is used to extract the actor structural features, and the semantic style branch is used to extract the actor style vector. The dual-branch neural network enhances the correlation between the two branches through a style structure consistency regularization term.

[0021] The adaptation evaluation module is used to calculate the action adaptation score between the actor and the role based on the character style vector, the target action structure feature, the actor style vector, and the actor structure feature, through a unified scoring model that integrates style and structure. The unified scoring model uses a weighted combination of style deviation and structure deviation, and introduces a learnable mapping matrix to perform style calibration on the actor structure feature in order to minimize the collaborative deviation between style and structure.

[0022] The beneficial technical effects of the present invention are at least as follows:

[0023] To address the aforementioned issues, this invention provides a method and system for evaluating the fit of motion capture actors based on neural networks. First, it maps unstructured character descriptions to a motion style vector space through semantic modeling, bridging the director's intent with numerical expression. Second, it transforms style vectors into specific target action structural features using neural networks, allowing style semantics to be grounded in comparable physical indicators such as rhythm, amplitude, and stability. Third, it extracts both style vectors and structural features from the actor's motion capture data simultaneously through dual-branch modeling, and designs style and structure consistency constraints to ensure mutual correspondence between the two levels of expression. Finally, in the fit calculation stage, it proposes a scoring function based on minimizing style-structure co-existence deviation, introducing a learnable structural mapping matrix to model the asymmetric requirements of the character for different structural dimensions, thus judging not only whether the action is "like" but also whether it is "reasonable." Through these innovations, this invention constructs a complete closed-loop process from character description to actor scoring, improving the objectivity and interpretability of the evaluation while meeting the practical needs of film and television production and virtual human performance for actor motion fit. Attached Figure Description

[0024] The present invention will be further described with reference to the accompanying drawings, but the embodiments in the drawings do not constitute any limitation on the present invention. For those skilled in the art, other drawings can be obtained based on the following drawings without creative effort.

[0025] Figure 1 This is a flowchart of the motion capture actor suitability evaluation method based on neural networks according to the present invention.

[0026] Figure 2 This is a framework diagram of the motion capture actor suitability evaluation system based on neural networks according to the present invention. Detailed Implementation

[0027] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.

[0028] In one or more embodiments, such as Figure 1 As shown, a method for evaluating the fit of motion capture actors based on neural networks is disclosed, the method comprising the following:

[0029] S1: Based on the input character description text, the semantic representation of the text is extracted through a text encoder and a Transformer model, and the semantic representation is mapped to a low-dimensional style vector space using a linear mapping layer to obtain the character style vector;

[0030] Specifically, this step aims to transform the character's text description into structured motion style vectors to support subsequent target style modeling and motion matching evaluation. The input is the character description text. Typically derived from director's settings, script passages, or acting instructions, the text is presented as natural language phrases or paragraphs. Examples include: "decisive actions, fast pace, strong physical tension" or "restrained demeanor, delicate steps, and a more introverted style." This text data is input into the system in UTF-8 encoding via a built-in text interface. The system first performs word segmentation using the standard BERT WordPiecetokenizer, encoding the original text into a token sequence, which is then mapped to a vector embedding sequence. Each of them An embedding vector representing a word segment. This represents the number of tokens after word segmentation.

[0031] The entire text embedding sequence is input into a 12-layer Transformer encoder for context modeling, with each layer containing a multi-head attention mechanism and a feedforward sublayer. The model uses a BERT-Base configuration with 768 hidden dimensions and 12 attention heads. During encoding, the final representation of the first special token [CLS] is... This representation is used as an aggregate representation of the entire text, serving as a semantic summary of the whole text. To project this representation into a low-dimensional space for action style comparison, the system uses a trainable linear mapping matrix. Map this to the style vector space. The final character style vector. The definition is as follows:

[0032] ;

[0033] in, The vector representing the character's action style is the core input for subsequent style modeling and fit calculation. It is the style mapping matrix, which is obtained through supervised training during the model training phase. It is a semantic vector generated by the BERT model from the input text, with a dimension of 768. The ReLU activation function introduces non-linearity to enhance the ability to express the distribution, while ensuring the non-negativity of the style vector to facilitate subsequent vector space comparison with action data.

[0034] To enhance the semantic alignment capability of style mapping, weak labels annotated by experts are introduced as regression targets during the training phase to guide the model in mapping texts describing different styles to their relative positions in a predefined style space. The label design includes six dimensions: "action tension," "pacing," "control precision," and "body openness." The system uses a mean squared error loss function to optimize this projection process, ensuring that the output style vectors are measurable and comparable. Taking the text "B: compact actions, even rhythm" as an example, the final generated style vectors... The vector's value in the "control precision" dimension is significantly higher than that in the "tension" dimension, reflecting the "stable and introverted" attributes in the character's style setting.

[0035] S2: Based on the character style vector, the character style vector is converted into target action structure features through an action structure generation network, wherein the action structure generation network includes multiple fully connected layers and activation functions, and outputs multi-dimensional structural features representing action rhythm, amplitude, inertia, control stability, trunk stability and limb symmetry.

[0036] Specifically, this step aims to leverage the character style vectors output by the previous module. Construct structured style metrics for characters at the action execution level, i.e., target action structure features. This structural feature does not represent a complete sequence of actions, but rather expresses the character's "preferences" or "constraints" in specific action dimensions, such as tempo, amplitude, and control coherence. These structural indicators will serve as a structural reference for the next step of evaluating the actor's action fit, directly comparing them at the dimensional level with the corresponding structural features extracted from the actor's motion capture data. Therefore, this step is not only a mapping bridge from the character's semantic vector to the action-level expression, but also a key hub for the entire system to achieve style understanding and structural comparability.

[0037] The input is the character style vector output from the previous step. The vector, extracted by BERT-Base and the linear mapping module, possesses a stable dimensional structure and rich semantic style expression capabilities. To transform this vector into target structural features, we designed an action structure generation network, which consists of two parts: a main mapping network (the backbone of structure prediction) and a style preservation regularization term (a structural style stability control mechanism). The entire network adopts a three-layer fully connected structure. The first layer is the input transformation layer, with the dimension reduced from 128 to 64; the second layer is the feature combination layer, maintaining the dimension at 64; and the third layer is the output layer, with a dimension of 6, corresponding to six core structural dimensions: rhythm frequency, joint amplitude, movement inertia, control stability, trunk stability, and limb symmetry. These dimensions are derived from the director's control language of performance style in film and television production and statistical induction from actual Mocap data, making them engineering-practical.

[0038] The main formula for structural prediction is as follows:

[0039] ;

[0040] in, It is the weight matrix of a three-layer fully connected layer. It corresponds to the bias term. For activation function, Use the Sigmoid activation function to normalize the output to... Interval. Output The target structural features are represented, and the meanings of each dimension are as follows:

[0041] First dimension: rhythm of movement per unit time (Hz);

[0042] Second dimension: Angular variation of major joints (standard deviation);

[0043] The third dimension: the rate of change of average acceleration (inertia);

[0044] 4th dimension: Standard deviation of velocity change between frames during action execution (control stability);

[0045] Fifth dimension: The fluctuation range of trunk posture angle on the time axis (stability).

[0046] 6th dimension: Coordination index of left and right limb movements (symmetry);

[0047] In actual training, we found that using only the standard MSE loss for training leads to a lack of style discriminativeness in the structure prediction results, resulting in different... The problem involves mapping to an approximate structural output. Therefore, we introduce a style-sensitive regularization term. This is used to enhance the separability of the output structure between different style vectors, and is specifically defined as follows:

[0048] ;

[0049] in, This represents the number of samples in a training batch. They are the first and The structural prediction output for each sample This is a hyperparameter that controls the discriminative power, typically set to 3.0. Introducing this regularization term effectively increases the structural feature spacing between different character styles, making the network more sensitive to stylistic detail differences, thereby improving the discriminative power of subsequent action matching scoring.

[0050] During the training process, the model is trained using director-annotated style text and corresponding standard motion structure statistical features. For example, if a character is described as having "a distinct rhythm, large steps, and crisp turns," the system uses expert-annotated rhythm frequency of 1.8Hz, joint amplitude of 64°, average acceleration change rate of 0.95, and control stability of 0.87 as supervision targets. The training data comes from real actor performance clips collected by the Mocap system. Skeletal coordinates are mapped to motion feature parameters through the skeleton topology, and then normalized and mapped to... Range, ensure Each dimension has a unified comparison standard.

[0051] S3: Based on the actor's motion capture data, the actor's style vector and actor's structural features are extracted from the motion capture data simultaneously through a dual-branch neural network. The dual-branch neural network includes a temporal structure branch and a semantic style branch. The temporal structure branch is used to extract the actor's structural features, and the semantic style branch is used to extract the actor's style vector. The dual-branch neural network enhances the correlation between the two branches through a style structure consistency regularization term.

[0052] Specifically, this step aims to extract stylistic and structural characteristics from the actor's motion capture data, which are represented as motion style vectors, respectively. With action structure characteristics Used to compare with the character style vector generated in the previous module and target structural features A comparative evaluation was conducted. In particular, this step designed a dual-branch neural network architecture to address the characteristics of motion capture data, which are high-frequency, high-dimensional, and have strong body part linkages: a temporal structure branch is used to extract structural indicators, and a semantic style branch is used to model high-order motion style expressions. The two branches work together to output structured motion expression results.

[0053] The input is Mocap data of an actor performing a specified action. ,in Indicates the first The 3D coordinates of the skeleton in the frame. This refers to the number of skeletal nodes (e.g., 21). This refers to the frame rate (e.g., 120 frames / 2 seconds). Motion data comes from an optical capture device and has undergone standard preprocessing such as sampling rate normalization, coordinate system reconstruction, and keypoint topology alignment. The output of the previous step... and It does not directly participate in the current feature extraction process, but is used for dimension alignment and regularization design during network structure construction to ensure the consistency of variable structure and purpose throughout the system.

[0054] To adapt to the structural rhythmic patterns and style variations in motion data, we designed a structure based on a multi-scale temporal Transformer + component-guided decoupled network. The network comprises two branches: a structure branch (extracting 6-dimensional structural features) and a style branch (extracting 128-dimensional style vectors). The structure branch receives... Then, skeletal differential transformation is first performed on each frame, that is... Obtain the skeletal velocity tensor Then, a set of sliding window modules is used to extract rhythm and control indicators, and combined with a component segmentation strategy, the joints are divided into three categories: "upper limbs, lower limbs, and trunk," which are modeled separately. The style branch then... Mapped to a 128-dimensional semantic space for use with Comparison.

[0055] We introduce a style-structure consistency regularization term to enhance the correlation between the two branches. Specifically, the features obtained in the structure branch... It is boosted to 128 dimensions through a linear layer, and then... Cosine similarity alignment is used to ensure consistency in expression between the two branches. The regularization term is defined as follows:

[0056] ;

[0057] in This is a mapping matrix from structural feature vectors to the semantic space. This regularization term encourages the higher-order representations of structural features and style representations to maintain consistency in direction, improving the alignment ability of the two branches. In film and television or digital human training scenarios, action style and structural indicators often need to jointly express the character's style. For example, "explosive action" is structurally characterized by dense rhythm, large amplitude, and high acceleration, and semantically corresponds to "flamboyant and decisive." If the two are expressed separately, it will lead to unstable scores. Therefore, enhancing the synergy of expression through this regularization term has a significant effect.

[0058] Style branch extraction The core structure is a 6-layer Transformer-based temporal modeling module. Each frame input is first encoded into a 128-dimensional vector through a linear transformation, then positional encoding is added, followed by multi-head attention mechanism to model the temporal context, and finally global average pooling is used to generate... :

[0059] ;

[0060] In the formula The temporal context vector for each frame is output by the Transformer model. This represents a high-level stylistic expression of the entire action sequence, with dimensions and... Consistent with this model, it can be used for subsequent style similarity calculations. Compared to traditional RNN structures, this model has stronger long-range dependency modeling capabilities and higher representation fidelity, making it particularly suitable for non-uniform rhythmic motion segments present in Mocap data.

[0061] Structural features Each dimension in the model is calculated as follows: rhythm frequency is determined by detecting local extrema of movement velocity; control stability is calculated using the standard deviation of velocity change per unit time; inertia is estimated using the standard deviation of joint acceleration change; and symmetry is extracted using the cross-correlation function of left and right limb angle changes. This six-dimensional structural feature is fitted and trained using Mocap data statistics as a benchmark during the training phase, and compared with… Alignment ensures strict consistency between dimensions and meaning.

[0062] S4: Based on the character style vector, the target action structure feature, the actor style vector, and the actor structure feature, calculate the action fit score between the actor and the character through a unified scoring model that integrates style and structure. The unified scoring model uses a weighted combination of style bias and structure bias, and introduces a learnable mapping matrix to perform style calibration on the actor structure feature in order to minimize the collaborative bias between style and structure.

[0063] Specifically, this step is responsible for completing the core function of the entire system: quantitatively comparing the actor's actual movement performance with the expected style of the target character, and outputting the final movement fit score. This step integrates the outputs of the previous three steps, including the character style vector. Target structural features and actor's action style vector Structural characteristics of actors By constructing a unified scoring model that integrates semantic style and physical structure, the overall suitability of actors for their roles can be evaluated.

[0064] This step does not simply score style and structure separately and then weight them. Instead, it proposes a unified scoring mechanism based on minimizing the style-structure co-bias, constructing a joint scoring function to measure whether "structural performance reasonably supports style expression." This method nests the biases in the style space and the biases in the structure space within the same expression and uses a style-structure mapping matrix for nonlinear correlation modeling, thereby capturing the hidden causal relationship common in practical applications where "inconsistent style expression is caused by inadequate structural execution."

[0065] Input in progress, and The abstract representation of the target style and the actor's style comes from the semantic vector extraction network in steps one and three; and This indicates a matching of structural features at the physical level, derived from the structural feature extraction modules in steps two and three. These four vectors together constitute the input feature set of this module, with no new variables, no preprocessing operations, perfectly aligned dimensions, and consistent semantic meaning.

[0066] The scoring function designed here is as follows:

[0067] ;

[0068] in, The final compatibility score is calculated. It is a learnable mapping matrix in the structural feature space, used for style calibration of the actor's structural features; This is a weighting coefficient that controls the relative importance of style bias and structural bias, typically set between 0.5 and 2. The first term in the formula is the squared Euclidean distance between style vectors, and the second term is a weighted contrast loss term at the structural level. The key point is the introduction of... By performing linear transformations on structural features instead of simple direct comparisons, the system can learn whether there is an asymmetry in the structural matching requirements of different styles.

[0069] For example, in film and television acting, a character with an "exuberant rhythm" doesn't necessarily require high-amplitude movements in all dimensions; they might prioritize sudden accelerations rather than large amplitude movements. Conversely, a "stable and composed" character might require "low-frequency, even" movements at a structural level. This difference is illustrated in this formula. The expression is effectively modeled so that the score not only reflects "how well it is done", but also "how reasonably it is done".

[0070] Furthermore, to improve the robustness of the scoring model to anomalous structural bias, we... Apply a structural sparsity regularization term to This regularization term, though not explicitly written into the formula, participates in the objective function optimization during model training, controlling the complexity of the structural mapping and improving the consistency and generalization ability of the scoring. This design is particularly crucial in the practical application scenarios of this solution—for example, in digital human generation or stunt double training, certain structural dimensions are uncontrollable (such as height and frame length), and the scoring system should avoid penalizing these dimensions as key factors. By introducing... With sparse constraints, the system can automatically identify and ignore these low-relevance structural dimensions.

[0071] The final output is a single score. The results can be used in a variety of applications, such as stunt double selection, digital human-driven motion verification, and virtual performance review.

[0072] In one or more embodiments, such as Figure 2 As shown, a neural network-based motion capture actor fit evaluation system is disclosed, the system comprising:

[0073] The semantic vector module is used to extract the semantic representation of the input character description text through a text encoder and a Transformer model, and then use a linear mapping layer to map the semantic representation to a low-dimensional style vector space to obtain the character style vector.

[0074] The structural feature module is used to convert the character style vector into target action structure features through an action structure generation network based on the character style vector. The action structure generation network includes multiple fully connected layers and activation functions, and outputs multi-dimensional structural features representing action rhythm, amplitude, inertia, control stability, trunk stability and limb symmetry.

[0075] A style modeling module is used to extract actor style vectors and actor structural features from actor motion capture data simultaneously using a dual-branch neural network. The dual-branch neural network includes a temporal structure branch and a semantic style branch. The temporal structure branch is used to extract the actor structural features, and the semantic style branch is used to extract the actor style vector. The dual-branch neural network enhances the correlation between the two branches through a style structure consistency regularization term.

[0076] The adaptation evaluation module is used to calculate the action adaptation score between the actor and the role based on the character style vector, the target action structure feature, the actor style vector, and the actor structure feature, through a unified scoring model that integrates style and structure. The unified scoring model uses a weighted combination of style deviation and structure deviation, and introduces a learnable mapping matrix to perform style calibration on the actor structure feature in order to minimize the collaborative deviation between style and structure.

[0077] It is worth noting that the specific workflow of the motion capture actor suitability evaluation system based on neural networks provided in this embodiment of the invention is the same as that of the motion capture actor suitability evaluation method based on neural networks described in the above embodiments, and will not be repeated here.

[0078] This invention also provides a neural network-based motion capture actor suitability evaluation device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements the steps described in the above embodiments of the neural network-based motion capture actor suitability evaluation method, for example... Figure 1 The steps S1 to S4 described above; or, when the processor executes the computer program, it implements the functions of each module in the above system embodiments.

[0079] For example, the computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program in the neural network-based motion capture actor fit evaluation device.

[0080] The neural network-based motion capture actor suitability assessment device can be a desktop computer, laptop, handheld computer, or cloud server, etc. This device may include, but is not limited to, a processor and memory. Those skilled in the art will understand that the neural network-based motion capture actor suitability assessment device may also include input / output devices, network access devices, buses, etc.

[0081] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASACs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. This processor is the control center of the neural network-based motion capture actor fit evaluation device, connecting all parts of the device via various interfaces and lines.

[0082] The memory can be used to store the computer programs and / or modules. The processor implements various functions of the neural network-based motion capture actor fit evaluation device by running or executing the computer programs and / or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function, etc.; the data storage area may store data created based on the operation of the air conditioner controller, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0083] The module integrated into the neural network-based motion capture actor suitability assessment device, if implemented as a software functional unit and sold or used as an independent product, can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0084] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

[0085] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.

Claims

1. A method for evaluating the suitability of motion capture actors based on neural networks, characterized in that, The method includes: Based on the input character description text, semantic representations of the text are extracted using a text encoder and a Transformer model. A linear mapping layer is then used to map these semantic representations to a low-dimensional style vector space to obtain character style vectors. The extraction steps specifically include: segmenting the input character description text to generate a sequence of word fragments; mapping the word fragment sequence to a high-dimensional vector embedding sequence; inputting the vector embedding sequence into a multi-layer Transformer encoder for contextual semantic modeling, extracting context-related representations of the text through a multi-layer self-attention mechanism and a feedforward neural network; extracting the representation of the first special position token output by the encoder as the semantic summary vector of the entire text; projecting the semantic summary vector onto a low-dimensional style vector space through a linear mapping layer, and using an activation function to enhance non-linear expressive power, ultimately obtaining the character style vector. Based on the character style vector, the character style vector is converted into target action structure features through an action structure generation network. The action structure generation network includes multiple fully connected layers and activation functions, outputting multi-dimensional structural features representing action rhythm, amplitude, inertia, control stability, trunk stability, and limb symmetry. The target action structure features are obtained as follows: ; in, It is the weight matrix of a three-layer fully connected layer. It corresponds to the bias term. For activation function, Use the Sigmoid activation function to normalize the output to... interval; As a character style vector, the output is The structure of the target action is represented by the following dimensions: the first dimension is the rhythm of the action per unit time; the second dimension is the range of angle changes of the major joints; the third dimension is the average rate of change of acceleration; the fourth dimension is the standard deviation of velocity changes between frames during the execution of the action; the fifth dimension is the fluctuation range of the trunk posture angle on the time axis; and the sixth dimension is the coordination index of the left and right limb movements. Based on the motion capture data of the actors, a dual-branch neural network is used to simultaneously extract the actor style vector and the actor structural features from the motion capture data. The dual-branch neural network includes a temporal structure branch and a semantic style branch. The temporal structure branch is used to extract the actor structural features, and the semantic style branch is used to extract the actor style vector. The dual-branch neural network enhances the correlation between the two branches through a style structure consistency regularization term. Based on the character style vector, the target action structure feature, the actor style vector, and the actor structure feature, a unified scoring model integrating style and structure is used to calculate the action fit score between the actor and the character. This unified scoring model uses a weighted combination of style bias and structure bias, and introduces a learnable mapping matrix to perform style calibration on the actor structure feature to minimize the collaborative bias between style and structure. The unified scoring model calculates the fit score as follows: First, the squared Euclidean distance between the character style vector and the actor style vector is calculated as the style bias; then, the squared Euclidean distance between the target action structure feature and the actor structure feature after linear transformation by the learnable mapping matrix is ​​calculated as the structure bias; next, weighting coefficients are used to weight the structure bias; finally, the style bias and the weighted structure bias are added together, and the sum is subtracted from 1 to obtain the fit score.

2. The method for evaluating the suitability of motion capture actors based on neural networks according to claim 1, characterized in that, The training process of the character style vector uses weakly labeled data annotated by experts as the regression target, including multiple dimensions such as action tension, rhythm speed, control precision, and body openness, and optimizes the projection process through the mean squared error loss function.

3. The method for evaluating the suitability of motion capture actors based on neural networks according to claim 1, characterized in that, The style-sensitive regularization term is calculated based on the Euclidean distance between the structural prediction outputs of different samples in the training batch. The discriminative power is controlled by an exponential function and hyperparameters to increase the structural feature spacing between different character styles.

4. The method for evaluating the suitability of motion capture actors based on neural networks according to claim 1, characterized in that, The training data for the motion structure generation network comes from real actor performance clips collected by the motion capture system. Skeletal coordinates are mapped to motion feature parameters through the skeleton topology and normalized to the 0 to 1 range.

5. The method for evaluating the suitability of motion capture actors based on neural networks according to claim 1, characterized in that, The temporal structure branch divides the joints into three categories: upper limbs, lower limbs, and trunk, and models them separately. It also combines the sliding window module to extract rhythm and control indicators.

6. The method for evaluating the suitability of motion capture actors based on neural networks according to claim 1, characterized in that, The style structure consistency regularization term aligns the actor's structural features with the actor's style vector by calculating the cosine similarity after mapping the actor's structural features to a 128-dimensional space, encouraging the higher-order representation of structural features to be consistent with the style representation in direction.

7. The method for evaluating the suitability of motion capture actors based on neural networks according to claim 1, characterized in that, The learnable mapping matrix is ​​subjected to a structural sparsity regularization term during training to automatically identify and ignore low-relevance structural dimensions, thereby improving the generalization ability of the scoring model.

8. A motion capture actor suitability evaluation system based on neural networks, characterized in that, The system includes: The semantic vector module is used to extract semantic representations from input character description text using a text encoder and a Transformer model, and then uses a linear mapping layer to map these semantic representations to a low-dimensional style vector space to obtain character style vectors. The extraction steps specifically include: segmenting the input character description text to generate a sequence of word fragments; mapping the word fragment sequence to a high-dimensional vector embedding sequence; inputting the vector embedding sequence into a multi-layer Transformer encoder for contextual semantic modeling, extracting context-related representations of the text through a multi-layer self-attention mechanism and a feedforward neural network; extracting the representation of the first special position token output by the encoder as the semantic summary vector of the entire text; projecting the semantic summary vector onto a low-dimensional style vector space through a linear mapping layer, and using an activation function to enhance non-linear expressive power, ultimately obtaining the character style vector. The structural feature module is used to convert the character style vector into target action structure features through an action structure generation network, based on the character style vector. The action structure generation network includes multiple fully connected layers and activation functions, outputting multi-dimensional structural features representing action rhythm, amplitude, inertia, control stability, trunk stability, and limb symmetry. The target action structure features are obtained as follows: ; in, It is the weight matrix of a three-layer fully connected layer. It corresponds to the bias term. For activation function, Use the Sigmoid activation function to normalize the output to... interval; As a character style vector, the output is The structure of the target action is represented by the following dimensions: the first dimension is the rhythm of the action per unit time; the second dimension is the range of angle changes of the major joints; the third dimension is the average rate of change of acceleration; the fourth dimension is the standard deviation of velocity changes between frames during the execution of the action; the fifth dimension is the fluctuation range of the trunk posture angle on the time axis; and the sixth dimension is the coordination index of the left and right limb movements. A style modeling module is used to extract actor style vectors and actor structural features from actor motion capture data simultaneously using a dual-branch neural network. The dual-branch neural network includes a temporal structure branch and a semantic style branch. The temporal structure branch is used to extract the actor structural features, and the semantic style branch is used to extract the actor style vector. The dual-branch neural network enhances the correlation between the two branches through a style structure consistency regularization term. The fit assessment module is used to calculate the action fit score between the actor and the role based on the character style vector, the target action structure feature, the actor style vector, and the actor structure feature, using a unified scoring model that integrates style and structure. This unified scoring model uses a weighted combination of style bias and structure bias, and introduces a learnable mapping matrix to perform style calibration on the actor structure feature to minimize the collaborative bias between style and structure. The unified scoring model calculates the fit score as follows: first, it calculates the squared Euclidean distance between the character style vector and the actor style vector as the style bias; then, it calculates the squared Euclidean distance between the target action structure feature and the actor structure feature after linear transformation by the learnable mapping matrix as the structure bias; next, it uses weighting coefficients to weight the structure bias; finally, it adds the style bias to the weighted structure bias and subtracts the sum from 1 to obtain the fit score.