Encoder training method, apparatus, device, and storage medium
By extracting video, audio, and image features from the training set and constructing a loss function to train the encoder, the problem of the fake face recognition model being unable to recognize lip movements was solved, the accuracy of fake face recognition was improved, and the reliability of identity verification and the security of financial instruments were ensured.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-05-31
- Publication Date
- 2026-06-23
AI Technical Summary
Existing fake face recognition models cannot recognize lip movements, resulting in low accuracy in identifying fake face images. This undermines the reliability of identity verification through face recognition and affects the security of financial instruments.
An encoder training method is adopted, which divides the training set into training video subset, audio subset and image subset, and extracts video, audio and image features respectively. A loss function is constructed to train the video and image encoders. The judgment is made by combining lip movement and speech content, thereby improving the accuracy of fake face recognition.
By training the encoder using a loss function that combines video and audio features, it can make judgments based on lip movements and speech content while the user is speaking, thereby improving the accuracy of identifying fake faces, ensuring the reliability of identity verification, and enhancing the confidentiality of the financial instrument acquisition process.
Smart Images

Figure CN116740492B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, such as encoder training methods, apparatus, devices, and storage media. Background Technology
[0002] In the financial sector, financial instruments are highly confidential. Accessing them requires identity authentication to obtain corresponding permissions. Once these permissions are granted, the information on the financial instrument can be read and / or modified. Facial recognition is one of the most widely used methods for identity authentication. Deepfake technology uses facial videos obtained from networks or databases to generate forged facial images. Using these forged facial images, it's possible to bypass facial verification, gain the necessary permissions, and then access financial instruments, compromising their security. Current methods for identifying forged facial images involve extracting facial images from facial videos and using these extracted images to train a convolutional neural network, resulting in a forged facial recognition model.
[0003] However, the fake face recognition model trained using face images cannot perform refined face recognition, especially the recognition of lip movements, resulting in low accuracy in recognizing fake face images. This makes it impossible to guarantee the reliability of identity verification through face recognition, and thus the confidentiality of the process of obtaining financial instruments is not strong. Summary of the Invention
[0004] This application provides an encoder training method, apparatus, device, and storage medium, aiming to solve the problem that a fake face recognition model trained from face images cannot recognize the lip movements of a face, resulting in low accuracy in recognizing fake face images, thus failing to guarantee the reliability of identity verification through face recognition, and leading to weak confidentiality in the process of obtaining financial instruments.
[0005] To solve the above problems, this application adopts the following technical solution:
[0006] This article provides encoder training methods, including:
[0007] Obtain a training set, which includes a subset of training videos, a subset of training audio, and a subset of training images;
[0008] The training video subset is input into the video encoder for video representation to obtain video features;
[0009] The training audio subset is input into the audio encoder for audio representation to obtain audio features;
[0010] The subset of training images is input into an image encoder for image representation to obtain image features;
[0011] A first loss function is constructed based on the video features and the audio features, and a second loss function is constructed based on the image features and the audio features;
[0012] The video encoder is trained according to the first loss function to obtain a trained video encoder, and the image encoder is trained according to the second loss function to obtain a trained image encoder.
[0013] Preferably, the step of inputting the training video subset into the video encoder for video representation to obtain video features includes:
[0014] The training video subset is input into the three-dimensional convolutional layer of the video encoder to extract lip features and obtain a lip feature sequence.
[0015] The lip feature sequence is input into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain the encoded local motion features;
[0016] The encoded local motion features are input into the linear projection layer of the video encoder for feature transformation to obtain the lip feature sequence;
[0017] The lip feature sequence is input into the Transformer layer of the video encoder for characterization to obtain the video features.
[0018] Preferably, the step of inputting the training audio subset into the audio encoder for audio representation to obtain audio features includes:
[0019] The training audio subset is input into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors;
[0020] The audio vector is input into the Transformer layer of the audio encoder for characterization to obtain the audio features.
[0021] Preferably, the step of inputting the training image subset into the image encoder for image representation to obtain image features includes:
[0022] The subset of training images is input into the Transformer layer of the image encoder for characterization to obtain the image features.
[0023] Preferably, constructing the first loss function based on the video features and the audio features includes:
[0024] The first loss function is constructed according to the following formula:
[0025]
[0026] Among them, L vLet z be the first loss function. v For the video features, z a Let τ1 be the first adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0027] Preferably, after training the image encoder according to the second loss function, the method further includes:
[0028] A third loss function is constructed based on the audio features and the video features;
[0029] A fourth loss function is constructed based on the audio features and the image features. The audio encoder is then trained using the third loss function and the fourth loss function to obtain the trained audio encoder.
[0030] Preferably, the step of training the video encoder according to the first loss function to obtain a trained video encoder, and training the image encoder according to the second loss function to obtain a trained image encoder, includes:
[0031] The first loss function value is calculated using the video features and the audio features;
[0032] The second loss function value is calculated using the image features and the audio features;
[0033] Backpropagation is performed based on the first loss function value to update the video encoding parameters of the video encoder, thereby obtaining the trained video encoder.
[0034] Backpropagation is performed based on the second loss function value to update the image encoding parameters of the image encoder, thereby obtaining the trained image encoder.
[0035] This application also provides an encoder training apparatus, comprising:
[0036] The training set acquisition module is used to acquire a training set, which includes a training video subset, a training audio subset, and a training image subset.
[0037] The video representation module is used to input the training video subset into the video encoder for video representation to obtain video features;
[0038] An audio representation module is used to input the training audio subset into an audio encoder for audio representation to obtain audio features;
[0039] The image representation module is used to input the subset of training images into the image encoder for image representation to obtain image features;
[0040] The loss function construction module is used to construct a first loss function based on the video features and the audio features, and to construct a second loss function based on the image features and the audio features;
[0041] The training module is used to train the video encoder according to the first loss function to obtain a trained video encoder, and to train the image encoder according to the second loss function to obtain a trained image encoder.
[0042] This application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the encoder training method described in any of the above claims.
[0043] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the encoder training method described in any of the preceding claims.
[0044] The encoder training method of this application divides the training set into training video subsets, training audio subsets, and training image subsets, enabling the encoding of video features, audio features, and image features respectively for each subset. A first loss function is constructed based on the video and audio features, allowing them to mutually guide and complement each other. A second loss function is constructed based on the image and audio features, also allowing them to mutually guide and complement each other. Training the video encoder using the first loss function allows for judgment based on the user's lip movements and speech content during speech. Training the image encoder using the second loss function allows for judgment based on the user's lip movements and lip shape during speech. Combining the trained video encoder and the trained image encoder for face recognition improves the accuracy of identifying forged faces, ensures the reliability of identity verification through face recognition, and enhances the confidentiality of the process of obtaining financial instruments. Attached Figure Description
[0045] Figure 1 This is a flowchart illustrating an encoder training method according to one embodiment;
[0046] Figure 2 This is a schematic diagram illustrating the process of inputting a subset of training videos into a video encoder for video representation, as an embodiment.
[0047] Figure 3 This is a schematic diagram illustrating the process of inputting a subset of training audio into an audio encoder for audio representation, as an embodiment.
[0048] Figure 4 This is a schematic diagram of the process of training an image encoder based on a second loss function according to one embodiment;
[0049] Figure 5 This is a schematic diagram illustrating the process of training a video encoder and an image encoder according to one embodiment.
[0050] Figure 6 This is a schematic block diagram of the structure of an encoder training device according to one embodiment;
[0051] Figure 7 This is a schematic block diagram of the structure of a computer device according to one embodiment.
[0052] The purpose, features, and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0053] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0054] Those skilled in the art will understand that, unless explicitly stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in the specification of this application means the presence of features, integers, steps, operations, elements, units, cells, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, units, cells, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless couplings. The term “and / or” as used herein includes all or any of the units and all combinations thereof of one or more associated listed items.
[0055] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0056] Reference Figure 1This is a flowchart illustrating the encoder training method proposed in this application. The encoder training method can be used in identity authentication processes within the financial sector. Specifically, when obtaining financial instruments, facial verification is required to obtain corresponding permissions. A first loss function, composed of video and audio features, is used to train the video representation module, enabling it to make judgments based on the user's lip movements and speech content while the user is speaking. A second loss function, composed of image and audio features, is used to train the image representation module, enabling it to make judgments based on the user's lip movements and lip shape while the user is speaking. Combining the trained video encoder and the trained image encoder for facial recognition improves the accuracy of identifying forged faces, thereby enhancing the security of obtaining financial instruments. When a forged face is identified, reading and modification permissions for the financial instruments are not granted, further improving the security of the financial system.
[0057] The encoder training method includes the following steps S1-S6:
[0058] S1: Obtain the training set, which includes a subset of training videos, a subset of training audio, and a subset of training images.
[0059] The training set includes multiple training videos, multiple training audio recordings, and multiple training images. All training videos form a subset, all training audio recordings form a subset, and all training images form a subset. The training videos and images include the user's face, and the training audio includes human voice. Both the face and voice can be used to identify the user, thereby detecting forged faces and ensuring strong confidentiality in the process of obtaining financial instruments.
[0060] Alternatively, the training set can be downloaded from the network or from a database.
[0061] S2: Input the training video subset into the video encoder for video representation to obtain video features.
[0062] The training video subset is input into the three-dimensional convolutional layer of the video encoder to extract lip features and obtain a lip feature sequence.
[0063] The lip feature sequence is input into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain the encoded local motion features;
[0064] The encoded local motion features are input into the linear projection layer of the video encoder for feature transformation to obtain the lip feature sequence;
[0065] The lip feature sequence is input into the Transformer layer of the video encoder for characterization to obtain the video features.
[0066] Lip feature sequences can reflect a user's lip movements in a video. Encoding these lip feature sequences yields encoded local motion features. Based on these encoded local motion features, a lip feature sequence representing multiple lip features of the user can be obtained. Using a Transformer layer to represent the lip feature sequence yields video features representing global semantics. These video features include a textual description of the video.
[0067] S3: Input the training audio subset into the audio encoder for audio representation to obtain audio features.
[0068] The training audio subset is input into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors;
[0069] The audio vector is input into the Transformer layer of the audio encoder for characterization to obtain the audio features.
[0070] The audio vector transformation layer can convert audio into corresponding audio vectors. The Transformer layer can be used to represent audio features to obtain audio features, which include textual descriptions of the audio.
[0071] The lip features in the lip feature sequence can correspond to the audio features, thereby improving face recognition and preventing the use of fake faces to obtain financial documents through face verification.
[0072] S4: Input the subset of training images into the image encoder for image representation to obtain image features.
[0073] The subset of training images is input into the Transformer layer of the image encoder for characterization to obtain the image features.
[0074] Image features include textual descriptions of the image.
[0075] S5: Construct a first loss function based on the video features and the audio features, and construct a second loss function based on the image features and the audio features.
[0076] The first loss function is constructed according to the following formula:
[0077]
[0078] Among them, L v Let z be the first loss function. v For the video features, z a Let τ1 be the first adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0079] Preferably, the first adjustment parameter is set to 0.1.
[0080] The first loss function consists of video features and audio features, which guide and complement each other.
[0081] Construct the second loss function according to the following formula:
[0082]
[0083] Among them, L p Let z be the second loss function. p For the image feature, z a Let τ2 be the audio feature, τ2 be the second adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0084] Preferably, the second adjustment parameter is set to 0.1.
[0085] The second loss function consists of image features and audio features, which guide and complement each other.
[0086] S6: Train the video encoder according to the first loss function to obtain a trained video encoder, and train the image encoder according to the second loss function to obtain a trained image encoder.
[0087] The first loss function value is calculated using the video features and the audio features;
[0088] The second loss function value is calculated using the image features and the audio features;
[0089] Backpropagation is performed based on the first loss function value to update the video encoding parameters of the video encoder, thereby obtaining the trained video encoder.
[0090] Backpropagation is performed based on the second loss function value to update the image encoding parameters of the image encoder, thereby obtaining the trained image encoder.
[0091] Training the video encoder based on the first loss function can combine video and audio features to improve the accuracy of the video encoder in recognizing fake faces in videos.
[0092] The image encoder is trained based on the second loss function, which combines image features and audio features. This improves the accuracy of the image encoder in recognizing fake faces in images, thereby improving the accuracy of face recognition. Users who pass face recognition have sufficient authority to obtain financial instruments, ensuring the security of the process of obtaining financial instruments.
[0093] The encoder training method of this application divides the training set into a training video subset, a training audio subset, and a training image subset, enabling the encoding of video features, audio features, and image features respectively for each subset. A first loss function is constructed based on the video and audio features, allowing them to mutually guide and complement each other. A second loss function is constructed based on the image and audio features, also allowing them to mutually guide and complement each other. Training the video encoder using the first loss function allows for judgment based on the user's lip movements and speech content during speech. Training the image encoder using the second loss function allows for judgment based on the user's lip movements and lip shape during speech. Combining the trained video encoder and the trained image encoder for face recognition improves the accuracy of identifying forged faces, ensures the reliability of identity verification through face recognition, and enhances the confidentiality of the process of obtaining financial instruments.
[0094] In one embodiment, refer to Figure 2 Step S2, which involves inputting the subset of training videos into a video encoder for video representation to obtain video features, includes the following steps S21-S24:
[0095] S21: Input the training video subset into the three-dimensional convolutional layer of the video encoder to extract lip features and obtain a lip feature sequence.
[0096] The three-dimensional convolutional layer can extract the lip features of each training video in the training video subset, and sort all the lip features according to the order of the corresponding training video in the training video subset to obtain the lip feature sequence. The lip feature sequence reflects the local features of the lips in the training video.
[0097] S22: Input the lip feature sequence into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain encoded local motion features.
[0098] A two-dimensional residual layer sequentially downsamples each lip feature in the lip feature sequence to obtain a downsampled lip feature sequence. Each downsampled lip feature in the downsampled lip feature sequence is concatenated with its corresponding lip feature to obtain a concatenated feature. All concatenated features are then encoded to obtain the encoded local motion features.
[0099] S23: Input the encoded local motion features into the linear projection layer of the video encoder for feature transformation to obtain the lip feature sequence.
[0100] The linear projection layer uses linear projection to convert the encoded local motion features corresponding to each frame in the training video into a lip feature sequence.
[0101] S24: Input the lip feature sequence into the Transformer layer of the video encoder for characterization to obtain the video features.
[0102] The video features obtained through the Transformer layer representation can reflect the global features of the lips in the training video.
[0103] As described above, a subset of the training video is input into the video encoder for video representation to obtain video features. This includes inputting the training video subset into the 3D convolutional layer of the video encoder to extract lip features, resulting in a lip feature sequence. The lip feature sequence is then input into the 2D residual layer of the video encoder for local motion feature encoding, resulting in encoded local motion features. These encoded local motion features are then input into the linear projection layer of the video encoder for feature transformation, resulting in a lip feature sequence. Finally, the lip feature sequence is input into the Transformer layer of the video encoder for representation, resulting in video features. The lip feature sequence reflects the local features of the lips in the training video, and the video features obtained through the Transformer layer representation reflect the global features of the lips in the training video.
[0104] In one embodiment, refer to Figure 3 The step S3, which involves inputting the training audio subset into an audio encoder for audio representation to obtain audio features, further includes the following steps S31-S32:
[0105] S31: Input the training audio subset into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors.
[0106] Preferably, the audio vector conversion layer can use a trained wav2vec 2.0 network structure. The wav2vec 2.0 network structure to be trained is trained using a contrastive loss function. The trained wav2vec 2.0 network structure extracts unsupervised speech features from the training audio through a multi-layer convolutional neural network and converts the unsupervised speech features into audio vectors.
[0107] S32: Input the audio vector into the Transformer layer of the audio encoder for characterization to obtain the audio features.
[0108] The audio features include a textual description of the audio. Preferably, after obtaining the audio features, adaptive average pooling is performed on the audio features to unify the audio features to a fixed length.
[0109] As described above, a subset of training audio is input into the audio encoder for audio representation to obtain audio features. This includes inputting the training audio subset into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors. These audio vectors are then input into the Transformer layer of the audio encoder for representation to obtain audio features. The audio features include textual descriptions of the audio and are used to guide and supplement video and image features.
[0110] In one embodiment, refer to Figure 4 After step S6 of training the image encoder according to the second loss function, the method further includes the following steps S71'-S72':
[0111] S71': Construct a third loss function based on the audio features and the video features.
[0112] The formula for constructing the third loss function is as follows:
[0113]
[0114] Among them, L a1 Let z be the third loss function. v For the video features, z a Let τ3 be the audio feature, τ3 be the third adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0115] Preferably, the third adjustment parameter is set to 0.1.
[0116] The third loss function includes video features and audio features, which guide and complement each other.
[0117] S72': Construct a fourth loss function based on the audio features and the image features, and train the audio encoder using the third loss function and the fourth loss function to obtain a trained audio encoder.
[0118] The formula for constructing the fourth loss function is as follows:
[0119]
[0120] Among them, L a2 Let z be the fourth loss function. p For the image feature, z a Let τ4 be the audio feature, τ4 be the fourth adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0121] Preferably, the fourth adjustment parameter is set to 0.1.
[0122] The fourth loss function includes image features and audio features, which guide and complement each other. Using the third loss function, an audio encoder can be trained based on the audio-to-video representation method; using the fourth loss function, an audio encoder can be trained based on the audio-to-image representation method. The trained audio encoder can assist the video encoder in determining whether a fake face exists in a video, and it can also assist the image encoder in determining whether a fake face exists in an image.
[0123] As described above, after training the image encoder based on the second loss function, the process further includes constructing a third loss function based on audio and video features, constructing a fourth loss function based on audio and image features, and training the audio encoder using the third and fourth loss functions to obtain the trained audio encoder. The trained audio encoder can assist the video encoder in determining whether a fake face exists in a video, and it can also assist the image encoder in determining whether a fake face exists in an image.
[0124] In one embodiment, refer to Figure 5 Step S6, which involves training the video encoder according to the first loss function to obtain a trained video encoder, and training the image encoder according to the second loss function to obtain a trained image encoder, includes the following steps S61-S64:
[0125] S61: Calculate the first loss function value using the video features and the audio features.
[0126] Both video and audio features are in vector form. The closer the product of video and audio features is to 1, the more similar the video and audio features are. The closer the product of video and audio features is to 0, the less similar the video and audio features are.
[0127] S62: Calculate the second loss function value using the image features and the audio features.
[0128] Both image features and audio features are in vector form. The closer the product of image features and audio features is to 1, the more similar the image features and audio features are. The closer the product of image features and audio features is to 0, the less similar the image features and audio features are.
[0129] S63: Perform backpropagation based on the first loss function value to update the video encoding parameters of the video encoder, thereby obtaining the trained video encoder.
[0130] The process of training a video encoder involves multiple backpropagations. The greater the difference between the two first loss function values corresponding to two adjacent backpropagations, the faster the video encoding parameters of the video encoder are updated.
[0131] The more times the video encoding parameters are updated, the smaller the difference between the video encoding parameters and the preset video encoding parameters becomes. When the difference between the video encoding parameters and the preset video encoding parameters is less than the first preset difference, training stops, and the trained video encoder is obtained.
[0132] S64: Perform backpropagation based on the second loss function value to update the image encoding parameters of the image encoder, thereby obtaining the trained image encoder.
[0133] The process of training an image encoder involves multiple backpropagations. The greater the difference between the two second loss function values corresponding to two adjacent backpropagations, the faster the image encoding parameters of the image encoder are updated.
[0134] The more times the image encoding parameters are updated, the smaller the difference between the image encoding parameters and the preset image encoding parameters becomes. When the difference between the image encoding parameters and the preset image encoding parameters is less than the second preset difference, training stops, and the trained image encoder is obtained.
[0135] A trained video encoder is used to identify fake faces in videos, and a trained image encoder is used to identify fake faces in images.
[0136] As described above, a video encoder is trained based on a first loss function, and an image encoder is trained based on a second loss function. This includes calculating the first loss function value using video and audio features, and calculating the second loss function value using image and audio features. Backpropagation is performed based on the first loss function value to update the video encoding parameters of the video encoder, resulting in a trained video encoder. Backpropagation is performed based on the second loss function value to update the image encoding parameters of the image encoder, resulting in a trained image encoder. The trained video encoder is used to identify fake faces in videos, and the trained image encoder is used to identify fake faces in images.
[0137] Reference Figure 6 This is a schematic block diagram of an encoder training device according to this application. The device includes:
[0138] The training set acquisition module 10 is used to acquire a training set, which includes a training video subset, a training audio subset, and a training image subset.
[0139] The video representation module 20 is used to input the training video subset into the video encoder for video representation to obtain video features;
[0140] The audio representation module 30 is used to input the training audio subset into the audio encoder for audio representation to obtain audio features;
[0141] Image representation module 40 is used to input the training image subset into the image encoder for image representation to obtain image features;
[0142] The loss function construction module 50 is used to construct a first loss function based on the video features and the audio features, and to construct a second loss function based on the image features and the audio features;
[0143] The training module 60 is used to train the video encoder according to the first loss function to obtain a trained video encoder, and to train the image encoder according to the second loss function to obtain a trained image encoder.
[0144] The encoder training device described above is used to implement the encoder training method.
[0145] In one embodiment, the video characterization module 20 further includes:
[0146] The lip feature extraction unit is used to input the training video subset into the three-dimensional convolutional layer of the video encoder, extract lip features, and obtain a lip feature sequence.
[0147] The local motion feature encoding unit is used to input the lip feature sequence into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain encoded local motion features;
[0148] The feature conversion unit is used to input the encoded local motion features into the linear projection layer of the video encoder for feature conversion to obtain the lip feature sequence;
[0149] The first representation unit is used to input the lip feature sequence into the Transformer layer of the video encoder for representation to obtain the video features.
[0150] In one embodiment, the audio representation module 30 further includes:
[0151] The vector transformation unit is used to input the training audio subset into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors;
[0152] The second representation unit is used to input the audio vector into the Transformer layer of the audio encoder for representation to obtain the audio features.
[0153] In one embodiment, the image representation module 40 further includes:
[0154] An image representation unit is used to input the subset of training images into the Transformer layer of the image encoder for representation to obtain the image features.
[0155] In one embodiment, the loss function construction module 50 further includes:
[0156] The first loss function construction unit is used to construct the first loss function according to the following formula:
[0157]
[0158] Among them, L v Let z be the first loss function. v For the video features, z a Let τ1 be the first adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0159] In one embodiment, the encoder training device further includes:
[0160] The first audio encoder training module is used to construct a third loss function based on the audio features and the video features;
[0161] The second audio encoder training module is used to construct a fourth loss function based on the audio features and the image features, and to train the audio encoder using the third loss function and the fourth loss function to obtain a trained audio encoder.
[0162] In one embodiment, the training module 60 further includes:
[0163] The first loss function value calculation unit is used to calculate the first loss function value using the video features and the audio features;
[0164] The second loss function value calculation unit is used to calculate the second loss function value using the image features and the audio features;
[0165] The video encoding parameter update unit is used to perform backpropagation based on the first loss function value, update the video encoding parameters of the video encoder, and obtain the trained video encoder.
[0166] The image coding parameter update unit is used to perform backpropagation based on the second loss function value to update the image coding parameters of the image encoder, thereby obtaining the trained image encoder.
[0167] Reference Figure 7 The present invention also provides a computer device, the internal structure of which can be as follows: Figure 7As shown. The computer device includes a processor, memory, network interface, and database connected via a system bus. The processor is designed to provide computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores operating devices, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores video features, audio features, and image features, etc. The network interface is used to communicate with external terminals via a network connection. Furthermore, the computer device may also include input devices and a display screen. When the computer program is executed by the processor, it implements an encoder training method, including the following steps:
[0168] Obtain a training set, which includes a subset of training videos, a subset of training audio, and a subset of training images;
[0169] The training video subset is input into the video encoder for video representation to obtain video features;
[0170] The training audio subset is input into the audio encoder for audio representation to obtain audio features;
[0171] The subset of training images is input into an image encoder for image representation to obtain image features;
[0172] A first loss function is constructed based on the video features and the audio features, and a second loss function is constructed based on the image features and the audio features;
[0173] The video encoder is trained according to the first loss function to obtain a trained video encoder, and the image encoder is trained according to the second loss function to obtain a trained image encoder.
[0174] In one embodiment, inputting the training video subset into a video encoder for video representation to obtain video features includes:
[0175] The training video subset is input into the three-dimensional convolutional layer of the video encoder to extract lip features and obtain a lip feature sequence.
[0176] The lip feature sequence is input into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain the encoded local motion features;
[0177] The encoded local motion features are input into the linear projection layer of the video encoder for feature transformation to obtain the lip feature sequence;
[0178] The lip feature sequence is input into the Transformer layer of the video encoder for characterization to obtain the video features.
[0179] In one embodiment, inputting the trained audio subset into an audio encoder for audio representation to obtain audio features includes:
[0180] The training audio subset is input into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors;
[0181] The audio vector is input into the Transformer layer of the audio encoder for characterization to obtain the audio features.
[0182] In one embodiment, the step of inputting the subset of training images into an image encoder for image representation to obtain image features includes:
[0183] The subset of training images is input into the Transformer layer of the image encoder for characterization to obtain the image features.
[0184] In one embodiment, constructing the first loss function based on the video features and the audio features includes:
[0185] The first loss function is constructed according to the following formula:
[0186]
[0187] Among them, L v Let z be the first loss function. v For the video features, z a Let τ1 be the first adjustment parameter, log be the logarithmic function, and exp be the exponential function.
[0188] In one embodiment, after training the image encoder according to the second loss function, the method further includes:
[0189] A third loss function is constructed based on the audio features and the video features;
[0190] A fourth loss function is constructed based on the audio features and the image features. The audio encoder is then trained using the third loss function and the fourth loss function to obtain the trained audio encoder.
[0191] In one embodiment, training the video encoder according to the first loss function to obtain a trained video encoder, and training the image encoder according to the second loss function to obtain a trained image encoder, includes:
[0192] The first loss function value is calculated using the video features and the audio features;
[0193] The second loss function value is calculated using the image features and the audio features;
[0194] Backpropagation is performed based on the first loss function value to update the video encoding parameters of the video encoder, thereby obtaining the trained video encoder.
[0195] Backpropagation is performed based on the second loss function value to update the image encoding parameters of the image encoder, thereby obtaining the trained image encoder.
[0196] Those skilled in the art will understand that Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer equipment on which the present application is applied.
[0197] One embodiment of this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements an encoder training method. It is understood that the computer-readable storage medium in this embodiment can be a volatile readable storage medium or a non-volatile readable storage medium.
[0198] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media provided in this application and in the embodiments may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-speed SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0199] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, apparatus, article, or method. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, apparatus, article, or method that includes that element.
[0200] The above description is only a preferred embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural changes made based on the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. An encoder training method, characterized in that, include: Obtain a training set, which includes a subset of training videos, a subset of training audio, and a subset of training images; The training video subset is input into the video encoder for video representation to obtain video features; The training audio subset is input into the audio encoder for audio representation to obtain audio features; The subset of training images is input into an image encoder for image representation to obtain image features; A first loss function is constructed based on the video features and the audio features, and a second loss function is constructed based on the image features and the audio features. The first loss function uses the video features and audio features as calculation elements, introduces a first adjustment parameter, and is constructed using a logarithmic function and an exponential function. The first adjustment parameter is configured with preferred values. The second loss function uses the image features and audio features as calculation elements, introduces a second adjustment parameter, and is also constructed using a logarithmic function and an exponential function. The second adjustment parameter is also configured with preferred values. The video encoder is trained according to the first loss function to obtain a trained video encoder, and the image encoder is trained according to the second loss function to obtain a trained image encoder. The step of inputting the training video subset into the video encoder for video representation to obtain video features includes: The training video subset is input into the three-dimensional convolutional layer of the video encoder to extract lip features and obtain a lip feature sequence. The lip feature sequence is input into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain encoded local motion features. The two-dimensional residual layer first downsamples each lip feature in the lip feature sequence to obtain a downsampled lip feature sequence, then concatenates each downsampled lip feature in the downsampled lip feature sequence with the corresponding original lip feature to form a concatenated feature, and finally encodes all concatenated features to obtain encoded local motion features. The encoded local motion features are input into the linear projection layer of the video encoder for feature transformation to obtain the lip feature sequence; The lip feature sequence is input into the Transformer layer of the video encoder for characterization to obtain the video features.
2. The encoder training method according to claim 1, characterized in that, The step of inputting the trained audio subset into the audio encoder for audio representation to obtain audio features includes: The training audio subset is input into the audio vector transformation layer of the audio encoder for vector transformation to obtain audio vectors; The audio vector is input into the Transformer layer of the audio encoder for characterization to obtain the audio features.
3. The encoder training method according to claim 1, characterized in that, The step of inputting the subset of training images into the image encoder for image representation to obtain image features includes: The subset of training images is input into the Transformer layer of the image encoder for characterization to obtain the image features.
4. The encoder training method according to claim 1, characterized in that, The step of constructing a first loss function based on the video features and the audio features includes: The first loss function is constructed according to the following formula: ; in, Let the first loss function be... For the video features, For the aforementioned audio features, is the first adjustment parameter, log is the logarithmic function, and exp is the exponential function.
5. The encoder training method according to claim 1, characterized in that, After training the image encoder according to the second loss function, the method further includes: A third loss function is constructed based on the audio features and the video features; A fourth loss function is constructed based on the audio features and the image features. The audio encoder is then trained using the third loss function and the fourth loss function to obtain the trained audio encoder.
6. The encoder training method according to claim 1, characterized in that, The step of training the video encoder according to the first loss function to obtain a trained video encoder, and training the image encoder according to the second loss function to obtain a trained image encoder, includes: The first loss function value is calculated using the video features and the audio features; The second loss function value is calculated using the image features and the audio features; Backpropagation is performed based on the first loss function value to update the video encoding parameters of the video encoder, thereby obtaining the trained video encoder. Backpropagation is performed based on the second loss function value to update the image encoding parameters of the image encoder, thereby obtaining the trained image encoder.
7. An encoder training device, characterized in that, include: The training set acquisition module is used to acquire a training set, which includes a training video subset, a training audio subset, and a training image subset. The video representation module is used to input the training video subset into the video encoder for video representation to obtain video features; An audio representation module is used to input the training audio subset into an audio encoder for audio representation to obtain audio features; The image representation module is used to input the subset of training images into the image encoder for image representation to obtain image features; The loss function construction module is used to construct a first loss function based on the video features and the audio features, and to construct a second loss function based on the image features and the audio features. The first loss function uses the video features and audio features as calculation elements, introduces a first adjustment parameter, and is constructed using a logarithmic function and an exponential function. The first adjustment parameter is configured with preferred values. The second loss function uses the image features and audio features as calculation elements, introduces a second adjustment parameter, and is also constructed using a logarithmic function and an exponential function. The second adjustment parameter is also configured with preferred values. The training module is used to train the video encoder according to the first loss function to obtain a trained video encoder, and to train the image encoder according to the second loss function to obtain a trained image encoder. The step of inputting the training video subset into the video encoder for video representation to obtain video features includes: The training video subset is input into the three-dimensional convolutional layer of the video encoder to extract lip features and obtain a lip feature sequence. The lip feature sequence is input into the two-dimensional residual layer of the video encoder for local motion feature encoding to obtain encoded local motion features. The two-dimensional residual layer first downsamples each lip feature in the lip feature sequence to obtain a downsampled lip feature sequence, then concatenates each downsampled lip feature in the downsampled lip feature sequence with the corresponding original lip feature to form a concatenated feature, and finally encodes all concatenated features to obtain encoded local motion features. The encoded local motion features are input into the linear projection layer of the video encoder for feature transformation to obtain the lip feature sequence; The lip feature sequence is input into the Transformer layer of the video encoder for characterization to obtain the video features.
8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the encoder training method according to any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the encoder training method according to any one of claims 1 to 6.