Attitude transformation data processing method and device, computer equipment and storage medium

An attitude transformation and data processing technology, applied in the field of image processing, can solve the problems of low image quality of attitude transformation, difficult to obtain satisfactory results, unable to solve the problem of occlusion, etc., and achieve the effect of high robustness and accuracy

Active Publication Date: 2020-04-21
TENCENT TECH (SHENZHEN) CO LTD
13 Cites 11 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0003] In related technologies, deep learning technology can be used to solve the pose transformation problem, and the given image features can be deformed to the target pose image through the space transformation module. Ho...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

Above-mentioned pose transformation data processing method, proposes a new deep learning framework, adopts the three-dimensional voxel representation of band segmentation to eliminate the ambiguity problem that two-dimensional representation brings, has included the human body shape of source image and target pose information , not only can represent the three-dimensional coordinates of the object after the pose transformation, but also can identify the category of each part, so that the subsequent high-quality synthesis of local parts can be realized, and the complete object parts are synthesized independently, so as to solve the occlusion problem between the parts and To ensure the high quality of local synthesis results, by generating transformed images, the frame from coarse to fine allows the final result to reach high-definition resolution, and the fused target pose image has very high robustness and accuracy.
In the embodiment of the present application, by adopting the architecture of the stacked hourglass network as the three-dimensional voxel network, the source image and the target three-dimensional pose are encoded and features are extracted, and then the features extracted are merged and input into the stacked hourglass network and decoded The three-dimensional segmentation voxel can be generated, which can improve the accuracy of three-dimensional segmentation voxel generation.
Wherein, the hierarchical generation network is used to merge and input the component layer data corresponding to the object parts in the source image, three-dimensional segmentation voxel, target two-dimensional pose, transformed image, through each network in the hierarchical generation network Layer processing to synthesize a complete part image regardless of occlusion. The output of the hierarchical generation network can be the residual between the target part and the transformed image part, so that in the next step, the target pose image can be obtained by merging the residual outputted by the hierarchical generation network with the transformed image. The output of the hierarchical generative network is to synthesize complete object parts, avoiding the occlusion phenomenon under the target pose. Since the complete parts are generated, the subsequent global fusion step can obtain more correct and realistic result images when occlusions occur.
[0051] Specifically, component layer data of each object component is synthesized to synthesize a complete target component image regardless of occlusion. The component image can be the residual between the target component and the transformed image of the component, so that in the next step, the target pose image can be obtained by fusing the residual with the transformed image. Synt...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention relates to an attitude transformation data processing method and device, computer equipment and a storage medium, and relates to an artificial intelligence image processing technology. The method comprises the steps that a source image and a target three-dimensional posture are acquired, three-dimensional segmentation voxels including voxel category information are obtained based onsemantic segmentation reconstruction, the three-dimensional segmentation voxels are projected to obtain a corresponding target posture two-dimensional segmentation image, and objects in the target posture two-dimensional segmentation image are labeled based on the category information to obtain component categories; a target two-dimensional attitude corresponding to the target three-dimensional attitude is obtained, and the source image, the target attitude two-dimensional segmentation image and the features of the target two-dimensional attitude are extracted to synthesize an intermediate-scale transformation image; the source image, the three-dimensional segmentation voxel, the target two-dimensional attitude and the transformation image are cut to obtain component layer data of each object component, and component synthesis on the component layer data of each object component is performed to generate a component image; and the transformed image and the component image are fused to obtain a target attitude image, thereby improving the quality of the attitude transformed image.

Application Domain

Technology Topic

Image

  • Attitude transformation data processing method and device, computer equipment and storage medium
  • Attitude transformation data processing method and device, computer equipment and storage medium
  • Attitude transformation data processing method and device, computer equipment and storage medium

Examples

  • Experimental program(1)

Example Embodiment

[0025] In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
[0026] It can be understood that the terms “first”, “second”, etc. used in this application can be used herein to describe various elements, but unless otherwise specified, these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.
[0027] Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
[0028] Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology includes computer vision technology and machine learning/deep learning.
[0029] Computer Vision (CV) Computer Vision is a science that studies how to make machines "see". Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect images. As a scientific discipline, computer vision studies related theories and technologies, trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
[0030] Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning techniques.
[0031] The solution of the present application relates to image processing technology based on artificial intelligence. The specific description is given by the following embodiments.
[0032] figure 1 It is an application environment diagram for posture transformation data processing provided in some embodiments. Such as figure 1 As shown, the application environment includes a terminal 110 and a server 120. The terminal 110 may send the source image and the target three-dimensional posture to the server 120. The server 120 can combine the source image and the target three-dimensional posture, and reconstruct the three-dimensional segmented voxel including the category information of the voxel based on semantic segmentation; project the three-dimensional segmented voxel Obtain the corresponding target pose two-dimensional segmentation map, and label the objects in the target pose two-dimensional segmentation map based on the category information of the voxel to obtain the corresponding component category. Obtain the target two-dimensional pose corresponding to the target three-dimensional pose, extract the source image, the target pose two-dimensional segmentation map, and the target two-dimensional pose feature to synthesize the intermediate-scale transformed image; The posture and transformation image are cropped to obtain the part layer data corresponding to each target part. The target part is determined according to the part category; the part layer data of each target part is separately synthesized to generate the part image corresponding to each target part; The image and the component image are fused to obtain the target pose image. The server 120 returns the target posture image to the terminal 110, and the terminal 110 may display the target posture image.
[0033] In some embodiments, the terminal 110 may also obtain and display the target posture image according to the source image and the target three-dimensional posture through the steps of the foregoing embodiment. An application program for posture transformation may be installed in the terminal 110, and the deployment on the application program includes a three-dimensional segmentation voxel module, an intermediate scale module, and a component image generation module.
[0034] The server 120 may be an independent physical server, or may be a server cluster composed of multiple physical servers, and may be a cloud server that provides basic cloud computing services such as cloud servers, cloud databases, cloud storage, and CDN. The terminal 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited to this. The terminal 110 and the server 120 can be connected through a communication connection method such as a network, which is not limited in this application.
[0035] Such as figure 2 As shown, in some embodiments, a posture transformation data processing method is proposed, and this embodiment mainly uses this method to apply to the above figure 1 The server 120 or the terminal 110 in the FIG. It can include the following steps:
[0036] Step S202: Obtain a source image and a three-dimensional pose of the target.
[0037] Specifically, the source image represents the image before the pose transformation, and the target three-dimensional pose uses the three-dimensional representation of the target pose. The three-dimensional target pose is used to describe the transformed target pose of the object matching the source image. The target three-dimensional pose can be one or more, which means that the object in the source image needs to be transformed into one or more poses. The object in the source image can be a person or object with a posture such as a human body, plant, animal, etc. There can also be one or more source images, which means that the objects in one or more source images need to be transformed into poses. A source image can include one or more objects whose poses are to be transformed.
[0038] In some embodiments, the target three-dimensional posture includes the positions of the three-dimensional human body joint points, which can be expressed by using a heat map and can be obtained through three-dimensional posture estimation technology.
[0039] In some embodiments, the target three-dimensional pose includes multiple target poses, a target pose image corresponding to the target three-dimensional pose is generated, and each target pose image is combined to generate a target pose video. For example, the same person can be generated into a video that includes multiple transformations of different poses.
[0040] Step S204: Combining the source image and the target's three-dimensional pose, and reconstructing based on semantic segmentation to obtain three-dimensional segmented voxels. The three-dimensional segmented voxels include voxel category information.
[0041] Specifically, semantic segmentation is a basic task in computer vision. In semantic segmentation, visual input needs to be divided into different semantically interpretable categories, that is, the classification categories are meaningful in the real world. Semantic segmentation can use convolutional neural networks to assign category labels to each voxel. The convolutional layer can effectively capture the local features in the image or the target's three-dimensional pose, and nest many such modules together in a hierarchical manner. A series of convolutions capture the complex features of the image or the three-dimensional pose of the target. The convolutional neural network can be used to extract the features of the source image and the target 3D pose based on semantic segmentation, thereby reconstructing the 3D segmented voxel containing the category information of the voxel, and obtaining the category information of each voxel by labeling the uniformly segmented grid topology . Three-dimensional segmentation voxel is to mark the category of each voxel in the three-dimensional space, which contains the object shape and target posture information of the source image. It can not only represent the three-dimensional coordinates of the human body after posture transformation, but also identify the component category of each object . The component category of the object is divided according to the characteristics of the object. For example, when the object is a human body, the component category can include face, left hand, right hand, left leg, right leg, torso, and background.
[0042] In one embodiment, the source image and the target three-dimensional pose are input into the three-dimensional segmented voxels to obtain the output three-dimensional segmented voxels through the trained three-dimensional segmentation voxels. Among them, the network structure of the 3D voxel network can be self-defining and can be a deep neural network.
[0043] Step S206: Project the three-dimensional segmented voxels to obtain a corresponding two-dimensional segmentation map of the target pose, and label the objects in the two-dimensional segmentation map of the target pose based on the category information of the voxel to obtain the corresponding component category.
[0044] Specifically, the three-dimensional segmented voxels are projected onto the horizontal plane of the image, that is, on the xy plane, to obtain a two-dimensional segmentation map. Since the three-dimensional segmented voxels are obtained based on the three-dimensional pose of the target and correspond to the target pose, the obtained is corresponding Two-dimensional segmentation map of target pose. Combine the regions corresponding to the voxels of the same category information to obtain the regions corresponding to each component category of the object on the two-dimensional segmentation map, so that each region is labeled with different component categories. Based on the two-dimensional segmentation map, the human pose can be transformed. Think of it as an image-to-image conversion problem.
[0045] Step S208: Obtain the target two-dimensional posture corresponding to the target three-dimensional posture, extract the source image, the two-dimensional segmentation map of the target posture, and the feature of the target two-dimensional posture to synthesize an intermediate scale transformed image.
[0046] Among them, the two-dimensional target posture uses a two-dimensional representation of the target posture. The two-dimensional target posture is used to describe the transformed target posture of the object matching the source image, which can be represented by a two-dimensional heat map. When there are multiple target 3D poses, they correspond to the same number of target 2D poses. The target 2D pose and the target 3D pose represent the same pose, which can be generated by a 2D pose estimation algorithm. The intermediate-scale transformed image refers to a rough image whose resolution is smaller than the resolution of the target image, but the posture is the post-transformed posture. The intermediate-scale transformed image is first generated through the frame from coarse to fine, and then combined with subsequent component drawings to make the final As a result, the high-definition resolution of the target image is achieved.
[0047] Specifically, when synthesizing intermediate-scale transformed images, a two-dimensional target pose is introduced to enhance the spatial attention of the target. Extract the features of the source image, the target pose 2D segmentation map and the target 2D pose, the input features can be learned through the deep neural network, and the texture of the source image can be transformed into a target matching the target pose 2D segmentation map and the target 2D pose Pose texture, thereby synthesizing intermediate-scale transformed images.
[0048] In one embodiment, the source image, the target pose two-dimensional segmentation map and the target two-dimensional pose are input to the intermediate scale generation network through the trained intermediate scale generation network to obtain the output intermediate scale transformed image. Among them, the network structure of the intermediate-scale generation network can be self-defining and can be a deep neural network.
[0049] In step S210, the source image, the three-dimensional segmented voxel, the target two-dimensional pose, and the transformed image are respectively cropped to obtain component layer data corresponding to each object component, and the object component is determined according to the component category. Part synthesis is performed on the part layer data of each target part to generate a part image corresponding to each target part.
[0050] Among them, the part-level data is the data of the corresponding object part in each input image, and the object part is the part obtained based on semantic segmentation. To determine the object part according to the part category, you can filter the category irrelevant to the object itself from the part category to obtain the object part, such as filtering out the background category. For example, the component category includes categories such as face, left hand, right hand, left leg, right leg, torso, background, filter out the background component category that has nothing to do with the object body, and the object components include face, left hand, right hand, left leg, and right leg ,trunk. It can be understood that because the proportion of each object component to the entire object is different, the image resolution of each component layer data obtained by cropping may also be different. In one embodiment, the image resolution of the component layer data is set according to the relative length of each object component in the entire object.
[0051] Specifically, part synthesis is performed on the part layer data of each object part to synthesize a complete target part image ignoring the occlusion. The part image can be the residual between the target part and the transformed image of the part, so that the target pose image can be obtained by fusing the residual and the transformed image in the next step. Synthesize the complete object parts corresponding to each object part to avoid the occlusion phenomenon under the target posture. Since a complete part is generated, when occlusion occurs, the subsequent global fusion step can get a more accurate and true result image.
[0052] In one embodiment, through the trained hierarchical generation network corresponding to each object component, the component layer data is input to the hierarchical generation network matching the object component to obtain the output component image. Among them, the network structure of the hierarchical generation network corresponding to each object component can be self-defined, and the hierarchical generation network of each object component is independently trained and can be a deep neural network.
[0053] In step S212, the transformed image and the component image are merged to obtain the target posture image.
[0054] Specifically, the method of fusion can be customized. The transformed image can be enlarged and transformed to a resolution matching the target posture image, and then each part image is superimposed on the matched target part area in the enlarged transformed image to obtain Target pose image. Because each component image is generated independently, there may be a problem that the global consistency between them cannot be guaranteed. In some embodiments, the target pose image is further post-processed to obtain an accurate target pose image. For example, filter processing is performed at the boundary of each object component area, or a repaired image is obtained through a fine-scale generation network, and the repaired image is fused with the target pose image to obtain an accurate target pose image.
[0055] The above-mentioned posture transformation data processing method proposes a new deep learning framework that uses segmented three-dimensional voxel representation to eliminate the ambiguity caused by two-dimensional representation. It contains the human body shape and target posture information of the source image, not only Represents the three-dimensional coordinates of the object after the posture transformation, and can identify each component category, so that the subsequent high-quality partial component synthesis can be realized, and the complete object components are separately synthesized, thereby solving the occlusion problem between the components and ensuring local synthesis The result is of high quality. Through the generation of transformed images, the frame from coarse to fine allows the final result to reach high-definition resolution, and the target pose image obtained by fusion has very high robustness and accuracy.
[0056] In some embodiments, step S204 includes: inputting the source image and the target three-dimensional pose into the three-dimensional voxel network, the three-dimensional voxel network encodes the source image and the target three-dimensional pose to obtain the encoding result, extracting features from the encoding result and decoding the output carrying The three-dimensional segmentation voxel of the voxel category information.
[0057] Specifically, the 3D voxel network may be a deep neural network, including a convolutional layer, a pooling layer, a connection layer and other network structures. The source image and the target 3D pose are encoded through each connected network layer to obtain the encoding result. The encoding result extracts features and decodes and outputs three-dimensional segmented voxels carrying the voxel category information. The input of the 3D voxel network is the source image and the 3D pose of the target, and the output is the 3D segmented voxel carrying the voxel category information. When performing model training, a supervised training method can be used to train the 3D voxel network. In one embodiment, the training data comes from the 3D human body model labeled in the data set. The training data includes source training images and target training 3D poses. Annotated three-dimensional human body voxels. In one embodiment, the source training image and the target training three-dimensional pose are input into the three-dimensional voxel network, and after each layer included in the three-dimensional voxel network is processed sequentially, the corresponding training three-dimensional segmented voxels are output. According to the difference between the trained 3D segmentation voxel and the labeled 3D voxel, the network parameters of the 3D voxel network are adjusted back propagation to obtain the trained intermediate scale generation network.
[0058] In some embodiments, a cross-entropy loss function is used to describe the difference between the trained 3D segmented voxel and the labeled 3D voxel to back-propagate and adjust the network parameters of the 3D voxel network to obtain the trained 3D voxel network. The definition of the cross entropy loss function is as follows:
[0059]
[0060] Is the cross-entropy loss, H, W, D are height, width, and depth respectively, N is the number of component categories, i, j, k, and c respectively represent the corresponding variables, Is the correct annotation of 3D voxels, Is the annotation of the 3D voxel output by the network, Is the softmax function.
[0061] In this embodiment, the three-dimensional segmented voxels carrying the category information of the voxels are directly obtained through the trained three-dimensional voxel network, so that the reconstruction of the three-dimensional segmented voxels is completed through the three-dimensional voxel network, which is efficient and accurate.
[0062] In some embodiments, the 3D voxel network encodes the source image and the target 3D pose to obtain the encoding result, extracting features from the encoding result and decoding and outputting the 3D segmentation voxel carrying the voxel category information includes: encoding the source image And extract the feature to get the first feature, encode the target 3D pose and extract the feature to get the second feature, merge the first feature and the second feature to get the merged feature, input the merged feature into the stacked hourglass network, and decode the stacked hourglass network to get the three-dimensional Split voxels.
[0063] Specifically, the source image and the target three-dimensional pose are encoded and features are extracted through several convolutional layers and pooling layers in the three-dimensional voxel network, respectively, and the encoding and feature extraction methods can be the same or different. Combine the first feature and the second feature to generate a matrix input to the stacked hourglass network. In one embodiment, the stacked hourglass network includes two, the first hourglass network generates the initial feature, and the second hourglass network makes the initial feature more refined, so as to decode the three-dimensional segmentation voxel.
[0064] In the embodiment of this application, by adopting the stacked hourglass network as the architecture of the three-dimensional voxel network, the source image and the target three-dimensional pose are respectively encoded and features are extracted, and then the extracted features are merged and input into the stacked hourglass network to decode the three-dimensional segmentation Voxels can improve the accuracy of 3D segmentation voxel generation.
[0065] In some embodiments, step S208 includes combining the source image, the target pose two-dimensional segmentation map, and the target two-dimensional pose to form an input matrix, and inputting the input matrix to the intermediate-scale generation network, and the intermediate-scale generation network sequentially passes through the downsampling layer, The residual block layer and the up-sampling layer perform feature extraction on the input matrix to obtain an intermediate-scale transformed image.
[0066] Specifically, the intermediate-scale generation network is used to perform feature extraction on the input matrix through the down-sampling layer, the residual block layer, and the up-sampling layer in sequence, so as to output a low-resolution rough result image. The down-sampling layer is used to reduce the feature size, make the network focus on global semantic information, and form an overall understanding of the input image. The residual block layer is used to make the calculation result a residual, reducing the amount of calculation. The up-sampling layer combines the down-sampled information and input information to restore detailed information and gradually restore image accuracy. In one embodiment, the intermediate-scale generation network is trained through a supervised training method, where the loss function can be customized as needed. In one embodiment, at least one loss of perceptual loss, counter loss, and feature matching loss Adjust the network parameters of the intermediate scale generation network to the propagation to obtain the trained intermediate scale generation network. Among them, the perceptual loss and feature matching loss are to make the output image and the label image closer, and the role of the anti-loss is to ensure that the result image maintains the object feature consistency of the source image.
[0067] In this embodiment, the intermediate-scale transformed image is directly obtained through the trained intermediate-scale generation network, so that the synthesis of the transformed image is completed through the intermediate-scale generation network, which is efficient and accurate.
[0068] In some embodiments, the training of the intermediate scale generation network includes the following steps: Obtain a first training sample, the first training sample includes a source training image, a target pose training two-dimensional segmentation image, a target training two-dimensional pose, and a corresponding label transformation image . The source training image, the target pose training two-dimensional segmentation map, and the target training two-dimensional pose are input to the intermediate scale generation network. After the layers included in the intermediate scale generation network are processed in sequence, the corresponding training transformed image is output; the training transformation image and The difference of the label transformation image is back propagated to adjust the network parameters of the intermediate scale generation network to obtain the trained intermediate scale generation network.
[0069] Among them, the first training sample includes the source training image, the target pose training two-dimensional segmentation map, the target training two-dimensional pose and the corresponding label transformation image. The training transformation image is the input source training image and target pose training through the intermediate scale generation network. Dimensional segmentation map, target training two-dimensional posture extraction feature to predict the image. The label transformation image is a true intermediate-scale image.
[0070] Specifically, after the first training sample is obtained, the source training image, the target pose training two-dimensional segmentation map, and the target training two-dimensional pose are input into the intermediate scale generation network, and after each layer included in the intermediate scale generation network is processed in sequence, Output the training transformation image, construct the loss function according to the difference between the training transformation image and the label transformation image, and then back-propagate in the direction of minimizing the loss function, adjust the network parameters of the intermediate scale generation network and continue training until the training end condition is satisfied or All training samples have been trained.
[0071] In one embodiment, such as image 3 As shown, adjusting the network parameters of the intermediate scale generation network according to the difference between the training transformation image and the label transformation image to obtain the trained intermediate scale generation network includes:
[0072] In step S302, the features of the label transformation image and the training transformation image are respectively extracted through the pre-training perception network to obtain feature maps, and the distance between the feature maps is calculated to obtain the perception loss.
[0073] Specifically, the pre-training perceptual network refers to a network that has been trained to perceive image quality, such as a VGG-19 pre-training network, which can make the transformed image result closer to the correctly labeled image through perceptual loss. In one embodiment, the expression of the perceptual loss function is as follows:
[0074]
[0075] among them Refers to the first training network Layer characteristics, Refers to the label transformation image, which is the real target image, Refers to the synthesized training transform image, L represents the total number of layers of the pre-trained network used to calculate the perceptual loss.
[0076] Step S304: Perform confrontation learning on the intermediate scale generation network and the discrimination network according to the label transformation image, the source training image, and the training transformation image to obtain the confrontation loss.
[0077] Specifically, by fighting loss, the result of the transformed image can be closer to the real image. Adversarial learning is to learn by letting two machine learning models play against each other to obtain the desired machine learning model. The intermediate-scale generation network and the discriminant network are used for adversarial learning. The goal of the intermediate-scale generation network is to obtain the desired output according to the input. The goal of the discriminative network is to distinguish the output of the generating network from the real image as much as possible. The input of the discriminant network includes the output of the intermediate scale generation network and the real image. The two networks fight against each other to learn and constantly adjust parameters. The ultimate goal is that the intermediate-scale generation network should deceive the discrimination network as much as possible, so that the discrimination network cannot judge whether the output result of the generation network is true.
[0078] Among them, the adjustment direction of the parameters in the discriminant network model is to make the loss value of the discriminant network model smaller, so that the discriminative ability of the discriminant network model becomes stronger, and the adjustment direction of the parameters in the intermediate scale generation network is towards The direction in which the loss value of the discriminant network model becomes larger is adjusted, so that the discriminant network model is not easy to distinguish the output of the intermediate-scale generation network from the real image as much as possible. In adversarial learning, model parameters can be adjusted multiple times.
[0079] In one embodiment, the expression of the anti-loss function is as follows:
[0080]
[0081] among them Is to identify the characteristics of the i-th layer of the network, Refers to the label transformation image, Refers to the synthesized training transform image, Refers to the source training image.
[0082] Step S306: Calculate the feature distances of the label transformation image and the training transformation image at multiple different scales by identifying multiple convolutional layers of different scales in the network, and count the feature distances of multiple different scales to obtain the feature matching loss.
[0083] Specifically, the feature matching loss is used to make the transformed image result closer to the correct labeled image. The feature distance between the transformed image and the training transformed image is transformed by multiple labels of different scales, so that the extracted features can be more accurately synthesized and high-quality Transform the image.
[0084] In one embodiment, the expression of the feature matching loss function is as follows:
[0085]
[0086] Is the characteristic of the i-th layer of the discrimination network, T is the total number of layers, Refers to the label transformation image, Refers to the synthesized training transform image, Refers to the source training image.
[0087] In step S308, the target loss is determined according to the perceptual loss, the confrontation loss, and the feature matching loss, and the network parameters of the intermediate scale generation network are adjusted according to the target loss back propagation to obtain the trained intermediate scale generation network.
[0088] Specifically, according to the formula Calculate the target loss, where Means fighting against loss, Represents the feature matching loss, Represents the perceived loss, Represents the weight of feature matching loss, Indicates the weight of the perceived loss. The direction of minimizing the loss of the target is backpropagated, the network parameters of the intermediate-scale generation network are adjusted and the training is continued until the training end condition is satisfied or all training samples are trained. In one embodiment, versus Respectively 1.
[0089] In the embodiments of the present application, the target loss is obtained by weighting the perceptual loss, the confrontation loss, and the feature matching loss, and the different types of losses are combined to train the intermediate scale generation network, which improves the quality of the transformed image result.
[0090] In some embodiments, step S210 includes: obtaining a source two-dimensional segmentation map corresponding to the source image, segmenting the source image according to the source two-dimensional segmentation map to obtain component layer data corresponding to each object component, and obtaining the crop corresponding to each object component. The cutting information is based on the center position of each object part, and the three-dimensional segmented voxel, the target two-dimensional pose, and the transformed image are respectively cut to obtain the part-level data matching the corresponding cutting information.
[0091] Specifically, the source two-dimensional segmentation map is used to describe the object shape and initial posture information of the source image. It can not only represent the two-dimensional coordinates of the human body in the initial posture, but also includes different component categories for marking the object. In order to establish the object layer correspondence between the source image and the target image, the source 2D segmentation map includes component categories consistent with the 3D segmentation voxel.
[0092] In some embodiments, the source two-dimensional segmentation map corresponding to the source image is obtained through the component segmentation network. The component segmentation network is used to determine the two-dimensional segmentation map corresponding to the input image. In order to balance efficiency and performance, the U-Net architecture can be used as a component Split the network. The component segmentation network can be trained by a supervised training method. The training sample data includes: the source training image and the label two-dimensional segmentation map, where the label two-dimensional segmentation map can be projected on the xy plane by projecting the three-dimensional segmentation voxel corresponding to the source image. To obtain the correct label for the segmentation of two-dimensional object parts, the cross-entropy loss function can be used to train the part segmentation network. The component layer data of the source image corresponding to each object component can be obtained by multiplying the source image and the source two-dimensional segmentation pixel by pixel.
[0093] In some embodiments, a cross-entropy loss function is used to describe the difference between the trained two-dimensional segmentation map and the labeled two-dimensional segmentation map to back-propagate and adjust the network parameters of the component segmentation network to obtain a trained component segmentation network. The definition of the cross entropy loss function is as follows:
[0094]
[0095] Is the cross entropy loss, H and W are height and width respectively, N is the number of component categories, i, j, and c respectively represent the corresponding variables, Is the label two-dimensional segmentation map, Is the training two-dimensional segmentation map output by the network, Is the softmax function.
[0096] by Obtain the part layer data of the source image corresponding to each object part, Represents the component layer data of the component category m corresponding to the source image, where Refers to the source image, m represents the component category, Represents the area where the component category is m in the source 2D segmentation map.
[0097] The cutting information refers to the image resolution information and the cutting center position corresponding to the target component layer. The cutting center positions are respectively at the center of the object part area corresponding to each image to be cut. According to the cutting information, obtain the cutting center positions of the three-dimensional segmented voxels, the target two-dimensional pose, and the transformed image respectively, and use the cutting center position as the center to perform cutting according to the matching image resolution information to obtain the corresponding parts of each object. Component layer data.
[0098] In this embodiment, the source image is efficiently segmented through the source two-dimensional segmentation map, and other images are respectively cropped through the component category, and each component layer data is accurately and quickly obtained.
[0099] In some embodiments, step S210 includes: obtaining the hierarchical generation network corresponding to each object component; inputting the component layer data into the hierarchical generation network of the matching object component; and the hierarchical generation network of each object component respectively outputting the component image matching the object component .
[0100] Among them, the hierarchical generation network is used to merge and input the cropped source image, three-dimensional segmentation voxels, target two-dimensional pose, and component layer data corresponding to the object component in the transformed image, and process each network layer in the hierarchical generation network. Synthesize a complete part image ignoring occlusion. The output of the hierarchical generation network can be the residual between the target component and the transformed image component, so in the next step, the residual output of the hierarchical generation network can be fused with the transformed image to obtain the target pose image. The output of the hierarchical generation network is to synthesize complete object parts, avoiding the occlusion phenomenon under the target pose. Since a complete part is generated, when occlusion occurs, the subsequent global fusion step can get a more accurate and true result image.
[0101] The hierarchical generation network of each object part is independently trained, so that the images of each object part can be synthesized independently without occlusion.
[0102] In some embodiments, a single level generation network is shared for object parts with symmetric features, for example, the left and right arms share an arm level generation network, and the left and right legs share a leg level generation network. It can be understood that when object parts with symmetric features share one hierarchical generation network, the input of the shared hierarchical generation network includes images corresponding to two object parts with symmetric characteristics. For example, using the symmetry of the human body, the input of the arm-level generation network and the leg-level generation network also includes the part image on the other side. This additional part image can provide more appearance information when the original part is occluded.
[0103] In one embodiment, the hierarchical generation network is trained through a supervised training method, where the loss function can be customized as needed. In one embodiment, the hierarchical generation is adjusted by back propagation of at least one of perceptual loss and counter loss. The network parameters of the network to obtain the trained intermediate scale generation network. The perceptual loss is to make the output image and the label image closer, and the adversarial loss is used to judge whether a certain object part is complete. Perceptual loss can only measure the visible area, thereby eliminating the influence of the unknown area that is blocked.
[0104] In this embodiment, the component image matching the object component is directly obtained through the trained hierarchical generation network, so that each component image is completed independently and without occlusion, which is efficient and accurate. The hierarchical generation network is used to process important object parts more accurately. It is necessary to preserve the texture details in the source pose and synthesize the missing areas in the target pose.
[0105] In some embodiments, the training of the hierarchical generation network includes the following steps: obtaining a second training sample; the second training sample includes the source training component image, the training component 3D segmentation voxel, the target training component 2D pose, and the component transformation image And the corresponding label component image, each sample in the second training sample corresponds to the current object component. Input the source training component image, training component 3D segmentation voxel, target training component 2D pose, and component transformation image into the hierarchy generation network corresponding to the current object component. After the layers included in the hierarchy generation network are processed sequentially, the corresponding training is output Component image; according to the difference between the training component image and the label component image, backpropagation adjusts the network parameters of the level generation network to obtain the level generation network corresponding to the current object component that has been trained.
[0106] Wherein, the second training sample includes the source training component image, the three-dimensional segmentation voxel of the training component, the two-dimensional pose of the target training component, the component transformation image and the corresponding label component image, which all correspond to the same object component. The training component image is the image obtained by predicting the extracted features of the input data through the hierarchical generation network. The label part image is an unobstructed real image corresponding to the current target part.
[0107] Specifically, after the second training sample is obtained, the source training component image, the three-dimensional segmentation voxel of the training component, the two-dimensional pose of the target training component, and the component transformation image are input into the hierarchical generation network, and the layers included in the hierarchical generation network are sequentially After processing, output the training component image, construct a loss function based on the difference between the training component image and the label component image, and then backpropagate in the direction of minimizing the loss function, adjust the network parameters of the hierarchical generation network and continue training until the end of the training is satisfied Condition or all training samples are trained.
[0108] In some embodiments, such as Figure 4 As shown, adjusting the network parameters of the level generation network according to the difference between the training component image and the label component image back propagation to obtain the level generation network corresponding to the current object component that has been trained includes:
[0109] Step S402: According to the label component image, the source training component image, and the training component image, perform adversarial learning on the level generation network and the discrimination network corresponding to the current object component to obtain the component confrontation loss. The label component image is the unoccluded and current object in the training set. The image corresponding to the part.
[0110] Specifically, the component resistance loss is used to judge whether the target component is complete. In one embodiment, the function expression of the component resistance loss is as follows:
[0111]
[0112] Where m Refers to the current object part layer in the source image, Refers to the training component image corresponding to the current object component output when the training layer generates the network, Refers to the unoccluded current object component layer randomly selected in the training set.
[0113] In step S404, the features of the label component image and the training component image are respectively extracted through the pre-training perception network to obtain a feature map, and the distance between the feature maps is calculated to obtain the component perception loss.
[0114] Specifically, the component perception loss makes the synthesis result closer to the correct annotated image. The component perception loss only measures the visible area, thereby eliminating the influence of the unknown area that is blocked.
[0115] In one embodiment, the component perception loss is calculated by the following formula:
[0116]
[0117] among them, Represents the perceptual loss of the entire image, m represents the component category, Represents the area where the component category is m in the two-dimensional segmentation of the target pose. Represents the component's perceived loss of component category m.
[0118] Step S406: Determine the target component loss according to the component confrontation loss and the component perception loss, and adjust the network parameters of the layer generation network corresponding to the current target component according to the target component loss back propagation to obtain the trained layer generation network.
[0119] Specifically, according to the formula Calculate the target component loss, where Indicates that the parts fight against loss, Indicates the perceived loss of components, Indicates the weight of the component's perceived loss. The direction of minimizing the loss of the target is backpropagated, the network parameters of the hierarchical generation network are adjusted and the training is continued until the training end condition is satisfied or all training samples are completed. In one embodiment, Is 1.
[0120] In the embodiment of the present application, the target loss is obtained by weighting the component resistance loss and the component perception loss, and combining different types of losses to train the hierarchical generation network, which improves the quality of the component image result.
[0121] In some embodiments, step S212 includes: fusing the part image into the corresponding object part area of ​​the transformed image according to the two-dimensional segmentation map to obtain the initial global pose image; combining the source image, the target pose two-dimensional segmentation map, the target two-dimensional pose and the initial The global pose image is merged and input to the fine-scale generation network, and the target pose residual image is output; the target pose residual image is fused with the initial global pose image to obtain the target pose image.
[0122] Specifically, the initial global pose image and the target pose image have the same resolution, the transformed image is first transformed to the resolution size corresponding to the target pose image, and then the part image is superimposed on the transformed transformed image according to the position of the part. The corresponding area of, get the initial global pose image. In one embodiment, the initial global pose image is obtained by the following formula:
[0123]
[0124] among them, Represents the transformed image, Represents the area where the component category is m in the two-dimensional segmentation of the target pose, Represents the initial global pose image, Represents the part image of part m, and M represents the total number of target parts.
[0125] The fine-scale generation network is used to improve the initial results and ensure the synthesis of high-quality globally consistent pose transformation results. The input of the fine-scale generation network is the combination of the source image, the target pose 2D segmentation map, the target 2D pose and the initial global pose image, and the output is the residual between the target pose image and the initial global pose image. In one embodiment, the target pose image is obtained by the following formula:
[0126]
[0127] among them, Represents the target pose image, Represents the output of the fine-scale generation network, Represents the initial global pose image.
[0128] In one embodiment, the fine-scale generation network is trained by a supervised training method, and the fine-scale generation network training objective function, the target loss may be the weighted sum of the counter loss, the perceptual loss, and the feature matching loss. In one embodiment, according to the formula Calculate the target loss of the fine-scale generation network, where Means fighting against loss, Represents the feature matching loss, Represents the perceived loss, Represents the weight of feature matching loss, Indicates the weight of the perceived loss. The direction of minimizing the loss of the target is backpropagated, the network parameters of the fine-scale generation network are adjusted and the training is continued until the training end condition is met or all training samples are trained. In one embodiment, versus Respectively 1.
[0129] In the embodiment of the present application, a coarse-to-fine frame is used to maintain fine local texture details through a fine-scale generation network to synthesize a higher resolution result image.
[0130] The following can be applied to human body posture transformation, such as Figure 5 As shown, the hierarchical end-to-end human body posture transformation network proposed by the embodiment of the present application is shown, and the posture transformation data processing method provided by the embodiment of the present application is explained, including the following steps:
[0131] 1. Obtain the source image and the target three-dimensional pose, the source image is a 1024*1024 image including the human body, and the target three-dimensional pose is a 256*256*64 three-dimensional pose.
[0132] 2. When inputting the 3D voxel network, you can downsample the source image to get a 256*256 image, and then input the downsampled source image and target 3D pose into the 3D voxel network. It is understandable that you can also The original resolution source image is input into the 3D voxel network, and the 3D voxel network is down-sampled to obtain a 256*256 image. The 3D voxel network outputs 256*256*64 segmented 3D voxels, including voxel category information, and is divided into 7 component categories, namely face, left hand, right hand, left leg, right leg, torso, and background.
[0133] 3. Project three-dimensional voxels on the xy plane of the image to obtain a two-dimensional segmentation of the target pose. The two-dimensional segmentation icon of the target pose annotates the human body part category of the target image. The human body part category includes face, left hand, right hand, and left leg. Right leg, torso.
[0134] Specifically, based on the two-dimensional segmentation map, the human pose transformation problem can be regarded as an image-to-image conversion problem.
[0135] 4. Obtain the target two-dimensional pose corresponding to the target three-dimensional pose, which is a 256*256 two-dimensional image, and input the 256*256 source image, 256*256 target two-dimensional pose, and 256*256 two-dimensional segmentation image into the middle The scale generation network outputs a transformed image with a resolution of 512*512 corresponding to the three-dimensional pose of the target.
[0136] Specifically, a two-dimensional target pose is introduced to enhance the target's spatial attention.
[0137] 5. Determine the component layer data of the source image, three-dimensional segmentation voxel, target two-dimensional pose, and transformed image corresponding to each object component.
[0138] Specifically, some body parts, such as human faces, change greatly in visibility during the transformation process and contain important information textures. In order to better solve this problem, the semantic representation of the human body is additionally used to indicate the independent synthesis of various semantic parts of the body. Three component-based hierarchical generation networks are used to more accurately synthesize important human body components, including faces, arms, and legs. These hierarchical generation networks can synthesize high-quality results, not only maintaining the texture details in the source image, but also generating occluded areas under the target pose. Due to the use of three-dimensional segmented voxel representation, although the visibility of these important components will be completely changed under the target pose, the occlusion problem can still be solved correctly because the corresponding three-dimensional voxel is complete.
[0139] In order to establish the body layer correspondence between the source image and the target image, which is the same as the three-dimensional voxel representation, we divide the source image into 7 component categories. The two-dimensional component segmentation map corresponding to the source image is obtained through the component segmentation network, and each component layer of the source image can be obtained by multiplying the source image and the two-dimensional component segmentation map pixel by pixel.
[0140] According to the position of the center, these body parts are cropped in the corresponding 3D segmentation voxel, target 2D pose, and transformed image. Since the final result has a resolution of 1024*1024, according to the relative position of each part layer in the whole body The length, resolution of the face, arms, and legs are set to 128*128, 256*256, 512*512, respectively.
[0141] 6. Input the component layer data of each object component to the hierarchical generation network matching the object component, and obtain the component image corresponding to each object component.
[0142] Combine the cropped result and the component layers of the source image into the corresponding hierarchical generation network to synthesize a complete target component image that ignores occlusion. Due to the symmetry of the human body, the left and right parts of the same kind share the same generator, so there are a total of three hierarchical generation networks to synthesize a face, two arms and two legs. And, in order to make better use of the symmetry of the human body, the input of the arm generator and the leg generator also includes the original part image on the other side. This additional part image can provide more appearance information when the original part is occluded. , The output result of the hierarchical generation network is the residual between the target component and the transformed image component. Such as Image 6 As shown, the detailed structure of the face level representation is shown. The face part in the source image, the face part in the three-dimensional segmentation voxel, the face part in the target two-dimensional pose, and the face part in the transformed image are input The face level generation network can get 128*128 face component images. Such as Figure 7 From left to right, the real target image, the real target component image, the synthesis result without hierarchical representation, and the synthesis result of the hierarchical generation network of this scheme are shown in sequence. It can be seen that the result of this scheme is of better quality and is a complete part image.
[0143] 7. Fuse the transformed image and the component image to obtain the target pose image.
[0144] Specifically, the 512*512 transformed image is up-sampled to a resolution of 1024*1024, and then the face part image of 128*128, the left and right arm part images of 256*256, the left and right legs of 512*512 The component image is added to the corresponding position of the up-sampled transformed image according to the two-dimensional segmentation map to obtain the initial global pose image.
[0145] The source image, the target pose two-dimensional segmentation map, the target two-dimensional pose, and the initial global pose image are combined and input to the fine-scale generation network, and the target pose residual image is output. It is understandable that the size of each image can be adjusted before input to keep the resolution of each input image consistent.
[0146] The posture transformation data processing method provided in the embodiments of this application has been tested and has achieved good human posture transformation results on both human body data sets, and the quantitative and qualitative comparison results have surpassed the current best three technical solutions. Thanks to the proposed hierarchical pose transformation synthesis framework, three-dimensional segmented voxel representation and component-based hierarchical representation, it can better handle occlusion problems, maintain fine local texture details, and synthesize higher-resolution result images. At the same time, it has good robustness and accuracy, and can also be applied to the synthesis of video action sequences, such as Figure 8 As shown, the partial result of the solution of the application on the video action sequence.
[0147] Such as Picture 9 Shown are the transformation results of part of the human body posture in the Human3.6M data set of the proposed scheme, and the comparison with other three technical schemes (DSC, LW-GAN, Pix2pixHD).
[0148] Such as Picture 10 Shown are the transformation results of part of the human body pose in the motion video data set constructed by the technical solution of this application, and the comparison with the other three technical solutions (DSC, LW-GAN, Pix2pixHD).
[0149] The human body posture transformation results generated by this application are compared with the other three current best methods. The three methods include Pix2pixHD, DSC, and LW-GAN, which are retrained on the data set. Two quantitative indicators, structural similarity (SSIM) and image block perceptual similarity (LPIPS) are used to evaluate the pros and cons of the composite images of each method. The comparison result is as Picture 11 It is the evaluation result in two data sets (the arrow indicates the direction of better value). It can be seen that the method of this application is significantly better than the other three methods.
[0150] In order to evaluate the importance of each part in this application, a basic model was constructed and the remaining parts were added one by one in order to conduct a peeling experiment. The input of the basic model Baseline is the two-dimensional pose of the source image and the target, including the intermediate-scale generation network and the fine-scale generation network. On this basis, a three-dimensional voxel network is added, so that the two-dimensional segmentation map of the target pose obtained by the projection is used as the input of the generation network. This model is marked as Baseline+V. Then add the component generation network, but only use the traditional anti-loss function to train the component generation network, which may synthesize incomplete target component images. This model is labeled Baseline+V+PL-. Therefore, the model of the complete method of this application is marked as Baseline+V+PL. Such as Picture 12 The evaluation results of the images generated by each model are shown, and it can be seen that the LPIPS indicators gradually get better with the addition of various components.
[0151] The posture transformation data processing method provided by the embodiments of the present application can be used for video production, and can also be used for beauty cameras, etc., and can be jointly deployed in smart cameras and the CPU (central processing unit, central processing unit) or GPU ( Graphics Processing Unit, graphics processor), can also be used on the CPU of mobile devices such as mobile phones.
[0152] Such as Figure 13 As shown, in some embodiments, a posture transformation data processing device is provided, and the posture transformation data processing device may be integrated into the server 120 or the terminal 110 described above, and may specifically include:
[0153] The obtaining module 502 is used to obtain the source image and the target three-dimensional pose.
[0154] The three-dimensional segmentation voxel module 504 is configured to combine the source image and the target three-dimensional pose to obtain three-dimensional segmented voxels based on semantic segmentation reconstruction, and the three-dimensional segmented voxels include the category information of the voxels.
[0155] The projection module 506 is configured to project the three-dimensional segmented voxels to obtain a corresponding two-dimensional segmentation map of the target pose, and label the objects in the two-dimensional segmentation map of the target pose based on the category information of the voxel to obtain the corresponding component category.
[0156] The intermediate scale module 508 is configured to obtain the target two-dimensional pose corresponding to the target three-dimensional pose, extract the features of the source image, the target pose two-dimensional segmentation map, and the target two-dimensional pose to synthesize an intermediate-scale transformed image.
[0157] The component image generation module 510 is used to separately crop the source image, three-dimensional segmented voxels, target two-dimensional pose, and transformed image to obtain component layer data corresponding to each object component. The object component is determined according to the component category, and each The part layer data of the target part is combined into parts to generate a part image corresponding to each target part.
[0158] The fusion module 512 is used for fusing the transformed image and the component image to obtain the target posture image.
[0159] In some embodiments, the 3D segmentation voxel module 504 is also used to input the source image and the target 3D pose into the 3D voxel network, and the 3D voxel network encodes the source image and the target 3D pose to obtain the encoding result, and extract the encoding result Features and decodes and outputs three-dimensional segmented voxels carrying voxel category information.
[0160] In some embodiments, the 3D segmentation voxel module 504 is also used to encode the source image and extract features to obtain the first feature; encode the target 3D pose and extract the features to obtain the second feature; combine the first feature with the second feature The merged feature is obtained by merging, and the merged feature is input to the stacked hourglass network, and the three-dimensional segmentation voxel is obtained through the stacked hourglass network decoding.
[0161] In some embodiments, the intermediate scale module 508 is also used to merge the source image, the target pose two-dimensional segmentation map and the target two-dimensional pose to form an input matrix; input the input matrix to the intermediate scale generation network; the intermediate scale generation network sequentially passes through the following The sampling layer, the residual block layer and the up-sampling layer perform feature extraction on the input matrix to obtain an intermediate scale transformed image.
[0162] In some embodiments, the device further includes:
[0163] The intermediate scale generation network training module is used to obtain the first training sample; the first training sample includes the source training image, the target pose training two-dimensional segmentation map, the target training two-dimensional pose and the corresponding label transformation image; the source training image, target The pose training two-dimensional segmentation map, the target training two-dimensional pose input to the intermediate scale generation network, and after the layers included in the intermediate scale generation network are sequentially processed, the corresponding training transformation image is output; the transformation image is based on the training transformation image and the label Differential back propagation adjusts the network parameters of the intermediate-scale generation network to obtain the trained intermediate-scale generation network.
[0164] In some embodiments, the intermediate scale generation network training module is also used to extract the features of the label transformation image and the training transformation image through the pre-training perceptual network to obtain feature maps, and calculate the distance between the feature maps to obtain the perception loss; The image, source training image, and training transformation image perform adversarial learning on the intermediate-scale generation network and the discriminant network to obtain the adversarial loss; the label transformation image and the training transformation image are calculated by multiple convolutional layers of different scales in the discriminant network. Scale feature distances, count the feature distances of multiple different scales to obtain feature matching loss; determine the target loss according to the perception loss, confrontation loss, and feature matching loss, and adjust the network parameters of the intermediate-scale generation network according to the target loss back propagation to obtain the The trained intermediate scale generation network.
[0165] In some embodiments, the component image generation module 510 is further configured to obtain a source two-dimensional segmentation map corresponding to the source image, and segment the source image according to the source two-dimensional segmentation map to obtain the component layer data corresponding to each object component; The cutting information corresponding to each object part is cut out of the three-dimensional segmented voxel, the target two-dimensional pose, and the transformed image based on the center position of each object part to obtain part-level data matching the corresponding cutting information.
[0166] In some embodiments, the component image generation module 510 is also used to obtain the hierarchy generation network corresponding to each object component; input the component layer data into the hierarchy generation network of the matched object component; the hierarchy generation network of each object component outputs the corresponding object component Matched part image.
[0167] In some embodiments, the device further includes:
[0168] The hierarchical generation network training module is used to obtain the second training sample; the second training sample includes the source training component image, the three-dimensional segmentation voxel of the training component, the two-dimensional pose of the target training component, the component transformation image and the corresponding label component image. Each sample in the training sample corresponds to the current object component; input the source training component image, the three-dimensional segmentation voxel of the training component, the two-dimensional pose of the target training component, and the component transformation image into the hierarchical generation network corresponding to the current object component, and the network is generated through the hierarchy After the included layers are processed in turn, the corresponding training component images are output; according to the difference between the training component image and the label component image, the network parameters of the level generation network are adjusted by backpropagation to obtain the level corresponding to the current object component that has been trained Generate the network.
[0169] In some embodiments, the hierarchical generation network training module is also used to perform adversarial learning on the hierarchical generation network and the discrimination network corresponding to the current object component according to the label component image, the source training component image, and the training component image to obtain the component confrontation loss. The label component The image is the unoccluded image corresponding to the current object component in the training set; the feature map is obtained by extracting the features of the label component image and the training component image through the pre-training perception network, and the distance between the feature maps is calculated to obtain the component perception loss; according to the component Combat loss and component perception loss to determine the target component loss; adjust the network parameters of the hierarchical generation network corresponding to the current target component according to the target component loss back propagation to obtain the trained hierarchical generation network.
[0170] In some embodiments, the fusion module 512 is further configured to fuse the component image into the corresponding object component area of ​​the transformed image according to the two-dimensional segmentation map to obtain the initial global pose image; combine the source image, the target pose two-dimensional segmentation map, and the target two The three-dimensional pose and the initial global pose image are combined and input to the fine-scale generation network, and the target pose residual image is output; the target pose residual image is fused with the initial global pose image to obtain the target pose image.
[0171] Figure 14 Shows the internal structure diagram of the computer device in some embodiments. The computer equipment can specifically be figure 1 In the terminal 110. Such as Figure 14 As shown, the computer equipment includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Among them, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program. When the computer program is executed by the processor, the processor can realize the posture transformation data processing method. A computer program may also be stored in the internal memory. When the computer program is executed by the processor, the processor can execute the posture transformation data processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen. The input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touchpad set on the housing of the computer equipment. It can be an external keyboard, touchpad, or mouse.
[0172] The computer equipment can also be figure 1 The server 120 in may include different components than shown in the figure.
[0173] Those skilled in the art can understand, Figure 14 The structure shown in is only a block diagram of part of the structure related to the solution of the application, and does not constitute a limitation on the computer equipment to which the solution of the application is applied. The specific computer equipment may include more or Fewer parts, or combine some parts, or have a different arrangement of parts.
[0174] In some embodiments, the posture transformation data processing device provided in the present application can be implemented in the form of a computer program, and the computer program can be in the form of Figure 14 Run on the computer equipment shown. The memory of the computer equipment can store various program modules that make up the posture transformation data processing device, for example, Figure 13 The acquisition module 502, the three-dimensional segmentation voxel module 504, the projection module 506, the intermediate scale module 508, the component image generation module 510, and the fusion module 512 are shown. The computer program composed of each program module causes the processor to execute the steps in the posture transformation data processing device of each embodiment of the application described in this specification.
[0175] E.g, Figure 14 The computer equipment shown can be passed as Figure 13 The acquisition module 502 in the posture transformation data processing device shown obtains the source image and the target three-dimensional posture. The three-dimensional segmentation voxel module 504 combines the source image and the target three-dimensional posture, and obtains three-dimensional segmentation voxels based on semantic segmentation and reconstruction. Including the category information of the voxels, the three-dimensional segmented voxels are projected by the projection module 506 to obtain the corresponding target pose two-dimensional segmentation map, and the objects in the target pose two-dimensional segmentation map are labeled based on the voxel category information to obtain the corresponding component category. The intermediate scale module 508 obtains the target two-dimensional pose corresponding to the target three-dimensional pose, extracts the features of the source image, the target pose two-dimensional segmentation map, and the target two-dimensional pose to synthesize an intermediate-scale transformed image. Through the part image generation module 510, the source image, 3D segmentation voxel, target 2D pose, and transformed image are cut to obtain the part layer data corresponding to each target part. The target part is determined according to the part category. The component layer data of the component is synthesized to generate the component image corresponding to each target component. The transformation image and the component image are fused by the fusion module 512 to obtain the target posture image.
[0176] In some embodiments, a computer device is provided, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the aforementioned posture transformation data processing method. Here, the steps of the posture transformation data processing method may be the steps in the posture transformation data processing method of each of the foregoing embodiments.
[0177] In some embodiments, a computer-readable storage medium is provided, which stores a computer program, and when the computer program is executed by a processor, the processor executes the steps of the aforementioned posture transformation data processing method. Here, the steps of the posture transformation data processing method may be the steps in the posture transformation data processing method of each of the foregoing embodiments.
[0178] It should be understood that, although the steps in the flowcharts of the embodiments of the present application are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly restricted in order, and these steps can be executed in other orders. Moreover, at least a part of the steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
[0179] A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. As an illustration and not a limitation, the RAM may be in various forms, such as static random access memory (SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM).
[0180] The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered as the scope of this specification.
[0181] The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation to the patent scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

An intelligent paper marking positioning method based on deep learning

InactiveCN109948609AOvercome the problem of not being able to identify regionsImprove accuracy and robustnessCharacter and pattern recognitionBiomedical engineeringWorkload
Owner:NANJING UNIV OF POSTS & TELECOMM

Marine target association system and method based on high-low orbit optical satellite observation

ActiveCN110458089AImprove accuracy and robustnessImprove surveillance capabilitiesScene recognitionSatellite imageEnvironmental geology
Owner:NAT INNOVATION INST OF DEFENSE TECH PLA ACAD OF MILITARY SCI

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products