Methods, apparatuses, devices, and media for training a model

CN116071601BActive Publication Date: 2026-06-26KE COM (BEIJING) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
KE COM (BEIJING) TECHNOLOGY CO LTD
Filing Date
2023-02-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing pose recognition models suffer from poor training results due to the influence of noisy pseudo-labels during the training process, making it difficult to effectively utilize pseudo-labels generated by teacher models.

Method used

By predicting key points in unlabeled data using multiple teacher models, initial pseudo-labels are aggregated and processed to correct key point positions, generating more accurate target pseudo-labels. Furthermore, data augmentation and iterative training are combined to remove noisy pseudo-labels and improve the quality of model training.

Benefits of technology

This improves the training quality and accuracy of the pose recognition model, reduces the workload of labeling data, avoids the adverse effects of noisy pseudo-labels on the training process, and obtains a pose recognition model with higher accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116071601B_ABST
    Figure CN116071601B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure disclose a method, device, equipment and medium for training a model, wherein the method comprises: obtaining a first sample set, a second sample set, n teacher models and a gesture recognition model to be trained, wherein the first sample set comprises a plurality of first sample images with labeled sample labels, and the second sample set comprises a plurality of unlabeled second sample images; predicting key points in each second sample image by using the n teacher models respectively, and taking n prediction results as pseudo labels of the second sample image to obtain n initial pseudo labels corresponding to each second sample image; performing aggregation processing on the n initial pseudo labels corresponding to each second sample image to obtain a target pseudo label corresponding to each second sample image; and performing iterative training on the gesture recognition model to be trained based on the first sample set, the second sample set and the target pseudo label corresponding to each second sample image to obtain a trained gesture recognition model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a method, apparatus, device, and medium for training a model. Background Technology

[0002] Currently, pose recognition models are typically trained using a semi-supervised approach, employing a small amount of labeled data and a large amount of unlabeled data. To improve training effectiveness, a teacher model can be introduced during training. This teacher model processes the unlabeled data, and the processed results are used as pseudo-labels for the unlabeled data, thus guiding the training of the pose recognition model.

[0003] Because teacher models generate noisy pseudo-labels during the processing of unlabeled data, and it is difficult to distinguish between high-quality pseudo-labels and noisy pseudo-labels based solely on the confidence of key points, related technologies often ignore the adverse effects of noisy pseudo-labels on the training process when training pose recognition models, resulting in poor training performance. Summary of the Invention

[0004] This disclosure provides a method, apparatus, electronic device, and storage medium for training a model.

[0005] One aspect of this disclosure provides a method for training a model, comprising: acquiring a first sample set, a second sample set, n teacher models, and a pose recognition model to be trained, wherein the first sample set includes multiple labeled first sample images, the second sample set includes multiple unlabeled second sample images, and n is a positive integer not less than 2; using the n teacher models to predict key points in each second sample image, and using the n prediction results as pseudo-labels for the second sample image to obtain n initial pseudo-labels corresponding to each second sample image; aggregating the n initial pseudo-labels corresponding to each second sample image, correcting the positions of key points in the n initial pseudo-labels to obtain target pseudo-labels corresponding to each second sample image; and iteratively training the pose recognition model to be trained based on the first sample set, the second sample set, and the target pseudo-labels corresponding to each second sample image to obtain a trained pose recognition model.

[0006] In some embodiments, key points in each second sample image are predicted using n teacher models, and the prediction results are used as pseudo-labels for the second sample image to obtain n initial pseudo-labels for each second sample image. This includes: performing a first type of data augmentation on each second sample image to obtain a third sample image corresponding to each second sample image; and inputting the third sample image corresponding to each second sample image into the n teacher models to obtain n initial pseudo-labels for each second sample image.

[0007] In some embodiments, the method further includes: performing a second type of data augmentation on each second sample image to obtain a fourth sample image corresponding to each second sample image, and determining the target pseudo-label corresponding to each second sample image as the target pseudo-label of each fourth sample image; and iteratively training the pose recognition model to be trained based on the first sample set, the second sample set, and the target pseudo-label corresponding to each second sample image to obtain the trained pose recognition model, including: inputting the first sample image into the pose recognition model to be trained, and determining a first loss value based on the sample label and the output result of the pose recognition model to be trained; inputting the fourth sample image into the pose recognition model to be trained, and determining a second loss value based on the target pseudo-label of the fourth sample image and the output result of the pose recognition model to be trained; adjusting the model parameters of the pose recognition model to be trained based on the first loss value and the second loss value; iteratively executing the steps of determining the first loss value, determining the second loss value, and adjusting the model parameters of the pose recognition model to be trained until a preset iteration termination condition is met to obtain the trained pose recognition model.

[0008] In some embodiments, the n initial pseudo-labels corresponding to each second sample image are aggregated to correct the positions of key points in the n initial pseudo-labels, thereby obtaining the target pseudo-label corresponding to each second sample image. This includes: obtaining n historical pseudo-labels obtained by n teacher models in the previous iteration; pairing the n initial pseudo-labels and n historical pseudo-labels corresponding to each second sample image to determine multiple label pairs, each label pair including an initial pseudo-label and a historical pseudo-label; determining key point pairs with matching relationships from the initial pseudo-labels and historical pseudo-labels included in the label pairs, and determining the pixel distance between the two key points in the key point pairs; for key points with the same label among the n initial pseudo-labels, determining the target key point pair with the smallest pixel distance from each key point pair containing the key point, and determining the position of the key point in the target key point pair as the target position of the key point; and determining the target pseudo-label corresponding to each second sample image based on each key point in the n initial pseudo-labels corresponding to each second sample image and its target position.

[0009] In some embodiments, performing a second type of data augmentation on each second sample image to obtain a fourth sample image corresponding to each second sample image includes: performing an affine transformation on each second sample image to obtain a transformed image corresponding to each second sample image; and performing a masking process on the transformed image corresponding to each second sample image to obtain a fourth sample image corresponding to each second sample image.

[0010] In some embodiments, before aggregating the n initial pseudo-labels corresponding to each second sample image, the method further includes: inputting the first sample image into n teacher models respectively, and determining the third loss value of each teacher model in the n teacher models based on the sample labels and the output results of the n teacher models; sequentially using each teacher model in the n teacher models as a student model, and using the other n-1 teacher models as reference models to obtain n model combinations; for each model combination in the n model combinations, inputting the fourth sample image into the student model in the model combination to obtain the first prediction result of the fourth sample image corresponding to the model combination, and determining the n-1 initial pseudo-labels obtained when the reference model in the model combination processes the third sample image corresponding to the fourth sample image; determining the fourth loss value corresponding to the model combination based on the first prediction result of the fourth sample image corresponding to the model combination and the n-1 initial pseudo-labels; and adjusting the model parameters of each teacher model in the n teacher models based on the third loss value and the fourth loss value corresponding to each model combination in the n model combinations.

[0011] In some embodiments, a masking process is performed on the transformed image corresponding to each second sample image to obtain a fourth sample image corresponding to each second sample image, including: obtaining an image to be cropped; cropping the image region corresponding to the limb part from the image to be cropped to obtain a local image; determining the key point positions in the transformed image corresponding to the second sample image based on n initial pseudo-labels corresponding to the second sample image; and attaching the local image to the key point positions in the transformed image corresponding to the second sample image to obtain the fourth sample image corresponding to the second sample image.

[0012] This disclosure also provides a pose recognition method, including: acquiring a target image including a target object; and recognizing the pose of the target object in the target image using a trained pose recognition model obtained through the training model method in any of the above embodiments.

[0013] This disclosure also provides an apparatus for training a model, comprising: a data acquisition unit configured to acquire a first sample set, a second sample set, n teacher models, and a pose recognition model to be trained, wherein the first sample set includes multiple labeled first sample images, the second sample set includes multiple unlabeled second sample images, and n is a positive integer not less than 2; a label generation unit configured to use the n teacher models to predict key points in each second sample image, and use the n prediction results as pseudo-labels for the second sample image to obtain n initial pseudo-labels corresponding to each second sample image; a label aggregation unit configured to aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain target pseudo-labels corresponding to each second sample image; and an iterative training unit configured to iteratively train the pose recognition model to be trained based on the first sample set, the second sample set, and the target pseudo-labels corresponding to each second sample image to obtain a trained pose recognition model.

[0014] This disclosure also provides a pose recognition device, including: an image acquisition unit configured to acquire a target image including a target object; and a pose recognition unit configured to recognize the pose of the target object in the target image using a trained pose recognition model obtained by the training model method in any of the above embodiments.

[0015] Embodiments of this disclosure also provide an electronic device, including: a memory for storing a computer program product; and a processor for executing the computer program product stored in the memory, wherein when the computer program product is executed, it implements the method in any of the above embodiments.

[0016] Embodiments of this disclosure also provide a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the methods in any of the above embodiments.

[0017] The training model method, apparatus, device, and medium provided in this disclosure can aggregate the initial pseudo-labels output by n teacher models to correct the positions of key points in the initial pseudo-labels, thereby obtaining target pseudo-labels with higher accuracy and removing noisy pseudo-labels. This avoids the adverse effects of noisy pseudo-labels on model training, helps to improve the training quality of the pose recognition model, and obtains a pose recognition model with higher accuracy.

[0018] The technical solutions of this disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description

[0019] The accompanying drawings, which form part of this specification, illustrate embodiments of this disclosure and, together with the description, serve to explain the principles of this disclosure.

[0020] This disclosure will become clearer with reference to the accompanying drawings and the following detailed description, wherein:

[0021] Figure 1 A flowchart illustrating one embodiment of the method for training the model disclosed herein;

[0022] Figure 2 This is a flowchart illustrating yet another embodiment of the method for training the model disclosed herein;

[0023] Figure 3 This is a schematic diagram of the masking process in one embodiment of the method for training the model disclosed herein;

[0024] Figure 4 This is a schematic diagram of the process for generating target pseudo-labels in one embodiment of the method for training the model disclosed herein;

[0025] Figure 5 This is a flowchart illustrating yet another embodiment of the method for training the model disclosed herein;

[0026] Figure 6 This is a schematic diagram of iterative training in one embodiment of the method for training the model disclosed herein;

[0027] Figure 7 This is a schematic diagram of the structure of one embodiment of the apparatus for training the model of this disclosure;

[0028] Figure 8 This is a schematic diagram of the structure of an application embodiment of the electronic device disclosed herein. Detailed Implementation

[0029] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of the present disclosure.

[0030] Those skilled in the art will understand that the terms "first," "second," etc., in the embodiments of this disclosure are only used to distinguish different steps, devices, or modules, and do not represent any specific technical meaning, nor do they indicate a necessary logical order between them.

[0031] It should also be understood that in the embodiments disclosed herein, "a plurality of" may refer to two or more, and "at least one" may refer to one, two or more.

[0032] It should also be understood that any component, data or structure mentioned in the embodiments of this disclosure can generally be understood as one or more unless expressly defined or given to the contrary in the context.

[0033] Furthermore, the term "and / or" in this disclosure is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this disclosure generally indicates that the preceding and following related objects have an "or" relationship.

[0034] It should also be understood that the description of the various embodiments in this disclosure emphasizes the differences between the various embodiments, and the similarities or similarities can be referred to each other. For the sake of brevity, they will not be described in detail.

[0035] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0036] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.

[0037] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0038] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0039] The embodiments disclosed herein can be applied to electronic devices such as terminal devices, computer systems, and servers, and can operate together with a wide range of other general-purpose or special-purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments including any of the above systems, etc.

[0040] Electronic devices such as terminal devices, computer systems, and servers can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system. Typically, program modules can include routines, programs, object programs, components, logic, data structures, etc., which perform specific tasks or implement specific abstract data types. Computer systems / servers can be implemented in distributed cloud computing environments, where tasks are executed by remote processing devices linked through communication networks. In distributed cloud computing environments, program modules can reside on local or remote computing system storage media, including storage devices.

[0041] This disclosure outlines

[0042] In implementing this disclosure, the inventors discovered that when using a teacher model to generate pseudo-labels for unlabeled data, the teacher model inevitably produces noisy pseudo-labels due to its inherent accuracy limitations. If these noisy pseudo-labels are used to guide model training, the trained model will overfit to these labels, leading to a decrease in the accuracy of the trained model.

[0043] Exemplary methods

[0044] The following is combined with Figure 1 The method for training the model disclosed herein is illustrated by way of example. Figure 1 A flowchart illustrating one embodiment of the method for training the model of this disclosure is shown, such as... Figure 1 As shown, the process includes the following steps.

[0045] Step 110: Obtain the first sample set, the second sample set, n teacher models, and the pose recognition model to be trained.

[0046] The first sample set includes multiple labeled first sample images, and the second sample set includes multiple unlabeled second sample images, where n is a positive integer not less than 2.

[0047] In this embodiment, the teacher model can be a pre-trained pose recognition model, such as ResNet, convolutional neural networks, recurrent neural networks, etc. n can be any positive integer not less than 2. The larger the value of n, the better the training effect of the pose recognition model. At the same time, the higher the performance requirements of the hardware during training. As an example, n can be 2, which balances training effect and hardware cost.

[0048] Typically, the number of first sample images can be less than the number of second sample images, which reduces the workload of labeling samples.

[0049] Step 120: Use n teacher models to predict the key points in each second sample image, and use the n prediction results as pseudo-labels for the second sample image to obtain n initial pseudo-labels for each second sample image.

[0050] In this embodiment, the teacher model can extract image features from the second sample image, predict the confidence level and type label of each pixel as a pose key point based on the image features, and form a key point heatmap based on the confidence level and type label of each pixel to obtain the prediction result of the teacher model, which is used as the initial pseudo label of the second sample image.

[0051] Step 130: Aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain the target pseudo-label corresponding to each second sample image.

[0052] In this embodiment, by aggregating n initial pseudo-labels, the key points with the highest accuracy are selected from the n initial pseudo-labels, and target pseudo-labels are generated based on the selected key points.

[0053] As an example, the keypoints with the highest confidence can be selected from n initial pseudo-labels based on their type labels. Taking human pose recognition as an example, the confidence of the head keypoints among the n initial pseudo-labels can be determined first. Then, the head keypoints in the initial pseudo-labels with the highest confidence are selected as the head keypoints in the target pseudo-labels. The same selection process is then applied to each limb keypoint to select the limb keypoints with the highest confidence from the initial pseudo-labels, which are then used as the limb keypoints in the target pseudo-labels. Finally, a keypoint heatmap can be generated based on the type labels and confidence levels of the selected keypoints, thus obtaining the target pseudo-labels.

[0054] Step 140: Based on the first sample set, the second sample set, and the target pseudo-labels corresponding to each second sample image, iteratively train the pose recognition model to be trained to obtain the trained pose recognition model.

[0055] In this embodiment, the pose recognition model to be trained can be subjected to a small amount of supervised training using the first sample set, while the pose recognition model to be trained can be subjected to a large amount of semi-supervised training using the second sample images and their pseudo-labels in the second sample set, thereby obtaining the trained pose recognition model.

[0056] As an example, the iterative training process of the pose recognition model to be trained can include two stages. The first stage is to use the first sample image as input and the sample label as the expected output to iteratively train the pose recognition model to be trained in order to update the model parameters of the pose recognition model to be trained. In the second stage, the second sample image is used as input and the target pseudo-label is used as the expected output to iteratively train the pose recognition model to be trained again and update the model parameters of the pose recognition model to be trained again until the iteration termination condition is reached, and the trained pose recognition model is obtained.

[0057] The training model method provided in this embodiment can aggregate the initial pseudo-labels output by n teacher models to correct the position of key points in the initial pseudo-labels, thereby obtaining target pseudo-labels with higher accuracy and removing noisy pseudo-labels. This avoids the adverse effects of noisy pseudo-labels on model training, helps to improve the training quality of the pose recognition model, and obtains a pose recognition model with higher accuracy.

[0058] Next, refer to Figure 2 , Figure 2 A flowchart illustrating yet another embodiment of the method for training the model of this disclosure is shown, as follows: Figure 2 As shown, the process includes the following steps.

[0059] Step 210: Obtain the first sample set, the second sample set, n teacher models, and the pose recognition model to be trained.

[0060] Step 220: Perform the first type of data augmentation on each second sample image to obtain the third sample image corresponding to each second sample image.

[0061] In this embodiment, the first type of data augmentation processing refers to simple data augmentation processing, such as rotation with a small amplitude, translation with a small distance, etc. Specifically, it can be a rotation of less than 20° or a translation of less than 10 pixels.

[0062] Step 230: Input the third sample image corresponding to each second sample image into n teacher models respectively to obtain n initial pseudo-labels corresponding to each second sample image.

[0063] In this embodiment, after inputting the third sample image into the teacher model, the prediction result output by the teacher model can be used as the initial pseudo-label of the second sample image.

[0064] In this embodiment, a third sample image can be obtained by performing a first type of data augmentation on the second sample image. The teacher model predicts the key points of the third sample image and outputs the prediction result, which is then used as the initial pseudo-label for the second sample image. This helps improve the accuracy of the initial pseudo-label.

[0065] Step 240: Aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain the target pseudo-label corresponding to each second sample image.

[0066] Step 250: Perform a second type of data augmentation on each second sample image to obtain a fourth sample image corresponding to each second sample image, and determine the target pseudo-label corresponding to each second sample image as the target pseudo-label of each fourth sample image.

[0067] In this embodiment, the second type of data augmentation processing represents complex augmentation processing, which may include large-amplitude rotations or large-distance translations. Specifically, it may be a rotation of more than 20° or a translation of more than 10 pixels.

[0068] The resulting fourth sample image can be combined with the corresponding third sample image to form a sample image pair. Each sample image pair can include a simple enhancement process and a complex enhancement process corresponding to a second sample image to obtain two sample images.

[0069] Step 260: Input the first sample image into the pose recognition model to be trained, and determine the first loss value based on the sample label and the output of the pose recognition model to be trained.

[0070] Step 270: Input the fourth sample image into the pose recognition model to be trained, and determine the second loss value based on the target pseudo-label of the fourth sample image and the output result of the pose recognition model to be trained.

[0071] Step 280: Adjust the model parameters of the pose recognition model to be trained based on the first loss value and the second loss value.

[0072] The above steps 260 to 280 are executed iteratively until the preset iteration termination condition is met, and the trained pose recognition model is obtained.

[0073] As an example, the iteration termination condition can be the convergence of the loss function or reaching a preset number of iterations.

[0074] In this embodiment, a first loss value can be determined based on the output result and sample label obtained by processing the first sample image by the pose recognition model to be trained. Then, the third sample image after the first data augmentation is input into the teacher model to obtain the initial pseudo label of the second sample image, and then the target pseudo label of the second sample image is obtained through aggregation processing. After that, the fourth sample image after the second data augmentation is input into the pose recognition model to be trained, and a second loss value is determined based on the target pseudo label and the output result of the pose recognition model to be trained. Then, the model parameters of the pose recognition model to be trained are adjusted based on the first loss value and the second loss value. The training process of the pose recognition model to be trained can be guided by the sample label of the first sample image and the pseudo label generated by the teacher model, which can reduce the workload of generating labeled image data and avoid the adverse effects of noisy pseudo labels on the training process.

[0075] In some optional embodiments of the above embodiments, step 250 can generate the fourth sample image in the following manner: perform affine transformation processing on each second sample image to obtain the transformed image corresponding to each second sample image; perform mask processing on the transformed image corresponding to each second sample image to obtain the fourth sample image corresponding to each second sample image.

[0076] In this embodiment, the affine transformation can include a significant rotation and / or a significant translation. Through masking, some or all key points in the affine-transformed image can be occluded to generate a fourth sample image. Subsequent training of the pose recognition model using this fourth sample image can improve training quality and yield a pose recognition model that can accurately identify poses in occluded images.

[0077] Furthermore, it can be done through Figure 3 The process shown performs masking on the transformed image corresponding to each second sample image, such as... Figure 3 As shown, the process may include the following steps.

[0078] Step 310: Obtain the image to be cropped.

[0079] In this embodiment, the image to be cropped can be a pre-determined image containing a recognition object. For example, when the pose recognition model to be trained is used to recognize human poses, the image to be cropped can be an image including a person.

[0080] Step 320: Crop out the image region corresponding to the limb part from the image to be cropped to obtain a local image.

[0081] As an example, an object detection model can be used to identify the limb parts of the image to be cropped, obtain the image region corresponding to the limb parts, and then extract the corresponding region from the image to be cropped to obtain a local image.

[0082] Step 330: Based on the n initial pseudo-labels corresponding to the second sample image, determine the key point positions in the transformed image corresponding to the second sample image.

[0083] In this embodiment, the initial pseudo-label is the prediction result obtained by the teacher model processing the third sample image. Since the third sample image is obtained by the second sample image through the first type of data augmentation processing (i.e., a small-amplitude affine transformation), and the transformed image is obtained by the second sample image through a larger-amplitude affine transformation, the initial pseudo-label can be subjected to affine transformation according to the degree of difference between the affine transformation and the first type of data augmentation processing, and the key point position can be determined from the transformed initial pseudo-label.

[0084] As an example, the first type of data augmentation processing sequentially includes: rotating 15° clockwise and shifting 5 pixels to the right; the affine transformation performed on the second sample image may include rotating 40° clockwise and shifting 15 pixels to the right. Therefore, the difference between the two data augmentation processes can be represented as rotating 25° clockwise and shifting 10 pixels to the right. Then, the initial pseudo-labels can be processed as follows: rotating 25° clockwise and shifting 10 pixels to the right, to obtain the transformed initial pseudo-labels, from which the keypoint positions can be determined.

[0085] Step 340: Attach the local image to the key point positions in the transformed image corresponding to the second sample image to obtain the fourth sample image corresponding to the second sample image.

[0086] In this embodiment, a local image can be attached to some or all of the key point positions in the transformed image corresponding to the second sample image to generate a fourth sample image with occluded areas.

[0087] exist Figure 3 In the implementation shown, the initial pseudo-labels generated by the teacher model can be used to determine the occlusion position, and then the local image in the image to be cropped can be pasted onto the corresponding position to obtain a more targeted fourth sample image, which helps to further improve the training quality.

[0088] Next, refer to Figure 4 , Figure 4 A flowchart illustrating the generation of target pseudo-labels is shown in one embodiment of the method for training the model of this disclosure, such as... Figure 4 As shown, the process includes the following steps.

[0089] Step 410: Obtain the n historical pseudo-labels obtained by the n teacher models in the previous iteration.

[0090] In this embodiment, one second sample image corresponds to one iteration (i.e., steps 260 to 270 above), and the historical pseudo-label represents the initial pseudo-label generated by the teacher model when processing the previous second sample image in the previous iteration.

[0091] Step 420: Pair the n initial pseudo-labels and n historical pseudo-labels corresponding to each second sample image to determine multiple label pairs.

[0092] Each tag pair includes an initial pseudo-tag and a historical pseudo-tag.

[0093] As an example, with n=2, two teacher models process the i-th second sample image in t iterations to obtain two initial pseudo-labels: A and B; two teacher models process the (i-1)-th second sample image in t-1 iterations to obtain two historical pseudo-labels: a and b. Thus, four label pairs can be determined, represented as A and a, A and b, B and a, and B and b, respectively.

[0094] Step 430: Identify key point pairs with matching relationships from the initial pseudo-labels and historical pseudo-labels included in the label pair, and determine the pixel distance between the two key points in the key point pair.

[0095] In this embodiment, keypoint matching can be performed on the initial pseudo-label and historical pseudo-label in the label pair by the type label or position of the keypoint, to determine the keypoint pair with matching relationship, and to determine the pixel distance between the two keypoints in the keypoint pair.

[0096] Continuing with the example from step 420, suppose that in the tag pair consisting of the initial pseudo-label A and the historical pseudo-label a, keypoint 1 in the initial pseudo-label A and keypoint 2 in the historical pseudo-label a form a keypoint pair. Here, the coordinates of keypoint 1 in the initial pseudo-label A are (x1, y1), and the coordinates of keypoint 2 in the historical pseudo-label a are (x2, y2). Then, the pixel distance between keypoint 1 and keypoint 2 can be expressed as...

[0097] Step 440: For keypoints with the same label among the n initial pseudo-labels, determine the target keypoint pair with the smallest pixel distance from each keypoint pair containing the keypoint, and determine the position of the keypoint in the target keypoint pair as the target position of the keypoint.

[0098] In this embodiment, the target location of the key point represents the position of the key point in the target pseudo-label.

[0099] Continuing with the example from step 430, assume that in the tag pair formed by the initial pseudo-tag B and the historical pseudo-tag a, keypoint 3 and keypoint 1 in the initial pseudo-tag B have the same label, and keypoint 3 and keypoint 2 form a keypoint pair. If the pixel distance between keypoint 3 and keypoint 2 is less than the pixel distance between keypoint 1 and keypoint 2, then the keypoint pair formed by keypoint 3 and keypoint 2 is the target keypoint pair, and the coordinates of keypoint 3 in the initial pseudo-tag B are the target position of that keypoint. By performing step 440 on each keypoint in the initial pseudo-tag A and the initial pseudo-tag B, the target position of each keypoint in the target pseudo-tag can be determined.

[0100] Step 450: Based on the key points and target positions in the n initial pseudo-labels corresponding to each second sample image, determine the target pseudo-label corresponding to each second sample image.

[0101] In this embodiment, the type and location of each key point contained in the target pseudo-label can be determined according to the type of each key point and its target location, thereby obtaining the target pseudo-label.

[0102] exist Figure 4 In the illustrated embodiment, the key point pair with the highest accuracy is selected from each key point pair based on pixel distance, thereby determining the target position of the key point in the target pseudo-label. The inconsistency between the position of the historical pseudo-label output by the teacher model in the previous iteration and the initial pseudo-label output in the current iteration can be used to correct the position of the key point in the pseudo-label, which helps to improve the accuracy of the position of the key point in the target pseudo-label.

[0103] exist Figure 2 Based on the embodiments shown, refer to Figure 5 , Figure 5 A flowchart illustrating an embodiment of the method for training the model of this disclosure is shown. Figure 2 The process shown can further include Figure 5 The process shown is as follows: Figure 5 As shown, the process includes the following steps.

[0104] Step 510: Input the first sample image into each of the n teacher models, and determine the third loss value of each of the n teacher models based on the sample labels and the output results of the n teacher models.

[0105] Step 520: Take each of the n teacher models as a student model in turn, and take the other n-1 teacher models as reference models to obtain n model combinations.

[0106] Step 530: For each of the n model combinations, input the fourth sample image into the student model in that model combination to obtain the first prediction result of the fourth sample image corresponding to the model combination, and determine the n-1 initial pseudo-labels obtained by the reference model in that model combination when processing the third sample image corresponding to the fourth sample image.

[0107] Step 540: Based on the first prediction result of the model combination corresponding to the fourth sample image and n-1 initial pseudo-labels, determine the fourth loss value corresponding to the model combination.

[0108] In this embodiment, the fourth loss function value corresponding to each model combination can be determined based on the difference between the first prediction result and n-1 initial pseudo-labels. Step 540 can be performed on each model combination to obtain n fourth loss values.

[0109] As an example, we can first perform a weighted average of the n-1 initial pseudo-labels to obtain the processed pseudo-labels; then, based on the difference between the processed pseudo-labels and the first prediction result, we can determine the fourth loss value.

[0110] Step 550: Based on the third loss value and the fourth loss value corresponding to each of the n model combinations, adjust the model parameters of each teacher model among the n teacher models.

[0111] Further integration Figure 6 This embodiment will be described below. Figure 6 A schematic diagram illustrating iterative training of one embodiment of a method for training a model is shown, such as... Figure 6 As shown, the process of training the pose recognition model in each iteration in this embodiment can include three stages. In the first stage, the first sample image 610 is input into the first teacher model 630, the second teacher model 640, and the pose recognition model 650 to be trained, respectively. Then, based on the prediction results and sample labels 660, the third loss value 1 corresponding to the first teacher model 630, the third loss function 2 corresponding to the second teacher model 640, and the first loss value 651 corresponding to the pose recognition model 650 to be trained are determined respectively. The parameters of each model can be adjusted according to the loss values ​​using the backpropagation principle.

[0112] Next, the second stage begins. First, the second teacher model 640 is used as the student network. The third sample image 621, obtained by performing the first type of data augmentation on the second sample image 620, is input into the first teacher model 630 to obtain the first initial pseudo-label 631. The fourth sample image 622, obtained by performing the second type of data augmentation on the second sample image 620, is input into the second teacher model 640. Based on the prediction result of the second teacher model 640 and the first initial pseudo-label 631, the fourth loss value 1 corresponding to the second teacher model 640 is determined. Here, the model parameters of the second teacher model 640 can be adjusted based on the fourth loss value 1. Then, the first teacher model 630 is used as the student model. The third sample image 621 is input into the second teacher model 640 to obtain the second initial pseudo-label 641. The fourth sample image 622 is input into the first teacher model 630. Based on the prediction result and the second initial pseudo-label 641, the fourth loss value 2 corresponding to the first teacher model 630 is determined. Here, the model parameters of the first teacher model 630 can be adjusted based on the fourth loss value 2. Then, the first initial pseudo-label 631 and the second initial pseudo-label 641 can be aggregated to obtain the target pseudo-label 670.

[0113] Next, we move to the third stage. The fourth sample image 622 is input into the pose recognition model 650 to be trained, and the second loss value 652 is determined based on the prediction result and the target pseudo-label 670. Then, the model parameters of the pose recognition model 650 to be trained are adjusted based on the second loss value 652.

[0114] The above three stages are executed iteratively until the preset iteration termination condition is met, and the trained pose recognition model can be obtained.

[0115] In this embodiment, n fourth loss values ​​can be determined through interactive learning among n teacher models. Then, combined with the third loss value, the model parameters of each teacher model are optimized using the backpropagation principle. Optimizing each teacher model during the training of the pose recognition model helps to further improve the training effect of the pose recognition model.

[0116] In a specific example, the pose recognition model trained through any of the above embodiments can predict the pose of a target object based on an image. The target object includes, but is not limited to, people and animals. For instance, the pose recognition model can extract feature data from the input image and then predict the confidence level and type label of each pixel in the image as a keypoint based on the feature data. Here, keypoints can, for example, represent the skeletal joints of the target object. In this way, the pose of the target object can be represented by multiple keypoints. Afterward, the pose recognition model can generate and output a keypoint heatmap based on the confidence level and type label of each pixel. Through post-processing of the keypoint heatmap, the keypoints with the highest confidence levels can be selected, and the coordinates of each keypoint can be mapped to the input image, thus presenting the pose of the target object in the input image.

[0117] Thanks to the training method disclosed herein, the training process of the pose recognition model requires only a small amount of labeled data and a large amount of unlabeled data, reducing the workload generated by labeled data; furthermore, by aggregating the initial pseudo-labels of the teacher model and removing the adverse effects of noisy labels on the training process, the pose recognition model can be guaranteed to have higher accuracy.

[0118] Based on the model training method in any of the above embodiments, this disclosure also provides a pose recognition method, including: acquiring a target image including a target object; and recognizing the pose of the target object in the target image using a trained pose recognition model obtained by the training model method in any of the above embodiments.

[0119] In this embodiment, the target object can be a person or an animal, and each target image may include one or more target objects.

[0120] As an example, a trained pose recognition model can be obtained in advance using the training model method described in any of the above embodiments. It is understood that if the target object is a person, both the first and second sample images used during training will include images of a person; if the target object is an animal, both the first and second sample images will include images of an animal. Subsequently, images of the target object can be captured using a camera, and these captured images can be used as input data for the pose recognition model. The pose recognition model then performs feature extraction and feature mapping on the target images to identify the key points of the target object and their confidence levels, and outputs a heatmap, which represents the pose of the target object.

[0121] The pose recognition method in this embodiment can improve the training quality and accuracy of the pose recognition model by using the training model method described in the above embodiments, thereby improving the accuracy of pose recognition.

[0122] Exemplary device

[0123] The following is combined with Figure 7 The apparatus for training the model disclosed herein is described by way of example, such as Figure 7 As shown, the device includes: a data acquisition unit 710, configured to acquire a first sample set, a second sample set, n teacher models, and a pose recognition model to be trained, wherein the first sample set includes multiple labeled first sample images, the second sample set includes multiple unlabeled second sample images, and n is a positive integer not less than 2; a label generation unit 720, configured to use the n teacher models to predict key points in each second sample image, and use the n prediction results as pseudo-labels for the second sample image to obtain n initial pseudo-labels corresponding to each second sample image; a label aggregation unit 730, configured to aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain target pseudo-labels corresponding to each second sample image; and an iterative training unit 740, configured to iteratively train the pose recognition model to be trained based on the first sample set, the second sample set, and the target pseudo-labels corresponding to each second sample image to obtain the trained pose recognition model.

[0124] In one embodiment, the label generation unit 720 further includes: a first enhancement module configured to perform a first type of data enhancement processing on each second sample image to obtain a third sample image corresponding to each second sample image; and a label generation module configured to input the third sample image corresponding to each second sample image into n teacher models respectively to obtain n initial pseudo-labels corresponding to each second sample image.

[0125] In one embodiment, the apparatus further includes: a second enhancement unit configured to perform a second type of data augmentation processing on each second sample image to obtain a fourth sample image corresponding to each second sample image, and to determine the target pseudo-label corresponding to each second sample image as the target pseudo-label of each fourth sample image; and the iterative training unit 740 further includes: a first module configured to input the first sample image into the pose recognition model to be trained, and to determine a first loss value based on the sample label and the output result of the pose recognition model to be trained; a second module configured to input the fourth sample image into the pose recognition model to be trained, and to determine a second loss value based on the target pseudo-label of the fourth sample image and the output result of the pose recognition model to be trained; an adjustment module configured to adjust the model parameters of the pose recognition model to be trained based on the first loss value and the second loss value; and an iteration module configured to iteratively execute the steps of determining the first loss value, determining the second loss value, and adjusting the model parameters of the pose recognition model to be trained until a preset iteration termination condition is met to obtain the trained pose recognition model.

[0126] In one embodiment, the label aggregation unit 730 further includes: an acquisition module configured to acquire n historical pseudo-labels obtained by n teacher models in the previous iteration; a combination module configured to pair the n initial pseudo-labels and n historical pseudo-labels corresponding to each second sample image to determine multiple label pairs, each label pair including an initial pseudo-label and a historical pseudo-label; a matching module configured to determine key point pairs with matching relationships from the initial pseudo-labels and historical pseudo-labels included in the label pairs, and determine the pixel distance between two key points in the key point pairs; a filtering module configured to, for key points with the same label among the n initial pseudo-labels, determine the target key point pair with the smallest pixel distance from each key point pair containing the key point, and determine the position of the key point in the target key point pair as the target position of the key point; and an aggregation module configured to determine the target pseudo-label corresponding to each second sample image based on each key point and its target position in the n initial pseudo-labels corresponding to each second sample image.

[0127] In one embodiment, the second enhancement unit further includes: a transformation module configured to perform affine transformation processing on each second sample image to obtain a transformed image corresponding to each second sample image; and a mask module configured to perform mask processing on the transformed image corresponding to each second sample image to obtain a fourth sample image corresponding to each second sample image.

[0128] In one embodiment, the apparatus further includes: a first input unit configured to input a first sample image into n teacher models respectively, and determine a third loss value for each of the n teacher models based on the sample labels and the output results of the n teacher models; a model combination unit configured to sequentially use each of the n teacher models as a student model and the other n-1 teacher models as reference models to obtain n model combinations; a second input unit configured to input a fourth sample image into the student model of each of the n model combinations, obtain a first prediction result of the fourth sample image corresponding to the model combination, and determine n-1 initial pseudo-labels obtained when the reference model in the model combination processes the third sample image corresponding to the fourth sample image; a loss determination unit configured to determine a fourth loss value corresponding to the model combination based on the first prediction result of the fourth sample image corresponding to the model combination and the n-1 initial pseudo-labels; and a model adjustment unit configured to adjust the model parameters of each of the n teacher models based on the third loss value and the fourth loss value corresponding to each of the n model combinations.

[0129] In one embodiment, the mask module is further configured to: acquire an image to be cropped; crop out the image region corresponding to the limb part from the image to be cropped to obtain a local image; determine the key point positions in the transformed image corresponding to the second sample image based on n initial pseudo-labels corresponding to the second sample image; and attach the local image to the key point positions in the transformed image corresponding to the second sample image to obtain a fourth sample image corresponding to the second sample image.

[0130] Furthermore, this disclosure also provides a pose recognition device, including: an image acquisition unit configured to acquire a target image including a target object; and a pose recognition unit configured to recognize the pose of the target object in the target image using a trained pose recognition model obtained by the training model method in any of the above embodiments.

[0131] Exemplary electronic devices

[0132] Below, for reference Figure 8 To describe an electronic device according to embodiments of the present disclosure.

[0133] Figure 8 A block diagram of an electronic device according to an embodiment of the present disclosure is shown.

[0134] like Figure 8 As shown, the electronic device includes one or more processors and memory.

[0135] A processor can be a central processing unit (CPU) or other form of processing unit with data processing and / or instruction execution capabilities, and can control other components in an electronic device to perform desired functions.

[0136] The memory can store one or more computer program products, and the memory can include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and / or cache memory. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program products can be stored on the computer-readable storage medium, and the processor can run the computer program products to implement the training model methods and / or pose recognition methods of the various embodiments of this disclosure described above, and / or other desired functions.

[0137] In one example, the electronic device may also include input devices and output devices, which are interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0138] In addition, the input device may also include, for example, a keyboard, a mouse, etc.

[0139] This output device can output various information to the outside, including determined distance information, direction information, etc. The output device may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0140] Of course, for the sake of simplicity, Figure 8 Only some of the components of the electronic device relevant to this disclosure are shown, omitting components such as buses, input / output interfaces, etc. In addition, the electronic device may include any other suitable components depending on the specific application.

[0141] In addition to the methods and apparatus described above, embodiments of this disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the methods for training models and / or pose recognition methods according to various embodiments of this disclosure as described in the foregoing portions of this specification.

[0142] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this disclosure. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on a user's computing device, partially on a user's computing device, as a standalone software package, partially on a user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0143] Furthermore, embodiments of this disclosure may also be computer-readable storage media storing computer program instructions thereon, which, when executed by a processor, cause the processor to perform steps in the methods for training models and / or pose recognition methods according to various embodiments of this disclosure as described in the foregoing portion of this specification.

[0144] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may, for example, include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0145] The basic principles of this disclosure have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this disclosure are merely examples and not limitations, and should not be considered as essential features of each embodiment of this disclosure. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the scope of this disclosure to the necessity of employing the aforementioned specific details for implementation.

[0146] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For system embodiments, since they largely correspond to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0147] The block diagrams of devices, apparatuses, devices, and systems disclosed herein are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0148] The methods and apparatus of this disclosure may be implemented in many ways. For example, they may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order of steps for the methods is for illustrative purposes only, and the steps of the methods of this disclosure are not limited to the order specifically described above unless otherwise specifically stated. Furthermore, in some embodiments, this disclosure may also be implemented as a program recorded on a recording medium, the program including machine-readable instructions for implementing the methods according to this disclosure. Thus, this disclosure also covers recording media storing programs for performing the methods according to this disclosure.

[0149] It should also be noted that in the apparatus, devices, and methods of this disclosure, the components or steps can be disassembled and / or recombined. These disassemblies and / or recombinations should be considered as equivalent solutions to this disclosure.

[0150] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects without departing from the scope of this disclosure. Therefore, this disclosure is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features disclosed herein.

[0151] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this disclosure to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.

Claims

1. A method for training a model, characterized in that, include: Obtain a first sample set, a second sample set, n teacher models, and a pose recognition model to be trained. The first sample set includes multiple labeled first sample images, the second sample set includes multiple unlabeled second sample images, and n is a positive integer not less than 2. The key points in each of the second sample images are predicted using the n teacher models, and the n prediction results are used as pseudo-labels for the second sample images to obtain n initial pseudo-labels for each second sample image. Aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain the target pseudo-labels corresponding to each second sample image. The target positions of key points with the same label among the n initial pseudo-labels are determined based on the position of the key point in the target key point pair. The target key point pair is the key point pair with the smallest pixel distance among all key point pairs containing the key point. The key point pair has a matching relationship in the label pairs obtained by pairing the n historical pseudo-labels and the n initial pseudo-labels. The n historical pseudo-labels are obtained by the n teacher models in the previous iteration. Based on the first sample set, the second sample set, and the target pseudo-label corresponding to each second sample image, the pose recognition model to be trained is iteratively trained to obtain the trained pose recognition model.

2. The method according to claim 1, characterized in that, Using the n teacher models, key points in each of the second sample images are predicted, and the prediction results are used as pseudo-labels for the second sample images, resulting in n initial pseudo-labels for each second sample image, including: Perform a first type of data augmentation on each of the second sample images to obtain a third sample image corresponding to each of the second sample images; Each third sample image corresponding to the second sample image is input into the n teacher models to obtain n initial pseudo-labels corresponding to each second sample image.

3. The method according to claim 2, characterized in that, The method further includes: performing a second type of data augmentation on each second sample image to obtain a fourth sample image corresponding to each second sample image, and determining the target pseudo-label corresponding to each second sample image as the target pseudo-label for each fourth sample image; and... The step of iteratively training the pose recognition model to be trained based on the first sample set, the second sample set, and the target pseudo-labels corresponding to each second sample image to obtain the trained pose recognition model includes: inputting the first sample image into the pose recognition model to be trained, and determining a first loss value based on the sample labels and the output results of the pose recognition model to be trained; inputting the fourth sample image into the pose recognition model to be trained, and determining a second loss value based on the target pseudo-labels of the fourth sample image and the output results of the pose recognition model to be trained; adjusting the model parameters of the pose recognition model to be trained based on the first loss value and the second loss value; iteratively executing the steps of determining the first loss value, determining the second loss value, and adjusting the model parameters of the pose recognition model to be trained until a preset iteration termination condition is met to obtain the trained pose recognition model.

4. The method according to claim 3, characterized in that, Aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain the target pseudo-label corresponding to each second sample image, including: Obtain the n historical pseudo-labels obtained from the previous iteration for the n teacher models; For each second sample image, the n initial pseudo-labels and the n historical pseudo-labels are paired to determine multiple label pairs, each label pair including an initial pseudo-label and a historical pseudo-label; From the initial pseudo-labels and historical pseudo-labels included in the label pair, determine the key point pairs with matching relationships, and determine the pixel distance between the two key points in the key point pair; For key points with the same label among the n initial pseudo-labels, determine the target key point pair with the smallest pixel distance from each key point pair containing the key point, and determine the position of the key point in the target key point pair as the target position of the key point. Based on the key points and target positions in the n initial pseudo-labels corresponding to each second sample image, the target pseudo-label corresponding to each second sample image is determined.

5. The method according to claim 3 or 4, characterized in that, Perform a second type of data augmentation on each of the second sample images to obtain a fourth sample image corresponding to each of the second sample images, including: Perform an affine transformation on each of the second sample images to obtain the transformed image corresponding to each of the second sample images; The transformed image corresponding to each second sample image is masked to obtain the fourth sample image corresponding to each second sample image.

6. The method according to claim 5, characterized in that, Before aggregating the n initial pseudo-labels corresponding to each of the second sample images, the method further includes: The first sample image is input into each of the n teacher models, and the third loss value of each teacher model is determined based on the sample label and the output results of the n teacher models. Each of the n teacher models is used as a student model, and the other n-1 teacher models are used as reference models to obtain n model combinations; For each of the n model combinations, the fourth sample image is input into the student model in that model combination to obtain the first prediction result of the fourth sample image corresponding to the model combination, and n-1 initial pseudo-labels are determined when the reference model in that model combination processes the third sample image corresponding to the fourth sample image; based on the first prediction result of the fourth sample image corresponding to the model combination and the n-1 initial pseudo-labels, the fourth loss value corresponding to that model combination is determined. Based on the third loss value and the fourth loss value corresponding to each of the n model combinations, the model parameters of each teacher model in the n teacher models are adjusted.

7. The method according to claim 6, characterized in that, The transformed image corresponding to each second sample image is masked to obtain a fourth sample image corresponding to each second sample image, including: Obtain the image to be cropped; The image region corresponding to the limb part is cropped from the image to be cropped to obtain a local image; Based on the n initial pseudo-labels corresponding to the second sample image, determine the key point positions in the transformed image corresponding to the second sample image; The local image is attached to the key point position in the transformed image corresponding to the second sample image to obtain the fourth sample image corresponding to the second sample image.

8. A pose recognition method, characterized in that, include: Obtain the target image including the target object; as well as The pose of the target object in the target image is identified by the pose recognition model trained according to any one of claims 1 to 7.

9. An apparatus for training a model, characterized in that, include: The data acquisition unit is configured to acquire a first sample set, a second sample set, n teacher models, and a pose recognition model to be trained, wherein the first sample set includes multiple labeled first sample images, the second sample set includes multiple unlabeled second sample images, and n is a positive integer not less than 2. The label generation unit is configured to use the n teacher models to predict the key points in each of the second sample images, and use the n prediction results as pseudo-labels for the second sample images to obtain n initial pseudo-labels for each of the second sample images; The label aggregation unit is configured to aggregate the n initial pseudo-labels corresponding to each second sample image, correct the positions of key points in the n initial pseudo-labels, and obtain the target pseudo-labels corresponding to each second sample image. The target positions of key points with the same label among the n initial pseudo-labels are determined based on the position of the key point in the target key point pair. The target key point pair is the key point pair with the smallest pixel distance among all key point pairs containing the key point. The key point pair has a matching relationship in the label pairs obtained by pairing the n historical pseudo-labels and the n initial pseudo-labels. The n historical pseudo-labels are obtained by the n teacher models in the previous iteration. The iterative training unit is configured to iteratively train the pose recognition model to be trained based on the first sample set, the second sample set, and the target pseudo-label corresponding to each second sample image, so as to obtain the trained pose recognition model.

10. A posture recognition device, characterized in that, include: The image acquisition unit is configured to acquire a target image including the target object; The pose recognition unit is configured to recognize the pose of the target object in the target image by means of a pose recognition model trained according to any one of claims 1 to 7.

11. An electronic device, characterized in that, include: Memory, used to store computer program products; A processor for executing a computer program product stored in the memory, wherein when the computer program product is executed, it implements the method described in any one of claims 1-8.

12. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1-8.