Live image human key point detection method and device, equipment and medium

By generating corrected images through a boundary evaluation model and a filling network, the problem of inaccurate key point detection caused by incomplete human images is solved, and accurate extraction of human key points is achieved, thus improving the performance of downstream services.

CN115841695BActive Publication Date: 2026-06-16GUANGZHOU FANGSI INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU FANGSI INFORMATION TECH CO LTD
Filing Date
2022-12-22
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies cannot effectively extract complete key information of the human body when dealing with incomplete human images, resulting in downstream businesses failing to obtain image processing results that meet expectations.

Method used

A boundary evaluation model is used to predict the cropping range of the image, and a filling network is used to fill the image to generate a corrected image. Then, a human keypoint detection model is used to extract keypoint information.

🎯Benefits of technology

It enables accurate detection of incomplete human images, ensuring a good interactive experience for downstream services such as digital human live streaming.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115841695B_ABST
    Figure CN115841695B_ABST
Patent Text Reader

Abstract

The application discloses a live image human body key point detection method and device, equipment and medium, the method comprises: obtaining an original image containing a human body image, the human body image is in an incomplete form because part of it exceeds the boundary of the original image; a prediction network of a boundary evaluation model is used to determine the cutting range information of the filling range required for the complete form of the human body image; a filling network of the boundary evaluation model is used to fill image information for the corresponding area of the original image according to the cutting range information, and a corrected image is obtained; a human body key point detection model is used to determine the human body key point information of the corrected image. The application can accurately extract all human body key point information corresponding to the human body image in the original image based on the corrected image, so that the image processing based on the human body key point information in the network live broadcast and other scenarios can obtain a good interactive experience.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of live streaming technology, and in particular to a method for detecting human key points in live streaming images, and the corresponding apparatus, computer equipment, and computer-readable storage medium. Background Technology

[0002] When processing human images, it is often necessary to identify key points of the human body in order to correctly handle the relationship between different parts of the human body based on these key points. This helps to achieve 3D human body modeling or image correction. It can be seen that the correct acquisition of human body key points is a very basic technology.

[0003] In an exemplary live streaming scenario, if the key point information of the human body in the live video stream is correctly obtained, the human body's activity posture in the live video stream can be transferred to the 3D model of the digital human based on the correspondence of the key points. This improves the interactive experience and allows live streaming to have a wider range of applications, such as virtual teaching, virtual performance, virtual dialogue, etc., thereby achieving greater socio-economic benefits.

[0004] In reality, some images used to extract key human body points often have incomplete content due to issues such as camera angle or other formatting processes. Conductor-based human body key point detection models cannot extract complete human body key point information, which in turn makes it difficult for downstream services to obtain the expected image processing results when the obtained human body image is incomplete. Therefore, processing the spatial redundancy corresponding to the integrity of the human body shape in the image is of paramount importance in obtaining human body key point information to drive various downstream services.

[0005] Traditional techniques for handling incomplete human images often employ a fixed-value approach to determine the filling area, expanding the region corresponding to the missing human keypoints in the image by a fixed value. This process neglects the semantic meaning provided by the human image, thus offering limited effectiveness in assisting human keypoint detection models to accurately predict keypoint information and failing to meet the required standards. Therefore, a more effective solution is needed. Summary of the Invention

[0006] The primary objective of this application is to solve at least one of the aforementioned problems by providing a method for detecting human key points in live images, as well as corresponding apparatus, computer equipment, and computer-readable storage media.

[0007] To achieve the various objectives of this application, the following technical solution is adopted:

[0008] A method for detecting key points in the human body, proposed for one of the purposes of this application, includes the following steps:

[0009] Obtain an original image containing a human body image, wherein the human body image is incomplete because some parts extend beyond the boundaries of the original image;

[0010] A prediction network using a boundary evaluation model determines the cropping range information of the area to be filled to accommodate the complete shape of the human body image;

[0011] The filling network using the boundary evaluation model fills in image information for the corresponding region of the original image according to the cropping range information, thereby obtaining the corrected image;

[0012] The human key point information of the corrected image is determined by using a human key point detection model.

[0013] Optionally, obtain the raw image containing the human body image, including:

[0014] To capture image frames from a video stream generated by a live web broadcast;

[0015] Detect whether there is facial information in the image frame; if facial information is present, use the image frame as the target image.

[0016] Obtain the human body image from the target image, determine whether the human body image is in a complete form, and if it is incomplete, use the target image as the original image.

[0017] Optionally, before obtaining the raw image containing the human body image, the following steps are included:

[0018] The human body key point detection model is trained until convergence using the first sample image. During the training process, the human body key point information corresponding to the first sample image is used to supervise the human body key point information predicted by the human body key point detection model. The first sample image contains a human body image, and the human body image is in a complete form.

[0019] The boundary evaluation model is trained using the second sample image until convergence. During the training process, the cropping range information determined by comparing the second sample image with its original image is used to supervise the cropping range information predicted by the boundary evaluation model. The second sample image and its original image both contain the same human body image. The human body image in the second sample image is incomplete, while the human body image in its original image is complete.

[0020] Optionally, after training the boundary evaluation model using the second sample image until convergence, the process includes:

[0021] The boundary evaluation model is followed by the human key point detection model to form a joint model, and the joint model is trained by inputting the second sample image.

[0022] Calculate the similarity loss value of the human key point information predicted by the human key point detection model in the joint model based on the human key point information corresponding to the second sample image;

[0023] The similarity loss value of the cropping range information predicted by the prediction network of the boundary evaluation model in the joint model is calculated based on the cropping range information of the second sample image compared with its original image.

[0024] The individual similarity loss values ​​are aggregated into the overall loss value of the joint model, and the joint model is then updated using gradients based on the overall loss value.

[0025] Optionally, training the boundary evaluation model to convergence using the second sample image includes:

[0026] The second sample image in the training dataset and the cropping range information determined by comparing it with the original image are obtained. The original image contains a human body image in a complete form. The second sample image is obtained by cropping a part of the original image so that the human body image has a non-complete form.

[0027] The second sample image is input into the boundary evaluation model, and its prediction network extracts image feature information from the second sample image to predict the cropping range information corresponding to the filling range required to accommodate the complete shape of the human body image.

[0028] Calculate the similarity loss value between the cropping range information corresponding to the second sample image and the cropping range information predicted by the prediction network, and perform gradient update on the boundary evaluation model based on the similarity loss value.

[0029] Optionally, before obtaining the second sample image and its corresponding second supervision image from the training dataset, the following steps are included:

[0030] Obtain a basic dataset containing multiple source images, including human images in their complete form;

[0031] The human key point detection model is used to detect the human key point information in the material image and determine the outer bounding box surrounding all human key points.

[0032] A cropping ratio is randomly generated and applied to at least one image edge. The material image is cropped according to the cropping ratio, and the portion covered by the outer frame is cropped. The proportion of the cropped portion within the outer frame to the total area covered by the outer frame is controlled to not exceed a preset ratio, thereby obtaining a second sample image.

[0033] The cropping ratio is represented as the cropping range information of the second sample image, and the cropping range information and the second sample image are used to form a mapping relationship data, which is stored in the training dataset.

[0034] Optionally, after determining the human key point information of the corrected image using a human key point detection model, the process includes:

[0035] The key point information of the human body is applied to a preset three-dimensional model of a digital human, and the digital human is controlled to switch to the corresponding activity posture.

[0036] Render the 3D model to obtain the pose image of the digital human;

[0037] The posture image is then pushed to a live webcast room for display.

[0038] A human body key point detection device provided for one of the purposes of this application includes:

[0039] The image acquisition module is configured to acquire an original image containing a human body image, wherein the human body image is incomplete because some parts of it extend beyond the boundaries of the original image.

[0040] The range determination module is configured to use a prediction network of a boundary evaluation model to determine the cropping range information of the fill range required to accommodate the complete shape of the human body image;

[0041] The filling processing module is configured to use the filling network of the boundary evaluation model to fill the corresponding region of the original image with image information according to the cropping range information, so as to obtain the corrected image;

[0042] The key point determination module is configured to use a human key point detection model to determine the human key point information of the corrected image.

[0043] A computer device provided for one of the purposes of this application includes a central processing unit and a memory, the central processing unit being configured to invoke and run a computer program stored in the memory to perform the steps of the human body key point detection method described in this application.

[0044] A computer-readable storage medium is provided for another purpose of this application, which stores, in the form of computer-readable instructions, a computer program implemented according to the described human key point detection method, which, when invoked by a computer, performs the steps included in the method.

[0045] A computer program product provided for another purpose of this application includes a computer program / instructions that, when executed by a processor, implement the steps of the method described in any embodiment of this application.

[0046] A method for detecting human key points in live images, provided for one of the purposes of this application, includes the steps of the method for detecting human key points.

[0047] Compared to existing technologies, this application addresses original images containing incomplete human images by employing a boundary evaluation model. First, the prediction network within the model predicts the cropping range information required to restore the human image in the original image to its complete form. Then, the filling network in the boundary evaluation model fills the image boundaries based on the cropping range information, thereby expanding the features of the original image to obtain a corrected image. The expanded area in the corrected image can be determined according to the semantic features of the human image in the original image. This corrected image is then provided to a human keypoint detection model for further detection, containing sufficient information to understand the complete human image. This allows for the accurate extraction of all human keypoint information corresponding to the human image in the original image, providing accurate and effective basic data for various downstream services that rely on human keypoint information, ensuring a good interactive experience in scenarios such as live streaming based on digital humans. Attached Figure Description

[0048] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:

[0049] Figure 1 This is an exemplary network architecture used for the live streaming service in the live streaming scenario of this application;

[0050] Figure 2 This is a flowchart illustrating a typical embodiment of the human body key point detection method of this application;

[0051] Figure 3 This is a schematic diagram of the model architecture of the boundary assessment model in this application;

[0052] Figure 4 These are comparative images used in this application to illustrate the area defined by the cropping range information, where the left side is the original image and the right side is the effect after expanding the range based on the cropping range information on the basis of the original image;

[0053] Figure 5 This is a schematic diagram of the process for determining the original image from the video stream in an embodiment of this application;

[0054] Figure 6 This is a flowchart illustrating the process of training the various models used in this application individually in the embodiments of this application;

[0055] Figure 7This is a schematic diagram illustrating the process of building various models of this application into a joint model and implementing training in an embodiment of this application;

[0056] Figure 8 A schematic diagram illustrating the process of independently training the boundary assessment model of this application;

[0057] Figure 9 A flowchart illustrating the process of preparing the second sample image and its cropping range information required for training the boundary evaluation model based on the source image;

[0058] Figure 10 This is a schematic diagram illustrating the process of applying human key point information to a digital human in an embodiment of this application;

[0059] Figure 11 This is a schematic block diagram of the human body key point detection device of this application;

[0060] Figure 12 This is a schematic diagram of the structure of a computer device used in this application. Detailed Implementation

[0061] Please see Figure 1 This application discloses an exemplary application scenario using a network architecture including a terminal device 80, a media server 81, and an application server 82. The application server 82 can be used to deploy a live streaming service. The media server 81 can run a computer program product implemented according to the human keypoint detection method of this application. Through the operation of this product, it implements the various steps of the method, thereby identifying human keypoint information in image frames of the live video stream submitted by the broadcaster user. Then, based on the human keypoint information, it generates a live video stream corresponding to a digital human, which is then pushed to the live streaming room to replace the broadcaster user's live video stream. The terminal device 80 allows broadcasters or viewers to log into the live streaming room supported by the live streaming service. The broadcaster user can obtain recordings through the camera unit in their terminal device 80 and submit them as a live video stream to the media server. The viewer user can receive and play the live video stream pushed by the media server through their terminal device 80.

[0062] Specifically, when a broadcaster accesses the live streaming service provided by the application server 82 from their terminal device 80 and enters the corresponding live streaming room, they can enable the live recording function and start pushing the live video stream to the media server. The media server can perform human key point detection on the image frames in the live video stream, obtain the corresponding human key point information, and then use it to drive the digital human to generate corresponding replacement image frames. The various image frames of the digital human constitute the replacement live video stream, which is then pushed to the broadcaster's live streaming room by the media server so that each viewer can receive and play the live video stream containing the digital human's image. Since the digital human's movements are controlled by the human key point information of the human images in the image frames of the live video stream submitted by the broadcaster, it is actually using a digital human to replace the original broadcaster to conduct live streaming, providing a virtual live streaming service based on a digital human.

[0063] Similarly, other application scenarios based on digital humans, such as augmented reality, virtual reality, and 3D games, can all become application scenarios of this application. Computer program products implemented using the human body key point information detection method of this application can use the technical solution of this application to obtain human body key point information in images and generate corresponding images or animations.

[0064] In some applications involving static image processing, such as beautification and reshaping of human images, the human key point information detection method according to this application can be used to determine the human key point information from the original image, then model the corresponding three-dimensional human body based on the human key point information, and finally provide the user with corresponding beautification services based on the three-dimensional model to help achieve the purpose of image beautification.

[0065] The aforementioned human body key point information mainly includes skeletal joints distributed throughout the body, such as the head, torso, and limbs. By adjusting the displacement of these skeletal key points in the 3D image coordinate system, corresponding body parts of the 3D human model can be driven to produce corresponding motion effects, thus switching the 3D human model to a corresponding posture. The resulting posture image can then be obtained through image rendering. By progressively adjusting the positional information of the same set of skeletal key points at multiple moments, a set of image frames showing gradual motion of the 3D human model can be generated. These image frames can then be used to construct a video stream, visually presenting the human body's motion effects during playback.

[0066] Based on the above exemplary scenarios and related principle descriptions, please refer to Figure 2 In one embodiment of the human body key point detection method of this application, the following steps are included:

[0067] Step S1100: Obtain an original image containing a human body image, wherein the human body image is incomplete because some parts of it extend beyond the boundaries of the original image.

[0068] Images from which key human body information needs to be extracted using the technical solution of this application can be considered as the original images of this application.

[0069] The original image typically contains a human figure. Furthermore, due to imaging, cropping, or other reasons, the head and / or individual limbs in the human figure may extend beyond the boundaries of the original image, resulting in the loss of relevant image information and rendering the human figure incomplete. It is precisely because the human figure in the original image is incomplete that the technical advantages of this application are highlighted. Through the processing of the technical solution in this application, even when presenting an original image with an incomplete human figure, all key human point information can be accurately extracted.

[0070] The type and source of the original image depend on the actual application scenario. For example, in image enhancement processing, the original image may be a static image specified by the user; in live streaming, the original image may be an image frame from the live video stream submitted by the broadcaster; in extracting key human body information from a preview video stream obtained from the camera unit of a terminal device, the original image may be an image frame from that preview video stream. Similarly, the original image can be determined as needed depending on the specific application scenario.

[0071] Step S1200: The prediction network of the boundary evaluation model is used to determine the cropping range information of the filling range required to accommodate the complete shape of the human body image;

[0072] This application proposes a boundary evaluation model, such as Figure 3 As shown, the boundary evaluation model is configured with a prediction network, which is a multi-task prediction network used to predict the cropping ratio of each edge of the original image. The cropping ratio of each edge is used to define the corresponding range of image information that should be filled in the original image. Thus, the predicted cropping ratio of each edge actually constitutes the cropping range information corresponding to the range that should be filled in the original image.

[0073] The ability of the boundary evaluation model to determine the cropping range information corresponding to the area to be filled in the original image is obtained through pre-training. By training it with corresponding training samples, it can predict the edges and their expansion ratios that should be added to the image area of ​​the original image to achieve the goal of restoring the human image in the original image to its complete form. Each edge and its expansion ratio can constitute the corresponding cropping range information so that the corresponding image information can be filled in according to the cropping range information.

[0074] It's easy to understand that since the image has a rectangular planar structure, the cropping range information can be described by the top, bottom, left, and right sides and their corresponding expansion ratios. Of course, modifications can be made, such as omitting one side and using only the three sides and their expansion ratios as a feasible alternative description of the cropping range information.

[0075] like Figure 3 As shown, the prediction network of the boundary evaluation model consists of an image feature extractor and a linear layer connected in series. The image feature extractor is composed of multiple stacked convolutional layers, which progressively extract deep semantic information, referred to as image feature information, from the input image, such as the original image or sample images during training. Then, the linear layer maps the image feature information to the categories corresponding to each edge of the image, and the values ​​obtained for each category can be used as the corresponding expansion ratio. The linear layer can be a fully connected layer.

[0076] The convolutional layers in the image feature extractor can be ordinary convolutional layers (CNN, Convolutional Neural Network) or other advanced convolutional layers evolved from ordinary convolutional layers, such as residual convolutional layers (ResNet, Residual Network). It can be seen that, under the action of the prediction network of the boundary evaluation model, the corresponding cropping range information can be predicted for the original image. This information indicates the range that needs to be filled to accommodate the complete shape of the human body image in the original image, indicating the edges that need to be filled and their corresponding expansion ratios, so that the expansion area of ​​each edge can be determined by the expansion ratio of each edge.

[0077] Step S1300: Using the filling network of the boundary evaluation model, fill the corresponding region of the original image with image information according to the cropping range information to obtain the corrected image;

[0078] The boundary evaluation model further includes a filling network, which determines the corresponding regions that need to be filled for each edge in the original image based on the cropping range information provided by the prediction network, and then fills the corresponding regions with image information to obtain the corrected image. In one embodiment, the filling network is a general description of an operation, which can directly assign values ​​to the pixels in the corresponding regions using an assignment method, without relying on a mathematical model.

[0079] Since the cropping range information consists of descriptive information composed of edges and their expansion ratios, in one embodiment, by simply determining the expansion range corresponding to each edge based on the expansion ratio, the entire area that needs to be filled in the original image can be obtained, and the filling network can then uniformly fill the image information for all areas. Figure 4 As shown in the example, the image on the left is the original image, and the image on the right is an expanded version relative to the original image. The blank areas are the corresponding areas that need to be filled with image information, determined according to the cropping range information. In reality, this expands the image to accommodate the redundant space corresponding to the leg area of ​​the person in the image. The cropping range information can be represented as a vector. For example, based on the top, bottom, left, and right relationships, it can be represented in the following form:

[0080] [up,down,left,right]

[0081] When filling the corresponding region represented by the cropping range information, in one embodiment, preset pixel values ​​can be used to fill each pixel. For example, according to RGB representation, [128, 128, 128] can be used as the pixel value of each pixel in the corresponding region, or a fixed value determined by a ratio such as [103, 116, 122] can be used as the pixel value of each pixel. In another embodiment, the edge pixel values ​​of the corresponding edge of the original image can be taken to fill each pixel in the region expanded by that edge. In other embodiments, other corresponding trained neural network models can be used to predict the corresponding pixel values ​​of each edge, thereby enriching the image information of the expanded region in the corrected image. The essence of the above embodiments is to use relatively balanced color values ​​to fill the expanded region defined by the cropping range information, so that the expanded image information in the obtained corrected image can maintain its balance with the original image in terms of color expression, thereby ensuring that the human key point detection model has stronger output robustness when predicting human key point information in the corrected image.

[0082] Step S1400: Use a human key point detection model to determine the human key point information of the corrected image.

[0083] After obtaining the corrected image, input the corrected image into a human keypoint detection model that has been trained to convergence beforehand, and the human keypoint detection model can extract the human keypoint information in the corrected image.

[0084] The human keypoint detection model can be any existing model. For example, MoveNet, BlazePose, and PoseNet can be used. MoveNet is a fast and accurate pose detection model that can detect 17 keypoints on the human body and can run at 50+ fps on laptops and mobile phones. BlazePose (MediaPipe BlazePose) can detect 33 keypoints on the human body, and in addition to the 17 specific keypoints, it also provides additional keypoint detection for the face, hands, and feet. PoseNet can detect multiple poses, each containing 17 keypoints. Of course, the selection of the human keypoint detection model is not limited to these, and its selection does not affect the inventive spirit of this application.

[0085] As can be seen from the above embodiments, this application, for original images containing incomplete human body images, employs a boundary evaluation model. First, the prediction network within the model predicts the cropping range information required to restore the human body image in the original image to its complete form. Then, the filling network in the boundary evaluation model fills the image boundary with image information based on the cropping range information, thereby expanding the features of the original image to obtain a corrected image. The expanded area in the corrected image can be determined according to the semantic features of the human body image in the original image. This processed corrected image is then provided to a human keypoint detection model for further detection. It contains sufficient information corresponding to a fully formed human body image, allowing for accurate extraction of all human keypoint information corresponding to the human body image in the original image. This provides accurate and effective basic data for various downstream services that rely on human keypoint information, ensuring a good interactive experience in scenarios such as live streaming based on digital humans.

[0086] Based on any embodiment of this application, please refer to Figure 5 Obtain the raw image containing the human body image, including:

[0087] Step S1110: Obtain image frames from the video stream generated by the live web broadcast;

[0088] This embodiment is applicable to online live streaming scenarios based on digital humans for virtual live streaming. In such scenarios, a computer program product implemented according to the method of this application can be deployed to a media server for online live streaming and run. Then, the original image is obtained through the live video stream generated by the online live streaming, referred to as the video stream. The original image obtained from the live video stream of the online live streaming scenario is a live image, which can be processed according to the technical solution of this application.

[0089] Once a live streamer starts broadcasting, they can perform activities such as dancing, singing, giving speeches, or becoming a fitness instructor. The camera unit on their terminal device will record the corresponding video stream and submit it to the media server.

[0090] After obtaining the video stream, the media server first decodes it to obtain the individual image frames in the video stream. The processing procedure of this embodiment can be applied to each image frame as needed.

[0091] Step S1120: Detect whether there is facial information in the image frame. If facial information is present, use the image frame as the target image.

[0092] To reduce unnecessary computation, a mature face detection model can be used to detect faces in the image frames. It's easy to understand that the presence of facial information in the image frame indicates human activity, thus necessitating further processing and designating the corresponding image frame as the target image. If the image frame does not contain facial information, i.e., no facial image exists, it can be ignored. For example, the face detection model can be implemented using the YOLO series of models.

[0093] Step S1130: Obtain the human body image in the target image, determine whether the human body image is in a complete form, and if it is incomplete, use the target image as the original image.

[0094] Given that the live streamer may be in motion during the live streaming process, and parts of their body may sometimes be obscured or outside the camera's field of view, the presence of such a phenomenon can be identified by first identifying the image frame. When such a phenomenon exists, the corresponding image frame can be used as the original image for subsequent processing in this application.

[0095] Considering the influence of environmental factors such as lighting and props in the live streaming environment, the image content of the corresponding video stream frames is relatively complex. Therefore, a mature image segmentation model can be used to segment the target image first, and the human image in the target image can be obtained through image segmentation to remove the interference of other image content.

[0096] Based on the human body image, various methods can be used to detect whether the human body displayed in the human body image presents a complete or incomplete form. When it is incomplete, the corresponding target image can be determined as the original image of this application.

[0097] In one embodiment, a mature human key point detection model can be used to extract human key point information from the human image. Then, based on the integrity of the human key point information, if the information is intact, the human image in the target image is determined to be in a complete form; otherwise, it is determined to be in an incomplete form.

[0098] In another embodiment, an image integrity classification model, pre-trained to convergence, can be used to determine whether the human image belongs to a complete or incomplete form. This image integrity classification model can be implemented using an image feature extractor followed by a classifier. The image feature extractor extracts deep semantics from the human image to obtain human image features, which are then mapped by the classifier to either a complete or incomplete form to obtain corresponding classification probabilities. The form represented by the probabilities with the highest probabilities is determined. In the training samples used for the image integrity classification model, complete human images can be used as positive samples, and incomplete human images as negative samples. Training the model with these positive and negative samples until convergence enables it to acquire the corresponding judgment ability.

[0099] As can be seen from the above embodiments, by intelligently recognizing the image frames of the video stream, it is first determined whether there is facial information in them. If there is facial information, it is then determined whether the human body image is incomplete. The image frame corresponding to the incomplete form is then used as the original image suitable for processing in this application. This avoids performing the subsequent processing steps of this application for each image frame in the video stream, reducing the computing pressure on the media server. Furthermore, it can accurately identify the image frames in the video stream from which human body key point information needs to be extracted, enabling various downstream services based on human body key point information in the video stream to be executed efficiently and accurately.

[0100] Based on any embodiment of this application, please refer to Figure 6 Before obtaining the raw image containing the human body image, the process includes:

[0101] Step S2100: Train the human key point detection model with the first sample image until convergence. During the training process, the human key point information corresponding to the first sample image is used to supervise the human key point information predicted by the human key point detection model. The first sample image contains a human image, and the human image is in a complete form.

[0102] Based on the established human keypoint detection model, it can be trained independently. When training the human keypoint detection model, corresponding training samples need to be prepared, and corresponding supervision label information needs to be provided for each training sample. Since the input of the human keypoint detection model is an image, and the output is human keypoint information, a batch of first sample images with complete human body shapes are selected as the training samples, and the human keypoint information pre-annotated for each first sample image is used as the supervision label information for the corresponding training sample.

[0103] During the iterative training of the human keypoint detection model, each iteration uses a first sample image as the input training sample. The human keypoint detection model outputs corresponding human keypoint information through inference. Then, using the supervision label information corresponding to the training sample, i.e., the corresponding human keypoint information, the loss value of the human keypoint information predicted by the model is calculated. When the loss value reaches a preset threshold, it indicates that the model has converged and training can be terminated. Otherwise, the model has not converged. The gradient of the human keypoint detection model is updated according to the loss value, and the next training sample is called to continue iterative training. This process continues until the human keypoint detection model is trained to a convergent state, at which point it can be put into online inference.

[0104] Step S2200: Train the boundary evaluation model using the second sample image until convergence. During the training process, the cropping range information determined by comparing the second sample image with its original image is used to supervise the cropping range information predicted by the boundary evaluation model. The second sample image and its original image both contain the same human body image. The human body image in the second sample image is incomplete, while the human body image in its original image is complete.

[0105] Based on the determined network architecture of the boundary evaluation model, it can also be trained. To prepare training samples for the boundary evaluation model, a batch of second sample images can be selected as training samples. Each second sample image originates from a corresponding original image. The second sample image can be obtained by cropping its corresponding original image on one side or multiple sides. Through cropping, a human body image with a complete shape in the original image becomes a human body image with a non-complete shape in its corresponding second sample image. Thus, the cropping range information of the second sample image relative to its original image can be determined. This cropping range information can be used as the corresponding supervision label information of the second sample image.

[0106] During the iterative training of the boundary assessment model, a second sample image is used as a training sample and input into the prediction network of the boundary assessment model in each iteration. The prediction network predicts the cropping range information of the corresponding region missing from the complete shape of the human body image in the second sample image. Then, the loss value predicted by the prediction network is calculated using the cropping range information corresponding to the training sample. When the loss value reaches a preset threshold, it indicates that the model has converged and training can be terminated. Otherwise, the model has not converged. The gradient of the boundary assessment model is updated according to the loss value, and the next training sample is called to continue iterative training. This process continues until the boundary assessment model is trained to a convergent state, at which point it can be put into online inference.

[0107] As can be seen from the above embodiments, the human keypoint detection model and the boundary evaluation model can be trained separately to acquire their respective image processing capabilities. The boundary evaluation model can then use its learned capabilities to accurately expand the boundaries of the original image containing incomplete human images to obtain a corrected image. The human keypoint detection model can then accurately extract human keypoint information based on the corrected image. This ensures that even if the original image contains incomplete human images, all human keypoint information can still be accurately obtained, thus ensuring the stable operation of downstream services that rely on human keypoint information.

[0108] Based on any embodiment of this application, please refer to Figure 7 After training the boundary evaluation model using the second sample image until convergence, the process includes:

[0109] Step S2300: Connect the boundary evaluation model to the human key point detection model to form a joint model, and use the second sample image as input to the joint model for training;

[0110] The boundary evaluation model and the human keypoint detection model can be further trained together. To this end, the boundary evaluation model is directly followed by the human keypoint detection model to build a joint model, so that the corrected image generated by the boundary evaluation model can be used as input information and directly provided to the human keypoint detection model to perform human keypoint information detection.

[0111] Based on the joint model, the training samples used as input and their corresponding supervision label information need to be selected accordingly. For the training samples used in the joint model, the second sample image used when training the boundary evaluation model can be used as input. For the supervision label information, two types are used. The first type is human key point information set to calculate the loss value of human key point information output by the joint model, which can be obtained by pre-annotating human key points in the second sample image. The second type is a supervision image set to calculate the loss value of the correction image of the boundary evaluation model in the joint model. As mentioned above, the second sample image is cropped from its corresponding supervision image. The human image in the supervision image is in a complete form, while the human image in the second sample image cropped from it is in a non-complete form.

[0112] Step S2400: Calculate the similarity loss value of the human key point information predicted by the human key point detection model in the joint model based on the human key point information corresponding to the second sample image;

[0113] When calculating the loss value corresponding to the human key point information output by the joint model, the similarity loss value of the L1 paradigm can be determined by using the human key point information labeled for the second sample image and the human key point information output by the joint model.

[0114] Step S2500: Calculate the similarity loss value of the cropping range information predicted by the prediction network of the boundary evaluation model in the joint model based on the cropping range information of the second sample image compared with its original image;

[0115] In the joint model, the prediction network of the boundary evaluation model predicts the corresponding cropping range information based on the second sample image. The cropping range information determined by comparing the second sample image with its original image can be used to calculate the similarity loss value between the two according to the L1 paradigm.

[0116] Step S2600: Summarize the various similarity loss values ​​into the overall loss value of the joint model, and perform gradient update on the joint model based on the overall loss value.

[0117] After determining the similarity loss value corresponding to the prediction network of the boundary evaluation model and the similarity loss value corresponding to the human key point detection model, the various similarity loss values ​​can be fused to form the overall loss value corresponding to the entire joint model, and then the decision iterative process is made based on the overall loss value.

[0118] Specifically, the overall loss value is compared with a target threshold used to determine whether the model has converged. When the overall loss value reaches the target threshold, it indicates that the joint model has reached convergence, and training can be terminated. When the target threshold is not reached, it indicates that the joint model has not yet converged. Gradient updates are then performed on the entire joint model based on the overall loss value. The weight parameters of each stage of the boundary evaluation model's prediction model and the human keypoint detection model are corrected through backpropagation, making the entire joint model closer to convergence. Then, the next second sample image is called to continue iterative training of the entire joint model, and so on, until the entire joint model is trained to a converged state. At this point, the entire joint model can be used in the online inference stage.

[0119] In one embodiment, when fusing the similarity loss value of the prediction network of the boundary evaluation model and the similarity loss value of the human keypoint detection model, the average of the two can be used to determine the overall loss value; in another embodiment, the overall loss value can be obtained by applying smoothing weights to perform a weighted sum of the two similarity loss values.

[0120] As can be seen from the above embodiments, by further building the boundary evaluation model and the human key point detection model into a joint model and training it until convergence, and by integrating the similarity loss value of the output results of the two models during the training process to correlate and correct the weights of the joint model, the joint model learns the ability to accurately predict human key point information of incomplete human images from end to end. Its processing is smoother and the obtained human key point information is more accurate and reliable.

[0121] Based on any embodiment of this application, please refer to Figure 8 The boundary evaluation model is trained to convergence using the second sample image, including:

[0122] Step S2210: Obtain the second sample image in the training dataset and the cropping range information determined by comparing it with the original image. The original image contains a human body image in a complete form. The second sample image is obtained by cropping a portion of the original image to make the human body image in a non-complete form.

[0123] The training dataset can be a pre-prepared, dedicated dataset for the boundary evaluation model, containing a large amount of sample data sufficient to train the prediction network of the boundary evaluation model to convergence. Each sample mainly includes a second sample image and the corresponding cropping range information. The cropping range information represents the corresponding edge that the second sample image is partially cropped from its original image and its cropping ratio. Based on the correspondence of the four edges (up, down, left, right) of the original image, the cropping range information can be represented as a vector form of [up, down, left, right].

[0124] The original image contains a complete human body image. Correspondingly, the second sample image obtained by cropping the original image contains a non-complete human body image. The second sample image and the human body image in the original image are essentially the same human body image.

[0125] Step S2220: Input the second sample image into the boundary evaluation model. After the prediction network extracts the image feature information of the second sample image, it predicts the cropping range information corresponding to the filling range required to accommodate the complete shape of the human body image.

[0126] Based on the function of the prediction network of the boundary evaluation model, when the second sample image is input as a training sample into the prediction network, it extracts the image feature information from the second sample image through convolutional layers, and then maps it to the corresponding category of each edge through linear layers. The expansion ratio corresponding to each category is calculated, and each edge and its corresponding expansion ratio constitute the cropping range information predicted by the prediction model. It should be noted that the expansion ratio is actually the cropping ratio of the second sample image relative to its original image. Essentially, the expansion ratio and the cropping ratio are the same thing and can both be represented as cropping range information.

[0127] Step S2230: Calculate the similarity loss value between the cropping range information corresponding to the second sample image and the cropping range information predicted by the prediction network, and perform gradient update on the boundary evaluation model based on the similarity loss value.

[0128] After the prediction network predicts the cropping range information for the second sample image, the cropping range information in the sample data containing the second sample image provided by the training dataset, that is, the cropping range information corresponding to the second sample image, can be used as supervision label information to calculate the loss value of the cropping range information predicted by the prediction network, that is, the similarity loss value.

[0129] In this embodiment, the mean squared error formula is used to calculate the similarity loss value between two cropping ranges, as shown in the following example:

[0130]

[0131] in:

[0132] y represents the cropping range information corresponding to the random cropping of the second sample image compared to its original image. ′This refers to the prediction results of the prediction network of the boundary evaluation model; n is the number of output values ​​of the prediction network, and i refers to the i-th value predicted by the network. Finally, the average is calculated as the similarity loss value, and the final loss = MSE(yy) is obtained. ′ ).

[0133] As can be seen from the above embodiments, when calculating the loss value of the output result of the prediction network of the boundary evaluation model alone, the mean squared error formula is used. Since the second sample image used as the training sample is itself obtained by cropping the original image, the cropping ratio between the two can be determined in advance. Therefore, its supervision label information is accurate and reliable. The loss value calculated in this way is used to update the gradient of the prediction network, which can train the prediction network of the boundary evaluation model to convergence more efficiently and the training cost is also low. With the help of accurate supervision label information, the prediction ability of the prediction network is also quite good.

[0134] Based on any embodiment of this application, please refer to Figure 9 Before obtaining the second sample image from the training dataset and the cropping range information determined by comparing it with its original image, the process includes:

[0135] Step S3100: Obtain a basic dataset, which contains multiple source images, including human body images in their complete form;

[0136] Automatically generating sample data from the training dataset required for training the boundary evaluation model can greatly reduce the training cost of the boundary evaluation model. To this end, a basic dataset can be obtained to prepare the training dataset.

[0137] The base dataset can be collected manually or from publicly available data. It contains a large number of source images as original images. Second sample images can be prepared based on the source images to train the prediction network of the boundary assessment model.

[0138] All selected source images contain human figures, and these human figures are complete in shape, with no instances of human figures extending beyond the image boundaries.

[0139] Step S3200: Use a human key point detection model to detect human key point information in the material image and determine the outer frame surrounding all human key points;

[0140] To prepare effective second sample images, for each source image, its effective human image range can be determined first. Specifically, a human keypoint detection model that has been trained to convergence can be used, such as the human keypoint detection model trained in this application. The source image is first input into the human keypoint detection model to identify the human keypoint information. Then, based on all the human keypoint information in the source image, an outer bounding box is determined so that the outer bounding box completely includes all the human keypoints in the human image. In actual processing, the outermost human keypoints of the source image can be used as the boundary, and a corresponding rectangular outer bounding box can be determined based on the coordinate information of these human keypoints.

[0141] Step S3300: Randomly generate a cropping ratio corresponding to at least one image edge, crop the material image according to the cropping ratio, crop the part covered by the outer frame, and control the proportion of the cropped part within the outer frame to the total area covered by the entire outer frame to not exceed a preset ratio, thereby obtaining a second sample image.

[0142] To generalize the features of the second sample images across the entire training dataset, a degree of randomness can be introduced to generate corresponding second sample images based on each source image. To this end, for each source image, an arbitrary number of target edges from its four sides are randomly generated, and a corresponding cropping ratio is established for each target edge. Furthermore, each edge can have multiple random cropping ratios, allowing for multiple cropping operations on the same side of the source image to obtain multiple second sample images from the same source image.

[0143] For the same source image, the number of edges to be cropped can vary, and the cropping ratio corresponding to each edge can also vary. It is easy to understand that, based on the same source image, multiple second sample images can be prepared. These second sample images are obtained by cropping different edges with different numbers of edges and different cropping ratios on the same source image, thus possessing rich and diverse features. When such sample images are used to train the prediction network of the boundary evaluation model, it can be ensured that the prediction network can obtain diverse features and effectively generalize features during the training process.

[0144] It is important to note that, to ensure the boundary evaluation model has sufficient semantic information from the original image to predict the cropping ratio (expansion ratio) for each edge during training, the degree of damage to the bounding boxes corresponding to all human keypoints in the source image after cropping can be determined based on the randomly generated cropping ratio before each cropping operation. If, based on the image range, the damage to the bounding box exceeds a preset ratio, such as 30%, then this random cropping ratio will not be used. This also allows for the constraint of the generation range of the random cropping ratio to ensure that the above exclusion situation does not occur.

[0145] After controlling the cropping range as described above, after cropping the material image to generate the corresponding second sample image, it will be ensured that a portion of the outer frame corresponding to the human body key points in the material image can be cropped out, and the proportion of this cropped portion to the total area covered by the entire outer frame will not exceed the preset proportion.

[0146] As can be seen from the above processing, when preparing second sample images based on source images, the image information of some key human figures in the source images is intentionally destroyed, resulting in an incomplete human figure in the second sample images compared to the source images. Based on this principle, a massive number of second sample images can be prepared using a limited number of source images as the original images, based on the aforementioned dataset. Furthermore, this process can be automated, highly efficient, and low-cost.

[0147] Step S3400: The cropping ratio is represented as the cropping range information of the second sample image, and the cropping range information and the second sample image are used to form a mapping relationship data, which is stored in the training dataset.

[0148] For each second sample image, although the cropping ratio of each side is randomly generated, it is fixed once applied. Therefore, each side and its corresponding cropping ratio can be represented as cropping range information, such as the corresponding vector form mentioned above, so that the model can directly call it for calculation.

[0149] After determining the cropping range information of each second sample image, the second sample image can be used as a training sample, and its cropping range information can be used as supervision label information. The two are constructed into mapping relationship data and stored in the training dataset of this application. This training dataset can then be used to train the prediction network of the boundary evaluation model.

[0150] As can be seen from the above embodiments, this application, based on a limited number of pre-collected material images in a basic dataset, can perform cropping processing on each material image according to randomly generated cropping range information, automatically generating a massive number of second sample images. Then, the second sample images and their corresponding cropping range information are used to construct the sample data required for the training dataset. This method has high preparation efficiency, low preparation cost, and can ensure generalized sample features. It also ensures that the boundary evaluation model can be quickly trained to convergence based on a training dataset with diverse features, and obtains stronger feature representation capabilities.

[0151] Based on any embodiment of this application, please refer to Figure 10 After determining the human key point information of the corrected image using a human key point detection model, the process includes:

[0152] Step S4100: Apply the human body key point information to the preset three-dimensional model of the digital human, and control the digital human to switch to the corresponding activity posture;

[0153] In the application scenario of live streaming corresponding to this application, the human body activities provided in the video stream of the broadcaster can be converted into the human body activities of a digital human. The basis for this is the human body key point information in the image frames within the broadcaster's video stream. As mentioned above, the human body key point information in the image frames can be determined according to the various embodiments of this application above.

[0154] After obtaining the human body key point information of the human body image within the video stream, it can be applied to the preset digital human 3D model. The distribution of each skeletal key point in the human body key point information is used as the distribution of each skeletal key point in the digital human 3D model, thereby controlling the 3D model to switch to the corresponding activity posture.

[0155] Step S4200: Render the 3D model to obtain the pose image of the digital human;

[0156] After the 3D model of the digital human is controlled by the human body key point information of an image frame to switch to a corresponding active pose, the 3D model can be image rendered based on the active pose to generate the corresponding pose image of the digital human in that active pose.

[0157] It is easy to understand that for each image frame containing a human image in a video stream, a corresponding pose image can be generated. These pose images are organized in order according to their corresponding time sequence to form the corresponding animation stream of the digital human. The animation stream contains pose images corresponding to each moment.

[0158] Step S4300: Push the posture image to the live webcast room for display.

[0159] Once the pose image of the digital human or the animation stream containing the pose image is determined, the pose image or animation stream can be pushed to the live broadcast room of the anchor user. After the terminal device of the audience user in the live broadcast room receives the pose image or animation stream, it loads and displays it accordingly, so that the audience user can get the display effect of the pose image or animation stream of the digital human, thereby realizing the virtual live broadcast service of digital human.

[0160] It should be noted that the operation of extracting human key point information from the video stream of the broadcaster and generating corresponding pose images or animation streams can be implemented at any transmission node of the live video stream, such as the broadcaster's terminal device, media server, or viewer's terminal device.

[0161] As can be seen from the above embodiments, this application can accurately extract key human information from human images with partial information loss, which can provide reliable basic data for downstream businesses. For example, in the digital human virtual live streaming service exemplified in this embodiment, the live performance of the anchor user can be transferred to the digital human based on the key human information. Therefore, it has profound practical significance and considerable economic benefits.

[0162] As can be seen from the above description of various embodiments of the human key point detection method of this application, when the human images processed in the above embodiments are derived from live images, this actually constitutes a disclosure of various embodiments of the human key point detection method for live images of this application. The human key point detection method for live images of this application can be specifically used in online live streaming scenarios to provide corresponding services for live images in live video streams.

[0163] Please see Figure 11 A human body key point detection device provided to meet one of the purposes of this application includes an image acquisition module 1100, a range determination module 1200, a filling processing module 1300, and a key point determination module 1400. The image acquisition module 1100 is configured to acquire an original image containing a human body image, wherein the human body image is incomplete due to local areas exceeding the boundaries of the original image. The range determination module 1200 is configured to use a prediction network of a boundary evaluation model to determine the cropping range information required to accommodate the complete shape of the human body image. The filling processing module 1300 is configured to use the filling network of the boundary evaluation model to fill corresponding areas of the original image with image information according to the cropping range information to obtain a corrected image. The key point determination module 1400 is configured to use a human body key point detection model to determine the human body key point information of the corrected image.

[0164] Based on any embodiment of this application, the image acquisition module 1100 includes: a video extraction unit, configured to acquire image frames from a video stream generated by a live web broadcast; a face detection unit, configured to detect whether face information exists in the image frame, and when face information exists, to use the image frame as a target image; and an image filtering unit, configured to acquire a human body image from the target image, determine whether the human body image is in a complete form, and when it is incomplete, to use the target image as the original image.

[0165] Based on any embodiment of this application, the human keypoint detection device of this application includes: a first training module, configured to train the human keypoint detection model to convergence using a first sample image, wherein during the training process, human keypoint information corresponding to the first sample image is used to supervise the human keypoint information predicted by the human keypoint detection model, wherein the first sample image contains a human image and the human image is in a complete form; a second training module, configured to train the boundary evaluation model to convergence using a second sample image, wherein during the training process, cropping range information determined by comparing the second sample image with its original image is used to supervise the cropping range information predicted by the boundary evaluation model, wherein the second sample image and its original image both contain the same human image, the human image in the second sample image is incomplete, while the human image in its original image is complete.

[0166] Based on any embodiment of this application, the human keypoint detection device of this application includes: a joint construction module, configured to connect the boundary evaluation model to the human keypoint detection model to form a joint model, and to train the joint model by inputting a second sample image; a first loss calculation module, configured to calculate the similarity loss value of the human keypoint information predicted by the human keypoint detection model in the joint model according to the human keypoint information corresponding to the second sample image; a second loss calculation module, configured to calculate the similarity loss value of the cropping range information predicted by the prediction network of the boundary evaluation model in the joint model according to the cropping range information of the second sample image compared with its original image; and a comprehensive loss calculation module, configured to summarize the various similarity loss values ​​into the overall loss value of the joint model, and to perform gradient update on the joint model according to the overall loss value.

[0167] Based on any embodiment of this application, the second training module includes: a sample retrieval unit, configured to acquire a second sample image from the training dataset and cropping range information determined by comparison with its original image, wherein the original image contains a human body image in a complete form, and the second sample image is obtained by partially cropping the original image to give the human body image a non-complete form; a range prediction unit, configured to input the second sample image into the boundary evaluation model, wherein the prediction network extracts image feature information from the second sample image and predicts the cropping range information corresponding to the filling range required to accommodate the complete form of the human body image; and a gradient update unit, configured to calculate the similarity loss value between the cropping range information corresponding to the second sample image and the cropping range information predicted by the prediction network, and perform gradient update on the boundary evaluation model based on the similarity loss value.

[0168] Based on any embodiment of this application, the human body key point detection device of this application includes: a material acquisition module, configured to acquire a basic dataset containing multiple material images, wherein the material images contain human body images, and the human body images are in a complete form; an outer frame detection module, configured to use a human body key point detection model to detect human body key point information in the material images and determine an outer frame surrounding all human body key points; a cropping module, configured to randomly generate a cropping ratio corresponding to at least one image edge, crop the material images according to the cropping ratio, crop the portion covered by the outer frame, and control the proportion of the cropped portion within the outer frame to the total portion covered by the entire outer frame to not exceed a preset ratio, thereby obtaining a second sample image; and a sample construction module, configured to represent the cropping ratio as cropping range information of the second sample image, construct a mapping relationship data between the cropping range information and the second sample image, and store it in a training dataset.

[0169] Based on any embodiment of this application, the human body key point detection device of this application includes: an information application module, configured to apply the human body key point information to a preset three-dimensional model of a digital human and control the digital human to switch to a corresponding activity posture; a rendering processing module, configured to render the three-dimensional model to obtain a posture image of the digital human; and an image push module, configured to push the posture image to a live broadcast room for display.

[0170] To address the aforementioned technical problems, embodiments of this application also provide computer equipment. For example... Figure 12The diagram shows the internal structure of a computer device. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected via a system bus. The computer-readable storage medium stores an operating system, a database, and computer-readable instructions. The database may store a sequence of control information. When the computer-readable instructions are executed by the processor, the processor can implement a method for detecting human key points. The processor of the computer device provides computing and control capabilities to support the operation of the entire computer device. The memory of the computer device may store computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor can execute the human key point detection method of this application. The network interface of the computer device is used for communication with a terminal. Those skilled in the art will understand that… Figure 12 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0171] In this embodiment, the processor is used to execute... Figure 11 The system contains the specific functions of each module and its sub-modules, and the memory stores the program code and various data required to execute these modules or sub-modules. The network interface is used for data transmission between the user terminal and the server. In this embodiment, the memory stores the program code and data required to execute all modules / sub-modules in the human body key point detection device of this application, and the server can call the server's program code and data to execute the functions of all sub-modules.

[0172] This application also provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the human key point detection method of any embodiment of this application.

[0173] This application also provides a computer program product, including a computer program / instructions that, when executed by one or more processors, implement the steps of the method described in any embodiment of this application.

[0174] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. The aforementioned storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

[0175] In summary, this application can accurately extract all human key point information corresponding to the human body image in the original image based on the corrected image, which can provide accurate and effective basic data for various downstream services that rely on human key point information, and ensure that scenarios such as live streaming based on digital humans can obtain a good interactive experience.

Claims

1. A method for detecting key points on the human body, characterized in that, Includes the following steps: The human key point detection model is trained using the first sample image until convergence. During the training process, the human key point information corresponding to the first sample image is used to supervise the human key point information predicted by the human key point detection model. The first sample image contains a human image, and the human image is in a complete form. The boundary evaluation model is trained using the second sample image until convergence. During the training process, the cropping range information determined by comparing the second sample image with its original image is used to supervise the cropping range information predicted by the boundary evaluation model. The second sample image and its original image both contain the same human body image. The human body image in the second sample image is incomplete, while the human body image in its original image is complete. Obtain an original image containing a human body image, wherein the human body image is incomplete because some parts extend beyond the boundaries of the original image; The prediction network of the boundary evaluation model is used to determine the cropping range information of the filling range required to accommodate the complete shape of the human body image; The filling network using the boundary evaluation model fills in image information for the corresponding region of the original image according to the cropping range information, thereby obtaining the corrected image; The human key point detection model is used to determine the human key point information of the corrected image.

2. The method for detecting key human body points according to claim 1, characterized in that, Obtain the raw image containing the human body image, including: To capture image frames from a video stream generated by a live web broadcast; Detect whether there is facial information in the image frame; if facial information is present, use the image frame as the target image. Obtain the human body image from the target image, determine whether the human body image is in a complete form, and if it is incomplete, use the target image as the original image.

3. The method for detecting key human body points according to claim 1, characterized in that, After training the boundary evaluation model using the second sample image until convergence, the process includes: The boundary evaluation model is followed by the human key point detection model to form a joint model, and the joint model is trained by inputting the second sample image. Calculate the similarity loss value of the human key point information predicted by the human key point detection model in the joint model based on the human key point information corresponding to the second sample image; The similarity loss value of the cropping range information predicted by the prediction network of the boundary evaluation model in the joint model is calculated based on the cropping range information of the second sample image compared with its original image. The individual similarity loss values ​​are aggregated into the overall loss value of the joint model, and the joint model is then updated using gradients based on the overall loss value.

4. The method for detecting key human body points according to claim 1, characterized in that, Training the boundary evaluation model to convergence using the second sample image includes: The second sample image in the training dataset and the cropping range information determined by comparing it with the original image are obtained. The original image contains a human body image in a complete form. The second sample image is obtained by cropping a part of the original image so that the human body image has a non-complete form. The second sample image is input into the boundary evaluation model, and its prediction network extracts image feature information from the second sample image to predict the cropping range information corresponding to the filling range required to accommodate the complete shape of the human body image. Calculate the similarity loss value between the cropping range information corresponding to the second sample image and the cropping range information predicted by the prediction network, and perform gradient update on the boundary evaluation model based on the similarity loss value.

5. The method for detecting key human body points according to claim 4, characterized in that, Before obtaining the second sample image and its corresponding second supervision image from the training dataset, the following steps are included: Obtain a basic dataset containing multiple source images, including human images in their complete form; The human key point detection model is used to detect the human key point information in the material image and determine the outer bounding box surrounding all human key points. A cropping ratio is randomly generated and applied to at least one image edge. The material image is cropped according to the cropping ratio, and the portion covered by the outer frame is cropped. The proportion of the cropped portion within the outer frame to the total area covered by the outer frame is controlled to not exceed a preset ratio, thereby obtaining a second sample image. The cropping ratio is represented as the cropping range information of the second sample image, and the cropping range information and the second sample image are used to form a mapping relationship data, which is stored in the training dataset.

6. The method for detecting key human body points according to any one of claims 1 to 5, characterized in that, After determining the human key point information of the corrected image using the aforementioned human key point detection model, the process includes: The key point information of the human body is applied to a preset three-dimensional model of a digital human, and the digital human is controlled to switch to the corresponding activity posture. Render the 3D model to obtain the pose image of the digital human; The posture image is then pushed to a live webcast room for display.

7. A method for detecting human key points in live-streamed images, characterized in that, It includes the steps of the human body key point detection method as described in any one of claims 1 to 6.

8. A human body key point detection device, characterized in that, include: The first training module is configured to train a human keypoint detection model using a first sample image until convergence. During the training process, human keypoint information corresponding to the first sample image is used to supervise the human keypoint information predicted by the human keypoint detection model. The first sample image contains a human image, and the human image is in a complete form. The second training module is configured to train the boundary evaluation model using the second sample image until convergence. During the training process, the cropping range information determined by comparing the second sample image with its original image is used to supervise the cropping range information predicted by the boundary evaluation model. The second sample image and its original image both contain the same human body image. The human body image in the second sample image is incomplete, while the human body image in its original image is complete. The image acquisition module is configured to acquire an original image containing a human body image, wherein the human body image is incomplete because some parts of it extend beyond the boundaries of the original image. The range determination module is configured to use the prediction network of the boundary evaluation model to determine the cropping range information of the fill range required to accommodate the complete shape of the human body image; The filling processing module is configured to use the filling network of the boundary evaluation model to fill the corresponding region of the original image with image information according to the cropping range information, so as to obtain the corrected image; The key point determination module is configured to use the human key point detection model to determine the human key point information of the corrected image.

9. A computer device comprising a central processing unit and a memory, characterized in that, The central processing unit is used to invoke and run a computer program stored in the memory to perform the steps of the method as described in any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that, It stores, in the form of computer-readable instructions, a computer program implemented according to any one of claims 1 to 6, which, when invoked by a computer, executes the steps included in the corresponding method.