Method and apparatus for matching human body in video stream

By dividing the image into partitions in the video stream and using feature extraction and classification models to calculate quality scores, the problem of low accuracy in human sequence matching in video streams is solved, and higher accuracy human sequence matching is achieved.

CN115187912BActive Publication Date: 2026-06-19BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD
Filing Date
2022-07-26
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The existing technology suffers from low accuracy in matching human body sequences in video streams.

Method used

By acquiring multiple video streams, human detection and tracking algorithms are used to process the video streams, human sequences are divided into image partitions, features are extracted using a human feature extraction model, category centers and quality scores are determined based on a classification model, target features are calculated, and human sequence matching is performed.

Benefits of technology

It improves the matching accuracy of human body sequences in video streams.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115187912B_ABST
    Figure CN115187912B_ABST
Patent Text Reader

Abstract

This disclosure relates to the field of video processing technology, and provides a method and apparatus for human body matching in video streams. The method includes: extracting a first feature corresponding to each human body image using a human body feature extraction model; determining a first category center corresponding to each human body image using a classification model based on the first feature; calculating a first quality score corresponding to each human body image using a quality score calculation model based on the first category center; determining a second feature, a second category center, and a second quality score corresponding to each image partition based on the first feature, first category center, and first quality score of each human body image in each image partition; determining a target feature corresponding to each human body sequence based on the second feature, second category center, and second quality score of each image partition in each human body sequence; and matching multiple human body sequences based on the target feature corresponding to each human body sequence.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of video processing technology, and in particular to a method and apparatus for human body matching in a video stream. Background Technology

[0002] In the field of surveillance, different video streams are acquired from different cameras. Human body sequences captured from these different video streams are matched to obtain the target's trajectory under different cameras. A single human body sequence contains multiple human images. Currently, a common approach is to extract general human features from the images in the sequence and then match these features to the sequence. However, human images generally exhibit three main poses: frontal, side, and back. The features of these three poses differ significantly. Because current technology does not consider the characteristics of human bodies in different poses and only uses general human features, the accuracy of human body sequence matching is low.

[0003] In realizing the present invention, the inventors discovered at least the following technical problems in the related technology: low accuracy of human body sequence matching in video streams. Summary of the Invention

[0004] In view of this, embodiments of the present disclosure provide a method, apparatus, electronic device, and computer-readable storage medium for human body matching in video streams, in order to solve the problem of low accuracy of human body sequence matching in video streams in the prior art.

[0005] A first aspect of this disclosure provides a method for human body matching in video streams, comprising: acquiring multiple video streams and sequentially processing the multiple video streams using a human detection algorithm and a human tracking algorithm to obtain multiple human body sequences, wherein each human body sequence includes multiple human body images; dividing each human body sequence into a preset number of image partitions, wherein each image partition includes multiple human body images; extracting a first feature corresponding to each human body image using a human feature extraction model; determining a first category center corresponding to each human body image using a classification model based on the first feature corresponding to each human body image; calculating a first quality score corresponding to each human body image using a quality score calculation model based on the first category center corresponding to each human body image; determining a second feature, a second category center, and a second quality score corresponding to each image partition based on the first feature, the first category center, and the first quality score corresponding to each human body image in each image partition; determining a target feature corresponding to each human body sequence based on the second feature, the second category center, and the second quality score corresponding to each image partition in each human body sequence; and matching the multiple human body sequences based on the target feature corresponding to each human body sequence.

[0006] A second aspect of this disclosure provides a human body matching device in a video stream, comprising: an acquisition module configured to acquire multiple video streams and process the multiple video streams sequentially using a human detection algorithm and a human tracking algorithm to obtain multiple human body sequences, wherein each human body sequence includes multiple human body images; a segmentation module configured to divide each human body sequence into a preset number of image partitions, wherein each image partition includes multiple human body images; an extraction module configured to extract a first feature corresponding to each human body image using a human body feature extraction model; and a first determination module configured to determine a first category corresponding to each human body image based on the first feature corresponding to each human body image using a classification model. The system comprises: a center; a calculation module configured to calculate a first quality score for each human image based on a first category center and using a quality score calculation model; a second determination module configured to determine a second feature, a second category center, and a second quality score for each image partition based on the first feature, the first category center, and the first quality score for each human image in each image partition; a third determination module configured to determine a target feature for each human sequence based on the second feature, the second category center, and the second quality score for each image partition in each human sequence; and a matching module configured to match multiple human sequences based on the target feature corresponding to each human sequence.

[0007] A third aspect of this disclosure provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.

[0008] A fourth aspect of this disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.

[0009] The beneficial effects of this disclosure embodiment compared with the prior art are as follows: This disclosure embodiment acquires multiple video streams and processes them sequentially using a human detection algorithm and a human tracking algorithm to obtain multiple human body sequences, where each human body sequence includes multiple human body images; each human body sequence is divided into a preset number of image partitions, where each image partition includes multiple human body images; a first feature corresponding to each human body image is extracted using a human body feature extraction model; based on the first feature corresponding to each human body image, a first category center corresponding to each human body image is determined using a classification model; based on the first category center corresponding to each human body image, a quality... The computational model calculates the first quality score corresponding to each human image; based on the first feature, first category center, and first quality score corresponding to each human image in each image partition, it determines the second feature, second category center, and second quality score corresponding to each image partition; based on the second feature, second category center, and second quality score corresponding to each image partition in each human sequence, it determines the target feature corresponding to each human sequence; based on the target feature corresponding to each human sequence, it matches multiple human sequences. Therefore, by adopting the above technical means, the problem of low matching accuracy of human sequences in video streams in the existing technology can be solved, thereby improving the accuracy of matching human sequences. Attached Figure Description

[0010] To more clearly illustrate the technical solutions in the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0011] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure;

[0012] Figure 2 This is a flowchart illustrating a human body matching method in a video stream provided in an embodiment of this disclosure;

[0013] Figure 3 This is a schematic diagram of the structure of a human body matching device in a video stream provided in an embodiment of this disclosure;

[0014] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation

[0015] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, so as to provide a thorough understanding of the embodiments of this disclosure. However, those skilled in the art will understand that this disclosure may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this disclosure with unnecessary detail.

[0016] A method and apparatus for human body matching in a video stream according to an embodiment of the present disclosure will now be described in detail with reference to the accompanying drawings.

[0017] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure. The application scenario may include terminal devices 101, 102, and 103, server 104, and network 105.

[0018] Terminal devices 101, 102, and 103 can be hardware or software. When terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with displays that support communication with server 104, including but not limited to smartphones, tablets, laptops, and desktop computers. When terminal devices 101, 102, and 103 are software, they can be installed in the aforementioned electronic devices. Terminal devices 101, 102, and 103 can be implemented as multiple software programs or software modules, or as a single software program or software module; this disclosure does not impose any limitations on this. Furthermore, various applications can be installed on terminal devices 101, 102, and 103, such as data processing applications, instant messaging tools, social platform software, search applications, shopping applications, etc.

[0019] Server 104 can be a server that provides various services, such as a backend server that receives requests sent by terminal devices with which it has established communication connections. This backend server can receive and analyze the requests sent by the terminal devices and generate processing results. Server 104 can be a single server, a server cluster consisting of several servers, or a cloud computing service center. This embodiment of the disclosure does not impose any limitations on these aspects.

[0020] It should be noted that server 104 can be either hardware or software. When server 104 is hardware, it can be various electronic devices that provide various services to terminal devices 101, 102, and 103. When server 104 is software, it can be multiple software programs or software modules that provide various services to terminal devices 101, 102, and 103, or it can be a single software program or software module that provides various services to terminal devices 101, 102, and 103. This disclosure does not limit the scope of the embodiments.

[0021] Network 105 can be a wired network using coaxial cable, twisted pair, and fiber optic connection, or it can be a wireless network that enables interconnection of various communication devices without wiring, such as Bluetooth, Near Field Communication (NFC), Infrared, etc. This disclosure does not limit the scope of the network.

[0022] Users can establish a communication connection with server 104 via network 105 through terminal devices 101, 102, and 103 to receive or send information, etc. It should be noted that the specific types, quantities, and combinations of terminal devices 101, 102, and 103, server 104, and network 105 can be adjusted according to the actual needs of the application scenario, and this disclosure embodiment does not impose any limitations on this.

[0023] Figure 2 This is a flowchart illustrating a human body matching method in a video stream provided in an embodiment of this disclosure. Figure 2 Human body matching methods in video streams can be derived from Figure 1 The computer or server, or the software on the computer or server, executes the command. For example... Figure 2 As shown, the human body matching methods in this video stream include:

[0024] S201: Acquire multiple video streams and process them sequentially using human detection and human tracking algorithms to obtain multiple human sequences, where each human sequence includes multiple human images;

[0025] Human detection algorithms can use YOLOX, and human tracking algorithms can use ByteTrack. A video stream corresponds to one or more human sequences. Human detection and human tracking algorithms can find one or more human sequences in a video stream. A human sequence can be understood as a human trajectory, which is multiple human images.

[0026] S202, each human body sequence is divided into a preset number of image partitions, wherein each image partition includes multiple human body images;

[0027] An image partition is a small segment of a human body sequence.

[0028] S203, use the human body feature extraction model to extract the first feature corresponding to each human body image;

[0029] The human feature extraction model can be a common neural network model, such as ResNet. The human feature extraction model is used to extract image features of human images. In order to improve the accuracy of the human feature extraction model, this disclosure performs two training sessions on the human feature extraction model, namely the first training and the second training.

[0030] S204, Based on the first feature corresponding to each human body image, use a classification model to determine the first category center corresponding to each human body image;

[0031] The first, second, and third category centers have the same meaning, differing only in the objects they correspond to. A single target (a person) corresponds to multiple video streams, multiple human body sequences, multiple image partitions, and multiple human body images. Human body images of a target are generally categorized into three poses: front, side, and back (a single human body image represents one pose). The features of these three poses differ significantly, corresponding to the frontal, side, and back center of the human body, respectively. The features of the three poses are fused to obtain the fused feature, corresponding to the human body fusion center. Therefore, each target corresponds to four category centers: the frontal center, the side center, the back center, and the fused center. Each human body image corresponds to a category center that is either the frontal center, the side center, or the back center.

[0032] The classification model consists of two fully connected layers: the first layer has dimensions (512, 256), and the second layer has dimensions (256, 3). The input to the pose classifier is the 512-dimensional features of a human image, and the output is the probability value of the human image for three categories (center of the front, center of the side, or center of the back). The category of the human image is the category center with the highest probability value. A category center can be understood as a vector.

[0033] S205, Based on the first category center corresponding to each human body image, calculate the first quality score corresponding to each human body image using the quality score calculation model;

[0034] When calculating the first quality score of a human image, it is based on the first category center corresponding to the human image and the human fusion center of the target corresponding to the human image.

[0035] S206, based on the first feature, first category center and first quality score corresponding to each human body image in each image partition, determine the second feature, second category center and second quality score corresponding to each image partition;

[0036] S207, Based on the second feature, second category center and second quality score corresponding to each image partition in each human body sequence, determine the target feature corresponding to each human body sequence;

[0037] S208: Match multiple human sequences based on the target features corresponding to each human sequence.

[0038] According to the technical solution provided in this disclosure, multiple video streams are acquired, and human detection and human tracking algorithms are sequentially used to process the multiple video streams to obtain multiple human sequences, wherein each human sequence includes multiple human images; each human sequence is divided into a preset number of image partitions, wherein each image partition includes multiple human images; a first feature corresponding to each human image is extracted using a human feature extraction model; based on the first feature corresponding to each human image, a first category center corresponding to each human image is determined using a classification model; based on the first category center corresponding to each human image, a first quality score corresponding to each human image is calculated using a quality score calculation model; based on the first feature, first category center, and first quality score corresponding to each human image in each image partition, a second feature, second category center, and second quality score corresponding to each image partition are determined; based on the second feature, second category center, and second quality score corresponding to each image partition in each human sequence, a target feature corresponding to each human sequence is determined; based on the target feature corresponding to each human sequence, multiple human sequences are matched. Therefore, by adopting the above technical means, the problem of low matching accuracy of human sequences in video streams in the prior art can be solved, thereby improving the accuracy of matching human sequences.

[0039] Based on the first feature, first category center, and first quality score corresponding to each human image in each image partition, determine the second feature, second category center, and second quality score corresponding to each image partition, including: taking the first category center with the most human images in each image partition as the second category center corresponding to each image partition; taking the first feature corresponding to the human image with the highest first quality score among the multiple human images corresponding to the second category center corresponding to each image partition as the second feature corresponding to each image partition; and taking the first quality score corresponding to the second feature corresponding to each image partition as the second quality score corresponding to each image partition.

[0040] For example: An image partition contains 100 human body images. The most common pose among these 100 images is the frontal view of the human body. Therefore, the first category center with the most images is the frontal view center, and the second category center corresponding to this image partition is also the frontal view center. Human body images in this image partition that do not belong to the frontal view center are deleted. The first feature corresponding to the human body image with the highest first quality score among the remaining human body images (this is the same as the first feature corresponding to the human body image with the highest first quality score among the multiple human body images corresponding to the second category center of the image partition) is used as the second feature corresponding to this image partition. The first quality score corresponding to this second feature is used as the second quality score corresponding to this image partition.

[0041] Based on the second feature, second category center, and second quality score corresponding to each image partition in each human body sequence, the target feature corresponding to each human body sequence is determined, including: in each human body sequence: image partitions corresponding to the same second category center are divided into the same category to obtain multiple categories; the second feature corresponding to the image partition in each category is weighted and averaged to obtain the target feature corresponding to each category, wherein the weight of each image partition in the weighted average calculation is determined by the second quality score of each image partition.

[0042] Each image partition corresponds to a second category center. For example, if a human body sequence contains multiple image partitions representing the center of the front, side, and back of the human body, then the sequence has three categories: the center of the front, the center of the side, and the center of the back. A weighted average is calculated on the second features corresponding to the image partitions within each category to obtain the target feature for each category. In the weighted average calculation, the higher the second quality score of an image partition, the greater its weight.

[0043] Based on the target features corresponding to each human body sequence, multiple human body sequences are matched, including: calculating the similarity between target features corresponding to the same category in two human body sequences, wherein each human body sequence includes multiple categories; and identifying two human body sequences with a similarity greater than a preset threshold as human body sequences of the same target.

[0044] For example, if two human body sequences both have two categories: the center of the side profile and the center of the back profile, then we calculate the similarity between the target features corresponding to the center of the side profile and the target features corresponding to the center of the back profile. If either of these two similarity values ​​is greater than a preset threshold, then the two human body sequences can be identified as sequences targeting the same object. The similarity can be cosine similarity.

[0045] The human feature extraction model requires two training iterations. The first training iteration includes: acquiring a training dataset containing multiple targets, each with multiple human images. Each target corresponds to a third-category center: frontal center, side center, back center, and fusion center. The third-category center for each human image of each target is either the frontal center, side center, or back center. A first similarity formula is constructed based on cosine similarity and a subtractive margin term, and a first loss function is constructed based on the first similarity formula. A second similarity formula is constructed based on cosine similarity and an additive margin term, and a second loss function is constructed based on the second similarity formula. The human feature extraction model is then trained for the first time using the first and second loss functions on the training dataset.

[0046] For example: If the training dataset has N targets, and each target corresponds to the following third category center: human front center, human side center, human back center, and human fusion center, then the training dataset has 4*N category center vectors.

[0047] Given an image of a human body from the front, and its corresponding category center being the center of the frontal view of the human body, the cosine similarity between this image and the i-th frontal view of the human body (there are N frontal viewpoints) is:

[0048] First similarity formula

[0049] First loss function

[0050] The subtractive margin term m1 is typically taken as 0.35. The first loss function is used to calculate the category center of the human image and its corresponding image. s is the output of the quality score calculation model, and cosθ is the value of s. j It is the cosine similarity between the human image and the j-th category center other than the center of the human face.

[0051] The cosine similarity between the human image and the i-th human fusion center (there are N human fusion centers) is: ( The meanings of the expressions before and after are different, but they both relate to the cosine similarity of the human body image.

[0052] Second similarity formula

[0053] Second loss function

[0054] The additive margin term m² is typically taken as 0.45. The second loss function is used to calculate the fusion center between the human image and the human body. s is the output of the quality score calculation model, and cosθ is the coefficient of variation. j It is the cosine similarity between the human image and the j-th category center other than the center of the human face.

[0055] After the above training, the human feature extraction model can distinguish the differences between different targets in fused features or different pose features.

[0056] Before utilizing the quality score calculation model, the process includes: constructing a spatial quality regressor and a threshold quality regressor; and constructing a quality score calculation model based on the spatial quality regressor and the threshold quality regressor.

[0057] Before constructing the spatial quality regressor, the following calculation is required: q = (1 - 0.5 * (d1 + d2)) * d3; d1 is the distance between the first feature corresponding to the human image and the category center corresponding to the human image, d2 is the distance between the first feature corresponding to the human image and the human fusion center of the target corresponding to the human image, and d3 is the distance between the first feature corresponding to the human image and the nearest neighbor human fusion center. The nearest neighbor human fusion center is the human fusion center of the target most similar to the target corresponding to the human image. "Most similar" is calculated by the nearest neighbor algorithm.

[0058] The spatial quality regressor performs multiple convolutions, batch normalization, and activation calculations on the input features. For example, the input features undergo a 3x3 convolution with 2 downsampling and 128 channels, followed by batch normalization and tanh activation calculation; then a 3x3 convolution with 2 downsampling and 32 channels, followed by batch normalization and mish activation calculation; and finally a 3x3 convolution with 2 downsampling and 1 channel, average pooling, and sigmoid calculation to obtain the predicted q′.

[0059] The loss function for training the quality regressor is:

[0060] Loss=|qq′|

[0061] The threshold quality regressor consists of multiple convolutional layers, a batch normalization layer, and an activation layer. For example, a threshold quality regressor might consist of three convolutional layers (Conv), a batch normalization layer (BN), and a PreLU activation layer. The convolutional layers all have 3x3 kernels, a downsampling rate of 2, a padding rate of 1, and 128, 32, and 1 channels respectively. After the PreLU activation layer, max pooling and average pooling can be performed, followed by sigmoid calculation, ultimately yielding two threshold-quality scores: α and β.

[0062] The final output of the mass score calculation model can be s = α * q′ + β, or output α, q′, and β.

[0063] After constructing the quality score calculation model based on the spatial quality regressor and the threshold quality regressor, the method further includes: updating the first similarity formula and the second similarity formula based on the output of the quality score calculation model to update the first loss function and the second loss function; performing a second training on the human feature extraction model based on the training dataset using the updated first loss function and the second loss function; and completing the training of the quality score calculation model while performing the second training on the human feature extraction model.

[0064] The updated first similarity formula is:

[0065]

[0066] m1 is changed to 0.45.

[0067] The updated second similarity formula is:

[0068]

[0069] m2 was changed to 0.55.

[0070] The human feature extraction model is trained a second time using the updated first and second loss functions.

[0071] While performing a second training of the human feature extraction model, the training of the quality score calculation model is also completed. It can be understood that the training of the quality score calculation model is to make the spatial quality regressor output the optimal q′, and the threshold quality regressor output the optimal α and β.

[0072] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.

[0073] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.

[0074] Figure 3 This is a schematic diagram of a human body matching device in a video stream provided in an embodiment of this disclosure. Figure 3 As shown, the human body matching device in this video stream includes:

[0075] The acquisition module 301 is configured to acquire multiple video streams and process the multiple video streams sequentially using a human detection algorithm and a human tracking algorithm to obtain multiple human sequences, wherein each human sequence includes multiple human images;

[0076] The segmentation module 302 is configured to divide each human body sequence into a preset number of image partitions, wherein each image partition includes multiple human body images;

[0077] The extraction module 303 is configured to extract the first feature corresponding to each human image using a human feature extraction model;

[0078] The first determining module 304 is configured to determine the first category center corresponding to each human body image based on the first feature corresponding to each human body image using a classification model;

[0079] The calculation module 305 is configured to calculate the first quality score corresponding to each human body image based on the first category center corresponding to each human body image using a quality score calculation model.

[0080] The second determining module 306 is configured to determine the second feature, second category center and second quality score corresponding to each image partition based on the first feature, first category center and first quality score corresponding to each human image in each image partition;

[0081] The third determining module 307 is configured to determine the target features corresponding to each human body sequence based on the second feature, the second category center, and the second quality score corresponding to each image partition in each human body sequence.

[0082] The matching module 308 is configured to match multiple human sequences based on the target features corresponding to each human sequence.

[0083] According to the technical solution provided in this disclosure, multiple video streams are acquired, and human detection and human tracking algorithms are sequentially used to process the multiple video streams to obtain multiple human sequences, wherein each human sequence includes multiple human images; each human sequence is divided into a preset number of image partitions, wherein each image partition includes multiple human images; a first feature corresponding to each human image is extracted using a human feature extraction model; based on the first feature corresponding to each human image, a first category center corresponding to each human image is determined using a classification model; based on the first category center corresponding to each human image, a first quality score corresponding to each human image is calculated using a quality score calculation model; based on the first feature, first category center, and first quality score corresponding to each human image in each image partition, a second feature, second category center, and second quality score corresponding to each image partition are determined; based on the second feature, second category center, and second quality score corresponding to each image partition in each human sequence, a target feature corresponding to each human sequence is determined; based on the target feature corresponding to each human sequence, multiple human sequences are matched. Therefore, by adopting the above technical means, the problem of low matching accuracy of human sequences in video streams in the prior art can be solved, thereby improving the accuracy of matching human sequences.

[0084] Optionally, the second determining module 306 is further configured to take the first category center with the largest number of human images in each image partition as the second category center of each image partition; take the first feature corresponding to the human image with the highest first quality score among the multiple human images corresponding to the second category center of each image partition as the second feature of each image partition; and take the first quality score corresponding to the second feature of each image partition as the second quality score of each image partition.

[0085] For example: An image partition contains 100 human body images. The most common pose among these 100 images is the frontal view of the human body. Therefore, the first category center with the most images is the frontal view center, and the second category center corresponding to this image partition is also the frontal view center. Human body images in this image partition that do not belong to the frontal view center are deleted. The first feature corresponding to the human body image with the highest first quality score among the remaining human body images (this is the same as the first feature corresponding to the human body image with the highest first quality score among the multiple human body images corresponding to the second category center of the image partition) is used as the second feature corresponding to this image partition. The first quality score corresponding to this second feature is used as the second quality score corresponding to this image partition.

[0086] Optionally, the third determining module 307 is further configured to: divide image partitions corresponding to the same second category center into the same category in each human body sequence, resulting in multiple categories; perform a weighted average operation on the second features corresponding to the image partitions in each category to obtain the target features corresponding to each category, wherein the weight of each image partition in the weighted average operation is determined by the second quality score of each image partition.

[0087] Each image partition corresponds to a second category center. For example, if a human body sequence contains multiple image partitions representing the center of the front, side, and back of the human body, then the sequence has three categories: the center of the front, the center of the side, and the center of the back. A weighted average is calculated on the second features corresponding to the image partitions within each category to obtain the target feature for each category. In the weighted average calculation, the higher the second quality score of an image partition, the greater its weight.

[0088] Optionally, the matching module 308 is further configured to calculate the similarity between target features corresponding to the same category in two human body sequences, wherein each human body sequence includes multiple categories; and to determine two human body sequences with a corresponding similarity greater than a preset threshold as human body sequences of the same target.

[0089] For example, if two human body sequences both have two categories: the center of the side profile and the center of the back profile, then we calculate the similarity between the target features corresponding to the center of the side profile and the target features corresponding to the center of the back profile. If either of these two similarity values ​​is greater than a preset threshold, then the two human body sequences can be identified as sequences targeting the same object. The similarity can be cosine similarity.

[0090] Optionally, the extraction module 303 is further configured to acquire a training dataset, wherein the training dataset includes: multiple targets, each target having multiple human images of that target, each target corresponding to the following third category centers: human front center, human side center, human back center, and human fusion center, and the third category center corresponding to each human image of each target is the human front center, human side center, or human back center; constructing a first similarity formula based on cosine similarity and a subtractive margin term, and constructing a first loss function based on the first similarity formula; constructing a second similarity formula based on cosine similarity and an additive margin term, and constructing a second loss function based on the second similarity formula; and performing a first training on the human feature extraction model based on the training dataset using the first loss function and the second loss function.

[0091] For example: If the training dataset has N targets, and each target corresponds to the following third category center: human front center, human side center, human back center, and human fusion center, then the training dataset has 4*N category center vectors.

[0092] Given an image of a human body from the front, and its corresponding category center being the center of the frontal view of the human body, the cosine similarity between this image and the i-th frontal view of the human body (there are N frontal viewpoints) is:

[0093] First similarity formula

[0094] First loss function

[0095] The subtractive margin term m1 is typically taken as 0.35. The first loss function is used to calculate the category center of the human image and its corresponding image. s is the output of the quality score calculation model, and cosθ is the value of s. j It is the cosine similarity between the human image and the j-th category center other than the center of the human face.

[0096] The cosine similarity between the human image and the i-th human fusion center (there are N human fusion centers) is: ( The meanings of the expressions before and after are different, but they both relate to the cosine similarity of the human body image.

[0097] Second similarity formula

[0098] Second loss function

[0099] The additive margin term m² is typically taken as 0.45. The second loss function is used to calculate the fusion center between the human image and the human body. s is the output of the quality score calculation model, and cosθ is the coefficient of variation. j It is the cosine similarity between the human image and the j-th category center other than the center of the human face.

[0100] After the above training, the human feature extraction model can distinguish the differences between different targets in fused features or different pose features.

[0101] Optionally, the extraction module 303 is also configured to construct a spatial quality regressor and a threshold quality regressor; and to construct a quality score calculation model based on the spatial quality regressor and the threshold quality regressor.

[0102] Before constructing the spatial quality regressor, the following calculation is required: q = (1 - 0.5 * (d1 + d2)) * d3; d1 is the distance between the first feature corresponding to the human image and the category center corresponding to the human image, d2 is the distance between the first feature corresponding to the human image and the human fusion center of the target corresponding to the human image, and d3 is the distance between the first feature corresponding to the human image and the nearest neighbor human fusion center. The nearest neighbor human fusion center is the human fusion center of the target most similar to the target corresponding to the human image. "Most similar" is calculated by the nearest neighbor algorithm.

[0103] The spatial quality regressor performs multiple convolutions, batch normalization, and activation calculations on the input features. For example, the input features undergo a 3x3 convolution with 2 downsampling and 128 channels, followed by batch normalization and tanh activation calculation; then a 3x3 convolution with 2 downsampling and 32 channels, followed by batch normalization and mish activation calculation; and finally a 3x3 convolution with 2 downsampling and 1 channel, average pooling, and sigmoid calculation to obtain the predicted q′.

[0104] The loss function for training the quality regressor is:

[0105] Loss=|qq′|

[0106] The threshold quality regressor consists of multiple convolutional layers, a batch normalization layer, and an activation layer. For example, a threshold quality regressor might consist of three convolutional layers (Conv), a batch normalization layer (BN), and a PreLU activation layer. The convolutional layers all have 3x3 kernels, a downsampling rate of 2, a padding rate of 1, and 128, 32, and 1 channels respectively. After the PreLU activation layer, max pooling and average pooling can be performed, followed by sigmoid calculation, ultimately yielding two threshold-quality scores: α and β.

[0107] The final output of the mass score calculation model can be α*q′+β, or α, q′, β.

[0108] Optionally, the extraction module 303 is further configured to update the first similarity formula and the second similarity formula based on the output of the quality score calculation model, so as to update the first loss function and the second loss function; to perform a second training on the human feature extraction model based on the training dataset using the updated first loss function and the second loss function; and to complete the training of the quality score calculation model while performing the second training on the human feature extraction model.

[0109] The updated first similarity formula is:

[0110]

[0111] m1 is changed to 0.45.

[0112] The updated second similarity formula is:

[0113]

[0114] m2 was changed to 0.55.

[0115] The human feature extraction model is trained a second time using the updated first and second loss functions.

[0116] While performing a second training of the human feature extraction model, the training of the quality score calculation model is also completed. It can be understood that the training of the quality score calculation model is to make the spatial quality regressor output the optimal q′, and the threshold quality regressor output the optimal α and β.

[0117] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this disclosure.

[0118] Figure 4 This is a schematic diagram of the electronic device 4 provided in an embodiment of this disclosure. Figure 4 As shown, the electronic device 4 of this embodiment includes: a processor 401, a memory 402, and a computer program 403 stored in the memory 402 and executable on the processor 401. When the processor 401 executes the computer program 403, it implements the steps in the various method embodiments described above. Alternatively, when the processor 401 executes the computer program 403, it implements the functions of each module / unit in the various device embodiments described above.

[0119] Electronic device 4 can be a desktop computer, laptop, handheld computer, cloud server, or other electronic device. Electronic device 4 may include, but is not limited to, processor 401 and memory 402. Those skilled in the art will understand that... Figure 4 This is merely an example of electronic device 4 and does not constitute a limitation on electronic device 4. It may include more or fewer components than shown, or different components.

[0120] The processor 401 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

[0121] The memory 402 can be an internal storage unit of the electronic device 4, such as a hard disk or RAM of the electronic device 4. The memory 402 can also be an external storage device of the electronic device 4, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash Card, etc., equipped on the electronic device 4. The memory 402 can also include both internal and external storage units of the electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.

[0122] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0123] If an integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program may include computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in a computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0124] The above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit it. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be included within the protection scope of this disclosure.

Claims

1. A method for human matching in a video stream, characterized by, include: Multiple video streams are acquired and processed sequentially using human detection and human tracking algorithms to obtain multiple human sequences, each of which includes multiple human images. Each human body sequence is divided into a preset number of image partitions, where each image partition includes multiple human body images; The first feature corresponding to each human image is extracted using a human feature extraction model; Based on the first feature corresponding to each human body image, a classification model is used to determine the first category center corresponding to each human body image; the first category center is the center of the front of the human body, the center of the side of the human body, or the center of the back of the human body. Based on the first category center corresponding to each human body image, the first quality score corresponding to each human body image is calculated using the quality score calculation model. Based on the first feature, first category center, and first quality score corresponding to each human image in each image partition, determine the second feature, second category center, and second quality score corresponding to each image partition; Based on the second feature, second category center, and second quality score corresponding to each image partition in each human body sequence, the target features corresponding to each human body sequence are determined. Matching multiple human body sequences based on the target features corresponding to each human body sequence; The step of determining the second feature, second category center, and second quality score corresponding to each image partition based on the first feature, first category center, and first quality score corresponding to each human image in each image partition includes: The center of the first category with the most human images in each image partition is used as the center of the second category for each image partition. The first feature corresponding to the human image with the highest first quality score among the multiple human images corresponding to the second category center of each image partition is used as the second feature corresponding to each image partition. The first quality score corresponding to the second feature of each image partition is used as the second quality score of each image partition.

2. The method according to claim 1, characterized in that, The step of determining the target features corresponding to each human body sequence based on the second feature, second category center, and second quality score corresponding to each image partition in each human body sequence includes: In each human body sequence: Image partitions corresponding to the same second category center are divided into the same category, resulting in multiple categories; A weighted average is performed on the second features corresponding to the image partitions in each category to obtain the target features corresponding to each category. The weight of each image partition in the weighted average is determined by the second quality score of each image partition.

3. The method of claim 1, wherein, The matching of multiple human sequences based on the target features corresponding to each human sequence includes: Calculate the similarity between target features of the same category in two human body sequences, where each human body sequence includes multiple categories; Two human sequences with a similarity greater than a preset threshold are identified as human sequences targeting the same target.

4. The method of claim 1, wherein, include: Obtain a training dataset, which includes: multiple targets, each target having multiple human images of that target, each target corresponding to the following four category centers: human front center, human side center, human back center and human fusion center, each human image of each target corresponding to the category center is human front center or human side center or human back center, and the human fusion center is the feature fusion of human front center, human side center and human back center. A first similarity formula is constructed based on cosine similarity and subtractive margin term, and a first loss function is constructed based on the first similarity formula; A second similarity formula is constructed based on the cosine similarity and the additive margin term, and a second loss function is constructed based on the second similarity formula; Based on the training dataset, the human feature extraction model is trained for the first time using the first loss function and the second loss function.

5. The method of claim 1, wherein, include: Construct a spatial quality regressor and a threshold quality regressor; The quality score calculation model is constructed based on the spatial quality regressor and the threshold quality regressor.

6. The method of claim 5, wherein, After constructing the quality score calculation model based on the spatial quality regressor and the threshold quality regressor, the method further includes: The first similarity formula and the second similarity formula are updated based on the output of the quality score calculation model, so as to update the first loss function and the second loss function; Based on the training dataset, the human feature extraction model is trained a second time using the updated first and second loss functions. And while performing a second training on the human feature extraction model, the training of the quality score calculation model is also completed.

7. A human body matching device in a video stream, characterized in that, include: The acquisition module is configured to acquire multiple video streams and process them sequentially using human detection and human tracking algorithms to obtain multiple human sequences, where each human sequence includes multiple human images. The segmentation module is configured to divide each human body sequence into a preset number of image partitions, wherein each image partition includes multiple human body images; The extraction module is configured to extract the first feature corresponding to each human image using a human feature extraction model; The first determining module is configured to determine the first category center corresponding to each human body image based on the first feature corresponding to each human body image using a classification model; the first category center is the center of the front of the human body, the center of the side of the human body, or the center of the back of the human body. The calculation module is configured to calculate the first quality score corresponding to each human body image based on the first category center corresponding to each human body image using the quality score calculation model; The second determining module is configured to determine the second feature, second category center, and second quality score corresponding to each image partition based on the first feature, first category center, and first quality score corresponding to each human image in each image partition; The third determination module is configured to determine the target features corresponding to each human body sequence based on the second feature, the second category center, and the second quality score corresponding to each image partition in each human body sequence. The matching module is configured to match multiple human sequences based on the target features corresponding to each human sequence; The second determining module is specifically configured to: take the first category center with the most human images in each image partition as the second category center of each image partition; take the first feature corresponding to the human image with the highest first quality score among the multiple human images corresponding to the second category center of each image partition as the second feature of each image partition; and take the first quality score corresponding to the second feature of each image partition as the second quality score of each image partition.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 6.