Training method of pedestrian search model based on twin network and related equipment
By using a Siamese network-based pedestrian search model training method and leveraging feature extraction and fusion from the lower and upper branches, the accuracy problem of pedestrian search at different scales is solved, achieving efficient pedestrian search training and recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2023-02-01
- Publication Date
- 2026-06-12
AI Technical Summary
Existing pedestrian search technologies have low accuracy at different scales and require a lot of manpower for labeling, making it difficult to handle the problem of pedestrian scale changes.
A pedestrian search model training method using Siamese networks is adopted. Feature extraction is performed through the lower and upper branches of the Siamese network to obtain pedestrian feature vectors and multi-scale feature vectors. Feature fusion and clustering training are then performed to reduce sample labeling costs and increase sample data by utilizing scale transformation.
It improves the accuracy of pedestrian search at different scales, reduces sample labeling costs, and enhances the scalability of the model.
Smart Images

Figure CN116259076B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image recognition technology, and in particular to a training method, apparatus, electronic device, and storage medium for a pedestrian search model based on a twin network. Background Technology
[0002] Pedestrian search has received increasing attention due to its widespread application in real-world environments. Pedestrian search aims to locate and retrieve a specific pedestrian from a set of scene images, and therefore can be viewed as a combination of pedestrian detection and pedestrian re-identification tasks.
[0003] Existing pedestrian search tasks can be divided into two approaches: end-to-end completion of pedestrian detection and pedestrian re-identification tasks; first, pedestrian detection is performed, and pedestrians in the image are cropped and preserved according to the detection results, and then the cropped image is used to complete the pedestrian re-identification task.
[0004] However, both of the above methods are trained in a supervised manner, which requires labeling all pedestrians in a large number of images with bounding boxes and numbers. Labeling a massive number of pedestrian images to support supervised learning requires a lot of manpower, which severely limits the scalability of supervised pedestrian search methods. At the same time, it is difficult to handle the problem of pedestrian scale changes, that is, the search accuracy for pedestrians with large scale changes needs to be improved. Summary of the Invention
[0005] This invention provides a training method, apparatus, electronic device, and storage medium for a pedestrian search model based on twin networks, in order to solve the problem of low accuracy in pedestrian search at different scales in the prior art.
[0006] This invention provides a training method for a pedestrian search model based on Siamese networks, comprising:
[0007] Determine the training images;
[0008] Feature extraction is performed on the training image based on the lower branch of the Siamese network to obtain the feature vector of each pedestrian in the training image;
[0009] The training images are processed to obtain multi-scale pedestrian images, and features are extracted from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain the multi-scale feature vectors of each pedestrian in the training images.
[0010] The pedestrian search model is trained based on the feature vector and the multi-scale feature vector, and a trained pedestrian search model is obtained when the training is completed.
[0011] According to the present invention, a training method for a pedestrian search model based on a Siamese network is provided, wherein the step of extracting features from the training image based on the lower branches of the Siamese network to obtain feature vectors of each pedestrian in the training image includes:
[0012] Based on the pedestrian detection in the training image by the lower branch of the Siamese network, determine the pedestrian bounding box corresponding to each pedestrian in the training image;
[0013] The positions of pedestrians in the training image are determined based on the pedestrian bounding boxes, and features are extracted from the pedestrian positions based on the pedestrian bounding boxes to obtain the feature vectors of each pedestrian in the training image.
[0014] According to a training method for a pedestrian search model based on a Siamese network provided by the present invention, the training image is processed to obtain multi-scale pedestrian images, and features are extracted from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain multi-scale feature vectors of each pedestrian in the training image, including:
[0015] The training images are processed to obtain pedestrian images at multiple scales;
[0016] Based on the upper branch of the Siamese network, feature extraction is performed on the multi-scale pedestrian images to obtain the multi-scale feature vectors corresponding to each pedestrian.
[0017] The multi-scale feature vectors are fused to obtain the multi-scale features of each pedestrian in the training image.
[0018] According to the present invention, a training method for a pedestrian search model based on Siamese networks is provided, wherein processing the training images to obtain multi-scale pedestrian images includes:
[0019] The training images are cropped to obtain several pedestrian images, each containing one pedestrian.
[0020] The pedestrian image is masked, and the size of the masked pedestrian image is scaled to obtain multi-scale pedestrian images.
[0021] According to the present invention, a training method for a pedestrian search model based on a Siamese network is provided, the method further includes:
[0022] The feature vector and the multi-scale feature vector are blended to obtain a hybrid feature vector;
[0023] The mixed feature vectors are clustered to obtain several clusters and cluster labels, with one cluster corresponding to one cluster label.
[0024] According to the present invention, a method for training a pedestrian search model based on a Siamese network, after training the initial pedestrian search model based on the feature vector and the multi-scale features, further includes:
[0025] When it is determined that training is not yet complete, the feature vector and the multi-scale feature vector are updated;
[0026] The model is trained again based on the updated feature vector and the multi-scale feature vector until a trained pedestrian search model is obtained.
[0027] According to the present invention, a training method for a pedestrian search model based on a Siamese network is provided, the method further includes:
[0028] A target image is determined, and pedestrian recognition and feature extraction are performed on the target image to obtain the target features of each pedestrian in the target image;
[0029] The extracted target features are matched in the query set to determine the pedestrian retrieval result corresponding to the target image.
[0030] The present invention also provides a training device for a pedestrian search model based on a Siamese network, comprising:
[0031] The image determination module is used to determine the training images;
[0032] The first extraction module is used to extract features from the training image based on the lower branch of the Siamese network to obtain the feature vectors of each pedestrian in the training image.
[0033] The second extraction module is used to process the training image to obtain multi-scale pedestrian images, and to extract features from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain the multi-scale feature vectors of each pedestrian in the training image.
[0034] The training optimization module is used to train the initial pedestrian search model based on the feature vector and the multi-scale features, and to obtain the trained pedestrian search model when the training is completed.
[0035] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the training method of the pedestrian search model based on twin networks as described above.
[0036] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the training method for a pedestrian search model based on a twin network as described above.
[0037] The present invention provides a training method, apparatus, electronic device, and storage medium for a pedestrian search model based on Siamese networks. During the training of the pedestrian search model, after determining the training image, feature information is extracted according to the upper and lower branches of the Siamese network. Specifically, the feature vectors of each pedestrian in the training image are obtained from the lower branch. After image processing of the training image to obtain multi-scale pedestrian images, multi-scale feature vectors of each pedestrian in the training image are obtained from the lower branch. Finally, the model is trained based on the obtained feature vectors and multi-scale feature vectors. This eliminates the need for manual sample labeling during training, reducing labeling costs. Simultaneously, it utilizes a small number of samples through scaling and other processing to increase sample data, reducing sample requirements and improving the accuracy of pedestrian search at different scales. Attached Figure Description
[0038] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0039] Figure 1 This is a flowchart illustrating the training method for the pedestrian search model based on Siamese networks provided by the present invention.
[0040] Figure 2 This is a flowchart illustrating the steps for obtaining feature vectors provided by the present invention;
[0041] Figure 3 This is a flowchart illustrating the steps for obtaining multi-scale feature vectors provided by the present invention.
[0042] Figure 4 This is a schematic diagram of the block diagram processing of the pedestrian search system based on a twin network for training provided by the present invention;
[0043] Figure 5 This is a schematic diagram of the structure of the training device for the pedestrian search model based on twin networks provided by the present invention;
[0044] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0045] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0046] The following is combined with Figures 1 to 4 The training method for the pedestrian search model based on Siamese networks provided by this invention is described below. (Refer to...) Figure 1 , Figure 1 This is a flowchart illustrating the training method for the pedestrian search model based on Siamese networks provided by this invention. Figure 1 As shown, the method includes:
[0047] Step 101: Determine the training images.
[0048] When performing pedestrian search, a pre-trained relevant model can be used to process the data to complete the search. This relevant model can be obtained through pre-training and optimization. Specifically, during model training, the training images used for training are first determined, and these training images contain pedestrians.
[0049] For example, the obtained training image can be an RGB scene image, and this RGB scene image includes relevant information such as height, width, and channels. In this case, the RGB scene image used for training can be represented as I∈R (H×W×C) Where H, W, and C represent the height, width, and number of channels, respectively.
[0050] Step 102: Extract features from the training image based on the lower branch of the Siamese network to obtain the feature vectors of each pedestrian in the training image.
[0051] When developing the search model, an initial pedestrian search model is constructed. This model is based on a Siamese network, which consists of an upper branch and a lower branch. Both the upper and lower branches contain a backbone network and global pooling layers, while the lower branch also includes an object detection network. The backbone network in both the upper and lower branches can be constructed using ResNet50, and the object detection network in the lower branch consists of convolutional layers and fully connected layers.
[0052] In practical applications, during model training, image processing and parameter optimization are performed on the upper and lower branches to optimize and adjust the parameters of the entire model. However, when searching for pedestrians, the lower branch of the Siamese network is used for prediction and retrieval to complete the search for pedestrians.
[0053] Specifically, during training, features are first extracted from the determined training images based on the lower branches of the Siamese network to obtain the feature vectors of each pedestrian contained in the training images.
[0054] For example, when performing feature extraction to obtain the feature vectors of each pedestrian in the training image, it is first necessary to detect the pedestrians contained in the training image. After determining the positions of the contained pedestrians, feature extraction is performed to obtain the feature vector of each pedestrian. The steps for obtaining the feature vector of each pedestrian can be referred to... Figure 2 , Figure 2 This is a flowchart illustrating the steps for obtaining feature vectors provided by the present invention, wherein the steps include steps 201 to 202.
[0055] Step 201: Based on the pedestrian detection in the training image by the lower branch of the Siamese network, determine the pedestrian bounding box corresponding to each pedestrian in the training image.
[0056] Step 202: Determine the pedestrian positions in the training image based on the pedestrian bounding boxes, and extract features from the pedestrian positions based on the pedestrian bounding boxes to obtain the feature vectors of each pedestrian in the training image.
[0057] Specifically, when extracting features from the training image, pedestrian detection is first performed on the training image to determine the pedestrian bounding box corresponding to each pedestrian in the training image. Then, the pedestrian position corresponding to each pedestrian in the training image is determined based on the obtained pedestrian bounding box. Subsequently, features are extracted from the pedestrians corresponding to the pedestrian positions based on the obtained pedestrian bounding boxes to obtain the feature vector of each pedestrian in the training image.
[0058] For example, the training images used for training may contain many pedestrians; that is, a single training image may contain several pedestrians. When determining the feature vectors of pedestrians, pedestrian detection is used to identify the pedestrians in the training image, and then feature vectors representing each pedestrian are obtained through feature extraction. If the training image contains 50 pedestrians, then the number of feature vectors obtained will also be 50.
[0059] During feature extraction, the lower branch of the Siamese network is used for processing. At this time, the target detection network contained in the lower branch is used to detect pedestrians in the training image. Then, the feature vectors of each pedestrian in the training image are obtained based on the backbone network and the global pooling layer.
[0060] Step 103: Process the training images to obtain multi-scale pedestrian images, and extract features from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain the multi-scale feature vectors of each pedestrian in the training images.
[0061] When extracting features from the training image using the lower branch of the Siamese network, the upper branch of the Siamese network also performs corresponding processing. Specifically, when processing the upper branch of the Siamese network, the training image is first processed to obtain multi-scale pedestrian images corresponding to the training image. The multi-scale pedestrian images are a set of images of each pedestrian at different scales. Then, features are extracted from the obtained multi-scale pedestrian images using the upper branch of the Siamese network to obtain the multi-scale feature vectors corresponding to each pedestrian in the training image.
[0062] For example, when processing training images to obtain multi-scale pedestrian images, multi-scale pedestrian images are multiple pedestrian images of a single pedestrian at different scales. For example, if the training image contains 50 pedestrians, multiple images of a single pedestrian will be obtained at multiple scales. For example, if there are 3 scales, multi-scale pedestrian images include images of the pedestrian at 4 scales (including the original scale).
[0063] Then, features are acquired from the obtained multi-scale pedestrian images to obtain multi-scale feature vectors for pedestrians at multiple scales. In practice, for each of the processed multi-scale pedestrian images, a feature vector can be extracted to obtain a feature vector. Then, the feature vectors corresponding to the images at different scales are processed to obtain the multi-scale feature vector corresponding to each pedestrian.
[0064] Furthermore, when processing training images and then obtaining multi-scale feature vectors through feature extraction, one can refer to... Figure 3 , Figure 3 This is a flowchart illustrating the steps for obtaining multi-scale feature vectors provided by the present invention, wherein the steps include steps 301 to 303.
[0065] Step 301: Process the training images to obtain multi-scale pedestrian images;
[0066] Step 302: Extract features from multi-scale pedestrian images based on the upper branch of the Siamese network to obtain multi-scale feature vectors for each pedestrian.
[0067] Step 303: Perform feature fusion on the multi-scale feature vectors to obtain the multi-scale feature vectors of each pedestrian in the training image.
[0068] Specifically, when obtaining the multi-scale feature vectors of each pedestrian based on the training images, the training images are first processed accordingly to obtain multi-scale pedestrian images corresponding to the training images. Then, features are extracted from the multi-scale pedestrian images corresponding to each pedestrian to obtain the feature vectors corresponding to each pedestrian image. Finally, the feature vectors of the same pedestrian at different scales are fused to obtain the multi-scale feature vectors corresponding to that pedestrian.
[0069] For example, before performing feature extraction to obtain multi-scale feature vectors for each pedestrian, the training image needs to be processed to obtain pedestrian images of each pedestrian at different scales. The processing of the training image includes: cropping the training image to obtain several pedestrian images, where each pedestrian image contains one pedestrian; masking the pedestrian images; and scaling the masked pedestrian images to obtain multi-scale pedestrian images.
[0070] In practical applications, image processing typically includes cropping, rotation, scaling, and masking. When processing training images, cropping is the first step to obtain several pedestrian images, each containing only one pedestrian. Then, the resulting pedestrian images are masked. Finally, the masked pedestrian images are scaled according to requirements to obtain pedestrian images at different scales. Thus, each cropped pedestrian image, after the above processing, will yield multiple pedestrian images at different scales.
[0071] It should be noted that when cropping the training image, it is necessary to determine the pedestrians and their positions contained in the training image. This can be determined by pre-marking, such as cropping all pedestrians in the input training image by marking the bounding box according to the ground truth, or by introducing a pedestrian detection module. There is no restriction here.
[0072] After cropping the training images, a masking process is first performed to remove interference information from the images. For example, a pedestrian parsing algorithm is used to distinguish between the foreground and background of pedestrians and remove background information interference so that the model can focus more on the foreground information of pedestrians.
[0073] When performing scale transformation, all cropped pedestrian images can be scaled multiple times using a bilinear interpolation algorithm to obtain pedestrian images of the same person at the original scale and multiple pedestrian images at different scales. For example, performing three scaling operations will result in four pedestrian images. Of course, there is no limit to the number of scaling operations; it can be set according to actual needs.
[0074] It should also be noted that after cropping the training image, the order of masking and scaling operations can be changed. For example, masking can be performed first, followed by scaling, or scaling can be performed first, followed by masking. There are no specific restrictions.
[0075] Step 104: Train the pedestrian search model based on the feature vector and multi-scale feature vector, and obtain the trained pedestrian search model when the training is completed.
[0076] After obtaining the feature vectors and multi-scale feature vectors of each pedestrian in the training images, the model is trained and optimized. Specifically, training is performed based on the obtained feature vectors and multi-scale feature vectors, resulting in a trained pedestrian search model. To determine whether training is complete, a judgment can be made based on a set loss function. For example, if the loss value of the set loss function is less than a set threshold, training is considered complete; otherwise, training is considered incomplete.
[0077] In practice, during training, the scale invariance of features is learned by constraining the distance between the feature vector and the multi-scale feature necklace, thereby enabling the adjustment and optimization of various parameters in the model.
[0078] For example, when determining whether training is complete based on the obtained feature vectors and multi-scale feature vectors, the loss value of the model at each stage of data processing is determined according to the constructed loss function, thereby determining whether training is complete. In the actual training process, the loss function corresponding to different stages is different, specifically including the foreground / background binary classification loss function L... cls Pedestrian bounding box regression loss function L loc Multi-scale data augmentation loss function L scale The contrast loss function L with pseudo-labels cluster Then, by summing the loss functions, we obtain the loss function of the model during the training process.
[0079] It should be noted that after obtaining the feature vectors and multi-scale feature vectors, each vector needs to be labeled, that is, the label corresponding to each vector needs to be determined to facilitate subsequent training processing. Determining the label of each vector includes: blending the feature vectors and multi-scale feature vectors to obtain a mixed feature vector; and clustering the mixed feature vectors to obtain several clusters and cluster labels, with each cluster corresponding to one cluster label.
[0080] Specifically, the feature vector is obtained based on the original training image and records the feature vector of each pedestrian. The multi-scale feature vector is obtained by image processing and feature extraction based on the original training image and also records the feature vector of each pedestrian. At this time, the two are mixed and the feature vector and multi-scale feature vector are placed in the same training set as the input for model training. This can highlight the features of pedestrians at different scales, so that pedestrian search can be completed more accurately in subsequent use.
[0081] When performing feature vector blending, since the same pedestrian corresponds to two vectors, vector fusion and selection can be performed. For example, if there are differences between the feature vector and the multi-scale feature vector of a pedestrian, the pedestrian's feature information can be determined through vector fusion. This feature information is represented by a feature vector. For instance, a new feature vector for the pedestrian can be determined by finding an intermediate value, or a new feature vector can be obtained through vector fusion. This ensures that in the resulting blended feature vector, each pedestrian corresponds to only one vector, which can be either the original feature vector or the multi-scale feature vector, or a new feature vector obtained through processing.
[0082] After the mixing process is completed and the mixed feature vector is obtained, clustering will be performed. At this time, unsupervised clustering algorithm can be used to cluster and obtain the label of each pedestrian. Multiple pedestrians can correspond to one label. That is, when performing clustering, each cluster obtained by clustering corresponds to a cluster label.
[0083] For example, when performing clustering to obtain the label for each pedestrian, after clustering is completed, several clusters are obtained. These clusters can then be further subdivided to make the clustering more accurate. For instance, for a given cluster, the similarity matrix between pedestrians within that cluster is first obtained. And obtain the classification result for each pedestrian according to the threshold, where the classification result y∈R m×m It can be as follows:
[0084]
[0085] Where i represents the i-th pedestrian, j represents the j-th pedestrian, and M c y represents the pedestrian features corresponding to cluster c, m is the number of pedestrians in the cluster, and y can be regarded as an adjacency matrix of an undirected graph. Pedestrians belonging to the same undirected connected subgraph in y are regarded as the same sub-cluster, and the original cluster is further split according to the sub-cluster division, so that the obtained cluster labels are more accurate.
[0086] Furthermore, after determining the labels of pedestrians and obtaining the feature vectors and multi-scale feature vectors, the correspondence between each feature vector and each multi-scale feature vector and the label is determined. Then, during training, the labels, feature vectors and multi-scale vectors are used together as training data to complete the training of the pedestrian search model.
[0087] During training, if training is deemed complete, a usable pedestrian search model is obtained. If training is deemed incomplete, further training is required until training ends. This continued training includes: updating the feature vectors and multi-scale features when training is deemed incomplete; and retraining based on the updated feature vectors and multi-scale features until a trained pedestrian search model is obtained upon completion.
[0088] Specifically, when it is determined that the training is not complete based on the feature vector and the multi-scale feature vector, further training is required. At this time, the feature vector and the multi-scale feature vector can be updated according to the set update momentum factor, and then the training can be carried out again based on the updated feature vector and the multi-scale feature vector until the training is completed and a trained pedestrian search model is obtained.
[0089] Furthermore, after training is completed, the resulting pedestrian search model can be used to perform pedestrian search. The process of performing pedestrian search includes: determining the target image, performing pedestrian recognition and feature extraction on the target image to obtain the target features of each pedestrian in the target image; and matching the extracted target features in the query set to determine the pedestrian retrieval results corresponding to the target image.
[0090] When performing pedestrian search, the lower branch of the Siamese network is used for processing. By determining the target image, pedestrian recognition and feature extraction are performed on the target image to obtain the target features of the pedestrians contained in the target image. Then, based on the obtained target features, matching is performed in the corresponding query set to obtain the pedestrian retrieval results corresponding to the target image.
[0091] As described above, training the system using a Siamese network yields a pedestrian search model that is also a lower branch of the Siamese network. For the training system, one can refer to... Figure 4 , Figure 4 This is a block diagram of a pedestrian search system based on a Siamese network for training provided by the present invention. The system is divided into two branches, and a model that can be used for pedestrian search is obtained by combining the two branches for training.
[0092] Specifically, during training, for the input image, the upper and lower branches of the multi-scale Siamese network perform corresponding processing to obtain the features of pedestrians contained in the image. At the same time, the processing methods of the upper and lower branches are somewhat different.
[0093] When the lower branch is processing, it determines the pedestrians contained in the image through the backbone network and the detection network, and then obtains the features corresponding to each pedestrian through the pooling layer, which are then stored in the corresponding feature storage unit M2.
[0094] When the upper branch processes the image, it first performs cropping, scaling, and masking operations. Specifically, the cropping, scaling, and masking operations can be described in the manner described above to obtain multi-scale pedestrian images. Then, the backbone network is used to obtain the features corresponding to each scale, and then the features are fused to obtain the features corresponding to each pedestrian, which are then stored in the corresponding feature storage unit M1.
[0095] As shown in the figure, the upper and lower branches will pass through the same backbone network and process the image based on the same backbone network. In the specific processing and construction, a backbone network can also be built for the upper and lower branches respectively, but the two backbone networks need to share parameters, that is, the network parameters in the two backbone networks must always be consistent.
[0096] During training, the information recorded in feature storage units M1 and M2 is used as training data through separate processing of the upper and lower branches. Specifically, the features stored in the two feature storage units are mixed. To avoid feature duplication, a clustering algorithm guided by multi-label classification can be used. Then, through continuous training and optimization, the entire system is optimized. The lower branch obtained after optimization can then be used for pedestrian search. The specific training process and data processing process are described above and will not be repeated here.
[0097] The method provided in this invention, when training a pedestrian search model, extracts feature information based on the upper and lower branches of the Siamese network after determining the training image. Specifically, the lower branch is used to obtain the feature vectors of each pedestrian in the training image, and after image processing of the training image to obtain multi-scale pedestrian images, the lower branch is used to obtain the multi-scale feature vectors of each pedestrian in the training image. Finally, the model is trained based on the obtained feature vectors and multi-scale feature vectors. This method eliminates the need for manual labeling of samples during training, reducing labeling costs. Simultaneously, it utilizes a small number of samples through scaling and other processing to increase sample data, reducing sample requirements and improving the accuracy of pedestrian search at different scales.
[0098] The training apparatus for the pedestrian search model based on twin networks provided by the present invention will be described below. The training apparatus for the pedestrian search model based on twin networks described below can be referred to in correspondence with the training method for the pedestrian search model based on twin networks described above.
[0099] Figure 5 This is a schematic diagram of the structure of the training device for the pedestrian search model based on Siamese networks provided by the present invention, as shown below. Figure 5 As shown, the training device 500 for the pedestrian search model based on Siamese networks includes:
[0100] Image determination module 501 is used to determine training images;
[0101] The first extraction module 502 is used to extract features from the training image based on the lower branch of the Siamese network to obtain the feature vectors of each pedestrian in the training image.
[0102] The second extraction module 503 is used to process the training image to obtain multi-scale pedestrian images, and to extract features from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain the multi-scale feature vectors of each pedestrian in the training image.
[0103] Training and optimization module 504 is used to train the initial pedestrian search model based on feature vectors and multi-scale features, and obtain the trained pedestrian search model upon completion of training.
[0104] Based on the above embodiments, the first extraction module 502 is further configured to:
[0105] Based on the pedestrian detection in the training image by the lower branch of the Siamese network, determine the pedestrian bounding box corresponding to each pedestrian in the training image;
[0106] The positions of pedestrians in the training image are determined based on the pedestrian bounding boxes, and features are extracted from the pedestrian positions based on the pedestrian bounding boxes to obtain the feature vectors of each pedestrian in the training image.
[0107] Based on the above embodiments, the second extraction module 503 is further configured to:
[0108] The training images are processed to obtain pedestrian images at multiple scales;
[0109] Based on the upper branch of the Siamese network, feature extraction is performed on multi-scale pedestrian images to obtain the multi-scale feature vectors corresponding to each pedestrian.
[0110] By fusing the feature vectors at multiple scales, we can obtain the multi-scale features of each pedestrian in the training image.
[0111] Based on the above embodiments, the second extraction module 503 is further configured to:
[0112] The training images are cropped to obtain several pedestrian images, each containing one pedestrian.
[0113] The pedestrian images are masked, and the masked pedestrian images are scaled to obtain multi-scale pedestrian images.
[0114] Based on the above embodiments, the training device for the pedestrian search model based on Siamese networks further includes a clustering module, used for:
[0115] The feature vector and the multi-scale feature vector are blended to obtain the blended feature vector;
[0116] Clustering is performed on the mixed feature vectors to obtain several clusters and cluster labels, with one cluster corresponding to one cluster label.
[0117] Based on the above embodiments, the training optimization module 504 is further configured to:
[0118] When it is determined that training is not yet complete, update the feature vector and multi-scale feature vector;
[0119] The model is trained again based on the updated feature vectors and multi-scale feature vectors until a trained pedestrian search model is obtained.
[0120] Based on the above embodiments, the training device for the pedestrian search model based on Siamese networks further includes a pedestrian search module, used for:
[0121] The target image is determined, and pedestrian recognition and feature extraction are performed on the target image to obtain the target features of each pedestrian in the target image;
[0122] The extracted target features are matched against the query set to determine the pedestrian retrieval results corresponding to the target image.
[0123] The present invention provides a training device for a pedestrian search model based on Siamese networks. During the training of the pedestrian search model, after determining the training image, feature information is extracted according to the upper and lower branches of the Siamese network. Specifically, the feature vectors of each pedestrian in the training image are obtained from the lower branch. After image processing of the training image to obtain multi-scale pedestrian images, the multi-scale feature vectors of each pedestrian in the training image are obtained from the lower branch. Finally, the model is trained based on the obtained feature vectors and multi-scale feature vectors. This eliminates the need for manual sample labeling during training, reducing labeling costs. Simultaneously, it utilizes a small number of samples through scaling and other processing to increase sample data, reducing the requirement for additional samples, and improving the accuracy of pedestrian search at different scales.
[0124] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 6 As shown, the electronic device may include a processor 610, a communication interface 620, a memory 630, and a communication bus 640. The processor 610, communication interface 620, and memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute a training method for a pedestrian search model based on a Siamese network. This method includes: determining a training image; extracting features from the training image based on the lower branch of the Siamese network to obtain feature vectors for each pedestrian in the training image; processing the training image to obtain multi-scale pedestrian images, and extracting features from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain multi-scale feature vectors for each pedestrian in the training image; training based on the feature vectors and multi-scale feature vectors, and obtaining a trained pedestrian search model upon completion of training.
[0125] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0126] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, the computer is able to execute the training method for the pedestrian search model based on the Siamese network provided by the above methods, the method comprising: determining a training image; extracting features from the training image according to the lower branch of the Siamese network to obtain feature vectors of each pedestrian in the training image; processing the training image to obtain multi-scale pedestrian images, and extracting features from the multi-scale pedestrian images according to the upper branch of the Siamese network to obtain multi-scale feature vectors of each pedestrian in the training image; training based on the feature vectors and the multi-scale feature vectors, and obtaining a trained pedestrian search model upon completion of training.
[0127] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a training method for a pedestrian search model based on a Siamese network provided by the methods described above. This method includes: determining a training image; extracting features from the training image based on the lower branches of the Siamese network to obtain feature vectors for each pedestrian in the training image; processing the training image to obtain multi-scale pedestrian images, and extracting features from the multi-scale pedestrian images based on the upper branches of the Siamese network to obtain multi-scale feature vectors for each pedestrian in the training image; training based on the feature vectors and the multi-scale feature vectors, and obtaining a trained pedestrian search model upon completion of training.
[0128] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0129] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods of various embodiments or some parts of embodiments.
[0130] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A training method for a pedestrian search model based on Siamese networks, characterized in that, include: Determine the training images; Feature extraction is performed on the training image based on the lower branch of the Siamese network to obtain the feature vector of each pedestrian in the training image; The training images are processed to obtain multi-scale pedestrian images, and features are extracted from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain the multi-scale feature vectors of each pedestrian in the training images. The pedestrian search model is trained based on the feature vector and the multi-scale feature vector, and a trained pedestrian search model is obtained when the training is completed. The training based on the feature vector and the multi-scale feature vector includes: The feature vector and the multi-scale feature vector are blended to obtain a hybrid feature vector; The mixed feature vectors are clustered to obtain several clusters and cluster labels, with one cluster corresponding to one cluster label; Training is performed based on the clusters and the cluster labels corresponding to the clusters.
2. The training method for the pedestrian search model based on Siamese networks according to claim 1, characterized in that, The step of extracting features from the training image based on the lower branch of the Siamese network to obtain the feature vector of each pedestrian in the training image includes: Based on the pedestrian detection in the training image by the lower branch of the Siamese network, determine the pedestrian bounding box corresponding to each pedestrian in the training image; The positions of pedestrians in the training image are determined based on the pedestrian bounding boxes, and features are extracted from the pedestrian positions based on the pedestrian bounding boxes to obtain the feature vectors of each pedestrian in the training image.
3. The training method for the pedestrian search model based on Siamese networks according to claim 1, characterized in that, The training images are processed to obtain multi-scale pedestrian images, and features are extracted from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain multi-scale feature vectors for each pedestrian in the training images, including: The training images are processed to obtain pedestrian images at multiple scales; Based on the upper branch of the Siamese network, feature extraction is performed on the multi-scale pedestrian images to obtain the multi-scale feature vectors corresponding to each pedestrian. The multi-scale feature vectors are fused to obtain the multi-scale features of each pedestrian in the training image.
4. The training method for the pedestrian search model based on Siamese networks according to claim 3, characterized in that, The process of processing the training images to obtain multi-scale pedestrian images includes: The training images are cropped to obtain several pedestrian images, each containing one pedestrian. The pedestrian image is masked, and the size of the masked pedestrian image is scaled to obtain multi-scale pedestrian images.
5. The training method for the pedestrian search model based on Siamese networks according to claim 1, characterized in that, After training the initial pedestrian search model based on the feature vector and the multi-scale features, the method further includes: When it is determined that training is not yet complete, the feature vector and the multi-scale feature vector are updated; The model is trained again based on the updated feature vector and the multi-scale feature vector until a trained pedestrian search model is obtained.
6. The training method for the pedestrian search model based on Siamese networks according to claim 1, characterized in that, The method further includes: A target image is determined, and pedestrian recognition and feature extraction are performed on the target image to obtain the target features of each pedestrian in the target image; The extracted target features are matched in the query set to determine the pedestrian retrieval result corresponding to the target image.
7. A training device for a pedestrian search model based on Siamese networks, characterized in that, include: The image determination module is used to determine the training images; The first extraction module is used to extract features from the training image based on the lower branch of the Siamese network to obtain the feature vectors of each pedestrian in the training image. The second extraction module is used to process the training image to obtain multi-scale pedestrian images, and to extract features from the multi-scale pedestrian images based on the upper branch of the Siamese network to obtain the multi-scale feature vectors of each pedestrian in the training image. The training optimization module, used to train the initial pedestrian search model based on the feature vector and the multi-scale features, and to obtain the trained pedestrian search model upon completion of training, includes: The feature vector and the multi-scale feature vector are blended to obtain a hybrid feature vector; The mixed feature vectors are clustered to obtain several clusters and cluster labels, with one cluster corresponding to one cluster label; The pedestrian search model is trained based on the clusters and their corresponding cluster labels, and a trained pedestrian search model is obtained upon completion of training.
8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the training method for the pedestrian search model based on twin networks as described in any one of claims 1 to 6.
9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the training method of the pedestrian search model based on twin networks as described in any one of claims 1 to 6.