A top-mounted camera-based pedestrian gender recognition method and device
By employing a two-level attention calculation and a globally depthwise separable convolutional gender recognition method in the case of a top-mounted camera, combined with cross-entropy and center loss function, the problem of low accuracy and poor robustness of gender recognition under top-mounted camera conditions is solved, achieving efficient and accurate gender recognition results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAMEN MILESIGHT IOT CO LTD
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing pedestrian gender recognition technologies suffer from low accuracy and poor robustness in scenarios with top-mounted cameras. In particular, the variable posture of the head and shoulder area and the distortion of the camera imaging boundary lead to frequent misclassification of gender.
A pedestrian gender recognition method based on a top-mounted camera is adopted. Through two-level attention calculation and global depthwise separable convolution, multi-scale basic feature maps are extracted. Combined with the joint loss function of cross-entropy loss and center loss, the probability distribution of gender category is generated. Spatial weights are generated by using the distance between the pedestrian detection box position and the image center for multi-frame comprehensive decision-making.
It improves the accuracy and robustness of gender recognition, effectively overcomes the effects of occlusion, pose changes and boundary distortion, and ensures real-time, accurate and robust operation on embedded devices.
Smart Images

Figure CN122244908A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of Internet of Things and intelligent video analytics, specifically to a method and device for pedestrian gender recognition based on a top-mounted camera. Background Technology
[0002] Currently, pedestrian gender recognition technology has wide applications in intelligent security, passenger flow analysis, and business intelligence. Most existing mainstream gender recognition solutions are based on face detection technology, classifying gender by extracting features from the face region. This type of method has high recognition accuracy under conditions such as frontal faces, close-ups, and good lighting. However, in practical deployments, especially in the application scenario of ceiling-mounted cameras, the cameras are usually installed on the ceiling or at a high position, with a downward shooting angle. From this angle, the pedestrian's face region is easily obscured (e.g., hair, hats, masks, head-down movements, etc.), causing face detection-based gender recognition algorithms to frequently fail. To address this, some research has proposed gender recognition schemes based on head and shoulder features to avoid the face occlusion problem. However, in the actual implementation of ceiling-mounted cameras, head and shoulder feature-based schemes still face the following technical challenges: Variation in head and shoulder posture: Under the top-view camera, the head and shoulder area of pedestrians exhibits a variety of posture changes (such as turning the head, bending over, turning to the side, etc.), which leads to instability in the extracted head and shoulder features and is prone to misclassification of gender.
[0003] Camera imaging boundary distortion: When the top-mounted camera is imaging, there is a significant distortion effect in the edge area of the image, which causes geometric deformation in the head and shoulder area of pedestrians located at the edge, further interfering with the accuracy of gender classification results.
[0004] In summary, existing gender recognition technologies suffer from low accuracy and poor robustness in top-mounted camera scenarios, necessitating a gender recognition method that can adapt to the top camera perspective and effectively overcome the effects of occlusion, pose changes, and boundary distortion. Summary of the Invention
[0005] To improve the accuracy and robustness of gender recognition, in a first aspect, embodiments of this application provide a pedestrian gender recognition method based on a top-mounted camera, the method comprising: Acquire indoor scene video streams captured by overhead cameras; Input image frames from an indoor scene video stream into a gender recognition model to extract basic feature maps at multiple different scales; The high-level feature map in the basic feature map is used as the key vector and value vector, and the low-level feature map in the basic feature map is used as the query vector. Two-level attention calculation is performed based on the key vector, value vector and query vector to obtain the first intermediate feature map; wherein, the two-level attention calculation includes coarse attention stage calculation and sparse fine attention stage calculation. The first intermediate feature map is subjected to spatially position-wise depth convolution operation by global depthwise separable convolution to assign differentiated weights to different spatial regions and compress the first intermediate feature map into a second intermediate feature map. After activating the second intermediate feature map using an activation function, the probability distribution of the gender category is output, where the gender category includes male, female, and unknown. The gender category with the highest probability value in the probability distribution is obtained as the predicted label of the current frame, and this probability value is used as the predicted probability. Spatial weights are generated based on the distance between the pedestrian detection box position and the image center in the current frame. The spatial weights are then weighted and fused with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label. Based on the frame-level confidence scores of multiple frames and their corresponding predicted label categories, the temporal average confidence scores of each gender category are calculated, and the gender category with the highest temporal average confidence score is taken as the gender recognition result.
[0006] In one possible implementation, the step of using high-level feature maps from the base feature maps as key vectors and value vectors, and low-level feature maps from the base feature maps as query vectors, and performing two-level attention calculations based on the key vectors, value vectors, and query vectors to obtain a first intermediate feature map includes: Perform matrix multiplication on the query vector and key vector, and calculate the global attention score; The global attention scores are normalized to obtain the weight matrix; The coarse attention matrix is obtained by weighting and summing the value vectors using the weight matrix. The similarity matrix is obtained based on the average value of each column in the weight matrix; The top-k similarity vectors are selected from the similarity matrix, and the corresponding sparse key vectors and sparse value vectors are obtained from the key vectors and value vectors according to the indices of the top-k similarity vectors. Calculate the fine attention score by performing matrix multiplication on the query vector and the sparse key vector; The sparse value vector is weighted and summed using the normalized fine attention score to obtain the first intermediate feature map.
[0007] In one possible implementation, the gender recognition model employs a joint loss function that includes cross-entropy loss and center loss. ; Cross-entropy loss is ; Where C represents the actual gender category, y i Indicates gender category, To predict the probability distribution.
[0008] The central loss is ; Where, x i For the i-th second intermediate feature map, c yi It is y i The center vector is λ, where m is the total number of samples and λ represents the weights, which are dynamically adjusted during model training.
[0009] In one possible implementation, the spatial weight decreases as the distance between the pedestrian detection box position and the image center increases in the current frame.
[0010] In one possible implementation, the method further includes: The center vector is initialized as a high-dimensional vector generated using a standard normal distribution; During model training, the goal is to minimize the center loss, and the center vectors of each gender category are updated using the backpropagation mechanism.
[0011] Secondly, embodiments of this application provide a pedestrian gender recognition device based on a top-mounted camera, the device comprising: The acquisition module is used to acquire indoor scene video streams captured by the overhead camera; The feature extraction module is used to input image frames from indoor scene video streams into the gender recognition model and extract basic feature maps at multiple different scales. The attention calculation module is used to take the high-level feature map in the basic feature map as the key vector and value vector, and the low-level feature map in the basic feature map as the query vector. It performs two-level attention calculation based on the key vector, value vector and query vector to obtain the first intermediate feature map. The two-level attention calculation includes coarse attention stage calculation and sparse fine attention stage calculation. The convolution module is used to perform spatially position-wise depth convolution operations on the first intermediate feature map through global depthwise separable convolution, so as to assign differentiated weights to different spatial regions and compress the first intermediate feature map into a second intermediate feature map. The output module is used to activate the second intermediate feature map through an activation function and output the probability distribution of the gender category, wherein the gender category includes male, female, and unknown. The prediction module is used to obtain the gender category with the highest probability value in the probability distribution as the prediction label of the current frame, and use this probability value as the prediction probability; The fusion module is used to generate spatial weights based on the distance between the pedestrian detection box position and the image center in the current frame, and to fuse the spatial weights with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label. The decision module is used to calculate the temporal average confidence of each gender category based on the frame-level confidence of multiple frames and their corresponding predicted label categories, and to take the gender category with the highest temporal average confidence as the gender recognition result.
[0012] In one possible implementation, the attention calculation module is specifically used for: Perform matrix multiplication on the query vector and key vector, and calculate the global attention score; The global attention scores are normalized to obtain the weight matrix; The coarse attention matrix is obtained by weighting and summing the value vectors using the weight matrix. The similarity matrix is obtained based on the average value of each column in the weight matrix; The top-k similarity vectors are selected from the similarity matrix, and the corresponding sparse key vectors and sparse value vectors are obtained from the key vectors and value vectors according to the indices of the top-k similarity vectors. Calculate the fine attention score by performing matrix multiplication on the query vector and the sparse key vector; The sparse value vector is weighted and summed using the normalized fine attention score to obtain the first intermediate feature map.
[0013] In one possible implementation, the gender recognition model employs a joint loss function that includes cross-entropy loss and center loss. ; Cross-entropy loss is ; Where C represents the actual gender category, y i Indicates gender category, To predict the probability distribution.
[0014] The central loss is ; Where, x i For the i-th second intermediate feature map, c yi It is y i The center vector is λ, where m is the total number of samples and λ represents the weights, which are dynamically adjusted during model training.
[0015] In one possible implementation, the spatial weight decreases as the distance between the pedestrian detection box position and the image center increases in the current frame.
[0016] Thirdly, the present invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements any of the above-mentioned pedestrian gender recognition methods based on a top-mounted camera.
[0017] The pedestrian gender recognition method based on a top-mounted camera provided in this application includes: acquiring an indoor scene video stream captured by the top-mounted camera; inputting image frames from the indoor scene video stream into a gender recognition model to extract multiple basic feature maps at different scales; using high-level feature maps from the basic feature maps as key vectors and value vectors, and low-level feature maps from the basic feature maps as query vectors; performing two-level attention calculations based on the key vectors, value vectors, and query vectors to obtain a first intermediate feature map; wherein the two-level attention calculations include a coarse attention stage calculation and a sparse fine attention stage calculation; performing a spatially position-wise depthwise convolution operation on the first intermediate feature map through global depthwise separable convolution to assign differentiated weights to different spatial regions, and then... The feature map is compressed into a second intermediate feature map; after the second intermediate feature map is activated by an activation function, the probability distribution of the gender category is output, and the gender category includes male, female, and unknown; the gender category with the largest probability value in the probability distribution is obtained as the predicted label of the current frame, and this probability value is used as the predicted probability; spatial weights are generated based on the distance between the pedestrian detection box position and the image center in the current frame, and the spatial weights are weighted and fused with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label; based on the frame-level confidence of multiple frames and their corresponding predicted label categories, the temporal average confidence of each gender category is calculated, and the gender category with the largest temporal average confidence is used as the gender recognition result. The scheme of this application improves the accuracy and robustness of gender recognition. Attached Figure Description
[0018] Figure 1 A schematic flowchart of a pedestrian gender recognition method based on a top-mounted camera provided in an embodiment of this application; Figure 2 This is another flowchart illustrating a pedestrian gender recognition method based on a top-mounted camera provided in an embodiment of this application. Figure 3 This is a schematic diagram of the PSTM module. Figure 4 This is a schematic diagram of the processing flow of the PSTM module; Figure 5 A schematic diagram showing the result of assigning weights to different spatial regions of a 7×7 feature map using the GDSC module. Detailed Implementation
[0019] The present invention will be described in detail below through embodiments.
[0020] Currently, pedestrian gender recognition technology has wide applications in intelligent security, passenger flow analysis, and business intelligence. Most existing mainstream gender recognition solutions are based on face detection technology, classifying gender by extracting features from the face region. This type of method has high recognition accuracy under conditions such as frontal faces, close-ups, and good lighting. However, in practical deployments, especially in the application scenario of ceiling-mounted cameras, the cameras are usually installed on the ceiling or at a high position, with a downward shooting angle. From this angle, the pedestrian's face region is easily obscured (e.g., hair, hats, masks, head-down movements, etc.), causing face detection-based gender recognition algorithms to frequently fail. To address this, some research has proposed gender recognition schemes based on head and shoulder features to avoid the face occlusion problem. However, in the actual implementation of ceiling-mounted cameras, head and shoulder feature-based schemes still face the following technical challenges: Variation in head and shoulder posture: Under the top-view camera, the head and shoulder area of pedestrians exhibits a variety of posture changes (such as turning the head, bending over, turning to the side, etc.), which leads to instability in the extracted head and shoulder features and is prone to misclassification of gender.
[0021] Camera imaging boundary distortion: When the top-mounted camera is imaging, there is a significant distortion effect in the edge area of the image, which causes geometric deformation in the head and shoulder area of pedestrians located at the edge, further interfering with the accuracy of gender classification results.
[0022] In summary, existing gender recognition technologies suffer from low accuracy and poor robustness in top-mounted camera scenarios, necessitating a gender recognition method that can adapt to the top camera perspective and effectively overcome the effects of occlusion, pose changes, and boundary distortion.
[0023] To improve the accuracy and robustness of gender identification, firstly, see... Figure 1 This application provides a method for pedestrian gender recognition based on a top-mounted camera, the method comprising: S101, acquire the indoor scene video stream captured by the ceiling-mounted camera.
[0024] Top-mounted cameras are generally top-mounted binocular cameras, and their mounting height needs to be between 1.9m and 3.5m.
[0025] S102, input the image frames of the indoor scene video stream into the gender recognition model, and extract multiple basic feature maps at different scales.
[0026] Considering the core requirements of user-friendliness and real-time performance for embedded device deployments, this application uses MobileNetV2 as the backbone network. After inputting image frames from the scene video stream into the gender recognition model, MobileNetV2 sequentially extracts five basic feature maps of different scales: a first basic feature map, a second basic feature map, a third basic feature map, a fourth basic feature map, and a fifth basic feature map. The first, second, third, fourth, and fifth basic feature maps have progressively decreasing resolutions.
[0027] S103, the high-level feature map in the basic feature map is used as the key vector and value vector, and the low-level feature map in the basic feature map is used as the query vector. Two-level attention calculation is performed based on the key vector, value vector and query vector to obtain the first intermediate feature map; wherein, the two-level attention calculation includes coarse attention stage calculation and sparse fine attention stage calculation.
[0028] High-level feature maps are deep feature maps among basic feature maps of different scales. Specifically, in this application, the fifth basic feature map is used as a high-level feature map. Low-level feature maps are shallow feature maps among basic feature maps of different scales. Specifically, in this application, the fourth basic feature map is used as a low-level feature map.
[0029] The specific process of two-level attention calculation includes: Perform matrix multiplication on the query vector and key vector, and calculate the global attention score; The global attention scores are normalized to obtain the weight matrix; The coarse attention matrix is obtained by weighting and summing the value vectors using the weight matrix. The similarity matrix is obtained based on the average value of each column in the weight matrix; The top-k similarity vectors are selected from the similarity matrix, and the corresponding sparse key vectors and sparse value vectors are obtained from the key vectors and value vectors according to the indices of the top-k similarity vectors. Calculate the fine attention score by performing matrix multiplication on the query vector and the sparse key vector; The sparse value vector is weighted and summed using the normalized fine attention score to obtain the first intermediate feature map.
[0030] Coarse Attention Stage: In this stage, the high-level feature maps are used as the keys (K) and values (V) of the attention mechanism, respectively, and the low-level feature maps are used as the queries (Q) to perform attention weighting calculations, generating an attention weight matrix (O). coarse This hierarchical attention strategy not only efficiently achieves cross-scale feature fusion, but also significantly reduces the computational complexity of the model.
[0031] ; in, This is the scaling factor.
[0032] Sparse fine attention stage: This stage is based on the coarse attention matrix, calculates the average value of each column to obtain the similarity matrix, and then uses a top-k selection mechanism to filter high information density regions for indexing. K is then obtained from the original K and V using the index. fine and V fine This then performs fine-grained attention weighting calculations to produce a sparse fine attention matrix (O). fine While accurately extracting key semantic features, it effectively preserves the spatial details of the feature map.
[0033] ; in, This is the scaling factor.
[0034] The following explains the specific implementation process for each step. For convenience, specific numerical values will be used in the examples: Low-level feature map: spatial size 14×14, number of positions 14×14=196; Advanced feature map: spatial size 7×7, number of positions 7×7= 49; The feature dimension is set to d = 128; Step 1: Generate Q, K, V Q = Low-level feature map (196, 128) K = High-level feature map (49, 128) V = High-level feature map (49, 128) Step 2: Coarse Attention Calculate the attention score matrix: The shape (196, 49) represents the original correlation of each low-level location to each high-level location; Applying softmax, we perform softmax on each row (each low-level position) so that the sum of each row is 1, resulting in the attention weight matrix, which still has the shape (196, 49): We use the weight matrix to sum V in a weighted manner to obtain the coarse attention output, which has the shape (196, 128). Each low-level position aggregates the global high-level semantics.
[0035] The third step is Sparse Fine Attention. The goal of this step is to select the most important high-level locations and then perform a second attention calculation only on these locations.
[0036] The importance vector is calculated based on the weight matrix. The average of each column of the attention weight matrix is calculated (i.e., the average dependence of all low-level positions for each high-level position). The result is a vector of length 49, where each element represents the importance of the corresponding high-level position.
[0037] Select the k most important positions from the 49 high-level positions (let k = 10), and obtain their indices. These k indices point to the most important high-level positions.
[0038] Take the corresponding rows from the original K and V, K fine The shape is (10, 128), V fine The shape is (10, 128).
[0039] Calculate the fine attention, with shape (196, 10); perform softmax on each row (each low-level position) and then multiply by V. fine Weighted summation yields the final output of sparse fine attention, which has the shape (196, 128). Although it has the same dimensions as the output of the coarse attention stage, it is recalculated based on the most important high-level positions, thus focusing more on key information.
[0040] S104, the first intermediate feature map is subjected to a spatially position-wise depth convolution operation through global depthwise separable convolution to assign differentiated weights to different spatial regions and compress the first intermediate feature map into a second intermediate feature map.
[0041] While achieving the same feature aggregation effect as Global Average Pooling (GAP), the first intermediate feature map is compressed into a second intermediate feature map. In addition, by assigning differentiated weights to different spatial regions of the first intermediate feature map, the model can adaptively focus on and learn core semantic features.
[0042] Global depthwise separable convolution consists of two steps: depthwise convolution and pointwise convolution. The implementation process of global depthwise separable convolution is illustrated below with specific values: Step 1: Depthwise Convolution Input feature map: 7×7×C (C channels, each channel is a 7×7 plane). A 7×7 convolution kernel is used independently for each channel. The size of this convolution kernel is the same as the spatial size, so one convolution can cover the entire 7×7 region. The convolution kernel parameters of each channel are independent, which means that different channels can learn different spatial weight distributions.
[0043] Step 2: Pointwise Convolution The outputs of the C channels are linearly combined using 1×1 convolutions, maintaining an output dimension of 1×1×C (spatial compression to 1×1, channel count unchanged). Since each channel has its own independent 7×7 convolution kernel, different channels can focus on different spatial regions. For example, channel A might focus on the central 7×7 region (for capturing faces); channel B might focus on the upper half (for capturing hairstyle features); and channel C might focus on the left and right sides (for capturing shoulder contours). Each channel can weight the 7×7 region differently, which is much more flexible than global average pooling (where all channels use the same average weight).
[0044] Assuming the input is a 7×7×256 feature map, Global Average Pooling (GAP) calculates the arithmetic mean of the 49 values for that channel, with all 49 positions having a fixed weight of 1 / 49, and the weight distribution across different channels being uniform. In contrast, the proposed scheme has a learnable 7×7 weight matrix (49 parameters) for each channel. Channel 1 might learn: [0.01, 0.01, ..., 0.05 (center height), ...], and Channel 2 might learn: [0.02, 0.02, ..., 0.03 (upper half height), ...]. Each channel can have different levels of attention to the 49 positions, and the weights are automatically learned through training.
[0045] In the scenario of a top-mounted camera, effective features for gender discrimination are distributed in different parts of the head and shoulder region. Hairstyle information is mainly on the top and sides of the head, facial contours are mainly in the central area, shoulder width is mainly on the left and right sides, and clothing features may be distributed throughout the upper body. If all channels use the same uniform weight (GAP), these differentiated spatial information will be mixed together. The solution in this application allows different channels to focus on different spatial regions, and then combines these channel-level spatial specializations through pointwise convolution to form a richer global feature representation.
[0046] S105, after activating the second intermediate feature map through an activation function, output the probability distribution of the gender category, wherein the gender category includes male, female, and unknown.
[0047] The activation function can be the softmax function.
[0048] S106, obtain the gender category with the largest probability value in the probability distribution as the predicted label of the current frame, and use this probability value as the predicted probability.
[0049] S107, generate spatial weights based on the distance between the pedestrian detection box position and the image center in the current frame, and fuse the spatial weights with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label.
[0050] The spatial weight decreases as the distance between the pedestrian detection box position and the image center in the current frame increases.
[0051] Because the distortion effect at the camera imaging boundary can significantly interfere with the gender classification results, the position of the person being detected in the image is also taken into account. When the person being detected is in the center of the image, the distortion effect is less affected, so the corresponding spatial weight can be increased. When the person being detected is at the edge of the image, the distortion effect is more affected, so the corresponding spatial weight can be decreased.
[0052] S108: Based on the frame-level confidence scores of multiple frames and their corresponding predicted label categories, the temporal average confidence scores of each gender category are calculated, and the gender category with the highest temporal average confidence score is taken as the gender recognition result.
[0053] Referring to Table 1, the output of this model is the probability distribution of three categories: male, female, and unknown. First, top-1 inference is performed on the single-frame prediction results to obtain the predicted category label and corresponding probability for the current frame. Second, spatial weights based on the distance between the pedestrian detection box position and the image center are calculated and weighted with the predicted probabilities to obtain frame-level confidence. Finally, the temporal average confidence of the three categories over multiple consecutive frames is calculated, and the category with the largest mean is selected as the final gender determination.
[0054] male avg = (0.544 + 0.711 + 0.837) / 3 = 0.693; female avg =0.2709 / 1=0.2709; unknown avg =0.14 / 1=0.14; Max (unknown) avg female avg male avg =0.693 male.
[0055] Table 1 Multi-frame Integrated Decision Table
[0056] The pedestrian gender recognition method based on a top-mounted camera provided in this application includes: acquiring an indoor scene video stream captured by the top-mounted camera; inputting image frames from the indoor scene video stream into a gender recognition model to extract multiple basic feature maps at different scales; using high-level feature maps from the basic feature maps as key vectors and value vectors, and low-level feature maps from the basic feature maps as query vectors; performing two-level attention calculations based on the key vectors, value vectors, and query vectors to obtain a first intermediate feature map; wherein the two-level attention calculations include a coarse attention stage calculation and a sparse fine attention stage calculation; performing a spatially position-wise depthwise convolution operation on the first intermediate feature map through global depthwise separable convolution to assign differentiated weights to different spatial regions, and then... The feature map is compressed into a second intermediate feature map; after the second intermediate feature map is activated by an activation function, the probability distribution of the gender category is output, and the gender category includes male, female, and unknown; the gender category with the largest probability value in the probability distribution is obtained as the predicted label of the current frame, and this probability value is used as the predicted probability; spatial weights are generated based on the distance between the pedestrian detection box position and the image center in the current frame, and the spatial weights are weighted and fused with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label; based on the frame-level confidence of multiple frames and their corresponding predicted label categories, the temporal average confidence of each gender category is calculated, and the gender category with the largest temporal average confidence is used as the gender recognition result. The scheme of this application improves the accuracy and robustness of gender recognition.
[0057] In one example, the gender recognition model employs a joint loss function that includes cross-entropy loss and center loss. ; Cross-entropy loss is ; Where C represents the actual gender category, y i Indicates gender category, To predict the probability distribution.
[0058] The central loss is ; Where, x i For the i-th second intermediate feature map, c yi It is y i The center vector is λ, where m is the total number of samples and λ represents the weights, which are dynamically adjusted during model training.
[0059] To improve the overall performance of gender recognition tasks, this application employs a joint loss function consisting of Cross Entropy Loss and Center Loss. Through collaborative optimization, it achieves intra-class compactness and inter-class separation of gender category feature distribution, thereby enhancing the model's discriminative ability. The weights can be adjusted linearly based on the training epochs: the center loss should not have too large a weight in the early stages of training because the initial feature distribution is chaotic, and forcibly concentrating the class centers will hinder classification convergence. A common approach is to initially use larger weights, allowing cross-entropy to dominate, learning classification first, and then gradually decreasing the weights while increasing the weight of the center loss to compact intra-class features. Alternatively, it can be dynamically adjusted based on the ratio of loss values: real-time monitoring of the order-of-magnitude difference between cross-entropy loss and center loss, dynamically adjusting the weights to ensure their contributions are equal. Another approach is to adjust based on validation set accuracy: evaluating the gender classification accuracy on the validation set every epoch (or every few epochs); if the accuracy stagnates, appropriately reducing the weights to allow the center loss to play a greater role in compacting features; if the accuracy decreases, increasing the weights to revert to cross-entropy dominance.
[0060] y i It's a one-hot vector. During model training, the data categories are clearly defined; currently, there are three categories ["male," "female," "unknown"]. If the current image's category is "male," it can be represented by a vector [1, 0, 0]; if the category is "female," it can be represented as [0, 1, 0]. After obtaining the second intermediate feature map, a linear transformation is applied to obtain a 3D vector representing the model's predicted probabilities for each category. .
[0061] In one example, the method further includes: The center vector is initialized as a high-dimensional vector generated using a standard normal distribution; During model training, the goal is to minimize the center loss, and the center vectors of each gender category are updated using the backpropagation mechanism.
[0062] The current design uses three classes: ["Male", "Female", and "Unknown"]. Each class has a class center, which can be represented by a high-dimensional vector, resulting in three center vectors. At the beginning of model training, three learnable class centers are defined. If the second intermediate feature map is 1024-dimensional, then each class center is 1024-dimensional. For initializing each center, the industry standard generally uses a standard normal distribution to generate the initial 1024-dimensional vector. After the input image is processed by the model, a 1024-dimensional feature vector is generated. Since the category of the input image is clear, the distance from each class's feature vector to the class center can be calculated. The sum of these distances is used as the center loss. Then, minimizing the loss, combined with a backpropagation mechanism, updates the class centers until convergence.
[0063] See Figure 2 This diagram illustrates a flowchart of the gender recognition method proposed in this application. It primarily comprises three core components: a backbone network, a PSTM (Pyramid Sparse Transformer Module), and a GDSC (Global Depthwise Separable Convolution) module. After an image is input into the gender recognition model, it sequentially passes through the backbone, PSTM, and GDSC modules. Finally, after normalization using a softmax activation function, the probability distribution of each gender category is output. The PSTM module, with its hierarchical attention mechanism as its core component, achieves efficient cross-scale feature fusion. It mainly consists of two key stages: a coarse attention stage and a sparse fine attention stage. Figure 3 This is a schematic diagram of the PSTM module. Figure 4 This is a schematic diagram of the PSTM module's processing flow. The backbone network extracts features at multiple scales based on the image. Then, the shallow feature map is used as the query vector, and the deep feature map is used as the key vector and value vector. First, coarse attention is calculated, and then a similarity matrix is obtained. The top-k indices are selected from the similarity matrix. The key vector and value vector are redefined based on the top-k indices. Sparse fine attention is calculated using the query vector and the redefined key vector and value vector, and finally, the first intermediate feature map is output.
[0064] Figure 5 The diagram illustrates the result of assigning weights to different spatial regions of the 7×7 feature map by the GDSC module. As can be seen, the weights are different at different locations in the diagram, which can adaptively focus on and learn core semantic features.
[0065] This application's solution introduces a PSTM module and GDSC into the AI gender recognition network for top-mounted cameras, and designs a joint loss function composed of Cross Entropy Loss and Center Loss, making gender recognition by top-mounted cameras more accurate. Secondly, a multi-frame comprehensive decision-making mechanism that integrates recognition information and location information is introduced, further improving the robustness of gender recognition.
[0066] This application includes several effective improvements for AI gender recognition in top-mounted cameras, including a joint loss function consisting of a PSTM module, a GDSC module, Cross Entropy Loss, and Center Loss, as well as a multi-frame integrated decision mechanism that fuses recognition information and location information. These improvements enhance the accuracy of pedestrian gender recognition in top-mounted cameras and ensure that the algorithm runs in real time, accurately, and robustly on embedded devices.
[0067] Secondly, embodiments of this application provide a pedestrian gender recognition device based on a top-mounted camera, the device comprising: The acquisition module is used to acquire indoor scene video streams captured by the overhead camera; The feature extraction module is used to input image frames from indoor scene video streams into the gender recognition model and extract basic feature maps at multiple different scales. The attention calculation module is used to take the high-level feature map in the basic feature map as the key vector and value vector, and the low-level feature map in the basic feature map as the query vector. It performs two-level attention calculation based on the key vector, value vector and query vector to obtain the first intermediate feature map. The two-level attention calculation includes coarse attention stage calculation and sparse fine attention stage calculation. The convolution module is used to perform spatially position-wise depth convolution operations on the first intermediate feature map through global depthwise separable convolution, so as to assign differentiated weights to different spatial regions and compress the first intermediate feature map into a second intermediate feature map. The output module is used to activate the second intermediate feature map through an activation function and output the probability distribution of the gender category, wherein the gender category includes male, female, and unknown. The prediction module is used to obtain the gender category with the highest probability value in the probability distribution as the prediction label of the current frame, and use this probability value as the prediction probability; The fusion module is used to generate spatial weights based on the distance between the pedestrian detection box position and the image center in the current frame, and to fuse the spatial weights with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label. The decision module is used to calculate the temporal average confidence of each gender category based on the frame-level confidence of multiple frames and their corresponding predicted label categories, and to take the gender category with the highest temporal average confidence as the gender recognition result.
[0068] In one possible implementation, the attention calculation module is specifically used for: Perform matrix multiplication on the query vector and key vector, and calculate the global attention score; The global attention scores are normalized to obtain the weight matrix; The coarse attention matrix is obtained by weighting and summing the value vectors using the weight matrix. The similarity matrix is obtained based on the average value of each column in the weight matrix; The top-k similarity vectors are selected from the similarity matrix, and the corresponding sparse key vectors and sparse value vectors are obtained from the key vectors and value vectors according to the indices of the top-k similarity vectors. Calculate the fine attention score by performing matrix multiplication on the query vector and the sparse key vector; The sparse value vector is weighted and summed using the normalized fine attention score to obtain the first intermediate feature map.
[0069] In one possible implementation, the gender recognition model employs a joint loss function that includes cross-entropy loss and center loss. ; Cross-entropy loss is ; Where C represents the actual gender category, y i Indicates gender category, To predict the probability distribution.
[0070] The central loss is ; Where, x i For the i-th second intermediate feature map, c yi It is y i The center vector is λ, where m is the total number of samples and λ represents the weights, which are dynamically adjusted during model training.
[0071] In one possible implementation, the spatial weight decreases as the distance between the pedestrian detection box position and the image center increases in the current frame.
[0072] Thirdly, the present invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements any of the above-mentioned pedestrian gender recognition methods based on a top-mounted camera.
[0073] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., a solid-state disk (SSD)).
[0074] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0075] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the device embodiments are described simply because their systems are similar to the method embodiments; relevant parts can be referred to the descriptions of the method embodiments.
[0076] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes to the above embodiments within the scope of the present invention without departing from the principles and spirit of the present invention.
Claims
1. A method for pedestrian gender recognition based on a top-mounted camera, characterized in that, The method includes: Acquire indoor scene video streams captured by overhead cameras; Input image frames from an indoor scene video stream into a gender recognition model to extract basic feature maps at multiple different scales; The high-level feature map in the basic feature map is used as the key vector and value vector, and the low-level feature map in the basic feature map is used as the query vector. Two-level attention calculation is performed based on the key vector, value vector and query vector to obtain the first intermediate feature map; wherein, the two-level attention calculation includes coarse attention stage calculation and sparse fine attention stage calculation. The first intermediate feature map is subjected to spatially position-wise depth convolution operation by global depthwise separable convolution to assign differentiated weights to different spatial regions and compress the first intermediate feature map into a second intermediate feature map. After activating the second intermediate feature map using an activation function, the probability distribution of the gender category is output, where the gender category includes male, female, and unknown. The gender category with the highest probability value in the probability distribution is obtained as the predicted label of the current frame, and this probability value is used as the predicted probability. Spatial weights are generated based on the distance between the pedestrian detection box position and the image center in the current frame. The spatial weights are then weighted and fused with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label. Based on the frame-level confidence scores of multiple frames and their corresponding predicted label categories, the temporal average confidence scores of each gender category are calculated, and the gender category with the highest temporal average confidence score is taken as the gender recognition result.
2. The method according to claim 1, characterized in that, The process involves using high-level feature maps from the base feature maps as key and value vectors, and low-level feature maps from the base feature maps as query vectors. Two-level attention calculations are performed based on the key vectors, value vectors, and query vectors to obtain the first intermediate feature map, including: Perform matrix multiplication on the query vector and key vector, and calculate the global attention score; The global attention scores are normalized to obtain the weight matrix; The coarse attention matrix is obtained by weighting and summing the value vectors using the weight matrix. The similarity matrix is obtained based on the average value of each column in the weight matrix; The top-k similarity vectors are selected from the similarity matrix, and the corresponding sparse key vectors and sparse value vectors are obtained from the key vectors and value vectors according to the indices of the top-k similarity vectors. Calculate the fine attention score by performing matrix multiplication on the query vector and the sparse key vector; The sparse value vector is weighted and summed using the normalized fine attention score to obtain the first intermediate feature map.
3. The method according to claim 1, characterized in that, The gender recognition model employs a joint loss function that includes cross-entropy loss and center loss. ; Cross-entropy loss is ; Where C represents the actual gender category, y i Indicates gender category, To predict the probability distribution. The central loss is ; Where, x i For the i-th second intermediate feature map, c yi It is y i The center vector is λ, where m is the total number of samples and λ represents the weights, which are dynamically adjusted during model training.
4. The method according to claim 1, characterized in that, The spatial weight decreases as the distance between the pedestrian detection box position and the image center in the current frame increases.
5. The method according to claim 3, characterized in that, The method further includes: The center vector is initialized as a high-dimensional vector generated using a standard normal distribution; During model training, the goal is to minimize the center loss, and the center vectors of each gender category are updated using the backpropagation mechanism.
6. A pedestrian gender recognition device based on a top-mounted camera, characterized in that, The device includes: The acquisition module is used to acquire indoor scene video streams captured by the overhead camera; The feature extraction module is used to input image frames from indoor scene video streams into the gender recognition model and extract basic feature maps at multiple different scales. The attention calculation module is used to take the high-level feature map in the basic feature map as the key vector and value vector, and the low-level feature map in the basic feature map as the query vector. It performs two-level attention calculation based on the key vector, value vector and query vector to obtain the first intermediate feature map. The two-level attention calculation includes coarse attention stage calculation and sparse fine attention stage calculation. The convolution module is used to perform spatially position-wise depth convolution operations on the first intermediate feature map through global depthwise separable convolution, so as to assign differentiated weights to different spatial regions and compress the first intermediate feature map into a second intermediate feature map. The output module is used to activate the second intermediate feature map through an activation function and output the probability distribution of the gender category, wherein the gender category includes male, female, and unknown. The prediction module is used to obtain the gender category with the highest probability value in the probability distribution as the prediction label of the current frame, and use this probability value as the prediction probability; The fusion module is used to generate spatial weights based on the distance between the pedestrian detection box position and the image center in the current frame, and to fuse the spatial weights with the predicted probability to obtain the frame-level confidence of the current frame with respect to the predicted label. The decision module is used to calculate the temporal average confidence of each gender category based on the frame-level confidence of multiple frames and their corresponding predicted label categories, and to take the gender category with the highest temporal average confidence as the gender recognition result.
7. The apparatus according to claim 6, characterized in that, The attention calculation module is specifically used for: Perform matrix multiplication on the query vector and key vector, and calculate the global attention score; The global attention scores are normalized to obtain the weight matrix; The coarse attention matrix is obtained by weighting and summing the value vectors using the weight matrix. The similarity matrix is obtained based on the average value of each column in the weight matrix; The top-k similarity vectors are selected from the similarity matrix, and the corresponding sparse key vectors and sparse value vectors are obtained from the key vectors and value vectors according to the indices of the top-k similarity vectors. Calculate the fine attention score by performing matrix multiplication on the query vector and the sparse key vector; The sparse value vector is weighted and summed using the normalized fine attention score to obtain the first intermediate feature map.
8. The apparatus according to claim 6, characterized in that, The gender recognition model employs a joint loss function that includes cross-entropy loss and center loss. ; Cross-entropy loss is ; Where C represents the actual gender category, y i Indicates gender category, To predict the probability distribution. The central loss is ; Where, x i For the i-th second intermediate feature map, c yi It is y i The center vector is λ, where m is the total number of samples and λ represents the weights, which are dynamically adjusted during model training.
9. The apparatus according to claim 6, characterized in that, The spatial weight decreases as the distance between the pedestrian detection box position and the image center in the current frame increases.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method described in any one of claims 1-5.