A miner safety monitoring method and system based on cross-modal pedestrian re-identification

By combining a shared dual-stream ResNet50 backbone network with local and global feature modules, the problem of modal differences in cross-modal pedestrian re-identification in underground coal mines was solved, achieving more discriminative feature extraction and improved recognition performance.

CN120014550BActive Publication Date: 2026-06-19CHINA UNIV OF MINING & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA UNIV OF MINING & TECH
Filing Date
2025-01-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In the underground environment of coal mines, traditional video surveillance methods cannot effectively cope with interference such as low light and dust, resulting in poor cross-modal pedestrian re-identification, non-discriminatory feature extraction, and difficulty in achieving accurate identification and rapid response.

Method used

A shared dual-stream ResNet50 backbone network is used to extract modality-specific and shared features. Local and global feature modules are combined, and the SP_Net module is used for cross-modality alignment loss. Bilinear fully connected layers and global loss functions, including identity, triplet, and center losses, are introduced to optimize network parameters and improve recognition performance.

🎯Benefits of technology

It significantly improves the performance of cross-modal pedestrian re-identification, eliminates modal differences, and enhances image recognition efficiency, model recognition ability, and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120014550B_ABST
    Figure CN120014550B_ABST
Patent Text Reader

Abstract

This invention discloses a method for miner safety monitoring based on cross-modal pedestrian re-identification, comprising the following steps: S1: Inputting the dataset into a shared dual-stream ResNet50 backbone network to obtain preliminary feature maps extracted from two modalities; S2: Inputting the feature maps of the two modalities into a local feature module, dividing each into three parts according to its height, pairing them together, and inputting the same pair of local feature maps into the same SP_Net module to obtain local feature vectors, and performing cross-modal alignment loss on the two local feature vectors; S3: Inputting the feature maps of the two modalities into a global feature module, passing through a bilinear fully connected layer to obtain a global feature vector, and calculating the loss function; S4: Adjusting the network parameters to optimize the network, and performing pedestrian matching on the cross-modal dataset to be identified. This invention can eliminate the differences between the two modalities, extract more discriminative features, improve image recognition efficiency, and enhance the performance of cross-modal pedestrian re-identification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image recognition technology, specifically relating to a method and system for monitoring the safety of miners based on cross-modal pedestrian re-identification. Background Technology

[0002] In the coal mining industry, miner safety monitoring is a crucial link in ensuring miners' lives and guaranteeing stable production. With the increasing depth of mining and the increasing complexity of the working environment, traditional monitoring methods face numerous challenges, including blind spots, difficulties in data collection, and severe environmental interference. In particular, the underground mining environment is typically characterized by low light levels, dust, and high personnel density, making traditional video surveillance-based safety management methods inadequate for meeting the demands for accurate identification and rapid response.

[0003] In recent years, with the rapid development of computer vision and artificial intelligence technologies, cross-modal pedestrian re-identification technology has provided a new approach to solving the problem of mine safety monitoring. This technology can be used for rapid cross-view recognition and matching retrieval of underground coal mine workers, enabling accurate identification of a miner's location at a specific point in time. It can pinpoint the miner's location, such as at the coal face, pump room, substation, or tunneling face, thus determining the miner's regional location. This facilitates rapid personnel location based on the matched time and location in the event of a mine safety accident, providing precise location information for rescue operations.

[0004] One of the main challenges this technology faces in underground mine safety monitoring is the modal difference between visible light and infrared images. Within the same modality, factors such as human posture, viewing angle, lighting, low image resolution, occlusion, and background can cause the network to focus only on the global area while ignoring local areas. As a result, the feature extraction network cannot extract more discriminative features, leading to poor recognition performance. Summary of the Invention

[0005] The purpose of this invention is to provide a method and system for monitoring the safety of miners based on cross-modal pedestrian re-identification, which can eliminate the differences between the two modalities, extract more discriminative features, improve image recognition efficiency, and improve the performance of cross-modal pedestrian re-identification.

[0006] To achieve the above objectives, this invention provides a miner safety monitoring method based on cross-modal pedestrian re-identification, comprising the following steps:

[0007] S1: Input the dataset into a shared dual-stream ResNet50 backbone network to obtain two types of modality-specific information and modality-shared information feature maps initially extracted from the modality;

[0008] S2: Input the feature maps of the two modalities into the local feature module, divide them into three parts according to their height, take two as a pair, input the same pair of local feature maps into the same SP_Net module to obtain local feature vectors, and perform cross-modal alignment loss on the two local feature vectors;

[0009] S3: Input the feature maps of the two modalities into the global feature module, pass them through a bilinear fully connected layer to obtain the global feature vector, and calculate the loss function;

[0010] S4: Adjust the network parameters to optimize the network and perform pedestrian matching on the cross-modal dataset to be identified.

[0011] As a further aspect of the present invention: the shared dual-stream ResNet50 backbone network in S1 includes five network layers, each of which is a residual structure block. The parameters of the first two network layers are independent and used to extract modality-specific features, while the parameters of the last three network layers are shared and used to extract modality-shared features.

[0012] As a further aspect of the present invention: the local feature module in S2 includes three independent SP_Net modules and three cross-modal alignment loss functions. The feature map output by S1 is uniformly divided into three local feature maps according to its height. The local feature maps corresponding to different modalities are combined into local feature map pairs. The local feature map pairs are input into the SP_Net module. The SP_Net module outputs two local feature vectors of different modalities. The cross-modal alignment loss function is calculated on the local feature vectors of these two different modalities. Finally, three cross-modal alignment loss functions are obtained, and they are summed to obtain the final local loss.

[0013] As a further aspect of this invention: the SP_Net module uses striped global average pooling to broaden the scope of the convolutional neural network, and the process is as follows:

[0014] Global average pooling along the width direction is denoted as GAP. 1*W Global average pooling along the height direction is denoted as GAP. H*1 Global average pooling is denoted as GAP, one-dimensional convolutional layer is denoted as Conv, batch normalization is denoted as BN, input feature map is denoted as z, feature vector is denoted as f, visible light is denoted as RGB, and infrared is denoted as IR.

[0015] The feature map z is reduced in dimensionality by performing a one-dimensional convolution. The feature map z obtained after the convolutional layer is then subjected to two-directional strip global average pooling to obtain the feature vector f:

[0016] f H*1 =GAP 1*W (Conv(z))

[0017] f1*W =GAP H*1 (Conv(z))

[0018] The new feature map z is obtained by performing an inner product on the obtained feature vectors. H*W :

[0019] z H*W =f H*1 *f 1*w

[0020] feature map z H*W Perform a one-dimensional convolution (Conv) to increase the dimensionality, and then pass it through a sigmoid function to obtain a new feature matrix S:

[0021] S = Sigmoid(Conv(z) H*W ))

[0022] The feature matrix S and the original feature map z are multiplied element-wise to obtain a new feature matrix Z. H*W :

[0023] Z H*W =S*z

[0024] For the characteristic matrix Z H*W We obtain the feature vectors by performing GAP and BN:

[0025] f = BN(GAP(Z) H*W ))

[0026] The feature vectors f of the two modes RGB and f IR Perform cross-modal alignment loss:

[0027]

[0028] The final local loss is obtained by summing the cross-modal alignment losses of the three parts:

[0029] L local =L alignment1 +L alignment2 +L alignment3 .

[0030] As a further aspect of the present invention: the global feature module in S3 includes a bilinear fully connected layer and a global loss function. The bilinear fully connected layer includes a fully connected layer with a bias term, a batch normalization layer, ReLU and Dropout activation functions after the batch normalization layer, and a fully connected layer without a bias term.

[0031] As a further aspect of the present invention: the global loss function includes an identity loss function, a triplet loss function, and a center loss function;

[0032] The formula for the identity loss function is:

[0033]

[0034] Where N is the total number of categories, y i It's a real tag, p i It is the probability that the model predicts the image belongs to the i-th class;

[0035] The formula for the triplet loss function:

[0036] L triplet =max(d p -d n +α,0)

[0037] Where, d p It is the distance between positive sample pairs, d n It is the distance between negative sample pairs, and α is a hyperparameter representing the allowed distance difference;

[0038] The formula for the center loss function:

[0039]

[0040] Among them, y i It is the label of the i-th image in the mini-batch. The y-th depth feature i Class center, B is batch size;

[0041] Therefore, the global loss function is:

[0042] L = L ID +L triplet +β Lcenter .

[0043] A miner safety monitoring system based on cross-modal pedestrian re-identification, based on the above-mentioned miner safety monitoring method, includes a shared dual-stream ResNet50 module, a global feature module, and a local feature module. The local feature module contains three independent SP_Nets, and the global feature module contains a bilinear fully connected layer and a global loss function. The global feature module and the local feature module are trained with joint loss.

[0044] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0045] This invention proposes a bilinear fully connected layer, which enables convergence of cross-modal ID embedding and improves the discriminativeness of cross-modal features. A center loss function is introduced to address the issue that triplet loss only considers the difference between positive and negative sample pairs while ignoring the absolute values ​​of these pairs. Three independent SP_Net modules are proposed, and striped global average pooling is used to expand the receptive field of the convolutional neural network, collecting richer contextual information and capturing long-range relationships in isolated regions. Furthermore, inner product and the sigmoid function are used to help model high-level semantic long-range dependencies. In summary, the proposed global and local feature modules significantly improve the model's recognition ability and robustness, eliminate differences between the two modalities, extract more discriminative features, improve image recognition efficiency, and enhance the performance of cross-modal person re-identification. Attached Figure Description

[0046] Figure 1 This is a flowchart of the mine safety monitoring method based on cross-modal pedestrian re-identification according to the present invention.

[0047] Figure 2 This is a diagram of the algorithm framework in this invention.

[0048] Figure 3 This is a flowchart of the mine safety monitoring system based on cross-modal pedestrian re-identification according to the present invention.

[0049] Figure 4 This is a schematic diagram of the SP_Net (strip pooling network) module in this invention.

[0050] Figure 5 This is a schematic diagram illustrating the algorithm's effect in this invention. Detailed Implementation

[0051] The present invention will be further illustrated by the following examples.

[0052] like Figure 1 As shown, a miner safety monitoring method based on cross-modal pedestrian re-identification includes the following steps:

[0053] S1: Input the dataset into the shared dual-stream ResNet50 backbone network to obtain two types of feature maps initially extracted from the modality: modality-specific information and modality-shared information; use the shared dual-stream ResNet50 backbone network to extract modality-specific information and modality-shared information, so that the network can extract more feature information from different modalities;

[0054] S2: Input the feature maps of the two modalities into the local feature module, divide them into three parts according to their height, take two as a pair, input the same pair of local feature maps into the same SP_Net module to obtain local feature vectors, and perform cross-modal alignment loss on the two local feature vectors;

[0055] S3: Input the feature maps of the two modalities into the global feature module, pass them through a bilinear fully connected layer to obtain the global feature vector, and calculate the loss function;

[0056] S4: Adjust the network parameters to optimize the network and perform pedestrian matching on the cross-modal dataset to be identified.

[0057] like Figure 2 As shown, VisibleImages and InfraredImages from two modalities in the dataset are input into a shareable two-stream ResNet50 backbone network, which outputs FeatureMaps. The resulting FeatureMaps are divided into three equal parts using a split function. The FeatureMaps from different modalities are then combined in pairs and input into three independent SP_Net modules. Three alignment loss functions are used to train the local feature modules. In the global feature module, the FeatureMaps are first processed through GAP (Global Average Pooling) to obtain feature vectors. The feature vectors are then processed through FC (Fully Connected) and BN (Browsing Normalization) layers, followed by two activation functions, ReLU and Dropout. Finally, a fully connected layer is used to obtain the final feature vector. Triplet loss, Center loss, and ID loss are used to train the global feature module.

[0058] Furthermore, the shared dual-stream ResNet50 backbone network in S1 consists of five network layers, each of which is a residual structure block. The parameters of the first two network layers are independent and used to extract modality-specific features, while the parameters of the last three network layers are shared and used to extract modality-shared features. The visible light and infrared feature maps are obtained by outputting the shared dual-stream ResNet50 backbone network.

[0059] Furthermore, the local feature module in S2 includes three independent SP_Net modules and three cross-modal alignment loss functions. The SP_Net module is used to obtain more discriminative features. The feature map output from S1 is evenly divided into three local feature maps according to its height using the split function. The local feature maps corresponding to different modalities are combined into local feature map pairs. The local feature map pairs are input into the SP_Net module, which outputs two local feature vectors of different modalities. The cross-modal alignment loss function is calculated on these two local feature vectors of different modalities. Finally, three cross-modal alignment loss functions are obtained, which are summed to obtain the final local loss.

[0060] Furthermore, such as Figure 4 As shown, the SP_Net module uses striped global average pooling to broaden the field of view of the convolutional neural network, extracting features from the feature map in different directions. The extracted feature vectors from different directions are multiplied by an inner product to obtain a feature matrix, which is then processed using the Sigmoid function. The calculated feature matrix is ​​then element-wise multiplied with the original feature map, and the resulting matrix is ​​subjected to global average pooling and batch normalization. Therefore, the SP_Net module mainly includes two striped global average pooling layers, one batch normalization layer, one one-dimensional convolutional layer, and one Sigmoid layer. Furthermore, to accelerate model convergence and reduce computational cost, a one-dimensional convolutional layer is used to reduce the dimensionality of the feature map. The process is as follows:

[0061] Global average pooling along the width direction is denoted as GAP. 1*W Global average pooling along the height direction is denoted as GAP. H*1 Global average pooling is denoted as GAP, one-dimensional convolutional layer is denoted as Conv, batch normalization is denoted as BN, input feature map is denoted as z, feature vector is denoted as f, visible light is denoted as RGB, and infrared is denoted as IR.

[0062] The feature map z is reduced in dimensionality by performing a one-dimensional convolution. The feature map z obtained after the convolutional layer is then subjected to two-directional strip global average pooling to obtain the feature vector f:

[0063] f H*1 =GAP 1*W (Conv(z))

[0064] f 1*W =GAP H*1 (Conv(z))

[0065] The new feature map z is obtained by performing an inner product on the obtained feature vectors. H*W :

[0066] z H*W =fH*1 *f 1*w

[0067] feature map z H*W Perform a one-dimensional convolution (Conv) to increase the dimensionality, and then pass it through a sigmoid function to obtain a new feature matrix S:

[0068] S = Sigmoid(Conv(z) H*W ))

[0069] The feature matrix S and the original feature map z are multiplied element-wise to obtain a new feature matrix Z. H*W :

[0070] Z H*W =S*z

[0071] For the characteristic matrix Z H*W We obtain the feature vectors by performing GAP and BN:

[0072] f = BN(GAP(Z) H*W ))

[0073] The feature vectors f of the two modes RGB and f IR Perform cross-modal alignment loss:

[0074]

[0075] Since our local feature module has three independent SP_Net networks, we get three cross-modal alignment losses. Summing the three cross-modal alignment losses gives us the final local loss:

[0076] L local =L alignment1 +L alignment2 +L alignment3 .

[0077] Furthermore, the global feature module in S3 includes a bilinear fully connected layer and a global loss function. In this module, the feature maps undergo global average pooling. The bilinear fully connected layer aims to address the problem of low discriminative cross-modal feature representations due to the difficulty of convergence between gradient vanishing shared networks and cross-modal ID embeddings. The bilinear fully connected layer consists of a fully connected layer with a bias term, a batch normalization layer, followed by ReLU and Dropout activation functions, and then a fully connected layer without a bias term. These two functions primarily prevent overfitting. Following this is a fully connected layer without a bias term, and finally, a feature vector is output. The global loss is then calculated using this feature vector.

[0078] Furthermore, the global loss function includes the identity loss function, the triplet loss function, and the center loss function. The center loss function is introduced to address the problem that the triplet loss only considers the difference between positive and negative sample pairs, while ignoring the absolute values ​​of positive and negative sample pairs.

[0079] The formula for the identity loss function is:

[0080]

[0081] Where N is the total number of categories, y i It's a real tag, p i It is the probability that the model predicts the image belongs to the i-th class;

[0082] The formula for the triplet loss function:

[0083] L triplet =max(d p -d n +α,0)

[0084] Where, d p It is the distance between positive sample pairs, d n It is the distance between negative sample pairs, and α is a hyperparameter representing the allowed distance difference;

[0085] The formula for the center loss function:

[0086]

[0087] Among them, y i It is the label of the i-th image in the mini-batch. The y-th depth feature i Class center, B is batch size;

[0088] Therefore, the global loss function is:

[0089] L = L ID +L triplet +β Lcenter .

[0090] The global and local losses are jointly trained, and parameters are fine-tuned during training to bring the model to its optimal state. To demonstrate the performance of this invention, the algorithm is tested on a dataset, and the results are as follows: Figure 5 As shown, this algorithm can significantly improve image recognition capabilities.

[0091] like Figure 3As shown, a miner safety monitoring system based on cross-modal pedestrian re-identification is proposed. Based on the above-mentioned miner safety monitoring method, it includes a shared dual-stream ResNet50 module, a global feature module, and a local feature module. The local feature module contains three independent SP_Net modules, and the global feature module contains a bilinear fully connected layer and a global loss function. The global feature module and the local feature module are trained with joint loss.

Claims

1. A miner safety monitoring method based on cross-modal pedestrian re-identification, characterized in that, Includes the following steps: S1: Input the dataset into a shared dual-stream ResNet50 backbone network to obtain two types of modality-specific information and modality-shared information feature maps initially extracted from the modality; S2: Input the feature maps of the two modalities into the local feature module, divide them into three parts according to their height, take two as a pair, input the same pair of local feature maps into the same SP_Net module to obtain local feature vectors, and perform cross-modal alignment loss on the two local feature vectors; The local feature module consists of three independent SP_Net modules and three cross-modal alignment loss functions. The feature map output by S1 is uniformly divided into three local feature maps according to its height. The local feature maps corresponding to different modalities are combined into local feature map pairs. The local feature map pairs are input into the SP_Net module, which outputs two local feature vectors of different modalities. The cross-modal alignment loss function is calculated on the two local feature vectors of different modalities. Finally, three cross-modal alignment loss functions are obtained, which are summed to obtain the final local loss. The SP_Net module uses striped global average pooling to broaden the scope of convolutional neural networks, as follows: Global average pooling along the width direction is denoted as... Global average pooling along the height direction is denoted as Global average pooling is denoted as A one-dimensional convolutional layer is denoted as Batch normalization is denoted as The input feature map is denoted as The eigenvector is denoted as Visible light is denoted as RGB, and infrared light is denoted as IR; The feature map z is reduced in dimensionality by performing a one-dimensional convolution. The feature map z obtained after the convolutional layer is then subjected to two-directional strip global average pooling to obtain the feature vector f: ; ; The resulting feature vectors are inner multiplied to obtain a new feature map : ; feature maps one-dimensional convolution dimensionality lifting, and obtaining new feature matrix through Sigmoid : ; The feature matrix is multiplied element-wise with the original feature map z, resulting in a new feature matrix ; ; Eigenmatrix GAP and BN to get eigenvectors: ; aligning the feature vectors of the two modalities and with a cross-modal alignment loss: ; The final local loss is obtained by summing the cross-modal alignment losses of the three parts: ; S3: Input the feature maps of the two modalities into the global feature module, pass them through a bilinear fully connected layer to obtain the global feature vector, and calculate the loss function; The global feature module includes a bilinear fully connected layer and a global loss function. The bilinear fully connected layer consists of a fully connected layer with a bias term, a batch normalization layer, ReLU and Dropout activation functions after the batch normalization layer, and a fully connected layer without a bias term. S4: Adjust the network parameters to optimize the network and perform pedestrian matching on the cross-modal dataset to be identified.

2. The miner safety monitoring method based on cross-modal pedestrian re-identification according to claim 1, characterized in that, The shared dual-stream ResNet50 backbone network in S1 consists of five network layers, each of which is a residual structure block. The parameters of the first two network layers are independent and used to extract modality-specific features, while the parameters of the last three network layers are shared and used to extract modality-shared features.

3. The miner safety monitoring method based on cross-modal pedestrian re-identification according to claim 1, characterized in that, Global loss functions include identity loss function, triplet loss function, and center loss function; The formula for the identity loss function is: ; in, It is the total number of categories. It's a real label. The model predicts that the image belongs to the first... The probability of a class; The formula for the triplet loss function: ; in, It is the distance between positive sample pairs. It is the distance between negative sample pairs. It is a hyperparameter that represents the allowed distance difference; The formula for the center loss function: ; in, It is the first in a small batch Image tags, The first representing the depth feature Class center, It refers to the batch size; Therefore, the global loss function is: .

4. A miner safety monitoring system based on cross-modal pedestrian re-identification, characterized in that, The miner safety monitoring method according to any one of claims 1-3 includes a shared dual-stream ResNet50 module, a global feature module, and a local feature module, wherein the local feature module contains three independent SP_Net modules, the global feature module contains a bilinear fully connected layer and a global loss function, and the global feature module and the local feature module are trained with joint loss.

Citation Information

Patent Citations

  • Video pedestrian re-identification method based on channel attention mechanism and application

    CN112836646A

  • Near infrared-visible light cross-modal double-current pedestrian re-identification method and system

    CN114220124A