A dynamic footprint retrieval method based on multi-class feature fusion
By employing a dynamic footprint retrieval method that integrates multiple features, and utilizing convolutional neural networks and feature fusion modules, the problem that static footprint images cannot reflect behavioral information is solved, achieving more efficient dynamic footprint recognition and improving recognition success rate and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ANHUI UNIV
- Filing Date
- 2023-02-06
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, static footprint images can only reflect the overall pressure information of the sole of the foot and cannot reflect behavioral information. Furthermore, they require high image quality, resulting in poor recognition performance, especially when there are incomplete parts, making accurate recognition impossible.
A dynamic footprint retrieval method based on multi-class feature fusion is adopted. By collecting dynamic footprint data, training and testing sets are constructed. A multi-class feature fusion network model is used for training and testing, including a convolutional neural network, an appearance feature fusion module, and a frame-by-frame feature aggregation module. The long and short distance temporal relationships are constructed, and spatiotemporal features are fused for retrieval.
It improves the accuracy and robustness of footprint retrieval, effectively utilizes plantar pressure change information, reduces dependence on image quality, and achieves higher recognition success rate and stability.
Smart Images

Figure CN116089647B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of footprint retrieval, specifically a dynamic footprint retrieval method based on the fusion of multiple features. Background Technology
[0002] In the early days, identity verification technologies were not mature enough. With the rise of technology, DNA testing and fingerprint recognition technologies developed rapidly and were quickly put into use. This allowed criminal investigators to obtain more information at crime scenes, enabling them to quickly identify suspects. Since the publication of the AlexNet convolutional neural network in 2012, computer vision technology has also experienced rapid development, bringing new technologies to identity verification—facial recognition. However, all of the above technologies have a drawback: they can be forged. Footprints, however, are different; they reflect a person's behavioral information. This information reflects a person's physiological activities and cannot be hidden. It is precisely this that has led to a growing number of people conducting footprint research.
[0003] Significant progress has been made in the research of static footprint images, but static footprint images have several drawbacks: 1. Static footprint images only contain overall pressure information of the sole of the foot and cannot reflect behavioral information; 2. High image quality is required, and if there are incomplete parts of the footprint, it cannot be accurately identified.
[0004] To address this, the present invention proposes a dynamic footprint retrieval method based on multi-class feature fusion. Summary of the Invention
[0005] This invention aims to at least solve one of the technical problems existing in the prior art. To this end, this invention proposes a dynamic footprint retrieval method based on multi-class feature fusion. This method addresses the problem of how to mine the intrinsic relationships between footprint images and deeper information about the footprints based on footprint images, thereby improving footprint retrieval results.
[0006] To achieve the above objectives, according to an embodiment of the first aspect of the present invention, a dynamic footprint retrieval method based on multi-class feature fusion is proposed, comprising:
[0007] Collect dynamic footprint data to construct training and testing sets for the retrieval network;
[0008] The acquired training set data is input into a multi-class feature fusion network model for training;
[0009] The obtained test set is input into the trained multi-class feature fusion network model for testing.
[0010] Furthermore, the initial dataset D1 acquired is preprocessed as follows:
[0011] Denoising; examine the pixel values of each column of the image by column, identify noise, and remove it;
[0012] Cutting: First, determine the position of the heel and toe of each footprint in the image of the entire footprint, and mark them as P1 and P2 respectively; then calculate a rectangle based on P1 and P2, and cut the image of the entire footprint using the rectangle;
[0013] Centering; determine the top, bottom, left, and right outer keypoints of the footprints in the footprint image, namely P1, P2, P3, and P4; the four outer keypoints move outward simultaneously, and stop moving when the distance between P1 and P2 and between P3 and P4 is a preset number of pixels; take P1, P2, P3, and P4 at this time as the four boundaries of the footprint image to obtain the preprocessed dataset D2;
[0014] The preprocessed dataset D2 is divided into training and test sets in a 3:1 ratio, and the test set is divided into base database and retrieval database in a 2:1 ratio.
[0015] Furthermore, the multi-feature fusion network model includes a convolutional neural network, an appearance feature fusion module, and a frame-by-frame feature aggregation module. The convolutional neural network extracts input information to obtain frame-by-frame features, which are then processed by the appearance feature fusion module to obtain global appearance features. Subsequently, the time aggregation branch of the frame-by-frame feature aggregation module is used to construct long-distance temporal relationships, and the long-short distance fusion branch substitutes short-distance features into the calculation to obtain spatiotemporal features containing long- and short-distance time information. Finally, the appearance features and spatiotemporal features are fused to perform the retrieval task.
[0016] Furthermore, for each dynamic footprint in the training set, N frames of images are read sequentially. Before inputting the data into the convolutional neural network, a normalization preprocessing operation is performed. The normalized preprocessed data is then input into the convolutional neural network. The input data size is (B, N, 3, 224, 224), and the output F1 size is (B, N, H). Here, B represents the batch size, 3 represents the number of channels in the image, 224 represents the height or width of the image, and H represents the length of the one-dimensional vector after stretching the feature map.
[0017] Furthermore, global appearance features are obtained through the appearance feature fusion module, as follows:
[0018] Image features x extracted by a convolutional neural network i The vectors are concatenated to form a feature vector X∈N. B*H′ That is, X = (x1, x2, ..., x i ,…,x n ), where x i H represents the features of the i-th frame image. ′Indicates the feature length after splicing;
[0019] Set a trainable weight matrix P, and multiply the feature vector X with the weight matrix P to obtain the final integrated vector X1, i.e., X1 = P·X.
[0020] Furthermore, the connection between frames is established through the frame feature aggregation module, as follows:
[0021] The temporal aggregation branch constructs long-distance temporal relationships between frames, feeding the image features F1 extracted by the neural network into the temporal aggregation branch, i.e., F2 = MaxPool(RULE(Conv)). 3*3 (F1))·F1, where F1 consists of n frame features; F2 represents the long-range temporal feature calculated by the temporal aggregation branch after the frame features are processed, and its size is (B, 4096); Conv 3*3 This indicates a 3x3 convolutional layer; RULE(·) is a linear activation; MaxPool is a max pooling layer with a kernel size of 8.
[0022] Construct long- and short-distance fusion branches; fuse long- and short-distance time information, i.e., S = Where S represents the spatiotemporal feature after the fusion of long and short distance time features, and its size is (B, 4096), [·] represents the vector inner product operation, ||·|| represents the vector modulus operation, and Cat(·) is the vector concatenation operation.
[0023] Furthermore, the training process of the multi-class feature fusion network model is as follows:
[0024] After feature calculation, a loss function is calculated, and then the network parameters are optimized through backpropagation to train a multi-class feature fusion network model with the best retrieval performance; the formula is as follows:
[0025]
[0026]
[0027] L=λ*L center +L cross
[0028] Where L center L represents the central loss. cross For cross-entropy loss, X j This represents the output characteristics of the network. Indicates the p-th j The central characteristic of a class, y (i) This represents the true label of the current sample. The predicted label represents the current sample, and based on experience, the value of λ is set to one-thousandth.
[0029] Furthermore, the testing process for the multi-feature fusion network model is as follows:
[0030] The trained network extracts features from images in the retrieval database and then calculates the distance between the retrieved image and the features in the base database. The calculated distances are sorted in ascending order. If the ID of the retrieved image has a corresponding ID in the base database, the retrieval is successful; otherwise, the retrieval fails.
[0031] Compared with the prior art, the beneficial effects of the present invention are:
[0032] This invention uses dynamic footprint data for research and constructs a dynamic footprint retrieval model based on multi-class feature fusion, so that the retrieval of footprints is not limited to the content information of footprint images, but focuses on the information of changes in plantar pressure.
[0033] The appearance feature fusion module designed in this invention uses a trainable weight matrix. The weights during feature fusion are continuously optimized during network training to obtain a suitable weight, so that the fused appearance features have stronger expressive power.
[0034] The frame-by-frame feature aggregation module designed in this invention constructs the spatiotemporal information between frame-by-frame footprint images and calculates long-distance temporal information through a time aggregation branch. At the same time, it uses a long-short distance fusion branch to fuse long and short temporal information, so that the calculated features contain both long and short temporal information of the footprints and also solve the problem of spatial information redundancy in the fusion process. Attached Figure Description
[0035] Figure 1 This is a schematic diagram of the overall process of dynamic footprint retrieval according to the present invention;
[0036] Figure 2 This is a schematic diagram of the structure of the multi-feature fusion network of the present invention;
[0037] Figure 3 This is a schematic diagram of the appearance feature fusion module of the present invention;
[0038] Figure 4 This is a schematic diagram of the frame feature aggregation module of the present invention. Detailed Implementation
[0039] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0040] like Figure 1As shown, a dynamic footprint retrieval method based on multi-class feature fusion includes:
[0041] Step 1: Collect dynamic footprint data to construct the training and test sets for the retrieval network;
[0042] In an embodiment of the present invention, a foot pressure lap data acquisition system is used as a hardware device to acquire dynamic footprint data; that is, the person being collected walks normally on the foot pressure lap data acquisition board while wearing shoes with different patterns, and after completing a lap, the software system generates a corresponding file to record the data of this lap.
[0043] The footprint data was framed at an FPS of 20 to obtain a dynamic footprint dataset D; the footprint data was then categorized according to the ID information of the individuals whose data was collected, constructing an initial dataset D1; each frame of the initial dataset D1 was preprocessed, including denoising, cropping, and centering, to obtain the processed dataset D2; the specific preprocessing process is as follows:
[0044] Step S11: Denoising; Examine the pixel values in each column of the image by column. If the blue channel has a larger proportion in each pixel value in a column, it is identified as noise and removed.
[0045] Step S12: Cutting; First, determine the position of the heel and toe of each footprint in the image of the entire footprint, and mark them as P1 and P2 respectively; then calculate a rectangle based on P1 and P2, and cut the image of the entire footprint using the rectangle;
[0046] Step S13: Centering; Determine the outer keypoints of the footprints in the footprint image (top, bottom, left, and right), namely P1, P2, P3, and P4; Move the four outer keypoints outward simultaneously. Stop moving when the distance between P1 and P2, and between P3 and P4, is 250 pixels; Use P1, P2, P3, and P4 at this point as the four boundaries of the footprint image. The pixel size of the footprint image is 250*250, thus obtaining the preprocessed dataset D2.
[0047] The preprocessed dataset D2 was divided into a training set and a test set in a 3:1 ratio. The training set contained 150 classes and a total of 108,000 images, with 30 dynamic footprints per class. The test set contained 50 classes and a total of 36,000 images, with 30 dynamic footprints per class. The test set was divided into a base database and a retrieval database in a 2:1 ratio, with the base database containing 12,000 images and the retrieval database containing 24,000 images.
[0048] Step 2: Input the acquired training set data into the multi-class feature fusion network model for training;
[0049] To perform dynamic footprint retrieval, this invention constructs a multi-class feature fusion network model, such as... Figure 2 As shown, the multi-feature fusion network model includes a convolutional neural network, an appearance feature fusion module, and a frame-by-frame feature aggregation module. Specifically, the convolutional neural network extracts input information to obtain frame-by-frame features, which are then processed by the appearance feature fusion module to obtain global appearance features. Next, the temporal aggregation branch of the frame-by-frame feature aggregation module constructs long-distance temporal relationships, and the long-short-distance fusion branch incorporates short-distance features into the calculation to obtain spatiotemporal features containing both long and short-distance temporal information. Finally, the appearance features and spatiotemporal features are fused for the retrieval task. The specific process is as follows:
[0050] Step S21: Data reading; Read N frames of images sequentially for each dynamic footprint in the training set. Perform normalization preprocessing before inputting the data into the convolutional neural network to reduce noise interference. The data size input into the convolutional neural network is (B, N, 3, 224, 224), where B is the batch, 3 represents the number of channels of the image, and 224 represents the height or width of the image.
[0051] Step S22: Extract image features using a convolutional neural network; compared to face recognition images, footprint images are simpler and clearer, and are not suitable for feature extraction using complex neural networks; this invention improves the AlexNet convolutional neural network, reducing the complexity of the network, while the amount of feature information extracted by the network is richer;
[0052] The input data is of size (B, N, 3, 224, 224), and the output F1 is of size (B, N, H), where H represents the length of the one-dimensional vector after stretching the feature map.
[0053] Step S23: Obtain global appearance features through the appearance feature fusion module; such as Figure 3 As shown, a static footprint image is composed of multiple frames of dynamic footprints, but the importance of each frame is different. Therefore, a new fusion method is designed, which assigns an appropriate weight to the features of each frame and performs feature fusion based on the weights. This can effectively integrate the apparent features of the frame-by-frame features. The specific process is as follows:
[0054] Step S231: Extract the image features x from the convolutional neural network i The vectors are concatenated to form a feature vector X∈N. B *H′ The specific formula is as follows:
[0055] X = (x1, x2, ..., x i ,…,x n )
[0056] Where x iH represents the features of the i-th frame image. ′ Indicates the feature length after splicing;
[0057] Step S232: Set a trainable weight matrix P, and multiply the feature vector X with the weight matrix P to obtain the final integrated vector X1. The specific formula is as follows:
[0058] X1 = P·X
[0059] Step S24: Establish the connection between frames through the frame feature aggregation module; such as Figure 4 As shown, the frame feature aggregation module of the present invention mainly consists of two branches: a time aggregation branch and a long-short distance fusion branch; specific details are as follows:
[0060] Step S241: The temporal aggregation branch constructs the long-distance temporal relationship between the frames. The image features F1 extracted by the neural network are fed into the temporal aggregation branch. The specific formula is as follows:
[0061] F2 = MaxPool(RULE(Conv) 3*3 (F1))·F1)
[0062] F1 consists of n frame features; F2 represents the long-range temporal feature calculated by the temporal aggregation branch of the frame features, and its size is (B, 4096); Conv 3*3 This indicates a 3x3 convolutional layer; RULE(·) is a linear activation; MaxPool is a max pooling layer with a kernel size of 8.
[0063] Step S242: Construct long- and short-distance fusion branches; the main idea of time aggregation branches is to establish global time information, but short-distance time information is also essential; therefore, drawing on the idea of orthogonal fusion, long- and short-distance time information is fused to ensure that the obtained features can obtain time information and also reduce redundant spatial information. The specific formula is as follows:
[0064]
[0065] Where S represents the spatiotemporal feature after the fusion of long and short distance time features, and its size is (B, 4096), [·] represents the vector inner product operation, ||·|| represents the vector modulus operation, and Cat(·) is the vector concatenation operation;
[0066] Step S25: Training Process: After feature calculation, the loss function is calculated, and then the network parameters are optimized through backpropagation to train the multi-class feature fusion network model with the best retrieval performance; In this invention, the loss function consists of cross-entropy loss and center loss, and the specific formula is as follows:
[0067]
[0068]
[0069] L=λ*L center +L cross
[0070] Where L center L represents the central loss. cross For cross-entropy loss, X j This represents the output characteristics of the network. Indicates the p-th j The central characteristic of a class, y (i) This represents the true label of the current sample. The predicted label represents the current sample; based on experience, the value of λ is set to 0.001.
[0071] Step 3: Input the obtained test set into the trained multi-class feature fusion network model for testing;
[0072] Testing Process: The trained network extracts features from images in the retrieval database and then calculates the distance (using Euclidean distance) between these features and the base database. The calculated distances are sorted in ascending order, with Rank 1 representing the retrieval result. If the retrieved image ID has a corresponding ID in the base database, the retrieval is successful; otherwise, the retrieval fails. The retrieval evaluation metrics for this invention are mAP, Rank 1, Rank 5, and Rank 10, with Rank 1 being the most important, representing the network's retrieval performance, while mAP represents the network's stability. This invention achieved the best retrieval performance on the footprint dataset, with an mAP value of 55.282% and a Rank 1 value of 85.388%.
[0073] The above formulas are all numerical calculations after removing dimensions. The formulas are obtained by software simulation based on a large amount of data and are closest to the real situation. The preset parameters and preset thresholds in the formulas are set by those skilled in the art according to the actual situation or obtained by simulation based on a large amount of data.
[0074] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.
Claims
1. A dynamic footprint retrieval method based on multi-class feature fusion, characterized in that, include: Collect dynamic footprint data to construct training and testing sets for the retrieval network; The acquired training set data is input into a multi-class feature fusion network model for training; The acquired test set is input into the trained multi-class feature fusion network model for testing; The multi-feature fusion network model includes a convolutional neural network, an appearance feature fusion module, and a frame-by-frame feature aggregation module. The convolutional neural network extracts the input information to obtain frame-by-frame features, which are then processed by the appearance feature fusion module to obtain global appearance features. The time aggregation branch of the frame-by-frame feature aggregation module is then used to construct long-distance temporal relationships. The long-short distance fusion branch substitutes short-distance features into the calculation to obtain spatiotemporal features containing long- and short-distance time information. Finally, the appearance features and spatiotemporal features are fused to perform the retrieval task. The global appearance features are obtained through the appearance feature fusion module, as follows: Image features extracted by a convolutional neural network Concatenation is performed to form a feature vector That is Wherein is the feature of the i-th image, Denotes the feature length after concatenation, and B is a batch. A trainable weight matrix P is set, and the feature vector X is multiplied by the weight matrix P to obtain a final integrated vector That is .
2. The dynamic footprint retrieval method based on multi-class feature fusion according to claim 1, characterized in that, The initial dataset D1 acquired through collection is preprocessed as follows: Denoising; examine the pixel values of each column of the image by column, identify noise, and remove it; Cutting: First, determine the position of the heel and toe of each footprint in the image of the entire footprint, and mark them as P1 and P2 respectively; then calculate a rectangle based on P1 and P2, and cut the image of the entire footprint using the rectangle; Centering; determine the top, bottom, left, and right outer keypoints of the footprints in the footprint image, namely P1, P2, P3, and P4; the four outer keypoints move outward simultaneously, and stop moving when the distance between P1 and P2 and between P3 and P4 is a preset number of pixels; take P1, P2, P3, and P4 at this time as the four boundaries of the footprint image to obtain the preprocessed dataset D2; The preprocessed dataset D2 is divided into training and test sets in a 3:1 ratio, and the test set is divided into base database and retrieval database in a 2:1 ratio.
3. The dynamic footprint retrieval method based on multi-class feature fusion according to claim 1, characterized in that, For each dynamic footprint in the training set, N frames of images are read sequentially. Normalization preprocessing is performed before inputting the data into the convolutional neural network. The normalized preprocessed data is then input into the convolutional neural network. The input data size is (B, N, 3, 224, 224). The output... The size is (B, N, H); where B is the batch size, 3 represents the number of channels in the image, 224 represents the height or width of the image, and H represents the length of the one-dimensional vector after stretching the feature map.
4. The dynamic footprint retrieval method based on multi-class feature fusion according to claim 1, characterized in that, The connection between frames is established through the frame feature aggregation module, as follows: The temporal aggregation branch constructs long-distance temporal relationships between frames, integrating the image features extracted by the neural network. Feed into the time aggregation branch, i.e. ,in It consists of n frame features; This represents the long-range temporal feature calculated by the temporal aggregation branch after the frame features are processed, and its size is... ; This indicates a size of 3 3 convolutional layers; Linear activation; It is a max-pooling layer with a kernel size of 8, and B represents the batch size; Construct long- and short-distance fusion branches; fuse long- and short-distance time information, i.e. ,in This represents the spatiotemporal feature after fusing long and short distance temporal features, and its size is... , This represents the vector dot product operation. This represents the modulo operation of a vector. This is a vector concatenation operation.
5. The dynamic footprint retrieval method based on multi-class feature fusion according to claim 1, characterized in that, The training process of the multi-class feature fusion network model is as follows: After feature calculation, a loss function is calculated, and then the network parameters are optimized through backpropagation to train a multi-class feature fusion network model with the best retrieval performance; the formula is as follows: in Indicates the loss at the center. For cross-entropy loss, This represents the output characteristics of the network. Indicates the first The central characteristic of a class This represents the true label of the current sample. The predicted label representing the current sample, based on experience, will be... The value is set to one per thousand.
6. The dynamic footprint retrieval method based on multi-class feature fusion according to claim 1, characterized in that, The testing process for the multi-feature fusion network model is as follows: The trained network extracts features from images in the retrieval database and then calculates the distance between the retrieved image and the features in the base database. The calculated distances are sorted in ascending order. If the ID of the retrieved image has a corresponding ID in the base database, the retrieval is successful; otherwise, the retrieval fails.