A face recognition method, system and storage medium
By performing tracking segmentation, feature extraction, and similarity calculation on the video stream, and combining dynamic angle function and modulus penalty function, the problem of low accuracy in face recognition under unconstrained environment is solved, and efficient real-time face recognition is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAMEN MEITUZHIJIA TECH
- Filing Date
- 2023-09-06
- Publication Date
- 2026-06-23
AI Technical Summary
Existing face recognition methods are not very accurate in unconstrained environments. In particular, the variability of face images in video streams increases the difficulty of recognition, making it difficult to distinguish between different frames of the same person or frames of different people.
By dividing video information into tracking segments, performing feature extraction and similarity calculation, and combining dynamic angle function and modulus penalty function, the Hungarian algorithm is used to update the face representation database, thereby improving the accuracy and efficiency of face recognition.
It effectively eliminates interference from low-quality face images, improves the accuracy and efficiency of face recognition, and achieves high-accuracy real-time face recognition under unconstrained video streams.
Smart Images

Figure CN117253271B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of facial recognition technology, and in particular to a facial recognition method, system, and storage medium. Background Technology
[0002] Facial recognition is a technology that can identify or verify the identity of a subject in an image or video. When deployed in unconstrained environments, facial recognition is one of the most challenging biometric methods due to the high variability of facial images in the real world.
[0003] Typically, face recognition in video streams falls under the category of unconstrained face recognition problems. Within consecutive video frames, variations in facial images include head pose, age, occlusion, lighting conditions, and facial expressions. Traditional face recognition methods rely on a combination of manually designed features (such as edge and texture descriptors) and machine learning techniques (such as principal component analysis, linear discriminant analysis, or support vector machines). Recently, traditional face recognition methods have been largely superseded by deep learning methods based on convolutional neural networks (CNNs). The main advantage of deep learning methods is their ability to be trained on very large datasets, thereby learning the optimal features to represent that data.
[0004] However, in a video, due to factors such as facial image movement and blurring, occlusion, or changes in lighting conditions, the difficulty of facial recognition varies from frame to frame. As a result, the facial recognition results of the video stream often show that the same person in different frames is identified as multiple different people, or multiple different people are difficult to distinguish.
[0005] Therefore, existing facial recognition methods suffer from low accuracy. Summary of the Invention
[0006] The main objective of this invention is to provide a face recognition method, system, and storage medium, aiming to solve the technical problem of low accuracy in existing face recognition methods.
[0007] To achieve the above objectives, the present invention provides a face recognition method, comprising the following steps: S1, acquiring video information and dividing it into several tracking segments, each tracking segment including several consecutive face images; S2, extracting features from each frame in each tracking segment to obtain face feature vectors, obtaining and comparing the corresponding moduli of all face feature vectors, and selecting the video frame corresponding to the maximum moduli as the face representation of the tracking segment; S3, acquiring a set of simultaneously existing tracking segments and calculating the first similarity between the face representations of each tracking segment in each tracking segment set; S4, performing face matching based on a preset face representation library and the first similarity and outputting the face recognition result.
[0008] Optionally, in step S1, the video information is obtained and divided into several tracking segments. Specifically, the tracking segments are divided according to the difference in face position between adjacent frames in the video information. Alternatively, adjacent frames in the video information are input into the scene switching detection network, the scene switching probability is output, and the tracking segments are divided according to the probability.
[0009] Optionally, the facial position difference is determined using the following formula:
[0010]
[0011] Where IOU represents the size of the face bounding box, A is the area of the bounding box of one face, and B is the area of the bounding box of another face. If the size of the face bounding box is less than a preset threshold, then a new tracking segment exists.
[0012] Optionally, in step S2, feature extraction is performed using a feature extraction network. The loss function of the feature extraction network is as follows:
[0013]
[0014] Where f(β) represents the dynamic included angle function, and g(β) represents the modulus penalty function, and a, b, c, and e are all constant hyperparameters, β represents the modulus, j represents the total number of people classifying during the training phase, k represents the k-th person in the total number of people classifying, and θ k For vector w k The angle between the vector w and the face feature vector x, vector w k θ represents the weight corresponding to the k-th person in the softmax layer. j For vector w j The angle between the vector w and the face feature vector x, vector w j represents the weight corresponding to the j-th person in the softmax layer, and s represents the temperature coefficient.
[0015] Optionally, the set of simultaneously existing tracking segments is obtained based on the spatiotemporal information of the tracking segments, specifically determined according to the following preset logic: Given tracking segments C1, C2, ..., C... n If there are simultaneously appearing tracking segments in a video frame at a certain moment, the corresponding tracking segments will be saved into the same tracking segment set.
[0016] Optionally, before step S4, the following steps are also included: matching the preset first threshold with the first similarity; if the first similarity is greater than the first threshold, the corresponding tracking segment belongs to the same individual; obtaining all tracking segments with the first similarity greater than the first threshold, and merging the tracking segments belonging to the same individual in each set.
[0017] Optionally, after merging the tracking segments belonging to the same individual in each set, the following steps are also included: for each set obtained after merging, the Hungarian algorithm is used to perform bipartite graph matching with the preset face representation library, and the preset face representation library is updated in combination with the preset second threshold, wherein the first threshold is less than the second threshold.
[0018] Optionally, if the merged set contains a tracking segment that does not match a preset face representation library, the tracking segment is recorded as a new individual in the preset face feature library; if the merged set contains a tracking segment that matches a tracking segment in the preset face representation library, the similarity between the two is calculated. If the similarity is less than a second threshold, the tracking segment is recorded as a new individual in the preset face feature library; if the merged set contains a tracking segment that matches a tracking segment in the preset face representation library, the similarity between the two is calculated. If the similarity is greater than a second threshold, the tracking segment is merged with the corresponding tracking segment in the preset face feature library.
[0019] Corresponding to the aforementioned face recognition method, this invention provides a face recognition system, comprising: a face detection module, used to acquire video information and divide it into several tracking segments, each tracking segment including several consecutive face images; a face representation module, used to extract features from each frame in each tracking segment to obtain a face feature vector, acquire the corresponding modulus of all face feature vectors and compare them, and select the video frame corresponding to the maximum modulus as the face representation of the tracking segment; a calculation module, used to acquire a set of simultaneously existing tracking segments and calculate a first similarity between the face representations of each tracking segment in each tracking segment set; and a face matching module, used to perform face matching based on a preset face representation library and the first similarity and output the face recognition result.
[0020] In addition, to achieve the above objectives, the present invention also provides a computer-readable storage medium storing a face recognition program, which, when executed by a processor, implements the steps of the face recognition method described above.
[0021] The beneficial effects of this invention are:
[0022] (1) Compared with the prior art, the present invention extracts features from each frame in each tracking segment to obtain face feature vectors, obtains the corresponding modulus of all face feature vectors and compares them, and selects the video frame corresponding to the maximum modulus as the face representation of the tracking segment. This can distinguish the recognition difficulty of face images. Combined with the first similarity, it can effectively eliminate the interference of low-quality face images in unconstrained environments on face recognition results and improve the accuracy of face recognition.
[0023] (2) Compared with the prior art, the present invention can improve the accuracy of tracking segment division and improve the efficiency of subsequent face recognition by dividing the tracking segments according to the face position difference or scene switching detection network.
[0024] (3) Compared with the prior art, the present invention introduces a dynamic angle function and a magnitude penalty function through the design of the loss function, thereby realizing the correlation between the magnitude of the face feature vector and the quality of the face image, which can avoid the interference of low-quality face images and improve the accuracy of face recognition.
[0025] (4) Compared with the prior art, the present invention, by linking the magnitude of the facial feature vector with the quality of its facial image and further combining it with spatiotemporal information, obtains a set of simultaneously existing tracking segments, which can achieve high-accuracy real-time facial recognition under unconstrained video streams.
[0026] (5) Compared with the prior art, the present invention can merge tracking segments belonging to the same individual in each set by matching the preset first threshold with the first similarity, distinguish different individuals in the tracking segments that have appeared at the same time, and then use the Hungarian algorithm to perform bipartite graph matching with the preset face representation library for each merged set. Combined with the preset second threshold, the preset face representation library can be updated, thereby improving the efficiency and accuracy of face recognition. Attached Figure Description
[0027] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings:
[0028] Figure 1 This is a simplified flowchart of an embodiment of the face recognition method of the present invention;
[0029] Figure 2 This is a simplified structural diagram of an embodiment of the scene detection network of the present invention;
[0030] Figure 3 This is a framework diagram of an embodiment of the face recognition system of the present invention. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0032] like Figure 1 As shown, a face recognition method of the present invention includes the following steps: S1, acquiring video information and dividing it into several tracking segments, each tracking segment including several consecutive face images; S2, extracting features from each frame in each tracking segment to obtain face feature vectors, obtaining the corresponding modulus of all face feature vectors and comparing them, and selecting the video frame corresponding to the maximum modulus as the face representation of the tracking segment; S3, acquiring a set of simultaneously existing tracking segments and calculating the first similarity between the face representations of each tracking segment in each tracking segment set; S4, performing face matching based on a preset face representation library and the first similarity and outputting the face recognition result.
[0033] This invention extracts features from each frame in each tracking segment to obtain facial feature vectors, obtains and compares the corresponding moduli of all facial feature vectors, and selects the video frame with the maximum moduli as the facial representation of the tracking segment. This can distinguish the recognition difficulty of facial images, and combined with the first similarity, it can effectively eliminate the interference of low-quality facial images in unconstrained environments on the facial recognition results, thereby improving the accuracy of facial recognition.
[0034] Preferably, the first similarity is calculated by calculating cosine similarity, and the first similarity is calculated based on the face representation of each pair of tracked segments.
[0035] It is understood that the tracking segment may contain partially occluded, blurred, or poorly lit images. Therefore, in the method described in this invention, the corresponding modulus of the face feature vector is used as the face quality index, and the video frame corresponding to the maximum modulus (which is also the frame with the best face quality) is selected as the face representation of the tracking segment.
[0036] In this embodiment, step S1 involves acquiring video information and dividing it into several tracking segments. Specifically, the tracking segments are divided based on the difference in face position between adjacent frames in the video information. Alternatively, adjacent frames in the video information are input into a scene switching detection network, and the scene switching probability is output. The tracking segments are then divided based on the probability.
[0037] In this embodiment, the facial position difference is determined according to the following formula:
[0038]
[0039] Where IOU represents the size of the face bounding box, A is the area of the bounding box of one face, and B is the area of the bounding box of another face. If the size of the face bounding box is less than a preset threshold, then a new tracking segment exists.
[0040] Preferably, the preset threshold is 0.4.
[0041] This invention improves the accuracy of tracking segment segmentation and the efficiency of subsequent face recognition by dividing the tracking segments according to the face position difference or scene switching detection network.
[0042] In this embodiment, a simplified diagram of the scene detection network structure is shown below. Figure 2 As shown, two adjacent input images are processed by the backbone and then fused by MTF (Merge Temporal Features), ultimately outputting the probability of scene switching. Preferably, both the backbone and the head are composed of depthwise separable convolutions stacked together.
[0043] In this embodiment, in step S2, feature extraction is performed using a feature extraction network. The loss function of the feature extraction network is as follows:
[0044]
[0045] Where f(β) represents the dynamic included angle function, and g(β) represents the modulus penalty function, and a, b, c, and e are all constant hyperparameters, β represents the modulus, j represents the total number of people classifying during the training phase, k represents the k-th person in the total number of people classifying, and θ k For vector w k The angle between the vector w and the face feature vector x, vector w k θ represents the weight corresponding to the k-th person in the softmax layer. j For vector w j The angle between the vector w and the face feature vector x, vector w j represents the weight corresponding to the j-th person in the softmax layer, and s represents the temperature coefficient.
[0046] In this embodiment, softmax is used for classification at the level of millions during the training phase (i.e., one image corresponds to one person in a million people). Therefore, in the above formula, k is one person in a million people during the training phase. After the model is trained, there is no limit to the number of people to be classified during the usage phase. The softmax layer is removed, and only the face representation obtained based on the face feature vector extracted from the feature is used to calculate the first similarity.
[0047] In this embodiment, s is a hyperparameter used to control the model's ability to distinguish negative samples, and is determined through debugging.
[0048] This invention introduces a dynamic angle function and a magnitude penalty function through the design of a loss function, thereby linking the magnitude of the face feature vector with the quality of the face image (using the magnitude of the face feature vector as a face quality index). This can avoid interference from low-quality face images and improve the accuracy of face recognition.
[0049] Typically, in a video, a face tracking segment belongs to only one person, and there may be multiple face tracking segments for the same person. However, at any given moment, there is a high probability that only one face tracking segment exists for the same person. Based on this prior knowledge, this paper enhances the accuracy of face recognition by using high and low thresholds.
[0050] In this embodiment, the set of simultaneously existing tracking segments is obtained based on the spatiotemporal information of the tracking segments, and is specifically determined according to the following preset logic: Given tracking segments C1, C2, ..., C... n If tracking segments appear simultaneously in a video frame at a certain moment, the corresponding tracking segments are saved into the same tracking segment set. For easier understanding, the following example illustrates this:
[0051] If tracking segments C1 and C2 exist simultaneously in the video frame at time t, then save the set {C1, C2}, indicating that tracking segments C1 and C2 have appeared at the same time; if a new tracking segment C3 appears in the video frame at time t+1, then add it to the set {C1, C2} to obtain a new set {C1, C2, C3}; and so on.
[0052] Based on the aforementioned pre-defined logic, after the video ends, several sets will be obtained, each set representing the simultaneous occurrence of these tracked segments. It is understandable that each set of tracked segments may contain several tracked segments, and the same tracked segment may exist in multiple sets.
[0053] This invention links the magnitude of the facial feature vector with the quality of the facial image, and further combines it with spatiotemporal information to obtain a set of simultaneously existing tracking segments, enabling high-accuracy real-time facial recognition under unconstrained video streams.
[0054] In this embodiment, tracking segments appearing at the same time are highly unlikely to belong to the same person. Therefore, a high threshold (i.e., the first threshold) is used to distinguish tracking segments that have appeared simultaneously (preferring them to belong to different people), while those that have not appeared simultaneously still require a low threshold (i.e., the second threshold). The results of the two different thresholds need to form the same preset face feature database, preferably using the Hungarian matching algorithm for merging. Therefore, before step S4, the following steps are also included: matching the preset first threshold with the first similarity; if the first similarity is greater than the first threshold, the corresponding tracking segment belongs to the same individual; obtaining all tracking segments with the first similarity greater than the first threshold, and merging the tracking segments belonging to the same individual in each set.
[0055] In this embodiment, after merging the tracking segments belonging to the same individual in each set, the following steps are also included: for each set obtained after merging, the Hungarian algorithm is used to perform bipartite graph matching with a preset face representation library, and the preset face representation library is updated in combination with a preset second threshold, wherein the first threshold is less than the second threshold.
[0056] In this embodiment, when using the Hungarian algorithm to perform bipartite graph matching with a preset face representation database, the cost matrix is calculated as follows:
[0057]
[0058]
[0059] Represents the cost matrix, cosine similarity(C i C j ) represents cosine similarity, C i C represents a specific tracking segment within a set obtained after merging. j This represents a tracking segment from a pre-defined facial representation database. M ij This indicates that the cost matrix is inverted because the Hungarian algorithm results in the minimum cost, but the method described in this invention requires matching the maximum cosine distance.
[0060] This invention uses a high threshold to match and merge previously co-occurring tracking segments, indicating that segments tend to belong to different individuals. Then, a Hungarian algorithm is used to match the segments against a facial feature database. During this process, tracking segments already identified as belonging to different individuals by the high threshold will not be merged.
[0061] Preferably, the first threshold is 0.4 and the second threshold is 0.25.
[0062] In this embodiment, if the merged set contains a tracking segment that does not match a preset face representation library, the tracking segment is recorded as a new individual in the preset face feature library. If the merged set contains a tracking segment that matches a tracking segment in the preset face representation library, the similarity between the two is calculated. If the similarity is less than a second threshold, the tracking segment is recorded as a new individual in the preset face feature library. If the merged set contains a tracking segment that matches a tracking segment in the preset face representation library, the similarity between the two is calculated. If the similarity is greater than a second threshold, the tracking segment is merged with the corresponding tracking segment in the preset face feature library.
[0063] This invention, by matching data with a preset first threshold and a first similarity, can merge tracking segments belonging to the same individual in various sets, distinguish different individuals in tracking segments that have appeared simultaneously, and then perform bipartite graph matching between each merged set and a preset face representation database using the Hungarian algorithm. Combined with a preset second threshold, the preset face representation database can be updated, thereby improving the efficiency and accuracy of face recognition.
[0064] like Figure 3 As shown, the present invention also provides a face recognition system, comprising: a face detection module 10, used to acquire video information and divide it into several tracking segments, each tracking segment including several consecutive face images; a face representation module 20, used to extract features from each frame in each tracking segment to obtain face feature vectors, acquire the corresponding modulus of all face feature vectors and compare them, and select the video frame corresponding to the maximum modulus as the face representation of the tracking segment; a calculation module 30, used to acquire a set of simultaneously existing tracking segments and calculate the first similarity between the face representations of each tracking segment in each tracking segment set; and a face matching module 40, used to perform face matching based on a preset face representation library and the first similarity and output the face recognition result.
[0065] In this embodiment, the face matching module 40 maintains a preset face representation library. When a new set of tracking segments is obtained based on spatiotemporal information, it is compared with the preset face representation library in real time. By using a high and low threshold matching method, each tracking segment can be matched with the corresponding individual, so that the face matching module 40 can output the matching result in real time at the granularity of the tracking segment set.
[0066] Preferably, the facial recognition system also includes a data matching module. Figure 3(not shown in the image) is used to match a preset first threshold with a first similarity. If the first similarity is greater than the first threshold, the corresponding tracking segment belongs to the same individual. All tracking segments with a first similarity greater than the first threshold are obtained, and tracking segments belonging to the same individual in each set are merged.
[0067] And, update module ( Figure 3 (Not shown in the image) is used to perform bipartite graph matching between each merged set and the preset face representation database using the Hungarian algorithm, and to update the preset face representation database by combining it with the preset second threshold.
[0068] This invention also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the memory described in the above embodiments; or it may be a standalone computer-readable storage medium not assembled into a device. The computer-readable storage medium stores at least one instruction, which is loaded and executed by a processor to implement... Figure 1 The face recognition method shown is illustrated. The computer-readable storage medium may be a read-only memory, a disk, or an optical disk, etc.
[0069] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the device embodiments, equipment embodiments, and storage medium embodiments, since they are basically similar to the method embodiments, the descriptions are relatively simple, and relevant parts can be referred to the descriptions of the method embodiments.
[0070] Furthermore, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0071] The foregoing description illustrates and describes preferred embodiments of the present invention. It should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the inventive concept by means of the foregoing teachings or techniques or knowledge in related fields. Any modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.
Claims
1. A face recognition method, characterized in that, Includes the following steps: S1. Acquire video information and divide it into several tracking segments, each tracking segment including several consecutive face images; S2. For each frame in each tracking segment, feature extraction is performed to obtain a face feature vector. The corresponding magnitudes of all face feature vectors are obtained and compared. The video frame with the maximum magnitude is selected as the face representation of the tracking segment. Feature extraction is performed through a feature extraction network. The loss function of the feature extraction network is as follows: ; in, Represents a dynamic angle function, and , Let represent the modulus penalty function, and a, b, c, and e are all constant hyperparameters. Let j represent the modulus, j represent the total number of people classifying during the training phase, and k represent the k-th person among the total number of people classifying. For vectors With facial feature vectors The included angle, vector This represents the weight corresponding to the k-th person in the softmax layer. For vectors With facial feature vectors The included angle, vector represents the weight corresponding to the j-th person in the softmax layer, and s represents the temperature coefficient; S3. Obtain a set of simultaneously existing tracking segments and calculate the first similarity between the face representations of each tracking segment in each tracking segment set; perform data matching between the preset first threshold and the first similarity; if the first similarity is greater than the first threshold, the corresponding tracking segments belong to the same individual; obtain all tracking segments with the first similarity greater than the first threshold, and merge the tracking segments belonging to the same individual in each set; for each merged set, use the Hungarian algorithm to perform bipartite graph matching with the preset face representation library, and update the preset face representation library in combination with the preset second threshold; wherein, the first threshold is less than the second threshold; The set of tracking segments existing simultaneously in S3 is obtained based on the spatiotemporal information of the tracking segments, and is specifically determined according to the following preset logic: Given tracking segments... If there are simultaneously appearing tracking segments in a video frame at a certain moment, the corresponding tracking segments will be saved into the same tracking segment set. S4. Based on the preset face representation library and the first similarity, perform face matching and output the face recognition result.
2. The face recognition method according to claim 1, characterized in that: In step S1, video information is acquired and divided into several tracking segments. Specifically, the tracking segments are divided based on the difference in face position between adjacent frames in the video information. Alternatively, adjacent frames from the video information can be input into a scene transition detection network, which outputs scene transition probabilities and then divides the tracking segments based on these probabilities.
3. The face recognition method according to claim 2, characterized in that: The difference in facial position is determined using the following formula: ; Where IOU represents the size of the face bounding box, A is the area of the bounding box of one face, and B is the area of the bounding box of another face. If the size of the face bounding box is less than a preset threshold, then a new tracking segment exists.
4. The face recognition method according to claim 1, characterized in that: If there are tracking segments in the merged set that do not match the preset face representation library, then the tracking segments will be recorded as new individuals in the preset face feature library. If there is a tracking segment in the merged set that matches the tracking segment in the preset face representation library, then the similarity between the two is calculated. If the similarity is less than the second threshold, then the tracking segment is recorded as a new individual in the preset face feature library. If the merged set contains a tracking segment that matches a tracking segment in the preset face representation library, then the similarity between the two is calculated. If the similarity is greater than the second threshold, then the tracking segment is merged with the corresponding tracking segment in the preset face feature library.
5. A face recognition system, using the face recognition method according to any one of claims 1-4, characterized in that, include: The face detection module is used to acquire video information and divide it into several tracking segments, each tracking segment including several consecutive face images; The face representation module is used to extract features from each frame in each tracking segment to obtain face feature vectors, obtain the corresponding modulus of all face feature vectors and compare them, and select the video frame with the maximum modulus as the face representation of the tracking segment. The calculation module is used to obtain a set of simultaneously existing tracking segments and calculate the first similarity between the face representations of each tracking segment in each tracking segment set; The face matching module is used to perform face matching based on a preset face representation library and a first similarity score, and output the face recognition result.
6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a face recognition program, which, when executed by a processor, implements the steps of the face recognition method as described in any one of claims 1 to 4.