A method, device and storage medium for counting people flow
By judging the similarity between adjacent frames in video surveillance and performing face recognition, the problem of duplicate counting errors in people flow statistics is solved, and more efficient and accurate people flow statistics are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CORP LTD
- Filing Date
- 2022-12-30
- Publication Date
- 2026-06-23
AI Technical Summary
Existing intelligent analysis methods based on video surveillance are prone to errors in pedestrian flow statistics due to double counting.
By judging the similarity between adjacent frames in video data, face recognition is only performed when the similarity is less than a preset similarity, thus reducing repeated recognition. The feature pyramid model and the single-point headless face detection model are used for target detection and face alignment. Combined with dimensionality reduction processing, the data is stored in a face database for pedestrian flow calculation.
It improves the accuracy of people flow statistics, reduces resource consumption, improves recognition efficiency in complex scenarios, and avoids duplicate counting of the same face.
Smart Images

Figure CN116189044B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of people flow statistics technology, and in particular to a method, apparatus, equipment and storage medium for people flow statistics. Background Technology
[0002] With the popularization of artificial intelligence technology and video surveillance, intelligent analysis applications based on video surveillance are increasing and being widely used. As an intelligent analysis application of video surveillance, passenger flow statistics generally rely on counting the images to be detected multiple times within a certain period of time, which can lead to errors caused by repeated counting of passenger flow. Summary of the Invention
[0003] In view of the above problems, embodiments of the present invention are proposed to provide a method, apparatus, device and storage medium for counting people flow to overcome or at least partially solve the above problems.
[0004] To address the aforementioned problems, this invention discloses a method for counting pedestrian traffic, the method comprising:
[0005] Obtain video data for a preset time period;
[0006] The video data is extracted into multiple frames according to a preset frame extraction rate.
[0007] The multiple frames of images are stored in the image database in chronological order.
[0008] Determine whether the similarity between two adjacent frames stored in the same order is less than a preset similarity.
[0009] If the similarity between two adjacent frames is less than the preset similarity, then face recognition is performed on the later image to obtain the corresponding number of faces;
[0010] The number of faces identified from multiple frames of images within the preset time period is summed to obtain the pedestrian flow corresponding to the video data.
[0011] Optionally, the step of performing face recognition on the later-ordered images to obtain the corresponding number of faces includes:
[0012] Extract the target detection boxes from the images that appear later in the sequence, and obtain a set of feature points for multiple faces based on the target detection boxes;
[0013] The feature point sets of the multiple faces are aligned using a preset affine transformation algorithm; the preset affine transformation algorithm includes the OpenCV affine transformation algorithm.
[0014] The dimensionality of the aligned feature point set of multiple faces is reduced to obtain low-dimensional vector data of multiple faces.
[0015] The low-dimensional vector data of the multiple faces are stored in a face database in chronological order. The number of faces in the later images is determined based on the number of low-dimensional vector data stored in the face database.
[0016] Optionally, storing the low-dimensional vector data of the multiple faces into a face database in chronological order includes:
[0017] The similarity between the low-dimensional vector data of the faces that appear later in the sequence and the low-dimensional vector data of the faces that appear earlier in the sequence is determined sequentially.
[0018] If the similarity is less than the vector similarity threshold, then the low-dimensional vector data of the face that appears later in the sequence is stored in the face database.
[0019] Optionally, extracting the target detection box from the later-ordered image includes:
[0020] The target detection bounding boxes in the subsequent images are extracted using the Feature Pyramid Model (FPN) and the Single-Point Headless Face Detection Model (SSH).
[0021] Optionally, obtaining a set of feature points for multiple faces based on the target detection box includes:
[0022] Based on the target detection box, redundant bounding boxes in the target detection box are removed to obtain a set of feature points of multiple faces.
[0023] Optionally, determining whether the similarity between two sequentially adjacent frames is less than a preset similarity includes:
[0024] The similarity between images in a later order and images in a earlier order is determined according to a preset similarity algorithm; the preset similarity algorithm includes any one of the following: cosine similarity algorithm, hash value algorithm, and histogram algorithm;
[0025] Determine whether the similarity between the later image and the earlier image is less than the preset similarity.
[0026] Optionally, the method further includes:
[0027] If the similarity between two adjacent frames is greater than or equal to the preset similarity, the image that appears later in the sequence will be deleted.
[0028] The present invention also discloses a device for counting pedestrian traffic, the device comprising:
[0029] The acquisition module is used to acquire video data for a preset time period;
[0030] The frame extraction module is used to extract frames from the video data according to a preset frame extraction rate to obtain multiple frames of images.
[0031] The storage module is used to store the multiple frames of images into the image database in chronological order.
[0032] The judgment module is used to determine whether the similarity between two adjacent frames of images stored in the same order is less than a preset similarity.
[0033] The recognition module is used to perform face recognition on the later image if the similarity between two adjacent frames is less than the preset similarity, and obtain the corresponding number of faces.
[0034] The calculation module is used to add up the number of faces identified from multiple frames of images based on the preset time period to obtain the pedestrian flow corresponding to the video data.
[0035] Optionally, the identification module includes:
[0036] The extraction submodule is used to extract target detection boxes from the images that appear later in the sequence, and to obtain a set of feature points for multiple faces based on the target detection boxes.
[0037] The alignment submodule is used to perform face alignment processing on the feature point set of the multiple faces using a preset affine transformation algorithm; the preset affine transformation algorithm includes the OpenCV affine transformation algorithm.
[0038] The dimensionality reduction submodule is used to reduce the dimensionality of the aligned feature point set of multiple faces to obtain low-dimensional vector data of multiple faces.
[0039] The determination submodule is used to store the low-dimensional vector data of the multiple faces into the face database in chronological order, and determine the number of faces in the later images based on the number of low-dimensional vector data stored in the face database.
[0040] Optionally, the determining submodule includes:
[0041] The judgment unit is used to judge the similarity between the low-dimensional vector data of the face that appears later in the sequence and the low-dimensional vector data of the face that appears earlier in the sequence.
[0042] The storage unit is used to store the low-dimensional vector data of the face that appears later in the sequence into the face database if the similarity is less than the vector similarity threshold.
[0043] Optionally, the extraction submodule includes:
[0044] The extraction unit is used to extract the target detection box in the subsequent image using the Feature Pyramid Model (FPN) and the Single-Point Headless Face Detection Model (SSH).
[0045] Optionally, the extraction submodule includes:
[0046] The unit is used to remove redundant bounding boxes from the target detection box to obtain a set of feature points for multiple faces.
[0047] Optionally, the determination module includes:
[0048] The similarity determination submodule is used to determine the similarity between the later image and the earlier image according to a preset similarity algorithm; the preset similarity algorithm includes any one of the following: cosine similarity algorithm, hash value algorithm, and histogram algorithm;
[0049] The judgment submodule is used to determine whether the similarity between the later image and the earlier image is less than the preset similarity.
[0050] Optionally, the device further includes:
[0051] The deletion module is used to delete the image that appears later in the sequence if the similarity between two adjacent frames is greater than or equal to the preset similarity.
[0052] The present invention also discloses an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the steps of the above-described method for counting pedestrian traffic.
[0053] The present invention also discloses a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the above-described method for counting pedestrian traffic.
[0054] The embodiments of the present invention have the following advantages:
[0055] This invention acquires video data over a preset time period; extracts frames from the video data according to a preset frame extraction rate to obtain multiple frames of images; stores these multiple frames in an image database in chronological order; determines whether the similarity between two adjacent frames stored in the database is less than a preset similarity; if the similarity is less than the preset similarity, then performs face recognition on the later frames to obtain the corresponding number of faces; and adds up the number of faces identified from the multiple frames over the preset time period to obtain the pedestrian flow corresponding to the video data. This invention reduces redundant recognition of multiple frames and improves the accuracy of pedestrian flow statistics by determining the similarity between adjacent frames before recognizing them, and only performing face recognition on later frames when the similarity is less than the preset similarity. Attached Figure Description
[0056] Figure 1 This is a flowchart illustrating the steps of a pedestrian flow statistics method provided in an embodiment of the present invention;
[0057] Figure 2 This is a flowchart of another method for counting pedestrian traffic provided in an embodiment of the present invention;
[0058] Figure 3 This is a structural block diagram of a people flow counting device provided in an embodiment of the present invention. Detailed Implementation
[0059] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0060] With the popularization of artificial intelligence technology and video surveillance, intelligent analysis applications based on video surveillance are increasing and being widely used. Passenger flow statistics, as an intelligent analysis application of video surveillance, mainly involves acquiring target detection images of the target area, using deep learning methods to detect human bodies or head and shoulder targets in the target detection images, and then counting the number of human bodies in the target detection images to count the flow of people in the target area. However, existing statistical methods generally count the images to be detected multiple times within a certain period of time, which can lead to errors caused by repeated counting of people flow.
[0061] One of the core concepts of this invention is that, before recognizing an image, the similarity between two adjacent frames is determined, and face recognition is only performed on the later image when the similarity is less than a preset similarity. This reduces the repeated recognition of multiple frames and improves the accuracy of pedestrian flow statistics.
[0062] Reference Figure 1The diagram illustrates a flowchart of a method for counting pedestrian traffic according to an embodiment of the present invention. The method may specifically include the following steps:
[0063] Step 101: Obtain video data for a preset time period.
[0064] In this embodiment of the invention, video data for a preset time period can be captured by a video shooting device, which can be a mobile terminal, a camera, or a surveillance camera. In one example, video data of a supermarket entrance from 10:00 AM to 12:00 PM can be captured by a surveillance camera.
[0065] Step 102: Extract frames from the video data according to the preset frame extraction rate to obtain multiple frames of images.
[0066] In this embodiment of the invention, since video data is composed of frames, the video data can be extracted at a preset frame extraction rate to obtain multiple frames. The unit of the frame extraction rate is frames / s. In one example, the total duration of the video data is 1200s, and the frame extraction rate is frames / 2s, so 600 frames can be extracted.
[0067] Step 103: Store multiple frames of images into the image database in chronological order.
[0068] In this embodiment of the invention, the generation time of each frame of the image is different. For example, if the generation time of image A is 10:10:50 and the generation time of image B is 10:10:52, then image A is stored in the image database first, and then image B is stored in the image database. Similarly, multiple frames of images can be stored in the image database in chronological order.
[0069] It should be noted that before storing multiple frames of images into the image database, the multiple frames of images can be preprocessed. Preprocessing includes at least one of the following methods: binarization, grayscale conversion, super-resolution, etc., to obtain preprocessed multiple frames of images. Then, the multiple frames of images are stored into the image database in chronological order.
[0070] Step 104: Determine whether the similarity between two adjacent frames stored in the order of storage is less than the preset similarity.
[0071] In this embodiment of the invention, the similarity between two adjacent frames of images stored in the same order can be determined. When image A is the first image stored, it can be stored directly since the image database is empty at this time. Image B is an adjacent image stored after image A. The similarity between image B and image A can be determined to avoid image A and image B being the same similar image, which would affect the recognition result and thus improve the efficiency of image recognition.
[0072] Step 105: If the similarity between two adjacent frames is less than a preset similarity, then perform face recognition on the later image to obtain the corresponding number of faces.
[0073] In this embodiment of the invention, a preset similarity can be set according to user needs to remove similar images from multiple frames. In one example, the preset similarity is 0.7. Image B is an adjacent image stored after image A. If the similarity between image B and image A is 0.6, which is less than 0.7, then face recognition can be performed on image B to obtain the number of faces in image B.
[0074] In one example, if image A is the first image stored, since there is no image in the image database at this time, face recognition can be performed directly on image A to obtain the number of faces in image A.
[0075] It should be noted that the preset similarity should not be set too high, as this will increase the time complexity. The preset similarity can generally be set to a value between 0.6 and 0.7.
[0076] Step 106: The number of faces obtained from multi-frame image recognition based on a preset time period is added together to obtain the pedestrian flow corresponding to the video data.
[0077] In this embodiment of the invention, if the multi-frame image is 60 frames, the number of faces identified in the 60 frames that have a similarity of less than a preset value can be summed to obtain the number of people in the video data.
[0078] This invention can reduce repeated recognition of multiple frames and improve the accuracy of pedestrian flow statistics by judging the similarity between two adjacent frames before recognizing the image. Only when the similarity is less than a preset similarity will the face recognition be performed on the later image.
[0079] Reference Figure 2 The diagram illustrates a flowchart of another method for counting pedestrian traffic provided by an embodiment of the present invention. This method may include the following steps:
[0080] Step 201: Obtain video data for a preset time period.
[0081] Step 202: Extract frames from the video data according to the preset frame extraction rate to obtain multiple frames of images.
[0082] Step 203: Store multiple frames of images into the image database in chronological order.
[0083] Step 204: Determine the similarity between the later image and the earlier image according to a preset similarity algorithm; the preset similarity algorithm includes any one of the following: cosine similarity algorithm, hash value algorithm, and histogram algorithm.
[0084] In this embodiment of the invention, the similarity between images that appear later in the sequence and images that appear earlier in the sequence can be determined by a preset similarity algorithm. The preset similarity algorithm can be any one of the cosine similarity algorithm, hash value algorithm, or histogram algorithm, and is not limited to any one of them.
[0085] In one example, image B is an adjacent image stored after image A. The preset similarity algorithm is the cosine similarity algorithm, which can calculate the similarity between image B and image A to be 0.6.
[0086] Step 205: Determine whether the similarity between the later image and the earlier image is less than a preset similarity.
[0087] Specifically, with a preset similarity of 0.65, if the similarity between image B and image A is 0.6, it can be determined that the similarity between image B and image A is less than 0.65.
[0088] In one embodiment of the present invention, the method further includes:
[0089] If the similarity between two adjacent frames is greater than or equal to a preset similarity, the image that appears later in the sequence will be deleted.
[0090] In this embodiment of the invention, if image B is an adjacent image stored after image A, and the similarity between image B and image A is greater than or equal to a preset similarity, it means that image B and image A may be the same image. In this case, image B can be deleted and not recognized to avoid affecting the recognition result.
[0091] Step 206: If the similarity between two adjacent frames is less than the preset similarity, extract the target detection box in the later image and obtain a set of feature points of multiple faces based on the target detection box.
[0092] In this embodiment of the invention, when the similarity between image B and image A is determined to be 0.6 and less than the preset similarity of 0.65, the target detection box in image B can be extracted. The target detection box can be learned from the eyes, nose, eyebrows and both sides of the mouth, or it can be learned from the 21 target points in AFLW and the commonly used 68 or 106 target points, which are not limited here. Then, through the target detection box, a set of feature points of multiple faces can be obtained. This set of feature points contains the feature points of each face in image B.
[0093] In one embodiment of the present invention, extracting the target detection box from the later-ordered image includes:
[0094] The feature pyramid model FPN and the single-point headless face detection model SSH are used to extract the target detection boxes in the later images.
[0095] Currently, the structures for image detection include: Featured image pyramid: This method first divides the image into different sizes, then extracts features at different scales for each size, and then makes predictions for each scale feature separately. The advantage of this method is that features at different scales can contain rich semantic information, but the disadvantage is that the time cost is too high.
[0096] Pyramid feature hierarchy: This is the multi-scale fusion method used by SSD, which extracts features of different scales from different layers of the network and then makes predictions on these features at different scales. The advantage of this method is that it does not require additional computation. However, the disadvantage is that the semantic information of some scale features is not very rich. In addition, SSD does not use enough low-level features, which the authors believe are very helpful for small object detection.
[0097] Single feature map: This is used in SPPnet, Fast R-CNN, and Faster R-CNN, where predictions are made on the feature map of the last layer of the network. The advantage of this method is its fast computation speed, but the disadvantage is that the feature map of the last layer has low resolution and cannot accurately contain the object's position information.
[0098] In this embodiment of the invention, a multi-scale fusion object detection method using FPN+SHH can be employed. FPN (Feature Pyramid Network) incorporates a feature pyramid into image detection. The structure of the feature pyramid mainly includes three parts: bottom-up (the process of inputting the image into the backbone convNet to extract features), top-down (upsampling the feature map obtained from the high-level layer and then passing it down, because the features of the high-level layer contain rich semantic information, and the top-down propagation can spread this semantic information to the low-level features, so that the low-level features also contain rich semantic information), and lateral connection. This improves the accuracy of image detection, especially in the detection of small objects. SSH (Single Stage Headless Face Detector) is a face detection algorithm proposed at ICCV 2017. It effectively improves the performance of face detection. The main improvements include multi-scale detection, the introduction of more contextual information, and the grouping and propagation of the loss function.
[0099] In one embodiment of the present invention, a set of feature points of multiple faces is obtained based on the target detection box, including:
[0100] Based on the target detection box, redundant bounding boxes are removed from the target detection box to obtain a set of feature points for multiple faces.
[0101] In this embodiment of the invention, the target detection box may contain redundant bounding boxes, i.e., repeated bounding boxes. After obtaining the target detection box, some redundant bounding boxes can be removed by non-maximum suppression, and some low-scoring feature points can be removed. The low-scoring feature points are not facial feature point information, thereby obtaining multiple facial feature point sets.
[0102] Step 207: Perform face alignment processing on the feature point set of multiple faces using a preset affine transformation algorithm; the preset affine transformation algorithm includes the OpenCV affine transformation algorithm.
[0103] In this embodiment of the invention, since the faces in each frame of the image have different angles, the feature points of each face also have angles. The calibrated feature point positions can be preset. The calibrated feature point positions refer to the feature point positions with square angles. Then, the feature points of each face in the feature point set of multiple faces are transformed to the calibrated feature point positions, and then the affine matrix is calculated. Finally, the feature points of each face are transformed and scaled to the same size using the affine matrix to achieve face alignment.
[0104] In one example, the OpenCV affine transformation algorithm can be used to map the coordinates of each face's feature points to fixed coordinates, achieving face alignment. This invention improves the model's recognition performance for faces at different angles by aligning and straightening multiple face feature point sets using an affine transformation algorithm, effectively solving the problem of high misclassification rates in traditional techniques for counting crowds in overhead video surveillance scenarios.
[0105] OpenCV is a cross-platform computer vision and machine learning software library released under the Apache 2.0 license (open source). It can run on Linux, Windows, Android, and Mac OS operating systems. It is lightweight and efficient—consisting of a series of C functions and a small number of C++ classes, while also providing interfaces for languages such as Python, Ruby, and MATLAB, and implementing many common algorithms in image processing and computer vision.
[0106] OpenCV is written in C++ and has interfaces for C++, Python, Java, and MATLAB. It supports Windows, Linux, Android, and Mac OS. OpenCV is primarily geared towards real-time vision applications and utilizes MMX and SSE instructions when available. It now also offers support for C#, Ch, Ruby, and Go.
[0107] Step 208: Dimensionality reduction is performed on the feature point set of the aligned multiple faces to obtain low-dimensional vector data of the multiple faces.
[0108] In this embodiment of the invention, the feature point set of multiple aligned faces can be dimensionality reduced. Dimensionality reduction is an operation that transforms a single image into a set of data in a high-dimensional space by converting the single image into a high-dimensional space, thereby obtaining low-dimensional vector data of multiple faces. In one example, faceNet can be used to map the feature points of aligned faces onto a 128-dimensional feature vector.
[0109] Step 209: Store the low-dimensional vector data of multiple faces into the face database in chronological order. Determine the number of faces in the later images based on the number of low-dimensional vector data stored in the face database.
[0110] In this embodiment of the invention, low-dimensional vector data of multiple faces generated at different times can be stored in a face database in chronological order. Then, the number of low-dimensional vector data points is calculated to determine the number of faces in the later-ordered images. In one example, if five low-dimensional vector data points are stored, it means that the number of faces in the later-ordered images is five.
[0111] In one embodiment of the present invention, storing low-dimensional vector data of multiple faces into a face database in chronological order includes: sequentially determining the similarity between the low-dimensional vector data of the face in the later sequence and the low-dimensional vector data of the face in the earlier sequence; if the similarity is less than the vector similarity threshold, then storing the low-dimensional vector data of the face in the later sequence into the face database.
[0112] In this embodiment of the invention, image B contains low-dimensional vector data of multiple faces, which can be stored in chronological order. For example, image B contains low-dimensional vector data 1, low-dimensional vector data 2, and low-dimensional vector data 3. Low-dimensional vector data 1 is stored before low-dimensional vector data 2, and low-dimensional vector data 2 is stored before low-dimensional vector data 3. It can be determined whether the similarity between low-dimensional vector data 1 and low-dimensional vector data 2 is less than the vector similarity threshold, and whether the similarity between low-dimensional vector data 2 and low-dimensional vector data 3 is less than the vector similarity threshold. If the similarity between low-dimensional vector data 1 and low-dimensional vector data 2 is greater than the vector similarity threshold, it means that the faces corresponding to low-dimensional vector data 1 and low-dimensional vector data 2 are relatively similar, and low-dimensional vector data 2 can be removed. If the similarity between low-dimensional vector data 2 and low-dimensional vector data 3 is less than the vector similarity threshold, it means that the faces corresponding to low-dimensional vector data 2 and low-dimensional vector data 3 are not similar, and low-dimensional vector data 3 can be stored in the face database.
[0113] If low-dimensional vector data 1 is the low-dimensional vector data of the first face stored in the face database, since the face database did not initially store it, low-dimensional vector data 1 can be directly stored in the face database.
[0114] In one example, the vector similarity threshold is 0.8. Low-dimensional vector data 1 is stored before low-dimensional vector data 2. If the similarity between low-dimensional vector data 1 and low-dimensional vector data 2 is 0.6, it means that the faces corresponding to low-dimensional vector data 1 and low-dimensional vector data 2 are not very similar, and low-dimensional vector data 2 can be directly stored in the face database. If the similarity between low-dimensional vector data 1 and low-dimensional vector data 2 is 0.9 > 0.8, it means that the faces corresponding to low-dimensional vector data 1 and low-dimensional vector data 2 are very similar, and low-dimensional vector data 2 can be deleted to avoid counting the same face repeatedly.
[0115] It should be noted that a vector similarity threshold that is too small will affect the calculation accuracy; it is generally set to a value between 0.8 and 0.9.
[0116] This invention employs a dual deduplication method. The first frame extraction and similarity threshold comparison of multiple frames helps reduce redundant calculations, lower resource consumption, and improve work efficiency. The second face database similarity threshold comparison enables non-duplicated calculation of people flow, avoiding repeated calculations of the same face and making the counting results more accurate.
[0117] Step 210: The number of faces identified from multiple frames of images within a preset time period is added together to obtain the pedestrian flow corresponding to the video data.
[0118] In this embodiment of the invention, the preset time period is from 10:00 AM to 12:00 PM at the entrance of a supermarket. The number of images identified from multiple frames within the time period of 10:00 AM to 12:00 PM can be added together to obtain the pedestrian flow of the video data from 10:00 AM to 12:00 PM. In one example, the video data includes 60 frames of images. After deduplication, 40 frames are identified. The total number of faces identified after a second round of deduplication of these 40 frames is 1000, which indicates that the pedestrian flow of the video data during this time period is 1000.
[0119] In one embodiment of the present invention, after determining that the flow of people during the preset time period of 10:00-12:00 is 1000 people, it can be displayed to users through a mobile terminal or a web interface.
[0120] This invention reduces redundant recognition of multiple frames and improves the accuracy of pedestrian flow statistics by determining the similarity between adjacent frames before image recognition. Only when the similarity is less than a preset threshold is the subsequent image used for face recognition. Furthermore, this invention eliminates already counted face data within a given time frame during image recognition, solving problems such as high resource consumption and difficulty in removing duplicate faces in complex video surveillance scenarios. Finally, this invention can recognize faces from different angles, improving recognition efficiency in complex scenes.
[0121] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.
[0122] Reference Figure 3 The diagram shows a structural block diagram of a pedestrian flow counting device provided by an embodiment of the present invention, which may specifically include the following modules:
[0123] The acquisition module 301 is used to acquire video data within a preset time period;
[0124] The frame extraction module 302 is used to extract frames from the video data according to a preset frame extraction rate to obtain multiple frames of images.
[0125] The storage module 303 is used to store the multiple frames of images into the image database in chronological order.
[0126] The judgment module 304 is used to determine whether the similarity between two adjacent frames of images stored in the order is less than a preset similarity.
[0127] The recognition module 305 is used to perform face recognition on the image in the later sequence if the similarity between two adjacent frames is less than the preset similarity, and obtain the corresponding number of faces.
[0128] The calculation module 306 is used to add up the number of faces identified from multiple frames of images based on the preset time period to obtain the pedestrian flow corresponding to the video data.
[0129] This invention discloses a device for recognizing pedestrian traffic. Before recognizing an image, this invention determines the similarity between two adjacent frames. Only when the similarity is less than a preset similarity will the image in the later sequence be recognized. This reduces the repeated recognition of multiple frames and improves the accuracy of pedestrian traffic statistics.
[0130] In one embodiment of the present invention, the identification module 305 may include:
[0131] The extraction submodule is used to extract target detection boxes from the images that appear later in the sequence, and to obtain a set of feature points for multiple faces based on the target detection boxes.
[0132] The alignment submodule is used to perform face alignment processing on the feature point set of the multiple faces using a preset affine transformation algorithm; the preset affine transformation algorithm includes the OpenCV affine transformation algorithm.
[0133] The dimensionality reduction submodule is used to reduce the dimensionality of the aligned feature point set of multiple faces to obtain low-dimensional vector data of multiple faces.
[0134] The determination submodule is used to store the low-dimensional vector data of the multiple faces into the face database in chronological order, and determine the number of faces in the later images based on the number of low-dimensional vector data stored in the face database.
[0135] In one embodiment of the present invention, determining a submodule may include:
[0136] The judgment unit is used to judge the similarity between the low-dimensional vector data of the face that appears later in the sequence and the low-dimensional vector data of the face that appears earlier in the sequence.
[0137] The storage unit is used to store the low-dimensional vector data of the face that appears later in the sequence into the face database if the similarity is less than the vector similarity threshold.
[0138] In one embodiment of the present invention, the extraction submodule may include:
[0139] The extraction unit is used to extract the target detection box in the subsequent image using the Feature Pyramid Model (FPN) and the Single-Point Headless Face Detection Model (SSH).
[0140] In one embodiment of the present invention, the determination module 304 may include:
[0141] The similarity determination submodule is used to determine the similarity between the later image and the earlier image according to a preset similarity algorithm; the preset similarity algorithm includes any one of the following: cosine similarity algorithm, hash value algorithm, and histogram algorithm;
[0142] The judgment submodule is used to determine whether the similarity between the later image and the earlier image is less than the preset similarity.
[0143] In one embodiment of the present invention, the apparatus further includes:
[0144] The deletion module is used to delete the image that appears later in the sequence if the similarity between two adjacent frames is greater than or equal to the preset similarity.
[0145] This invention reduces redundant recognition of multiple frames and improves the accuracy of pedestrian flow statistics by determining the similarity between adjacent frames before image recognition. Only when the similarity is less than a preset threshold is the subsequent image used for face recognition. Furthermore, this invention eliminates already counted face data within a given time frame during image recognition, solving problems such as high resource consumption and difficulty in removing duplicate faces in complex video surveillance scenarios. Finally, this invention can recognize faces from different angles, improving recognition efficiency in complex scenes.
[0146] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.
[0147] This invention also provides an electronic device, comprising:
[0148] It includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor. When the computer program is executed by the processor, it implements the various processes of the above-described method embodiment for counting pedestrian traffic and achieves the same technical effect. To avoid repetition, it will not be described again here.
[0149] This invention also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the various processes of the above-described method embodiment for counting pedestrian traffic and achieves the same technical effect. To avoid repetition, it will not be described again here.
[0150] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0151] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0152] Embodiments of the present invention are described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0153] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0154] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0155] Although preferred embodiments of the present invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present invention.
[0156] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0157] The present invention has provided a detailed description of a method, apparatus, device, and storage medium for counting pedestrian traffic. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, those skilled in the art will recognize that there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method for counting pedestrian traffic, characterized in that, The method includes: Get video data for a preset time period; The video data is extracted into multiple frames according to a preset frame extraction rate. The multiple frames of images are stored in the image database in chronological order. Determine whether the similarity between two adjacent frames stored in the same order is less than a preset similarity. If the similarity between two adjacent frames is less than the preset similarity, then face recognition is performed on the later image to obtain the corresponding number of faces; The number of faces identified from multiple frames of images within the preset time period is added together to obtain the pedestrian flow corresponding to the video data; The step of performing face recognition on the later-order images to obtain the corresponding number of faces includes: Extract the target detection boxes from the images that appear later in the sequence, and obtain a set of feature points for multiple faces based on the target detection boxes; The feature point sets of the multiple faces are aligned using a preset affine transformation algorithm; the preset affine transformation algorithm includes the OpenCV affine transformation algorithm. The dimensionality of the aligned feature point set of multiple faces is reduced to obtain low-dimensional vector data of multiple faces. The low-dimensional vector data of the multiple faces are stored in the face database in chronological order. The number of faces in the later images is determined based on the number of low-dimensional vector data stored in the face database. The step of storing the low-dimensional vector data of the multiple faces into the face database in chronological order includes: The similarity between the low-dimensional vector data of the faces that appear later in the sequence and the low-dimensional vector data of the faces that appear earlier in the sequence is determined sequentially. If the similarity is less than the vector similarity threshold, then the low-dimensional vector data of the face that appears later in the sequence is stored in the face database. The method further includes: If the similarity between two adjacent frames is greater than or equal to the preset similarity, the image that appears later in the sequence will be deleted.
2. The method according to claim 1, characterized in that, Extracting the target detection bounding box from the later-ordered image includes: The target detection bounding boxes in the subsequent images are extracted using the Feature Pyramid Model (FPN) and the Single-Point Headless Face Detection Model (SSH).
3. The method according to claim 1, characterized in that, The step of obtaining a set of feature points for multiple faces based on the target detection box includes: Based on the target detection box, redundant bounding boxes in the target detection box are removed to obtain a set of feature points of multiple faces.
4. The method according to claim 1, characterized in that, The step of determining whether the similarity between two adjacent frames stored in the same order is less than a preset similarity includes: The similarity between images in a later order and images in a earlier order is determined according to a preset similarity algorithm; the preset similarity algorithm includes any one of the following: cosine similarity algorithm, hash value algorithm, and histogram algorithm; Determine whether the similarity between the later image and the earlier image is less than the preset similarity.
5. A device for counting pedestrian traffic, characterized in that, The device includes: The acquisition module is used to acquire video data for a preset time period; The frame extraction module is used to extract frames from the video data according to a preset frame extraction rate to obtain multiple frames of images. The storage module is used to store the multiple frames of images into the image database in chronological order. The judgment module is used to determine whether the similarity between two adjacent frames of images stored in the same order is less than a preset similarity. The recognition module is used to perform face recognition on the later image if the similarity between two adjacent frames is less than the preset similarity, and obtain the corresponding number of faces. The calculation module is used to add up the number of faces identified from multiple frames of images within the preset time period to obtain the pedestrian flow corresponding to the video data; The identification module includes: The extraction submodule is used to extract target detection boxes from the images that appear later in the sequence, and to obtain a set of feature points for multiple faces based on the target detection boxes. The alignment submodule is used to perform face alignment processing on the feature point set of the multiple faces using a preset affine transformation algorithm; the preset affine transformation algorithm includes the OpenCV affine transformation algorithm. The dimensionality reduction submodule is used to reduce the dimensionality of the aligned feature point set of multiple faces to obtain low-dimensional vector data of multiple faces. The determination submodule is used to store the low-dimensional vector data of the multiple faces into the face database in chronological order, and determine the number of faces in the later images based on the number of low-dimensional vector data stored in the face database. The determining submodule includes: The judgment unit is used to judge the similarity between the low-dimensional vector data of the face that appears later in the sequence and the low-dimensional vector data of the face that appears earlier in the sequence. The storage unit is used to store the low-dimensional vector data of the face that appears later in the sequence into the face database if the similarity is less than the vector similarity threshold. The device further includes: The deletion module is used to delete the image that appears later in the sequence if the similarity between two adjacent frames is greater than or equal to the preset similarity.
6. An electronic device, characterized in that, include: A processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the steps of the method for counting pedestrian traffic as described in any one of claims 1-4.
7. A computer-readable storage medium, characterized in that, A computer program is stored on the computer-readable storage medium, which, when executed by a processor, implements the steps of the method for counting pedestrian traffic as described in any one of claims 1-4.