Semantic segmentation based tof depth image feature extraction method and extraction system
By combining structured light and optical flow vector information in a TOF depth image feature extraction method, the problem of inaccurate human contour segmentation in existing technologies is solved, enabling accurate counting and tracking in complex scenes and improving the robustness and efficiency of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SICHUAN HONGYE XINCHUANG TECHNOLOGY CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244943A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of lidar depth map rendering, segmentation, and semantic symbol feature extraction technology. Specifically, it relates to a TOF depth image feature extraction method and extraction system based on semantic segmentation. Background Technology
[0002] The content in this section provides only background information related to this application and may not constitute prior art.
[0003] In indoor personnel entry, exit, and retention counting systems, it is necessary to count the number of people entering and exiting. To improve personnel passage efficiency, sensors are typically installed near elevator entrances and main doors to collect the silhouettes of people entering and exiting. This determines the number of people entering and exiting a certain area, as well as their physical characteristics.
[0004] Extracting human silhouettes from real-time video streams typically relies on edge detection combined with Fast Fourier Transform (FFT). The system preprocesses captured video frames, using edge detection algorithms (such as the Canny operator) to initially outline the silhouettes of salient objects in the image. Then, it uses FFT to analyze image features in the frequency domain to enhance or extract specific frequency components related to human edge structures, aiding in the localization of human silhouettes. Finally, semantic segmentation techniques (usually based on deep learning models) are used to perform pixel-level classification of the initially detected silhouette regions, accurately segmenting the set of pixels belonging to the "human silhouette."
[0005] Edge detection algorithms are inherently highly sensitive to image noise, lighting variations, and background texture interference (such as dense grids, vegetation, complex patterns, or moving objects). Under conditions of dense pedestrian traffic, cluttered backgrounds, or uneven lighting, these algorithms often struggle to stably and accurately delineate complete and clear human silhouettes. Furthermore, after acquiring the human silhouettes, it is difficult to determine their direction of movement by tracking them, making it impossible to accurately determine the number of people entering or leaving the access control system. Summary of the Invention
[0006] The summary section of this application is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.
[0007] Some embodiments of this application propose a TOF depth image feature extraction method and system based on semantic segmentation to solve the technical problems mentioned in the background section above.
[0008] As a first aspect of this application, some embodiments of this application provide a TOF depth image feature extraction method based on semantic segmentation, comprising the following steps: Step 1: Collect optical flow vector information of the target area, generate a dynamic region of the target area based on the optical flow vector information, and the moving direction of the dynamic region to generate movement information; Step 2: Obtain the structured light information and optical flow vector information of the target area to generate moving target extraction frames; Step 3: Calculate the depth difference between each pixel in the dynamic region of the moving target extraction frame and the other pixels in the preset window, generate a structured difference value, normalize the structured difference value, and generate a structured difference mask. Step 4: Input the structural difference mask and motion information into the image feature extraction network. The image feature extraction network extracts the first annotation information from the optical flow vector information; the image feature extraction network extracts the second annotation information from the structural difference mask. Among them, the first annotation information is the annotation information related to the direction of movement of the dynamic area, and the second annotation information is the annotation information similar to the head, shoulders, and limbs; Step 5: Input the first and second annotation information into the image feature extraction network to extract the personnel contours in the target area and the direction of entry and exit of the personnel contours in the target area; Step 6: Record the direction of entry and exit of the personnel outlines in the target area to count the number of personnel at the corresponding positions in the direction of entry and exit of the target area.
[0009] This method fuses structured light (TOF) depth information and optical flow vector information. The structured light depth information can explicitly capture and learn the inherent three-dimensional features of the person's contour, and input these key depth difference information as strong guiding features into the deep network. This enables the network to accurately perceive the three-dimensional structure of the person during semantic segmentation, significantly improving the accuracy of the person contour segmentation. Meanwhile, the optical flow vector information can automatically mark the movement direction of pixels in the person contour, thereby increasing the accuracy of the person contour tracking.
[0010] Furthermore, step 1 includes the following steps: Step 11: Set up an optical flow vector sensor above the target area to obtain optical flow vector information of the target area; Step 12: Continuously detect the optical flow vector information of the target region, and extract the dynamic region of the target region based on the optical flow vector information; Step 13: Use optical flow vectors to annotate the dynamic region and generate the movement direction of each pixel in the dynamic region to generate movement information.
[0011] Furthermore, step 2 includes the following steps: Step 21: Align the structured light information with the optical flow vector information in the time dimension, and extract the three-dimensional image frame from the structured light information; Step 22: Extract the optical flow vector frames containing dynamic regions from the optical flow vector information, and use the 3D image frames that are aligned with the optical flow vector frames in time scale as the moving target extraction frames.
[0012] This application uses optical flow vector information to accurately extract which frame of the structured light information contains a dynamic object, thereby reducing the amount of computation required for structured light information and increasing system operating efficiency.
[0013] Step 3 includes the following steps: Step 31: Obtain the depth coordinates of the moving target extraction frame and generate a two-dimensional depth image. The pixel value of each pixel in the two-dimensional depth image is the depth coordinate. Step 32: Use the dynamic region as the extraction region in the two-dimensional depth image, and set the pixel values of all pixels outside the dynamic region to 0; Step 33: Based on the depth coordinates and the preset size of the person's outline, divide the extracted area into several initial areas; For the same initial region, the difference in depth coordinates of each pixel is less than a preset depth threshold; Step 34: For depth coordinates within the same initial region, normalize them based on the maximum and minimum values to generate a structural difference mask.
[0014] This application intelligently segments a 2D depth image into multiple initial regions that conform to the human body's external dimensions by combining prior knowledge of human contour size with local depth consistency constraints (depth difference between pixels in the same region is less than a threshold). This effectively overcomes the overwhelming effect of absolute distance error on centimeter-level human body undulation features in long-distance structured light measurements (1-10 meter range). Local depth value normalization within the initial regions significantly amplifies the relative depth gradient changes of the human body contour.
[0015] Furthermore, step 33 includes the following steps: Step 331: Pre-set the depth threshold and the minimum range of the initial region; Step 332: Traverse the pixel value of each pixel in the two-dimensional depth image, extract the largest segmentation region formed by all pixels with pixel values less than the depth threshold. If the range of the segmentation region is greater than the minimum range, generate the first initial region, and then remove the initial region from the two-dimensional depth image. Step 333: Repeat step 322 until the initial region cannot be extracted from the two-dimensional depth image, and obtain all the initial regions.
[0016] The technical solution provided in this application can divide a two-dimensional depth image into initial regions with an area greater than the minimum range by continuously traversing the image.
[0017] Furthermore, the image feature extraction network includes: A dual-channel input module is used to input movement information and structural difference masks respectively, so as to extract the first annotation information and the second annotation information; Features are fused into modules. Based on click operations, the first and second annotation information are fused to generate fused features. Based on the fused features, the probability of each pixel belonging to the boundary of the person's outline is generated to obtain the person's outline. The direction recognition module generates the entry and exit directions of each person's silhouette based on the probability that each pixel belongs to the boundary of the person's silhouette and the movement information.
[0018] The dual-channel feature extraction network designed in this application independently processes video frames (apparent texture) and structural difference masks (depth gradient) to generate a first feature vector (spatial-semantic features) and a second feature vector (3D structural features), respectively. It innovatively employs dot product operations for feature fusion, enabling pixel-level selective enhancement of spatial features by depth gradient information, accurately activating depth abrupt change regions at the edges of person contours. This mechanism significantly improves the network's robustness to illumination changes, background interference, and low-texture regions.
[0019] Furthermore, when training the image feature extraction network, a physical loss is generated based on the predicted residual information, and the physical loss is incorporated into the total loss function to correct the model parameters of the image feature extraction network.
[0020] This application innovatively transforms depth prediction residual information into differentiable physical constraint loss, dynamically correcting network parameters through backpropagation. This loss function forces the network to strictly adhere to the 3D geometric consistency rules (such as surface smoothness and boundary depth abrupt changes) constructed from structured light data while optimizing segmentation accuracy. This significantly suppresses erroneous segmentation that violates physical laws due to insufficient training data or noise (such as broken contours and floating edges), thereby improving the model's generalization ability and spatial rationality of segmentation results in complex real-world scenarios.
[0021] Furthermore, the total loss function includes: physical loss, boundary loss, and point loss; Physical loss is used to constrain neural network models to meet physical laws. Boundary loss is used to constrain neural network models to meet boundary conditions. Point loss is used to fit predicted information to the actual information.
[0022] This application employs a ternary collaborative loss mechanism consisting of physical loss (forcing predictions to conform to 3D anatomical rules), boundary loss (optimizing contour topological integrity and edge accuracy), and point loss (driving pixel-level data fitting). Through adaptive weight fusion, it achieves joint optimization of "physical rule constraints + geometric structure optimization + supervision signal anchoring," significantly improving the model's generalization ability under limited samples. It eliminates anatomical errors such as floating contours and broken edges, and outputs personnel contours with both sub-pixel-level geometric accuracy and spatial rationality, providing a robust segmentation foundation for access control systems.
[0023] Furthermore, step 6 includes the following steps: Step 61: Generate the corresponding intersection location vectors based on the number and direction of the intersections near the target area; Step 62: Acquire the outlines of people in the target area in real time, and track their positions according to their entry and exit directions to generate the destination of each person's outline at the corresponding intersection. Step 63: Calculate the total number of people in the area connected by each intersection based on the destination of each person's profile at the corresponding intersection.
[0024] As a second aspect of this application, a TOF depth image feature extraction system based on semantic segmentation is provided, which uses a TOF depth image feature extraction method based on semantic segmentation to extract the outline of people in the target region. Attached Figure Description
[0025] Figure 1 This is a flowchart of a TOF depth image feature extraction method based on semantic segmentation.
[0026] Figure 2 This is a structured light information map.
[0027] Figure 3 This is a schematic diagram of the user interface of a TOF depth image feature extraction system based on semantic segmentation. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments. The same reference numerals in the accompanying drawings represent the same components. It should be noted that the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the described embodiments of this application without creative effort are within the scope of protection of this application.
[0029] Compared to the embodiments shown in the accompanying drawings, feasible embodiments within the scope of this application may have fewer components, other components not shown in the drawings, different components, differently arranged components, or components with different connections, etc. Furthermore, two or more components in the drawings may be implemented in a single component, or a single component shown in the drawings may be implemented as multiple separate components.
[0030] Unless otherwise defined, the technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this application pertains. The terms “first,” “second,” and similar terms used in this specification and claims do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Similarly, the terms “an” or “a” and similar terms do not necessarily indicate a quantity limitation. Terms such as “upper” and “lower” are used only to indicate relative positional relationships, and these relative positional relationships may change accordingly when the absolute position of the described object changes.
[0031] In complex scenarios, indoor personnel entry, exit and retention counting systems (such as hotel management systems) face problems such as inaccurate contour segmentation (e.g., edge breakage caused by lighting interference, false detection caused by background noise) and low processing efficiency. Based on this, this application provides Embodiment 1, Embodiment 2 and Embodiment 3.
[0032] refer to Figure 1 Example 1: A TOF depth image feature extraction method based on semantic segmentation. Includes the following steps: Step 1: Acquire optical flow vector information of the target area, generate a dynamic region of the target area based on the optical flow vector information, and determine the direction of movement of the dynamic region to generate movement information; Step 1 includes the following steps: Step 11: Set up an optical flow vector sensor above the target area to obtain optical flow vector information of the target area; Step 12: Continuously detect the optical flow vector information of the target region, and extract the dynamic region of the target region based on the optical flow vector information; Step 13: Use optical flow vectors to annotate the dynamic region and generate the movement direction of each pixel in the dynamic region to generate movement information.
[0033] Optical flow vector sensors are devices based on optical principles. They integrate an image acquisition module to capture continuous image sequences of a target area in real time and analyze pixel displacements between image frames using existing optical flow calculation algorithms (such as Lucas-Kanade) to output optical flow vector information.
[0034] The optical flow vector information is represented in the form of a two-dimensional vector, which contains the motion components of each pixel in the horizontal and vertical directions, and quantifies the instantaneous motion direction and speed of the pixel within the time interval.
[0035] In this way, by using optical flow vector information, it is possible to determine which pixels in adjacent image frames are moving in the target area, which means that it is possible to roughly outline the moving objects in the target area.
[0036] Based on optical flow vector information, the dynamic region identifies the set of pixels in the target region that undergo significant movement by setting a motion amplitude threshold (such as the optical flow vector magnitude being greater than a preset threshold), thereby segmenting the dynamic region. Subsequently, the direction component of the optical flow vector (such as calculating the angle through the arctangent function) is used to directly label the movement direction of each pixel in the dynamic region, generating movement information (average movement direction and average movement speed of the pixels). This process does not require additional calculations; it only requires threshold filtering and direction extraction of the existing optical flow vector data.
[0037] Step 2: Acquire the structured light information of the target area, fuse the structured light information into the corresponding image group, and generate a structured light fused image group of the target area.
[0038] Structured light information involves actively projecting a specific coded pattern (invisible light) onto a target area and using a dedicated sensor to capture the deformation of the pattern on the object's surface, thereby calculating a set of data on the three-dimensional spatial position (depth) of each point on the object's surface relative to the sensor.
[0039] Step 2 includes the following steps: Step 21: Align the structured light information with the optical flow vector information in the time dimension, and extract the three-dimensional image frame from the structured light information; Specifically, the system simultaneously acquires optical flow vector information and structured light information. It precisely pairs the optical flow vector information and structured light information captured at the same instant, then extracts a 3D image frame from the structured light information. This 3D image frame is essentially the 3D coordinates of each pixel in the structured light information, and these coordinates are bound to the optical flow vector information, thus allowing for the approximate marking of the movement direction at each location within the target area.
[0040] Step 22: Extract the optical flow vector frames containing dynamic regions from the optical flow vector information, and use the 3D image frames that are aligned with the optical flow vector frames in time scale as the moving target extraction frames.
[0041] In step 1, optical flow vector information was extracted. This includes optical flow vector frames of the target region, which essentially represent a period of time where the optical flow vector changes drastically. For example, when no object is moving within the target region, the acquired optical flow vector information shows relatively smooth movement. However, when an object is moving, a dynamic region is acquired where the optical flow vector changes drastically.
[0042] Therefore, by aligning the optical flow vector information and structured light information in the time dimension, it is possible to extract the moving target extraction frame based on the existence of dynamic regions. The moving target extraction frame contains an object motion region and records the three-dimensional spatial position of each pixel in the target region.
[0043] Step 3: Calculate the depth difference between each pixel in each image of the structured light fusion image group and the other pixels in the preset window to generate a structured difference value. Normalize the structured difference value to generate a structured difference mask.
[0044] Step 3 includes the following steps: Step 31: Obtain the depth coordinates of the moving target extraction frame and generate a two-dimensional depth image. The pixel value of each pixel in the two-dimensional depth image is the depth coordinate.
[0045] like Figure 2 As shown, in hotel access control systems, cameras are usually placed at the top of the entrance so that when people enter the target area, the depth coordinates can be used to roughly distinguish the head, shoulders, feet, and other areas.
[0046] For example, the head is closest to the camera and has the smallest depth coordinate, followed by the shoulders, and then the feet. At the same time, the shoulders exhibit a clear symmetry relative to the head. Therefore, converting to depth coordinates reduces the number of information dimensions while maintaining the ability to recognize the human silhouette.
[0047] Step 31 essentially involves extracting the 3D information—how far each point is from the ground and the horizontal plane—from each image in the fused image group and converting it into a planar image (depth map) that uses black and white tones to visually represent elevation. This depth map serves as the foundational input for generating the "structural difference mask" (the core of step 3).
[0048] Step 32: Use the dynamic region as the extraction region in the two-dimensional depth image, and set the pixel values of all pixels outside the dynamic region to 0.
[0049] Dynamic areas are areas where people may move around, so some information is retained while the rest is deleted.
[0050] Step 33: Based on the depth coordinates and the preset size of the person's outline, divide the extracted area into several initial areas; Step 33 includes the following steps: Step 331: Pre-set the depth threshold and the minimum range of the initial region; Step 332: Traverse the pixel value of each pixel in the two-dimensional depth image, extract the largest segmentation region formed by all pixels with pixel values less than the depth threshold. If the range of the segmentation region is greater than the minimum range, generate the first initial region, and then remove the initial region from the two-dimensional depth image. For example, in a 2D depth image, a depth threshold is set, and then every pixel in the image is scanned: all pixels whose pixel values differ from the depth threshold are extracted. Next, from these matching points, all contiguous pixel blocks (regions) are identified, and the largest one is selected. If the area of this largest contiguous region exceeds a preset minimum range, the system marks it as the "first initial region" and temporarily removes all pixels corresponding to this region from the depth map (equivalent to "removing" it), preparing for the search of other regions later.
[0051] The phrase "all pixels whose pixel value difference is less than the depth threshold" means that in an initial region, the difference between the maximum and minimum pixel values must be less than the depth threshold.
[0052] In other embodiments, a depth threshold can be added upwards from the lowest pixel value in the two-dimensional depth image to obtain all pixels that meet the requirements. The region with the largest area among these pixels is compared with the minimum range. If it is larger than the minimum range, it is used as an initial region.
[0053] Step 333: Repeat step 322 until the initial region cannot be extracted from the two-dimensional depth image, and obtain all the initial regions.
[0054] Step 34: For depth coordinates within the same initial region, normalize them based on the maximum and minimum values to generate a structural difference mask.
[0055] Normalization is a current technique, and the specific normalization process will not be discussed here. After step 32, the 2D depth image is divided into multiple initial regions. The depth coordinates of these initial regions are within a certain range (meaning there may be one or more human silhouettes). Therefore, normalization is performed on each initial region, and the remaining parts are directly deleted. In this way, in the structural difference mask, the height difference between pixels can be intuitively described for the initial region. Analyzing this part allows for faster extraction of human body contour features.
[0056] Step 4: Input the structural difference mask and motion information into the image feature extraction network. The image feature extraction network extracts the first annotation information from the optical flow vector information; the image feature extraction network extracts the second annotation information from the structural difference mask. The first annotation information is related to the direction of movement of the dynamic area, and the second annotation information is similar to the head, shoulders, and limbs. Step 5: Input the first and second annotation information into the image feature extraction network to extract the personnel contours in the target area and the direction of entry and exit of the personnel contours in the target area; Steps 4 and 5 are actually one step, which involves inputting the structural difference mask and motion information into the image feature extraction network to obtain the outline of the person in the target area, as well as the direction of the person's entry and exit from the target area. To facilitate understanding of the working process of the image feature extraction network, it is written as two separate steps.
[0057] The specific structure and working process of the image feature extraction network are described in Example 2.
[0058] The structural difference mask and motion information designed in this scheme are highly correlated. The structural difference mask information can effectively identify the contour information of the target area, while the motion information automatically indicates the movement direction of each module in the target area in an unlabeled manner.
[0059] Step 6: Record the direction of entry and exit of the personnel outlines in the target area to count the number of personnel at the corresponding positions in the direction of entry and exit of the target area.
[0060] Step 6 includes the following steps: Step 61: Generate the corresponding intersection location vector based on the number and direction of the intersections near the target area.
[0061] The sensors on the access control system (depth sensor, optical flow vector sensor) are fixed, and therefore the corresponding field of view is also fixed. Each field of view boundary defines an intersection, thus allowing us to obtain the intersection's position vector. The intersection position vector is a direction vector; when the direction of movement of a dynamic object in the target area is the same as the intersection position vector, and the focus is on the entrance of that intersection, it indicates that an object has entered through that intersection.
[0062] Step 62: Acquire the outlines of people in the target area in real time, and track their positions based on their entry and exit directions to generate the destination of each person's outline at the corresponding intersection.
[0063] In step 5, the outline of each person and the direction of the outline can be obtained, so that the direction from which the outline of the person comes and the final destination can be determined.
[0064] In step 62, it is necessary to reasonably match adjacent image contours in each video frame. For example, if a person leaves the target area in 1 second, assuming the acquisition rate of optical flow vector information and structured light information is 60 frames per second, then there are 60 person contours in 1 second. Recording 60 person contours at this time is inappropriate.
[0065] Therefore, step 62 requires real-time acquisition of the personnel outlines in the target area and position tracking based on the personnel outlines' entry and exit directions. The tracking method involves determining whether the positions of personnel outlines in adjacent frames match their movement directions. For example, if the positions of personnel outlines appearing in two adjacent frames match their movement directions, they are combined into a single personnel outline.
[0066] Step 63: Calculate the total number of people in the area connected by each intersection based on the destination of each person's profile at the corresponding intersection.
[0067] like Figure 3 As shown, Figure 3 The image shows the number of people in each room within a certain system.
[0068] Example 2: The image feature extraction network is a semantic segmentation model that can automatically label and segment the outlines of people in the target area based on the input video frames and structural difference masks.
[0069] Image feature extraction networks include: A dual-channel input module is used to input movement information and structural difference masks respectively, so as to extract the first annotation information and the second annotation information; Features are fused into modules. Based on click operations, the first and second annotation information are fused to generate fused features. Based on the fused features, the probability of each pixel belonging to the boundary of the person's outline is generated to obtain the person's outline. The direction recognition module generates the entry and exit directions of each person's silhouette based on the probability that each pixel belongs to the boundary of the person's silhouette and the movement information.
[0070] The dual-channel input module in this scheme is a convolutional network, and the feature fusion module is a neural network. The feature fusion module mainly relies on the first and second annotation information to extract the human contour. The first annotation information is related to the direction of movement of the dynamic region, and the second annotation information is similar to the head, shoulders, and limbs. By using these two pieces of information, regions with movement characteristics and similarity to the human contour can be quickly found during training.
[0071] The orientation discrimination module mainly obtains the entry and exit directions of the personnel contour through optical flow vectors. That is, after recognizing the personnel contour, it identifies the personnel's entry and exit directions based on the optical flow direction of the pixels inside the personnel contour. The dual-channel input module and the feature fusion module belong to a neural network and need to be trained in the following way: the orientation discrimination module provides motion-supervised training for the segmentation of the personnel contour based on the personnel contour provided by the feature fusion module and the optical flow vector. When updating parameters during training, the parameters of the network model in the feature fusion module and the dual-channel input module are mainly updated based on the result of the loss function.
[0072] When training the image feature extraction network, a physical loss is generated based on the predicted residual information. The physical loss is then incorporated into the total loss function to correct the model parameters of the image feature extraction network.
[0073] The total loss function includes: physical loss, boundary loss, and point loss; Physical loss is used to constrain neural network models to meet physical laws. Boundary loss is used to constrain neural network models to meet boundary conditions. Point loss is used to fit predicted information to the actual information.
[0074] Specifically, the loss function is LOSS; ; ; ; ; in, For physical loss, For boundary loss, For point loss, , , These are the first weight parameter, the second weight parameter, and the third weight parameter, respectively. N represents the number of sampling points. This represents the depth coordinates of the i-th sampling point. Describes the computational domain of a partial differential equation. Represents the Laplace operator. Indicates at point The predicted solution value, This indicates that the true solution is at the point. The second derivative at that point, Represents the square of the norm; M represents the boundary. Number of upsampled points This represents the depth coordinates of the j-th sampling point. Represents the boundary of the computational domain. Indicates at point The predicted solution value, This indicates that the true solution is at the point. The true solution is... This represents the depth coordinates of the k-th sampling point. Indicates at point The predicted solution value, This indicates that the true solution is at the point. The true solution.
[0075] The points for gradient loss and point loss are both within the domain, while the points for boundary loss are strictly located on the boundary. Therefore, the computational domains of boundary points and sampling points are not the same.
[0076] The training process of the image feature extraction network is as follows: 500,000 labeled human contour images with motion information and structural difference masks are obtained as the training set. The training set labels the human contours and their movement directions. End-to-end supervised training is performed using the loss function LOSS, ultimately generating a highly discriminative human contour feature extractor that can track human contours and provide the movement direction of each contour.
[0077] Example 2: A TOF depth image feature extraction system based on semantic segmentation, comprising: an iToF depth radar sensor, an information processing device, and an information recording device.
[0078] iToF depth radar sensors are deployed above access control systems, elevators, gates, etc. The iToF depth radar sensor has a 100*100 radar dot array. The optical flow vector sensor is positioned in the same location as the iToF depth radar sensor; The information processing device is connected to the iToF depth radar sensor via a local area network. It uses the schemes described in Embodiments 1 and 2 to extract the outlines of people at all access control points, elevators, gates, and other locations, and calculates the number of people in each independent area.
[0079] like Figure 3 As shown, the information recording device is connected to the information processing device. Based on the iToF depth radar sensors and information processing device installed at the entrances and exits of each area, it records the number of people in each room. When the number of people in a room exceeds the warning number, it promptly issues a warning message. Figure 3 It was shown that once a maximum number of people is set in a room, an early warning can be issued in a timely manner.
[0080] The above are merely preferred embodiments of this application and are not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A Time-of-Flight (TOF) depth image feature extraction method based on semantic segmentation, characterized in that, Includes the following steps: Step 1: Collect optical flow vector information of the target area, generate a dynamic region of the target area based on the optical flow vector information, and the moving direction of the dynamic region to generate movement information; Step 2: Obtain the structured light information and optical flow vector information of the target area to generate moving target extraction frames; Step 3: Calculate the depth difference between each pixel in the dynamic region of the moving target extraction frame and the other pixels in the preset window, generate a structured difference value, normalize the structured difference value, and generate a structured difference mask. Step 4: Input the structural difference mask and motion information into the image feature extraction network. The image feature extraction network extracts the first annotation information from the optical flow vector information. The image feature extraction network extracts the second annotation information from the structural difference mask; Among them, the first annotation information is the annotation information related to the direction of movement of the dynamic area, and the second annotation information is the annotation information similar to the head, shoulders, and limbs; Step 5: Input the first and second annotation information into the image feature extraction network to extract the personnel contours in the target area and the direction of entry and exit of the personnel contours in the target area; Step 6: Record the direction of entry and exit of the personnel outlines in the target area to count the number of personnel at the corresponding positions in the direction of entry and exit of the target area.
2. The TOF depth image feature extraction method based on semantic segmentation according to claim 1, characterized in that, Step 1 includes the following steps: Step 11: Set up an optical flow vector sensor above the target area to obtain optical flow vector information of the target area; Step 12: Continuously detect the optical flow vector information of the target region, and extract the dynamic region of the target region based on the optical flow vector information; Step 13: Use optical flow vectors to annotate the dynamic region and generate the movement direction of each pixel in the dynamic region to generate movement information.
3. The TOF depth image feature extraction method based on semantic segmentation according to claim 1, characterized in that, Step 2 includes the following steps: Step 21: Align the structured light information with the optical flow vector information in the time dimension, and extract the three-dimensional image frame from the structured light information; Step 22: Extract the optical flow vector frames containing dynamic regions from the optical flow vector information, and use the 3D image frames that are aligned with the optical flow vector frames in time scale as the moving target extraction frames.
4. The TOF depth image feature extraction method based on semantic segmentation according to claim 1, characterized in that, Step 3 includes the following steps: Step 31: Obtain the depth coordinates of the moving target extraction frame and generate a two-dimensional depth image. The pixel value of each pixel in the two-dimensional depth image is the depth coordinate. Step 32: Use the dynamic region as the extraction region in the two-dimensional depth image, and set the pixel values of all pixels outside the dynamic region to 0; Step 33: Based on the depth coordinates and the preset size of the person's outline, divide the extracted area into several initial areas; For the same initial region, the difference in depth coordinates of each pixel is less than a preset depth threshold; Step 34: For depth coordinates within the same initial region, normalize them based on the maximum and minimum values to generate a structural difference mask.
5. The TOF depth image feature extraction method based on semantic segmentation according to claim 4, characterized in that, Step 33 includes the following steps: Step 331: Pre-set the depth threshold and the minimum range of the initial region; Step 332: Traverse the pixel value of each pixel in the two-dimensional depth image, extract the largest segmentation region formed by all pixels with pixel values less than the depth threshold. If the range of the segmentation region is greater than the minimum range, generate the first initial region, and then remove the initial region from the two-dimensional depth image. Step 333: Repeat step 322 until the initial region cannot be extracted from the two-dimensional depth image, and obtain all the initial regions.
6. The TOF depth image feature extraction method based on semantic segmentation according to claim 1, characterized in that, Image feature extraction networks include: A dual-channel input module is used to input movement information and structural difference masks respectively, so as to extract the first annotation information and the second annotation information; Features are fused into modules. Based on click operations, the first and second annotation information are fused to generate fused features. Based on the fused features, the probability of each pixel belonging to the boundary of the person's outline is generated to obtain the person's outline. The direction recognition module generates the entry and exit directions of each person's silhouette based on the probability that each pixel belongs to the boundary of the person's silhouette and the movement information.
7. The TOF depth image feature extraction method based on semantic segmentation according to claim 6, characterized in that, When training the image feature extraction network, a physical loss is generated based on the predicted residual information. The physical loss is then incorporated into the total loss function to correct the model parameters of the image feature extraction network.
8. The TOF depth image feature extraction method based on semantic segmentation according to claim 7, characterized in that, The total loss function includes: physical loss, boundary loss, and point loss; Physical loss is used to constrain neural network models to meet physical laws. Boundary loss is used to constrain neural network models to meet boundary conditions. Point loss is used to fit predicted information to the actual information.
9. The TOF depth image feature extraction method based on semantic segmentation according to claim 1, characterized in that, Step 6 includes the following steps: Step 61: Generate the corresponding intersection location vectors based on the number and direction of the intersections near the target area; Step 62: Acquire the outlines of people in the target area in real time, and track their positions according to their entry and exit directions to generate the destination of each person's outline at the corresponding intersection. Step 63: Calculate the total number of people in the area connected by each intersection based on the destination of each person's profile at the corresponding intersection.
10. A TOF depth image feature extraction system based on semantic segmentation, characterized in that, The person contour of the target region is extracted using the TOF depth image feature extraction method based on semantic segmentation as described in any one of claims 1 to 9.