A gaze target detection method based on visual and semantic cues
By combining visual and semantic cue-based gaze target detection methods with FOV prediction, saliency detection, and semantic object detection, the ambiguity and lack of information in existing gaze target detection technologies are solved, achieving efficient and accurate detection in complex scenes.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIHANG UNIV
- Filing Date
- 2023-02-22
- Publication Date
- 2026-06-30
Smart Images

Figure CN116402991B_ABST
Abstract
Description
Technical Field
[0001] The embodiments of this disclosure relate to the field of computer technology, and more specifically to a gaze target detection method based on visual and semantic cues. Background Technology
[0002] Eye gaze is a crucial factor in revealing human behavior. Traditional research has focused on estimating the direction of human eye gaze. However, to investigate deeper human intentions, the location of a person's gaze—the object being gazed at—offers a more intuitive channel. Therefore, human gaze target detection in the wild, aimed at estimating what each person is looking at in a single (RGB) image, has become a challenging task in computer vision and has been widely applied as a profitable technique in human-computer interaction, social awareness analysis, and medical research.
[0003] The wide range of applications has attracted numerous researchers to explore solutions for gaze target detection tasks. However, due to the ambiguity of the human gaze target problem and the lack of rigorously labeled datasets, existing convolutional methods, which combine gaze estimation results with visual saliency information of the image, cannot provide satisfactory results.
[0004] Recent work has introduced 3D depth as additional information for calculating gaze targets. Despite achieving state-of-the-art performance, issues such as low-resolution or occluded faces and highly blurred scenes persist. In summary, the current state of research in human gaze target detection is highly limited by a lack of information and the inherent ambiguity of the problem itself. Summary of the Invention
[0005] The summary portion of this disclosure is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description portion. This summary portion is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.
[0006] Some embodiments of this disclosure propose a gaze target detection method based on visual and semantic cues to address one or more of the technical problems mentioned in the background section above.
[0007] This disclosure proposes a coarse-to-fine gaze target detection method that detects gaze targets from a single RGB image by incorporating field of view (FOV), saliency, and semantic cues.
[0008] This disclosed gaze target detection method based on visual and semantic cues consists of three modules: 1) A FOV prediction module first predicts the human gaze direction using different strategies based on human facial visibility, then infers high-probability target regions and generates a weighted FOV map containing FOV cues. 2) A saliency detection module first extracts features from the weighted FOV map, then uses an encoder-decoder to generate an FOV-guided saliency map, merging the FOV cues and saliency cues. 3) A semantic object detection module detects objects of human interest, then generates object candidate maps for each target region, containing well-distributed weights for semantic cues. Finally, the method infers the accurate gaze target by combining the FOV-guided saliency map and the object candidate map. Attached Figure Description
[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and elements are not necessarily drawn to scale.
[0010] Figure 1 This is a flowchart of some embodiments of the gaze target detection method based on visual and semantic cues according to the present disclosure;
[0011] Figure 2 This is a general flowchart of some embodiments of the gaze target detection method based on visual and semantic cues according to the present disclosure;
[0012] Figure 3 This is a flowchart illustrating multi-person gaze estimation based on some embodiments of the gaze target detection method based on visual and semantic cues of this disclosure. Detailed Implementation
[0013] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0014] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.
[0015] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0016] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0017] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0018] This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.
[0019] Figure 1 A flow 100 of some embodiments of a gaze target detection method based on visual and semantic cues according to the present disclosure is shown. This gaze target detection method based on visual and semantic cues includes the following steps:
[0020] Step 101: Input an RGB image containing a single or multiple people in a scene, scale the RGB image to a specific size, and obtain the scaled complete image.
[0021] Step 102: Input the scaled complete image into the multi-person gaze estimation module, and estimate the gaze direction of the specified person using different strategies based on the facial clarity of the specified person.
[0022] Step 103: Input the location of the specified person in the image and the gaze direction of the specified person into the visual field prediction module to obtain the high probability gaze region of the specified person in the scaled complete image, and generate a weighted visual field map containing gaze direction cues within the high probability gaze region.
[0023] Step 104: Input the high-probability gaze region of the specified person and the corresponding weighted visual field map into the scene saliency detection network. The scene saliency detection network extracts image features from the gaze region using a feature extractor and generates a vision-guided saliency map using an encoder-decoder.
[0024] Step 105: Input the scaled complete image into the target detector to detect all activity-related objects in the image, and combine them with high-probability gaze regions to generate an attention map of candidate objects within the high-probability gaze regions.
[0025] Step 106: Multiply the saliency map corresponding to the high-probability gaze region and the candidate object attention map to obtain a gaze target heatmap. The point with the highest heat value in the gaze target heatmap is the inferred gaze target.
[0026] refer to Figure 2 Following human gaze target estimation strategies, this disclosed method combines cues from three aspects—FOV, saliency, and semantics—to ultimately locate the gaze target. The method consists of three modules: an FOV prediction module, a saliency detection module, and a semantic object detection module. If the face is clear and detectable, the FOV prediction module uses a multi-person gaze estimator to predict the gaze direction of each person in the image; if not, a multi-person pose estimator is used for pseudo-gaze estimation. Each path crops the field of view of each person and generates a weighted FOV, which not only represents the cone weights along the gaze direction in the FOV but also indicates the target region, which is the smallest rectangular region containing the FOV. The saliency detection module contains an encoder-decoder trained through supervised learning to predict the saliency map of the weighted FOV. The semantic object detection module first detects all activity-related objects in the input image and then includes the weighted FOV to generate object candidate maps for each person. Finally, the method combines the FOV-guided saliency map and the object candidate maps to make the final prediction of the gaze target.
[0027] The FOV prediction module aims to estimate the weighted FOV of each person in the input image. This module can be divided into two stages: gaze estimation and weighted field of view generation.
[0028] The first stage of the FOV prediction module is gaze estimation, in which the module predicts the gaze of each person in the input image. To achieve efficient and robust multi-person gaze estimation, a multi-person gaze estimator is used to estimate the predicted gaze, a multi-person pose estimator is used to estimate spurious gazes, and then appropriate gaze outputs are used to process different face visibility scenarios based on the detected face visibility.
[0029] Design and train a single-stage multi-person 3D gaze estimation network that can simultaneously estimate the 3D gaze of each individual. Figure 3 As shown, the gaze estimator consists of a ResNet-50 backbone and a three-layer feature pyramid, which respectively implements feature extraction from the whole image and multi-scale feature fusion. This is followed by a context module to expand the network's receptive domain. Then, the multi-task downstream head receives features and outputs three elements: head location (the bounding box of the detected human head), facial visibility (the confidence score of each detected face), and [other parameters]. The 3D gaze head output shows each person's 3D gaze (yaw and pitch). The 3D gaze is then projected into the 2D image space to obtain the 2D predicted gaze Y. g :
[0030]
[0031] Where r is half the width of the head bounding box, θ, It is the pitch and yaw components of 3D gaze.
[0032] A multi-person pose estimator aims to approximate the human gaze direction of a face in low visibility conditions (e.g., a blurred or indistinct face with a blurred back) by using the positional relationships between anatomical keypoints. The multi-person pose estimator is capable of simultaneously estimating keypoints for each person in the entire image. The pose estimator predicts the 2D positions of the human ears and nose, and then approximates the human 2D pseudo-gaze Y using the vector from the midpoint of the ear to the nose. h :
[0033]
[0034] Among them, (x n y n (x) is the coordinate of the nose. el y el ) and (x er y er () represents the coordinates of the left and right ears.
[0035] Confidence of face visibility head regression of gaze estimator This indicates the likelihood of a true and accurate 3D gaze estimation, specifically related to the visibility of the face. Higher confidence levels indicate more reliable gaze estimation, corresponding to faces with clearer facial features, while lower confidence levels indicate that the face cannot be clearly detected, meaning the estimated gaze is unreliable, corresponding to blurred, backward, or indistinct faces. The threshold is set to 0.5, and in the following process, when The 2D gaze direction g used will be the Y direction generated by the gaze estimator. g ,when When using the pseudo-gaze Y generated by the pose estimator h .
[0036] The second stage of the FOV prediction module aims to focus attention on FOV cues. With a 2D gaze direction selected, the module first defines the target region that almost certainly contains the gaze target, and then generates cone weights in the FOV that help to pitch the object in the gaze direction.
[0037] Given a person's head bounding box, their field of view (FOV) in the image is a sector formed along the 2D gaze direction g estimated from the center of their head. The two edges of this sector are the upper boundary vector g0 and the lower boundary vector g1 of their FOV. Calculate the coordinates of the intersection points of these two boundary vectors with the image boundaries. The gaze region is the smallest rectangular region (parallel to the image edges) containing the complete FOV in the full image. Assuming P is the set of rectangular regions with edges parallel to the image edges, and Q is the union of the two intersection points and the four points of the head bounding box, the target region TR can be defined as:
[0038]
[0039] Then, the half-angle ε between the upper / lower boundary vectors of the FOV and the 2D gaze direction g is determined. A 10% training set was sampled from the GazeFollow dataset to test the mean angular error of the estimated 2D gaze direction. Since the mean angular difference in these samples is 11.6°, 12° is used as the initial value for the half-angle ε of the target region, and the corresponding two boundary vectors are calculated by incrementing ε by 1° each time.
[0040] ε∈{12°, 13°, 14°, ..., 180°}.
[0041] Clearly, the proportion of staring targets within the target area is positively correlated with the choice of ε. When ε is 20, the proportion reaches 99.3% and is difficult to improve thereafter. Therefore, the half-angle difference ε is set to 20° to calculate the target area.
[0042] To further utilize FOV cues, a cone-weighted plot was created in the target region using a cone-weighted plot generator. The resulting weighted FOV plot is represented as A. v For any point (i, j) in the target region, firstly, calculate the angle ε between the vector from the head center to (i, j) and the estimated 2D gaze direction g. Since the gaze target is more reliably located in the neighboring region of g, larger weights are assigned to points with smaller ε, and smaller weights are assigned to points with larger ε. Furthermore, to improve the robustness of the module, an offset outside the FOV is introduced. Therefore, the weighted FOV map is designed to be continuous, negatively proportional to the half-angle ε in the FOV, and positively offset outside the FOV. The weighted FOV map can be represented as:
[0043]
[0044] Where ε represents the angle between any point within the gaze region and the direction of the line of sight, ε∈[0, 180°]. A vLet represent the weighted visual field map. α represents the weight coefficients for all points. β represents the weight offset. (i, j) represents the coordinates of a point in the weighted visual field map. Considering that the gaze estimator does not have sufficiently clear head features of the gazer in the image (i.e., ... In this case, significant errors may occur in the estimation of the 2D gaze direction. To avoid missing the gaze target in this situation, the weight offset for all points with ε > 20° located outside the FOV is set to b = 0.5. Since the half-angle difference of the FOV is determined to be ε = 20° in the proposed module, the continuity α at the boundary is set to 4.5. The weighted FOV map generation incorporates FOV cues, achieves well-distributed cone weights, and ensures that the true gaze target does not fall outside the target area.
[0045] The saliency detection module merges FOV cues and saliency cues. In this stage, target regions containing RGB information and individually generated weighted FOV maps for each individual are stitched together as input. Features are extracted using a ResNet backbone, and then an encoder-decoder regresses the FOV-guided saliency map for each individual. This module is trained using supervised learning. The saliency map regression loss function L is used to compute the FOV-guided saliency map H. * The error between them and the heatmap H generated from the real annotations using mean squared error loss:
[0046] L = MSE(H * H).
[0047] Where MSE() represents the mean squared error loss function, H * H represents a saliency plot, and H represents a heatmap.
[0048] The semantic object detection module is a two-step framework. First, the module utilizes semantic cues from the entire image to detect activity-related objects. In the second stage, the module generates object candidate maps for each target region using uniformly distributed Gaussian weights, which reduces the visual saliency gap between objects at different scales.
[0049] Activity-related object detection selects sports balls, frisbees, cell phones, cameras, televisions, knives, kites, cakes, and books from all 80 annotated classes in the COCO dataset. Based on prior knowledge of common human activities (e.g., ball sports, family gatherings, cooking in the kitchen), detected faces are classified as a total of 10 classes as human activity-related objects. RetinaNet is then trained to detect all objects of the selected classes in the COCO dataset within complete images.
[0050] Object candidate maps are generated for each target region. Gaussian weights are distributed to objects whose bounding boxes intersect with the target region. Gaussian weights are placed around the centers of all objects intersecting with the target region to generate object candidate maps (all locations of the bounding boxes are converted to target region coordinates). Considering the visual saliency gap caused by differences in object scale, a novel weighting strategy is used to eliminate the negative impact of large-scale object visual saliency:
[0051]
[0052] Where n is the number of targets detected within the gaze area. A o Let represent the candidate object attention map. (i, j) represent the coordinates of a point in the candidate object attention map. β represents the weight offset. Gaussian distribution is the Gaussian distribution for the k-th detected object, with the peak value of the Gaussian distribution being:
[0053]
[0054] Where lk is the length of the shorter side of the kth detected target.
[0055] Finally, the candidate image of the object is multiplied with the scene saliency map guided by the FOV output by the saliency detection module to obtain the final result of the gaze target detection.
[0056] The above description is merely a selection of preferred embodiments of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.
Claims
1. A gaze target detection method based on visual and semantic cues, comprising: Input an RGB image containing a single or multiple people in a scene, scale the RGB image to a specific size, and obtain the scaled complete image; The scaled-up complete image is input into the multi-person gaze estimation module, and the gaze direction of the specified person is estimated using different strategies based on the facial clarity of the specified person. The location and gaze direction of a specified person in the image are input into the visual field prediction module to obtain the high-probability gaze region of the specified person in the scaled-up complete image. A weighted visual field map containing gaze direction cues is then generated within this high-probability gaze region. Specifically, based on the biological characteristics of the human visual system, the visual field prediction module models the specified person's 3D visual field in the RGB image as a cone emanating from the person's eye position. This cone is projected onto the 2D image plane to obtain the specified person's 2D visual field in the RGB image. Weights are assigned to all points within the 2D visual field based on the gaze direction, generating a weighted visual field map within the specified person's gaze region. , in, This represents the angle between any point within the gaze area and the direction of the line of sight. This represents a weighted viewpoint map. This represents the weight coefficients for all points. Indicates weight offset, Represents the coordinates of a point in the weighted view map; The high-probability gaze region of a specified person and the corresponding weighted visual field map are input into the scene saliency detection network. The scene saliency detection network extracts image features in the gaze region through a feature extractor and generates a vision-guided saliency map using an encoder-decoder. The scaled full image is input into the object detector to detect all activity-related objects in the image. Combined with high-probability gaze regions, an attention map of candidate objects within the high-probability gaze regions is generated, including: After inputting the scaled full image into the object detector to detect all activity-related objects in the image, an adaptive Gaussian weight is placed around the center point of all detected objects within the gaze region, based on the gaze region of a specified person, to generate a candidate object attention map of the gaze region: , in, It is the number of targets detected within the gaze area. Represents the attention map of candidate objects. This represents the coordinates of a point in the attention map of the candidate object. Indicates weight offset, It is the first The Gaussian distribution of the detected targets has the following peak value: , in, It is the first The length of the short side of each detected target; The saliency map corresponding to the high-probability gaze region and the candidate object attention map are multiplied to obtain the gaze target heatmap. The point with the largest heat value in the gaze target heatmap is the inferred gaze target. For the gaze target detection problem in natural scenes, gaze targets are detected from coarse to fine. By merging visual field, saliency and semantic cues, gaze targets are detected from a single RGB image.
2. The method according to claim 1, wherein, To address the visibility of facial features in natural scenes, different strategies are used to predict the human gaze direction. Specifically, when facial features are clear and detectable, the gaze direction is directly estimated from the facial features. When the face is back-facing, blurry, low-resolution, or occluded, the positions of the nose and ears are estimated based on the positional relationship between key points on the human body. The vector direction from the midpoint of the two ears to the nose is then determined, and this vector direction is used as the pseudo-gaze.
3. The method according to claim 1, wherein, Based on prior knowledge of common human activities, a target detector is used to detect common activity-related targets in the input RGB image. The common activity-related targets include at least one of the following categories: ball, frisbee, mobile phone, camera, television, knife, kite, cake, book, and human face.