Information processing apparatus, information processing method, learning method, and storage medium

By combining facial orientation adjustment coefficients and a common learning model with an information processing device to optimize the weighted feature map, the problem of inaccurate gaze estimation caused by changes in facial orientation is solved, and the accuracy of gaze estimation is improved.

CN115205829BActive Publication Date: 2026-06-19HONDA MOTOR CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HONDA MOTOR CO LTD
Filing Date
2022-02-25
Publication Date
2026-06-19

Smart Images

  • Figure CN115205829B_ABST
    Figure CN115205829B_ABST
Patent Text Reader

Abstract

This invention relates to information processing apparatus, information processing method, learning method, and storage medium. A technique is provided that improves the estimation accuracy of a person's gaze based on an image of their eyes. The information processing apparatus for estimating a person's gaze includes: a first processing unit that uses a first model to estimate the orientation of the person's face, the first model being configured to output a calculation result of the orientation of the person's face when an image of the person's face is input; and a second processing unit that uses a second model to estimate the person's gaze, the second model being configured to output a calculation result of the gaze when an image of the person's eyes is input, the second processing unit changing coefficients of the second model based on the orientation of the face estimated by the first processing unit.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a technique for estimating a person's line of sight. Background Technology

[0002] Patent document 1 proposes a technique for detecting a driver's gaze based on images obtained by capturing images of the driver's eyes or face.

[0003] Existing technical documents

[0004] Patent documents

[0005] Patent Document 1: Japanese Patent Application Publication No. 2005-278898 Summary of the Invention

[0006] The problem that the invention aims to solve

[0007] A person's gaze sometimes changes depending on the direction of their face, so a technique is desired that can accurately estimate a person's gaze based on the direction of their face.

[0008] Therefore, the object of the present invention is to provide a technique that improves the estimation accuracy of a person's gaze when estimating an image based on the person's eyes.

[0009] Solution for solving the problem

[0010] To achieve the above objectives, an information processing apparatus as one aspect of the present invention is an information processing apparatus for estimating the gaze of a person, characterized in that it comprises: a first calculation unit that uses a first model to estimate the orientation of the person's face, the first model being configured to output a calculation result of the orientation of the person's face when an image of the person's face is input; and a second calculation unit that uses a second model to estimate the gaze of the person, the second model being configured to output a calculation result of the gaze of the person when an image of the person's eyes is input, the second calculation unit changing the coefficients of the second model according to the orientation of the face estimated by the first calculation unit.

[0011] To achieve the above objectives, an information processing method according to one aspect of the present invention is an information processing method for estimating the gaze of a person, characterized by comprising: a first calculation step, using a first model to estimate the orientation of the person's face, the first model being configured to output a calculation result of the orientation of the face when an image of the person's face is input; and a second calculation step, using a second model to estimate the gaze of the person, the second model being configured to output a calculation result of the gaze when an image of the person's eyes is input, wherein in the second calculation step, the coefficients of the second model are changed according to the orientation of the face estimated by the first calculation step.

[0012] To achieve the above objectives, one aspect of the present invention is a learning method for an information processing device for estimating the gaze of a person, characterized by comprising: an extraction step, extracting an image of the person's face and an image of the person's eyes from an image of the person; an estimation step, estimating the person's gaze based on the image of the face and the image of the eyes extracted in the extraction step; an acquisition step, acquiring information about the person's gaze at the time the image of the person is obtained as training data; and a learning step, enabling the information processing device to learn so that the deviation between the person's gaze estimated in the estimation step and the person's gaze acquired in the acquisition step as training data is reduced.

[0013] The effects of the invention

[0014] According to the present invention, for example, techniques can be provided that improve the estimation accuracy when estimating the gaze of a person based on an image of the person's eyes. Attached Figure Description

[0015] Figure 1 This is a diagram illustrating a structural example of a system using the information processing apparatus of the present invention.

[0016] Figure 2 This is an example of capturing an image, extracting an image, and inputting an image.

[0017] Figure 3 It is a diagram used to illustrate a learning model applied in an information processing device.

[0018] Figure 4 This is a flowchart representing the estimation process performed by the information processing device.

[0019] Figure 5 This is a schematic diagram illustrating the construction of inputs and outputs in machine learning.

[0020] Figure 6 This is a flowchart illustrating the learning method in an information processing device.

[0021] Explanation of reference numerals in the attached figures

[0022] 1: Information processing unit; 1a: Storage unit; 1b: Communication unit; 1c: Generation unit; 1d: Model calculation unit; 2: Imaging unit; 3: External device. Detailed Implementation

[0023] Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. The present invention is not limited to the following embodiments, and also includes changes and modifications to the structure within the scope of the present invention. Furthermore, the present invention does not require a combination of all the features described in this embodiment. Moreover, the same structural elements are labeled with the same reference numerals, and their descriptions are omitted.

[0024] Figure 1 This is a block diagram illustrating a structural example of a system A using an information processing apparatus 1 according to an embodiment of the present invention. System A of this embodiment includes an information processing apparatus 1, an imaging unit 2, and an external device 3. The imaging unit 2 includes, for example, a camera, which captures a person's face in an image. For example, when System A of this embodiment is applied to a vehicle, the imaging unit 2 can be configured to capture the driver sitting in the driver's seat of the vehicle. Furthermore, the external device 3 is a device that acquires the gaze information of a person estimated by the information processing apparatus 1 and performs various processing based on that gaze information. For example, when System A of this embodiment is applied to a vehicle, the external device 3 is a control unit that controls the vehicle (for example, an ECU (Electronic Control Unit), which detects where the driver (person) is facing while driving based on the gaze information of the driver (person) estimated by the information processing apparatus 1. The external device 3 can also be a control unit that controls the vehicle's automatic driving.

[0025] Information processing device 1 is a computer including a processor (such as a CPU), a storage device such as semiconductor memory, and an interface with external devices. It performs estimation processing to estimate (determine, calculate) the gaze of a person based on an image of the person obtained by the imaging unit 2. "The person's gaze" is defined as the direction the person is looking in, and can also be understood as the gaze direction or gaze vector. In this embodiment, information processing device 1 may include a storage unit 1a, a communication unit 1b, a generation unit 1c, and a model calculation unit 1d. Storage unit 1a stores programs executed by the processor, various data, and learning models and learning data (described later). Information processing device 1 can perform the aforementioned estimation processing by reading and executing the programs stored in storage unit 1a. Here, the programs executed by information processing device 1 may also be stored on storage media such as CD-ROMs or DVDs, and installed from such storage media into information processing device 1.

[0026] The communication unit 1b of the information processing device 1 is an interface for communicating information and data with the shooting unit 2 and / or the external device 3, including an input / output interface and / or a communication interface. The communication unit 1b can be understood as either an acquisition unit that acquires an image of a person obtained by the shooting unit 2, or an output unit (providing unit) that outputs (provides) information about the person's gaze estimated by the model calculation unit 1d (described later) to the external device 3. It should be noted that, hereinafter, the image of the person obtained by the shooting unit 2 will sometimes be referred to as a "captured image".

[0027] The generation unit 1c of the information processing apparatus 1 applies known image processing techniques to the captured image of a person obtained from the capturing unit 2 via the communication unit 1b, thereby obtaining an image of the person's face (overall face), an image of the person's left eye, and an image of the person's right eye from the captured image. Then, based on the facial image, the left eye image, and the right eye image extracted from the captured image, an image for input to the model calculation unit 1d is generated. Hereinafter, the image extracted from the captured image will sometimes be referred to as "extracted image," and the image input to the model calculation unit 1d will sometimes be referred to as "input image."

[0028] In this embodiment, the generation unit 1c performs mirror inversion processing on one of the extracted images of the left eye and the right eye, thereby inputting the inverted image obtained by mirroring one of the extracted images in the left-right direction into the model calculation unit 1d. On the other hand, the extracted image of the other of the extracted images of the left eye and the right eye is not mirrored, and the non-inverted image without mirror inversion in the left-right direction is input into the model calculation unit 1d. The extracted image of the face is also not mirrored, and the non-inverted image without mirror inversion in the left-right direction is input into the model calculation unit 1d. Hereinafter, an example of mirror inversion processing on the extracted image of the right eye will be described. It should be noted that "left-right direction" can be defined as the direction in which the left and right eyes are arranged in the image of a person (i.e., the left-right direction based on the person).

[0029] Figure 2 This is an example of capturing an image, extracting an image, and inputting an image. Figure 2 (a) represents the image 10 captured by the imaging unit 2, showing the person (driver) seated in the driver's seat of the vehicle. The generation unit 1c acquires the image from the imaging unit 2 via the communication unit 1b. Figure 2 The captured image 10 shown in (a) is processed by applying known image processing techniques to the captured image 10, thereby extracting the image of the face, the image of the left eye, and the image of the right eye as extracted images. Figure 2Images (b-1) to (b-3) show the extracted facial image 11a, the extracted left eye image 12a, and the extracted right eye image 13a, respectively. Additionally, the generation unit 1c... Figure 2 The extracted image 13a of the right eye shown in (b-3) is mirrored, thereby generating the image as shown in Figure 13a. Figure 2 As shown in (c-3), the inverted image obtained by mirroring the extracted image 13a of the right eye in the left-right direction is used as the input image 13b of the right eye. On the other hand, the generation unit 1c does not perform mirror reversal processing on the extracted image 11a of the face and the extracted image 12a of the left eye (e.g., no processing is performed), and generates an extracted image (non-inverted image) as the input image. That is, the generation unit 1c generates an image as shown in (c-3). Figure 2 As shown in (c-1), the extracted facial image 11a is used as the input facial image 11b to generate a face image as shown in (c-1). Figure 2 As shown in (c-2), the extracted image 12a of the left eye is used as the input image 12b of the left eye.

[0030] The model computation unit 1d of the information processing device 1 performs computation using a machine learning algorithm with a predetermined learning model (neural network), thereby estimating (determining, calculating) the gaze of the left eye and the gaze of the right eye based on the input image 12b of the left eye and the input image 13b of the right eye input from the generation unit 1c, respectively. In this embodiment, an example is described where the learning model (neural network) includes, for example, a network structure called CNN (Convolutional Neural Network) that includes one or more convolutional layers, pooling layers, and fully concatenated layers, but the network structure is not limited to CNN and may be other structures. Alternatively, it may be a structure with skip connections, such as ResNet (Residual Network). Or, it may be a structure with a decoder in addition to an encoder structure with a CNN structure, such as an autoencoder. Of course, it is not limited to these structures; if it is a neural network structure used for spatially distributed signals such as images, other structures may also be used.

[0031] In this embodiment, the model processing unit 1d uses a common (identical) learning model to separately (independently) perform the processing of estimating the left eye's gaze based on the left eye's input image 12b and the processing of estimating the right eye's gaze based on the right eye's input image 13b. The common learning model can also be understood as the structure and function of the learning model used to estimate the gaze based on the input image being common (identical); more specifically, it can be understood as the coefficients of the learning model (i.e., the weighting coefficients between neurons) being common (identical). The reason why a common learning model can be used for the left eye's input image 12b and the right eye's input image 13b is that, as described above, the extracted image of one of the left eye's extracted image 12a and the right eye's extracted image 13a (in this embodiment, the right eye's extracted image 13a) is mirrored in the left-right direction and input into the model processing unit 1d (learning model). Furthermore, by using a common learning model, the two extracted images (left eye and right eye) obtained from a single captured image 10 can be used as input data for machine learning when generating the learning model. In other words, conventionally, the extracted image from either the left or right eye is used as input data from a single captured image 10; however, in this embodiment, both extracted images (left eye and right eye) can be used as input data from a single captured image 10. Therefore, the learning accuracy (eye gaze estimation accuracy) and learning efficiency of machine learning can be improved.

[0032] Furthermore, the model computation unit 1d of this embodiment performs computation using a machine learning algorithm with a predetermined learning model (neural network), thereby estimating the orientation (direction of the face) of the person based on the input image 11b of the face input by the generation unit 1c. Then, the model computation unit 1d inputs the estimated face orientation result to a learning model for estimating the gaze of each eye based on the input images 12b and 13b of each eye, and changes the coefficients of the learning model (i.e., the weighting coefficients between neurons). Thus, the gaze of each eye can be estimated with good accuracy based on the face orientation. Here, the correlation between the estimated face orientation result and the coefficient changes can be set through machine learning. Furthermore, an attention mechanism can be applied as the mechanism for changing the coefficients of the learning model.

[0033] Next, the learning model used in the information processing apparatus 1 of this embodiment will be explained. Figure 3 This is a block diagram illustrating the learning model applied by the information processing apparatus 1 (model calculation unit 1d) of this embodiment. For example... Figure 3As shown, the information processing apparatus 1 of this embodiment can include a learning model M1 for estimating the orientation of the face based on the input image 11b of the face, a learning model M2 for estimating the gaze of the left eye based on the input image 12b of the left eye, and a learning model M3 for estimating the gaze of the right eye based on the input image 13b of the right eye. The learning models M1 to M3 can also be understood as a single learning model.

[0034] An input image 11b of the face is input to the learning model M1. As mentioned earlier, the input image 11b is an image obtained from the extracted image 11a of the face without mirroring. In this embodiment, the extracted image 11a is used as is. First, the learning model M1 performs a feature map extraction process 21 based on the input image 11b of the face, for example, using a CNN. The positions of the left eye, right eye, nose, and mouth can be listed as features. Then, the learning model M1 performs an operation process 22 to calculate the orientation of the face based on the extracted feature map. The data representing the orientation of the face calculated in the operation process 22 is provided to the attention mechanism 25 of the learning model M2 and the attention mechanism 29 of the learning model M3, respectively. However, the data obtained by mirroring the orientation of the face calculated in the operation process 22 in the left-right direction is provided to the attention mechanism 29 of the learning model M3.

[0035] The left-eye input image 12b is input to the learning model M2. As mentioned earlier, the input image 12b is obtained by extracting the left-eye image 12a without mirroring it; in this embodiment, the extracted image 12a is used as is. First, the learning model M2 performs an extraction process 24 of eye-related feature maps based on the left-eye input image 12b, for example, using a CNN. As an example, in the extraction process 24, multiple features required to achieve the CNN's intended function (in this embodiment, estimating the gaze direction) are automatically constructed into the feature map. In the extraction process 24, the size of the eye, the width of the eye, the direction of the eye, and the position of the pupil (black part of the eye) can also be added as auxiliary information for estimating the gaze direction. Then, the learning model M2 weights each feature in the feature map extracted in the extraction process 24 using an attention mechanism 25, thereby generating a weighted feature map, and performs a gaze calculation operation 26 based on this weighted feature map. Thus, the gaze calculation is performed in the learning model M2. Information processing device 1 outputs the gaze information calculated by learning model M2 as information 32 representing the estimation result of the gaze of the left eye (hereinafter sometimes referred to as the gaze estimation information of the left eye). Here, in learning model M2, the weights (weighting coefficients) assigned to the feature map in attention mechanism 25 are changed based on the data provided from learning model M1.

[0036] The input image 13b of the right eye is input to the learning model M3. As mentioned earlier, the input image 13b is obtained by mirroring the extracted image 13a of the right eye 27. The learning model M3 is the same as the learning model M2; specifically, the model structure and weighting coefficients are common to the learning model M2. First, the learning model M3 performs an extraction process 28 of eye-related feature maps based on the input image 13b of the right eye, for example, using a CNN. As an example, in the extraction process 28, multiple features required to achieve the purpose of the CNN (in this embodiment, estimating the gaze direction) are automatically constructed into the feature map. In the extraction process 28, the size of the eye, the width of the eye, the orientation of the eye, and the position of the pupil (black part of the eye) can also be added as auxiliary information for estimating the gaze direction. Then, the learning model M3 weights each feature using an attention mechanism 29 on the extracted feature map, thereby generating a weighted feature map, and performs a calculation process 30 for gait estimation based on the weighted feature map. Thus, gaze calculations are performed in learning model M3. Information processing device 1 performs a mirror inversion process 31 on the gaze calculated by learning model M3, thereby mirroring the gaze in the left-right direction and outputting the information of the mirrored gaze as information 33 representing the estimation result of the right eye's gaze (hereinafter, sometimes referred to as right eye gaze estimation information). Here, in learning model M3, the weights (weighting coefficients) assigned to the feature map in attention mechanism 29 are changed based on data provided from learning model M1.

[0037] Next, the estimation processing performed by the information processing device 1 of this embodiment will be described. Figure 4 This is a flowchart illustrating the estimation process performed by the information processing apparatus 1 of this embodiment.

[0038] In step S11, the information processing device 1 (communication unit 1b) acquires a photographed image 10 of a person from the photographing unit 2. Next, in step S12, the information processing device 1 (generation unit 1c) applies known image processing techniques to the photographed image 10 acquired in step S11, thereby extracting a partial image of the person's face as extracted image 11a, a partial image of the person's left eye as extracted image 12a, and a partial image of the person's right eye as extracted image 13a.

[0039] In step S13, the information processing device 1 (generation unit 1c) generates input images for inputting into learning models M1 to M3 based on the extracted images 11a, 12a, and 13a obtained in step S12. As described above, the information processing device 1 mirrors one of the extracted images 12a (left eye) and 13a (right eye) to generate the input image, while generating the input image without mirroring the other extracted image. In this embodiment, the information processing device 1 mirrors the extracted image 13a (right eye) to generate the input image 13b (right eye), while using the extracted image 12a (left eye) as is without mirroring it to generate the input image 12b (left eye). Similarly, the information processing device 1 uses the extracted image 11a (face) as is without mirroring it to generate the input image 11b (face).

[0040] In step S14, the information processing device 1 (model calculation unit 1d) inputs the input images 11b, 12b, and 13b generated in step S13 into the learning models M1 to M3, thereby calculating the left-eye gaze and the right-eye gaze independently. Regarding the calculation method for the left-eye gaze and the right-eye gaze, for example, using... Figure 3 As described above. Next, in step S15, the information processing device 1 (model calculation unit 1d) determines the gaze estimation information separately (independently) for each of the left and right eyes based on the gaze information of the left eye and the right eye calculated in step S14. The information processing device 1 performs mirror inversion processing on the gaze of the left and right eyes that was mirrored in step S13 to restore the left-right inversion to its original state, thereby generating gaze estimation information for that eye. In this embodiment, the information processing device 1 performs mirror inversion processing on the gaze of the right eye calculated in step S14 and determines the gaze estimation information of the mirrored gaze as the gaze estimation information of the right eye. On the other hand, the gaze of the left eye calculated in step S14 is not mirrored, and the calculated gaze information of the left eye is determined as the gaze estimation information of the left eye. Next, in step S16, the information processing device 1 outputs the gaze estimation information of the left eye and the gaze estimation information of the right eye determined in step S15 to, for example, an external device 3.

[0041] Next, the learning method of the information processing device 1 in this embodiment will be explained. Figure 5This is a schematic diagram illustrating the construction of the input and output in machine learning used to generate the learning model. Input data X1 (41) and input data X2 (42) are the data of the input layer of the learning model 43. As input data X1 (41), an image of the face is applied (in this embodiment, the input image 11b of the face). As input data X2 (42), an image of one of the left and right eyes is applied (in this embodiment, the input image 12b of the left eye) and / or an image of the other eye that has been mirrored (in this embodiment, the input image 13b of the right eye). In this embodiment, two images (left eye and right eye) obtained from a captured image 10 can be applied as input data X2 respectively, that is, machine learning can be performed twice based on a captured image 10, thus improving the learning accuracy (eye gaze estimation accuracy) and learning efficiency of machine learning.

[0042] Input data X1(41) and input data X2(42) are input into the learning model M(43), thereby outputting the output data Y(44), which is the result of the gaze calculation, from the learning model M(43). The learning model M(43) can also be understood as including Figure 3 Learning models M1 and M2 or Figure 3 The learning models are M1 and M3. Furthermore, during machine learning, training data T(45) is assigned as ground-truth data of the gaze calculated from input data X, and output data Y(44) and training data T(45) are assigned to the loss function f(46), thereby obtaining the deviation L(47) between the actual gaze and the ground-truth. The coefficients (weighting coefficients) of the learning model M(43) are updated to reduce the deviation L relative to a large amount of learning data (input data), thereby optimizing the learning model M(43).

[0043] Here, the measurement results of a person's gaze are used as training data T (45). For example, as a measurement of a person's gaze, the person is photographed by the camera unit 2 while the person's gaze is directed toward a predetermined part (target part). The person's gaze at this time can be used as training data T, the facial image extracted from the photographed image obtained by the camera unit 2 can be used as input data X1 (41), and the eye image extracted from the photographed image can be used as input data X2 (42).

[0044] Figure 6 This is a flowchart illustrating the learning method of the information processing apparatus 1 in this embodiment.

[0045] In step S21, information about the captured image obtained by having the camera unit 2 capture the person and the person's gaze at that moment is acquired. For example, as described above, by having the camera unit 2 capture the person with their gaze directed toward a predetermined location (target location), information about the captured image and the person's gaze can be acquired. The information about the person's gaze acquired in this step S21 is used as training data T(45).

[0046] In step S22, a partial image of the person's face is extracted from the captured image obtained in step S21 as input data X1 (41), and a partial image of the person's eyes is extracted as input data X2 (42). Here, the input data X2 (42) can be either a reversed image obtained by flipping the extracted partial image of the person's eyes in the left-right direction, or a non-reversed image obtained by not flipping the extracted partial image of the person's eyes.

[0047] In step S23, based on the local image of the person's face extracted as input data X1 (41) in step S22 and the local image of the person's eyes extracted as input data X2 (42), the information processing device 1 estimates the person's gaze using the learning model M (43). The gaze estimated in this step corresponds to... Figure 5 The output data Y(44). Next, in step S24, the information processing device 1 is trained to reduce the deviation L(47) between the gaze of the character estimated as output data Y(44) in step S23 and the gaze of the character acquired as training data T(45) in step S21.

[0048] As described above, the information processing apparatus 1 of this embodiment uses a common learning model to separately perform processing (first processing) to estimate the gaze of one eye using a reversed image obtained by inverting one of the images in the left and right eyes of the person, and processing (second processing) to estimate the gaze of the other eye using a non-reversed image that does not invert the image of the other eye in the left and right eyes of the person. Therefore, machine learning can be performed using two images (left eye and right eye) obtained from a single captured image 10 to generate the common learning model, thus improving the learning accuracy (gazing estimation accuracy) and learning efficiency of the machine learning.

[0049] Furthermore, in this embodiment, the information processing apparatus 1 uses a learning model M1 to estimate the orientation of a person's face based on an image of the person's face. Based on the orientation of the person's face estimated using the learning model M1, the coefficients of the learning models (M2 and / or M3) used to estimate the person's gaze based on an image of the person's eyes are changed. Thus, the person's gaze, which may change depending on the orientation of the person's face, can be estimated with good accuracy.

[0050] <Other Implementation Methods>

[0051] Furthermore, a program that implements one or more functions described in the above embodiments is provided to the system or device via a network or storage medium, and one or more processors in the computer of the system or device can read and execute the program. The present invention can also be implemented in this way.

[0052] <Summary of Implementation Methods>

[0053] 1. The information processing device of the above embodiment is used to estimate a person's line of sight, wherein,

[0054] The information processing device (e.g., 1) includes:

[0055] A first processing unit (e.g., 1d) uses a first model (e.g., M1) to estimate the orientation of the person's face, the first model being configured to output a calculation result of the orientation of the person's face when an image of the person's face (e.g., 11b) is input; and

[0056] A second processing unit (e.g., 1d) uses a second model (e.g., M2, M3) to estimate the person's gaze. This second model is configured to output the calculated gaze result when an image of the person's eyes (e.g., 12b, 13b) is input.

[0057] The second computing unit changes the coefficients of the second model based on the facial orientation estimated by the first computing unit.

[0058] According to this embodiment, it is possible to accurately estimate the gaze of a person, which may vary depending on the orientation of the person's face.

[0059] 2. In the above embodiments,

[0060] The second model has an attention mechanism (e.g., 25, 29) that weights the feature map of the eye image.

[0061] The second computing unit changes the weighting coefficients in the attention mechanism based on the facial orientation estimated by the first computing unit.

[0062] According to this embodiment, it is possible to accurately estimate the gaze of a person, which may vary depending on the orientation of the person's face.

[0063] 3. In the above embodiments, it further comprises:

[0064] Acquisition components (e.g., 1b, 1c) acquire images (e.g., 10) of the person obtained by the capturing component (e.g., 2); and

[0065] A generation component (e.g., 1c) generates an image of the face of the person input to the first model (e.g., 11b) based on the image of the person acquired by the acquisition component, and generates images of the eyes of the person input to the second model (e.g., 12b, 13b).

[0066] According to this embodiment, it is possible to obtain an image of a person's face and an image of their eyes based on an image of a person obtained by the shooting component (camera), and to estimate the image of the person with good accuracy based on these images.

[0067] 4. In the above embodiments,

[0068] The second processing unit inputs a reversed image (e.g., 13b) obtained by inverting the image of the person's eyes into the second model (e.g., M3), and estimates the person's gaze based on the information obtained by inverting the gaze information output from the second model (e.g., 33).

[0069] According to this embodiment, a common model can be used to estimate the line of sight of a person's left eye and right eye. Even in this case, the line of sight of the left eye and right eye can be estimated with good accuracy based on the orientation of the person's face.

[0070] 5. In the above embodiments,

[0071] The second calculation unit changes the coefficients of the second model based on the orientation of the face obtained by reversing the orientation of the face estimated by the first calculation unit (e.g., 23).

[0072] According to this embodiment, when using a common model to estimate the line of sight of a person's left and right eyes, the line of sight of the left and right eyes can be estimated with good accuracy based on the orientation of the person's face.

[0073] This invention is not limited to the above-described embodiments, and various changes and modifications can be made without departing from the spirit and scope of this invention.

Claims

1. An information processing device for estimating a person's line of sight, characterized in that, have: A first computing unit uses a first model to estimate the orientation of the person's face, the first model being configured to output a calculation result of the orientation of the person's face when an image of the person's face is input. as well as The second processing unit uses a second model to estimate the person's gaze. This second model is configured to output the calculated gaze result when an image of the person's eyes is input. The second computing unit changes the coefficients of the second model based on the facial orientation estimated by the first computing unit.

2. The information processing device according to claim 1, characterized in that, The second model has an attention mechanism that weights the feature map of the eye image. The second computing unit changes the weighting coefficients in the attention mechanism based on the facial orientation estimated by the first computing unit.

3. The information processing device according to claim 1, characterized in that, It also has: The acquiring component acquires an image of the person obtained by the shooting component; and A generation component generates an image of the face of the person input to the first model, and an image of the eyes of the person input to the second model, based on the image of the person obtained by the acquisition component.

4. The information processing apparatus according to claim 1, characterized in that, The second processing unit inputs a reversed image, obtained by inverting the image of the person's eyes, into the second model, and estimates the person's gaze based on the information obtained by inverting the gaze information output from the second model.

5. The information processing apparatus according to claim 4, characterized in that, The second calculation unit changes the coefficients of the second model based on the orientation of the face obtained by reversing the orientation of the face estimated by the first calculation unit.

6. An information processing method for estimating a person's line of sight, characterized in that, include: The first calculation step involves using a first model to estimate the orientation of the person's face. The first model is configured to output the calculation result of the orientation of the face when an image of the person's face is input. as well as The second calculation step involves using a second model to estimate the person's gaze. This second model is configured to output the gaze calculation result when an image of the person's eyes is input. In the second calculation step, the coefficients of the second model are changed based on the facial orientation estimated in the first calculation step.

7. A storage medium storing a program for causing a computer to perform the steps of the information processing method according to claim 6.

8. A learning method for an information processing device used to estimate a person's gaze, the learning method being characterized by comprising: The extraction process involves extracting images of the person's face and eyes from the image of the person. In the estimation process, according to the information processing method of claim 6, the information processing device estimates the gaze of the person based on the image of the face and the image of the eyes extracted in the extraction process; The acquisition process involves acquiring information about the person's gaze when the image of the person is obtained, as training data. as well as The learning process enables the information processing device to learn, thereby reducing the deviation between the gaze of the person estimated in the estimation process and the gaze of the person acquired in the acquisition process, which serves as training data.

9. A storage medium storing a program for causing a computer to execute the steps of the learning method according to claim 8.