Information processing device, information processing method, and program

The information processing apparatus enhances object detection accuracy in tilted faces by estimating and correcting for face orientation using neural networks, addressing the limitations of existing technologies in handling inclined faces.

JP7880742B2Active Publication Date: 2026-06-26CANON KK

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
CANON KK
Filing Date
2022-05-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing object detection technologies, particularly in facial organ detection using deep learning, struggle with accuracy when faces are inclined or tilted, as they are trained primarily on upright images and lack the capability to correct for face orientation.

Method used

An information processing apparatus that estimates the central position and likelihood of an object's tilt relative to a standard orientation, using neural networks to output evaluation values for multiple reference angles, allowing for accurate detection by correcting the detection angle based on the estimated tilt.

Benefits of technology

Accurately detects tilted objects in images by adjusting the detection process to account for face orientation, improving detection accuracy even when faces are not in a standard upright position.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007880742000001
    Figure 0007880742000001
  • Figure 0007880742000002
    Figure 0007880742000002
  • Figure 0007880742000003
    Figure 0007880742000003
Patent Text Reader

Abstract

To provide an information processor, an information processing method, and a program that accurately detect a detection object which tilts in an image.SOLUTION: An information processor 200 included in a camera 100 comprises: a detection object estimation unit 220 for outputting, for a plurality of reference angles respectively, evaluation values for determining whether a detection object in an image tilts against a standard attitude of the detection object at a reference angle; an angle estimation unit 240 for estimating a tilt angle, against the standard attitude, of the detection object in the image on the basis of the evaluation vales output for the plurality of reference values respectively; and an organ detection unit 260 for detecting the detection object from the image, by processing adjusted with the estimated tilt angle.SELECTED DRAWING: Figure 4
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to an information processing apparatus, an information processing method, and a program.

Background Art

[0002] Object detection processing for detecting an object from an image has been applied to the functions of imaging devices such as digital cameras. Conventionally, the target of object detection processing has often been limited to the face of a person. However, in recent years, with the development of deep learning, it has become possible to detect facial organs such as the pupils of a person, and it is mounted on products as a pupil detection function.

[0003] It has been found that in the learning of facial organ detection using deep learning, restricting the person in the image to an image close to upright for learning results in higher accuracy of facial organ detection. However, the facial organ detector realized by such learning has high accuracy in detecting facial organs of a face close to upright, but the accuracy decreases when the inclination of the face is large. In detecting an inclined face, for example, in Patent Document 1, a technique for determining whether the face is a front-facing face or a side-facing face using a plurality of face orientation estimators is disclosed. Further, in Patent Document 2, a technique for estimating the face orientation of the detected face by integrating the scores of a plurality of face orientation estimators realized by machine learning is disclosed.

Prior Art Documents

Patent Documents

[0004]

Patent Document 1

Patent Document 2

Summary of the Invention

Problems to be Solved by the Invention

[0005] However, the technology described in Patent Document 1 only determines whether a face is facing forward or to the side, and cannot make a detailed determination of which direction the face is facing. Furthermore, the technology described in Patent Document 2 only calculates the face orientation of the detected face, and could not perform detection while correcting for the tilt of the face.

[0006] The present invention aims to accurately detect tilted objects in an image. [Means for solving the problem]

[0007] To achieve the objectives of the present invention, for example, an information processing apparatus according to one embodiment has the following configuration. That is, A first estimation means for estimating the central position of the object to be detected in the image, and the The object to be detected is tilted at a reference angle relative to the standard orientation of the object to be detected. Likelihood of, within the same plane Output means for outputting for each of the multiple reference angles, and for each of the multiple reference angles The corresponding combination of likelihoods, and the central position of the detected object. Based on this, the tilt angle of the detected object in the image relative to the standard posture is estimated. 2 The estimation means and the estimated inclination angle Before The invention is characterized by comprising a detection means for detecting the object to be detected. [Effects of the Invention]

[0008] It accurately detects tilted objects in an image. [Brief explanation of the drawing]

[0009] [Figure 1] A diagram illustrating an example of the facial tilt that is the target of detection according to Embodiment 1. [Figure 2] A block diagram showing an example of a system including an information processing device according to Embodiment 1. [Figure 3] A block diagram showing an example of the hardware configuration of the information processing device according to Embodiment 1. [Figure 4] A block diagram showing an example of the functional configuration of the information processing device according to Embodiment 1. [Figure 5] A diagram for explaining input / output data by the detector according to Embodiment 1. [Figure 6] A diagram for explaining the map output by the detector according to Embodiment 1. [Figure 7] A diagram for explaining the tilt angle estimated by the information processing apparatus according to Embodiment 1. [Figure 8] A flowchart showing an example of the adjusted detection process according to Embodiment 1. [Figure 9] A block diagram showing an example of the functional configuration of the learning apparatus according to Embodiment 1. [Figure 10] A diagram showing an example of the correct answer information and map of learning according to Embodiment 1. [Figure 11] A diagram for explaining the map generation process from the correct answer information according to Embodiment 1. [Figure 12] A flowchart showing an example of the learning process according to Embodiment 1. [Figure 13] A diagram for explaining the map output by the detector according to Embodiment 1. [Figure 14] A block diagram showing an example of the functional configuration of the information processing apparatus according to Embodiment 2. [Figure 15] A diagram for explaining the map output by the detector according to Embodiment 2. [Figure 16] A block diagram showing an example of the functional configuration of the learning apparatus according to Embodiment 2. [Figure 17] A diagram showing an example of the correct answer information and map of learning according to Embodiment 2.

Mode for Carrying Out the Invention

[0010] Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Note that the following embodiments do not limit the invention according to the claims. Although a plurality of features are described in the embodiments, not all of these plurality of features are essential to the invention, and the plurality of features may be arbitrarily combined. Further, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant descriptions are omitted.

[0011] [Embodiment 1] An information processing apparatus according to an embodiment of the present invention detects a detection target in an image. In particular, the information processing apparatus outputs an evaluation value as to whether the detection target in the image is inclined at a reference angle with respect to the standard posture for each of a plurality of reference angles. Next, the information processing apparatus estimates the inclination angle of the detection target in the image with respect to the standard posture based on the evaluation value, and detects the detection target by a process adjusted using the estimated inclination angle.

[0012] The information processing apparatus according to the present embodiment detects a detection target from a captured image by a camera which is an imaging device. FIG. 1 is a diagram showing an example in which a face which is a detection target according to the present embodiment is inclined with respect to the standard posture. In the present embodiment, as the standard posture of the face, a vertical face with the top of the head positioned upward as shown in (a) of FIG. 1 is detected. In (b) of FIG. 1, a face 11 in the standard posture, a face 12 inclined to the right, a face 13 with the top of the head positioned downward, and a face 14 inclined to the left are illustrated. In this example, with respect to the face 11 in the standard posture, the face 12 is a face rotated 90° clockwise in the plane, the face 13 is a face rotated 180° clockwise in the plane, and the face 14 is a face rotated 270° (90° counterclockwise) clockwise in the plane.

[0013] FIG. 2 is a diagram showing an example of the configuration of a system including the information processing apparatus 200 according to the present embodiment. The information processing apparatus 200 according to the present embodiment is built in the camera 100, and performs various processes on the captured image by the camera 100 to detect a detection target. Note that the information processing apparatus 200 may use, instead of the captured image by the camera 100, an image acquired from a device different from the camera 100 as a processing target, and the information processing apparatus 200 may have an imaging function and capture an image to be a processing target. Here, the image may be a still image or an image included in a video.

[0014] Figure 2 shows an example of the hardware configuration of the information processing device 200 according to this embodiment. The information processing device 200 includes a processing unit 101, a storage unit 102, an input unit 103, an output unit 104, and a communication unit 105.

[0015] The processing unit 101 controls the operation of the information processing device 200 by executing programs stored in the storage unit 102. The processing unit 101 is, for example, a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit). The storage unit 102 is a storage device such as a magnetic memory device or semiconductor memory, and stores programs that are read based on the operation of the processing unit 101, or data to be stored for a long time. In this embodiment, the processing unit 101 reads the programs stored in the storage unit 102 and processes them, thereby executing the processes described below, including various processes performed by the information processing device 200. The storage unit 102 may also store images captured by the camera 100 according to this embodiment, and the processing results of those captured images.

[0016] The input unit 103 is a mouse and keyboard, a touch panel, or buttons, and acquires various inputs from the user. The output unit 104 is a liquid crystal panel or an external monitor, and outputs various information. In this embodiment, the output unit 104 is a liquid crystal panel, and the touch panel, which is the input unit 103, is mounted on the output unit 104. By using such an input unit 103 and output unit 104, the user can perform input operations via the touch panel while checking the image displayed on the liquid crystal panel.

[0017] The communication unit 105 communicates with other devices via wired or wireless communication. Furthermore, each functional unit shown in Figure 3 is connected via a system bus, enabling communication and the transmission and reception of various types of information according to the processing requirements.

[0018] The imaging unit (not shown) of the camera 100 according to this embodiment consists of a lens, aperture, image sensor, A / D converter for converting analog signals to digital signals, aperture control unit, and focus control unit. The image sensor is composed of a CCD or CMOS, etc., and converts the optical image of the subject into an electrical signal.

[0019] The overall system configuration is not limited to the examples described above. For example, the camera 100 may perform various processes that the information processing device 200 would normally perform. Also, for example, the learning device 300 may be the same device as the camera 100 or the information processing device 200. Furthermore, the camera 100 may be equipped with I / O devices for communication between various devices. Here, the I / O devices are, for example, input / output units such as memory cards and USB cables, or transmitting and receiving units such as wired or wireless devices.

[0020] Figure 4 is a block diagram showing an example of the functional configuration of an information processing device 200 and a camera 100 equipped with the information processing device 200. The information processing device 200 according to this embodiment includes an image acquisition unit 210, a detection target estimation unit 220, a center position calculation unit 230, and an angle estimation unit 240. The detection target estimation unit 220 also includes a center position estimation unit 221 and a direction estimation unit 222. The camera 100 includes an angle correction unit 250, an organ detection unit 260, and an AF processing unit 270.

[0021] The image acquisition unit 210 acquires images included in the time-series video captured by the imaging unit of the camera 100. In the following, 1600 x 1200 pixel image data will be treated as "images," but the size and format of the images are not particularly limited as long as the processing described below is possible. In this embodiment, the image acquisition unit 210 acquires images in real time (60 frames per second).

[0022] The detection target estimation unit 220 outputs an evaluation value for each of several reference angles indicating whether the detection target in the image is tilted at a reference angle relative to the standard posture. Here, the reference angles used are 90° (rightward), 180° (downward), and 270° (or -90°) (leftward), as shown in Figure 1(b). For this purpose, the center position estimation unit 221 outputs a center feature map as a map showing the likelihood of the center position of the detection target for each position in the image. In addition, the direction estimation unit 222 outputs a direction feature map as a map showing the evaluation value of whether the detection target is tilted at a reference angle for each position in the image. Each map will be explained later. In the following explanation, the detection target will be assumed to be a human face.

[0023] The detection target estimation unit 220 according to this embodiment uses a neural network (NN) to extract features from an image. Figure 5 is a schematic diagram of the output of the NN of the detection target estimation unit 220 to the input image. In this embodiment, the NN has a hierarchical structure in which multiple modules, each consisting of layers such as convolutional layers, activation layers, pooling layers, and normalization layers, are linked together. Here, these modules are collectively referred to as the feature extraction layer 410. The fully connected layer 420 takes the intermediate features output from the feature extraction layer 410 as input and outputs a feature map 440 (output layer 430). Note that the processing in the NN is basically the same as that performed by general techniques, so a detailed explanation is omitted.

[0024] Feature map 440 includes a central feature map, the face central feature map 450, and a directional feature map, the face orientation feature map 460. Face orientation feature map 460 includes an upward feature map 461, a rightward feature map 462, a downward feature map 463, and a leftward feature map 464, as directional feature maps corresponding to each of the reference angles.

[0025] Feature map 440 is a two-dimensional matrix data corresponding to the input image 400. Face center feature map 450 shows the likelihood of the center position of the person's face on the input image 400 for each location. Face orientation feature map 460 shows the likelihood of the face being tilted at a reference angle for each location. The size of these matrix data may be the same as the number of pixels in the input image 400, and may be enlarged or reduced. Hereafter, when simply referred to as "center position," it refers to the center position of the person's face.

[0026] In this embodiment, the feature map 440 is assumed to be a 320x240 map, scaled down to 1 / 5 of the input image in both width and height, and the data for each position is represented in the range of 0 to 1. That is, in the face center feature map 450, positions with a higher probability of being the center of the face will have higher values, showing values ​​close to 1. Similarly, in the face orientation feature map 460, positions with a higher probability of being tilted at a reference angle will have higher values, showing values ​​close to 1. In this embodiment, the face center feature map 450 and the face orientation feature map 460 are assumed to be the same size, but their sizes may be different, and the corresponding positions may be processed as described below.

[0027] Figure 6 is a diagram illustrating the values ​​of each element in the feature map 440. In the example in Figure 6, the top of the person's head is pointing diagonally upwards to the right in the input image 400, so in the upward feature map 461 and the rightward feature map 462, the elements corresponding to the area of ​​the face show values ​​close to 1. In each of the feature maps 440 in Figure 6, the elements that do not correspond to the area of ​​the face (background) show values ​​close to 0, and in this example, these are represented by leaving the values ​​blank.

[0028] The center position calculation unit 230 calculates the image coordinate values ​​of the center position of the face in the image from the face center feature map 450 output by the detection target estimation unit 220. The center position calculation unit 230 can use the position where the value peaks among the elements of the face center feature map 450 (the position showing the value "0.9" in the example of Figure 6) as the center position element, and use the corresponding coordinates in the input image 400 as the center position of the face. For example, if the element of the face center position in the face center feature map 450 is (180,100), then the coordinates of the center position in the input image 400 will be (900,500). Note that this process is just one example, and any other known technique, such as subpixel estimation, may be used as long as it can estimate the center position of the detection target.

[0029] The center position calculation unit 230 may use an element that exceeds a predetermined threshold as the center position, or it may use an element that exceeds the predetermined threshold and is at a peak as the center position. If there are multiple elements that exceed the predetermined threshold or multiple elements that are at a peak, multiple faces will be detected, but in the following explanation, only one face will be used as the processing target. If multiple faces are detected, each of those faces may be processed in the same manner.

[0030] The angle estimation unit 240 estimates the tilt angle (face orientation angle) of the face in the image relative to the standard posture, based on the face orientation feature map 460 and the center position calculated by the center position calculation unit 230. In this embodiment, the face orientation feature map 460 outputs evaluation values ​​for each reference angle, and the estimated face orientation angle can be calculated based on these evaluation values. Hereinafter, the evaluation values ​​for the reference angles will simply be referred to as evaluation values.

[0031] Next, the evaluation values ​​described above will be explained. In this embodiment, the center position calculation unit 230 calculates evaluation values ​​from the element corresponding to the center position in the face orientation feature map 460. Here, the center position calculation unit 230 can estimate the average of the element corresponding to the center position and the eight elements adjacent to that element as the evaluation value. The evaluation values ​​for up, right, down, left, and right calculated from the face orientation feature maps 461 to 464 in Figure 6 are (up, right, down, left) = (0.9, 0.7, 0.1, 0.1). The method of calculating the evaluation values ​​is not particularly limited in this way; for example, the average of elements within a predetermined range from the center position, such as the four neighboring pixels or twelve neighboring pixels of the element corresponding to the center position, or only the element corresponding to the center position, may be used as the evaluation value.

[0032] As described above, the angle estimation unit 240 estimates the face orientation angle based on the evaluation values. The angle estimation unit 240 may also calculate a vector representing the estimated face orientation angle by combining the evaluation values ​​for up, down, left, and right, for example, using them as coefficients for the unit vectors of the up, down, left, and right directions. The calculation of the combined vector using the feature map shown in Figure 6 will be explained with reference to Figure 7. Figure 7(a) is a diagram showing the vectors in the four directions (up, down, left, and right) based on the evaluation values ​​calculated from the face orientation feature map 460. The upward vector 471, the rightward vector 472, the downward vector 473, and the leftward vector 474 have lengths of 0.9, 0.7, 0.1, and 0.1, respectively. At this time, the combined vector obtained by combining these vectors is shown in Figure 7(b). The length of the combined upward vector 482 is determined to be 0.8 from the difference between the upward vector 471 and the downward vector 473, and the length of the combined rightward vector 482 is determined to be 0.6 from the difference between the rightward vector 472 and the leftward vector 474. Therefore, the combined vector 483 becomes the direction of the face, and the face direction angle is calculated as angle 484 (approximately 32° in the example in Figure 7).

[0033] As described above, in this embodiment, the values ​​calculated from the likelihood shown in the face orientation feature map were used as evaluation values ​​for each reference angle direction, and the face orientation angle was estimated by combining vectors using these evaluation values. However, the method for estimating the face orientation angle is not particularly limited as long as it can be estimated based on the face orientation feature map. For example, the angle estimation unit 240 may use the face orientation angle as the value obtained by weighting the angles of the four directions of the face orientation feature map (0°, 90°, 180°, 270°) with the center position element as the weight (and the remainder when divided by 360°). Alternatively, the angle estimation unit 240 may use the direction with the highest evaluation value among each direction as the face orientation angle.

[0034] Furthermore, although this embodiment has been described assuming the existence of four directional feature maps (for four directions), various processes may be performed using a different number of directional feature maps, such as two directional feature maps.

[0035] The organ detection unit 260 detects faces by performing an adjusted process using the tilt angle of the detection target (face) in the image relative to the standard posture, which is estimated by the angle estimation unit 240. For example, the organ detection unit 260 may detect a detection target that has rotated to return to a tilt angle equal to the estimated face orientation angle. Here, the organ detection unit 260 can detect a detection target that has rotated to return to a tilt angle equal to the face orientation angle by correcting the detection angle of the detector by the face orientation angle and then detecting a face from the image. The organ detection unit 260 is composed of a neural network and has been trained using images that include detection targets at an angle close to upright (standard posture). Therefore, by rotating and correcting the angle of the detector based on the face orientation angle, it is possible to detect a detection target with the same accuracy as detecting a detection target in a standard posture, even if the detection target is not in a standard posture. Alternatively, for example, the organ detection unit 260 may rotate the image by the face orientation angle and then detect a face from the rotated image.

[0036] In this embodiment, the organ detection unit 260 detects a face as the detection target using a detector whose detection angle has been corrected by the face orientation angle. Here, the detection method is not particularly limited as long as it can detect a person's face. For example, the organ detection unit 260 may detect a face by detecting a person's pupils, or it may detect a face by detecting other facial features such as the nose, mouth, or ears. If the detection target is a vehicle such as an automobile, the organ detection unit 260 may detect the target by detecting a part of the vehicle, such as the headlights.

[0037] The AF processing unit 270 performs autofocus (AF) processing so as to focus on the eyes of the person detected by the organ detection unit 260. Since the AF processing can be performed using known techniques, a detailed explanation is omitted.

[0038] Figure 8 is a flowchart illustrating an example of the process performed by the information processing device 200 according to this embodiment, which estimates the face orientation angle of the target to be detected in the captured image and uses the estimated face orientation angle to detect the target. Note that this flowchart is just one example, and the information processing device 200 does not need to perform all of the processes described below.

[0039] In S501, the image acquisition unit 210 acquires an image captured by the camera 100. In this embodiment, the image captured by the camera 100 is assumed to be bitmap data represented by RGB 8 bits. In S502, the detection target estimation unit 220 outputs a face center feature map (center feature map) and a face orientation feature map (direction feature map) from the image acquired in S501.

[0040] In S503, the center position calculation unit 230 calculates the coordinates of the center position of the person's face in the captured image from the face center feature map output in S502. In S504, the angle estimation unit 240 estimates the face orientation angle based on the face orientation feature map and the center position of the face.

[0041] In S505, the angle correction unit 250 corrects the detection angle of the detector of the organ detection unit 260 by the estimated face orientation angle. In S506, the organ detection unit 260 detects a face from the captured image using the detector with the corrected detection angle. In S507, the AF processing unit 270 performs AF processing to focus on the pupil of the detected face.

[0042] In S508, the information processing device 200 determines whether or not to continue the operation of the camera 100. Here, if the user has performed an operation to stop imaging, such as turning off the imaging function of the camera 100, the camera's operation will be stopped; otherwise, the camera's operation will continue. If the camera's operation is to be continued, the process returns to S501; otherwise, the process ends.

[0043] With this configuration, an evaluation value is output to determine whether the object to be detected in the image is tilted at a reference angle relative to the standard orientation, and the tilt of the object relative to the standard orientation is estimated based on the output evaluation value. Then, the object to be detected can be detected by processing adjusted based on the estimated tilt. Therefore, by considering the tilt of the object to be detected in the image, detection accuracy can be improved with simple processing.

[0044] In this embodiment, the evaluation value was calculated from elements in the face orientation feature map near the position designated as the center position, by referring to the face center feature map. However, this limitation is not necessary if the evaluation value can be calculated from the elements in the face orientation feature map corresponding to the position of the detected target, and the face center feature map is not essential. For example, the face center feature map may not be used, the face position may be obtained by a different means, and the evaluation value may be calculated from the elements in the face orientation feature map corresponding to the face position.

[0045] [Learning Methods] Next, a learning method for the information processing device 200 according to this embodiment to output central feature map and face orientation feature map evaluation values ​​from an image as input will be described. The learning device 300 shown in Figure 9 includes a learning data storage unit 310, a learning data acquisition unit 320, an image acquisition unit 330, a detection target estimation unit 340, a training data creation unit 350, a position error calculation unit 360, a direction error calculation unit 370, and a learning unit 380.

[0046] The learning data storage unit 310 stores learning data for the learning device 300 to perform learning. Here, the learning data includes a pair of a learning image and correct information about the faces of people in that image. The correct information includes the coordinates of the center position of the face in the image and the angle of the face's orientation, and may also include other information such as the size of the face (size on the image). The learning data storage unit 310 may store a sufficient number of learning data for learning, and may also be able to acquire learning data from an external device. The learning data acquisition unit 320 acquires the learning data stored in the learning data storage unit 310 as the target of processing in the learning process.

[0047] The image acquisition unit 330 acquires images included in the training data that the training data acquisition unit 320 has selected for processing. The detection target estimation unit 340 takes the images acquired by the image acquisition unit 330 as input and outputs a face center feature map and a face orientation feature map by processing them in the same way as the detection target estimation unit 220 in Figure 4. The detection target estimation unit 340 has basically the same configuration as the detection target estimation unit 220 and can perform the same processing, so redundant explanations are omitted.

[0048] The training data creation unit 350 creates a face-centered target map and a face-direction target map as training data that will serve as target values ​​for learning, from the correct answer information contained in the training data processed by the training data acquisition unit 320. The face-centered target map and the face-direction target map will be explained below, along with examples of how to create these maps. Here, it is assumed that the images acquired by the image acquisition unit 330 are 1600 x 1200 pixel images, the same as the images acquired by the image acquisition unit 210. In the following, the face-centered target map and the face-direction target map will not be distinguished and will be referred to simply as "target maps".

[0049] The face-center target map is matrix data of the same size as the face-center feature map and contains information on the correct face center position. In this embodiment, the face-center feature map is 320 x 240 pixels, which is 1 / 5 the size of the input image in both width and height. Therefore, the face center coordinates and face size on the face-center target map are also 1 / 5 the size of the input image. The face orientation target map is matrix data of the same size as the face orientation feature map (i.e., the same size as the face-center target map in this embodiment) and contains information on the correct face orientation angle. Figure 10 is a diagram illustrating an example of a training image, the correct information for that image, and the training data generated from that image according to this embodiment.

[0050] Figure 10(a) shows the training image, Figure 10(b) shows its ground truth information, and Figure 10(c) shows the ground truth information on the face center target map and face orientation target map. In the ground truth information in Figure 10(b), the coordinates of the face center position are (X,Y)=(900,500), the size (here, assumed to be the width in the X-axis direction) is 600, and the face orientation angle is 37°. In addition, in the ground truth information on the map in Figure 10(c), the coordinates of the face center position are (X,Y)=(180,100), the size is 120, and the face orientation angle is 37°.

[0051] The face-centered target map 620 shown in Figure 10(d) is a map in which positive examples are labeled at the face center position (180, 100). The face-centered target map 620 is labeled with a heatmap of a circular region with a diameter of 120, the same as the face size, centered at the face center position. Here, each element of the target map has a value in the range of 0 to 1, similar to the elements of the feature map. The element corresponding to the center position is set to 1, and the value gradually decreases as you move from the center position towards the circumference of the heatmap. In Figure 10(d), the element at the center position of the target map is set to 1.0, the adjacent elements above, below, left, and right are set to 0.8, and the elements adjacent to the 0.8 elements (excluding the center position) are set to 0.4. In this embodiment, elements outside the heatmap are set to Void (empty value). In this embodiment, Void is a label that is set to an empty value so as not to contribute to learning.

[0052] Next, the method for creating the face orientation target map will be explained with reference to Figure 11. As shown in Figure 11(b), the face orientation target map 630 includes an upward target map 631, a rightward target map 632, a downward target map 633, and a leftward target map 634. In each of the face orientation target maps 630, a bounding box is provided with the center position of the face as the center, and the length of each side is the face size value. Within the bounding box, one of the labels, positive example, negative example, or Void is attached. The values ​​set for each element within the bounding box for each label will be described later. Figure 11(a) shows label criteria 641 to 644, which indicate the criteria for deciding how to label each of the face orientation target maps 631 to 634.

[0053] In the label criteria (upward label criteria) 641 of the upward target map 631, values ​​between -45° and 45° from the standard orientation are positive examples, values ​​between -90° and -45° and between 45° and 90° are void examples, and all other values ​​are negative examples. While the void range is not mandatory, including a void range between the positive and negative ranges helps avoid instability in learning near the boundary between positive and negative examples. Note that the ranges used for classification here are just examples; values ​​within the range where the absolute difference between the tilt angle and the reference angle |θ-θs| is small can be considered positive examples, values ​​within the range where the value is larger than that of a positive example can be considered void examples, and values ​​within the range where the value is larger than that of a void example can be considered negative examples.

[0054] Here, as shown in Figure 10(c), the face orientation angle of the ground truth information is 37°, so the upward target map 631 is labeled as a positive example by referring to the label criterion 641. The training data creation unit 350 sets each element within the bounding box of the face orientation target map labeled as a positive example to the cosine value cos(θ-θs). In this embodiment, θ is the face orientation angle of the ground truth information, and θs is the reference angle in that face orientation target map (i.e., in the corresponding face orientation feature map). In the example in Figure 11, the value of θs is 0° for the upward target map 631, 90° for the rightward target map 632, 180° for the downward target map 633, and 270° for the leftward target map 634. Therefore, the value of the element within the bounding box in the upward target map 631 is cos37°. Here, the values ​​of each element are rounded to two decimal places, with cos37° being set to 0.8, but this is not a strict limitation. Also, the training data creation unit 350 uses cos(θ-θs) for the elements within the bounding box of the face orientation target map labeled as positive examples, but other values ​​may be used as long as they can indicate that it is a positive example, such as uniformly setting it to 1.0. Furthermore, the training data creation unit 350 sets the elements within the bounding box of the face orientation target map labeled as negative examples to 0, and the elements within the face orientation target map labeled as Void to empty values.

[0055] The position error calculation unit 360 calculates the center position error, which is the error between the face center feature map output by the detection target estimation unit 340 and the face center target map created by the training data creation unit 350. For the void element, the error is assumed to be 0. The direction error calculation unit 370 calculates the direction error, which is the error between the face orientation feature map output by the detection target estimation unit 340 and the face orientation target map created by the training data creation unit 350. The error for the void element is handled in the same way as in the position error calculation unit 360.

[0056] The learning unit 380 learns (updates) the parameters of the detection target estimation unit 340 so as to reduce the center position error and direction error. The learning process can be carried out in the same way as general learning processes, so a detailed explanation is omitted.

[0057] Figure 12 is a flowchart showing an example of the learning process performed by the learning device 300 according to this embodiment. In S701, the learning data acquisition unit 320 acquires the learning data stored in the learning data storage unit 310. In S702, the image acquisition unit 330 acquires the learning images included in the learning data. In S703, the detection target estimation unit 340 outputs a face center feature map and a face orientation feature map from the learning images.

[0058] In S704, the training data creation unit 350 creates a face center target map and a face orientation target map from the correct information contained in the training data. In S705, the position error calculation unit 360 calculates the center position error, which is the error between the output face center feature map and the created face center target map. In S706, the orientation error calculation unit 370 calculates the orientation error, which is the error between the output face orientation feature map and the face orientation target map. In S707, the learning unit 380 learns the parameters of the detection target estimation unit 340 so that the center position error and orientation error are reduced.

[0059] In S708, the learning unit 380 determines whether to continue learning. If learning is to be continued, the process returns to S701; otherwise, the process terminates. The learning unit 380 may decide to terminate learning, for example, when a predetermined number of learning sessions or a set learning time has been completed, or it may set other criteria for whether or not to continue learning.

[0060] In this embodiment, the detection target estimation unit 340 performs estimation using images acquired by the image acquisition unit 330 as input. However, the image acquisition unit 330 may also perform data augmentation of the training images. For example, if there are insufficient or no images of people facing a specific direction in the training data, the face images can be rotated to create inputs of faces facing such a specific direction, thereby enabling more comprehensive learning and improving the accuracy of face orientation estimation. Furthermore, robustness can sometimes be improved by scaling images, adding noise, or changing the brightness or color of the images. When performing data augmentation involving geometric transformations, such as image rotation or scaling, the correct information in the training data must also be transformed to correspond to those geometric transformations.

[0061] In this embodiment, the information processing device 200 estimated the face orientation angle of a face tilted by in-plane rotation relative to a standard posture. However, the information processing device 200 may also estimate the three-dimensional face tilt angle relative to the standard posture due to rotation around the pitch axis or yaw axis, in addition to in-plane rotation (rotation around the roll axis), and perform detection of the target object by processing adjusted using the estimated tilt angle. In other words, as described above, the information processing device 200 can estimate the face orientation angle by considering not only the angle of in-plane rotation but also the rotation angle around the pitch axis or yaw axis as the face tilt angle.

[0062] Figure 13 shows an example of a feature map 800, which includes a face-center feature map 810 and a face-orientation feature map 820, output by the information processing device 200 according to this embodiment. The face-orientation feature map 820 includes a roll axis head direction map 830, a pitch axis head direction map 840, and a yaw axis head direction map 850, as head direction maps corresponding to the roll axis, pitch axis, and yaw axis, respectively. Furthermore, the head direction maps 830 to 850 are, It includes a map of directions. Face center feature map 810 is a similar map to face center feature map 450 in Figure 6.

[0063] The roll axis head orientation map 830 is similar to the face orientation feature map 460, and includes maps 831-834, where the reference angles of face orientation correspond to up, down, left, and right, respectively.

[0064] The pitch axis head direction map 840 includes map 841 when the face is facing forward, map 842 when the face is facing zenith, map 843 when the face is facing backward, and map 844 when the face is facing ground. The yaw axis head direction map 850 includes map 851 when the face is facing forward, map 852 when the face is facing side to the right, map 853 when the face is facing backward, and map 854 when the face is facing side to the left. In other words, the face orientation feature map 820 includes 12 maps in total, in addition to the 4 maps included in the face orientation feature map 460 shown in Figure 6.

[0065] The information processing device 200 can output the roll axis head direction map 830 using the same processing as described for the face orientation feature map in Embodiment 1. Furthermore, the information processing device 200 can output the pitch axis head direction map 840 and the yaw axis head direction map 850 as separate planar coordinate systems using the same processing as for the roll axis head direction map 830, and can calculate the face orientation angle from each. In this way, the information processing device 200 can estimate the tilt angle relative to the standard posture of the detected object even in a three-dimensional coordinate system.

[0066] The learning device 300 can prepare target maps for the head direction of the roll axis, pitch axis, and yaw axis, and perform learning. This process can be achieved by performing the learning process for the roll axis, as explained with reference to Figures 10 to 12, on the pitch axis and yaw axis as well. Through this process, it becomes possible to estimate the three-dimensional tilt angle of the object to be detected in the image, and then perform detection after correcting for the estimated tilt angle.

[0067] [Embodiment 2] The information processing device according to Embodiment 1 outputs an evaluation value for whether or not the detected object in the image is tilted at a reference angle relative to the standard posture, using a face center feature map and a face orientation feature map. In addition to the face center feature map and face orientation feature map, the information processing device according to this embodiment outputs the above-mentioned evaluation value using a size feature map that estimates and outputs the size of the detected object, and uses the output evaluation value to estimate the face orientation angle.

[0068] Figure 14 shows an example of the functional configuration of the information processing device 900 according to this embodiment. The information processing device 900 has the same configuration as the information processing device 200 of Embodiment 1, except that it has a detection target estimation unit 910 instead of a detection target estimation unit 220, and additionally has a size calculation unit 920 and a box generation unit 930.

[0069] The detection target estimation unit 910 has a size estimation unit 911 and outputs a size feature map in addition to the processing performed by the detection target estimation unit 220. Figure 15 shows an example of a feature map 1000 that includes a size feature map 1020 in addition to a face center feature map 1010 and a face orientation feature map 1030, which are output by the detection target estimation unit 910 according to this embodiment. The face center feature map 1010 and the face orientation feature map 1030 are output by the same processing as the face center feature map 450 and the face orientation feature map 460 of Embodiment 1, so redundant explanations are omitted here. The face orientation feature map 1030 includes an upward feature map 1031, a rightward feature map 1032, a downward feature map 1033, and a leftward feature map 1034, which are face orientation feature maps corresponding to up, down, left, and right, similar to 461 to 464 in Figure 6.

[0070] The size feature map 1020 is a two-dimensional matrix data similar to the face center feature map and face orientation feature map. It is a map in which the relative size values ​​of faces in the image are set to 1, with the maximum size of a recognizable face in the image being the element corresponding to the region of a face in the image. The size estimation unit 911 is trained to take an image as input and output the size feature map described above. Here, we assume that the width and height of the faces are the same and that these values ​​are used as the face size. However, for example, the width or height of faces that are not common may be used as the face size, or the average value of the width and height of the faces may be used as the face size.

[0071] The size calculation unit 920 calculates the face size of a person in an image based on the size feature map 1020 and the center position of the face output by the center position calculation unit 230. The thick black frame shown on the size feature map 1020 in Figure 15 indicates the center position. In this embodiment, the size calculation unit 920 can calculate the face size in the image as the product of the center position value of the size feature map and the maximum face size value. In the example in Figure 15, the center position value of the size feature map 1020 is 0.8, and with the maximum face size being 1000, the face size is calculated as 1000 × 0.8 = 800.

[0072] The box generation unit 930 generates a bounding box representing the face region based on the face size output by the size calculation unit 920 and the center position of the face output by the center position calculation unit 230. This bounding box is centered at the center position of the face and has the face size value (a value corresponding to the map) as its width and height.

[0073] The angle estimation unit 240 estimates the face orientation angle based on the face orientation feature map 1030 and the bounding box generated by the box generation unit 930. The angle estimation unit 240 calculates the average value of the elements within the bounding box as an evaluation value for each of the four face orientation feature maps 1031 to 1034. In the face orientation feature map 1030 in Figure 15, the bounding box is shown with a thick black border, and the evaluation values ​​for up, down, left, and right are (0.9, 0.7, 0.1, 0.1). The angle estimation unit 240 estimates the face orientation angle using the evaluation values ​​calculated in this way, but this process is the same as in Embodiment 1, so the explanation is omitted.

[0074] The information processing device 900 according to Embodiment 2 is capable of performing the same processing as shown in Figure 8, except that it performs the output processing of a size feature map, the calculation processing of face size, and the generation processing of a bounding box between S503 and S504 shown in Figure 8.

[0075] This processing method allows for the estimation of face orientation while considering face size. In particular, by using the average of the bounding box representing face size as the evaluation value, it becomes possible to perform robust detection against noise caused by changes in face size within the image.

[0076] The bounding box generated by the box generation unit 930 in this embodiment represents the area on the map where the detection target is estimated to exist. Here, the box generation unit 930 generates the bounding box using the face size, but this generation method does not necessarily have to be used if the range of elements in the face orientation feature map corresponding to the face area in the image can be estimated. For example, the box generation unit 930 may generate a bounding box surrounding the face in the image using a known detection technique, and then generate the bounding box to be used by converting the coordinates of each of the four corners of that bounding box to the corresponding positions on the map.

[0077] Next, the learning method using the learning device 1100 according to this embodiment will be described. The learning device 1100 according to this embodiment has the same configuration as the learning device 300 shown in Figure 9 of Embodiment 1, except that it has a detection target estimation unit 1110 instead of a detection target estimation unit 340.

[0078] The detection target estimation unit 1110 takes the image acquired by the image acquisition unit 330 as input and outputs a face center feature map, a face orientation feature map, and a size feature map by processing in the same way as the detection target estimation unit 910 in Figure 14. The detection target estimation unit 1110 has basically the same configuration as the detection target estimation unit 910 and can perform common processing, so redundant explanations are omitted.

[0079] In this embodiment, the training data creation unit 350 creates a face size target map, which serves as training data for the size feature map, in addition to the face center target map and face orientation target map similar to those in Embodiment 1, based on the correct answer information. The method for creating the face size target map will be described below.

[0080] Figure 17 is a diagram illustrating the correct answer information according to this embodiment. Figure 17(a) is a diagram showing the correct answer information on the map, similar to Figure 10(c). Here, the center position is (X,Y)=(180,100), the face size is 120, and the face orientation angle is 37°.

[0081] In the face size target map 1200 shown in Figure 17(b), a bounding box 1201 is displayed centered at the central position (180, 100), with the length of each side being the same as the face size value. The face size target map in Figure 17(b) is labeled as a positive example, and the value of each element within the bounding box 1201 is the face size value on the map divided by the maximum face size on the map. Here, since the maximum size is 200, the value within the bounding box 1201 is 120 / 200 = 0.6. Also, elements outside the bounding box 1201 are represented as void.

[0082] The size error calculation unit 1120 calculates the size error, which is the error between the size feature map output by the detection target estimation unit 1110 and the face size target map created by the training data creation unit 350. The learning unit 380 learns the parameters of the detection target estimation unit 1110 so that the size error is reduced in addition to the center position error and direction error.

[0083] The learning device 1100 can perform the same processing as shown in Figure 12, except that it estimates a size feature map in S703, creates a face size target map in S704, and calculates the size error between S705 and S706.

[0084] The disclosures herein include the following information processing devices, information processing methods, and programs.

[0085] (Item 1) An output means that outputs an evaluation value for each of the multiple reference angles, indicating whether the object to be detected in the image is tilted at a reference angle relative to the standard orientation of the object to be detected. A first estimation means for estimating the tilt angle of the detected object in the image relative to the standard posture, based on the evaluation value output for each of the multiple reference angles, A detection means for detecting the target object by processing adjusted using the estimated tilt angle, An information processing device characterized by comprising:

[0086] (Item 2) The information processing device according to item 1, characterized in that the output means takes an image as input and outputs a matrix having as elements an evaluation value whether or not the detection target is tilted at a reference angle with respect to the standard posture of the detection target.

[0087] (Item 3) The system further comprises a second estimation means for estimating the central position of the detected object in the input image, The information processing device according to item 2, characterized in that the output means outputs the evaluation value from the element of the matrix corresponding to the estimated center position.

[0088] (Item 4) The information processing device according to item 3, characterized in that the output means outputs the average value of the elements of the matrix corresponding to the estimated center position and the elements within a predetermined range from the center position as an evaluation value.

[0089] (Item 5) The system further comprises a third estimation means for estimating the range of elements in the matrix that correspond to the region to be detected in the input image, The information processing device according to item 2, characterized in that the output means outputs the evaluation value based on the estimated range of elements of the matrix.

[0090] (Item 6) The information processing device according to item 5, characterized in that the output means outputs the average value of the estimated elements within the range as the evaluation value.

[0091] (Item 7) The system further comprises a generation means for generating a vector in the direction of the reference angle, with the value of the evaluation value as its length, for each of the reference angles. The information processing device according to any one of items 1 to 6, characterized in that the first estimation means estimates the inclination angle of a composite vector obtained by combining the vectors generated from each of the plurality of reference angles by the generation means as the inclination angle with respect to the standard posture.

[0092] (Item 8) The information processing device according to any one of items 1 to 7, characterized in that the detection means detects a detection target that is rotating to return to the estimated tilt angle.

[0093] (Item 9) The information processing device according to any one of items 1 to 7, characterized in that the detection means rotates the image so as to return the estimated tilt angle, and detects the object to be detected from the rotated image.

[0094] (Item 10) The output means outputs an evaluation value for each of the multiple reference angles, indicating whether or not the object being detected is tilted by a reference angle due to in-plane rotation relative to the standard posture of the object being detected. The information processing device according to any one of items 1 to 9, characterized in that the first estimation means estimates the inclination angle of the detected object due to in-plane rotation relative to the standard posture based on the evaluation value.

[0095] (Item 11) The output means outputs an evaluation value for each of the multiple reference angles indicating whether or not the object to be detected is tilted at a reference angle relative to its standard orientation in three-dimensional coordinates. The information processing device according to any one of items 1 to 9, characterized in that the first estimation means estimates the tilt angle of the detected object with respect to the standard orientation in the three-dimensional coordinates based on the evaluation value.

[0096] (Item 12) An output means that outputs an evaluation value for each of the multiple reference angles, indicating whether the object to be detected in the image is tilted at a reference angle relative to the standard orientation of the object to be detected. A first estimation means for estimating the tilt angle of the object to be detected in the image relative to the standard posture, based on the evaluation value output for each of the multiple reference angles, A means for acquiring data that shows the correct tilt angle relative to the standard posture, A generation means that generates training data for learning the evaluation value for each of the multiple reference angles based on the data indicating the correct answer, Equipped with, The output means is characterized in that it is trained to minimize the error between the evaluation value and the training data.

[0097] (Item 13) The information processing device according to item 12, wherein the generation means generates, for each reference angle, as training data, one of the following based on the correct answer of the inclination angle and the reference angle: a positive example having a positive value, a negative example having a value of 0, or an empty value not used for learning.

[0098] (Item 14) The generation means, for each of the reference angles, uses the following as training data: If the absolute value of the difference between the correct tilt angle and the reference angle falls within the first range, a positive example is generated. If the absolute value of the difference between the correct tilt angle and the reference angle falls within a second range that is greater than the first range, a blank value is generated. If the absolute value of the difference between the correct tilt angle and the reference angle falls within a third range that is greater than the second range, a negative example is generated. The information processing device described in item 13, characterized in that...

[0099] (Item 15) The information processing device according to item 14, characterized in that the generation means generates the positive value possessed by the positive example as the cosine value of the difference between the inclination angle and the reference angle.

[0100] (Item 16) The information processing device according to item 14, characterized in that the generation means generates the positive value of the positive example as 1.

[0101] (Item 17) The information processing device according to any one of items 1 to 16, characterized in that the output means outputs the evaluation value using a neural network.

[0102] (Item 18) A step of outputting an evaluation value for each of the multiple reference angles, indicating whether the object to be detected in the image is tilted at a reference angle relative to the standard orientation of the object to be detected. A step of estimating the tilt angle of the object to be detected in the image relative to the standard posture, based on the evaluation value output for each of the multiple reference angles, A step of detecting the target object by processing adjusted using the estimated tilt angle, An information processing method characterized by comprising:

[0103] (Item 19) A step of outputting an evaluation value for each of the multiple reference angles, indicating whether the object to be detected in the image is tilted at a reference angle relative to the standard orientation of the object to be detected. A step of estimating the tilt angle of the object to be detected in the image relative to the standard posture, based on the evaluation value output for each of the multiple reference angles, A step of obtaining data that shows the correct tilt angle relative to the standard posture, A step of generating training data for learning the evaluation value for each of the multiple reference angles based on the data indicating the correct answer, Equipped with, The information processing method is characterized in that the output step is trained to minimize the error between the evaluation value and the training data.

[0104] (Item 20) A program to cause a computer to function as one of the information processing devices described in any one of items 1 through 17.

[0105] (Other examples) The present invention can also be realized by supplying a program that implements one or more of the functions of the above-described embodiments to a system or device via a network or storage medium, and by having one or more processors in the computer of that system or device read and execute the program. It can also be realized by a circuit (e.g., an ASIC) that implements one or more functions.

[0106] The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, claims are attached to disclose the scope of the invention. [Explanation of symbols]

[0107] 100: Camera, 200: Information processing device, 300: Learning device

Claims

1. A first estimation means for estimating the central position of a detection target in an image, Output means for outputting the likelihood that the detected object is tilted at a reference angle with respect to the standard orientation of the detected object, for each of the multiple reference angles in the same plane, A second estimation means for estimating the tilt angle of the detected object in the image relative to the standard orientation, based on a combination of likelihoods corresponding to each of the multiple reference angles and the central position of the detected object. A detection means for detecting the target using the estimated tilt angle, An information processing device characterized by comprising:

2. The information processing apparatus according to claim 1, wherein the output means takes an image as input and outputs a matrix having as elements an evaluation value whether or not the detection target is tilted at a reference angle with respect to the standard posture of the detection target.

3. The information processing apparatus according to claim 2, characterized in that the output means outputs the evaluation value from the element of the matrix corresponding to the estimated center position.

4. The information processing apparatus according to claim 3, characterized in that the output means outputs the average value of the elements of the matrix corresponding to the estimated center position and the elements within a predetermined range from the center position as an evaluation value.

5. The system further comprises a third estimation means for estimating the range of elements in the matrix that correspond to the region to be detected in the input image, The information processing apparatus according to claim 2, characterized in that the output means outputs the evaluation value based on the estimated range of elements of the matrix.

6. The information processing apparatus according to claim 5, characterized in that the output means outputs the average value of the estimated range of elements as the evaluation value.

7. The system further comprises a generation means for generating a vector whose length is the evaluation value of whether or not the detected object is tilted at a reference angle relative to the standard posture of the detected object, for each of the reference angles. The information processing apparatus according to claim 1, characterized in that the second estimation means estimates the inclination angle of a composite vector obtained by combining the vectors generated from each of the plurality of reference angles by the generation means as the inclination angle with respect to the standard posture.

8. The information processing apparatus according to claim 1, characterized in that the detection means detects a detection target that is rotating in such a way that it returns to the estimated tilt angle.

9. The information processing apparatus according to claim 1, characterized in that the detection means rotates the image so as to return the estimated tilt angle, and detects the object to be detected from the rotated image.

10. The output means outputs an evaluation value for each of the multiple reference angles, indicating whether or not the object being detected is tilted by a reference angle due to in-plane rotation relative to the standard posture. The information processing apparatus according to claim 1, wherein the second estimation means estimates the inclination angle of the detection target due to in-plane rotation with respect to the standard posture based on the evaluation value.

11. The output means outputs an evaluation value for each of the multiple reference angles indicating whether the object to be detected is tilted at a reference angle relative to its standard orientation in three-dimensional coordinates. The information processing apparatus according to claim 1, wherein the second estimation means estimates the tilt angle of the detection target with respect to the standard orientation in the three-dimensional coordinates based on the evaluation value.

12. A first estimation means for estimating the central position of a detection target in an image, Output means for outputting the likelihood that the detected object is tilted at a reference angle with respect to the standard orientation of the detected object, for each of the multiple reference angles in the same plane, A second estimation means for estimating the tilt angle of the detected object in the image relative to the standard orientation, based on a combination of likelihoods corresponding to each of the multiple reference angles and the central position of the detected object. A means for acquiring data that shows the correct tilt angle relative to the standard posture, A generation means that generates training data for learning the likelihood for each of the multiple reference angles based on the data indicating the correct answer, Equipped with, The output means is characterized in that it is trained to minimize the error between the likelihood and the training data.

13. The information processing apparatus according to claim 12, characterized in that for each reference angle, the generation means generates, as training data, any of the following: positive examples having a positive value, negative examples having a value of 0, or empty values ​​not used for learning, based on the correct answer of the inclination angle and the reference angle.

14. The generation means, for each of the reference angles, uses the following as training data: If the absolute value of the difference between the correct tilt angle and the reference angle falls within the first range, a positive example is generated. If the absolute value of the difference between the correct tilt angle and the reference angle falls within a second range that is greater than the first range, a blank value is generated. If the absolute value of the difference between the correct tilt angle and the reference angle falls within a third range that is greater than the second range, a negative example is generated. The information processing apparatus according to claim 13, characterized in that

15. The information processing apparatus according to claim 14, characterized in that the generation means generates the positive value of the positive example as the cosine value of the difference between the inclination angle and the reference angle.

16. The information processing apparatus according to claim 14, characterized in that the generation means generates the positive value of the positive example as 1.

17. The information processing apparatus according to claim 12, characterized in that the output means outputs the likelihood using a neural network.

18. The first estimation means includes the step of estimating the central position of the object to be detected in the image, The output means includes the step of outputting the likelihood that the object to be detected in the image is tilted at a reference angle relative to the standard orientation of the object to be detected, for each of the multiple reference angles in the same plane, The second estimation means includes the step of estimating the tilt angle of the detected object in the image relative to the standard orientation, based on a combination of likelihoods corresponding to each of the plurality of reference angles and the central position of the detected object. The detection means includes the step of detecting the target using the estimated tilt angle, An information processing method characterized by comprising:

19. The first estimation means includes the step of estimating the central position of the object to be detected in the image, The output means includes the step of outputting the likelihood that the detected object is tilted at a reference angle with respect to the standard orientation of the detected object, for each of the multiple reference angles in the same plane, The second estimation means includes the step of estimating the tilt angle of the detected object in the image relative to the standard orientation, based on a combination of likelihoods corresponding to each of the plurality of reference angles and the central position of the detected object. The acquisition means includes the step of acquiring data that shows the correct tilt angle relative to the standard posture, The generation means includes the step of generating training data for learning the likelihood for each of the plurality of reference angles based on the data indicating the correct answer, Equipped with, The information processing method is characterized in that the output step is trained to minimize the error between the likelihood and the training data.

20. A program for causing a computer to function as one of the means of an information processing device according to any one of claims 1 to 17.