Processing apparatus, control method of processing apparatus, and program

JP2025001892A5Pending Publication Date: 2026-06-26CANON KK

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
CANON KK
Filing Date
2023-06-21
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing methods struggle to accurately detect a person's gaze direction in store images due to small face sizes and reliance on line of sight direction alone, leading to inaccurate determinations of interest in products.

Method used

A processing device that acquires images, sets a gaze area based on object characteristics and distance/angle, and uses joint point information to determine if a person is gazing at a target object, correcting for obstructions and lighting conditions.

Benefits of technology

Improves the accuracy of determining visual recognition of objects by fixing the gaze area at the product position, allowing precise gaze determination and interest measurement.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 00000000_0000_ABST
    Figure 00000000_0000_ABST
Patent Text Reader

Abstract

To improve accuracy in determination of human visibility of an object.SOLUTION: The present invention is directed to a processing apparatus having video acquiring means for acquiring a video at a place where a target object is disposed, characteristic acquiring means for acquiring characteristic information relating to an angle and a distance which allows a view to view the target object, region setting means for setting a viewing region where the target object can be viewed, and determination means for, based on information relating to joint parts of a person existing in the viewing region set by the region setting means.SELECTED DRAWING: Figure 2
Need to check novelty before this filing date? Find Prior Art

Description

[Technical field]

[0001] The present invention relates to a processing device, a control method for the processing device, and a program. [Background technology]

[0002] In recent years, technology has been proposed that can detect human behavior from surveillance camera footage, and this technology is increasingly being applied to analyzing customer behavior in stores. When stores put new products on sale, there is a demand to measure the level of interest shown by customers. The level of interest of customers is reflected in actions such as stopping in front of a product to stare at it, or picking it up. Regarding whether customers are staring at a product, a method has been proposed that utilizes surveillance cameras installed in stores.

[0003] The method in Patent Document 1 detects the gaze direction of a person from a video and determines the product that the person is looking at. Patent Document 2 describes the details of the gaze direction detection method, and determines the gaze direction from the face direction and the center position of the pupil. [Prior art documents] [Patent documents]

[0004] [Patent Document 1] JP 2017-117384 A [Patent Document 2] JP 2009-104524 A Summary of the Invention [Problem to be solved by the invention]

[0005] However, although Patent Documents 1 and 2 detect the gaze direction of a person in the video, it is difficult to accurately detect people in the camera video in a store, especially face areas, because the size is small. Also, because only the gaze direction is taken into consideration, even if a person is looking at a product from a distance where they cannot focus on the product, they will be treated as gazing at the product.

[0006] The present invention has been made in consideration of the above problems, and an object of the present invention is to improve the accuracy of determination regarding a person's visual recognition of an object. [Means for solving the problem]

[0007] In order to achieve the above-mentioned object, a processing device as one aspect of the present invention comprises an image acquisition means for acquiring an image of a location where a target object is located, a characteristic acquisition means for acquiring characteristic information relating to the angle and distance at which the target object can be viewed, an area setting means for setting a gaze area in which the target object can be viewed based on the characteristic information acquired by the characteristic acquisition means, and a determination means for determining whether a person is viewing the target object based on information of the joint points of the person present in the gaze area set by the area setting means. Effect of the Invention

[0008] According to the present invention, it is possible to improve the accuracy of determination regarding visual recognition of an object by a person. [Brief description of the drawings]

[0009] [Figure 1] FIG. 2 illustrates an example of a hardware configuration of a processing device. [Diagram 2] FIG. 2 is a diagram illustrating a functional configuration of a processing device. [Diagram 3] FIG. 11 is a diagram showing an example of a posture estimation result. [Figure 4] 4 is a flowchart showing a process flow in the processing device. [Diagram 5] 4 is a flowchart showing a process flow in the processing device. [Figure 6] FIG. 13 is a diagram showing an example of a product image. [Figure 7] FIG. 2 is a diagram of an example of a product coordinate system. [Figure 8] 11 is a diagram showing an example of a fixation area set by a fixation area setting unit; [Figure 9] FIG. 13 is a diagram showing an example of an in-store camera image and a gaze area. [Figure 10] 13 is a diagram showing an example of an in-store camera image when there is an obstruction around a product. FIG. [Figure 11] This is an example of an image captured by a camera in a store viewed from the XZ plane when there is an obstruction around a product. [Figure 12] This is an example of a gaze area when there are multiple products. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0010] The following describes in detail the embodiments for carrying out the present invention. Note that the embodiments described below are examples for realizing the present invention, and should be appropriately modified or adjusted depending on the configuration of the device to which the present invention is applied and various conditions, and the present invention is not limited to the following embodiments. Also, in each drawing, parts having the same functions are given the same numbers, and repeated explanations are omitted.

[0011] <Embodiment 1> 1 is a block diagram showing a hardware configuration of a processing device 1 in this embodiment. The processing device 1 in this embodiment determines whether a person is gazing at a target object based on a gaze area (to be described later) and information on joint points of the person present in the gaze area. The processing device 1 in this embodiment can also detect a person gazing at a target object. The processing device 1 in this embodiment includes a CPU 101, a ROM 102, a RAM 103, a secondary storage device 104, an imaging device 105, an input device 106, a display device 107, and a network I / F 108.

[0012] The CPU 101 is a central processing unit, and controls the entire processing device 1 by executing control programs stored in the ROM 102 and the RAM 103. The ROM 102 is a non-volatile memory, and stores the control program in this embodiment and other programs and data required for control. The RAM 103 is a volatile memory, and stores temporary data such as frame image data and pattern discrimination results. The secondary storage device 104 is a rewritable secondary storage device such as a hard disk drive or flash memory, and stores image information, programs, various setting contents, and the like. This information is transferred to the RAM 103, and the CPU 101 executes the programs and uses the data.

[0013] The imaging device 105 is composed of an imaging lens, an imaging sensor such as a CCD or CMOS, a video signal processing unit, etc., and captures images and videos. The input device 106 is a keyboard, a mouse, etc., which allows input from the user. The display device 107 is composed of a cathode ray tube (CRT), a liquid crystal display, etc., and displays processing results, etc. on the screen (presents them to the user). The network I / F 108 is a modem or LAN that connects to a network such as the Internet or an intranet, etc. The bus 109 connects these devices and allows data to be input and output between them.

[0014] 2 is a diagram showing the functional configuration of the processing device 1 in this embodiment. The processing device 1 includes an area setting unit 201, a gazing person detection unit 202, an image acquisition unit 203, and a gaze area storage unit 208.

[0015] The video acquisition unit 203 is configured with the imaging device 105, and acquires images and videos. Specifically, the video acquisition unit 203 acquires a video of a specific product (target object) that is the object of attention, a video of a location where the target object is located, and the like.

[0016] The area setting unit 201 is a functional unit that sets a gaze area in which a commodity, which is a target object, can be gazed upon. The area setting unit 201 further includes a characteristic acquisition unit 204, a gaze area setting unit 205, a placement condition acquisition unit 206, and a gaze area correction unit 207 as its functional units.

[0017] The characteristic acquisition unit 204 acquires information on the characteristics of a commodity that is a target object (characteristic information). Specifically, the characteristic information is information on the angle and distance at which the target object can be gazed upon. The details of the characteristic information of the object acquired by the characteristic acquisition unit 204 will be described later.

[0018] The gaze area setting unit 205 sets a gaze area in which a target object can be gazed upon, based on the characteristic information acquired by the characteristic acquisition unit 204. The gaze area set by the gaze area setting unit 205 will be described later.

[0019] The placement condition acquisition unit 206 acquires information on the three-dimensional shape of the store (such as the placement of shelves) and the three-dimensional positions of products from the video. It also acquires placement conditions, which are conditions for the positions where the products are placed (the positions of the products in the video). Details of the placement conditions acquired by the placement condition acquisition unit 206 will be described later.

[0020] The gaze area correction unit 207 corrects the gaze area set by the gaze area setting unit 205 based on the product arrangement conditions acquired by the arrangement condition acquisition unit 206. The correction process in the gaze area correction unit 207 will be described later in detail.

[0021] The gaze area storage unit 208 is configured by the RAM 103 and the secondary storage device 104. The gaze area storage unit 208 stores information about the gaze area set by the gaze area setting unit 205 and the gaze area correction unit 207, and the like.

[0022] The gazing person detection unit 202 is a functional unit that detects and counts people who are determined to be gazing at a product. The gazing person detection unit 202 includes a person detection unit 209, a person tracking unit 210, a posture estimation unit 211, a gaze determination unit 212, a person number measurement unit 213, and a display unit 214.

[0023] The person detection unit 209 detects a person's area from the image acquired by the image acquisition unit 203. In this embodiment, it is assumed that the area of ​​the entire body of a person is detected.

[0024] Person tracking unit 210 associates person areas acquired by person detection unit 209 from the same person before and after frames (frame images) in the video acquired by video acquisition unit 203, and assigns the same person ID to that person. That is, in a video composed of frame images, by associating person areas of a person presumed to be the same person in a current frame image and a frame image immediately before the current frame image, the same ID is assigned to the associated person in each frame image.

[0025] The posture estimation unit 211 acquires information on joint points constituting the posture of a person from the area (whole body area) of the person detected by the person detection unit 209. In this embodiment, the joint points indicate the positions of the parts of the human body. The locations shown as the joint points in this embodiment will be described later with reference to FIG. 3.

[0026] The gaze determination unit 212 determines whether or not a person is gazing at a product, based on the joint points estimated by the posture estimation unit 211 and the gaze area read from the gaze area storage unit 208. That is, the gaze determination unit 212 determines whether or not a person is gazing at a target object, based on information on the joint points of the person present in the gaze area set by the gaze area setting unit 205.

[0027] The person number measurement unit 213 measures the time that a person who is determined by the gaze determination unit 212 to be gazing at a product gazes at the product. It also determines whether the time reaches a predetermined time. It also counts the number of people who are determined by the gaze determination unit 212 to be gazing at the product based on the determination result.

[0028] The display unit 214 is configured from the display device 107. The display unit 214 presents the result of the counting by the person number counting unit 213 to the user (displays it on the screen of the display device 107).

[0029] Each of the functional units of the processing device 1 is realized by the CPU 101 expanding a program stored in the ROM 102 into the RAM 103 and executing the program. The CPU 101 then stores the execution results of each process (described later) in a predetermined storage medium such as the RAM 103 or the secondary storage device 104.

[0030] FIG. 3 is a diagram showing an example of a posture estimation result by the posture estimation unit 211. The joint points in this embodiment are positions indicated by black dots in FIG. 3. That is, the joint points are a right shoulder 301, a left shoulder 302, a right elbow 303, a left elbow 304, a right wrist 305, a left wrist 306, a right waist 307, a left waist 308, a right knee 309, a left knee 310, a right ankle 311, and a left ankle 312. In this embodiment, not only the above-mentioned parts, but also the right eye 313, the left eye 314, the right ear 315, the left ear 316, and the nose 317, which are organ points of the person's face, are treated as joint points. Thus, the joint points in this embodiment include position information (coordinate information) of the joints of the above-mentioned human body parts, as well as position information of the eyes (right eye, left eye), the ears (right ear, left ear), and the nose, which are organ points of the person's face. Incidentally, the joint points are not limited to the above-mentioned parts, and other parts may also be treated as joint points.

[0031] Next, details of the process in which the processing device 1 in this embodiment sets and saves the gaze area will be described with reference to Fig. 4. Fig. 4 is a flowchart showing the flow of the process of setting the gaze area in the processing device 1. Note that each operation (process) shown in the flowchart in Fig. 4 is realized by the CPU 101 of the processing device 1 executing a program stored in the ROM 102. Also, each process (step) is represented by adding an S to the beginning, thereby omitting the representation of the process (step).

[0032] In S401, the video acquisition unit 203 acquires a video (product video) captured by the imaging device 105. That is, the video is acquired of a location where a product, which is a target object, is placed. The video acquired by the video acquisition unit 203 is composed of a plurality of frame images. In this embodiment, it is assumed that the video acquired by the video acquisition unit 203 in S401 is a video in which the product is captured in a somewhat close-up state, as shown in FIG. 6. FIG. 6 is a diagram showing an example of a product video. Note that FIG. 6 shows an example of one of the frame images constituting the video.

[0033] In the frame image 601 shown in FIG. 6, a product 602 with a label 603 is captured. In FIG. 6, the product 602 is a canned product as an example, but this is merely an example and any product may be used. In S402, the image acquisition unit 203 may acquire images or videos captured in advance by reading them from a storage medium such as the ROM 102 or the secondary storage device 104. In addition, for example, when the inside of a store is being imaged, if the entire image of the product 602 can be captured, the image or video may be acquired from the imaging device. Here, in order to acquire the characteristics (characteristic information) of the product 602 described below, it is desirable that the area of ​​the product in the frame image has a sufficient resolution, for example, 200×200 pixels or more.

[0034] In S402, the characteristic acquisition unit 204 acquires characteristic information of the product, which is the target object, from the frame image in the video acquired in S401. The characteristic information here refers to information such as the size of the characters on the product or the label attached to the product (on the target object), the contrast difference between the label background and the characters, the font used, the shape of the product, and other properties of the product itself that affect the viewer who owns it. A specific method for acquiring characteristic information here is, for example, a method of acquiring the information by analyzing an image, such as identifying fonts using a Convolutional Neural Network (CNN). Note that the user may directly input the character size as a numerical value via the input device 106, etc.

[0035] In S403, the gaze area setting unit 205 sets a gaze area in a product coordinate system, that is, sets a gaze area in which a target object can be gazed upon based on the characteristic information acquired by the characteristic acquisition unit 204. The product coordinate system (xyz) is a three-dimensional coordinate system with the origin at the center of the surface of the product illustrated in Fig. 7. Fig. 7 is a diagram of an example of the product coordinate system for the product 602 illustrated in Fig. 6.

[0036] In FIG. 7, a product (product 602) is indicated by 701 on the xy plane. It is also indicated by 702 on the xz plane. It is also indicated by 703 on the yz plane. The xy plane is in contact with the surface of the product, and the z axis is in the direction of a perpendicular line extending from the surface. Furthermore, the product's label (label 603) is indicated by 704 on the xy plane. Normally, when purchasing or considering purchasing a product, it is thought that a customer will gaze at the product and read the label in order to determine whether or not the product is necessary. Therefore, it is desirable to take the origin o on the surface of the product where the label is attached. In the product coordinate system, the longest length of the product (d in the example of FIG. 7) is taken as 1.

[0037] The gaze area is a partial area in a three-dimensional space determined by the distance from a product, which is a target object, to a person (a person existing near the product) and the angle of the person with respect to the target object. Note that in this embodiment, the gaze area is defined as a shape obtained by cutting out a part of a sphere, but it may be any shape in space that can be defined by the distance and angle to the target object.

[0038] FIG. 8 is a diagram showing an example of a gaze area. As mentioned above, the origin o is on the label surface of the product. The area surrounded by o, a, b, c, and d in FIG. 8 is the gaze area. When setting the gaze area, first consider a sphere Q with a radius r that is the gazeable distance. In FIG. 8, r is oa (or ob, oc, od). As mentioned above, the gazeable distance r depends on the characteristics of the target object, such as the size of the characters on the label, the contrast difference between the label background and the characters, and the font used on the label. For example, the larger the character size on the label, the larger r will be, since it can be read from a distance. For example, if the character width when the short side of the label is 1 is S, then c Then, r can be expressed by the following equation (1).

[0039] r=(S c / S b )×D b (1)

[0040] In the above formula (1), D b is specified condition (font size is S c ) is the distance that can be observed in advance. That is, S c >S b If so, the distance r at which attention can be focused becomes longer. On the other hand, if c b The contrast difference between the label and the text, the font used for the label, etc. are quantified and added to the standard conditions, and the D b Other characteristics can be handled in the same way by measuring the distance. Note that this is just an example, and other formulas may be used as long as they can express the relationship between the quantified object characteristics and the gazeable distance.

[0041] Next, the angle at which the target object is easily gazed upon, the horizontal θ h (=∠aod or ∠boc), vertical θ v We will explain how to determine (=∠aob or ∠doc). There is a method to determine it based on the angle measured under certain conditions, just like the distance r. For example, if the character size S c When θ h can be expressed by the following equation (2). ​

[0042] θ h =(S c / S b )×θ hb (2)

[0043] In the above formula (2), θ hb is specified condition (font size is S c The horizontal angle that can be gazed at is measured in advance when the vertical angle θ v The same is true for θ hb Measure θ based on the ratio of each condition to the standard condition. hb Note that this is just an example, and other expressions may be used as long as they can express the relationship between the quantified object characteristics and the gazeable angle.

[0044] The gaze area setting unit 205 takes the gaze area in the direction perpendicular to the object surface, which is in the z-axis direction in the example of Fig. 8. That is, the z-axis passes through the center of abcd and coincides with the center line of the gaze area (the line segment from o to the center point of abcd). As will be described later, the gaze area correction unit 207 changes the direction and scale of the center line depending on the position where the product is placed.

[0045] In S404, the video acquisition unit 203 acquires a video (store video) of a location where a commodity (target object) is arranged in a store, etc., captured by the imaging device 105. The video acquired by the video acquisition unit 203 is composed of a plurality of frame images. In S404, the video acquisition unit 203 acquires a video with the same angle of view as S502 in FIG. 5, which will be described later.

[0046] In S405, the arrangement condition acquisition unit 206 acquires information on the three-dimensional shape (such as the arrangement of shelves) of the inside of the store from the in-store image acquired in S404. A specific method of acquiring the arrangement information is, for example, a method of estimating depth information for each pixel from an image using a Vision Transformer (ViT). In addition, a synthetic image generated in advance using a three-dimensional CG model may be overlaid on the in-store image and displayed, and the user may adjust the size of the CG model while checking it to fit it.

[0047] Fig. 9 is a diagram showing an example of an in-store camera image and a gaze area. A shelf 902 and a product 903 are captured in a frame image 901, which is a part of the image shown in Fig. 9. Specifically, the product 903 is placed on the shelf 902. The placement condition acquisition unit 206 acquires the three-dimensional shapes of these in S405.

[0048] In S406, a coordinate system is set that indicates each coordinate of the three-dimensional shape. A world coordinate system (hereinafter, the in-store coordinate system) is used, with a specific position in the in-store image (point O at the left rear of the shelf in FIG. 9) as the origin. In this embodiment, the XZ plane is the floor, the X axis is the horizontal direction of the shelf, the Z axis is the depth direction of the shelf, and the Y axis is the height direction. Each of these axes is measured in units of length in the real world (in this embodiment, the length of the long side of the product). The position of the origin, the X axis, the Y axis, the Z axis, the length of the long side of the product, etc. are set by the user (operator) inputting them via the input device 106 while checking the in-store image.

[0049] In S407, the placement condition acquisition unit 206 acquires the three-dimensional position of the product (product 903). The placement condition acquisition unit 206 acquires the three-dimensional position of the product based on a user instruction via the input device .

[0050] In S408, the gaze area correction unit 207 adapts the gaze area set by the gaze area setting unit 205 to the position where the product is placed. That is, the gaze area is adapted to the in-store coordinate system. In this process, first, the gaze area expressed in the product coordinate system is scaled based on the real-world length of the long side of the product to match it with the in-store coordinate system. Next, the vertex o of the gaze area is shifted to the center position 904 of the product label. Next, the direction of the center line 905 of the gaze area in the xz plane is matched with the perpendicular line to the surface of the product. Next, the direction of the yz plane is determined based on the height information of the person. Specifically, it is determined based on the average height of the people who are expected to gaze at the target object.

[0051] Here, the average height is set to match the target customer group of the product. For example, for adult males, it is set to 170 cm, and for elementary school students, it is set to 120 cm. The average height may be selected from pre-set values ​​according to the target customer group by the user, or may be set arbitrarily by the user.

[0052] Taking FIG. 9 as an example, it is assumed that a person of average height 906 stands at a position at a predetermined ratio of the gazeable distance from the product position (distance r, which is the radius of the sphere shown in FIG. 8). The gazeable distance is not used as is because distance r is the limit of gazeable distance, and it is assumed that customers often stand at a position at the optimal distance where they can see the product most easily. In FIG. 9, the head center of a person of average height standing at a gaze optimum distance 907 is set to 908. The direction in the xz plane is adjusted so that the center line 905 of the gaze area passes through the head center 908. In this way, the gaze area correction unit 207 adapts the gaze area to the position where the product is arranged in the store. The gaze area 909 shows the gaze area when the gaze area is adapted to the position where the product is arranged in the store.

[0053] In S409, the gaze area correction unit 207 acquires placement information, which is information on conditions related to placement other than the three-dimensional position of the product. Here, the information on placement conditions other than the three-dimensional position (placement information) is information on lighting conditions (brightness, color temperature of the light source, light source direction, etc.) and the positions of obstructions around the product. Methods for acquiring lighting conditions include a method of obtaining the lighting conditions from an image using a gray hypothesis, and a method in which the user specifies the type of light source. In addition, the information on the positions of obstructions around the product may be acquired using, for example, a method using ViT, but may be acquired using any method without being limited to this.

[0054] In S410, the gaze area correction unit 207 corrects the gaze area based on the product placement conditions (placement information). When correcting the gaze area in S410, for example, if the brightness falls below a predetermined value, it is considered that the gazeable distance will become smaller. The predetermined value is obtained by experiment, and when it is lower than the predetermined value, the length of the center line of the gaze area is changed according to the brightness difference value ΔV using the following formula (3).

[0055] L = 1 / ΔV × α × L1 (3)

[0056] In the above formula (3), L1 is the length of the center line of the gaze area before correction, and α is a coefficient for adjusting the range of values. Note that this is just an example, and other formulas may be used as long as they can express the correction ratio based on the relationship between the quantified placement conditions and the gazeable distance.

[0057] FIG. 10 is a diagram of an example of a camera image in a store when there is an obstruction around a product. When there is an obstruction around a product as in FIG. 10, the area that can be gazed at is limited. For example, using FIG. 10 as an example, assume that there is a shelf 1002 and a product 1003 in a camera image 1001 in the store, and a pillar 1004 protrudes outward. In such a case, since a person cannot stand on the right side in front of the shelf 1002, the product 1003 cannot be gazed at from the right side. Therefore, based on the positional relationship between the three-dimensional information of the store acquired in S405 and the gaze area matched in S408, it is identified that there is an obstruction around the product (for example, a pillar 1004) (positional information of the obstruction is acquired).

[0058] Fig. 11 is a view of the interior of the store in Fig. 10 viewed from the ceiling on the xz plane. In Fig. 11, 1101 indicates a shelf, 1102 indicates a product, and 1103 indicates a pillar. In addition, the correction of the gaze area based on an obstruction corrects only the angle on the right side as in gaze area 1105 indicated by a solid line, so that gaze area 1104 indicated by a dotted line does not overlap with pillar 1103. In this way, gaze area correction unit 207 corrects the gaze area so that it does not overlap with an obstruction based on position information of the obstruction that limits the area that a person can gaze at from the arrangement information.

[0059] In S411, the gaze area correction unit 207 stores the gaze area corrected in S410 (corrected gaze area) in the gaze area storage unit 208. After that, the process in this processing flow, that is, the process until the processing device 1 sets and saves the gaze area, is terminated.

[0060] Next, details of the process in which the processing device 1 in this embodiment detects and counts a person determined to be gazing at a commodity will be described with reference to Fig. 5. Fig. 5 is a flowchart showing the flow of the process in the processing device 1 to detect and count a person determined to be gazing at a commodity. Note that each operation (process) shown in the flowchart in Fig. 5 is realized by the CPU 101 of the processing device 1 executing a program stored in the ROM 102. Also, each process (step) is represented by adding an S to the beginning, thereby omitting the representation of the process (step).

[0061] In S501, the gaze determination unit 212 reads out the gaze area from the gaze area storage unit 208 and temporarily stores it in the RAM 103. Note that the gaze area read out by the gaze determination unit 212 from the gaze area storage unit 208 in S501 is the gaze area stored in the above-mentioned S411 (the corrected gaze area).

[0062] In S502, the video acquisition unit 203 acquires the video of the inside of the store from the imaging device 105 in units of frame images while associating the video with time information. The time information is at least one of a time stamp and a frame ID. The video acquisition unit 203 acquires the video with the same angle of view as the video acquired in the above-mentioned S404.

[0063] In S503, the person detection unit 209 detects a person (a person's area) from within a frame image in the video acquired in S502. That is, the person detection unit 209 detects the area of ​​a person within each of a plurality of frames. Here, a specific method of person detection is, for example, a method using CNN. Note that the person detection is not limited to a method using CNN as long as the person area can be detected. In addition, in this embodiment, the whole body area is used as the target of person detection, but a part of the person's area, such as the upper body area, may be used.

[0064] The whole body area is expressed by the x and y coordinates of the two top left and bottom right points of a rectangle surrounding a person, with the top left corner of the frame image as the origin. In addition, time information of the frame image is added to each whole body area.

[0065] In S504, the person tracking unit 210 performs a tracking process to determine which person's whole body area detected in the previous frame (the frame image immediately before the latest frame image) corresponds to the whole body area detected in the current frame (the latest frame image). The person tracking unit 210 also assigns a person ID issued for each person to each whole body area.

[0066] There are various methods for the tracking process performed by the person tracking unit 210, for example, a method of matching the center position of the whole body area included in the previous frame with the center position of the whole body area of ​​the person included in the current frame that is closest to the center position of the whole body area of ​​the person included in the current frame. In addition to this, any method may be used as long as it can match the whole body areas of the person between frames, such as a pattern matching method using the whole body area of ​​the previous frame as a matching pattern.

[0067] In S505, the posture estimation unit 211 estimates the posture of the person from all the whole body regions in the frame images, and outputs the joint points corresponding to each whole body region in the form of a list. Here, a specific method of posture estimation is, for example, a method of estimating the three-dimensional coordinates of each joint point using CNN and calculating the reliability thereof. In addition, as another method, a method of first calculating the joint points on the two-dimensional coordinates and then estimating the positions of the joint points on the three-dimensional coordinates may be used. That is, the posture estimation unit 211 estimates the postures of all the people in a plurality of frame images, estimates or calculates the coordinates (positions) of each joint point from the posture estimation results, and outputs the information of the joint points in the form of a list. Note that the method is not limited to the above-mentioned method as long as it is a method capable of estimating the three-dimensional coordinates of the joint points.

[0068] Multiple joint point lists are created for each whole body region (person) included in a frame image. Each joint point list lists the time information of the frame image, the person ID, and the coordinates and reliability of all joint points of the person in a specific order.

[0069] In S506, the gaze determination unit 212 determines whether or not a person present in the gaze area is gazing at a commodity (target object) based on the joint point list output by the posture estimation unit 211 in S505 and the gaze area read out in S501. If it is determined that the person is gazing at the commodity (YES in S506), the process proceeds to S507. On the other hand, if it is determined that the person is not gazing at the commodity (NO in S506), the process proceeds to S509.

[0070] The gaze determination in S506 is made on the premise that when a face is within the gaze area, the probability that the person is gazing is high. The accuracy of the coordinates of facial organ points, especially on the side not visible from the camera, is likely to decrease, and the gaze direction calculated from the position of the facial organ points as in the conventional technology is likely to have large errors, which makes the gaze determination accuracy worse. In contrast, the gaze area is fixed and stable at the position of the product, so gaze determination can be made with greater accuracy.

[0071] Specifically, for example, the average coordinates of five points of the right eye 313, the left eye 314, the right ear 315, the left ear 316 and the nose 317, which are facial organ points, are set as the center position of the face. Then, the gaze determination unit 212 determines whether the person is gazing at the product by determining whether the center position is within the gaze area (first determination) and determining whether the face faces the product from the left-right positional relationship of the eyes and ears (second determination). That is, the gaze determination unit 212 performs the first determination to determine whether the average coordinates of the coordinates of the right eye, the left eye, the right ear, the left ear and the nose of the person to be determined are within the gaze area. Furthermore, based on the positional relationship of the right eye, the left eye, the right ear and the left ear of the person to be determined, the second determination to determine whether the person gazes at the target object.

[0072] Furthermore, if facial organ points cannot be detected due to occlusion or other reasons, the center position may be estimated from other joint points such as the right shoulder 301, left shoulder 302, right hip 307, and left hip 308. For example, the center position of the face is determined to be a position extended from the center coordinates of both shoulders by a predetermined percentage of the length from the shoulder to the hip.

[0073] If there are multiple products, a corresponding gaze area is set for each product, and gaze judgment is performed for each gaze area. That is, the gaze area setting unit 205 sets a gaze area in which the target object can be gazed at for each target object based on the characteristic information acquired by the characteristic acquisition unit 204. Furthermore, when multiple gaze areas are set, the gaze judgment unit 212 judges for each set gaze area whether a person present in the gaze area is gazing at the product (target object).

[0074] Fig. 12 is a diagram showing multiple gaze areas set for multiple products as viewed on the xz plane. In Fig. 12, product 1201 and product 1202 are arranged side by side. Gaze area 1203 is the gaze area set for product 1201. Gaze area 1204 is the gaze area set for product 1202. Center line 1205, shown by a dotted line, is the center line of gaze area 1203. Center line 1206 is the center line of gaze area 1204. Gaze area 1203 and gaze area 1206 overlap in area (overlapping area) 1207.

[0075] Here, when the center position of the face of the above-mentioned person is within overlapping region 1207, the gaze determination unit 212 determines that the person is gazing at the product closer to center line 1205 or center line 1206. In other words, when the center position of the above-mentioned person is in an area where one gaze region and another gaze region partially overlap, the gaze determination unit 212 determines which product (target object) the person is gazing at, based on the center lines (center position information) of the respective gaze regions.

[0076] Incidentally, since it is conceivable that two products are being compared, the gaze determination unit 212 may determine that both product 1201 and product 1202 are being gazed at simultaneously. Also, it may determine that the product closer to the center line is being gazed at, and also determine that the two products are being compared. Incidentally, the above is an example in which two gaze regions are set for two products, but the same can be applied to the case in which multiple gaze regions are set for three or more products.

[0077] In S507, the person number measuring unit 213 measures (calculates) the time (gazing time) from the time information of the person that was determined to be gazing for the first time. Then, the time information is temporarily stored in the RAM 103.

[0078] In S508, the person number measurement unit 213 reads out the gaze time temporarily saved in S507, and judges whether the gaze time has reached a predetermined time. That is, when the gaze judgment unit 212 judges that a person in the gaze area is gazing at the target object, the person number measurement unit 213 judges whether the predetermined time has been reached based on the information of the time the person gazes at the target object. When it is judged that the gaze time has reached the predetermined time (YES in S508), it is determined that the person is interested in the target object, and the process proceeds to S509. On the other hand, when it is judged that the gaze time has not reached the predetermined time (NO in S508), it is determined that the person is not interested in the target object, and the process proceeds to S510. Note that the predetermined time is set to an arbitrary time in advance.

[0079] In S509, the person number measuring unit 213 counts (measures) the person who is determined by the gaze determining unit 212 to be gazing at the target object as a person who is interested (shows interest) in the product.

[0080] In S510, it is determined whether all processing has been completed for all people (whole body regions) included in the current frame image. If it is determined that all processing has not been completed for all people (whole body regions) (NO in S510), the process returns to S505 and similar processing is performed. On the other hand, if it is determined that all processing has been completed for all people (whole body regions) (YES in S510), the process proceeds to S511.

[0081] In S511, the display unit 214 presents the count result of S509 to the user via the display device 107 (displays it on the screen of the display device 107). After that, this processing flow ends. During this processing, for example, a message (notification) such as "There are 5 people interested in the product" is displayed on the screen of the display device.

[0082] In this embodiment, the processing result is notified to the user, but further processing such as creating statistical information may be performed. Alternatively, a mobile terminal such as a tablet or smartphone may be carried by a store clerk or staff member working in the store, and a message such as "Someone is interested in the product" may be notified to the store clerk carrying the mobile terminal when the gaze time reaches a predetermined time in S508. By notifying the store clerk in this manner, it is possible to prompt the store clerk to explain the product to the target person (the person gazing at the product).

[0083] The above is the flow of processing by the processing device 1 in this embodiment. Note that, although the above describes the flow of processing up to the point where the number of people who are interested in the product is counted, all steps from S502 onwards are always repeated until the processing device 1 is terminated.

[0084] As described above, the processing device 1 in this embodiment can set a gaze area defined by the angle and distance at which a person can gaze at a commodity (target object). Since the gaze area set by the processing device 1 in this embodiment is an area fixed at the position of the commodity, it is possible to perform gaze determination with high accuracy (improve the determination accuracy of a person's gaze behavior).

[0085] In this embodiment, all the functions are incorporated in one device. However, the present invention is not limited to this. For example, the video acquired from the video acquisition unit 203 is transmitted to the cloud, and the gaze area setting unit 205 performs processing on the cloud to set the gaze area and store it in a storage unit on the cloud. After that, the information on the gaze area set on the cloud is transmitted to a processing device formed of a PC or the like, and the above-mentioned gaze determination processing is performed, and the number of people determined to be gazing at the product is counted and presented to the user.

[0086] In addition, the processing device 1 in this embodiment may be linked or combined with measurement of the time a person spends standing in front of a product or analysis of the behavior of reaching out for a product to form part of a system for analyzing the customer's level of interest in a product.

[0087] The object of the present invention may be achieved by the following method. A recording medium (or storage medium) on which program code of software for realizing the functions of the above-mentioned embodiments is recorded is supplied to a system or device. Then, a computer (or a CPU, MPU, or GPU) of the system or device reads and executes the program code stored in the recording medium. In this case, the program code itself read from the recording medium realizes the functions of the above-mentioned embodiments, and the recording medium on which the program code is recorded constitutes the present invention. It can also be realized by a circuit (e.g., ASIC) that realizes one or more functions.

[0088] Furthermore, the functions of the above-described embodiments are not only realized by the computer reading and executing the program code, but also include an operating system (OS) running on the computer performing all or part of the actual processing based on the instructions of the program code.

[0089] Furthermore, the functions of the above-described embodiments may be realized by the following method: Program code read from a recording medium is written into a memory provided in a function expansion card inserted into a computer or a function expansion unit connected to a computer. Then, based on the instructions of the program code, a CPU provided in the function expansion card or function expansion unit performs some or all of the actual processing.

[0090] When the present invention is applied to the above-mentioned recording medium, the recording medium stores program code corresponding to the flowcharts described above.

[0091] Although the preferred embodiments of the present invention have been described above, the present invention is not limited to these embodiments, and various modifications and changes are possible within the scope of the gist of the present invention.

[0092] The disclosure of this embodiment includes the following configuration, method, and program.

[0093] (Configuration 1) An image capturing means for capturing an image of a location where a target object is placed; A characteristic acquisition means for acquiring characteristic information relating to an angle and distance at which the target object can be gazed at; an area setting means for setting a gaze area in which the target object can be gazed upon based on the characteristic information acquired by the characteristic acquisition means; a determining means for determining whether the person is gazing at the target object based on information on joint points of the person present in the gaze area set by the area setting means; A processing device comprising:

[0094] (Configuration 2) 2. The processing device according to configuration 1, further comprising a correction means for correcting the gaze area to conform to a position of the target object in the video based on height information of the person.

[0095] (Configuration 3) 3. The processing device according to configuration 1 or 2, wherein the gaze area is a partial area in a three-dimensional space determined by a distance from the target object to the person and an angle of the person relative to the target object.

[0096] (Configuration 4) Arrangement of the target object 3. The processing device according to configuration 2, further comprising arrangement information acquisition means for acquiring arrangement information relating to the

[0097] (Configuration 5) 5. The processing device according to configuration 4, wherein the correction means corrects the gaze area based on the arrangement information.

[0098] (Configuration 6) 6. The processing device according to configuration 4 or 5, wherein the placement information includes at least one of information on a position where the target object is placed, lighting conditions, and information on the positions of obstructions around the target object.

[0099] (Configuration 7) 7. The processing device according to configuration 6, wherein the correction means corrects the gaze area based on the position of the obstruction so that the gaze area does not overlap with the position of the obstruction.

[0100] (Configuration 8) The image acquisition means acquires the image in a plurality of frames, an area detection means for detecting an area of ​​a person within each of the plurality of frames; an estimation means for estimating a posture from the area of ​​the person detected by the area detection means, and estimating information on joint points of the person based on the posture, 8. The processing device according to any one of configurations 1 to 7.

[0101] (Configuration 9) The processing device according to any one of configurations 1 to 8, characterized in that the information on the joint points includes coordinate information of the person's right eye, left eye, right ear, left ear, and nose in addition to coordinate information of each joint of the person.

[0102] (Configuration 10) The determination means determines whether a center position is within the gaze area by determining whether an average coordinate of coordinates of the person's right eye, left eye, right ear, left ear, and nose is a center position; and The processing device according to configuration 9, characterized in that it determines whether the person is gazing at the target object by a second determination of whether the person is facing the target object based on the positional relationship between the person's right eye, left eye, right ear, and left ear.

[0103] (Configuration 11) The processing device described in configuration 10 is characterized in that, when there are multiple target objects and gaze areas, and the center position of the person is located within an area where one gaze area partially overlaps with another gaze area, the determination means determines which target object the person is gazed at based on the center position information of each of the gaze areas.

[0104] (Configuration 12) The processing device described in any one of configurations 1 to 11, characterized in that when the judgment means judges that a person in the gaze area is gazing at the target object, the processing device further comprises an interest judgment means for determining whether the person is interested in the target object based on information about the time the person is gazing at the target object.

[0105] (Configuration 13) a measuring means for measuring the number of people who are determined by the interest determining means to be interested in the target object; and a display unit for displaying the number of people counted by the counting unit on a display device. 13. The processing device according to configuration 12.

[0106] (Configuration 14) The processing device according to any one of configurations 1 to 13, characterized in that the characteristic information includes any one of a size of a character on the target object, a contrast difference between a background and a character, a font used, and a shape of the target object.

[0107] (Configuration 15) A method for controlling a processing device, comprising the steps of: An image capturing step of capturing an image of a location where a target object is located; a characteristic acquisition step of acquiring characteristic information relating to an angle and distance at which the target object can be gazed at; a region setting step of setting a gaze region in which the target object can be gazed upon based on the characteristic information acquired in the characteristic acquisition step; and a determining step of determining whether the person is gazing at the target object based on information of joint points of the person present in the gaze area set in the area setting step. 23. A method for controlling a processing apparatus comprising:

[0108] (Configuration 16) A program for causing a computer to operate as the processing device according to any one of configurations 1 to 14. [Explanation of symbols]

[0109] 201 Area setting section 202 Gazing Person Detection Unit 203 Video Acquisition Unit

Claims

1. An image capturing means for capturing an image of a location where a target object is placed; A characteristic acquisition means for acquiring characteristic information relating to an angle and distance at which the target object can be gazed at; an area setting means for setting a gaze area in which the target object can be gazed upon based on the characteristic information acquired by the characteristic acquisition means; a determining means for determining whether the person is gazing at the target object based on information on joint points of the person present in the gaze area set by the area setting means; A processing device comprising:

2. 2. The processing device according to claim 1, further comprising a correction unit that corrects the gaze area to conform to a position in the image where the target object is located, based on height information of the person.

3. 2 . The processing device according to claim 1 , wherein the fixation area is a partial area in a three-dimensional space determined by a distance from the target object to the person and an angle of the person relative to the target object.

4. 3. The processing device according to claim 2, further comprising: arrangement information acquisition means for acquiring arrangement information relating to an arrangement of the target object.

5. The processing device according to claim 4 , wherein the correction means corrects the gaze area based on the layout information.

6. 5. The processing device according to claim 4, wherein the position information includes at least one of information on a position where the target object is placed, lighting conditions, and information on the positions of obstructions around the target object.

7. The processing device according to claim 6 , wherein the correction means corrects the gaze area based on the position of the obstruction so that the gaze area does not overlap the position of the obstruction.

8. The image acquisition means acquires the image in a plurality of frames, an area detection means for detecting an area of ​​a person within each of the plurality of frames; an estimation means for estimating a posture from the area of ​​the person detected by the area detection means, and estimating information on joint points of the person based on the posture, 2. The processing device according to claim 1 .

9. 2. The processing device according to claim 1, wherein the information on the joint points includes coordinate information of the person's right eye, left eye, right ear, left ear, and nose in addition to coordinate information on each joint of the person.

10. the determining means determines whether the center position is within the gaze area by determining whether the center position is an average of coordinates of the person's right eye, left eye, right ear, left ear, and nose; The processing device according to claim 9, characterized in that it determines whether the person is gazing at the target object by a second determination of whether the person is facing the target object based on the positional relationship of the person's right eye, left eye, right ear, and left ear.

11. The processing device according to claim 10, characterized in that when there are multiple target objects and gaze areas and the center position of the person is located within an area where one gaze area partially overlaps with another gaze area, the determination means determines which target object the person is gazed at based on center position information of each of the gaze areas.

12. The processing device according to claim 1, further comprising an interest determination means for determining, when the determination means determines that a person in the gaze area is gazing at the target object, whether the person is interested in the target object based on information about the time the person is gazing at the target object.

13. a measuring means for measuring the number of people who are determined by the interest determining means to be interested in the target object; and a display unit for displaying the number of people counted by the counting unit on a display device.

13. The processing device according to claim 12.

14. 2. The processing device according to claim 1, wherein the characteristic information includes any one of a size of characters on the target object, a contrast difference between the background and the characters, a font used, and a shape of the target object.

15. A method for controlling a processing device, comprising the steps of: An image capturing step of capturing an image of a location where a target object is located; a characteristic acquisition step of acquiring characteristic information relating to an angle and distance at which the target object can be gazed at; a region setting step of setting a gaze region in which the target object can be gazed upon based on the characteristic information acquired in the characteristic acquisition step; and a determining step of determining whether the person is gazing at the target object based on information of joint points of the person present in the gaze area set in the area setting step. A method for controlling a processing apparatus comprising the steps of:

16. A program for causing a computer to operate as the processing device according to any one of claims 1 to 14.