Action identification devices, action identification methods and procedures

By combining room layout information of a building with image sensor location information, and using CNN and other classifiers to calculate feature quantities, the problem of insufficient accuracy in action recognition within buildings in existing technologies is solved, and high-precision recognition of space-dependent actions is achieved.

CN115803790BActive Publication Date: 2026-06-30PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
Filing Date
2021-04-01
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies fail to effectively consider the room layout information of buildings when identifying human actions inside buildings, resulting in insufficient recognition accuracy, especially for space-dependent actions such as cooking activities in the kitchen.

Method used

By acquiring room layout information of the building and location information of image sensors, candidate actions are selected, and feature quantities of image data are calculated using convolutional neural networks (CNN) and other classifiers. The candidate actions are identified by combining the feature quantities, taking into account the spatial relationships within the building.

Benefits of technology

It achieves high-precision recognition of space-dependent actions within buildings, improves recognition accuracy, simplifies the composition of action recognition devices, and reduces processing load.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115803790B_ABST
    Figure CN115803790B_ABST
Patent Text Reader

Abstract

The action recognition device of the present invention selects candidate actions from object actions based on the location information and room layout information of the image sensor, acquires image data detected by the image sensor, determines one or more recognizers corresponding to the candidate actions, calculates the feature quantity of the image data using one or more recognizers, and identifies the candidate actions based on the feature quantity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a technology for recognizing a user's actions within a building. Background Technology

[0002] In recent years, research has been advancing on recognizing human actions based on moving images. For example, Patent Document 1 discloses a technique that detects a human region from a moving image and recognizes human actions based on the combination of the posture of the person mapped onto the region and surrounding objects.

[0003] For example, Patent Document 2 discloses a technique that extracts skeleton information based on human joints from dynamic images in a time series, extracts the surrounding region of the skeleton information, and identifies human actions based on the extracted surrounding region.

[0004] However, the technologies in Patent Document 1 and Patent Document 2 do not take into account the room layout information of the building, and improvements are necessary to accurately identify the actions of people who depend on the space within the building.

[0005] Existing technical documents

[0006] Patent documents

[0007] Patent Document 1: Japanese Patent Publication No. 2018-206321

[0008] Patent Document 2: Japanese Patent Publication No. 2019-144830 Summary of the Invention

[0009] An embodiment of the present invention relates to an action recognition device for recognizing human actions in a building, comprising: a first acquisition unit for acquiring object action information including one or more object actions as predetermined recognition objects, room layout information of the building, and location information of an image sensor installed in the building; an action selection unit for selecting candidate actions as recognition candidates from the one or more object actions included in the object action information based on the location information of the image sensor and the room layout information; a second acquisition unit for acquiring image data detected by the image sensor; a first action recognition unit for determining one or more recognizers corresponding to the candidate actions and calculating feature quantities of the image data using the one or more recognizers; a second action recognition unit for recognizing the candidate actions based on the feature quantities; and an output unit for outputting the recognition result of the second action recognition unit.

[0010] According to the present invention, human actions dependent on the space within a building can be identified with high precision. Attached Figure Description

[0011] Figure 1 This is a block diagram illustrating an example of the configuration of an action recognition device involved in an implementation.

[0012] Figure 2 This is an illustration of the CNN that constitutes the recognizer.

[0013] Figure 3 This is a schematic diagram illustrating an example of the configuration of the first action identification unit.

[0014] Figure 4 This is a schematic diagram illustrating an example of the structure of an action selection table.

[0015] Figure 5 This is a schematic diagram illustrating an example of the structure of a recognizer selection table.

[0016] Figure 6 This is a schematic diagram illustrating an example of a weight table referenced by the combiner when setting weight coefficients for feature quantities.

[0017] Figure 7 This is a flowchart illustrating an example of the process for generating list information of the action recognition device involved in the implementation.

[0018] Figure 8 It means Figure 7 A flowchart detailing the processing of step S106.

[0019] Figure 9 This is a flowchart illustrating an example of motion recognition processing by a motion recognition device.

[0020] Figure 10 This is a schematic diagram representing the data structure of a prior knowledge table that summarizes information related to the recognizer.

[0021] Figure 11 This is a table summarizing the processing time for each frame when the existing recognizers are executed separately.

[0022] Figure 12 It is a table that summarizes the input image data.

[0023] Figure 13 This is a table summarizing the simulation results conducted to evaluate the recognition accuracy of motion recognition devices.

[0024] Figure 14 This is a block diagram illustrating an example of the configuration of a motion recognition device according to a variation of the present invention.

[0025] Figure 15 This is a block diagram illustrating an example of the configuration of the action recognition device involved in the modified example (3) of the present invention.

[0026] Figure 16 This is a schematic diagram illustrating a scenario where an image sensor is installed in the entryway.

[0027] Figure 17 This is a schematic diagram illustrating an example of interaction between a user and a display terminal in a scenario where an image sensor is installed in the entryway.

[0028] Figure 18 It means Figure 17 A diagram illustrating the subsequent interactions.

[0029] Figure 19 This is a schematic diagram illustrating an example of a setup screen that overlaps with an annotation image.

[0030] Figure 20 This is a schematic diagram illustrating a scenario where an image sensor is installed in a kitchen.

[0031] Figure 21 This is a schematic diagram illustrating an example of interaction between a user and a display terminal in a scenario where an image sensor is set up in a kitchen.

[0032] Figure 22 It means Figure 21 A diagram illustrating the subsequent interactions.

[0033] Figure 23 This is a schematic diagram illustrating an example of a setup screen that overlaps with labeled images.

[0034] Figure 24 This is a schematic diagram illustrating an example of the interaction between the user and the display terminal when correcting an annotated image after setting up an image sensor.

[0035] Figure 25 This is a schematic diagram illustrating an example of a setup screen that displays an overlaid labeled image. Detailed Implementation

[0036] The insights gained from this invention

[0037] Previous methods have proposed inferring human actions based on moving images or still images used as sensor data. These methods pre-label the actions of the object to be identified and determine which label the sensor data corresponds to. When the sensor data is moving images or still images, high recognition accuracy can be achieved by using deep neural networks (DNNs) that utilize convolutional and pooling layers. However, in current DNN-based methods, the actions to be identified are spatial actions, such as walking, which are independent of physical movement. Therefore, to identify spatially dependent actions, the DNN needs to learn from scratch, taking spatially relevant information into account, which is cumbersome and costly.

[0038] In Patent Document 1, the human action is identified by considering the combination of the person's posture and the objects around the person. However, when there are objects around the person that are unrelated to the person's action, these objects do not become useful information in inferring the person's action. Therefore, further improvements to Patent Document 1 are needed to identify space-dependent human actions.

[0039] In Patent Document 2, it is evident that the assumption of action as the object of identification, based on the inference of human movement from the area enclosed by skeletal information, is that the action is independent of the background. Therefore, in order to infer actions that are highly dependent on space, such as cooking in a kitchen, further improvements to Patent Document 2 are needed.

[0040] Here, the inventors of the present invention have come to the insight that if the room layout information of a building is taken into account, the actions of people dependent on the space within the building can be identified with high accuracy, and thus the various embodiments of the present invention shown below were conceived.

[0041] An embodiment of the present invention relates to an action recognition device for recognizing human actions in a building, comprising: a first acquisition unit for acquiring object action information including one or more object actions as predetermined recognition objects, room layout information of the building, and location information of an image sensor installed in the building; an action selection unit for selecting candidate actions as recognition candidates from the one or more object actions included in the object action information based on the location information of the image sensor and the room layout information; a second acquisition unit for acquiring image data detected by the image sensor; a first action recognition unit for determining one or more recognizers corresponding to the candidate actions and calculating feature quantities of the image data using the one or more recognizers; a second action recognition unit for recognizing the candidate actions based on the feature quantities; and an output unit for outputting the recognition result of the second action recognition unit.

[0042] Based on this configuration, candidate actions are selected from object actions using location information from image sensors and room layout information, and the recognition result for each candidate action is calculated. Therefore, it is possible to accurately recognize human actions dependent on the space within a building.

[0043] Furthermore, according to this configuration, the first action recognition unit determines a recognizer corresponding to a candidate action and calculates feature quantities of the image data using the determined recognizer. The second action recognizer then identifies the candidate action based on the calculated feature quantities. Therefore, existing recognizers can be used as the recognizer, making it easy to construct an action recognition device. Moreover, because feature quantities are calculated using one or more recognizers corresponding to the candidate action, feature quantities suitable for recognizing candidate actions can be calculated, improving the accuracy of object action recognition.

[0044] In the action recognition device, the first action recognition unit determines multiple recognizers when the candidate action is a defined action, and the second action recognition unit combines the feature quantities calculated by the multiple recognizers respectively, and recognizes the candidate action based on the combined feature quantities.

[0045] According to this configuration, since multiple recognizers are determined when the object action is a specified action, for example, when the specified action depends on objects located around a person, the feature quantity can be calculated by recognizing the feature quantity of the person and recognizing the feature quantity of the object, which can improve the recognition accuracy of candidate actions.

[0046] Furthermore, according to this configuration, the feature values ​​calculated by each identifier are combined, and the action of the object is identified based on the combined feature values. Therefore, when multiple feature values ​​are calculated by multiple identifiers, they can be input to the second action recognition unit in combination with a single feature value, which simplifies the configuration of the action recognition device.

[0047] In the action recognition device, the specified action can also be sweeping, brushing teeth, cooking, washing clothes, using a computer, reading, or eating.

[0048] Based on this structure, characteristic quantities of actions such as sweeping or brushing teeth, which depend on the actions of objects located around the person, can be calculated with high precision from image data.

[0049] In the action recognition device, each recognizer may be constructed using a convolutional neural network (hereinafter referred to as CNN), and the second action recognition unit may use a classifier employing one of the following methods to recognize the candidate action: logistic regression, support vector machine, decision tree, random forest, k-nearest neighbor method, Gaussian Naive Bayes, perceptron, and stochastic descent method.

[0050] Based on this configuration, since each recognizer is constructed using CNN and the second action recognizer is constructed using a classifier that utilizes logistic regression, the second action recognizer can be constructed using a classifier with lower processing cost compared to the first action recognizer.

[0051] Alternatively, the action recognition device may be a second action recognition unit that uses a classifier that performs machine learning on the feature quantity as an explanatory variable and the object action as the objective variable to identify the candidate action.

[0052] Based on this configuration, a second action recognition unit can be constructed using a classifier generated through machine learning, which uses the feature values ​​calculated by each recognizer as explanatory variables and the candidate actions corresponding to those feature values ​​as objective variables. For example, if a classifier with a lower processing load compared to CNNs such as logistic regression is used, the classifier can learn in a shorter time. Furthermore, by using an existing recognizer constructed from a CNN to construct the first action recognition unit, and only allowing the second action recognition unit to perform machine learning, an action recognition device can be constructed without requiring the first action recognition unit to perform machine learning.

[0053] In the action recognition device, the second action recognition unit may use a weighting coefficient predetermined according to the candidate action to weight each feature quantity, and identify the candidate action based on the weighted feature quantity.

[0054] According to this structure, since each feature is weighted according to the object's action and candidate actions are identified based on the weighted feature values, candidate actions can be correctly identified.

[0055] In the aforementioned action recognition device, it may also extract one or more objects set in the building from the room layout information, classify the one or more objects into one of the following: a movable first object, a water-related facility (i.e., a second object), and a structure of the building (i.e., a third object), extract the classification information representing the classification result and the room layout feature quantity corresponding to the setting position for each of the one or more objects, generate an action selection table based on the room layout feature quantity, the action selection table being a table that corresponds one or more spaces of the building to the actions of the objects corresponding to each space, and the first acquisition unit acquires the action selection table as the object action information.

[0056] Based on this structure, room layout features are extracted from each object placed in the building, corresponding to its classification information and location, based on room layout information. Therefore, it is possible to determine what objects are placed in each space within the building, extracting information useful for creating an action selection table. Furthermore, an action selection table is generated based on the room layout features, corresponding to each space in the building and the actions of the objects associated with each space. This action selection table is then used as object action information. Therefore, the relationship between each space and object actions can be quickly understood, making it easy to select candidate actions.

[0057] Alternatively, the motion recognition device may be communicatively connected to a display terminal. The motion recognition device may also include a setting support unit, which obtains the name of the space in the building where the image sensor is installed via the display terminal and outputs setting guidance to the display terminal. The setting guidance sets the image sensor in such a way that specific equipment or facilities associated with the space are included within the field of view of the image sensor.

[0058] According to this configuration, because a setup guide is output to the display terminal to ensure that a specific device or facility related to the space is included within the image sensor's field of view, the user can properly set the image sensor. Furthermore, because the name of the building's space is obtained through the display terminal, the image sensor can be correlated with the space where it is set. As a result, candidate actions can be easily selected by referring to the image sensor's installation location and the action selection table.

[0059] In the action recognition device, the setting support unit may acquire image data captured by the image sensor, detect the specific device or the specific facility contained in the image data, and overlay a labeled image representing the detection result of the specific device or the specific facility onto the image shown in the image data.

[0060] According to this configuration, a labeled image, representing the detection result of a specific device or facility detected from the image data, is overlaid on the image shown in the image data. Therefore, it is easy to confirm whether the specific device or facility has been correctly detected from the image data.

[0061] In the aforementioned action recognition device, the setting support unit may also acquire a correction instruction for the labeled image through the display terminal and store the labeled information shown in the corrected labeled image into a memory.

[0062] According to this configuration, because the correction instructions for the annotation information are obtained through the display terminal, for example, if the recognizer fails to correctly detect a specific device or facility, the annotation information can be corrected to accurately represent the location of the specific device or facility. Furthermore, the location of the specific device or facility on the image can be determined, which is crucial for the recognizer in recognizing the object's movement.

[0063] This invention can also be implemented as an action recognition method that causes a computer to execute the characteristic components included in the action recognition device, and as an action recognition program that causes a computer to execute the characteristic components. Furthermore, needless to say, the action recognition program can be distributed via a computer-readable non-transitory recording medium such as a CD-ROM or a communication network such as the Internet.

[0064] Furthermore, the embodiments described below are specific examples of the present invention. The numerical values, shapes, constituent elements, steps, and order of steps shown in the following embodiments are merely specific examples and are not intended to limit the present invention. Moreover, among the constituent elements in the following embodiments, those not described in the independent claims representing the highest-level concept are described as arbitrary constituent elements. Furthermore, the contents of all embodiments can be combined arbitrarily.

[0065] Implementation

[0066] Computer-aided action recognition methods are categorized into three types based on the relationship between the person and their surrounding space. The first type identifies actions unrelated to the person's spatial location. For example, actions like walking or standing are represented solely by the person's movements and are independent of their spatial context. The second type identifies actions influenced by objects within or near the person's space. For instance, the action of riding a bicycle can be recognized by combining the detection of the person's posture with the detection of objects. When recognizing actions outdoors or where there is interference from objects, it's difficult to have prior knowledge of building structures or road conditions. Therefore, actions can be identified by utilizing both the person and objects in their vicinity. The third type identifies actions by considering both the person and their surrounding space. For example, the action of starting to cook in the kitchen can be recognized by the person's presence in the kitchen and their orientation towards the microwave.

[0067] Therefore, actions such as "starting to cook," which are difficult to identify without utilizing information about the space in which a person is located, cannot be identified using conventional identification methods, namely the first and second types. In this embodiment, a method is proposed to identify human actions using information about the person and the space in which they are located, namely room layout information. This embodiment will now be described with reference to the accompanying drawings. In this embodiment, the human actions to be identified include, in addition to actions such as starting to cook or cleaning, actions that do not involve physical movement, such as lying down while watching television.

[0068] Figure 1 This is a block diagram illustrating an example of the configuration of the action recognition device 1 according to the embodiment. The action recognition device 1 is a device for recognizing the actions of a user within the user's residence (an example of a building).

[0069] The motion identification device 1 is composed of a computer equipped with, for example, a processor, memory, and interface circuitry. The motion identification device 1 does not necessarily need to be implemented by a single computer; it can also be implemented using a distributed processing system (not shown) that includes a terminal device and a server. For example, the motion identification device 1 could also employ a configuration where a terminal device housing the memory for storing image data 204 is located in a residence, and a server houses some or all of the modules constituting the processor 100. This implementation will be described in later variations.

[0070] The motion recognition device 1 includes a processor 100 and a memory 200. The memory 200 is constructed using a non-volatile storage device such as an SSD or an HDD, and stores object motion information 201, room layout information 202, location information 203, image data 204, and list information 205. The memory 200 only needs to store image data 204 acquired by the second acquisition unit 103 from the image sensor 2 at a predetermined frame rate, tracing back to a certain time period (e.g., 1 minute) from the present.

[0071] Image sensor 2 is, for example, constructed using a camera installed inside a residence. Image sensor 2 acquires image data 204 by capturing images of the space inside the residence at a predetermined frame rate, and inputs the acquired image data 204 into a second acquisition unit 103. Multiple image sensors 2 may also be used. Image data 204 may be, for example, either color image data or black-and-white image data.

[0072] The processor 100 is composed of circuits such as a CPU. The processor 100 includes a first acquisition unit 101, an action selection unit 102, a second acquisition unit 103, a first action recognition unit 104, a second action recognition unit 105, and an output unit 106. These modules are implemented by having the processor 100 execute, for example, an action recognition program. However, this is only one example; such modules can also be implemented using dedicated hardware circuits such as ASICs.

[0073] The first acquisition unit 101 acquires object action information 201 representing object actions of a pre-determined identification object and saves it to the memory 200. For example, the first acquisition unit 101 may acquire object action information 201 input during a registration process using an input device (not shown). The input device may include, for example, a keyboard and a mouse. However, this is just one example; the object action information 201 may also be pre-stored in the memory 200. The object action may be registered, for example... Figure 4 The action selection table T1 shows various actions such as wiping and sweeping. Object action information 201 can also be stored in memory 200 beforehand. Action selection table T1 is an example of object action information 201.

[0074] Furthermore, the first acquisition unit 101 also acquires room layout information 202 within the residence and saves it to the memory 200. Room layout information 202 is two-dimensional or three-dimensional information representing the constituent elements of rooms such as the living room, dining room, and kitchen, as well as the shape and positional relationships of each room. Room layout information 202 may include, for example, two-dimensional drawing data recording the room layout, three-dimensional design data (CAD data) used in residential design, three-dimensional point cloud information (point cloud) measured by a three-dimensional laser scanner within the residence, trajectory information of equipment moving within the residence such as a cleaning robot, image data of the residence captured by a calibrated camera, or information acquired from devices equipped with room information.

[0075] Furthermore, the first acquisition unit 101 also acquires the position information 203 of the image sensor 2 installed in the residence and saves it to the memory 200. The position information 203 is, for example, composed of coordinate data represented by a coordinate system in a dual-axis or tri-axis coordinate space included in the room layout information 202. The first acquisition unit 101 acquires the position information 203, for example, through registration work input using an input device (not shown).

[0076] The action selection unit 102 selects candidate actions from among the target actions based on the location information 203 of the image sensor 2 and the room layout information 202. For example, the action selection unit 102 determines the space within the residence where the image sensor 2 is installed based on the location information 203 and the room layout information 202, and selects a pre-determined action that is presumed to be the action that the user is likely to take in the determined space. Specifically, the action selection unit 102 refers to... Figure 4 The action selection table T1 shown can determine the action corresponding to the determined space. The space is, for example, the space within a residence that makes up a room, entryway, kitchen, etc.

[0077] The second acquisition unit 103 acquires image data 204 captured by the image sensor 2 at a specified frame rate and saves it to the memory 200.

[0078] The first action recognition unit 104 determines one or more recognizers corresponding to the candidate actions selected by the action selection unit 102, and calculates feature quantities of the image data 204 using the determined recognizers. The first action recognition unit 104 includes a recognizer selection unit 110 and N (N is an integer greater than or equal to 1) recognizers 111_1, 111_2, ..., 111_N. Hereinafter, when collectively referred to as recognizers 111_1, 111_2, ..., 111_N, they will be referred to as recognizer 111.

[0079] The recognizer selection unit 110 selects a recognizer 111 for recognizing the candidate action selected by the action selection unit 102. The recognizer selection unit 110 may, for example, refer to the list information 205 described later when generating the list information 205. Figure 5 The identifier selection table T2 shown selects the identifier 111 corresponding to the candidate action. The identifier 111 is selected for the candidate action by referring to the list information 205 described later when performing the identification process.

[0080] Recognizer 111 is, for example, a recognizer constructed using a CNN. In this embodiment, the first action recognition unit 104, such as... Figure 5 As shown, there are recognizers 111 that respectively identify pose estimation, object detection, face detection, head orientation estimation, age and gender estimation, individual estimation, and trajectory tracking. These recognizers 111 are either based on or partially modified from existing recognizers, the source code of which has been made public.

[0081] The feature quantities vary depending on the recognizer 111. The feature quantities calculated by the pose inference recognizer 111 include, for example, two-dimensional coordinate data of each feature point representing multiple feature points that constitute skeletal information (e.g., 17 points such as the right shoulder and right elbow).

[0082] The object detection recognizer 111 calculates features including, for example, the coordinate data of the bounding rectangle surrounding the object and the label of the identified object. The objects detected in object detection are pre-defined; for example, eighty types of objects within a dwelling (such as microwave ovens and refrigerators) are detected. The face detection recognizer 111 calculates features including, for example, image data of the face region surrounded by a bounding rectangle. Here, the features calculated by each recognizer 111 are not human-incomprehensible data (e.g., tensors) output from intermediate layers of the DNN.

[0083] The first action recognition unit 104 includes a dependent recognizer 111 that depends on other recognizers 111 based on candidate actions. A dependent recognizer 111 is a recognizer 111 that calculates a feature value using a feature value calculated by another recognizer 111 as input. For example, the head orientation prediction recognizer 111 calculates a feature value representing head orientation using the feature value (image data of the face region) calculated by the face detection recognizer 111.

[0084] The second action identification unit 105 identifies each candidate action selected by the action selection unit 102 based on the feature values ​​calculated by each recognizer 111. The second action identification unit 105 includes a combiner 121 and a classifier 122. The combiner 121 combines the feature values ​​calculated by each combiner 121 and inputs the combined feature values ​​into the classifier 122. Here, the combiner 121 may also use a weighting coefficient predetermined according to the candidate action to weight each feature value, and then concatenate the weighted feature values ​​and input them into the classifier 122.

[0085] Classifier 122 identifies candidate actions by calculating the likelihood for each candidate action based on the features input from combiner 121. Classifier 122 employs a classifier that performs class classification. Classifier 122 can employ, for example, any classifier utilizing logistic regression, support vector machine, decision tree, random forest, k-nearest neighbor algorithm, Gaussian Naive Bayes, perceptron, and stochastic descent method. When there are multiple candidate actions, classifier 122 calculates the likelihood for each candidate action separately.

[0086] Classifier 122 is a classifier that performs machine learning by using the feature values ​​output by recognizer 111 as explanatory variables and the object action as the objective variable. This machine learning is performed separately for each object action. For example, in the case of the object action "start cooking", machine learning is performed by using the combination of the feature values ​​output by each recognizer 111 used for that object action as explanatory variables and "start cooking" as the objective variable.

[0087] The output unit 106 outputs the recognition result of the second action recognition unit 105. Specifically, the output unit 106 outputs the recognition result for human actions based on the likelihood calculated by the classifier 122. For example, when there are multiple candidate actions, the classifier 122 may output the label with the highest likelihood as the recognition result.

[0088] The identification result is output to an external device (not shown) via a communication circuit, for example. The external device could be a household appliance installed in the home that performs control based on the identification result. This household appliance could also be a display device installed in the home for displaying the identification result.

[0089] Figure 2 This is an illustrative diagram of the CNN that constitutes the recognizer 111. The CNN contains convolutional layers and fully connected layers. In this example, the CNN is constructed with 9 convolutional layers and 2 fully connected layers. Figure 2 The suffixes of various markers indicate the layer numbers when the input layer of the input image data D1 is layer 0, the convolutional layers are layers 1 to 9, the fully concatenated layers are layers 10 to 11, and the output layer is layer 12. The input image data D1 is represented by H0×W0×C0. Here, H represents the height of the data in each layer, W represents the width of the data in each layer, and C represents the number of channels in each layer. When the input image data D1 is color image data, C0 = 3. In the convolutional layers, the feature values ​​of the data in each layer are calculated through convolution operations. For example, the size of the data output from layer 5, H5×W5×C5, is represented by 64×64×256 = 1048576.

[0090] The fully concatenated layer summarizes the data output from the convolutional layer and generates the output data. For example, in the case of a pose inference CNN that detects skeletal information of up to P individuals and M points, the fully concatenated layer outputs P×M×2(x, y) data points as output data.

[0091] CNNs offer very high accuracy in recognizing image data. However, because CNNs consist of multiple layers such as convolutional layers and fully concatenated layers, they have a high processing load and require a significant amount of time to learn. In contrast, classifier 122 is a support vector machine (SVM) or similar classifier with a much lower processing load than CNNs. Therefore, in this embodiment, the existing classifier 111, which is composed of a CNN, is used, while the classifier 122, with its lower processing load, learns the action of the target. This reduces the learning time and allows for the easy construction of an action recognition device 1.

[0092] Figure 3 This is a schematic diagram illustrating an example of the configuration of the first action recognition unit 104. In this example, the first action recognition unit 104 includes a recognizer 111_1 for posture estimation, a recognizer 111_2 for object detection, a recognizer 111_3 for face detection, and a recognizer 111_4 for head orientation estimation. Input image data D1 is input to recognizers 111_1 to 111_3 respectively. The input image data D1 is image data of the interior of a residence captured by image sensor 2.

[0093] The following explanation uses the case where the object's action is "start cooking" as an example. For instance, the input image data D1 contains a scene of a person standing in the kitchen facing the refrigerator. Recognizer 111_1, given the input image data D1 showing a person standing up, outputs the coordinate data of the skeleton constituting that posture as a feature. Recognizer 111_2 detects a given object from the input image data D1, outputting the coordinate data of the vertices of the bounding rectangle surrounding the detected object and the label of the detected object as feature values. Here, the refrigerator is detected as the object.

[0094] Recognizer 111_3 detects the human face from the input image data D1 and outputs the image data of the face region enclosed by the bounding rectangle as a feature. Recognizer 111_4 infers the human head orientation based on the image data of the face region output from Recognizer 111_3. The head orientation is represented, for example, by a direction vector originating from the centroid of the face region.

[0095] The second action recognition unit 105 determines that a person is facing the refrigerator when, for example, the center of gravity of the circumscribed rectangle surrounding the refrigerator exists in the direction of the direction vector. For example, the second action recognition unit 105 determines that a person is facing the refrigerator when the direction vector is rotated within a range of positive and negative threshold angles (e.g., positive and negative 30 degrees) from a starting point on the image and the extension line of the direction vector passes through the center of gravity of the refrigerator.

[0096] By combining information about the person's posture, object position, and head orientation, the second action recognition unit 105 can determine the action of "starting to cook". Additionally, the recognizer 111_4 can detect the direction of the gaze, the direction of the nose, and the normal direction of the body instead of detecting head orientation.

[0097] Figure 4 This is a schematic diagram illustrating an example of the structure of the action selection table T1. Action selection table T1 stores multiple spaces and corresponding object actions for each space. Object actions include wiping, sweeping, brushing teeth, starting to cook, making coffee, using a laptop (PC), doing laundry, reading, eating, and walking. Spaces include the entryway, kitchen, living room, dining room, bedroom, bathroom, and toilet. However, these are just examples; other spaces and object actions can be used. For example, the object action of lying on the sofa watching TV could be assigned to the living room.

[0098] In the action selection table T1, the ○ mark indicates that for the corresponding space, the corresponding object action is used as the identification object. The × mark indicates that for the corresponding space, the corresponding object action is not used as the identification object.

[0099] For example, the wiping and cleaning shown in the first row is identified as the objects in the entryway, kitchen, living room, dining room, bathroom, and toilet.

[0100] Regarding wiping and cleaning, since it is more likely to be done in all spaces, it is identified as the target space. Regarding sweeping and cleaning, since it is less likely to be done in the bathroom, it is identified as the target space outside the bathroom. Regarding brushing teeth, since it is an action performed in the washing space, it is identified as the target space in the kitchen, bathroom, and toilet.

[0101] Regarding starting to cook, since it won't happen in a space other than the kitchen, the kitchen will be the only relevant space. Regarding brewing coffee, since coffee machines can be placed not only in the kitchen but also in the dining room, both the kitchen and the dining room will be considered as the relevant spaces.

[0102] Regarding the use of laptops, since laptops may be used in restaurants, living rooms, and bedrooms, they are considered as identification objects in these spaces.

[0103] Regarding laundry, washing machines are generally considered to be located in the bathroom, thus the bathroom is the primary identification area. Regarding reading, since it's more likely to happen in a seated space, the living room, dining room, and bedroom are considered primary identification areas. Regarding eating, since it can happen not only in the dining room but also in the living room, the living room and dining room are considered primary identification areas. Regarding walking, since it can happen in any space, all spaces are considered primary identification areas.

[0104] The action selection unit 102 can easily select appropriate object actions for the space by referring to such an action selection table T1 to select object actions corresponding to the space.

[0105] Figure 5 This is a schematic diagram illustrating an example of the structure of the recognizer selection table T2. The recognizer selection table T2 stores multiple object actions and the recognizers 111 used to identify each object action in a corresponding manner. In the recognizer selection table T2, the ○ symbol indicates the recognizer 111 used for the corresponding object action, and the × symbol indicates a recognizer that cannot be used for the corresponding object action. Object actions and... Figure 4 The objects shown behave the same way. Pose inference, object detection, face detection, head orientation inference, age and gender inference, individual inference, and trajectory tracking are each performed separately by the corresponding recognizer 111.

[0106] For example, for wiping and cleaning, since it is not necessary to identify a specific person, a face detection, head orientation inference, age and gender judgment and personal inference recognition device 111 is not used, but a pose inference, object detection and trajectory tracking recognition device 111 is used.

[0107] For sweeping cleaning, a recognizer 111 using posture estimation, object detection, and trajectory tracking is used for the same reasons as for wiping cleaning. For brushing teeth, since the individual's identity and movement are not important, a recognizer 111 other than age and gender estimation, individual estimation, and trajectory tracking is used.

[0108] When starting to cook, all the identifiers 111 are used. For example, since the mother is more likely to cook, when starting to cook, an identifier 111 is used to infer whether the mother is cooking, including age, gender, and personal inference.

[0109] For brewing coffee, unlike starting to cook, since it can also be brewed by the father or child, movement is not important, so a recognizer 111 is used in addition to age and gender estimation, personal estimation, and trajectory tracking.

[0110] For laptop use, since few people move around while using a PC, a recognizer 111 other than trajectory tracking is used. For laundry, since the location where the washing machine is set up is fixed, a recognizer 111 other than trajectory tracking is used. For reading, since it is not necessary to determine the individual or their movement, a recognizer 111 other than age and gender estimation, individual estimation, and trajectory tracking is used.

[0111] Regarding eating, since anyone can do it, head orientation is unimportant, and there is no movement, a pose estimation and object detection recognizer 111 is used. Regarding walking, since objects are not involved, head orientation, and individual identification are unimportant, a pose estimation and trajectory tracking recognizer 111 is used. In the case of detecting unknown actions other than those mentioned above, all recognizers 111 are used.

[0112] Secondly, the effectiveness of the recognizer 111 used in recognizing each object's action is explained.

[0113] Pose inference is effective for recognizing all object actions because it can detect poses corresponding to those actions. Furthermore, face detection is required to infer head orientation. In other words, head orientation detection depends on face detection.

[0114] In wiping and cleaning, object detection is effective for detecting cleaning tools such as vacuum cleaners, as it can effectively track the trajectory while moving and cleaning in multiple locations.

[0115] In sweeping cleaning, the same as wiping cleaning, object detection and trajectory tracking are effective.

[0116] When brushing teeth, object detection is effective for detecting both the toothbrush and the mirror. Since the head orientation tends to be towards the mirror, head orientation inference is effective.

[0117] When cooking begins, object detection is effective for detecting cooking utensils and ingredients. Since the head orientation tends to be towards the kitchen or sink, head orientation inference is effective. Furthermore, when cooking begins, age and gender inference and personal inference are effective for identifying individuals who cook frequently, such as mothers who often cook. Since trajectory tracking is effective in the kitchen, individuals may sometimes move from the sink to the stove.

[0118] When brewing coffee, object detection is effective for detecting the coffee machine and coffee cup, and head orientation inference is effective for detecting whether the head is facing the coffee machine.

[0119] When using a laptop, object detection is effective for detecting the laptop, and head orientation inference is effective for detecting whether the head is facing the laptop. Since laptops are sometimes used at work, age, gender, and personal inference are also effective.

[0120] In laundry, object detection is effective for detecting the washing machine, head orientation inference is effective for detecting whether the head is facing the washing machine, and age and gender inference, as well as individual inference, are effective for identifying individuals such as mothers who frequently wash clothes.

[0121] When reading, object detection is effective for detecting books, and head orientation inference is effective for detecting whether the head is facing the book.

[0122] While eating, the object's inference is effective in detecting the food.

[0123] When walking, trajectory tracking is effective in capturing continuous movement.

[0124] For actions other than those targeting these objects, since it is unknown whether they are effective, all recognizers 111 are selected.

[0125] In the action recognition device 1, the action of an object, which is the object to be recognized, is associated with a space. For example, "start cooking" is strongly associated with the space of the kitchen, and information about that space is useful for action recognition. That is, for object actions, if weights are applied to specific objects or head orientation, the detection accuracy of the object action will be improved. Therefore, in this embodiment, the combiner 121 sets the weight coefficient corresponding to the object action as a feature quantity.

[0126] Figure 6 This is a schematic diagram illustrating an example of the weight table T3 referenced by the combiner 121 when setting weight coefficients for feature quantities. Weight table T3 stores multiple object actions in correspondence with each object action's category index, label, spatial information, object information, head orientation information, spatial information weight coefficient Wr, object information weight coefficient We, and weighted content. Here, as an object action, besides... Figure 4 The instructions for starting to cook also include wiping and cleaning. Although the illustrations are omitted here, the actual process is similar. Figure 4 The records related to each object action shown are also registered in the weight table T3.

[0127] The category index is a unique index assigned to each object action. Here, the category index is assigned in a sequential numbering manner. The label is the label of the object action output by the output unit 106. Here, "Start cooking" and "Wipe and clean" are examples of labels. Although the illustration is omitted here, these are registered in the weight table T3. Figure 4 The label for each object action is shown. Spatial information refers to the space in which the object action was performed. Here, the space for "Start cooking" is the kitchen, as specified in the action selection table T1, representing the space where the object action is registered.

[0128] Object information represents information about the object associated with the corresponding action. For example, the object information for "start cooking" is at least one of a refrigerator, microwave oven, gas stove, and oven. Head orientation information represents the relationship between the object recorded in the object information and the head orientation. For example, the head orientation information for "start cooking" is the head facing at least one of a refrigerator, microwave oven, gas stove, and oven. The determination of whether the head is facing these objects is based on the relationship between the direction vector and the object's center of gravity.

[0129] The spatial information weight coefficient Wr is the weight coefficient for the space where the object's action is performed. Here, the weight coefficient is set to "1" for spaces marked with ○ in the action selection table T1, and the weight coefficient is set to "0" for spaces marked with ×. For example, for "Start cooking", because the kitchen is marked with ○ in the action selection table T1, the weight coefficient Wr for the kitchen is "1", and the weight coefficient Wr for spaces other than the kitchen is "0".

[0130] The weighting coefficient We for object information is a weighting coefficient for the detected object; it is "1" when the corresponding object is detected and "0" otherwise. When multiple features are extracted for an object, each feature is assigned a value of 1 or 0. The weighted content represents the weighting coefficient set based on the relationship between the detected object and the head orientation.

[0131] The following is based on Figure 4 As shown, the weighting performed by the combiner 121 is explained using the feature quantities output by the three recognizers 111 (pose estimation, object detection, and head orientation estimation) as examples. Here, it is assumed that the final output vector of the combiner 121 is Vo. Moreover, it is assumed that the feature quantity output by each recognizer 111 is also a vector. In addition, the output vector Vo is not limited to a one-dimensional vector, but can also be a multi-dimensional tensor. Without weighting, the output vector Vo is represented by Equation (1).

[0132] [Formula 1]

[0133] V o = <V p V ob V h > (1)

[0134] Assume the feature quantities output by the pose estimation recognizer 111, the object detection recognizer 111, and the head orientation estimation recognizer 111 are Vp, Vob, and Vh, respectively. The output vector Vo, without weighting, simply concatenates the feature quantities Vp, Vob, and Vh. The number of features in the output vector Vo is the sum of the number of features in the feature quantities Vp, Vob, and Vh. The symbol <> indicates that the vectors within <> are also concatenated along their respective directions after concatenation.

[0135] In this case, the output vector Vo is represented by formula (2). In the weighted case, similar to the unweighted case, the output vector Vo' is generated by connecting the vectors of each recognizer 111. At this time, the feature quantities of each recognizer 111 are weighted. Therefore, the number of features in the output vector Vo' is the same as in the unweighted case.

[0136] [Formula 2]

[0137]

[0138] In formula (2), the symbol with "·" added to ○ represents the broadcast operation, and the symbol with "×" added to ○ represents the Hadamard product.

[0139] Wr, Wp, Wob, and Wh are the weighting coefficients for the room, pose estimation, object detection, and head orientation estimation, respectively. Weighting coefficient Wp performs a broadcast operation on feature quantity Vp, weighting coefficient Wob performs a broadcast operation on the Hadamard product of weighting coefficient We and feature quantity Vob, and weighting coefficient Wh performs a broadcast operation on feature quantity Vh. The broadcast operation calculates the product of the weighting coefficient (scalar value) and each element of the vector. Weighting coefficient We is the weighting coefficient for the object and has the same number of elements as feature quantity Vob.

[0140] For example, suppose the object's action is to start cooking, and the input image data D1 is image data taken in the kitchen, with feature Vob containing the refrigerator's label and feature Vh containing the direction vector toward the refrigerator. In this case, the combiner 121 refers to the weight table T3 and sets the weight coefficients Wr, We, Wh, and Wob to "1".

[0141] To simplify the explanation, we will use the case where there is only one object action as an example to illustrate the weighting of formula (2). When there are two or more object actions, the combiner 121 can weight each object action according to formula (2) and input the weighted feature values ​​into the classifier 122. The classifier 122 calculates the likelihood for each object action. For example, when recognizing the actions "start cooking" and "wipe and clean", the combiner 121 can set each weight coefficient of formula (2) to the value corresponding to "start cooking", and then set the weight coefficient of formula (2) to the value corresponding to "wipe and clean".

[0142] As another approach for cases where the object's action is two or more, the weighting coefficients can be integrated, for example, by calculating the arithmetic mean, arithmetic sum, or logical sum of the weighting coefficients for the cases of "start cooking" and "sweeping".

[0143] Figure 7 This is a flowchart illustrating an example of the generation process of the list information 205 of the action recognition device 1 according to the embodiment. Details of the list information 205 will be described later. Additionally, Figure 7 The flowchart is executed, for example, when a motion recognition device 1 is installed in a residence. Furthermore, Figure 7 The flowchart can also be executed, for example, when changing room layout information 202 or location information 203.

[0144] First, the first acquisition unit 101 acquires the action selection table T1 (object action information 201) and saves it to the memory 200 (step S101). Second, the first acquisition unit 101 acquires the room layout information 202 and saves it to the memory 200 (step S102).

[0145] Next, the second acquisition unit 103 acquires the position information 203 of the image sensor 2 and saves it to the memory 200 (step S103). When there are multiple image sensors 2, the second acquisition unit 103 can acquire the position information 203 of each image sensor 2.

[0146] Next, the action selection unit 102 uses the location information 203 and the room layout information 202 to determine the space where the image sensor 2 is installed (step S104). For example, the action selection unit 102 can determine the space by investigating which space the coordinate data shown in the location information 203 is located among the multiple spaces included in the room layout information 202. Moreover, when the location information 203 for multiple image sensors 2 is obtained, the action selection unit 102 only needs to determine the space corresponding to each location information 203.

[0147] Next, the action selection unit 102, referring to the action selection table T1, selects candidate actions as object actions corresponding to the determined spaces (step S105). Here, if multiple spaces are determined, one or more candidate actions corresponding to each space are selected. Moreover, if multiple candidate actions are selected, the recognizer 111 corresponding to each candidate action is selected.

[0148] Next, the recognizer selection unit 110 selects the recognizer 111 corresponding to the selected candidate action (step S106). Utilizing Figure 8 The details of this process will be explained.

[0149] Next, the recognizer selection unit 110 generates a list information 205 that corresponds to the identifier of each image sensor 2, the candidate actions identified from the image data 204 captured by each image sensor 2, and the recognizer 111 used when identifying the candidate actions, and saves it to the memory 200 (step S107). As described above, the list information 205 is generated.

[0150] Figure 8 It means Figure 7 The flowchart details the process of step S106. First, the recognizer selection unit 110 acquires the tag of the object action acquired in step S101 (step S201). Next, the recognizer selection unit 110 selects the recognizer 111 to be used when recognizing the object action, referring to the recognizer selection table T2 (step S202). Here, if there are multiple object actions, the recognizer 111 to be used when recognizing each object action is selected. Next, the recognizer selection unit 110 selects the dependent recognizer 111 (step S203). Here, by referring to the items of the recognizer dependency relationship in the prior knowledge table T4 (described later), it is determined which recognizer 111 depends on which recognizer 111.

[0151] Figure 9 This is a flowchart illustrating an example of the action recognition process of action recognition device 1. Figure 9 This refers to the processing when a certain image sensor 2 captures one frame of image data 204. Therefore, when there are multiple image sensors 2, Figure 9 The processing can be performed in parallel on the image data 204 captured by each image sensor 2. Furthermore, Figure 9 The processing can be performed either when capturing one frame of image data 204 each time, or when capturing multiple frames of image data 204 each time.

[0152] First, the second acquisition unit 103 acquires the image data 204 captured by the image sensor 2 and saves it to the memory 200 (step S301).

[0153] Next, the identifier selection unit 110, referring to the list information 205, selects one of the candidate actions corresponding to the identifier of the image sensor 2 that captured the image data 204 (step S302). Thus, a candidate action corresponding to the space is selected.

[0154] Next, the recognizer selection unit 110, referring to the list information 205, inputs image data 204 into the recognizer 111 used when recognizing a candidate action, and the recognizer 111 calculates feature quantities (step S303). The calculated feature quantities are input into the combiner 121.

[0155] Next, the combiner 121 weights the feature quantities input with weight coefficients corresponding to a candidate action (step S304).

[0156] Next, the combiner 121 concatenates the weighted features (step S305). The concatenated features are then input into the classifier 122. Next, the classifier 122 calculates the likelihood of the input features (step S306).

[0157] Next, the recognizer selection unit 110 determines whether all candidate actions have been selected (step S307). If all candidate actions have not been selected (no in step S307), the process returns to step S302 to calculate the likelihood of the next candidate action to be selected.

[0158] On the other hand, after all candidate actions have been selected (if "Yes" is true in step S307), the output unit 106 outputs the label of the candidate action with the highest likelihood among the likelihoods calculated for each candidate action as the recognition result (step S308). Furthermore, in step S308, not only can the label of the candidate action with the highest likelihood be output, but the labels of the k candidate actions can also be output in descending order of likelihood. For example, if k is 5, the first five categories are output.

[0159] Thus, according to this embodiment, based on the location information 203 of the image sensor 2 and the room layout information 202, candidate actions to be identified are selected from the object actions, and the identification result for the candidate actions is calculated. Therefore, it is possible to identify space-dependent human actions with high accuracy.

[0160] Selection of Recognizer

[0161] Secondly, the selection method for the recognizer will be explained. Previous methods for action recognition using DNNs can be broadly categorized into two approaches.

[0162] The first approach is an action recognition method that utilizes multiple convolutional layers and pooling layers as general feature extraction layers to extract action features from the input data. This method then calculates the likelihood of a given action based on the extracted features. However, this approach is not ideal in terms of computational cost and accuracy.

[0163] The second approach is to heuristically design feature quantities that help identify the actions. Similar to the first approach, this action identification method calculates the likelihood of a given action based on the extracted feature quantities. The details of this heuristic design are explained below.

[0164] In the field of image processing, skeletal information (representing joints like shoulders and knees connected by straight lines) is widely used to represent humans, and bounding boxes are used to represent the position of objects. Therefore, heuristic action recognition methods utilizing these skeletal or bounding box representations have been proposed. However, neither skeletal information nor bounding box representations are considered as features for action recognition. Therefore, the action recognition method is determined based on heuristic trial-and-error results. Furthermore, the learning and evaluation of DNNs require significant time, limiting the trial conditions. The background for implementing this heuristic trial-and-error approach is that the action to be recognized is the same as or similar to previously recognized objects, allowing the use of features from previous recognitions as a reference. Especially when using a public dataset for learning, the action to be recognized becomes a category defined by the dataset.

[0165] Furthermore, existing DNN-based action recognition methods are ineffective even when the action to be recognized is not included in previous recognition datasets. Therefore, efficient action recognition requires the extraction of features that accurately represent the action to be recognized. To this end, when training data containing the actions to be recognized is provided, the composition of the feature extraction method needs to be determined.

[0166] Action recognition using DNNs calculates the likelihood of a given action by processing sensor data. When a label for a single action is obtained, the label of the action with the highest likelihood is selected. However, when calculating the likelihood of each action from multiple DNNs or feature extractors, a larger amount of feature data does not guarantee higher recognition accuracy. This is because the features are highly likely to contain components such as noise that do not contribute to recognition accuracy. Similarly, even increasing the number of DNNs or feature extractors used does not necessarily improve accuracy. Therefore, it is important to set the number of DNNs or feature extractors to a number that provides sufficient information for the action to be recognized.

[0167] When considering computational costs, it is especially important to choose the least computationally expensive configuration from those that achieve the desired recognition accuracy. However, typical DNNs or feature extractors, due to the nonlinear transformations involved in their internal processing, cannot analytically compute recognition accuracy. In other words, to obtain the recognition accuracy for a given configuration, there is no other method than preparing the data set and actually performing the recognition process. The computational cost of achieving such recognition accuracy becomes too high, and only a limited number of condition numbers can be explored.

[0168] As mentioned above, in order to use computationally expensive DNNs for action recognition in real-time processing, the composition of the DNN or feature extractor used becomes very important.

[0169] The computational cost of the layers (convolutional layers) of a DNN performing convolutional processing is explained below. Similar to convolution in mathematics, a convolutional layer obtains an output by applying a given kernel to the input image data. There are two-dimensional convolutions applicable to one frame of two-dimensional image data and three-dimensional convolutions applicable to N frames (N is the number of frames) of two-dimensional image data. The number of parameters Wc of the weight coefficients of the convolution can be expressed by formula (3) if the kernel size is k, the number of convolutional dimensions is d (2 for two-dimensional convolution and 3 for three-dimensional convolution), the number of input channels is Ci, and the number of output channels is Co.

[0170] [Formula 3]

[0171] W c =k d ×C i ×C o (3)

[0172] The number of multiplication operations required for convolution calculation, Uc, can be expressed by formula (4) if the number of strides, i.e., the movement of the convolution kernel, is s. For simplicity, it is assumed that the vertical and horizontal processing of the image is the same.

[0173] [Formula 4]

[0174] U c =W c ×(C i / s) d (4)

[0175] On the other hand, the number of weight coefficients Wfc of the fully combined layer, as shown in formula (5), is the product of the number of input channels Ci and the number of output channels Co.

[0176] [Formula 5]

[0177] W fc =C i ×C o (5)

[0178] The number of multiplication operations Ufc required for the calculation of the fully combined layer, as shown in formula (6), is equal to the number of weight coefficients Wfc.

[0179] [Formula 6]

[0180] U fc =W fc (6)

[0181] The number of weight coefficients, Wfc, corresponds to the number of memories required during execution, while the number of multiplication operations corresponds to the number of operations required during execution. As shown in Equation (5), the number of operations in convolution becomes greater than that in full associativity due to the multiplication operations caused by the convolution kernel. In the case of 3D convolution, both the number of weight coefficients and the number of multiplication operations increase. Therefore, with Figure 2 The computational cost of a DNN centered around multiple convolutional layers increases significantly, making it difficult to perform real-time processing (equivalent to typical video) even on a desktop computer using a graphics processing unit (GPU). Real-time processing becomes even more challenging when using multiple DNNs to improve recognition accuracy.

[0182] The high computational cost of existing DNNs stems from the fact that the network layers are primarily arranged in series, with most being convolutional layers. Even in the intermediate layers, the number of weights is large, increasing the number of multiplication operations. The outputs of these intermediate layers become tensors that are difficult for humans to interpret, and may not represent the most suitable representation for recognition. In other words, there are excessive weights in the intermediate layers, further increasing the number of multiplication operations. Therefore, in the intermediate layers, it is possible to reduce the number of weights and multiplication operations by effectively representing human actions or spatial states with fewer parameters.

[0183] When considering the recognition of human actions, actions can be inferred from skeletal information rather than from densely packed input images. For example, an action like cutting food with a knife is characterized by repetitive up-and-down movements of the arm. The number of bytes required to represent the input image data, in the case of 8-bit image data, is the product of the image's vertical length, horizontal length, and number of channels. On the other hand, if it is two-dimensional skeletal information, it can be represented by the product of the number of vertices (e.g., 17 points) and the number of components of the vertices' coordinate data (x, y), which is 34.

[0184] In view of the above, the method for selecting the identifier 111 according to an embodiment of the present invention will be described below. The actions of the person being identified include, for example, actions such as walking, running, standing, talking, cleaning, starting to cook, etc., as well as stationary actions such as sleeping, lying down, sitting, watching TV.

[0185] First, the case where there are many options for the recognizer 111 will be explained. The first action recognition unit 104 uses at least one recognizer 111. When the first action recognition unit 104 holds N recognizers 111 and selects M recognizers 111, the combination of recognizers 111 is the factorial of N. The factorial of N diverges faster than an exponential function. Considering the learning time, if N reaches approximately 5, it becomes difficult to learn all the combinations of recognizers 111.

[0186] Secondly, the problem related to the combination of recognizers 111 is represented mathematically. By learning each combination of the selected recognizers 111, the recognition accuracy and computational cost can be obtained. Given a recognition accuracy that should be satisfied (e.g., 70%), searching for the combination that minimizes the computational cost while satisfying this accuracy is a conditional combinatorial optimization problem. Such a combinatorial optimization problem cannot be solved analytically if the number of combinations reaches a number that is difficult to search all. Therefore, it becomes a search for a near-optimal solution. Furthermore, there is no universal algorithm for combinatorial optimization problems.

[0187] Here, in order to efficiently select the search recognizer 111, the recognizer 111 is selected based on prior knowledge and a greedy algorithm.

[0188] First, the prior knowledge will be explained. Figure 10This is a schematic diagram representing the data structure of a prior knowledge table T4 that summarizes information related to the recognizer 111. This table is stored in memory 200. The prior knowledge table T4 has items related to the recognizer 111's identification number, identification content, input sensor data, relative computational cost, and the recognizer's dependencies. Each item is explained below. The identification number is a unique number (e.g., a serial number) assigned to the registered recognizer 111, which becomes an index or hash in a combined search. The identification content is the content identified by the recognizer 111. The input sensor data is the sensor data required by the recognizer 111 for inference, and is 0 or more. 0 is included because a recognizer 111 that infers solely based on features calculated by other recognizers 111 does not require sensor data. The relative computational cost is the relative value of the recognizer 111's computational cost. The relative computational cost is a relative value calculated based on benchmark results performed by a computer used as a benchmark. In addition, the relative computational cost can be calculated based on benchmark results from multiple computers, or it can be the absolute value of the benchmark results themselves.

[0189] The dependency relationship of a recognizer indicates whether it depends on information from other recognizers 111. For example, recognizers 111 with identification numbers 4 to 6 depend on recognizer 111 with identification number 3. This is because the results of face detection are needed when performing head orientation estimation, age and gender estimation, and individual estimation.

[0190] Alternatively, a recognizer 111 that detects hands or feet can be used, or a recognizer 111 that detects specific regions of a person can be used. Moreover, a recognizer 111 that outputs dense data, such as a mask representing the extent of an image, can also be used.

[0191] Next, the greedy algorithm will be explained. A greedy algorithm is a method that makes the best choice based on partial information. When selecting a recognizer 111, firstly, for each recognizer 111 contained in the prior knowledge table T4, the recognition accuracy when selecting only one recognizer 111 is calculated using the learning data. If the recognition accuracy is high when using only a certain recognizer 111, it can be expected that the recognition accuracy when combining that recognizer 111 with other recognizers 111 will also be high. However, high recognition accuracy is not guaranteed. In the greedy algorithm, if the recognition accuracy is high when using only one recognizer 111, it is assumed that the recognition accuracy when combining that recognizer 111 with other recognizers 111 will also be high, and the priority order of combining recognizers 111 as evaluation objects is determined accordingly. Specifically, the recognizers 111 are sorted in descending order of recognition accuracy when using only one recognizer 111, and the recognition is evaluated starting with the first recognizer 111. For example, when assigning indices sequentially starting with the first recognizer 111, indices closer to the beginning are preferentially used in the order of 1 and 2, 1 and 3, 1 to 3, 1 to 4. A more detailed method for selecting indices can be represented by a parameter X that takes into account the selection of indices from the beginning up to the Xth index and a parameter Y that allows the selection of indices up to Y. The parameters X and Y are determined based on the performance of the computer performing the inference; for example, X = 4 and Y = 3. Figure 5 The identifier selection table T2 shown is created, for example, based on the selection results of identifier 111 as described above.

[0192] Classifier selection

[0193] Next, the selection of classifier 122 will be explained. Since the feature data output by the first action recognition unit 104 is data vectorized with fewer parameters, similar to skeletal information, recognition can be performed even without using a CNN. Classifier 122 significantly reduces computational processing compared to a CNN. Cross-validation can be performed on multiple candidate classifiers to select the classifier 122 with the highest accuracy or F1 score. Candidate classifiers 122 may include, for example, classifiers using logistic regression, support vector machine (SVM), decision tree, random forest, k-nearest neighbors, Gaussian Naive Bayes, perceptron, or stochastic descent.

[0194] Classifier learning

[0195] Next, the learning process of classifier 122 will be explained. Classifier 122 can learn using learning data containing labels of object actions. For example, image data of moving images prepared as learning data for the labels "start cooking" and "have not started cooking yet". The action recognition device 1 then uses the prepared image data to perform... Figure 9 The action recognition process shown compares the obtained labels with the correct answer labels in the learning data to update the weights of classifier 122 in a way that minimizes its error.

[0196] simulation

[0197] Next, a simulation of the action recognition device 1 to confirm that the computational processing load is less than that of existing action prediction AI will be explained.

[0198] In this simulation, CNNs were used as the existing recognizers 111. Specifically, "1: PoseNet" was used as the pose inference recognizer 111, "2: SSD" was used as the object detection recognizer 111, "3: RetinaFace" was used as the face detection recognizer 111, and "4: DeepHeadPose" was used as the head orientation inference recognizer 111. As the classifier 122, a stochastic descent method was used. On the other hand, "5: RepresentationFlowNet" was used as the existing action inference AI.

[0199] The input image data used was a single color image containing three people. This image data had a resolution of 640×480, 3 channels, and 8-bit resolution. The computational cost of classifier 122 was negligible because it was sufficiently small compared to the existing classifier 111, regardless of the number of categories to be recognized.

[0200] As a computer, it is a desktop personal computer that uses computing assistance, including a graphics processing unit (product model: Geforce GTX 1080Ti).

[0201] Figure 11 This is a table summarizing the processing time of each frame executed by the existing recognizer 111. In the simulation, the processing time of the recognizer 111 for each frame was measured. Figure 11 The processing time for processes A (pose estimation), B (object detection), and C (head orientation detection) is shown. The processing time for process C includes the processing time for head orientation estimation and face detection. Since processes A, B, and C are independent, they can be executed in parallel. Ignoring overhead time (time not directly related), the maximum processing time among processes A, B, and C is 0.0455 seconds for process C. On the other hand, the overall processing time when processes A, B, and C are executed sequentially is 0.0725 seconds.

[0202] The processing time for process D (the existing action prediction AI) is 0.1429 seconds. Therefore, the processing speed when processes A, B, and C are processed in parallel is three times that of process D. The processing speed when processes A, B, and C are processed sequentially is twice that of process D. Thus, it can be confirmed that the processing speed using multiple recognizers 111 is significantly faster than that of existing action prediction AI.

[0203] Next, a simulation conducted to evaluate the recognition accuracy of the motion recognition device 1 will be described. In this simulation, the input image data used was... Figure 12 Image data for the five moving images shown. Figure 12 This is a table summarizing the input image data. The video ID is an identifier used to identify the five moving images. Figure 12 For each moving image, the total number of frames that "start cooking" and frames that "have not started cooking" are represented respectively.

[0204] Figure 13 This table summarizes the simulation results conducted to evaluate the recognition accuracy of the action recognition device 1. Figure 13 For example, "process A+B+C" represents the case where the first action recognition unit 104 is constructed using the recognizer 111 of process A, the recognizer 111 of process B, and the recognizer 111 of process C. Similarly, "process A+B" represents the case where the first action recognition unit 104 is constructed using the recognizer 111 of process A and the recognizer 111 of process B. In this simulation, for... Figure 13 For each combination of processing methods A through C shown, the recognition accuracy and processing time of each of the eight classifiers 122 were measured. The eight classifiers 122 are logistic regression, support vector machine, decision tree, random forest, k-nearest neighbor, Gaussian Naive Bayes, perceptron, and stochastic descent. In this simulation, the classifier 122 using stochastic descent had the shortest processing time and the highest recognition accuracy among the eight classifiers 122.

[0205] Furthermore, in the case of classifier 122 using the random descent method, Figure 13 The average recognition accuracy across all combinations of processing A through C is 88.5%. This result indicates that the recognition accuracy of action recognition device 1 is higher than the expected value of 50% when randomly guessing the classification between the two categories "started cooking" and "did not start cooking". Therefore, based on this simulation result, it can be confirmed that action recognition device 1 is capable of recognizing actions.

[0206] Figure 13Of the combinations shown, the highest recognition accuracy is achieved with processing A+C at 88.8%. However, processing A alone also yields a recognition accuracy of 88.7%. If we assume a recognition accuracy threshold of 88.7%, the computational cost becomes minimal when using only processing A. Processing A+B+C is not optimal because there exists a recognizer 111 that does not contribute to improving accuracy; this recognizer 111 behaves like a noise source.

[0207] The simulation results above are just an example. The recognizers 111 constituting the first action recognition unit 104 are not limited to the recognizers described above, and the optimal recognizer 111 can be appropriately adopted according to the action of the object. Moreover, the classifier 122 can also be a classifier other than the random descent method.

[0208] Variations

[0209] (1) In the above embodiment, the action recognition device 1 is composed of a single unit. However, the present invention is not limited to this and can also be composed of multiple devices. Figure 14 This is a block diagram illustrating an example of the configuration of a motion recognition device 1A according to a variation of the present invention. The motion recognition device 1A is configured as a server. The motion recognition device 1A also includes a communication unit 500. The communication unit 500 transmits the recognition result output by the output unit 106 to a home appliance 600 via a network NT and a gateway 700. The communication unit 500 receives image data captured by an image sensor 300 installed in the residence. The network NT is, for example, a wide area communication network such as the Internet. The gateway 700 is installed in the residence and connects the image sensor 300 and the home appliance 600 to the network NT. The home appliance 600 is a washing machine, a microwave oven, a television, etc. The home appliance 600 either performs control based on the motion recognition result sent from the motion recognition device 1A or displays the recognition result. Thus, even when configured as a server, the motion recognition device 1A can recognize the actions of people in a residence.

[0210] (2) In the above embodiments, in addition to using image data, at least one of thermal image data, depth image data, audio data, room temperature data, humidity data, illuminance data, and radio wave data can also be used as input data.

[0211] (3) In the case of using cameras installed in a residence to identify the actions of people within the residence, it is assumed that the camera's location and angle are fixed or rarely changed. Furthermore, the camera's field of view is approximately 110° with a wide-angle lens and less than 110° with a narrow-angle lens, and the area mapped to a particular camera is only a portion of the space where it is installed. Therefore, it is assumed that the camera is installed in a way that allows it to capture spaces where the actions to be identified occur frequently or are of high importance. Under this assumption, if the space where the camera is installed and the objects mapped or potentially mapped to the camera can be known in advance, the accuracy of action recognition can be improved. For example, actions with a frequency of 0 can be excluded from the objects to be identified. Moreover, it is conceivable that if a sofa with low movement frequency is mapped to the camera, the frequency of actions involving sitting on the sofa will become higher. Furthermore, it is conceivable that if a movable chair is present in the space, the frequency of actions involving sitting on the chair will become higher. Therefore, it is possible to decide... Figure 6 The values ​​of the weight coefficients in the weight table T3 shown.

[0212] In action recognition within a residence, direct contact between a person and an object, or interaction between a person and an object located near the object, is particularly important. When the location of an object (the first object described below) that is involved in movement, opening, closing, or manipulation is known, the positional relationship or orientation of a person relative to that object becomes the basis for action recognition. This is because people frequently move, open, close, or manipulate objects while taking action.

[0213] Facilities surrounding water that are not movable within a residence (the second type of object discussed later) are also important in action recognition. Actions involving water, such as cooking or washing one's face, typically utilize facilities located within the residence. Therefore, the relationship between the location of water-related facilities and the location of a person becomes crucial data for action recognition. This is because people frequently require water in their daily lives.

[0214] A residence is typically divided into about 4 to 10 rooms using doors or fixed or movable walls. Furthermore, multi-story residences also have elevators such as staircases. Therefore, a residence is an object containing multiple spaces and multiple doors (the third object described later). The names of these spaces, the location of the doors or elevators, and their relationship to a person's location become the basis for action recognition. This is because actions such as cooking or taking a bath are actions performed only in specific rooms.

[0215] As mentioned above, information about where an object or device is located and how space is divided is particularly important in action recognition. While technologies that consider space names such as living room or bedroom for action recognition exist, even the same space name can be used in various ways, making it insufficient to rely solely on space names like living room or bedroom as the basis for action recognition. Therefore, it is particularly important to consider what kind of interactions between people and objects might occur in what kind of spaces. In view of these considerations, a variation (3) of the present invention will be described below.

[0216] Figure 15 This is a block diagram illustrating an example of the configuration of the action recognition device 1B according to a modification (3) of the present invention. In this modification, room layout features are extracted based on room layout information 202, and an action selection table T1 is generated based on the extracted room layout features. In this modification, the room layout information 202 includes not only information about the space within the building but also information about the objects (facilities and equipment) installed in each space (category information and location information).

[0217] The motion identification device 1B is communicatively connected to the display terminal 400 via a predetermined communication path. The predetermined communication path can be a wireless LAN, Bluetooth (registered trademark), or the Internet. The display terminal 400 is, for example, a smartphone or tablet computer. The display terminal 400 is, for example, held by a user. The user is, for example, the installer of the image sensor 2. The installer is, for example, a resident of a house or a construction contractor.

[0218] The processor 100B of the motion recognition device 1B is in Figure 1 In addition to the above, it also includes a support unit 301 and a room layout feature extraction unit 302. The room layout feature extraction unit 302 extracts objects installed within the building based on the room layout information 202, and classifies the extracted objects into one of three categories: movable first objects, second objects serving as water-related facilities, or third objects serving as building structures. For each classified object, it extracts room layout features that correspond to the installation location and the classification information representing the classification result. The room layout features are, for example, constructed using a two-dimensional table.

[0219] The first object includes movable objects such as furniture and household appliances. Specifically, the first object is a vacuum cleaner, coffee machine, laptop, chair, and sofa.

[0220] The second type of object is water-related facilities that cannot be moved, such as sinks and washbasins.

[0221] The third object refers to the structural elements of a building, such as the entrance hall, kitchen, living room, dining room, bedroom, bathroom, elevator, and toilet.

[0222] The location of the first and second objects can be determined using two-dimensional or three-dimensional coordinate data, such as the entrance hall of a building. Alternatively, the location of the third object can also be determined using coordinate data representing the area where the third object is located. This allows us to determine which space within the building the first and second objects are located in. Furthermore, in addition to the coordinate data of the first and second objects, the location of the first and second objects can also include the name of the space within the building where they are located.

[0223] The classification information indicates which of the first to third objects an object extracted from the room layout information 202 corresponds to.

[0224] The room layout feature extraction unit 302 generates an action selection table T1 based on the extracted room layout features.

[0225] For example, the location of the first object corresponds to the location of the user's action. For example, the location of the coffee machine can correspond to the action of brewing coffee. For example, the location of the microwave oven can correspond to the action of the user starting to cook.

[0226] Furthermore, since the second object is a water-related facility, its location indicates the position where the action of using water is performed. For example, the location of the sink can correspond to the action of washing dishes.

[0227] The placement of a third object indicates the name of the space, the user's entry and exit from the space, and the range of movement the user can make. For example, the placement of the bathroom door can correspond to the act of taking a shower.

[0228] The room layout feature extraction unit 302 generates an action selection table based on the extracted room layout features. Action selection table T1, as shown... Figure 4 The table shown corresponds to one or more spaces within a building with the most likely object actions a user will perform in each space.

[0229] Here, the room layout feature extraction unit 302 can determine what kind of equipment or facilities are installed in each space based on the room layout feature quantities, and generate an action selection table T1 by referring to rules predetermined based on the decision results and classification information. As a rule, for example, rules can be adopted to correspond equipment or facilities with objects that are more likely to be performed on the equipment or facilities, such as having the first object, i.e., the microwave oven, heat food, having the first object, i.e., the coffee machine, brew coffee, and having the second object, i.e., the sink, wash dishes.

[0230] For example, suppose a microwave oven, coffee maker, and sink are installed in the kitchen. In this case, the room layout feature extraction unit 302 can generate an action selection table T1 for the kitchen that corresponds to object actions such as starting to cook, making coffee, and washing dishes. Furthermore, the room layout feature extraction unit 302 can similarly map other spaces within the building to object actions.

[0231] Moreover, the rules are not only applied to equipment or facilities, but also to objects within the space itself that are highly likely to engage in action. For example, as in Figure 4 As described above, rules can be adopted for the kitchen including starting to cook, for the bathroom including doing laundry, for the living room, dining room and bedroom including reading, and for the living room and dining room including eating.

[0232] Furthermore, the room layout feature extraction unit 302 can also set the parameters to be recorded based on the extracted room layout features. Figure 6 The weighting coefficients of the weighting table T3 are shown. For example, if the kitchen is equipped with a sink, refrigerator, microwave oven, and gas stove, the weighting coefficient We for the kitchen can be set to 1 for the sink, refrigerator, microwave oven, and gas stove, and 0 for other equipment.

[0233] The room layout feature extraction unit 302 can also set the weight coefficient We for other spaces besides the kitchen in the same way as for the kitchen.

[0234] In addition, the first acquisition unit 101 can acquire the action selection table T1 generated in this way as object action information.

[0235] The setup support unit 301 also includes a setup support unit that obtains the name of the space of the building where the image sensor 2 is set up through the display terminal 400 and outputs setup guidance to the display terminal 400. The setup guidance is used to set up the image sensor 2 in a way that a specific device or facility related to the space is included in the field of view of the image sensor 2.

[0236] Figure 16 This is a schematic diagram illustrating the scenario where image sensor 2 is installed in entryway 501. Figure 17 This is a schematic diagram illustrating an example of the interaction between a user and a display terminal 400 in a scenario where an image sensor 2 is installed in the entryway 501.

[0237] In step ST1, the display terminal 400 accepts the user's request to open the settings screen, and displays a list of spaces where the image sensor 2 is to be set. Here, the list displays names such as A: Entrance Hall, B: Kitchen, C: Living Room, etc., for the spaces where the image sensor 2 is to be set.

[0238] In step ST2, the display terminal 400 accepts the user's operation to select a name for the space where the image sensor 2 is set from the list of space names. The display terminal 400 sends the selected space name to the motion recognition device 1B. Here, since the image sensor 2 is set in the entrance hall 501, "A: Entrance Hall" is selected.

[0239] In step ST3, the display terminal 400 displays a message prompting the user to set the image sensor 2, such as, "Please set the position of the door mapping. After setting, please press OK." An OK button may also be displayed on the setting screen in conjunction with this message. This OK button is used to notify the display terminal 400 that the image sensor 2 setup is complete.

[0240] This setting allows for tasks such as... Figure 16 As shown, image sensor 2 is installed in the entrance hall 501.

[0241] In step ST4, the display terminal 400 accepts the user's action of pressing the OK button. The display terminal 400, having accepted the OK button press, sends information indicating that the OK button has been pressed to the motion recognition device 1B.

[0242] Upon receiving information indicating that the OK button has been pressed, the setting support unit 301 acquires image data captured by the image sensor 2 and performs processing to detect the door 401 from the acquired image data. At this time, the setting support unit 301 causes the display terminal 400 to display a message saying "Automatically detecting the door" (step ST5). Here, the setting support unit 301 can detect the door 401 using a pre-defined identifier. The door 401 is an example of a specific device for the entryway 501. The specific device can be predetermined based on the space where the image sensor 2 is installed. For example, a pre-prepared identifier for detecting the door 401 can be used as the pre-defined identifier.

[0243] Figure 18 It means succession Figure 17 The following is a schematic diagram of the subsequent interaction. In step ST6, the display terminal 400 obtains annotation information representing the detection result of door 401 from the setting support unit 301, and displays a setting screen that overlays the annotation image shown by the obtained annotation information onto the image captured by the image sensor 2. This setting screen displays a message prompting correction of the annotation image, such as, "Please adjust it to match the position of the door guided by the automatically detected red door, and press OK." The annotation information is, for example, the coordinate data of the annotation image.

[0244] Figure 19This is a schematic diagram of an example of setting screens G1 and G2 showing the overlapping of labeled image A1. Setting screen G1 displays the labeled image A1 before correction, and setting screen G2 displays the labeled image A1 after correction. Labeled image A1 is composed of a quadrilateral bounding box. Labeled image A1 has a specified color (in this case, red). Setting screens G1 and G2 display an OK button 1901 to indicate the end of the correction work on labeled image A1.

[0245] A circle marker P1 is displayed at the top left corner of the labeled image A1, and a circle marker P2 is displayed at the bottom right corner of the labeled image A1.

[0246] If the display terminal 400 receives the operation of moving circle markers P1 and P2, it will change the size of the marked image A1 in conjunction with the received operation.

[0247] The setting screen G1 displays a labeled image A1, which is the detection result of the detector for door 401. It can be seen that the labeled image A1 is deviated to the upper left relative to door 401 and does not correctly identify door 401.

[0248] The user inputs an operation to move the circular markers P1 and P2, thereby changing the shape of the annotation image A1 into the bounding rectangle of the door 401. This results in the setting screen G2. In the setting screen G2, the annotation image A1 is positioned as the bounding rectangle of the door 401.

[0249] Therefore, the labeled image A1, which is displayed on the setting screen G1 at a position offset from the door 401, is corrected to cover the entire area surrounding the door 401 as shown on the setting screen G2. As a result, even if the recognizer cannot correctly identify the position of the door 401 on the image, it can still obtain the correct position of the door 401 on the image.

[0250] In step ST7, the display terminal 400 receives an operation from the user who has completed the adjustment of the labeled image A1, indicating that the user has pressed the OK button 1901. The display terminal 400, having received this operation, sends the coordinate data of the corrected labeled image A1 to the setting support unit 301. Upon receiving this coordinate data, the setting support unit 301 stores the coordinate data of the corrected labeled image A1 in the memory 200, corresponding to the position information 203 of the image sensor 2 installed in the entrance 501.

[0251] Then, the recognizer 111, used to identify the actions of the object corresponding to the entrance 501, identifies the user's actions based on the image data captured by the image sensor 2 installed in the entrance 501, using the labeled image A1 shown in the coordinate data as a reference. Thus, the recognizer 111 can efficiently identify the user's actions.

[0252] In step ST8, the display terminal 400 displays a message saying "Settings have been saved" because it has received information from the setting support unit 301 indicating that the coordinate data has been saved.

[0253] In step ST9, the display terminal 400 receives an order from the user to close the settings screen. As a result, the display terminal 400 closes the settings screen.

[0254] Figure 20 This is a schematic diagram illustrating a scenario where image sensor 2 is installed in kitchen 502. Figure 21 This is a schematic diagram illustrating an example of interaction between a user and a display terminal 400 in a scenario where an image sensor 2 is set up in a kitchen 502.

[0255] Step ST1 and Figure 17 Step ST1 is the same. In step ST2, the display terminal 400 selects "B: Kitchen" as the setting location for the image sensor 2 from the user's acceptance.

[0256] Step ST3 and Figure 17 Step ST3 is the same. In step ST4, the display terminal 400 accepts the user's action of pressing the OK button. The display terminal 400, having accepted the action of pressing the OK button, sends information indicating that the OK button has been pressed to the motion recognition device 1B.

[0257] The setup support unit 301, having received information indicating that the OK button has been pressed, acquires image data captured by the image sensor 2 and performs processing to detect the refrigerator 402 from the acquired image data. As a result, the setup support unit 301 causes the display terminal 400 to display a message such as "Automatically detecting the refrigerator" (step ST5). Here, the setup support unit 301 can detect the refrigerator 402 using a pre-defined identifier. The refrigerator 402 is an example of a specific appliance in the kitchen 502. The specific appliance is predetermined based on the space where the image sensor 2 is installed. The pre-defined identifier, for example, is an identifier prepared in advance for detecting the refrigerator 402.

[0258] Figure 22 It means succession Figure 21 A diagram illustrating the subsequent interactions. Steps ST6 to ST9 and... Figure 18 same. Figure 23This is a schematic diagram illustrating an example of setting screens G3 and G4 where the labeled image A1 is overlapped. Setting screen G3 displays the labeled image A1 before correction, and setting screen G4 displays the labeled image A1 after correction. Setting screen G3 displays the labeled image A1 as the detection result of the detector on the refrigerator 402. It can be seen that the labeled image A1 is offset upwards relative to the refrigerator 402 and does not correctly identify the refrigerator 402. The user inputs an operation to move the circular markers P1 and P2, thereby changing the shape of the labeled image A1 into the bounding rectangle of the refrigerator 402. Thus, setting screen G4 is obtained. In setting screen G4, the labeled image A1 is positioned as the bounding rectangle of the refrigerator 402.

[0259] Therefore, the annotation image A1, which was displayed on the setting screen G3 at a position offset from the refrigerator 402, is corrected to surround the entire area of ​​the refrigerator 402 as shown on the setting screen G4. As a result, even if the recognizer cannot correctly identify the position of the refrigerator 402 on the image, the correct position of the refrigerator 402 on the image can be obtained.

[0260] Then, the recognizer 111, used to identify the actions of the object corresponding to the refrigerator 402, identifies the user's actions based on the image data captured by the image sensor 2 installed in the refrigerator 402, using the labeled image A1 shown in the coordinate data as a reference. Thus, the recognizer 111 can efficiently identify the user's actions.

[0261] Figure 24 This is a schematic diagram illustrating an example of the interaction between the user and the display terminal 400 when correcting the annotation image A1 after the image sensor 2 has been set up. The image sensor 2 may sometimes experience angle changes due to external forces after being set up. In this case, the coordinate data shown in the annotation image A1 set during setup may deviate from the position of a specific device or facility on the image. To correct this deviation, the following correction operation is performed.

[0262] In step ST1, the display terminal 400 receives an operation from the user to open the settings screen and displays a message prompting the user to confirm whether the image sensor 2's position is off. Here, a message such as "Please confirm whether the camera position is off. The red guide indicates the current setting. The light blue guide indicates the automatically detected position" is displayed.

[0263] Figure 25 This is a schematic diagram illustrating an example of setting screens G5 and G6 displaying an overlaid annotation image A1. Setting screen G5 represents the annotation image A1 before correction, and setting screen G6 represents the annotation image A1 after correction. Annotated image A1 corresponds to the red guide. The dashed annotation image A2 represents the annotation image as the detection result of the detector on refrigerator 402. Annotated image A2 corresponds to the light blue guide.

[0264] In the setting screen G5, because the angle of the image sensor 2 is deviated from the angle at the setting time, the displayed annotation image A1 is deviated upward relative to the annotation image A2.

[0265] In step ST2, the user confirms the angle deviation of the image sensor 2 from the setting screen G5 and performs the adjustment operation on the angle of the image sensor 2.

[0266] The display terminal 400 determines whether the position of the labeled image A1 is consistent with the position of the labeled image A2 through image processing. If the position is consistent, the display terminal 400 displays a message indicating that the angle of the image sensor 2 has returned to the angle set at the time of setting.

[0267] In step ST4, the display terminal 400 receives an order from the user to close the settings screen. Thus, the display terminal 400 closes the settings screen.

[0268] By allowing the user to make such adjustments, even if the angle of image sensor 2 changes after the settings are applied, the angle of image sensor 2 can be returned to the angle it was set at. Therefore, the recognizer 111 can accurately identify the user's actions.

[0269] Industrial availability

[0270] The action recognition device of the present invention has practical value in recognizing the actions of people in buildings.

Claims

1. A motion recognition device for recognizing the actions of people in a building, characterized in that... include: The first acquisition unit acquires object action information including the action of one or more objects as predetermined identification objects, room layout information of the building, and location information of the image sensors installed in the building. The action selection unit selects a candidate action as an identification candidate from one or more object actions included in the object action information, based on the location information of the image sensor and the room layout information. The second acquisition unit acquires the image data detected by the image sensor; The first action recognition unit determines one or more recognizers corresponding to the candidate action and uses the one or more recognizers to calculate the feature quantity of the image data; The second action identification unit identifies the candidate action based on the feature quantity; as well as, The output unit outputs the recognition result of the second action recognition unit.

2. The motion recognition device according to claim 1, characterized in that, The first action identification unit, when the candidate action is a defined action, determines multiple identifiers. The second action identification unit combines the feature values ​​calculated by the plurality of identifiers respectively, and identifies the candidate action based on the combined feature values.

3. The motion recognition device according to claim 2, characterized in that, The prescribed actions are cleaning, brushing teeth, cooking, doing laundry, using a computer, reading, or eating.

4. The motion identification device according to any one of claims 1 to 3, characterized in that, Each recognizer is constructed using a convolutional neural network. The second action identification unit uses a classifier employing one of the following methods to identify the candidate action: logistic regression, support vector machine, decision tree, random forest, k-nearest neighbor, Gaussian Naive Bayes, perceptron, and stochastic descent.

5. The motion recognition device according to any one of claims 1 to 3, characterized in that, The second action recognition unit uses a classifier that performs machine learning by using the feature quantity as an explanatory variable and the object action as the objective variable to identify the candidate action.

6. The motion identification device according to any one of claims 1 to 3, characterized in that, The second action identification unit uses a weighting coefficient predetermined according to the candidate action to weight each feature quantity, and identifies the candidate action based on the weighted feature quantity.

7. The motion identification device according to any one of claims 1 to 3, characterized in that, Extract one or more objects located in the building from the room layout information, and classify the objects into one of the following categories: movable first objects, water-related facilities (second objects), and building structures (third objects). Extract room layout feature quantities corresponding to the classification results and the placement locations of each object, and generate an action selection table based on the room layout feature quantities. The action selection table is a table that maps one or more spaces of the building to the corresponding object actions for each space. The first acquisition unit acquires the action selection table as the object action information.

8. The motion recognition device according to claim 7, characterized in that, The motion recognition device can be communicatively connected to the display terminal. The action recognition device also includes a setup support unit, which obtains the name of the space in the building where the image sensor is installed via the display terminal, and outputs setup guidance to the display terminal. The setup guidance sets the image sensor in such a way that specific equipment or facilities associated with the space are included in the field of view of the image sensor.

9. The motion recognition device according to claim 8, characterized in that, The setting support unit acquires image data captured by the image sensor, detects the specific device or facility contained in the image data, and overlays a labeled image representing the detection result of the specific device or facility onto the image shown in the image data.

10. The motion recognition device according to claim 9, characterized in that, The setting support unit obtains the correction instruction of the annotation image through the display terminal and stores the annotation information shown in the corrected annotation image into the memory.

11. An action recognition method for recognizing the actions of users in a building, characterized in that, computer: Obtain object action information that includes the action of one or more objects identified as pre-determined objects; Obtain the room layout information of the building; Acquire the location information of the image sensors installed on the building; Based on the location information of the image sensor and the room layout information, a candidate action is selected from the one or more object actions included in the object action information as an identification candidate. Acquire the image data detected by the image sensor; The first action recognition unit determines one or more recognizers corresponding to the candidate action, and uses the one or more recognizers to calculate the feature quantity of the image data; The second action identification unit identifies the candidate action based on the feature quantity; Output the recognition result of the second action recognition unit.

12. A computer program product comprising a computer program for recognizing the actions of users in a building, characterized in that, The computer program enables the computer to function as the following components: The first acquisition unit acquires object action information including the action of one or more objects as predetermined identification objects, room layout information of the building, and location information of the image sensors installed in the building. The action selection unit selects a candidate action as an identification candidate from one or more object actions included in the object action information, based on the location information of the image sensor and the room layout information. The second acquisition unit acquires the image data detected by the image sensor; The first action recognition unit determines one or more recognizers corresponding to the candidate action, and uses the determined one or more recognizers to calculate the feature quantity of the image data; The second action identification unit identifies the candidate action based on the feature quantity; as well as, The output unit outputs the recognition result of the second action recognition unit.