Information processing device, information processing method, and recording medium
The information processing device addresses inaccuracies in work performance assessment by aligning frame images and comparing work speeds, enhancing the detection of attention segments in target videos.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NEC CORP
- Filing Date
- 2025-12-09
- Publication Date
- 2026-07-02
AI Technical Summary
Existing video analysis technologies fail to accurately assess work performance by comparing frame images at the same time, leading to incorrect detection of attention segments due to time lags or differences in work speed.
An information processing device that identifies corresponding frame images between a target and reference video, detecting attention segments based on the correspondence and work speed comparison, using techniques like Temporal Cycle-Consistency Learning and Dynamic Time Warping.
Accurately detects attention segments in target videos, correcting for time differences and work speed variations, thereby improving the analysis of work performance.
Smart Images

Figure JP2025042934_02072026_PF_FP_ABST
Abstract
Description
Information processing device, information processing method, and recording medium
[0001] This disclosure relates to an information processing device, an information processing method, a recording medium, and a program.
[0002] Technology related to this disclosure is disclosed in Patent Document 1. Patent Document 1 discloses a technology for determining the success or failure of a worker's work based on a comparison of a worker's work video with a skilled worker's work video. In this technology, the success or failure of each work performed by the worker is determined based on whether or not the "object that has reached a state corresponding to the progress of the work" that appears in the skilled worker's work video appears in the worker's work video at the same timing as its appearance in the skilled worker's work video.
[0003] International Publication No. 2020 / 235120
[0004] The inventors identified the following problems in a technology that analyzes the work of an evaluation subject by comparing a video of the evaluation subject performing the work with a video of an example subject.
[0005] A video image contains multiple frame images. When comparing two video images, simply comparing frame images taken at the same time (same elapsed time from the start), as in the technology disclosed in Patent Document 1, may not allow for a proper analysis of the work of the person being evaluated.
[0006] One example of the purpose of this disclosure is to develop technology for analyzing the work of an evaluated person based on video footage of that person performing their work.
[0007] According to one aspect of this disclosure, an information processing device is provided, comprising: acquisition means for acquiring target video footage of an evaluation subject performing work; identification means for identifying which of a plurality of reference frame images included in the target video corresponds to each of a plurality of reference frame images included in a reference video; detection means for detecting a section of attention from the target video based on the result of the correspondence identification; and notification means for providing notification regarding the section of attention.
[0008] Furthermore, according to one aspect of this disclosure, an information processing method is provided in which one or more computers acquire target video footage of an evaluation subject performing a task, identify which of the multiple target frame images included in the target video corresponds to among the multiple reference frame images included in the reference video, detect a section of attention from the target video based on the result of the correspondence identification, and provide a notification regarding the section of attention.
[0009] Furthermore, according to one aspect of this disclosure, a program is provided that causes a computer to execute: an acquisition step of acquiring target video footage of an evaluation subject performing a task; an identification step of identifying which of the multiple reference frame images included in the target video corresponds to each of the multiple reference frame images included in the reference video; a detection step of detecting a section of interest from the target video based on the result of the identification of the correspondence; and a notification step of providing notification regarding the section of interest.
[0010] One example of this disclosure suggests that it is possible to develop a technology that analyzes the work of an evaluated person based on video footage of the person performing the work.
[0011] Figure 1 is a diagram showing an example of a functional block diagram of an information processing device. Figure 2 is a flowchart showing an example of the processing flow of an information processing device. Figure 3 is a diagram illustrating a comparative example. Figure 4 is a diagram illustrating another comparative example. Figure 5 is a diagram showing an example of the hardware configuration of an information processing device. Figure 6 is a diagram showing an example of a target video. Figure 7 is a diagram illustrating the process of associating multiple target frame images with multiple example frame images. Figure 8 is a diagram illustrating the process of detecting attention intervals. Figure 9 is a flowchart showing another example of the processing flow of an information processing device. Figure 10 is a diagram showing an example of the result of the information processing device detecting attention intervals. Figure 11 is a diagram showing an example of a screen output by the information processing device.
[0012] The embodiments of this disclosure will be described below with reference to the drawings. In this disclosure, the drawings are associated with one or more embodiments. In all drawings, similar components are denoted by the same reference numerals, and their descriptions are omitted where appropriate.
[0013] <<First Embodiment>> Figure 1 is a functional block diagram showing an overview of the information processing device 10. Figure 2 is a flowchart showing an example of the processing flow executed by the information processing device 10.
[0014] As shown in Figure 1, the information processing device 10 includes an acquisition unit 11, a identification unit 12, a detection unit 13, and a notification unit 14. These functional units execute the processes shown in the flowchart of Figure 2.
[0015] In S10, the acquisition unit 11 acquires a target video image of the person being evaluated performing a task. In S11, the identification unit 12 identifies which of the multiple example frame images (reference frame images) in the example video (reference video) corresponds to each of the multiple target frame images included in the target video acquired in S10. In S12, the detection unit 13 detects a cautionary section from the target video based on the correspondence identification result obtained in S11. In S13, the notification unit 14 provides a notification regarding the cautionary section detected in S12.
[0016] In this way, the information processing device 10 detects attention segments based on "the correspondence between each of the multiple target frame images included in the target video and each of the multiple example frame images included in the example video." In generally known techniques (comparative example), attention segments are detected from the target video based on, for example, "the comparison result between the target frame image and the example frame image captured at the same time." The information processing device 10 can detect attention segments from the target video using a method different from such comparative example. With such an information processing device 10, the options for methods of detecting attention segments from the target video can be increased.
[0017] By the way, when adopting the above comparative example to detect an attention section from the target moving image based on "the comparison result between the target frame image and the template frame image captured at the same timing", there are the following problems.
[0018] FIG. 3 shows an example of data of a template moving image and a target moving image. The horizontal axis indicates the frame image number. The vertical axis indicates the feature amount (sensor information) extracted from each frame image. The types of sensor information are various and are not particularly limited.
[0019] In the case of the example of FIG. 3, from frame image numbers 0 to 20, the sensor information of the template moving image and the target moving image at the same timing (the same frame image number) has equivalent values. However, after frame image number 20, the sensor information of the template moving image and the target moving image at the same timing (the same frame image number) has significantly different values.
[0020] In the example of FIG. 3, when adopting the above comparative example to detect an attention section from the target moving image based on "the comparison result between the target frame image and the template frame image captured at the same timing", the section after frame image number 20 is detected as the attention section. That is, the section after frame image number 20 where the sensor information of the template moving image and the target moving image are significantly different from each other is detected as the attention section.
[0021] Here, when observing the waveforms of the template moving image and the target moving image in FIG. 3, it can be seen that the section from 0 to t 1 in the target moving image and the section from 0 to t 1 ' in the template moving image have substantially the same waveform. And it can be seen that the section from t 1 to t 2 in the target moving image and the section from t 1 ' to t 2 ' in the template moving image have different required times (required number of frames), and as a result, have different waveforms. The waveform of the section from t 1 to t 2 in the target moving image is the waveform of the section from t 1 ' to t 2The waveform in the ' section has been stretched horizontally. And in the target video, t 2 The following section and the example video t 2 It can be seen that the waveforms in the sections after ' are almost identical. This is because the person being evaluated, as captured in the target video, is t 1 Up to this point, the work proceeded smoothly, but 1 kara t 2 During this time, some cause may lead to a delay in the work, 2 This means that the work proceeded smoothly again from that point onward.
[0022] Despite this situation, when detecting a section of the video based on "the comparison result between the target frame image and the example frame image taken at the same time," frame number 20 and onward are detected as sections of the video. In other words, the work proceeded smoothly. 2 It will continue to be detected as a cautionary zone from this point onward.
[0023] If a delay occurs in the work performed by the person being evaluated at some point during the process, and a time lag occurs between the target video footage of the person being evaluated and the example video footage, this time lag is highly likely to persist without being resolved. In addition, for example, if the person being evaluated forgets to perform a certain task, the person being evaluated may have to perform the task ahead of schedule, causing a time lag between the target video footage of the person being evaluated and the example video footage. In this case as well, this time lag is highly likely to persist without being resolved. If a warning section is detected from the target video based on "the comparison result of the target frame image and the example frame image taken at the same time" despite such a persistent time lag, an incorrect judgment may be made, as described above. In other words, even if the work is performed correctly after the time lag occurs, it may be detected as a warning section.
[0024] Figure 4 shows another example of data for the example video and the target video. The data is presented using the same method as in Figure 3.
[0025] Even for individuals who perform their tasks smoothly without any particular problems, the time required for each task can vary. Therefore, even if an individual performs their task smoothly without any particular problems, their work time may not match the work time shown in the example video. Figure 4 shows data for such an example.
[0026] Observing the waveforms of the example video and the target video in Figure 4, the subject of evaluation proceeded smoothly without any particular problems, so the overall waveforms of the example video and the target video are very similar to each other. There are no characteristics of the target video waveform being locally stretched or missing in the horizontal direction. The overall waveform in the target video is a horizontally stretched version of the overall waveform in the example video. However, due to the difference in work time mentioned above, these waveforms do not completely overlap. In other words, in most places, the sensor information of the target frame image and the example frame image, which were captured at the same time, have different values.
[0027] Despite these circumstances, when detecting a section of the video based on "the comparison result between the target frame image and the example frame image taken at the same time," for example, the section from around frame number 15 onwards is detected as a section of concern. In other words, even though the work proceeded smoothly without any particular problems, a section of concern was detected due to differences in work speed among the people being evaluated.
[0028] The information processing device 10 can mitigate these problems. The information processing device 10 detects attention segments in the target video using a different technique than the technique that detects attention segments in the target video based on "the result of comparing the target frame image and the example frame image captured at the same time." With such an information processing device 10, the above problems that occur when detecting attention segments in the target video based on "the result of comparing the target frame image and the example frame image captured at the same time" can be mitigated.
[0029] Furthermore, the information processing device 10 identifies which of the multiple example frame images included in the example video corresponds to each of the multiple target frame images included in the target video. Based on the identification of this correspondence, the information processing device 10 detects the attention section from the target video. With such an information processing device 10, even if a time difference occurs between the target video and the example video, the attention section can be detected from the target video based on the correct correspondence. As a result, the problems described above are mitigated and the accuracy of attention section detection is improved.
[0030] Thus, the information processing device 10 enables the development of a technology that analyzes the work of an evaluation subject based on video footage of the evaluation subject performing the work.
[0031] <<Second Embodiment>> <Overview> The information processing device 10 of the second embodiment is a concrete implementation of the configuration of the information processing device 10 of the first embodiment. It will be described in detail below.
[0032] <Hardware Configuration> First, an example of the hardware configuration of the information processing device 10 will be described. Each functional unit of the information processing device 10 is realized by any combination of hardware and software. Those skilled in the art will understand that there are various variations in the implementation method and the device. The software includes programs that are pre-installed at the time of shipment of the device, as well as programs downloaded from recording media such as CDs (Compact Discs) or from servers on the Internet.
[0033] Figure 5 is a block diagram illustrating the hardware configuration of the information processing device 10. As shown in Figure 5, the information processing device 10 includes a processor 1A, memory 2A, input / output interface 3A, peripheral circuitry 4A, and bus 5A. The peripheral circuitry 4A includes various modules. The information processing device 10 does not necessarily have peripheral circuitry 4A. The information processing device 10 may also be composed of multiple physically and / or logically separated devices. In this case, each of the multiple devices may have the above hardware configuration.
[0034] Bus 5A is a data transmission path for the processor 1A, memory 2A, peripheral circuit 4A, and input / output interface 3A to send and receive data to and from each other. The processor 1A is a processing unit such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit). Memory 2A is a memory such as RAM (Random Access Memory) or ROM (Read Only Memory). The input / output interface 3A includes interfaces for acquiring information from input devices, external devices, external servers, external sensors, cameras, etc., and interfaces for outputting information to output devices, external devices, external servers, etc. The input / output interface 3A also includes interfaces for connecting to a communication network such as the Internet. Input devices include, for example, a keyboard, mouse, microphone, physical buttons, touch panel, etc. Output devices include, for example, a display, projection device, speaker, printer, mailer, etc. The processor 1A can issue commands to each module and perform calculations based on their calculation results.
[0035] <Functional Configuration> Next, the functional configuration of the information processing device 10 will be described in detail. Figure 1 is an example of a functional block diagram of the information processing device 10. As shown in the figure, the information processing device 10 has an acquisition unit 11, a identification unit 12, a detection unit 13, and a notification unit 14.
[0036] The acquisition unit 11 acquires target video footage of the person being evaluated while performing a task.
[0037] "Persons being evaluated" refers to the workers who are being evaluated.
[0038] A "task" is a task performed by the person being evaluated according to certain procedures and rules. A task may include multiple subtasks. The person being evaluated may perform multiple subtasks in a predetermined order. The task may be performed in a factory, a retail store such as a convenience store, an office, an outdoor site such as a construction site, or anything else. The content of the task is not limited.
[0039] For example, the work may be manufacturing work performed in a factory. This work may include several small tasks such as attaching part A, attaching part B, soldering, tightening screws, checking the attachment, and performing operational tests. Alternatively, the work may be preparing prepared food at a convenience store. This work may include several small tasks such as cleaning equipment, storing ingredients in equipment, and operating equipment. Note that while examples of work have been described here, the work is not limited to these examples.
[0040] The "target video footage" is video footage of the person being evaluated performing the tasks described above. The target video footage must show at least a portion of the person being evaluated. For example, the entire body of the person being evaluated may be shown in the target video footage. Alternatively, only a part of the person being evaluated (for example, their hands) may be shown in the target video footage.
[0041] Figure 6 shows an example of a target video P. The target video P shows the hands H of the person being evaluated. The target video P also shows a workbench T and a work object Q.
[0042] "Working object" refers to at least one of the object being worked on and the object used in the work, and its type is not limited. For example, the working object may be an intermediate or finished product of the goods manufactured in the work. In addition, the working object may be a tool used in the work (soldering iron, screwdriver, cleaning equipment, device, etc.).
[0043] The acquisition unit 11 can acquire target video images generated by a camera that films the person being evaluated while they are working. For example, a fixed camera may be installed at a position that films the person being evaluated while they are working. Alternatively, the person being evaluated may wear a wearable device with a camera and film their hands, etc. Alternatively, a camera may be installed on a moving object. The moving object may then film the person being evaluated while moving as appropriate. The moving object may move on the floor, move through the air, or move along walls or ceilings. The acquisition unit 11 may acquire the target video images generated by the camera in real time processing or in batch processing.
[0044] The camera can detect visible light and create an image. In addition, the camera may also detect other electromagnetic waves such as infrared light and create an image. Furthermore, the camera may be a monocular camera or a stereo camera. The camera may also have a standard lens with a field of view of about 45°, or it may have lenses such as a telephoto lens (field of view of about 30° or less), a wide-angle lens (field of view of about 60-84°), an ultra-wide-angle lens (field of view of about 94-118°), or a fisheye lens (field of view of about 180°).
[0045] The acquisition unit 11 may acquire target video footage from an external device. Alternatively, the acquisition unit 11 may acquire target video footage input by the user to the information processing device 10 by any means.
[0046] "Acquisition" includes at least one of the following: the device retrieving data or information stored in another device or storage medium (active acquisition), and the device inputting data or information output from another device into its own device (passive acquisition). Examples of active acquisition include making a request to another device and receiving a reply, and accessing and reading data from another device or storage medium. Examples of passive acquisition include receiving information that is delivered (or transmitted, push notification, etc.). Furthermore, acquisition may also involve selecting and acquiring data or information from among the received data or information, or selecting and receiving data or information that has been delivered.
[0047] Returning to Figure 1, the identification unit 12 identifies which of the multiple example frame images included in the example video corresponds to each of the multiple example frame images included in the example video.
[0048] The "target frame image" is a frame image included in the target video.
[0049] A "model video (also called a reference video)" is a video recording of a reference subject performing a task. The task performed by the reference subject is the same as the task performed by the subject being evaluated. The model video may be a video generated by filming the reference subject performing the task with a camera. Alternatively, the model video may be an animation depicting the reference subject performing the task. In the model video, the reference subject performs the task correctly according to certain procedures and rules. In other words, the work performed by the reference subject recorded in the model video is a model (exemplary) to be used as a reference.
[0050] The "reference subject" can be a real person or a real robot. Alternatively, the reference subject may be a fictional person or character depicted for the example video.
[0051] A "model frame image (also called a reference frame image)" is a frame image included in the model video.
[0052] "Corresponding relationship" refers to a relationship where the target video and the example video record the same scene within a series of tasks. In other words, the corresponding target frame image and the example frame image record the same scene within a series of tasks. To put it another way, the corresponding target frame image and the example frame image record the scene where the person being evaluated and the person being referenced are performing the same task within a series of tasks.
[0053] The identification unit 12 identifies which of the multiple example frame images included in the example video corresponds to each of the multiple target frame images included in the example video. That is, as shown in Figure 7, each of the multiple target frame images is associated with one of the example frame images.
[0054] This mapping can be achieved using widely known techniques. For example, the identification unit 12 may extract predetermined feature quantities from each of the multiple target frame images. The identification unit 12 may then perform the mapping based on the comparison result between the feature quantities of each target frame image and the feature quantities extracted from each of the multiple example frame images. In one example, the identification unit 12 can map each target frame image to an example frame image from which the feature quantities most similar to those of each target frame image have been extracted.
[0055] The features used here can vary. For example, the identification unit 12 may use features related to the posture of the person being evaluated and the person being referenced. Features related to posture can be extracted using technologies such as OpenPose or MMPose. In addition, the identification unit 12 may use features related to RGB information. RGB information indicates the color information of each pixel. In addition, the identification unit 12 may use features related to the appearance of the work object. If the work object is an intermediate or finished product of a product manufactured in the work, the state of the work object may change according to the progress of the work. The identification unit 12 may identify the correspondence between the target frame image and the example frame image based on such changes in the state of the work object according to the progress of the work. Also, if the work object is a tool used in the work (soldering iron, screwdriver, cleaning tool, device, etc.), the tool used in the work may change according to the progress of the work. The identification unit 12 may identify the correspondence between the target frame image and the example frame image based on such changes in the tool used according to the progress of the work. In addition, the identification unit 12 may use a combination of the above-mentioned features.
[0056] The specific unit 12 may utilize techniques such as TCC (Temporal Cycle-Consistency Learning), LA2DS (Learning by Aligning 2D Skeleton Sequences in Time), and DTW (Dynamic Time Warping) in the above-mentioned mapping. The techniques used are not limited to those given herein.
[0057] The detection unit 13 detects a section of the target video based on the result of identifying the correspondence between the target frame image and the example frame image.
[0058] The "results of identifying the correspondence between the target frame image and the example frame image" indicate, for example, as shown in Figure 7, which of the multiple example frame images in the example video corresponds to each of the multiple target frame images in the example video.
[0059] A "cautionary section" is at least a portion of a target video of a certain length. A cautionary section can be identified, for example, by the elapsed time from the beginning of the target video. A cautionary section is a section that shows a difference of a certain level or higher when compared to the example video. Such cautionary sections are considered to require some kind of action, such as improvement, review, examination, practice, or skill development.
[0060] Here, we will explain the process of detecting a section of interest within the target video based on the "result of identifying the correspondence between the target frame image and the example frame image," as shown in Figure 7, for example.
[0061] The detection unit 13 detects "M" in the target video image. 1 and M 2 The detection unit 13 detects the section in the target video where the relationship with (M) satisfies predetermined conditions as a caution section. Specifically, the detection unit 13 detects the section in the target video where (M) satisfies predetermined conditions as a caution section. 2 / M 1 The system detects intervals that satisfy the predetermined condition that "the absolute difference between the value and the reference value is greater than or equal to the threshold," as attention intervals.
[0062] "M 1 " is the time interval or frame interval between the first target frame image and the second target frame image among multiple target frame images.
[0063] First, the detection unit 13 identifies a pair of first and second target frame images from among a plurality of target frame images. The first and second target frame images may be, for example, two consecutive target frame images. In this case, there are no other target frame images between the first and second target frame images. Alternatively, the first and second target frame images may be two target frame images with a predetermined interval between them. In this case, there is a predetermined number of target frame images between the first and second target frame images. The detection unit 13 can identify multiple pairs of first and second target frame images that satisfy this relationship from among a plurality of target frame images. Then, the detection unit 13 can sequentially identify one of these multiple pairs as the target for processing.
[0064] The detection unit 13 detects the time interval or frame interval M between the first target frame image and the second target frame image. 1 Identify.
[0065] The "frame interval" is the value obtained by subtracting the frame number (target frame image number) of the first target frame image from the frame number (target frame image number) of the second target frame image. If the first and second target frame images are consecutive target frame images, the frame interval is "1". If the first and second target frame images have Q target frame images between them, the frame interval is "Q+1".
[0066] The "time interval" is the frame interval multiplied by the time of one frame.
[0067] "M 2 " is the time interval or frame interval between the first example frame image corresponding to the first target frame image and the second example frame image corresponding to the second target frame image.
[0068] The detection unit 13 identifies a first example frame image corresponding to the first target frame image and a second example frame image corresponding to the second target frame image, based on the correspondence relationship identified by the identification unit 12. Next, the detection unit 13 determines the time interval or frame interval M between the first example frame image and the second example frame image. 2 Identify.
[0069] "M 2 / M 1 This indicates the degree of work speed in a portion of the series of operations recorded in the target video (the portion from the first target frame image to the second target frame image) by comparing it with the work speed in the same portion recorded in the example video. That is, M 2 / M 1 This indicates whether the speed of work in a portion of the series of tasks recorded in the target video is the same as, or to what extent faster or slower than, the speed of work in the same portion recorded in the example video. 2 / M 1 The closer this value is to 1, the more it indicates that the work speed in a portion of the target video is approximately the same as the work speed in a portion of the example video. 2 / M 1 The greater the value of 1, the faster the work speed in the portion recorded in the target video is compared to the work speed in the same portion recorded in the example video. 2 / M 1 The smaller the value is (less than 1), the slower the work speed in the portion recorded in the target video is compared to the work speed in the same portion recorded in the example video.
[0070] The "reference value" is M as described above. 2 / M 1 This is a value that is compared to the standard value, for example, M if the person being evaluated proceeds through the series of tasks smoothly without any particular problems. 2 / M 1 Or it shows an equivalent value. The detection unit 13 is M 2 / M 1Based on the comparison results with the baseline values, attention segments are detected within the target video footage.
[0071] In one example, a user can determine a reference value and input it into the information processing device 10 or register it in advance. The detection unit 13 can identify the reference value that the user has input or registered in advance.
[0072] The baseline value could be, for example, "1". Such a baseline value is effective in evaluating an evaluator who is familiar with the task and can perform a series of tasks at the same speed as the reference subject recorded in the example video.
[0073] In addition, the benchmark value may be determined for each person being evaluated, taking into account their individual work speed. For example, the user may determine the time required if the person being evaluated proceeds smoothly and without any particular problems, based on prior preparation or past videos. Alternatively, the user may determine the time required for a series of tasks performed by a reference subject as recorded in an example video, by reviewing the example video. The user may then determine the benchmark value for each person being evaluated as "(time required for a series of tasks performed by a reference subject as recorded in an example video) / (time required if the person being evaluated proceeds smoothly and without any particular problems)".
[0074] In addition, the detection unit 13 may determine a reference value for each target video (for each person being evaluated) based on the target video and the example video. For example, the detection unit 13 may determine (N 2 / N 1 ) can be used as the reference value for determination.
[0075] "N 1 " is the time or number of frames required for a series of operations recorded in the target video. The detection unit 13, for example, determines the total time length or total number of frames of the target video as N 1 It may be identified as such. This identification method is effective when the target video footage only records scenes of the work being evaluated.
[0076] In addition, the detection unit 13 may analyze the target video and identify the start and end times of the work within the target video. Then, the detection unit 13 determines the time length or number of frames from the start to the end of the work within the target video as N 1 It may also be specified as follows. For example, the detection unit 13 may specify the start and end times of the work based on the "feature quantities of each target frame image" described above. The detection unit 13 can specify the start and end times of the work as the times when feature quantities specific to the start time of the work and feature quantities specific to the end time of the work are detected. In addition, the detection unit 13 may accept input from the user specifying the start and end times of the work in the target video. The detection unit 13 may then specify the start and end times of the work in the target video based on the user input.
[0077] "N 2 " is the time or number of frames required for a series of operations recorded in the example video. The detection unit 13 is the above N 1 Using a similar method to the identification of N, 2 It is possible to identify this.
[0078] Next, using Figure 8, "(M 2 / M 1 The significance of the predetermined condition, "the absolute value of the difference between the reference value and the reference value is greater than or equal to the threshold," will be explained.
[0079] The graph in Figure 8 plots the target frame image number on the horizontal axis and the example frame image number on the vertical axis. The "target frame image number" is a sequential number assigned to the multiple time-series target frame images contained in the target video in chronological order. The "example frame image number" is a sequential number assigned to the multiple time-series example frame images contained in the example video in chronological order.
[0080] In the graph of Figure 8, the correspondence identified by the identification unit 12 is plotted. That is, circles are plotted at the positions of the corresponding target frame image and the example frame image, and these circles are connected by straight lines in chronological order. When the first target frame image and the second target frame image are two consecutive target frame images, (M 2 / M1 ) represents the slope of the straight line connecting two consecutive circles.
[0081] Furthermore, in the graph of Figure 8, a straight line A is shown with the reference value as its slope. As mentioned above, the reference value is M, which is the value when the person being evaluated proceeds through the series of tasks smoothly without any particular problems. 2 / M 1 Or it shows an equivalent value.
[0082] M 2 / M 1 If the value deviates from the standard value by a threshold, that is, if the difference between the slope of the line connecting two consecutive circles and the slope of line A exceeds a threshold, it means that the work of the person being evaluated is not progressing smoothly in the section connecting those two circles.
[0083] On the other hand, M 2 / M 1 If the value is not more than a threshold away from the reference value, that is, if the difference between the slope of the line connecting two consecutive circles and the slope of line A is less than the threshold, it means that the work of the person being evaluated is progressing smoothly in the section connecting those two circles.
[0084] The detection unit 13 detects, for each pair of the first target frame image and the second target frame image (for example, for each pair of consecutive circles in Figure 8), "(M 2 / M 1 The detection unit 13 then determines whether the predetermined condition is met, which is that "the absolute value of the difference between the reference value and the reference value is greater than or equal to the threshold." The detection unit 13 then detects a set of pairs of intervals that satisfy the predetermined condition as a caution interval.
[0085] Here, we have explained the case where the first and second target frame images are two consecutive target frame images. Even if there is a predetermined number of target frame images between the first and second target frame images, the attention interval can be detected using the same process.
[0086] Returning to Figure 1, the notification unit 14 provides notification regarding the attention section. The notification unit 14 can provide this notification via any output device such as a display, projection device, or speaker. Alternatively, the notification unit 14 may provide this notification by transmitting the notification screen to an external device.
[0087] For example, the notification unit 14 can notify the start and end positions of the attention section within the target video. Such notifications can be implemented in various ways. For example, the notification unit 14 may display a time bar corresponding to the duration of the target video and clearly indicate the start and end positions of the attention section on that time bar. Alternatively, the notification unit 14 may clearly indicate the start and end positions of the attention section using numerical values that show the elapsed time from the beginning of the target video, such as "0:03:23 to 0:05:11".
[0088] Next, an example of the processing flow of the information processing device 10 will be explained using the flowchart in Figure 9. The purpose here is simply to explain the processing flow. Details of each process have been described above, so explanations will be omitted here as appropriate.
[0089] In S20, the information processing device 10 acquires the target video. The information processing device 10 then calculates the total number of frames (N) of the acquired target video. a The information processing device 10 identifies the target video frame number, and in this example, ranges from 0 to (N). a It takes the value of -1).
[0090] In S21, the information processing device 10 extracts feature quantities from each of the multiple target frame images included in the target video.
[0091] In S22, the information processing device 10 identifies the correspondence between multiple target frame images and multiple example frame images based on the feature quantities extracted in S21 and the pre-prepared "feature quantities extracted from each of the multiple example frame images included in the example video." That is, the information processing device 10 identifies which of the multiple example frame images included in the example video each of the multiple target frame images corresponds to. The information processing device 10 can generate a list Align that shows the correspondence. Align(i) indicates the example frame image number corresponding to the target frame image number i.
[0092] In S23, the information processing device 10 determines that the variable i is (N a -1) Check if the variable i is (N a If -1) (Yes in S23), the information processing device 10 terminates processing. On the other hand, if the variable i is (N a -1) If not (No. in S23), the information processing device 10 proceeds to S24.
[0093] In S24, the information processing device 10 determines whether the condition shown in the figure, "*1," is met. The condition shown in the figure, "M," is the same as described above. 1 and M 2 This is an example of a predetermined condition for the relationship with *1. If the condition shown in the figure *1 is not met (No. in S24), the information processing device 10 proceeds to S26. On the other hand, if the condition shown in the figure *1 is met (Yes in S24), the information processing device 10 proceeds to S25.
[0094] In S25, the information processing device 10 registers the i-th target frame number in the difference frame list Diff. Then, the information processing device 10 proceeds to S26.
[0095] In S26, the information processing device 10 updates the variable i to (i+1), then returns to S23 and repeats the same process.
[0096] The differential frame list Diff generated by this process indicates the detected attention interval. In other words, the interval containing the target frame number registered in the differential frame list Diff is the attention interval.
[0097] <Effects and Effects> The information processing device 10 of the second embodiment can achieve the same effects and effects as the information processing device 10 of the first embodiment.
[0098] Furthermore, the information processing device 10 can identify the degree of work speed in a part of a series of tasks recorded in the target video by comparing it with the work speed in the same part recorded in the example video. Based on the comparison result between the identified degree of speed and a reference value, the information processing device 10 can decide whether or not to detect that part as a section requiring attention. In this way, the information processing device 10 can detect sections requiring attention by evaluating the series of tasks recorded in the target video in separate parts. With the information processing device 10, which can subdivide and evaluate the series of tasks recorded in the target video in this way, the work of the person being evaluated recorded in the target video can be evaluated with high resolution.
[0099] As described above, the information processing device 10 identifies the degree of work speed in a portion of a series of operations recorded in the target video by comparing it with the work speed in a portion of the same operation recorded in the example video. In this identification, if the portion of the example video compared with the portion of the series of operations recorded in the target video is not appropriate, the degree of work speed in that portion of the target video cannot be correctly identified. To address this problem, the information processing device 10 identifies which of the multiple example frame images included in the example video corresponds to each of the multiple target frame images included in the target video. Then, based on the result of this correspondence identification, the information processing device 10 determines the portion of the example video that is compared with the portion of the series of operations recorded in the target video. With such an information processing device 10, even if a time difference occurs between the target video and the example video, the above identification can be performed based on the correct correspondence. As a result, even if a time difference occurs between the target video and the example video, the attention section can be detected with high accuracy.
[0100] Furthermore, the information processing device 10 can determine a reference value for each person being evaluated, which is used to compare the speed of work in a portion of the series of tasks recorded in the target video. For example, the information processing device 10 can use the above-mentioned "(time required for the series of tasks performed by the reference person recorded in the example video) / (time required if the person being evaluated proceeds with the series of tasks smoothly without any particular problems)" as the reference value for each person being evaluated. Alternatively, the information processing device 10 can use the above-mentioned (N 2 / N 1 ) can be used as the reference value.
[0101] As explained with reference to Figure 4 in the first embodiment, even for individuals who can perform their tasks smoothly without any particular problems, the time required for each task may differ. Therefore, if the above reference value is set to a common value for all individuals, it may not be possible to correctly evaluate the work of each individual. The information processing device 10 can determine the reference value for each individual, taking into account their individual work speed. Such an information processing device 10 can suppress the inconveniences explained with reference to Figure 4 in the first embodiment and accurately detect the attention interval.
[0102] Furthermore, the information processing device 10 can notify the user of the start and end positions of the attention section within the target video. Based on this notification, the user can easily identify and confirm the attention section within the target video.
[0103] Here, Figure 10 shows the results of processing the data of the example video and target video in Figure 3 using the information processing device 10 to detect the attention interval. The effect of the information processing device 10 will then be explained by comparing it with the comparative example in Figure 3. Figure 10 shows the data using the same method as in Figure 8 described above.
[0104] Observing the waveforms of the example video and the target video in Figure 3, we see that the waveforms from 0 to t in the target video are... 1 The interval and the section from 0 to t in the example video. 1 It can be seen that the intervals marked with ' have almost the same waveform. And in the target video, t 1 ~t 2 The interval and the example video t1 t to t' 2 It can be seen that the intervals of t to t' have different required times (required number of frames), resulting in different waveforms. And for the target moving image at t 2 The interval after this and the interval after t' in the reference moving image 2 It can be seen that the intervals after this are almost the same waveform. This means that the evaluated person photographed in the target moving image worked smoothly until t 1 but caused a work delay due to some reason between t 1 and t 2 and then worked smoothly again after t 2 .
[0105] In such data of FIG. 3, when detecting an attention interval from the target moving image based on "the comparison result between the target frame image and the reference frame image photographed at the same timing", as shown in FIG. 3, the interval after the target frame image number 20 is detected as the attention interval. That is, the interval after t 2 when working smoothly is also detected as the attention interval.
[0106] In contrast, when processing the data of FIG. 3 by the information processing device 10 to detect the attention interval, as shown in FIG. 10, the interval where the difference in slope from the straight line A in the line connecting the circles in time series is greater than or equal to the threshold value, that is, the interval between the target frame image numbers 20 to 60 is detected as the attention interval. That is, the information processing device 10 does not detect the intervals up to t 1 and the intervals after t 2 as the attention interval because the slope of the straight line connecting consecutive circles is equal to the slope of the straight line A (the difference is less than the threshold value). And the information processing device 10 does not detect the intervals from t 1 to t 2 as the attention interval because the slope of the straight line connecting consecutive circles is significantly different from the slope of the straight line A (the difference is greater than or equal to the threshold value). That is, the information processing device 10 does not detect the intervals up to t 1 when the evaluated person worked smoothly and the intervals after t 2 as the attention interval. And the information processing device 10 detects the interval from t 1 to t2 The section is detected as a section requiring attention. In this way, the information processing device 10 can detect sections requiring attention with high accuracy.
[0107] <<Third Embodiment>> The information processing device 10 of the third embodiment can provide notification regarding the attention section using a different method than the information processing device 10 of the second embodiment. This will be explained in detail below.
[0108] In the third embodiment, the work recorded in the target video and the example video includes multiple sub-tasks. The person being evaluated and the person being referenced perform a task that involves performing multiple sub-tasks in a predetermined order according to certain procedures and rules. The target video and the example video record the person being evaluated and the person being referenced performing such a task. An example of such a task and sub-tasks is as described in the second embodiment.
[0109] The detection unit 13 detects each sub-task within the target video. Detecting each sub-task means identifying the start and end positions of the sections in the target video where each sub-task is recorded. The detection unit 13 can achieve this identification by various means. For example, the user may input the start and end positions of the sections in which each sub-task is recorded while viewing the target video. The detection unit 13 may then detect each sub-task based on the user's input.
[0110] In addition, the detection unit 13 may detect each sub-task by analyzing the target video image. In this case, feature quantities specific to each sub-task are pre-registered. The detection unit 13 can then detect the sections in which the feature quantities specific to each sub-task are detected as sections for each sub-task. Various feature quantities can be used here. For example, the detection unit 13 may use feature quantities related to the posture of the person being evaluated and the person being referenced. In addition, the detection unit 13 may use feature quantities related to RGB information. In addition, the detection unit 13 may use feature quantities related to the appearance of the work object. In addition, the detection unit 13 may use a combination of these feature quantities. These feature quantities are as described in the second embodiment.
[0111] The detection unit 13 detects each sub-task within the target video image and then detects sub-tasks that are included in the attention section. Specifically, the detection unit 13 determines whether the relationship between each sub-task section and the attention section satisfies the judgment conditions. The detection unit 13 then identifies the sub-task sections that satisfy the judgment conditions and detects the sub-tasks in the identified sections as "sub-tasks included in the attention section."
[0112] The criteria for determination can be any of the following, for example: • At least a portion of the sub-task overlaps with the attention zone (or at least a portion of the sub-task is included in the attention zone) • A predetermined percentage or more of the sub-task overlaps with the attention zone (or a predetermined percentage or more of the sub-task is included in the attention zone) • The entire sub-task overlaps with the attention zone (or the entire sub-task is included in the attention zone)
[0113] The above-mentioned predetermined percentage is a value that is set in advance.
[0114] The notification unit 14 notifies of subtasks included in the attention section. For example, the names of multiple subtasks may be defined in advance and registered in the information processing device 10. The notification unit 14 may then identify the names of the subtasks included in the attention section based on that information and notify the names of the identified subtasks. In addition to names, other identification information such as serial numbers, icons, or symbols may be used.
[0115] Other configurations of the information processing device 10 can be the same as those in the first and second embodiments.
[0116] The information processing device 10 of the third embodiment can achieve the same effects as the information processing device 10 of the first and second embodiments. Furthermore, the information processing device 10 can identify and notify users of small tasks included in the attention section. With such an information processing device 10, users can easily identify small tasks that require improvement or other actions based on the notification content.
[0117] <<Fourth Embodiment>> The information processing device 10 of the fourth embodiment is "(M 2 / M 1A plurality of threshold values to be compared with the "absolute value of the difference from the reference value" are set in advance. Then, the information processing apparatus 10 varies the notification method according to whether the "absolute value of the difference from the reference value of (M 2 / M 1 )" is greater than any of the threshold values. This will be described in detail below.
[0118] As described in detail in the second embodiment, the detection unit 13 determines, for each pair of the first target frame image and the second target frame image (for example, for each pair of two consecutive circled marks in FIG. 8), whether a predetermined condition of "the absolute value of the difference from the reference value of (M 2 / M 1 ) is greater than or equal to the threshold value" is satisfied. Then, the detection unit 13 detects a set of intervals of pairs that satisfy the predetermined condition as a caution interval.
[0119] In the fourth embodiment, a plurality of threshold values are set in this predetermined condition. Then, the detection unit 13 compares the absolute value of the difference between (M 2 / M 1 ) and the reference value with each of the plurality of threshold values, and identifies which threshold value the absolute value of the difference between (M 2 / M 1 ) and the reference value is greater than or equal to. The fact that the absolute value of the difference between (M 2 / M 1 ) and the reference value is greater than or equal to a larger threshold value means that the difference between the target moving image and the template moving image is larger.
[0120] The notification unit 14 notifies an interval greater than or equal to any of the threshold values in the target moving image as a caution interval. Then, the notification unit 14 can subdivide the caution interval into small intervals based on "which threshold value it is greater than or equal to", identify the small intervals, and notify them. And the notification unit 14 can vary the notification method according to which threshold value the absolute value of the difference between (M 2 / M 1 ) and the reference value is greater than or equal to among the plurality of threshold values. In this case, the notification unit 14 can emphasize and display a smaller interval in which the absolute value of the difference between (M 2 / M 1 ) and the reference value is greater than or equal to a larger threshold value more.
[0121] For example, the notification unit 14 can display a time bar corresponding to the duration of the target video, and on that time bar, it can clearly indicate the start and end positions of the attention section, and the start and end positions of subsections within the attention section. The notification unit 14 then (M 2 / M 1 The sub-intervals in which the absolute value of the difference between the reference value and the reference value is greater than or equal to the first threshold (for example, the largest threshold) can be highlighted by flashing or displaying them in red. Note that the highlighting methods are not limited to these.
[0122] In addition, the notification unit 14 can clearly indicate the start and end positions of the attention section using numerical values that show the elapsed time from the beginning of the target video, such as "0:03:23 to 0:05:11". Separately from this display, the notification unit 14 may also notify the start and end positions of each sub-section, such as "0:04:03 to 0:04:45". The notification unit 14 then (M 2 / M 1 The numerical values indicating the start and end positions of the sub-intervals where the absolute value of the difference between the reference value and the reference value is greater than or equal to the first threshold (for example, the largest threshold) may be highlighted by flashing or by displaying them in red. However, the highlighting methods are not limited to these.
[0123] In addition, the notification unit 14 is configured as described in the third embodiment, and "(M 2 / M 1 You may also notify us of "small tasks that fall within a small interval where the absolute value of the difference between the specified value and the reference value is equal to or greater than the first threshold (for example, the largest threshold)."
[0124] The detection unit 13 identifies a section of a small task that satisfies one of the following determination conditions, and classifies the small task in the identified section as "(M 2 / M 1 ) can be detected as a sub-interval in which the absolute value of the difference between and the reference value is greater than or equal to the first threshold (for example, the largest threshold). ・At least a portion of the sub-interval is (M 2 / M 1 ) and the absolute value of the difference between the reference value and the sub-interval where the absolute value is greater than or equal to the first threshold (for example, the largest threshold) (at least a part of the sub-interval is (M2 / M 1 )This can also be defined as the sub-interval in which the absolute value of the difference between the sub-interval and the reference value is equal to or greater than the first threshold (for example, the largest threshold) ・A predetermined proportion or more of the sub-interval is (M 2 / M 1 ) and the absolute value of the difference between the reference value and the sub-interval where the absolute value is greater than or equal to the first threshold (for example, the largest threshold) (a predetermined proportion or more of the sub-intervals are (M 2 / M 1 )This can also be said to mean that the absolute value of the difference between ( ) and the reference value is greater than or equal to the first threshold (for example, the largest threshold) and is included in the sub-interval.) ・The entire sub-interval is (M 2 / M 1 ) and the absolute value of the difference between the reference value and the sub-interval where the absolute value is greater than or equal to the first threshold (for example, the largest threshold) (the entire sub-interval is (M 2 / M 1 (This can also be described as the sub-interval in which the absolute value of the difference between the specified value and the reference value is greater than or equal to the first threshold (for example, the largest threshold).)
[0125] The above-mentioned predetermined percentage is a value that is set in advance.
[0126] Other configurations of the information processing device 10 can be the same as those of the first to third embodiments.
[0127] The information processing device 10 of the fourth embodiment can achieve the same effects as the information processing device 10 of the first to third embodiments. Furthermore, the information processing device 10 can change the way it notifies depending on the degree of difference between the target video and the example video. Specifically, the information processing device 10 can highlight and notify sections where the degree of difference is greater. With such an information processing device 10, the user can easily grasp small tasks or sections that particularly require improvement.
[0128] <<Fifth Embodiment>> The information processing device 10 of the fifth embodiment is "(M 2 / M 1 A threshold can be set for each person being evaluated, which is compared to the absolute difference between the subject and the standard value. This will be explained in detail below.
[0129] As described in detail in the second embodiment, the detection unit 13, for each pair of the first target frame image and the second target frame image (for example, for each pair of consecutive circles in Figure 8), "(M 2 / M 1 The detection unit 13 then determines whether the predetermined condition is met, which is that "the absolute value of the difference between the reference value and the reference value is greater than or equal to the threshold." The detection unit 13 then detects a set of pairs of intervals that satisfy the predetermined condition as a caution interval.
[0130] The information processing device 10 of the fourth embodiment can set a threshold value for each person being evaluated under these predetermined conditions.
[0131] In one example, the detection unit 13 can accept user input specifying a threshold for each person being evaluated. The detection unit 13 then identifies the threshold input by the user as the threshold for each person being evaluated. The detection unit 13 may also accept user input specifying a threshold to be applied to the processing of the target video when the target video is input to the information processing device 10 and the target video is processed.
[0132] In addition, the detection unit 13 may accept user input to register with the information processing device 10, with thresholds set in advance for each person being evaluated. The detection unit 13 may then store the thresholds set for each person being evaluated in the storage device of the information processing device 10, linked to each person being evaluated.
[0133] In this example, the detection unit 13 receives user input to input the identification information of the person being evaluated recorded in the target video when the target video is input to the information processing device 10 and the target video is processed. The detection unit 13 then reads from the storage device the threshold value registered in association with the person being evaluated identified by the input identification information. Alternatively, the detection unit 13 may identify the person being evaluated recorded in the target video using facial recognition technology or the like. The detection unit 13 may then read from the storage device the threshold value registered in association with the identified person being evaluated.
[0134] In addition, multiple thresholds may be set in advance, and the application conditions for each threshold may be defined and stored in the storage device of the information processing device 10. The application conditions are defined using the attribute information of the person being evaluated. Examples of attribute information of the person being evaluated include, but are not limited to, years of service, years of work experience, number of past accidents, age, and evaluation score by supervisor. In this example, the attribute information of each person being evaluated is stored in the storage device of the information processing device 10 in advance, linked to each person being evaluated. This information is updated as needed.
[0135] In this example, the detection unit 13 receives user input to input the identification information of the person to be evaluated recorded in the target video when the target video is input to the information processing device 10 and the target video is processed. Next, the detection unit 13 reads the attribute information of the person to be evaluated identified by the input identification information from the storage device. The detection unit 13 then determines which threshold application conditions the read attribute information satisfies, and identifies the threshold associated with the application conditions satisfied by the read attribute information as the threshold for that person to be evaluated. Alternatively, the detection unit 13 may identify the person to be evaluated recorded in the target video using facial recognition technology or the like. The detection unit 13 may then read the attribute information registered in association with the identified person to be evaluated from the storage device and perform the above processing.
[0136] Other configurations of the information processing device 10 can be the same as those of the first to fourth embodiments.
[0137] The information processing device 10 of the fifth embodiment can achieve the same effects as the information processing device 10 of the first to fourth embodiments. Furthermore, the information processing device 10 can set appropriate thresholds for each person being evaluated. For example, the information processing device 10 can set appropriate thresholds for each person being evaluated based on their attribute information. With such an information processing device 10, attention intervals can be detected with high accuracy based on the appropriate thresholds determined for each person being evaluated.
[0138] <<Modified Examples>> Here, modified examples applicable to the first to fifth embodiments will be described.
[0139] The information processing device 10 can generate and output a screen as shown in Figure 11. For example, the information processing device 10 may output the screen via an output device it has or an output device connected to the information processing device 10. Alternatively, the information processing device 10 may transmit the screen to an external device and have the external device output the screen. The output device is exemplified by, but is not limited to, displays and projection devices.
[0140] The screen in Figure 11 displays a time bar corresponding to the duration of the target video. The time bar corresponding to "You" in the figure is the time bar corresponding to the duration of the target video.
[0141] Furthermore, the screen in Figure 11 displays a time bar corresponding to the duration of the example video. The time bar corresponding to "Example" in the figure corresponds to the duration of the example video. The target video and the example video have different durations. Therefore, the lengths of their time bars are also different. Each time bar is color-coded to identify and display multiple sections. Each color-coded section indicates the section in which each of the multiple subtasks is recorded.
[0142] The time bar corresponding to the duration of the target video displays a frame W that encloses a specific section. This section enclosed by frame W is the section requiring attention.
[0143] The user can play back the caution section by operating the "Start Playback Button (triangle mark in the diagram)" displayed in conjunction with frame W. During playback, this mark may change to a "Stop Playback Button (e.g., square mark)". In this case, the user can stop the playback of the caution section by operating this stop playback button. Note that while playback is stopped, this mark may change back to the "Start Playback Button (triangle mark in the diagram)". In addition, the user may perform operations such as play, stop, fast forward 5 seconds, and fast backward 5 seconds by operating the operation buttons displayed below the illustrated playback image.
[0144] The time bar in Figure 11 clearly indicates the current playback position by the position of a mark that combines a triangle and a vertical line.
[0145] The screen in Figure 11 displays a sample video in time synchronization with the playback of the target video.
[0146] Furthermore, the screen in Figure 11 displays work advice. For example, work advice information may be pre-registered in the information processing device 10 for each subtask. When the information processing device 10 identifies a subtask included in the detected attention section, it may read the work advice information associated with the identified subtask and display it on the screen as shown in Figure 11. Alternatively, the information processing device 10 may identify a subtask at the playback position, read the work advice information associated with the identified subtask, and display it on the screen as shown in Figure 11. In this example, the work advice information displayed on the screen may also change according to the change in the playback position.
[0147] Although this disclosure has been described above with reference to embodiments, this disclosure is not limited to the embodiments described above. Various modifications to the structure and details of this disclosure are possible, which can be understood by those skilled in the art within the scope of this disclosure. Furthermore, each embodiment can be combined with other embodiments as appropriate.
[0148] Furthermore, the flowchart used in the above explanation shows multiple steps (processes) in sequence. However, the execution order of the steps performed in each embodiment is not limited to the order in which they are described. In each embodiment, the order of the illustrated steps can be changed to the extent that it does not impede the content.
[0149] Some or all of the above embodiments may also be described as follows, but are not limited to the following: 1. An information processing device having: an acquisition means for acquiring a target video image of a person being evaluated performing a task; an identification means for identifying which of a plurality of reference frame images in a reference video each of a plurality of target frame images included in the target video corresponds to; a detection means for detecting a section of attention from the target video based on the result of the identification of the correspondence; and a notification means for providing notification regarding the section of attention. 2. The information processing device according to 1, wherein the notification means notifies the start and end positions of the section of attention in the target video. 3. The information processing device according to 1 or 2, wherein the task includes a plurality of subtasks, and the notification means notifies the subtasks included in the section of attention. 4. The time interval or frame interval between the first target frame image and the second target frame image is M 1 The time interval or frame interval between the first reference frame image corresponding to the first target frame image and the second reference frame image corresponding to the second target frame image is set to M. 2 In that case, the detection means is M 1 and M 2 An information processing device according to any one of 1 to 3, which detects an interval in which the relationship with satisfies predetermined conditions as the attention interval. 5. The predetermined conditions are (M 2 / M 1 ) The information processing device according to 4, wherein the absolute value of the difference between and the reference value is greater than or equal to the threshold. 6. The required time or number of frames of the operation recorded in the target video image is N 1 N is the time required or number of frames required for the operation recorded in the reference video. 2 In that case, the aforementioned reference value is (N 2 / N 1) an information processing device as described in 5. 7. An information processing device as described in 5 or 6, wherein a plurality of thresholds are set, and the notification means provides a different notification method depending on which of the plurality of thresholds the absolute value is greater than or equal to. 8. An information processing device as described in any one of 5 to 7, wherein the detection means acquires attribute information of the person being evaluated, and determines the threshold for each person being evaluated based on the attribute information. 9. An information processing method in which one or more computers acquire a target video image of a person being evaluated performing work, identify which of the plurality of reference frame images included in the target video image corresponds to each of the plurality of reference frame images included in the reference video image, detect a cautionary section from the target video image based on the result of the correspondence identification, and provide notification regarding the cautionary section. 10. A program that causes a computer to execute the following steps: an acquisition step of acquiring target video footage of a subject being evaluated while performing a task; an identification step of identifying which of the multiple reference frame images included in the target video corresponds to each of the multiple target frame images included in the reference video; a detection step of detecting a section of interest from the target video based on the result of the correspondence identification; and a notification step of providing notification regarding the section of interest.
[0150] Some or all of the appendices 2 to 8, which are dependent on the information processing device described in appendice 1 above, may also be dependent on the information processing method in appendice 9 and the program in appendice 10 in the same dependent relationship as between appendice 1 and appendices 2 to 8. Furthermore, within the scope of not departing from each of the embodiments described above, some or all of the configurations described as appendices can be realized in various hardware, software, various recording means for recording software, or systems.
[0151] This application claims priority based on Japanese Patent Application No. 2024-228365, filed on 25 December 2024, and incorporates all of its disclosures herein.
[0152] 10 Information processing device 11 Acquisition unit 12 Identification unit 13 Detection unit 14 Notification unit 1A Processor 2A Memory 3A Input / Output I / F 4A Peripheral circuit 5A Bus
Claims
1. An information processing device comprising: an acquisition means for acquiring target video footage of a person being evaluated while performing a task; an identification means for identifying which of a plurality of reference frame images included in the target video corresponds to each of a plurality of target frame images included in a reference video; a detection means for detecting a section of attention from the target video based on the result of the correspondence identification; and a notification means for providing notification regarding the section of attention.
2. The information processing apparatus according to claim 1, wherein the notification means notifies the start and end positions of the attention section within the target video image.
3. The information processing apparatus according to claim 1 or 2, wherein the operation includes a plurality of sub-operations, and the notification means notifies of the sub-operations included in the attention section.
4. The time interval or frame interval between the first target frame image and the second target frame image is M. 1 The time interval or frame interval between the first reference frame image corresponding to the first target frame image and the second reference frame image corresponding to the second target frame image is set to M. 2 In that case, the detection means is M 1 and M 2 The information processing device according to any one of claims 1 to 3, which detects an interval in which the relationship with satisfies predetermined conditions as the attention interval.
5. The aforementioned predetermined conditions are (M 2 / M 1 The information processing apparatus according to claim 4, wherein the absolute value of the difference between the reference value and the reference value is greater than or equal to a threshold value.
6. Let the required time or the number of frames used for the operation recorded in the target moving image be N 1 , and let the required time or the number of frames used for the operation recorded in the reference moving image be N 2 . In this case, the reference value is (N 2 / N 1 ). The information processing apparatus according to claim 5 7. The information processing apparatus according to claim 5 or 6, wherein a plurality of thresholds are set, and the notification means provides different notification methods depending on which of the plurality of thresholds the absolute value is greater than or equal to.
8. The information processing apparatus according to any one of claims 5 to 7, wherein the detection means acquires attribute information of the person to be evaluated, and determines the threshold value for each person to be evaluated based on the attribute information.
9. An information processing method comprising: one or more computers acquiring target video footage of an evaluation subject performing a task; identifying which of the multiple target frame images included in the target video corresponds to among the multiple reference frame images included in a reference video; detecting a section of attention from the target video based on the results of the correspondence identification; and providing notification regarding the section of attention.
10. The information processing method according to claim 9, wherein the notification concerning the attention section includes notifying the start and end positions of the attention section within the target video image.
11. The information processing method according to claim 9 or 10, wherein the operation includes a plurality of sub-operations, and the notification regarding the attention section includes notification of the sub-operations included in the attention section.
12. The time interval or frame interval between the first target frame image and the second target frame image is M. 1 The time interval or frame interval between the first reference frame image corresponding to the first target frame image and the second reference frame image corresponding to the second target frame image is set to M. 2 In that case, when detecting the aforementioned caution zone, M 1 and M 2 The information processing method according to any one of claims 9 to 11, wherein a section in relation to satisfies predetermined conditions is detected as the attention section.
13. The aforementioned predetermined conditions are (M 2 / M 1 The information processing method according to claim 12, wherein the absolute value of the difference between the reference value and the reference value is greater than or equal to a threshold value.
14. The required time or number of frames of the operation recorded in the target video image is N. 1 N is the time required or number of frames required for the operation recorded in the reference video. 2 In that case, the aforementioned reference value is (N 2 / N 1 The information processing method according to claim 13, which is:
15. A recording medium that records a program causing a computer to execute: an acquisition step of acquiring target video footage of an evaluation subject performing a task; an identification step of identifying which of the multiple reference frame images included in the target video footage corresponds to each of the multiple reference frame images included in the reference video footage; a detection step of detecting a section of interest from the target video footage based on the result of the identification of the correspondence; and a notification step of providing notification regarding the section of interest.
16. The recording medium according to claim 15, wherein the notification step provides notification of the start and end positions of the attention section within the target video image.
17. The recording medium according to claim 15 or 16, wherein the operation includes a plurality of sub-operations, and the notification step provides notification of the sub-operations included in the attention section.
18. The time interval or frame interval between the first target frame image and the second target frame image is M. 1 The time interval or frame interval between the first reference frame image corresponding to the first target frame image and the second reference frame image corresponding to the second target frame image is set to M. 2 In that case, in the detection step, M 1 and M 2 The recording medium according to any one of claims 15 to 17, wherein a section in relation to satisfies predetermined conditions is detected as the attention section.
19. The aforementioned predetermined conditions are (M 2 / M 1 The recording medium according to claim 18, wherein the absolute value of the difference between the reference value and the recording medium is greater than or equal to a threshold value.
20. The required time or number of frames of the operation recorded in the target video image is N. 1 N is the time required or number of frames required for the operation recorded in the reference video. 2 In that case, the aforementioned reference value is (N 2 / N 1 The recording medium according to claim 19, which is the recording medium described in claim 19.