Practice support program, practice support method, and practice support system

The practice support system addresses the lack of sufficient practice results by comparing user and instructor performances to provide targeted feedback and personalized practice plans, enhancing practice efficiency.

WO2026126390A1PCT designated stage Publication Date: 2026-06-18HONDA MOTOR CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HONDA MOTOR CO LTD
Filing Date
2024-12-11
Publication Date
2026-06-18

Smart Images

  • Figure JP2024043839_18062026_PF_FP_ABST
    Figure JP2024043839_18062026_PF_FP_ABST
Patent Text Reader

Abstract

This practice support program is characterized by causing a computer to execute: a reading step for reading, from a storage device, first imaging data that is stored in advance in the storage device and obtained by imaging a user who is practicing a predetermined practical skill while receiving instruction on the predetermined practical skill; a reception step for receiving, from an imaging device, second imaging data that is imaging data of the user who is practicing the predetermined practical skill; a generation step for editing the first imaging data to generate third imaging data that is imaging data of a person involved in the predetermined practical skill; a comparison step for comparing the second imaging data with the third imaging data; and an output step for outputting, to an output device, imaging data among the first imaging data of a target section to be replayed as determined on the basis of the comparison result.
Need to check novelty before this filing date? Find Prior Art

Description

Practice support program, practice support method, and practice support system 【0001】 The present invention relates to a practice support program, a practice support method, and a practice support system related to a predetermined practical skill. 【0002】 Conventionally, as this type of device, a system for supporting musical performance practice has been proposed (see, for example, Patent Document 1). In the system described in Patent Document 1, music performed according to a musical score displayed on a display unit (touch panel) is input, and performance data representing the music is generated. Further, when a touch operation is performed on the musical score, the music represented by the performance data is reproduced with the touched position as the reproduction start position. 【0003】 Japanese Patent Application Laid-Open No. 2013-200455 【0004】 On the other hand, in the system described in Patent Document 1, although the practice efficiency of the practitioner can be improved, it is difficult to give the practitioner sufficient practice results. 【0005】 A practice support program according to an aspect of the present invention causes a computer to execute a reading step of reading out first imaging data obtained by imaging a user who is practicing a predetermined practical skill while receiving guidance on the predetermined practical skill, which is stored in a storage device in advance, from the storage device; a receiving step of receiving second imaging data, which is imaging data of a user who is practicing a predetermined practical skill, from an imaging device; a generating step of editing the first imaging data to generate third imaging data, which is imaging data of a person involved in a predetermined practical skill; a comparing step of comparing the second imaging data and the third imaging data; and an outputting step of outputting imaging data of a reproduction target section determined based on the comparison result among the first imaging data to an output device. 【0006】Another aspect of the present invention is a practice support method which includes: a reading step of reading first imaging data from a storage device, which is stored in a storage device beforehand and obtained by imaging a user practicing a predetermined practical skill while receiving instruction on the predetermined practical skill; a receiving step of receiving second imaging data, which is imaging data of a user practicing a predetermined practical skill, from an imaging device; a generation step of editing the first imaging data to generate third imaging data, which is imaging data of a person involved in the predetermined practical skill; a comparison step of comparing the second imaging data and the third imaging data; and an output step of outputting imaging data of the playback target section determined from the first imaging data based on the comparison result to an output device. 【0007】 Another aspect of the present invention, a practice support system, comprises a server device that stores in advance first imaging data obtained by imaging a user practicing a predetermined practical skill while receiving instruction in that skill, and a user terminal having an imaging device. The user terminal includes a data acquisition unit that acquires the first imaging data from the server device and receives second imaging data, which is imaging data of a user practicing a predetermined practical skill, from the imaging device; a generation unit that edits the first imaging data to generate third imaging data, which is imaging data of a person involved in the predetermined practical skill; a determination unit that compares the second imaging data and the third imaging data and determines a playback target section from the first imaging data based on the comparison result; and a playback unit that outputs the imaging data of the playback target section to an output device. 【0008】 According to the present invention, it is possible to provide practitioners with sufficient practice results. 【0009】A diagram showing an example of the configuration of a practice support system according to an embodiment of the present invention. A block diagram showing the main components of the video acquisition device in Figure 1. A block diagram showing the main components of the server device in Figure 1. A diagram for explaining the image data. A flowchart showing an example of processing performed by the server device in Figure 1. A block diagram showing the main components of the user terminal in Figure 1. A diagram showing an example of the display screen of the user terminal's display. A diagram showing an example of the display screen of the user terminal's display. A diagram showing an example of the display screen of the user terminal's display. A diagram showing an example of the display screen of the user terminal's display. A flowchart showing an example of processing performed by the calculation unit of the user terminal. A flowchart showing an example of the processing in step S203 of Figure 9. 【0010】 Embodiments of the present invention will be described below with reference to the figures. Figure 1 is a diagram showing an example of the configuration of a practice support system according to an embodiment of the present invention. The practice support system 1 comprises a video acquisition device 100, a server device 200, and a user terminal 300, which are connected to each other via a network NW so as to be able to communicate with each other. 【0011】 Figure 2 is a block diagram showing the main components of the video acquisition device 100 shown in Figure 1. As shown in Figure 2, the video acquisition device 100 includes a controller 10, a communication unit 13, and an imaging unit 14. 【0012】 The communication unit 13 communicates with external devices via a network that includes wireless communication networks such as the Internet and mobile phone networks. The network includes not only public wireless communication networks but also closed communication networks established for each designated management area, such as wireless LANs, Wi-Fi®, Bluetooth®, etc. 【0013】The imaging unit 14 includes a camera 14a and a microphone (hereinafter simply referred to as "microphone") 14b. The camera 14a has an image sensor such as a CCD or CMOS and is installed so that the instructor and student during the lesson are included in the imaging range. The microphone 15 receives ambient sound as an audio signal. The audio signal received from the microphone 15 is converted into audio data via an A / D converter (not shown). The imaging unit 14 outputs data (hereinafter referred to as imaging data) including image data obtained by the camera 14a and audio data acquired by the microphone 14b to the controller 10. The installation position and number of cameras 14a and microphone 14b may be changed depending on the type of instrument and the position of the student. The imaging unit 14 may also have multiple cameras and multiple microphones. 【0014】 The imaging data acquired by the imaging unit 14 (hereinafter also referred to as lesson video data or simply lesson video) includes the sound produced by the instrument played by the student (student instrument sound) and the image of the student (student image). The lesson video also includes the sound produced by the instrument played by the instructor (instructor instrument sound), the image of the instructor (instructor image), the student's voice, and the instructor's voice (criticism and praise). If multiple students are taking a lesson, the lesson video will include instrument sounds, images, and voices corresponding to each student. 【0015】 The controller 10 is configured to include a computer having an arithmetic unit 11 such as a CPU (microprocessor), a storage unit 12 such as ROM or RAM, and other peripheral circuits (not shown) such as an I / O interface. The arithmetic unit 11 has a functional configuration that includes an acquisition unit 111 and a transmission unit 112. 【0016】 The acquisition unit 111 stores the imaging data received from the imaging unit 14 in the storage unit 12. The transmission unit 112 transmits the image data and audio data stored in the storage unit 12 by the acquisition unit 111 to the server device 200 via the communication unit 13. 【0017】Figure 3 is a block diagram showing the main components of the server device 200 shown in Figure 1. As shown in Figure 3, the server device 200 includes a controller 20 and a communication unit 23. The configuration of the communication unit 23 is the same as that of the communication unit 13 of the video acquisition device 100, so its explanation is omitted. 【0018】 The controller 20 is configured to include a computer having an arithmetic unit 21 such as a CPU, a storage unit 22 such as ROM and RAM, and other peripheral circuits (not shown) such as an I / O interface. The arithmetic unit 21 has a functional configuration of a receiving unit 211, an editing unit 212, and a transmitting unit 213. The server device 200 may be configured using a virtual server function on the cloud. Furthermore, the server device 200 may be configured with its functional configuration distributed across multiple devices. 【0019】 The receiving unit 211 receives imaging data from the video acquisition device 100 via the communication unit 23. The receiving unit 211 stores the received imaging data in the storage unit 22. Figure 4 is a diagram illustrating the imaging data. As shown in Figure 4, the imaging data includes audio data, image data, and a timestamp. 【0020】 The editing unit 212 reads the image data stored in the storage unit 12 by the receiving unit 211 and edits it. The editing unit 212 classifies the image data read from the storage unit 22 into audio data and image data. 【0021】 Next, the editorial department 212 classifies the audio data classified from the imaging data into audio data for each sound source (speaker and instrument) based on the characteristics of the audio data (frequency, wavelength, amplitude, etc.). The waveform of the audio signal differs between the sound of a musical instrument and human speech. Therefore, based on these characteristics of the audio signal, the editorial department 212 first classifies the audio data classified from the imaging data into audio data corresponding to the speaker (hereinafter referred to as speech data) and audio data corresponding to the instrument (hereinafter referred to as performance data). 【0022】Next, editorial staff 212 classifies the speech data into speech data for each speaker. For example, if the learner is a child, the learner's tone of voice will be higher than that of the adult instructor, so the learner's voice will have a relatively higher frequency than the conductor's voice. Based on these speech characteristics, editorial staff 212 classifies the speech data into speech data corresponding to the learner (hereinafter referred to as learner speech data) and speech data corresponding to the instructor (hereinafter referred to as instructor speech data). 【0023】 Editorial staff 212 similarly classifies the performance data into audio data for each instrument. However, when the student and the instructor are playing the same instrument, the characteristics of their performance sounds are basically the same, making it difficult to distinguish between the student's performance and the instructor's performance based solely on sound characteristics. Therefore, editorial staff 212 classifies the performance data using image data included in the imaging data. Specifically, editorial staff 212 determines whether the subject in the image data included in the imaging data (image data with a timestamp recorded at the same time as the performance data) is the student or the instructor, and classifies the performance data into audio data for each performer. Hereinafter, performance data corresponding to the student will be called student performance data, and performance data corresponding to the instructor will be called conductor performance data. 【0024】 Normally, by recognizing actions such as mouth movements based on image data, it is possible to determine which person is speaking among those included in the imaging range. However, it is difficult to detect mouth movements of people who are covering their faces with masks or other means, or people who are outside the camera's field of view. Therefore, the editorial department 212 may classify the sound source based on the physical characteristics of the audio signal. Specifically, the editorial department 212 may classify the speech data of the instructor and the speech data of the trainee based on the frequency of the audio signal. 【0025】 Now, with reference to Figure 5, the editing process performed by the editing unit 212 will be explained. Figure 5 is a flowchart showing an example of the editing process performed by the editing unit 212, which is executed by the calculation unit 21 of the server device 200 according to a predetermined program. 【0026】In step S101, the editorial department 212 extracts non-silent sections from the image data. Specifically, the editorial department 212 extracts sections from the image data that contain only the performer's performance data, sections that contain only the instructor's speech data, sections that contain only the instructor's performance data, and sections that contain both the instructor's speech data and the instructor's performance data. 【0027】 To determine whether the extracted non-quiet sections are similar to each other, the editorial department 212 represents the audio data of each non-quiet section as a "language of performance." This allows the similarity of each non-quiet section to be measured using a similarity calculation method of natural language processing. A "language of performance" is information that represents sound in a manner that can be expressed in language. More specifically, it is information that, after converting sound into a Mel spectrogram, represents the pitch, length, importance (frequency of occurrence), and continuity of the sound as vectors (hereinafter referred to as feature vectors). A Mel spectrogram is information that represents the amplitude of sound on a time axis and a frequency axis on the Mel scale. 【0028】 In step S102, the editorial unit 212 clusters each non-silent interval based on similarity. Specifically, first, the editorial unit 212 generates a feature vector corresponding to each non-silent interval. If a non-silent interval contains multiple audio data, the editorial unit 212 generates a feature vector corresponding to each of those audio data. 【0029】 Next, the editorial department 212 calculates the relative similarity of each non-quiet interval based on the feature vectors. More specifically, the editorial department 212 calculates the relative similarity (cosine similarity) of each non-quiet interval based on the cosine value of the angle formed by the feature vectors corresponding to each non-quiet interval. However, the method for calculating the similarity of non-quiet intervals is not limited to this, and the editorial department 212 may replace the feature vectors of each non-quiet interval with TF-IDF (Term Frequency-Inverse Document Frequency) values ​​and calculate the cosine similarity between each non-quiet interval based on the TF-IDF values. 【0030】After calculating the similarity of each non-silent interval, the editorial department 212 clusters each non-silent interval based on the calculated similarity. As a result, each non-silent interval is grouped with other non-silent intervals whose feature vectors are similar to each other. The threshold used to determine whether or not non-silent intervals are similar may be a fixed value or a relative value based on the similarity of the non-silent intervals. 【0031】 In step S103, the editorial department 212 arranges each group (cluster) generated by clustering along the time axis. At this time, the editorial department 212 may merge clusters that contain only instructor utterance data with clusters that contain only instructor performance data. 【0032】 Specifically, the editorial department 212 refers to the timestamps to determine whether there is a temporal relationship between the cluster containing only instructor speech data and the cluster containing only instructor performance data. If there is a temporal relationship, the editorial department 212 merges those clusters. For example, if the recording time of cluster A, which contains only instructor speech data, is "00h10m10s to 00h10m20s", and the recording time of cluster B, which contains only instructor performance data, is "00h10m20s to 00h10m50s", then clusters A and B are merged because they have a temporal relationship (continuity). 【0033】 In step S104, the editing unit 212 stores each cluster arranged along the time axis as a performance practice set in the storage unit 22, associating it with information that can identify the practicer (hereinafter referred to as the user ID). A performance practice set is information that associates the practicer's performance data acquired during the lesson with the instructor's speech data and instructor's performance data acquired during the lesson. 【0034】 The transmission unit 213 transmits the performance practice sets stored in the storage unit 22 by the editing unit 212 to the user terminal 300 based on a transmission request command described later. 【0035】Figure 6 is a block diagram showing the main components of the user terminal 300 in Figure 1. As shown in Figure 6, the user terminal 300 includes a controller 30, a communication unit 33, an imaging unit 34 including a camera 34a and a microphone 34b, a speaker 35, and a display 36. The configurations of the communication unit 33 and the imaging unit 34 are the same as those of the communication unit 13 and imaging unit 14 of the video acquisition device 100, so their explanations will be omitted. 【0036】 Speaker 35 outputs audio data from controller 30 as an audio signal via a D / A converter (not shown). Display 36 displays an image based on image data from controller 30. 【0037】 The controller 30 is composed of a computer having an arithmetic unit 31 such as a CPU, a storage unit 32 such as ROM and RAM, and other peripheral circuits not shown such as an I / O interface. The arithmetic unit 31 has a functional configuration including a data acquisition unit 311, an analysis unit 312, a conversation unit 313, a generation unit 314, a decision unit 315, and a playback unit 316. 【0038】 The data acquisition unit 311 requests the server device 200 to send the practice music set. More specifically, the data acquisition unit 311 outputs a transmission request command to the server device 200, which includes the user ID of the user practicing. When the data acquisition unit 311 receives the practice music set transmitted from the server device 200 in response to this transmission request command, it stores the received practice music set in the storage unit 32, associating it with the user ID. 【0039】 Furthermore, the data acquisition unit 311 stores the image data of the person practicing playing a musical instrument, acquired by the imaging unit 34, in the storage unit 32. When the imaging unit 34 acquires the image data of the person practicing, the user terminal 300 is positioned so that the person practicing and the instrument are included in the imaging range of the camera 34a, and so that the person's voice and the sound of the instrument being played can be acquired by the microphone 34b. 【0040】When the practice performance set stored in the memory unit 32 includes instructor speech data, the analysis unit 312 analyzes the characteristics of the instructor's speech based on the speech data. These characteristics include how words are extended, emphasis on specific sounds, speaking speed, intonation biases, verbal tics, and intonation. 【0041】 The conversation unit 313 generates conversation data related to instrument practice and presents it to the student. At this time, the conversation unit 313 generates conversation data that reflects the characteristics of the instructor's speech based on the analysis results of the analysis unit 312. Note that the conversation data is not limited to audio data, but may also be image data including text messages. Furthermore, the instructor's verbal habits may be reflected in the conversation data as characteristics of the instructor's speech. 【0042】 Figures 7A to 7D show examples of the display screen of the display 36. Figures 7A to 7D show examples of the display screen of the display 36 when display information including conversation data is output. In the display screens of Figures 7A to 7D, conversation data including text messages is displayed as image data in area 71. In the display screens of Figures 7A to 7D, an image representing a character modeled after the instructor is displayed in area 72, but area 72 may display an image representing the instructor's face or an animated character instead of a character modeled after the instructor. Furthermore, the image displayed in area 72 may be switchable by a display switching instruction via an input device. 【0043】 The input device may be a keyboard (not shown) in addition to the microphone 42b. Furthermore, if the display 36 has a touch panel, the touch panel may be used as the input device. Also, the conversation data may be output as audio data via the speaker 35 in synchronization with the display of the conversation data in area 71. The audio output from the speaker 35 based on the conversation data may be an audio mimicking the voice of an instructor or an audio mimicking the voice of an anime character. Furthermore, the input device may be located outside the user terminal 300. 【0044】The conversation unit 313 outputs conversation data including questions regarding practice via an output device. When the conversation data is audio data, the speaker 35 is used as the output device. On the other hand, when the conversation data is image data, the display 36 is used as the output device. The conversation unit 313 derives a practice plan based on the answers of the trainee to the questions and the performance practice sets stored in the storage unit 32. 【0045】 Specifically, the conversation unit 313 outputs conversation data (Fig. 7A) that prompts the input of the practice time desired by the trainee (hereinafter referred to as the desired practice time). When performance practice sets corresponding to a plurality of contents are respectively stored in the storage unit 32, the conversation unit 313 may output conversation data that prompts the input of the total practice time of the plurality of contents or the practice time for each content. Further, conversation data that prompts the input of the name of the content for which practice is desired (hereinafter referred to as the desired practice content) and the order of practice execution for each content may be output. When the practice target is instrument performance, a piece of music is given as an example of the content. Note that the practice plan may include items other than those related to the desired practice content and the desired practice time. 【0046】 When the answer of the trainee is input via the input device, the conversation unit 313 derives a practice plan (such as the content to be practiced (hereinafter referred to as the practice target content) and the practice time, etc.) based on the content of the answer. The conversation unit 313 outputs conversation data (Fig. 7B) for presenting the derived practice plan to the trainee. 【0047】 When the answer of the trainee is not input, the conversation unit 313 regenerates and outputs the conversation data. For example, when conversation data that prompts the input of the total practice time of a plurality of contents is output, and the answer of the trainee is not input, conversation data that prompts the input of the practice time for each content is generated and output. In this way, the questions to the trainee are repeated while changing the content of the questions until the information necessary for deriving the practice plan is obtained from the trainee. 【0048】In addition, when neither the trainee nor the musical instrument played by the trainee is included in the imaging range of the camera 34a, the conversation unit 313 may output conversation data for prompting changes in the position, posture, orientation, etc. of the camera 34a. Further, when the magnitudes of the voice of the trainee and the performance sound of the musical instrument input to the microphone 34b are below a predetermined value, the conversation unit 313 may output conversation data for prompting changes in the orientation, position, etc. of the microphone 34b. 【0049】 When the conversation unit 313 derives a practice plan, it outputs conversation data for prompting the start of a pre-performance (hereinafter referred to as a pre-performance start instruction) (FIG. 7C). For example, when the practice target content is determined to be piece X, a pre-performance start instruction for prompting the start of the performance of piece X is output. 【0050】 The generation unit 314 reads out a performance practice set corresponding to the practice target content from the storage unit 32. The generation unit 314 extracts imaging data of the trainee who is practicing the performance and imaging data of the instructor who is giving performance guidance from the read performance practice set. 【0051】 At this time, as the imaging data of the trainee, trainee speech data, trainee performance data, and image data with time stamps recorded at the same time as them are extracted. Also, as the imaging data of the instructor, instructor speech data, instructor performance data, and image data with time stamps recorded at the same time as them are extracted. Among the extracted imaging data of the trainee, the trainee performance data is used as comparison source imaging data in the determination unit 315. The comparison source imaging data may include image data corresponding to the trainee performance data. On the other hand, among the extracted imaging data of the instructor, the instructor performance data is used as model imaging data in the determination unit 315. The model imaging data may include image data corresponding to the instructor performance data. 【0052】The decision unit 315 reads the imaging data from the imaging unit 34 from the storage unit 32 as comparison imaging data. At this time, the decision unit 315 reads the imaging data of the practicer who started playing in response to the pre-play start instruction from the storage unit 32 as comparison imaging data. For example, if song X is selected as the practice subject, the imaging data acquired by the imaging unit 34 from the start to the end of playing song X is read from the storage unit 32 as comparison imaging data. 【0053】 The determination unit 315 compares the source image data and the example image data and generates information indicating the differences between them (hereinafter referred to as lesson-time difference information). The determination unit 315 also compares the target image data and the example image data and generates information indicating the differences between them (hereinafter referred to as pre-performance difference information). The lesson-time difference information and pre-performance difference information are information that records the difference between the audio signal corresponding to the practicer's performance and the audio signal corresponding to the instructor's performance in chronological order. 【0054】 The determination unit 315 detects erroneous performance sections during lessons (hereinafter referred to as "erroneous performance sections during lessons") based on the difference information during lessons. An erroneous performance section during lessons is a performance section in which the difference between the comparison source image data and the example image data is greater than or equal to a predetermined level. In addition, the determination unit 315 detects erroneous performance sections during practice (hereinafter referred to as "erroneous performance sections during practice") based on the difference information during prior performances. An erroneous performance section during practice is a performance section in which the difference between the comparison target image data and the example image data is greater than or equal to a predetermined level. 【0055】 Furthermore, the determination unit 315 determines the sections where the erroneous performance section during the lesson and the erroneous performance section during practice overlap as sections requiring instruction. Sections requiring instruction are performance sections in which the learner repeats the same error during the practice session as they did during the lesson. An "erroneous performance" is a performance that does not match the score or a performance that does not follow the notes or time signatures written in the score. An "erroneous performance" also includes performances in which the dynamics, hand shape, fingering, and how the fingers play the keys do not conform to the instructor's guidance. 【0056】Furthermore, the lesson-time difference information and pre-performance difference information may indicate differences in the image signal, either in place of differences in the audio signal or in conjunction with differences in the audio signal. This allows for the detection of performance sections where not only the sound but also the hand shape, fingering, and how the fingers play the keys during the performance do not conform to the instructor's guidance, as incorrect performance sections. 【0057】 The playback unit 317 acquires image data from the comparison target image data that has a timestamp corresponding to the section requiring instruction. Furthermore, if the comparison source image data contains instructor speech data with a timestamp corresponding to the section requiring instruction, or a section within a predetermined time immediately following the section requiring instruction, the playback unit 317 acquires that instructor speech data. In this case, the playback unit 317 may acquire the corresponding image data along with the instructor speech data. 【0058】 The playback unit 317 generates instructional data based on the data acquired from the comparison target imaging data and the comparison source imaging data, and outputs the instructional data to the output device. As a result, the student is presented with the audio (or audio and video) of the student's incorrect performance during the lesson, along with the instructor's audio (or audio and video) of the instructor giving guidance on that incorrect performance. This allows the student to review the feedback they received from the instructor during the previous lesson when practicing on their own. 【0059】 Furthermore, when the determination unit 315 determines multiple instruction-requiring sections for the practice content, the playback unit 317 divides the practice content into parts according to these instruction-requiring sections. Specifically, first, it divides the performance section of the practice content from the start position to the detection position of the first instruction-requiring section into a single practice part. Next, it divides the remaining performance section up to the detection position of the next instruction-requiring section into a single practice part. This division process is repeated until the end position of the performance of the practice content, thereby dividing the practice content into multiple practice parts. The playback unit 317 generates instructional data for each practice part. 【0060】When the playback unit 317 divides the practice content into parts, the conversation unit 313 outputs conversation data (Figure 7D) prompting the user to select the practice part they wish to play. When the playback unit 317 receives a playback instruction via the input device specifying the practice part to be played, it plays back the difference visualization information and instructional data corresponding to that practice part. The difference visualization information is image information that visualizes the difference information from the prior performance. 【0061】 Figure 8 shows an example of the display screen of the display 36, which outputs display information including difference visualization information. Note that the information displayed in regions 81 and 82 in Figure 8 is the same as that in regions 71 and 72 in Figures 7A to 7D, so their explanation is omitted. 【0062】 Area 82 displays information indicating the currently playing practice part. In the example in Figure 8, area 82 displays information indicating that "Part #1" is currently playing. Area 83 displays a slider indicating the playback position. Area 84 highlights the keys pressed by the instructor during performance according to the playback position. Area 84a displays a button to switch the output of the instructor's performance sound from speaker 35 on and off. Area 85 highlights the keys pressed by the practicer during performance according to the playback position. Area 85a displays a button to switch the output of the practicer's performance sound from speaker 35 on and off. 【0063】Area 86 displays difference visualization information. In the example in Figure 8, image 86a schematically represents the audio signal of the instructor's performance, and image 86b schematically represents the audio signal of the student's performance, both displayed as difference visualization information. Image 86c is an indicator showing the playback position, and together with the slider indicator displayed in area 83, it moves left and right on the screen in accordance with the playback position. Area 87 displays buttons for switching the start and stop of playback of the practice part. Area 88 displays a button for the student to request the end of practice. The slider indicator displayed in area 83 may be operable by the user (student). This allows the user to repeat playback of the instructor's performance in a predetermined section, for example, the section in which the same mistake was made as during the lesson. Note that the display screen in Figure 8 is just an example, and difference visualization information, etc., may be presented to the student in a manner different from that shown in Figure 8. 【0064】 Figure 9 is a flowchart showing an example of processing performed by the arithmetic unit 31 of the user terminal 300 according to a predetermined program. The processing shown in Figure 9 starts when the self-practice application (hereinafter referred to as the self-practice application) installed on the user terminal 300 is launched. 【0065】 In step S201, it is determined whether or not a trainee has been recognized within the imaging range of camera 34a. Specifically, using facial recognition technology utilizing machine learning, it is determined whether or not a person within the imaging range of camera 34a is a trainee based on facial image data of trainees pre-stored in the memory unit 32. The process in step S201 is repeated until a positive determination is made. If a positive determination is not made after a certain period of time, the self-practice application may be terminated. 【0066】If affirmed in step S201, in step S202, conversational data (Figure 7A) including questions about practice is presented to the practicer via the output device. At this time, in order to ease the practicer's tension and pique their interest in practice, conversational data including small talk (for example, "Good job at school!", "You seem a little tired. Don't push yourself too hard!") is presented to the practicer first. After that, conversational data prompting the practicer to input the desired practice time and the name of the content they wish to practice is presented. Furthermore, if performance practice sets corresponding to multiple contents are stored in the memory unit 32, conversational data prompting the practicer to input the total practice time for the multiple contents and the order in which each content should be practiced is presented. When the practicer's answer to a question is input via the input device, information indicating the content of that answer (hereinafter referred to as practicer answer information) is stored in the memory unit 32. 【0067】 In step S203, a practice plan (including the content to be practiced and the practice time) is derived based on the practicer response information stored in the memory unit 32. In step S204, a part division process is performed on the content to be practiced derived in step S203. Details of the part division process will be described later with reference to Figure 10. 【0068】 In step S204, the practice content, which was divided into parts in step S203, is presented to the practicer, and conversational data is presented to the practicer to allow them to select which part of the divided practice song they would like to practice (Figure 7D). 【0069】In step S205, it is determined whether a practice part to be practiced (hereinafter referred to as the practice target part) has been selected. Step S205 is repeated until a positive determination is made. If a positive determination is made in step S205, in step S206, difference visualization information and instructional data corresponding to the practice target part selected in step S205 are presented to the practicer via the output device. This information may also be presented when an output request instruction is received from the practicer via the input device. In addition to presenting the difference visualization information and instructional data, conversational data may also be presented to the practicer encouraging them to practice playing while confirming the presented difference visualization information and instructional data. 【0070】 In step S207, it is determined whether the practice time corresponding to the practice part has elapsed. The practice time corresponding to the practice part may be determined by dividing the practice time set for the piece to be practiced by the number of parts, or it may be determined based on the difficulty of playing each practice part. The difficulty of playing may be determined based on the frequency of fast passages, odd time signatures, and complex rhythms included in the practice part, or it may be determined according to other criteria. 【0071】 Step S207 is repeated until a positive result is obtained. If a positive result is obtained in step S207, in step S208, conversational data is presented to the trainee to confirm whether or not to continue practicing the practice part. Then, based on the content of the trainee's response entered via the input device, it is determined whether or not to continue practicing. If a negative result is obtained in step S208, the process returns to step S204, and the trainee is prompted again to select a practice part. If a positive result is obtained in step S208, in step S209, conversational data is presented to the trainee to notify them of the end of practice, and the process ends. If a request to end practice is entered via the input device before a positive result is obtained in step S205 or step S207, the process may proceed to step S208. 【0072】Figure 10 is a flowchart showing an example of the process (practice planning process) in step S203 of Figure 9. In step S301, materials for reviewing the previous lesson are presented to the student. More specifically, a performance practice set corresponding to the practice target song set in step S301 is read from the storage unit 32, and the student's performance data included in the performance practice set is output via the output device. At this time, if the performance practice set includes image data with a timestamp recorded at the same time as the student's performance data, the image data may be output along with the performance data. 【0073】 In step S302, conversational data to guide the practicer to pre-play, i.e., a pre-play start instruction (Figure 7C), is output. In step S303, it is determined whether the practicer has started pre-play based on the image data acquired by the camera 34a and the audio data acquired by the microphone 34b. When the practicer starts pre-playing, the image is captured by the imaging unit 34, and the image data is stored in the storage unit 32. The process in step S303 is repeated until a positive determination is made. If a positive determination is not made after a certain period of time, the process may proceed to step S305. 【0074】 In step S304, it is determined whether the practicer has finished the preliminary performance. Whether the preliminary performance has finished may be determined based on the image data acquired by camera 34a, based on the audio data acquired by microphone 34b, or based on the image data acquired by camera 34a and the audio data acquired by microphone 34b. 【0075】In step S306, the image data of the practicer acquired by the imaging unit 34 during the pre-performance and stored in the storage unit 32 is acquired as comparison image data. In addition, the image data of the practicer included in the performance practice set read from the storage unit 32 in step S301 is acquired as comparison source image data, and the image data of the instructor included in the performance practice set is acquired as example image data. Furthermore, based on the comparison image data, comparison source image data and example image data, sections requiring instruction within the practice piece are detected, and the practice piece is divided into parts according to these sections. Specifically, when a section requiring instruction is detected, the performance section up to that section is divided into a single practice part. By repeating this division process until the end of the performance, the practice piece is divided into multiple practice parts. 【0076】 According to embodiments of the present invention, the following effects can be achieved. (1) The user terminal 300 includes a data acquisition unit 311 that acquires imaging data (hereinafter referred to as first imaging data) obtained by imaging a user practicing playing a musical instrument while receiving instruction from a server device 200, and receives imaging data of the user (practitioner) practicing playing a musical instrument (hereinafter referred to as second imaging data) from an imaging unit 34 acting as an imaging device; a generation unit 314 that removes imaging data other than the imaging data of the user practicing playing a musical instrument from the first imaging data, more specifically, edits the first imaging data to generate imaging data of the person involved in playing the musical instrument (hereinafter referred to as third imaging data); a determination unit 315 that compares the second imaging data and the third imaging data and determines the instruction-requiring section (hereinafter also referred to as the playback target section) from the first imaging data based on the comparison result; and a playback unit 116 that outputs the imaging data of the playback target section to an output device. This allows the system to present users with video footage of their lessons, based on a comparison between their current playing and their performance during lessons, when they are practicing on their own, such as playing an instrument. As a result, users can recall the instruction they received from their instructor while practicing on their own. 【0077】(2) The data acquisition unit 311 acquires example image data (hereinafter also simply called example data) showing an example of musical instrument performance from the server device 200, and the determination unit 315 determines the playback target section based on the comparison result of the difference between the second image data and the example data and the difference between the third image data and the example data. More specifically, the data section in which the similarity between the difference between the second image data and the example data and the difference between the third image data and the example data is above a predetermined level is determined as the playback target section. This makes it possible for the practicer to recognize the performance section in which they are making the same mistakes as during lessons when practicing on their own. 【0078】 (3) The user terminal 300 further includes a conversation unit 313 that presents conversation data (hereinafter also referred to as text data) to the user via a speaker 35 as an audio output device or a display 36 as a display device. The practice of playing a musical instrument includes multiple contents, and the conversation unit 313 presents the user with first text data for casual conversation. Subsequently, the conversation unit 313, acting as a practice derivation unit, presents the user with second text data (Figure 7A) that prompts the user to input an answer to at least one of the following: the total practice time for the multiple contents, the practice time for each contents, the name of each contents, and the execution order of the multiple contents. Furthermore, when an answer is input from the user via the input device, the conversation unit 313 presents the user with third text data (Figure 7B) that shows a practice plan for playing a musical instrument derived based on the content of the answer. If no answer is input via the input device, the conversation unit 313 regenerates the second text data and presents it to the user again. This helps the user to start self-practice smoothly. 【0079】 (4) The conversation unit 313, acting as a practice guidance unit, further presents the user with a fourth set of text data (Figure 7C) prompting them to begin practicing playing an instrument according to the practice plan. When the practice of playing an instrument according to the practice plan begins, the data acquisition unit 311 starts receiving the second set of image data. This allows the system to support the user's self-practice according to the practice plan they themselves have chosen. 【0080】(5) When the first imaging data includes audio data representing the voice of an instructor who provides instruction to the user on playing a musical instrument, the user terminal 300 further includes an analysis unit 312 that analyzes the characteristics of the instructor's speech based on the audio data. The conversation unit 313 generates first, second, third, and fourth sentence data based on the speech characteristics obtained by the analysis unit 312. This makes it easier to alleviate the tension of the learner and to stimulate their interest in practicing. 【0081】 The above embodiment can be modified into various forms. Modifications will be described below. In the above embodiment, the practice support system 1 was used as an example to support self-practice of playing a musical instrument, but the practice support system may also support self-practice of a predetermined practical skill other than playing a musical instrument (for example, dancing or singing). That is, the instructor may be a predetermined person who instructs the user in dancing or singing. Also, in the above embodiment, the practice support system 1 was used as an example to include a video acquisition device 100 and a user terminal 300. However, the user terminal 300 may also have the functions of the video acquisition device 100. That is, the user terminal 300 and the video acquisition device 100 may be realized in a single device. 【0082】 The above description is merely an example, and the present invention is not limited by the embodiments and modifications described above, as long as the features of the present invention are not impaired. It is also possible to arbitrarily combine one or more of the above embodiments and modifications, and to combine modifications with each other. 【0083】 1 Practice support system, 30 Controller, 31 Calculation unit, 32 Memory unit, 33 Communication unit, 34 Imaging unit, 34a Camera, 34b Microphone, 35 Speaker, 36 Display, 311 Data acquisition unit, 312 Analysis unit, 313 Conversation unit, 314 Generation unit, 315 Decision unit, 316 Playback unit, 100 Video acquisition device, 200 Server device, 300 User terminal

Claims

1. A practice support program characterized by causing a computer to execute the following steps: a read step of reading first imaging data obtained by imaging a user practicing a predetermined practical skill while receiving instruction in the predetermined practical skill, which is stored in the storage device beforehand; a receive step of receiving second imaging data, which is imaging data of the user practicing the predetermined practical skill, from an imaging device; a generate step of editing the first imaging data and then generating third imaging data, which is imaging data of a person involved in the predetermined practical skill; a compare step of comparing the second imaging data and the third imaging data; and an output step of outputting imaging data of the playback target section determined from the first imaging data based on the results of the comparison to an output device.

2. A practice support program according to claim 1, characterized in that, in the reading step, example data which is imaging data showing a predetermined example of the practical technique is further read from the storage device, and in the comparison step, the playback target section is determined based on the comparison result of the difference between the second imaging data and the example data and the difference between the third imaging data and the example data.

3. The practice support program according to claim 2, characterized in that the playback target interval is a data interval in which the similarity between the difference between the second imaging data and the example data and the similarity between the third imaging data and the example data is above a predetermined level.

4. A practice support program according to claim 1, wherein the practice of a predetermined practical skill includes a plurality of contents, and the program further causes the computer to execute a derivation step, which includes: a conversation step in which a first sentence data is presented to the user via an audio output device or display device; and a derivation step in which, after the conversation step, a second sentence data is presented to the user via the audio output device or display device prompting the user to input an answer to at least one of the following: the total practice time of the plurality of contents, the practice time of each contents, the name of each contents, and the execution order of the plurality of contents, and when the answer is input via an input device, a practice plan for the predetermined practical skill is derived based on the content of the answer, and a third sentence data showing the derived practice plan is presented to the user via the audio output device or display device.

5. A practice support program according to claim 4, characterized in that, in the derivation step, when the answer is not input via the input device, the second text data is regenerated and re-presented to the user.

6. A practice support program according to claim 4, wherein in the derivation step, a fourth text data prompting the user to start practicing the predetermined practical skills according to the practice plan is presented to the user via the audio output device or the display device, and in the reception step, when the practice of the predetermined practical skills according to the practice plan is started, the reception of the second imaging data is started.

7. A practice support program according to claim 5, wherein when the first imaging data includes audio data representing the voice of a predetermined person, the computer is further made to perform an analysis step of analyzing the characteristics of the speech of the predetermined person based on the audio data, and the first, second, third, and fourth sentence data are generated based on the characteristics of the speech obtained in the analysis step.

8. A practice support program according to any one of the claims in claim 7, characterized in that the predetermined person is an instructor who provides the predetermined practical skills instruction to the user.

9. A practice support program according to claim 8, characterized in that the first imaging data includes imaging data of the instructor and the user.

10. A practice support program according to any one of claims 1 to 9, characterized in that the predetermined practical skill is the performance of a musical instrument.

11. A practice support method characterized by comprising: a reading step of reading first imaging data obtained by imaging a user practicing a predetermined practical skill while receiving instruction on the predetermined practical skill, which is stored in the storage device in advance; a receiving step of receiving second imaging data, which is imaging data of the user practicing the predetermined practical skill, from an imaging device; a generation step of editing the first imaging data to generate third imaging data, which is imaging data of a person involved in the predetermined practical skill; a comparison step of comparing the second imaging data and the third imaging data; and an output step of outputting imaging data of the playback target section determined from the first imaging data based on the results of the comparison to an output device.

12. A practice support system comprising: a server device that stores in advance first imaging data obtained by imaging a user practicing a predetermined practical skill while receiving instruction in the predetermined practical skill; and a user terminal having an imaging device, wherein the user terminal includes: a data acquisition unit that acquires the first imaging data from the server device and receives second imaging data, which is imaging data of the user practicing the predetermined practical skill, from the imaging device; a generation unit that edits the first imaging data to generate third imaging data, which is imaging data of a person involved in the predetermined practical skill; a determination unit that compares the second imaging data and the third imaging data and determines a playback target section from the first imaging data based on the comparison result; and a playback unit that outputs the imaging data of the playback target section to an output device.