Information processing device, information processing method, and recording medium

The information processing device and method efficiently extract specific voices from mixed audio data by training an audio extraction model with a loss function, allowing for flexible and accurate voice extraction without needing multiple models.

WO2026140123A1PCT designated stage Publication Date: 2026-07-02NEC CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NEC CORP
Filing Date
2024-12-25
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing technologies struggle to efficiently extract specific voices from mixed audio data, particularly in changing the range of voices to be extracted without requiring multiple machine-learned models.

Method used

An information processing device and method that utilizes an audio extraction model trained with a loss function to extract desired audio data by superimposing and separating audio data based on input parameters, allowing for flexible and accurate voice extraction.

Benefits of technology

Enables flexible and accurate extraction of desired voices from mixed audio data by optimizing the audio extraction model, enabling changes in the range of extracted voices using a single model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2024045969_02072026_PF_FP_ABST
    Figure JP2024045969_02072026_PF_FP_ABST
Patent Text Reader

Abstract

This information processing device comprises: an acquisition means that acquires mixed voice data in which a plurality of voices are mixed and a parameter for designating an extracted voice to be extracted from the plurality of voices; and a voice extraction means that outputs extracted voice data obtained by extracting the voice data about the extracted voice from the mixed voice data on the basis of the mixed voice data and the parameter. According to such an information processing device, a desired voice can be extracted according to a purpose because the type of voice to be extracted can be designated by a parameter.
Need to check novelty before this filing date? Find Prior Art

Description

Information processing device, information processing method, and recording medium

[0001] This disclosure relates to the technical fields of information processing equipment, information processing methods, and recording media.

[0002] Techniques are known for extracting only desired audio from audio data in which multiple audios are mixed. For example, Patent Document 1 discloses a technique that uses a machine learning model composed of a neural network to suppress noise and emphasize only the desired audio.

[0003] Japanese Patent Publication No. 2018-138936

[0004] This disclosure aims to provide an information processing device, an information processing method, and a recording medium that can solve technical problems that are difficult to solve with the technologies disclosed in prior art documents.

[0005] One aspect of the information processing device of this disclosure includes an acquisition means for acquiring mixed audio data obtained by mixing multiple voices and a parameter for specifying an extracted voice to be extracted from the multiple voices, and an audio extraction means for outputting extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameter.

[0006] Another aspect of the information processing device of this disclosure includes: an audio acquisition means for acquiring a plurality of audio data; a first superposition means for superimposing all of the plurality of audio data and outputting them as total audio data; a second superposition means for superimposing audio data from the plurality of audio data according to input parameters and outputting it as correct audio data; an audio extraction means for using an audio extraction model to extract audio according to the parameters from the total audio data and outputting it as extracted audio data; and a first learning means for learning the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.

[0007] One aspect of the information processing method of this disclosure is that at least one computer obtains mixed audio data, which is a mixture of multiple voices, and parameters that specify an extracted voice to be extracted from the multiple voices, and outputs extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

[0008] Another aspect of the information processing method of this disclosure involves at least one computer acquiring multiple audio data, superimposing all of the multiple audio data and outputting them as total audio data, superimposing the audio data from the multiple audio data corresponding to the input parameters and outputting it as ground truth audio data, using an audio extraction model to extract the audio corresponding to the parameters from the total audio data and outputting it as extracted audio data, and training the audio extraction model using a loss function calculated from the extracted audio data and the ground truth audio data.

[0009] One aspect of the recording medium of this disclosure includes a computer program recorded on at least one computer that causes the computer program to execute an information processing method which involves acquiring mixed audio data obtained by mixing multiple audios and parameters that specify an extracted audio to be extracted from the multiple audios, and outputting extracted audio data obtained by extracting the audio data of the extracted audio from the mixed audio data based on the mixed audio data and the parameters.

[0010] Another aspect of the recording medium of this disclosure includes recording a computer program that causes at least one computer to execute an information processing method which includes acquiring multiple audio data, superimposing all of the multiple audio data and outputting them as total audio data, superimposing audio data from the multiple audio data corresponding to input parameters and outputting them as ground truth audio data, extracting audio corresponding to the parameters from the total audio data using an audio extraction model and outputting it as extracted audio data, and learning the audio extraction model using a loss function calculated from the extracted audio data and the ground truth audio data.

[0011] This is a block diagram showing the hardware configuration of the first information processing device. This is a block diagram showing the functional configuration of the first information processing device. This is a flowchart showing the operation flow of the first information processing device. This is a conceptual diagram showing an example of an index corresponding to each of multiple voices. This is a block diagram (1) showing an example of voice extraction operation by the second information processing device. This is a block diagram (2) showing an example of voice extraction operation by the second information processing device. This is a table showing an example of category labels assigned to multiple voices. This is a block diagram (1) showing an example of voice extraction operation by the third information processing device. This is a block diagram (2) showing an example of voice extraction operation by the third information processing device. This is a block diagram showing the functional configuration of the fourth information processing device. This is a flowchart showing the learning operation flow by the fourth information processing device. This is a block diagram showing the functional configuration of the fifth information processing device. This is a flowchart showing the learning operation flow by the fifth information processing device. This is a block diagram showing the functional configuration of the sixth information processing device. This is a flowchart showing the learning operation flow by the sixth information processing device.

[0012] The following describes embodiments of the information processing device, information processing method, and recording medium with reference to the drawings.

[0013] <First Embodiment> The first information processing device will be described with reference to Figures 1 to 3.

[0014] (Hardware Configuration) First, the hardware configuration of the first information processing device will be described with reference to Figure 1. Figure 1 is a block diagram showing the hardware configuration of the first information processing device.

[0015] As shown in Figure 1, the first information processing device 10 includes a processor 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, a storage device 14, an input device 15, and an output device 16. The processor 11, RAM 12, ROM 13, storage device 14, input device 15, and output device 16 are all connected via a data bus 17. Note that the data bus 17 may be an interface other than a data bus (for example, LAN or USB).

[0016] The processor 11 reads a computer program. For example, the processor 11 is configured to read a computer program stored in at least one of the RAM 12, ROM 13, and storage device 14. Alternatively, the processor 11 may read a computer program stored in a computer-readable storage medium using a storage medium reading device (not shown). The processor 11 may also obtain (i.e., read) a computer program from a device (not shown) located outside the first information processing device 10 via a network interface. The processor 11 performs various processes by executing the read computer program. When the processor 11 executes the read computer program, a functional block related to the processing performed by the first information processing device 10 is realized within the processor 11. That is, the processor 11 may function as a controller that performs various controls in the first information processing device 10.

[0017] The processor 11 may be configured as, for example, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an FPGA (field-programmable gate array), a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or a quantum processor. The processor 11 may consist of one of these, or it may be configured to use multiple of them in parallel.

[0018] RAM 12 temporarily stores computer programs executed by processor 11. RAM 12 also temporarily stores data that processor 11 uses temporarily while executing computer programs. RAM 12 may be, for example, D-RAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory). Alternatively, other types of volatile memory may be used instead of RAM 12.

[0019] ROM 13 stores computer programs executed by processor 11. ROM 13 may also store other static data. ROM 13 may be, for example, P-ROM (Programmable Read Only Memory) or EPROM (Erasable Read Only Memory). Alternatively, other types of non-volatile memory may be used instead of ROM 13.

[0020] The storage device 14 stores data that the first information processing device 10 stores long-term. The storage device 14 may also operate as a temporary storage device for the processor 11. The storage device 14 may store computer programs executed by the processor 11. The storage device 14 may include, for example, at least one of a hard disk drive, a magneto-optical disk drive, an SSD (Solid State Drive), and a disk array device.

[0021] The input device 15 is a device that receives input instructions from the user of the first information processing device 10. The input device 15 may include, for example, at least one of a keyboard, mouse, touch panel, and stylus. The input device 15 may also be a device capable of voice input, for example, including a microphone.

[0022] The output device 16 is a device that outputs information related to the first information processing device 10 to the outside. For example, the output device 16 may be a display device (e.g., a display or monitor) capable of displaying information related to the first information processing device 10. Alternatively, the output device 16 may be a speaker or the like capable of outputting audio information related to the information processing device 10.

[0023] The first information processing device 10 may be configured to include some of the components described in Figure 1. For example, the first information processing device 10 may be configured to include only the processor 11, RAM 12, and ROM 13 from the components described above. In this case, the storage device 14, input device 15, and output device 16 may each be provided as external devices to the first information processing device 10. Furthermore, some of the arithmetic functions of the first information processing device 10 may be implemented by an external server or cloud.

[0024] (Functional Configuration) Next, the functional configuration of the first information processing device 10 will be described with reference to Figure 2. Figure 2 is a block diagram showing the functional configuration of the first information processing device.

[0025] In Figure 2, the first information processing device 10 is configured as a device that extracts a desired voice from mixed voice data (hereinafter referred to as "mixed voice data" as appropriate). The first information processing device 10 is configured to include an acquisition unit 110 and a voice extraction unit 120 as components for realizing its function. Note that each of the acquisition unit 110 and the voice extraction unit 120 may be a processing block realized by the processor 11 (see Figure 1) described above.

[0026] The acquisition unit 110 is configured to acquire mixed audio data, which is a mixture of multiple voices. The mixed audio data may be, for example, audio data in which the voices of different people are mixed. The mixed audio data may also contain non-voice noise. The acquisition unit 110 may be configured to acquire mixed audio data in real time from, for example, a microphone, or it may be configured to acquire mixed audio data stored in storage (i.e., pre-recorded mixed audio data). The mixed audio data may be acquired with a single microphone. In this case, the mixed audio data can be acquired even with a simple configuration such as a microphone installed in a smartphone, thus expanding the range of applications of the device. Alternatively, the mixed audio data may be acquired with multiple microphones. In this case, the amount of information in the mixed audio data increases compared to when it is acquired with a single microphone, thus improving the accuracy of voice extraction.

[0027] The acquisition unit 110 is further configured to acquire parameters that specify the audio to be extracted from among multiple audio (hereinafter referred to as "extracted audio" as appropriate). The parameters may be input by the user, for example. The parameters may be thresholds corresponding to audio indicators or category labels that specify the category of the audio. Specific examples of these will be explained in detail in other embodiments described later. Here, an example is given in which one acquisition unit 110 acquires mixed audio data and parameters, but an audio acquisition unit that acquires mixed audio data and a parameter acquisition unit that acquires parameters may be provided separately. The mixed audio data and parameters acquired by the acquisition unit 110 are output to the audio extraction unit 120.

[0028] The audio extraction unit 120 is configured to extract audio data of extracted audio from the mixed audio data based on the mixed audio data and parameters acquired by the acquisition unit 110, and to output the extracted audio data. Specifically, the audio extraction unit 120 extracts the extracted audio specified by the parameters from among multiple audios included in the mixed audio data. The audio extraction unit 120 then outputs the audio data of the extracted audio extracted from the mixed audio data as extracted audio data. The extracted audio may include multiple audios. For example, if the mixed audio data is a mixture of three audios, the extracted audio may be two of the three audios. In this case, the extracted audio data is output as audio data in which multiple extracted audios are superimposed.

[0029] The voice extraction unit 120 may extract the voice data of the extracted voice using a machine learning model. For example, the voice extraction unit 120 may use a voice extraction model that takes mixed voice data and parameters as input and outputs voice data of the extracted voice according to the parameters. The voice extraction model may be a voice enhancement model that emphasizes the voice according to the parameters. The voice enhancement model may have a function to suppress noise. Alternatively, the voice extraction unit 120 may use a voice separation model that separates the mixed voice data into multiple voices. In this case, the voice extraction unit 120 can output the extracted voice data by superimposing the voices corresponding to the extracted voice according to the parameters from among the multiple voices separated by the voice separation model. Each model used by the voice extraction unit 120 will be described in detail in other embodiments described later.

[0030] (Operation Flow) Next, the operation flow of the first information processing device 10 will be explained with reference to Figure 3. Figure 3 is a flowchart showing the operation flow of the first information processing device.

[0031] As shown in Figure 3, when the operation of the first information processing device 10 is started, the acquisition unit 110 first acquires mixed audio data in which multiple voices are mixed (step S101). The acquisition unit 110 also acquires parameters for specifying the extracted voice (step S102).

[0032] Subsequently, the voice extraction unit 120 extracts voice data of the extracted voice corresponding to the parameters from the mixed voice data (step S103). Then, the voice extraction unit 120 outputs the extracted voice data, which is the voice data of the extracted voice (step S104).

[0033] (Technical Effect) Next, the technical effect obtained by the first information processing apparatus 10 will be described.

[0034] As described with reference to FIGS. 1 to 3, in the first information processing apparatus 10, voice data of the extracted voice corresponding to the input parameters is extracted. In this way, a desired voice can be extracted from the mixed voice data in which a plurality of voices are mixed. Also, if the input parameters are changed, the extracted voice can be changed. For example, only the voice of the speaker can be extracted from the voice data recorded in a meeting, or only the voices of the speaker and some of the participants can be extracted.

[0035] In the existing voice extraction technology, it is difficult to change the range of the voice to be extracted. For example, when extracting a specific voice using a machine-learned model, in order to change the range of the voice, it is required to use another model (that is, a separately learned model). However, according to the information processing apparatus 10 according to the present embodiment, voices in various ranges can be extracted using one model.

[0036] <Second Embodiment> The second information processing apparatus 10 will be described with reference to FIGS. 4 to 6. The second information processing apparatus 10 may have some operations different from those of the first information processing apparatus 10 described above, and the other parts may be the same as those of the first information processing apparatus 10. Therefore, hereinafter, the parts different from the first embodiment will be described in detail, and the description of other overlapping parts will be omitted as appropriate.

[0037] (Index of Voice) First, the index of the voice processed by the second information processing apparatus 10 will be described while referring to FIG. 4. FIG. 4 is a conceptual diagram showing an example of the index corresponding to each of a plurality of voices.

[0038] As shown in FIG. 4, in the second information processing apparatus 10, an index is set for each of a plurality of voices. In the example of FIG. 4, the voice quality (MOS: Mean Opinion Score) is adopted as the index, but the index according to the present embodiment is not limited to this. For example, the index may be obtained in consideration of, in addition to the voice quality, the distance between the object emitting the voice and the microphone, the volume, and the like.

[0039] In the example shown in FIG. 4, the recording data of the meeting is acquired as mixed voice data. For the voice of the speaker, the index is "4", for the voices of meeting participants other than the speaker (for example, conversations with people next to them), the index is "3", and for the voices of other passing people, the index is "2". Hereinafter, the operation for extracting a desired voice from such mixed voice data will be described.

[0040] (Example of operation) Next, an example of the operation by the second information processing apparatus 10 will be described with reference to FIGS. 5 and 6. FIG. 5 is a block diagram (part 1) showing an example of the voice extraction operation by the second information processing apparatus. FIG. 6 is a block diagram (part 2) showing an example of the voice extraction operation by the second information processing apparatus.

[0041] In the example shown in FIG. 5, the mixed voice data shown in FIG. 4 is input to the second information processing apparatus 10. Further, a threshold value corresponding to the index is input to the second information processing apparatus 10 as a parameter. The second information processing apparatus 10 extracts, as the extracted voice, a voice whose index is equal to or higher than the threshold value among the plurality of voices included in the mixed voice data. Here, the threshold value is Λ = 4. Therefore, the extracted voice data output by the second information processing apparatus 10 includes the voice of the speaker whose index is "4".

[0042] On the other hand, in the example shown in FIG. 6, the threshold value Λ = 3 is input. In this case, the second information processing apparatus extracts, as the extracted voice, a voice whose index is 3 or higher. As a result, the extracted voice data output by the second information processing apparatus 10 includes the voice of the speaker whose index is "4" and the voices of meeting participants whose index is "3".

[0043] (Technical Effects) Next, the technical effects obtained by the second information processing device 10 will be explained.

[0044] As explained in Figures 4 to 6, the second information processing device 10 extracts audio where the index is above a threshold. In this way, by changing the value of the threshold, it is possible to easily and accurately specify the audio to be extracted.

[0045] <Third Embodiment> The third information processing device 10 will be described with reference to Figures 7 to 9. Note that the third information processing device 10 differs in some operations from the first and second information processing devices 10 described above, but other parts may be the same as those of the first and second information processing devices 10. For this reason, the parts that differ from the embodiments already described will be explained in detail below, and other overlapping parts will be omitted as appropriate.

[0046] (Category Labels) First, we will explain the category labels handled by the third information processing device 10, referring to Figure 7. Figure 7 is a table showing an example of category labels assigned to multiple audio files.

[0047] As shown in Figure 7, the third information processing device 10 assigns a category label to each of the multiple voices. The category label indicates the category of the voice and is not a continuous value like the index described in the second embodiment. Examples of category labels include gender, emotion, age, and diseases that can be identified from the voice data. More specifically, the category label may indicate whether the person speaking is male or female. The category label may indicate whether the person speaking is happy, angry, or sad. The category label may indicate whether the person speaking is 20 years old or younger, in their 20s, 30s, 40s, 50s, or 60 years old or older. The category label may indicate the state of the person speaking, such as having a sore throat or dementia. Note that these category labels are merely examples, and the category labels applicable to this embodiment are not limited to the above examples.

[0048] Furthermore, the speech extraction unit 120 may determine the categories of multiple voices included in the mixed speech data. Alternatively, the categories of multiple voices included in the mixed speech data may be assigned in advance before being input to the speech extraction unit 120. In this case, a labeling model that assigns a label to each voice in the mixed speech data may be used. The labeling model may be learned separately from the various models used by the speech extraction unit 120.

[0049] (Example of Operation) Next, an example of operation by the third information processing device 10 will be described with reference to Figures 8 and 9. Figure 8 is a block diagram (part 1) showing an example of speech extraction operation by the third information processing device. Figure 9 is a block diagram (part 2) showing an example of speech extraction operation by the third information processing device.

[0050] In the example shown in Figure 8, mixed speech data and a category label are input to the third information processing device 10. The third information processing device 10 extracts the speech corresponding to the input category label from among the multiple speeches contained in the mixed speech data as extracted speech. The category label input here is "male". Therefore, the extracted speech data output by the third information processing device 10 will include speech uttered by a male (in other words, it will not include speech uttered by a female).

[0051] On the other hand, in the example shown in Figure 9, three category labels are input: "female," "happy," and "in her 20s." In this case, the third information processing device extracts audio corresponding to all three input category labels as extracted audio. As a result, the extracted audio data output by the third information processing device 10 will include audio spoken by a "happy woman in her 20s."

[0052] (Technical Effects) Next, the technical effects obtained by the third information processing device 10 will be explained.

[0053] As explained in Figures 7 to 9, the third information processing device 10 extracts audio corresponding to the category label. In this way, by inputting the desired category label, it is possible to easily and accurately specify the audio to be extracted. For example, if the category label relates to gender, emotion, age, or at least one of the diseases that can be identified from the audio data, it is possible to narrow down the category of the person who spoke and extract the audio data.

[0054] <Fourth Embodiment> The fourth information processing device 10 will be described with reference to Figures 10 and 11. The fourth information processing device 10 differs in some configurations and operations from the first to third information processing devices 10 described above, but other parts may be the same as those of the first to third information processing devices 10. For this reason, the parts that differ from the embodiments already described will be explained in detail below, and other overlapping parts will be omitted as appropriate.

[0055] (Functional Configuration) First, the functional configuration of the fourth information processing device 10 will be explained with reference to Figure 10. Figure 10 is a block diagram showing the functional configuration of the fourth information processing device. Note that the fourth information processing device 10 shown in Figure 10 includes components used for machine learning. For this reason, some of the components shown in Figure 10 (i.e., components used for machine learning) may be omitted when operating the device.

[0056] As shown in Figure 10, the fourth information processing device 10 is configured to include, as components for realizing its function, a voice acquisition unit 210, a first learning data generation unit 221, a first voice extraction unit 231, and a first learning unit 241. Note that each of the voice acquisition unit 210, the first learning data generation unit 221, the first voice extraction unit 231, and the first learning unit 241 may be a processing block realized by the processor 11 (see Figure 1) described above.

[0057] The voice acquisition unit 210 is configured to acquire multiple audio data. λ1 S λ2 ,...S λnThese are acquired as separate audio data. The multiple audio data acquired by the audio acquisition unit 210 are configured to be output to the first superposition unit 301 and the second superposition unit 302, respectively.

[0058] The first learning data generation unit 221 is configured to generate learning data for training the first speech extraction model 310 used by the first speech extraction unit 231. The first learning data generation unit 221 receives a plurality of speech data acquired by the speech acquisition unit 210 and a parameter threshold Λ as input. The first learning data generation unit 221 includes a first superposition unit 301 and a second superposition unit 302.

[0059] The first superposition unit 301 superimposes all of the multiple audio data acquired by the audio acquisition unit 210 to generate all audio data. The all audio data generated by the first superposition unit 301 is output to the first audio extraction unit 231. The second superposition unit 302 superimposes the audio data of the audio acquired by the audio acquisition unit 210 whose index is greater than or equal to the threshold Λ to generate correct audio data. The correct audio data generated by the second superposition unit 302 is output to the first learning unit 241.

[0060] The first voice extraction unit 231 is configured to extract extracted voices according to a threshold Λ from all the voice data generated by the first superposition unit 301. The first voice extraction unit 231 extracts the voice data of the extracted voices using the first voice extraction model 310. The first voice extraction model 310 is a model that takes all the voice data and the threshold Λ as input and outputs extracted voice data. The first voice extraction model 310 is trained by the first learning unit 241.

[0061] The first learning unit 241 is configured to train the first speech extraction model 310. Specifically, the first learning unit 241 calculates a loss function from the extracted speech data extracted by the first speech extraction unit 231 and the correct speech data generated by the second superposition unit 222. Subsequently, the first learning unit 241 updates the parameters of the first speech extraction model 310 so that the calculated loss function becomes smaller.

[0062] (Learning Operation) Next, with reference to Figure 11, the flow of the learning operation by the fourth information processing device 10 (i.e., the operation when machine learning the first speech extraction model 310) will be explained. Figure 11 is a flowchart showing the flow of the learning operation by the fourth information processing device.

[0063] As shown in Figure 11, when the learning operation by the fourth information processing device 10 is started, the first voice acquisition unit 210 acquires multiple voice data (step S201). Then, the first superposition unit 301 superimposes all of the multiple voice data acquired by the voice acquisition unit 210 to generate all voice data (step S202). On the other hand, the second superposition unit 302 acquires a threshold Λ (step S203). Then, the second superposition unit 302 superimposes the voice data of the voices whose index is greater than or equal to the threshold Λ among the multiple voice data acquired by the voice acquisition unit 210 to generate correct voice data (step S204).

[0064] Next, the first voice extraction unit 231 acquires the model structure of the first voice extraction model 310 (step S205). After that, the first voice extraction unit 231 acquires the initial parameters of the first voice extraction model 310 (step S206).

[0065] Next, the first audio extraction unit 231 uses the first audio extraction model 310 to extract audio data of the extracted audio from the total audio data and outputs the extracted audio data (step S207). That is, the first audio extraction unit 231 extracts audio from among multiple audios included in the total audio data in which the index is equal to or greater than a threshold, and outputs it as extracted audio data.

[0066] Next, the first learning unit 241 calculates a loss function from the extracted audio data extracted by the first audio extraction unit 231 and the correct audio data generated by the second superposition unit 222 (step S208). After that, the first learning unit 241 updates the parameters of the audio extraction model 310 so that the calculated loss function becomes smaller (step S209).

[0067] Next, the first learning unit 241 determines whether or not to terminate the update of the first speech extraction model 310 (step S210). The first learning unit 241 only needs to determine, for example, whether or not a predetermined termination condition has been met. If it is determined that the update of the first speech extraction model 310 should not be terminated (step S201: NO), the process from step S207 is repeatedly executed. On the other hand, if it is determined that the update of the first speech extraction model 310 should be terminated (step S201: YES), the series of learning operations will be terminated.

[0068] (Technical Effects) Next, the technical effects obtained by the fourth information processing device 10 will be explained.

[0069] As explained in Figures 10 and 11, the fourth information processing device 10 learns the speech extraction model 310 using training data generated from multiple speech data. In this way, it is possible to optimize the parameters of the first speech extraction model 310 and improve the accuracy of speech extraction.

[0070] <Fifth Embodiment> The fifth information processing device 10 will be described with reference to Figures 12 and 13. The fifth information processing device 10 differs from the fourth information processing device 10 described above in some configurations and operations, but other parts may be the same as those of the first to fourth information processing devices 10. For this reason, the parts that differ from each embodiment already described will be explained in detail below, and other overlapping parts will be omitted as appropriate.

[0071] (Functional Configuration) First, the functional configuration of the fifth information processing device 10 will be described with reference to Figure 12. Figure 12 is a block diagram showing the functional configuration of the fifth information processing device. In Figure 12, the same reference numerals are used for the same elements as those described in Figure 10. Furthermore, the fifth information processing device 10 shown in Figure 12 includes components used for machine learning. For this reason, some of the components shown in Figure 12 (i.e., components used for machine learning) may be omitted when operating the device.

[0072] As shown in Figure 12, the fifth information processing device 10 is configured to include a voice acquisition unit 210, a second learning data generation unit 222, a second voice extraction unit 232, and a second learning unit 242 as components for realizing its function. Note that each of the voice acquisition unit 210, the second learning data generation unit 222, the second voice extraction unit 232, and the second learning unit 242 may be a processing block realized by the processor 11 (see Figure 1) described above.

[0073] The second learning data generation unit 222 is configured to generate learning data for training the second speech extraction model 320 used by the second speech extraction unit 232. Multiple speech data acquired by the speech acquisition unit 210 are input to the second learning data generation unit 222. The second learning data generation unit 222 includes a first superposition unit 301 and a second superposition unit 302. As already described in the fourth embodiment, the first superposition unit 301 generates all speech data by superimposing all of the multiple speech data acquired by the speech acquisition unit 210. The all speech data generated by the first superposition unit 301 is output to the second speech extraction unit 232. The second superposition unit 302 also generates correct speech data by superimposing the speech data of speeches whose index is greater than or equal to the threshold Λ among the multiple speech data acquired by the speech acquisition unit 210, as already described in the fourth embodiment. The correct speech data generated by the second superposition unit 302 is output to the second learning unit 242.

[0074] The second audio extraction unit 232 is configured to extract extracted audio according to a threshold Λ from all audio data generated by the first superposition unit 301. Specifically, the second audio extraction unit 232 uses the second audio extraction model 320 to extract audio data of the extracted audio. The second audio extraction unit 232 also uses the second audio extraction model 320 to separate all audio data generated by the first superposition unit 301 and extracts a plurality of separated audio data S est1 S est2 ,...S estnIt generates. The second voice extraction model 320 is a model that takes all voice data and a threshold value Λ as inputs and outputs extracted voice data and a plurality of separated voice data. Each of the extracted voice data and the plurality of separated voice data is configured to be output to the second learning unit 242. Note that the second voice extraction model 320 is learned by the second learning unit 242.

[0075] The second learning unit 242 is configured to be able to learn the second voice extraction model 320. Specifically, the second learning unit 242 calculates a loss function from the extracted voice data generated by the second voice extraction model 320 and the correct voice data generated by the second superimposing unit 222. Also, the second learning unit 242 calculates a loss function from the plurality of separated voice data S est1 , S est2 , …S estn generated by the second voice extraction model 320 and the plurality of voice data S λ1 , S λ2 , …S λn acquired by the voice acquisition unit 210. Then, the second learning unit 242 updates the parameters of the second voice extraction model 320 so that the two calculated loss functions become smaller respectively.

[0076] (Learning operation) Next, the flow of the learning operation by the fifth information processing apparatus 10 (that is, the operation when the second voice extraction model 320 is machine-learned) will be described while referring to FIG. 13. FIG. 13 is a flowchart showing the flow of the learning operation by the fifth information processing apparatus. In FIG. 13, the same reference numerals are given to the same processes as those shown in FIG. 11.

[0077] As shown in Figure 13, when the learning operation by the fifth information processing device 10 is started, the first voice acquisition unit 210 acquires multiple voice data (step S201). Then, the first superposition unit 301 superimposes all of the multiple voice data acquired by the voice acquisition unit 210 to generate all voice data (step S202). On the other hand, the second superposition unit 302 acquires a threshold Λ (step S203). Then, the second superposition unit 302 superimposes the voice data of the voices whose index is greater than or equal to the threshold Λ among the multiple voice data acquired by the voice acquisition unit 210 to generate correct voice data (step S204).

[0078] Next, the second voice extraction unit 232 acquires the model structure of the second voice extraction model 320 (step S301). After that, the second voice extraction unit 232 acquires the initial parameters of the second voice extraction model 320 (step S302).

[0079] Next, the second audio extraction unit 232 uses the second audio extraction model 320 to extract the extracted audio data from the total audio data, and separates the total audio data to generate multiple separated audio data (step S303). Then, the second audio extraction unit 232 outputs the extracted audio and the multiple separated audio data generated by the second audio extraction model 320 to the second learning unit 242, respectively.

[0080] Next, the second learning unit 242 calculates a loss function from the extracted audio data generated by the second audio extraction model 320 and the correct audio data generated by the second superposition unit 222. The second learning unit 242 also calculates a loss function from the multiple separated audio data S generated by the second audio extraction model 320. est1 S est2 ,...S estn And multiple audio data S acquired by the audio acquisition unit 210 λ1 S λ2 ,...S λn Then, the loss function is calculated from (step S304). After that, the second learning unit 242 updates the parameters of the second speech extraction model 320 so that the two calculated loss functions become smaller (step S305).

[0081] Next, the second learning unit 242 determines whether or not to terminate the update of the second speech extraction model 320 (step S306). The second learning unit 242 can determine, for example, whether or not a predetermined termination condition has been met. If it is determined that the update of the second speech extraction model 320 should not be terminated (step S306: NO), the process from step S303 is repeatedly executed. On the other hand, if it is determined that the update of the second speech extraction model 320 should be terminated (step S306: YES), the series of learning operations will be terminated.

[0082] (Technical Effects) Next, the technical effects obtained by the fifth information processing device 10 will be explained.

[0083] As explained in Figures 12 and 13, in the fifth information processing device 10, the second speech extraction model 320 is trained using multiple separated speech data in addition to the extracted speech data. In this way, it is possible to optimize the parameters of the second speech extraction model 320 and improve the accuracy of speech extraction and speech separation.

[0084] <Sixth Embodiment> The sixth information processing device 10 will be described with reference to Figures 14 and 15. The sixth information processing device 10 differs from the fourth and fifth information processing devices 10 described above in some configurations and operations, while other parts may be the same as those of the first to fifth information processing devices 10. Therefore, the following will explain in detail the parts that differ from the embodiments already described, and will omit explanations of other overlapping parts as appropriate.

[0085] (Functional Configuration) First, the functional configuration of the sixth information processing device 10 will be described with reference to Figure 14. Figure 14 is a block diagram showing the functional configuration of the sixth information processing device. In Figure 14, the same reference numerals are used for the same elements as those described in Figure 10. Furthermore, the sixth information processing device 10 shown in Figure 14 includes elements used for machine learning. For this reason, some of the elements shown in Figure 14 (i.e., elements used for machine learning) may be omitted when operating the device.

[0086] As shown in Figure 14, the sixth information processing device 10 is configured to include a voice acquisition unit 210, a third learning data generation unit 223, a third voice extraction unit 233, and a third learning unit 243 as components for realizing its function. Note that each of the voice acquisition unit 210, the third learning data generation unit 223, the third voice extraction unit 233, and the third learning unit 243 may be a processing block realized by the processor 11 (see Figure 1) described above.

[0087] The third learning data generation unit 223 is configured to generate learning data for training the third speech extraction model 330 used by the third speech extraction unit 233. The third learning data generation unit 223 receives input from multiple speech data generated by the speech acquisition unit 210, a threshold Λ, and noise N (for example, sounds different from human voices). The third learning data generation unit 223 includes a first superposition unit 301, a second superposition unit 302, and a third superposition unit 303.

[0088] As already described in the fourth and fifth embodiments, the first superposition unit 301 generates all audio data by superimposing all of the multiple audio data acquired by the audio acquisition unit 210. The all audio data generated by the first superposition unit 301 is output to the third learning unit 243. The second superposition unit 302 generates correct audio data by superimposing audio data of audio from among the multiple audio data acquired by the audio acquisition unit 210 whose index is greater than or equal to the threshold Λ. The correct audio data generated by the second superposition unit 302 is output to the third learning unit 243. The third superposition unit 303 generates noisy all audio data by superimposing all of the multiple audio data acquired by the audio acquisition unit 210 and noise N. The noisy all audio data generated by the third superposition unit 303 is output to the third audio extraction unit 233.

[0089] The third voice extraction unit 233 is configured to extract extracted voice data according to a threshold Λ from all noisy voice data generated by the third superposition unit 303. Specifically, the third voice extraction unit 233 uses the third voice extraction model 330 to extract the voice data of the extracted voice from all noisy voice data. The third voice extraction unit 233 also uses the third voice extraction model 330 to suppress the noise in all noisy voice data and outputs it as noise-suppressed all voice data. That is, the third voice extraction model 330 is a model that takes all noisy voice data and a threshold Λ as input and outputs extracted voice data and noise-suppressed all voice data, respectively. Each of the extracted voice data and noise-suppressed all voice data is output to the third learning unit 243. The third voice extraction model 330 is learned by the third learning unit 243.

[0090] The third learning unit 243 is configured to train the third speech extraction model 330. Specifically, the third learning unit 243 calculates a loss function from the extracted speech data generated by the third speech extraction model 330 and the ground truth speech data generated by the second superposition unit 222. The third learning unit 243 also calculates a loss function from the noise-suppressed total speech data generated by the third speech extraction model 330 and the total speech data generated by the first superposition unit 221. Subsequently, the third learning unit 243 updates the parameters of the third speech extraction model 330 so that the two calculated loss functions become smaller.

[0091] (Learning Operation) Next, with reference to Figure 15, the flow of the learning operation by the sixth information processing device 10 (i.e., the operation when machine learning the third speech extraction model 330) will be explained. Figure 15 is a flowchart showing the flow of the learning operation by the sixth information processing device. Note that in Figure 15, the same reference numerals are used for the same processes as shown in Figure 11.

[0092] As shown in Figure 15, when the learning operation by the sixth information processing device 10 is started, the first voice acquisition unit 210 acquires multiple voice data (step S201). Then, the first superposition unit 301 superimposes all of the multiple voice data acquired by the voice acquisition unit 210 to generate all voice data (step S202). In addition, the third superposition unit 303 superimposes all of the multiple voice data acquired by the voice acquisition unit 210 with noise N to generate all voice data with noise (step S401). On the other hand, the second superposition unit 302 acquires a threshold Λ (step S203). Then, the second superposition unit 302 superimposes the voice data of the voices whose index is greater than or equal to the threshold Λ among the multiple voice data acquired by the voice acquisition unit 210 to generate correct voice data (step S204).

[0093] Next, the third voice extraction unit 233 acquires the model structure of the third voice extraction model 330 (step S402). After that, the third voice extraction unit 233 acquires the initial parameters of the third voice extraction model 330 (step S403).

[0094] Next, the third audio extraction unit 233 uses the third audio extraction model 330 to extract the audio data of the desired audio from the entire audio data containing noise. The third audio extraction unit 233 also uses the third audio extraction model 330 to suppress the noise in the entire audio data containing noise and outputs it as noise-suppressed audio data (step S404).

[0095] Next, the third learning unit 243 calculates a loss function from the extracted speech data generated by the third speech extraction model 330 and the correct speech data generated by the second superposition unit 222. The third learning unit 243 also calculates a loss function from the noise-suppressed total speech data generated by the third speech extraction model 330 and the total speech data generated by the first superposition unit 221 (step S405). After that, the third learning unit 243 updates the parameters of the third speech extraction model 330 so that the two calculated loss functions become smaller (step S406).

[0096] Next, the third learning unit 243 determines whether or not to terminate the update of the third speech extraction model 330 (step S407). The third learning unit 243 only needs to determine, for example, whether or not a predetermined termination condition has been met. If it is determined that the update of the third speech extraction model 330 should not be terminated (step S407: NO), the process from step S404 is repeatedly executed. On the other hand, if it is determined that the update of the third speech extraction model 404 should be terminated (step S407: YES), the series of learning operations will be terminated.

[0097] (Technical Effects) Next, the technical effects obtained by the sixth information processing device 10 will be explained.

[0098] As explained in Figures 14 and 15, in the sixth information processing device 10, the third speech extraction model 330 is trained using the extracted speech data as well as the noise-suppressed total speech data. In this way, it is possible to optimize the parameters of the third speech extraction model 330 and improve the speech extraction accuracy and noise suppression effect.

[0099] The processing method of recording a program that operates the configuration of each embodiment in order to realize the functions of each embodiment described above on a recording medium, reading the program recorded on the recording medium as code, and executing it on a computer is also included in the scope of each embodiment. In other words, a computer-readable recording medium is also included in the scope of each embodiment. Furthermore, not only the recording medium on which the above-mentioned program is recorded, but also the program itself is included in each embodiment.

[0100] Examples of recording media that can be used include floppy disks (registered trademark), hard disks, optical disks, magneto-optical disks, CD-ROMs, magnetic tapes, non-volatile memory cards, and ROMs. Furthermore, the scope of each embodiment is not limited to programs that perform processing on the recording media alone, but also includes programs that operate on the OS and perform processing in cooperation with other software and the functions of expansion boards. In addition, the program itself may be stored on a server, and part or all of the program may be made available for download from the server to the user terminal. The program may be provided to the user in, for example, SaaS (Software as a Service) format.

[0101] <Note> The embodiments described above may also be described in the following way, but are not limited to the following.

[0102] (Note 1) The information processing device described in Note 1 is an information processing device comprising: an acquisition means for acquiring mixed audio data obtained by mixing multiple audios and a parameter for specifying an extracted audio to be extracted from the multiple audios; and an audio extraction means for outputting extracted audio data obtained by extracting the audio data of the extracted audio from the mixed audio data based on the mixed audio data and the parameter.

[0103] (Note 2) The information processing device described in Note 2 is the information processing device described in Note 1, wherein the acquisition means acquires a threshold corresponding to an index related to speech as the parameter, and the speech extraction means extracts speech data from the mixed speech data in which the index is equal to or greater than the threshold as the extracted speech data.

[0104] (Note 3) The information processing device described in Note 3 is the information processing device described in Note 1, wherein the acquisition means acquires a category label indicating the category of the voice as the parameter, and the voice extraction means extracts voice data belonging to the category indicated by the category label from the mixed voice data as the extracted voice data.

[0105] (Note 4) The information processing device described in Note 4 is the information processing device described in Note 3, wherein the category label relates to gender, emotion, age, and at least one of the diseases that can be identified from the voice data.

[0106] (Note 5) The information processing device described in Note 5 is the information processing device described in any one of Notes 1 to 4, wherein the voice extraction means outputs the extracted voice data using a machine learning model.

[0107] (Note 6) The information processing device described in Note 6 is an information processing device comprising: an audio acquisition means for acquiring multiple audio data; a first superposition means for superimposing all of the multiple audio data and outputting them as total audio data; a second superposition means for superimposing audio data from the multiple audio data according to the input parameters and outputting it as correct audio data; an audio extraction means for using an audio extraction model to extract audio according to the parameters from the total audio data and outputting it as extracted audio data; and a learning means for learning the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.

[0108] (Note 7) The information processing device described in Note 7 is the information processing device described in Note 6, wherein the voice extraction means includes a voice separation means that separates the entire voice data into voice data corresponding to each of the plurality of voice data using a voice separation model and outputs them as a plurality of separated voice data, and a third superposition means that superimposes the voice data according to the parameters from the plurality of separated voice data and outputs them as extracted voice data, and the first learning means learns the voice separation model using a loss function calculated from the plurality of separated voice data and the plurality of voice data.

[0109] (Note 8) The information processing device described in Note 8 further comprises a third superposition means that superimposes all of the plurality of audio data with noise and outputs it as noise-containing total audio data, the audio extraction means outputs the extracted audio data from the noise-containing total audio data using the audio extraction model, and also suppresses the noise contained in the noise-containing total audio data using the audio extraction model and outputs it as noise-suppressed total audio data, and the learning means learns the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data, in addition to a loss function calculated from the noise-suppressed total audio data and the total audio data, as described in Note 6 or 7.

[0110] (Note 9) The information processing device described in Note 9 is the information processing device described in any one of Notes 1 to 4, wherein the acquisition means acquires the mixed voice data using one microphone.

[0111] (Note 10) The information processing device described in Note 10 is the information processing device described in any one of Notes 1 to 4, wherein the acquisition means acquires the mixed voice data using a plurality of microphones.

[0112] (Note 11) The information processing method described in Note 11 is an information processing method in which at least one computer obtains mixed audio data obtained by mixing multiple voices and parameters that specify the extracted voice to be extracted from the multiple voices, and outputs extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

[0113] (Note 12) The information processing method described in Note 12 is an information processing method in which at least one computer acquires multiple audio data, superimposes all of the multiple audio data and outputs them as total audio data, superimposes the audio data from the multiple audio data that corresponds to the input parameters and outputs it as correct audio data, uses an audio extraction model to extract the audio corresponding to the parameters from the total audio data and outputs it as extracted audio data, and learns the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.

[0114] (Note 13) The recording medium described in Note 13 is a recording medium on which a computer program is recorded that causes at least one computer to execute an information processing method that acquires mixed audio data obtained by mixing multiple voices and parameters that specify the extracted voice to be extracted from the multiple voices, and outputs extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

[0115] (Note 14) The recording medium described in Note 14 is a recording medium on which a computer program is recorded that causes at least one computer to execute an information processing method which includes acquiring multiple audio data, superimposing all of the multiple audio data and outputting them as total audio data, superimposing the audio data from the multiple audio data that corresponds to the input parameters and outputting it as correct audio data, extracting audio corresponding to the parameters from the total audio data using an audio extraction model and outputting it as extracted audio data, and learning the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.

[0116] (Note 15) The computer program described in Note 15 is a computer program that causes at least one computer to execute an information processing method that obtains mixed audio data obtained by mixing multiple voices and parameters that specify the extracted voice to be extracted from the multiple voices, and outputs extracted audio data obtained by extracting the voice data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

[0117] (Note 16) The computer program described in Note 16 is a computer program that causes at least one computer to execute an information processing method which includes acquiring multiple audio data, superimposing all of the multiple audio data and outputting them as total audio data, superimposing the audio data corresponding to the input parameters from the multiple audio data and outputting it as correct audio data, extracting the audio corresponding to the parameters from the total audio data using an audio extraction model and outputting it as extracted audio data, and learning the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.

[0118] This disclosure may be modified as appropriate, insofar as it does not contradict the gist or idea of ​​the invention as can be inferred from the claims and the specification as a whole, and information processing devices, information processing methods, and recording media with such modifications are also included in the technical idea of ​​this disclosure.

[0119] 10 Information processing device 11 Processor 12 RAM 13 ROM 14 Storage device 15 Input device 16 Output device 17 Data bus 110 Acquisition unit 120 Voice extraction unit 210 Voice acquisition unit 221 First learning data generation unit 222 Second learning data generation unit 223 Third learning data generation unit 231 First voice extraction unit 232 Second voice extraction unit 233 Third voice extraction unit 241 First learning unit 242 Second learning unit 243 Third learning unit 301 First superposition unit 302 Second superposition unit 303 Third superposition unit 310 First voice extraction model 320 Second voice extraction model 330 Third voice extraction model

Claims

1. An information processing device comprising: an acquisition means for acquiring mixed audio data obtained by mixing multiple voices and parameters for specifying an extracted voice to be extracted from the multiple voices; and an audio extraction means for outputting extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

2. The information processing apparatus according to claim 1, wherein the acquisition means acquires a threshold corresponding to an index related to voice as the parameter, and the voice extraction means extracts voice data from the mixed voice data in which the index is equal to or greater than the threshold as the extracted voice data.

3. The information processing apparatus according to claim 1, wherein the acquisition means acquires a category label indicating the category of the voice as the parameter, and the voice extraction means extracts voice data from the mixed voice data that belongs to the category indicated by the category label as the extracted voice data.

4. The information processing apparatus according to claim 3, wherein the category label relates to gender, emotion, age, and at least one of the diseases that can be identified from the voice data.

5. The information processing apparatus according to any one of claims 1 to 4, wherein the voice extraction means outputs the extracted voice data using a machine learning model.

6. An information processing device comprising: an audio acquisition means for acquiring multiple audio data; a first superposition means for superimposing all of the multiple audio data and outputting them as total audio data; a second superposition means for superimposing audio data from the multiple audio data according to the input parameters and outputting it as correct audio data; an audio extraction means for using an audio extraction model to extract audio according to the parameters from the total audio data and outputting it as extracted audio data; and a learning means for learning the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.

7. The information processing apparatus according to claim 6, wherein the voice extraction means, in addition to outputting the extracted voice data using the voice extraction model, also uses the voice extraction model to separate the total voice data into voice data corresponding to each of the plurality of voice data and outputs them as a plurality of separated voice data, and the learning means learns the voice extraction model using a loss function calculated from the plurality of separated voice data and the plurality of voice data, in addition to a loss function calculated from the extracted voice data and the ground truth voice data.

8. The information processing apparatus according to claim 6 or 7, further comprising a third superposition means for superimposing all of the plurality of audio data with noise and outputting it as noise-containing total audio data, the audio extraction means for outputting extracted audio data from the noise-containing total audio data using the audio extraction model, and for suppressing the noise contained in the noise-containing total audio data using the audio extraction model and outputting it as noise-suppressed total audio data, and the learning means for learning the audio extraction model using a loss function calculated from the noise-suppressed total audio data and the total audio data in addition to a loss function calculated from the extracted audio data and the correct audio data.

9. The information processing apparatus according to any one of claims 1 to 4, wherein the acquisition means acquires the mixed voice data using one microphone.

10. The information processing apparatus according to any one of claims 1 to 4, wherein the acquisition means acquires the mixed voice data using a plurality of microphones.

11. An information processing method comprising: at least one computer obtaining mixed audio data obtained by mixing multiple voices, and parameters specifying an extracted voice to be extracted from the multiple voices; and outputting extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

12. An information processing method comprising: at least one computer acquiring multiple audio data; superimposing all of the multiple audio data and outputting them as total audio data; superimposing audio data from the multiple audio data corresponding to input parameters and outputting it as ground truth audio data; using an audio extraction model, extracting audio corresponding to the parameters from the total audio data and outputting it as extracted audio data; and training the audio extraction model using a loss function calculated from the extracted audio data and the ground truth audio data.

13. A recording medium on which a computer program is recorded that causes at least one computer to execute an information processing method that obtains mixed audio data obtained by mixing multiple voices and parameters that specify an extracted voice to be extracted from the multiple voices, and outputs extracted audio data obtained by extracting the audio data of the extracted voice from the mixed audio data based on the mixed audio data and the parameters.

14. A recording medium on which a computer program is stored that causes at least one computer to execute an information processing method comprising: acquiring multiple audio data; superimposing all of the multiple audio data and outputting them as total audio data; superimposing audio data from the multiple audio data corresponding to input parameters and outputting it as correct audio data; using an audio extraction model to extract audio corresponding to the parameters from the total audio data and outputting it as extracted audio data; and training the audio extraction model using a loss function calculated from the extracted audio data and the correct audio data.