Information processing system and information processing method, and computer program

JP2026104751APending Publication Date: 2026-06-25SONY GROUP CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SONY GROUP CORP
Filing Date
2025-01-28
Publication Date
2026-06-25

Smart Images

  • Figure 2026104751000001_ABST
    Figure 2026104751000001_ABST
Patent Text Reader

Abstract

This invention provides an information processing system that generates or edits audio using an AI model. [Solution] The information processing system comprises a repair unit that repairs masked audio data using a generation model, and a sampler that extracts the mask position for the next iterative synthesis from the audio data repaired by the repair unit. The system generates the final output audio data by iterative synthesis, which repeats the extraction of the mask position by the sampler and the repair of the audio data with the mask position applied by the repair unit a predetermined number of times.
Need to check novelty before this filing date? Find Prior Art

Claims

1. A repair unit that repairs masked audio data using a generative model, A sampler extracts the mask position for the next iterative synthesis from the audio data repaired by the repair unit, The system comprises the following: it generates the final output audio data by iterative synthesis, which repeats the extraction of mask positions by the sampler and the repair of the masked audio data by the repair unit a predetermined number of times. Information processing system.

2. It further includes a vector quantization encoder that encodes the Mel spectrogram of audio waveform data into a token sequence, The repair unit repairs the masked token sequence. The sampler extracts the mask position from the token sequence repaired by the repair unit. The information processing system according to claim 1.

3. The aforementioned generation model consists of a transformer model. The information processing system according to claim 2.

4. A masking unit that masks tokens at any position in the token sequence, A loss function calculation unit calculates a loss function based on the difference between the first token sequence and the second token sequence obtained by restoring the masked token sequence (the first token sequence masked by the masking unit) using the generative model, and the first token sequence. Furthermore, The generative model is trained to optimize the loss function. The information processing system according to claim 2.

5. The loss function calculation unit calculates the cross-entropy loss for the masked portion, comparing the prediction for the masked portion with the correct label for the masked portion. The information processing system according to claim 4.

6. The masking unit masks tokens at any position in the token sequence using either an unconditional mask or a conditional mask. The information processing system according to claim 4.

7. The masking unit uses the conditional mask based on the feature vector obtained by mapping the original mel spectrogram of the first token sequence to a shared latent space. The information processing system according to claim 6.

8. The generative model employs an iterative synthesis algorithm using Classifier-free Guidance (CFG). The information processing system according to claim 6.

9. In the training phase, the token sequence input to the generative model is masked by changing it from an unconditional mask to a conditional mask at a predetermined ratio in the training steps. In the inference phase, the final logit is calculated by linearly combining the conditional logit and unconditional logit calculated for each masked token using a guidance scale. The information processing system according to claim 8.

10. The guidance scale is linearly increased from 0.0 to the assigned value through iterative synthesis. The information processing system according to claim 9.

11. The sampler extracts the top k tokens of poor quality from the token sequence repaired by the repair unit as mask positions for the next iterative synthesis. The information processing system according to claim 9.

12. It further includes a frequency domain masking unit that masks arbitrary frequency ranges of audio data, The repair unit repairs the frequency interval masked by the frequency domain masking unit. The information processing system according to claim 1.

13. It further includes a time-domain masking unit that masks arbitrary time intervals of audio data, The repair unit repairs the time interval masked by the time-domain masking unit. The information processing system according to claim 1.

14. The time-domain masking unit adds a mask to the end of the generated sound source generated by the repair unit based on the text prompt. The repair unit repairs the last mask based on the text prompt, thereby generating the generated sound source with the duration extended by the duration of the last mask. The information processing system according to claim 13.

15. The time-domain masking unit adds a mask to the end of the first generated sound source generated by the repair unit based on the first text prompt. The repair unit extends the duration of the mask at the end of the first generated sound source by the duration of the mask at the end of the first generated sound source by repairing the mask added to the end of the first generated sound source with a second generated sound generated based on the second text prompt. The information processing system according to claim 13.

16. The aforementioned time-domain masking unit adds a mask to the end of the existing sound source, The repair unit repairs the last mask with the generated sound generated based on the existing sound source, thereby generating the existing sound source with the duration of the last mask extended by the duration of the mask. The information processing system according to claim 13.

17. The aforementioned time-domain masking unit adds a mask to the end of the first existing sound source, The repair unit repairs the last mask with a generated sound based on the second existing sound source or text prompt, thereby lengthening the first existing sound source by the duration of the last mask. The information processing system according to claim 13.

18. A repair step that uses a generative model to repair masked audio data, A sampling step is performed to extract the mask position for the next iterative synthesis from the audio data repaired by the above repair step, The system generates the final output audio data by iterative synthesis, which involves repeating the extraction of mask positions by the sampling step and the repair of the masked audio data by the repair step a predetermined number of times. Information processing methods.

19. A repair unit that repairs masked audio data using a generative model. A sampler extracts the mask position for the next iterative synthesis from the audio data repaired by the repair unit. The system is written in a computer-readable format so that it can function as a computer, and generates the final output audio data by iterative synthesis, which repeats the extraction of mask positions by the sampler and the repair of the masked audio data by the repair unit a predetermined number of times. Computer program.