Discrete token generation device, language model fine-tuning system, audio caption generation system, discrete token generation method, and program

The discrete token generation device enhances acoustic description generation by using a general-purpose acoustic signal representation to extract and vector-quantize features, improving semantic information recognition and classification tasks.

WO2026133543A1PCT designated stage Publication Date: 2026-06-25NT T INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NT T INC
Filing Date
2024-12-20
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing Neural Audio Codecs focus on compressing the waveform of audio signals and do not effectively extract semantic information, making them suboptimal for tasks involving the recognition and classification of sound semantic information.

Method used

A discrete token generation device that utilizes a general-purpose acoustic signal representation to extract features and vector-quantize them, generating discrete tokens that better capture semantic information, combined with a language model fine-tuning process to enhance acoustic description generation.

Benefits of technology

The proposed method improves the accuracy of acoustic description generation by generating more appropriate discrete tokens, as evidenced by higher scores in METEOR, CIDEr, SPICE, and SPIDEr metrics compared to existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2024045214_25062026_PF_FP_ABST
    Figure JP2024045214_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A discrete token generation device according to the present invention includes an acoustic signal acquisition unit for acquiring an acoustic signal, a feature extraction unit for extracting features of the acoustic signal using general-purpose acoustic signal representation, and a discrete token generation unit for performing vector-quantization of the extracted features and generating a discrete token.
Need to check novelty before this filing date? Find Prior Art

Description

Discrete Token Generation Device, Language Model Fine-Tuning System, Acoustic Description Generation System, Discrete Token Generation Method, and Program

[0001] The present invention relates to a discrete token generation device, a language model fine-tuning system, an acoustic description generation system, a discrete token generation method, and a program.

[0002] As technologies for recognizing and classifying general sounds, not limited to speech, including environmental sounds and sound effects, acoustic scene classification, acoustic event detection, acoustic description generation, etc. have been studied. These technologies aim to recognize and classify semantic information of sounds such as scenes like restaurants and lectures, events like a dog's bark and a car's running sound, and the failure status of a machine by taking an acoustic signal as an input.

[0003] There is EnCLAP as a method for generating an acoustic description (see Non-Patent Document 1). EnCLAP treats the input sound as a sequence of discrete tokens. In EnCLAP, discrete tokens of sound are obtained using EnCodec, which is one of the Neural Audio Codecs (Non-Patent Document 2). Neural Audio Codec is an encoding technology for acoustic signals using an encoder and a decoder based on a deep learning model, and can convert an acoustic signal into a sequence of discrete tokens. EnCLAP realizes more efficient fine-tuning of a text generation model by using discrete tokens as input features.

[0004] J. Kim, J. Jung, J. Lee, and SH Woo, “Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2024, pp. 6735-6739. A. D´efossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” Trans. Mach. Learn. Res., 2023. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio pre-training with acoustic tokenizers,” in Proc. Int. Conf. Mach. Learn. (ICML), Jul. 2023, pp. 5178-5193. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Modeling Duo: Towards a Universal Audio Pre-training Framework,” IEEE / ACM Trans. Audio, Speech, Language Process., vol. 32,pp. 2391-2406, 2024.

[0005] On the other hand, Neural Audio Codecs are technologies that focus on compressing the waveform of the audio signal itself, and do not focus on extracting semantic information about the content of the sound. Therefore, they are not considered optimal for tasks that involve recognizing and classifying semantic information of sound, such as generating descriptive audio texts.

[0006] In view of the above circumstances, the present invention aims to provide a technology for generating more appropriate discrete tokens in the task of recognizing and classifying semantic information of sound.

[0007] One aspect of the present invention is a discrete token generation device comprising: an acoustic signal acquisition unit that acquires an acoustic signal; a feature extraction unit that extracts feature quantities from the acoustic signal using a general-purpose acoustic signal representation; and a discrete token generation unit that vector-quantizes the extracted feature quantities to generate discrete tokens.

[0008] One aspect of the present invention is a discrete token generation method comprising: an acoustic signal acquisition step of acquiring an acoustic signal; a feature extraction step of extracting feature quantities from the acoustic signal using a general-purpose acoustic signal representation; and a discrete token generation step of vector quantizing the extracted feature quantities to generate discrete tokens.

[0009] The present invention makes it possible to generate more appropriate discrete tokens in the task of recognizing and classifying semantic information of sound.

[0010] This is a diagram showing an example configuration of the discrete token generation device 10 according to this embodiment. This is a flowchart showing the operation of the discrete token generation device 10 according to this embodiment. This is a diagram showing the configuration of the language model fine-tuning system 1 according to this embodiment. This is a diagram showing the procedure for the process of fine-tuning the language model. This is a diagram showing the configuration of the acoustic descriptive text generation system 2 according to this embodiment. This is a diagram showing the experimental results. This is a diagram showing the experimental results.

[0011] Embodiments of the present invention will be described in detail below with reference to the drawings.

[0012] Figure 1 shows an example of the configuration of a discrete token generation device 10 according to this embodiment. The discrete token generation device 10 generates discrete tokens based on an acoustic signal. The discrete token generation device 10 comprises an acoustic signal acquisition unit 11, a feature extraction unit 12, a discrete token generation unit 13, and a discrete token output unit 14.

[0013] The acoustic signal acquisition unit 11 acquires an acoustic signal. The feature extraction unit 12 extracts the features of the acquired acoustic signal. The feature extraction unit 12 may extract the features of the acoustic signal using a Deep Neural Network (DNN) such as a deep embedding model, or it may extract the features of the acoustic signal using a method with fixed parameters such as a spectrogram transformation process.

[0014] A general-purpose acoustic signal representation (PASR) is used for feature extraction. A PASR is a method for extracting acoustic features that has been trained to extract semantic information contained in sound. PASR is trained on a large dataset. The training method uses self-supervised learning, which ensures that the features remain unchanged when different variations are applied to the acoustic signal. Therefore, the resulting PASR is robust to variations that do not significantly alter the semantic information and provides features that well represent semantic information. Thus, it can effectively perform the recognition and classification of general sounds described above. Examples of PASRs include BEATs (see Non-Patent Literature 3) and M2D (see Non-Patent Literature 4).

[0015] The discrete token generation unit 13 generates discrete tokens based on the extracted features. The discrete token generation unit 13 generates discrete tokens by vector quantizing the extracted features.

[0016] The vector quantization method performed by the discrete token generation unit 13 may be single-layer vector quantization or multi-layer vector quantization such as Residual Vector Quantization (RVQ). Furthermore, the codebook for quantization may be pre-learned or continuously updated through online learning.

[0017] The discrete token output unit 14 outputs the generated discrete tokens. The generated discrete tokens are input to the language model fine-tuning device 20 and the acoustic descriptive text generation device 30, which will be described later.

[0018] Figure 2 is a flowchart illustrating the operation of the discrete token generation device 10 according to this embodiment. The acoustic signal acquisition unit 11 acquires an acoustic signal (step S11). The feature extraction unit 12 extracts feature quantities from the acquired acoustic signal (step S12). The discrete token generation unit 13 generates discrete tokens based on the extracted feature quantities (step S13). The discrete token output unit 14 outputs the generated discrete tokens (step S14).

[0019] The following describes how to use the discrete tokens generated by the discrete token generator 10. Figure 3 shows the configuration of the language model fine-tuning system 1 according to this embodiment. The language model fine-tuning system 1 comprises a discrete token generator 10 and a language model fine-tuning device 20. An acoustic signal is input to the discrete token generator 10, and discrete tokens are output. The language model fine-tuning device 20 has a language model to be fine-tuned pre-inputted into it. Discrete tokens are also input to the language model fine-tuning device 20 from the discrete token generator 10.

[0020] The language model fine-tuning device 20 receives an acoustic signal and a caption corresponding to the acoustic signal as input.

[0021] The discrete token generator 10 may receive data consisting of an acoustic signal and a corresponding caption, and the discrete token generator 10 may generate discrete tokens based on the acoustic signal, thereby generating data consisting of discrete tokens and captions. The generated data consisting of discrete tokens and captions may be input to the language model fine-tuning device 20.

[0022] The language model fine-tuning device 20 fine-tunes the input language model based on the input discrete tokens and their corresponding captions. The language model fine-tuning device 20 outputs the fine-tuned language model.

[0023] Figure 4 shows the procedure for fine-tuning a language model. The language model to be fine-tuned is, for example, BART. First, ART processing and CLAP processing are performed on the acoustic signal. ART processing is a process performed by the discrete token generator 10 to extract features from the acoustic signal and generate discrete tokens by vector quantization.

[0024] CLAP processing is a process that extracts features using a model trained by CLAP (Contrastive Language-Audio Pretraining). The method for CLAP processing is disclosed in Non-Patent Document 1. Through CLAP processing, CLAP audio embedding E is extracted from the acoustic signal. A This is extracted.

[0025] Additionally, there are special tokens that indicate the beginning and end of a sentence. <bos>and <eos>The BART embedding layer converts this into an embedded representation. This conversion is denoted as "BART Embed" in Figure 4. Here, the conversion results are each e bos and e eos Let's assume that.

[0026] The discrete tokens generated by the ART process are replaced with an embedded representation. This replacement is labeled "Token Embed" in Figure 4. bos and e eos The discrete tokens are replaced with embedding representations, which are then converted into embedding representations that represent positional information. This conversion is labeled "positional embedding" in Figure 4. CLAP audio embedding E A The linear projection (represented as "Linear" in Figure 4) and the embedded representation of the positional information are input to a trained BART encoder. BART is an example of a language model.

[0027] At this time, the output of the BART encoder is input to the BART decoder. Furthermore, the BART decoder receives the caption corresponding to the acoustic signal, which has been converted into an embedded representation by BART (indicated as "BART Embed" in Figure 4), and then converted into an embedded representation representing positional information (indicated as "positional embedding" in Figure 4). In the BART decoder, cross-attention is performed on the two inputs and the cross-entropy loss is calculated.

[0028] Furthermore, the loss for the task of predicting masked input tokens in the output of the BART encoder (Masked Code Modeling Loss) is calculated. Here, the parameters of the BART encoder and BART decoder are adjusted so that the Cross Entropy Loss and Masked Code Modeling Loss are minimized. This fine-tunes the language model.

[0029] The language model fine-tuning device 20 may also be configured to include a discrete token generator 10. In this configuration, when an acoustic signal and a corresponding caption are input to the language model fine-tuning device 20, the discrete token generator 10 generates discrete tokens based on the input acoustic signal.

[0030] Figure 5 shows the configuration of the acoustic description generation system 2 according to this embodiment. The acoustic description generation system 2 comprises a discrete token generator 10 and an acoustic description generation system 30. An acoustic signal is input to the discrete token generator 10, and discrete tokens are output. A language model is pre-stored in the acoustic description generation system 30. The stored language model may be a language model that has been fine-tuned by a language model fine-tuning device 20. The acoustic description generation system 30 generates and outputs an acoustic description of the acoustic signal input to the discrete token generator 10 by inputting the discrete tokens input from the discrete token generator 10 into the language model that stores them. The output result is displayed on an external display device, for example. The output result may be stored in a storage device. The storage device may be provided in the acoustic description generation system 30 or may be provided externally.

[0031] The above describes how language models can be fine-tuned and acoustic descriptions can be generated based on discrete tokens generated by the discrete token generator 10. However, the invention is not limited to these methods, and the discrete tokens generated by the discrete token generator 10 can be used for acoustic recognition and classification. For example, based on the discrete tokens generated by the discrete token generator 10, the scenes in which the acoustics occurred may be classified, the events in which the acoustics occurred may be detected, or abnormal sounds may be detected.

[0032] As mentioned above, in the acoustic description generation method EnCLAP disclosed in Non-Patent Document 1, discrete tokens of sound are obtained using EnCodec, one of the Neural Audio Codecs. Neural Audio Codecs are a technology that focuses on compressing the waveform of the acoustic signal itself and does not focus on extracting semantic information of the sound content. Therefore, discrete tokens obtained by EnCodec are not considered optimal for tasks such as the recognition and classification of semantic information of sound, such as acoustic description generation. In contrast, in this embodiment, the discrete token generation device 10 extracts features using a general-purpose acoustic signal representation and generates discrete tokens based on the extracted features. Therefore, the discrete tokens generated by the discrete token generation device 10 can be said to focus on the extraction of semantic information of the sound content. For this reason, in tasks such as the recognition and classification of semantic information of sound, such as acoustic description generation, the discrete tokens generated by the discrete token generation device 10 are considered more appropriate than the discrete tokens obtained by EnCodec.

[0033] (Experiment) The experimental results of this embodiment are described below. The scores of the objective evaluation index for the acoustic descriptions generated in this embodiment were compared with those of the comparative example (EnCLAP). In this embodiment, the acoustic descriptions were generated by the acoustic description generation system 2. The language model used by the acoustic description generation device 30 of the acoustic description generation system 2 is BART, which has been fine-tuned by the language model fine-tuning system 1 in the manner shown in Figure 4. In the experiment, the only difference between the comparative example and this embodiment was the method of generating discrete tokens.

[0034] Furthermore, the discrete token generation unit 13 generated discrete tokens using BEATs described in Non-Patent Document 3 as a general-purpose acoustic signal representation. In addition, the number of quantization layers was set to 16 in the comparative example and this embodiment.

[0035] METEOR, CIDEr, SPICE, and SPIDEr were calculated as objective evaluation metrics for sound descriptions. METEOR stands for Metric for Evaluation of Translation with Explicit Ordering and is a metric that focuses on the similarity of word sequences considering synonyms and stems. CIDEr stands for Consensus-based image description evaluation and is a metric that focuses on the similarity of consecutive word sequences. SPICE stands for Semantic propositional image caption evaluation and is a metric that focuses on the dependency structure between words. SPIDEr is a linear combination of SPICE and CIDEr and is a metric that measures the similarity to the reference description corresponding to the input sound.

[0036] Furthermore, the mean (Mean) and Claimed score for METEOR, CIDEr, SPICE, and SPIDER were calculated. The standard deviation (Std) was also calculated along with the mean. Claimed represents the best score. The scores in the comparative example are based on Non-Patent Literature 1, while the scores in this embodiment are the scores at which SPIDER is highest when the language model is trained with six different seeds.

[0037] Figure 6 shows the experimental results. Comparing the comparative example with this embodiment, this embodiment outperformed the comparative example in all aspects: METEOR, CIDEr, SPICE, and SPIDEr.

[0038] Furthermore, in this embodiment, discrete tokens were generated by changing the number of quantization layers by the discrete token generation unit 13, acoustic descriptive text was generated, and the scores for METEOR, CIDEr, SPICE, and SPIDEr were calculated. Figure 7 shows the experimental results. In this embodiment, increasing the total number of quantization layers improved the score (except when the number of quantization layers for CIDEr was changed from 4 to 8).

[0039] Based on the above, it can be seen that this embodiment improves the objective evaluation index of acoustic descriptions compared to the comparative example, EnCLAP.

[0040] (Other Embodiments) As described above, one embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above, and various design changes and the like can be made without departing from the gist of the present invention.

[0041] The processing by the discrete token generation device 10, the language model fine-tuning device 20, and / or the acoustic description text generation device 30 in the above-described embodiment may be realized by a computer using software. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read into a computer system and executed to realize it. Here, the "computer system" is assumed to include hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, etc., and a storage device such as a hard disk built in a computer system. Furthermore, the "computer-readable recording medium" refers to something that dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line, and may also include something that holds a program for a certain period of time, like volatile memory inside a computer system that becomes a server or a client in that case. Also, the above program may be for realizing a part of the above-described functions, and may further be realized in combination with a program already recorded in the computer system for realizing the above-described functions, and may be realized using a programmable logic device such as an FPGA (Field Programmable Gate Array).

[0042] 1. Language model fine-tuning system, 2. Acoustic description text generation system, 10. Discrete token generation device, 11. Acoustic signal acquisition unit, 12. Feature quantity extraction unit, 13. Discrete token generation unit, 14. Discrete token output unit, 20. Language model fine-tuning device, 30. Acoustic description text generation device< / eos> < / bos>

Claims

1. A discrete token generation device comprising: an acoustic signal acquisition unit for acquiring an acoustic signal; a feature extraction unit for extracting feature quantities from the acoustic signal using a general-purpose acoustic signal representation; and a discrete token generation unit for vector quantizing the extracted feature quantities and generating discrete tokens, wherein the discrete tokens are used for acoustic recognition and classification.

2. The discrete token generation apparatus according to claim 1, wherein the feature extraction unit extracts the features of the acoustic signal using a DNN (Deep Neural Network).

3. The discrete token generation apparatus according to claim 1, wherein the feature extraction unit extracts the feature quantities of the acoustic signal using a method having fixed parameters.

4. A language model fine-tuning system comprising: a discrete token generation device according to claim 1; a discrete token generated by the discrete token generation device; and a language model fine-tuning device that fine-tunes a language model based on captions corresponding to the acoustic signals.

5. The language model fine-tuning system according to claim 4, wherein the language model is BART.

6. An acoustic description generation system comprising: a discrete token generation device according to claim 1; and an acoustic description generation device that generates an acoustic description based on the discrete tokens generated by the discrete token generation device.

7. A discrete token generation method comprising: an acoustic signal acquisition step of acquiring an acoustic signal; a feature extraction step of extracting feature quantities from the acoustic signal using a general-purpose acoustic signal representation; and a discrete token generation step of vector quantizing the extracted feature quantities to generate discrete tokens.

8. A program that causes a computer to perform the method described in claim 7.