Speech synthesis device, speech synthesis method, and program
The speech synthesis device efficiently selects and generates speech based on expressive characteristics by dimensionally reducing and graphically representing speech features, addressing the challenge of voice selection in text-to-speech systems.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- CYBER AGENT
- Filing Date
- 2025-11-05
- Publication Date
- 2026-06-26
AI Technical Summary
Users of text-to-speech synthesis systems cannot efficiently search for voices that reflect desired expressive characteristics, as they cannot know in advance the expressive characteristics of the voices available.
A speech synthesis device and method that extracts speech features including expressive characteristics, reduces their dimensionality to three dimensions or less, plots these features on a graph, allows user selection, and generates synthesized speech based on the selected features and acquired text.
Enables users to efficiently find and generate synthesized speech that matches desired expressive characteristics, enhancing user experience by presenting voices based on their emotional and stylistic traits.
Smart Images

Figure 0007881028000001_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to a voice synthesis device, a voice synthesis method, and a program.
Background Art
[0002] In recent years, in text-to-speech (TTS) technology for synthesizing natural voices from text, the introduction of neural networks has made high-quality voice synthesis possible. Non-Patent Document 1 discloses a method of text-to-speech synthesis using a language model. In the method disclosed in Non-Patent Document 1, voice of about several seconds is used as input to the voice synthesis model. The generated synthetic voice reflects the expressive characteristics indicating the characteristics of the voice.
Prior Art Documents
Non-Patent Documents
[0003]
Non-Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] To generate a synthetic voice that reflects desired expressive characteristics, a voice having the desired expressive characteristics is required. However, users of text-to-speech synthesis cannot know in advance the expressive characteristics of the voice. Therefore, users cannot efficiently search for voices having the desired expressive characteristics. Thus, there is a problem that voices cannot be presented to users based on expressive characteristics.
[0005] In view of the above circumstances, an object of the present invention is to provide a voice synthesis device, a voice synthesis method, and a program that can present voices to users based on expressive characteristics. [Means for solving the problem]
[0006] One aspect of the present invention is a speech synthesis device comprising: an extraction unit that extracts speech features including information about expressive characteristics, which are characteristics of speech, from speech; a dimensionality reduction unit that compresses the extracted speech features to three dimensions or less; a mapping unit that plots speech points corresponding to the speech on a graph with the same dimensions as the compressed features, based on the compressed speech features, which are the dimensionality-reduced speech features; a display unit that displays the graph on which the speech points are plotted; a speech selection unit that selects the speech points; a speech determination unit that determines the selected speech, which is the speech corresponding to the selected speech points, as speech to be used for speech synthesis; a text acquisition unit that acquires text to be used for speech synthesis; and a speech synthesis unit that generates synthesized speech based on the determined speech or speech features extracted from the determined speech and the acquired text.
[0007] One aspect of the present invention is a speech synthesis method performed by a speech synthesizer, comprising the steps of: extracting speech features from speech, which include information about expressive characteristics that are features of speech; dimensionally compressing the extracted speech features to three dimensions or less; plotting speech points corresponding to the speech on a graph with the same dimensions as the compressed features, based on the compressed speech features that are the dimensionally compressed speech features; displaying the graph on which the speech points are plotted; selecting the speech points; determining the selected speech, which corresponds to the selected speech points, as the speech to be used for speech synthesis; obtaining text to be used for speech synthesis; and generating synthesized speech based on the determined speech or speech features extracted from the determined speech and the obtained text.
[0008] One aspect of the present invention is a program for causing a computer to perform the following steps: extract speech features including information about expressive characteristics, which are characteristics of speech, from speech; compress the extracted speech features to three dimensions or less; plot speech points corresponding to the speech on a graph with the same dimensions as the compressed features, based on the compressed speech features, which are the dimensionally compressed speech features; display the graph on which the speech points are plotted; select the speech points; determine the selected speech, which is the speech corresponding to the selected speech points, as the speech to be used for speech synthesis; obtain text to be used for speech synthesis; and generate synthesized speech based on the determined speech or the speech features extracted from the determined speech and the obtained text. [Effects of the Invention]
[0009] The present invention makes it possible to provide a speech synthesis device, a speech synthesis method, and a program that can present speech to a user based on its expressive characteristics. [Brief explanation of the drawing]
[0010] [Figure 1] This figure shows an example of the configuration of a speech synthesis device in an embodiment. [Figure 2] This figure shows an example of the configuration of the display control unit in the embodiment. [Figure 3] This figure shows an example of the display screen configuration in the embodiment. [Figure 4] This figure shows an example of a labeled graph in the embodiment. [Figure 5] This figure shows an example of a tagged graph in the embodiment. [Figure 6] This figure shows an example of generating speech features in an embodiment. [Figure 7] This figure shows an example of plotting the selected points again in the embodiment. [Figure 8] This is a flowchart showing an example of the operation of the speech synthesis device in the embodiment. [Figure 9] This figure shows an example of the hardware configuration of an information processing device in an embodiment. [Modes for carrying out the invention]
[0011] Embodiments of the present invention will be described in detail with reference to the drawings. Figure 1 shows an example of the configuration of the speech synthesis device 1. The speech synthesis device 1 comprises a speech acquisition unit 11, an extraction unit 12, a dimensionality compression unit 13, a display control unit 14, a display unit 15, a speech selection unit 16, a speech playback unit 17, a speech determination unit 18, a text acquisition unit 19, a generation count acquisition unit 20, and a feature quantity generation unit 21.
[0012] The voice acquisition unit 11 acquires the user's spoken voice (hereinafter referred to as "spoken voice"). The voice acquisition unit 11 is, for example, a microphone. The extraction unit 12 extracts speech feature quantities from the speech, which include information about the characteristics of the speech (hereinafter referred to as "expressive characteristics"). The speech from which speech feature quantities are extracted is, for example, speech data that is a few seconds long. Expressive characteristics include, for example, the speaker's emotions, the speaker's way of speaking, the speech speed (speech rate), the pitch range of the speech, the volume (accent) of the speech, the intonation of the speech, the tone of voice, and the intensity of the speech. The speech is, for example, a pre-stored speech sample.
[0013] The voice from which speech features are extracted may also be spoken speech. This allows the user to use a similar spoken speech for speech synthesis even if the voice sample does not contain speech with the desired expressive characteristics (hereinafter referred to as "desired speech").
[0014] Furthermore, the speech from which speech features are extracted may also be synthesized speech generated by the speech synthesizer 1 (hereinafter referred to as "generated speech"). This allows the user to check whether the expressive characteristics of the generated speech are similar to those of the speech used for speech synthesis by plotting the speech points (described later) corresponding to the generated speech.
[0015] The extraction unit 12 includes, for example, a machine learning model. The machine learning model may be trained by self-supervised learning (SSL: Self-Supervised Learning). Examples of machine learning models trained by self-supervised learning include models such as wav2vec or HuBERT. At least a part of the configuration in a neural audio codec (NAC: Neural Audio Codec) may be used in the machine learning model. At least a part of the configuration in an audio language model may be used in the machine learning model.
[0016] The dimensionality reduction unit 13 reduces the dimensionality of the voice feature amount extracted from the voice (hereinafter referred to as the "voice feature amount of the voice") to three dimensions or less. The dimensionality reduction unit 13 may reduce the dimensionality of the voice feature amount of the voice corresponding to a plurality of selected voice points (described later) to three dimensions or less. The dimensionality reduction unit 13 reduces the dimensionality of the voice feature amount, for example, by principal component analysis (PCA: principal component analysis) or t-distributed stochastic neighbor embedding (t-SNE: t-distributed Stochastic Neighbor Embedding).
[0017] The display control unit 14 controls information regarding the graph displayed on the display screen. For example, the display control unit 14 plots voice points corresponding to the voice on a graph in the same dimension as the compressed feature amount based on the dimensionally compressed voice feature amount (hereinafter referred to as the "compressed feature amount"). That is, voice points plotted at close positions are voices with similar representation characteristics.
[0018] The display unit 15 displays a graph on which voice points are plotted. The display unit 15 may highlight the voice points (hereinafter referred to as "selected points") selected by the voice selection unit 16 (described later). The display unit 15 includes, for example, a display device such as an organic EL (organic electro-luminescence) display or a liquid crystal display.
[0019] The voice selection unit 16 selects voice points plotted on the graph. The voice selection unit 16 selects voice points, for example, based on an input from the user. The voice selection unit 16 may select a plurality of voice points.
[0020] The voice playback unit 17 plays back the voice corresponding to the selected point (hereinafter referred to as "selected voice"). That is, the user can check the voice corresponding to the plotted voice point. Thereby, the user can search for a desired voice by referring to the position where the voice point is plotted and the expression characteristics of the voice corresponding to the voice point.
[0021] The voice determination unit 18 determines the selected voice as the voice to be used for voice synthesis. The text acquisition unit 19 acquires the text to be used for voice synthesis. The generation number acquisition unit 20 acquires the number of synthesized voices to be generated. The feature quantity generation unit 21 generates voice feature quantities corresponding to locations on the graph where no voice points are plotted, based on the voice feature quantities of the voice.
[0022] The voice synthesis unit 22 generates a synthesized voice based on the determined voice or the voice feature quantity of the determined voice and the acquired text. The voice synthesis unit 22 may generate the number of synthesized voices acquired by the generation number acquisition unit 20. The voice synthesis unit 22 may also generate a synthesized voice based on the voice feature quantity generated by the feature quantity generation unit 21 and the acquired text.
[0023] The voice synthesis unit 22 includes, for example, a voice synthesis model using a neural network. The voice synthesis model is, for example, VALL-E or CosyVoice2. Here, in a voice synthesis model using a neural network, even if the voice and text used for input are the same, the expression characteristics of the generated voice are not necessarily the same.
[0024] The correction unit 23 corrects the generated sound based on instructions from the user. The correction may include, for example, a change in the intensity of the generated sound. For example, the correction unit 23 may correct the amplitude of the generated sound by normalization processing or a volume scaling function based on the RMS (Root Mean Square) value of the sound data. The correction may also include, for example, a change in the pitch of the generated sound. For example, the correction unit 23 may correct the pitch of the generated sound by frequency axis transformation on the Mel spectrogram using the Pitch Synchronous Overlap and Add (PSOLA) method or the WORLD method.
[0025] The correction unit 23 may perform corrections based on the extracted speech features. Alternatively, the correction unit 23 may perform various corrections through function calls generated from semantic features, which are the results of analyzing the user interaction. The user interaction is analyzed, for example, by a natural language processing unit (not shown).
[0026] The storage processing unit 24 stores the generated audio or the corrected generated audio. The storage device 25 stores, for example, audio samples.
[0027] Figure 2 shows an example of the configuration of the display control unit 14 in the embodiment. The display control unit 14 includes a mapping unit 141, a labeling unit 142, and a tagging unit 143.
[0028] The mapping unit 141 plots audio points corresponding to speech on a graph with the same dimensions as the compressed features, based on the compressed features. Alternatively, the mapping unit 141 may plot audio points corresponding to multiple selected speeches on a graph based on the compressed features obtained by compressing the audio features of multiple selected speeches.
[0029] The labeling unit 142 assigns labels to the coordinate axes of the graph that represent the features of each dimension in the compressed feature. The features of each dimension include, for example, the speaker's speaking style, the speaker's acting, the speaker's emotions, the intensity of the voice, and the voice quality.
[0030] The tagging unit 143 assigns tags indicating information about the audio corresponding to the audio point to the vicinity of the audio point. The information about the audio is, for example, information about the voice quality, such as whether the voice is loud.
[0031] Figure 3 shows an example of the configuration of the display screen 3 in the embodiment. The display screen 3 comprises a graph 31, a selected voice screen 32, a text input field 33, a quantity input field 34, a synthesis button 35, a synthesized voice screen 36, and a saved voice screen 37 as screen components. The graph 31 comprises voice points 301 and selection points 302. The selected voice screen 32 comprises a play button 321. The synthesized voice screen 36 comprises a play button 361 and a save button 362. The saved voice screen 37 comprises a play button 371 and a delete button 372.
[0032] Graph 31 plots the audio points 301 corresponding to the audio. Although Graph 31 shown in Figure 3 is a two-dimensional graph, Graph 31 may also be a one-dimensional or three-dimensional graph.
[0033] When a user clicks on an audio point on graph 31, the audio corresponding to the clicked point is selected. The selected point 302 is highlighted on graph 31. The selected audio screen 32 displays information about how to operate the selected audio. For example, when a user clicks on the play button 321, the selected audio is played.
[0034] The text input field 33 is where the text to be used for speech synthesis is entered. The quantity input field 34 is where information about the number of synthesized speeches to be generated is entered. For example, the user enters text into the text input field 33 using an input device such as a keyboard. In Figure 3, "5" is entered in the quantity input field 34. That is, the setting is to generate synthesized speech 5 times. For example, the user enters the quantity information into the quantity input field 34 using an input device such as a keyboard.
[0035] Furthermore, the number of synthesized speeches to be generated may be automatically determined based on the characteristics of the speech synthesis model provided by the speech synthesis unit 22, the input text, and the selected speech. The characteristics of the speech synthesis model are, for example, the number of parameters. In other words, the number of synthesized speeches to be generated is determined according to the randomness of the generated synthesized speeches. This means that, for example, more synthesized speeches will be generated if the randomness of the generated synthesized speeches is higher.
[0036] When the user clicks the synthesis button 35, synthesized speech is generated based on the text entered in the text input field 33 and the selected voice. The synthesized speech screen 36 displays information related to the operation of the generated speech. For example, when the user clicks the play button 361, the generated speech is played. When the user clicks the save button 362, the generated speech is saved to the storage device 25.
[0037] In Figure 3, since "5" is entered in the quantity input field 34, the synthesized speech is generated 5 times. Here, depending on the configuration of the speech synthesis model, even if the voice and text used for input are the same, the expressive characteristics of the generated speech may not be the same. Therefore, by generating synthesized speech multiple times, the user can save the most suitable generated speech from among those with different expressive characteristics.
[0038] The saved audio screen 37 displays information regarding the operation of saved generated audio. For example, the user can play the saved generated audio by clicking the play button 371. The user can delete the saved generated audio by clicking the delete button 372.
[0039] Figure 4 shows an example of Graph 31 with labels 311 and 312. Label 311, assigned to the vertical axis, is labeled "Emotion." Label 312, assigned to the horizontal axis, is labeled "Intensity." Each label corresponds to a feature in each dimension of the compressed feature. That is, in Graph 31 as exemplified in Figure 4, audio points plotted higher up correspond to audio where the speaker's emotion is heightened. Audio points plotted further to the right correspond to audio with higher intensity. This allows users to, for example, identify audio points that correspond to audio with higher intensity while maintaining the speaker's emotion, by referring to the labels. Furthermore, users can search for desired audio by considering the positions where two or more audio points are plotted.
[0040] Labels are assigned by the labeling unit 142 based, for example, on the location where the audio points are plotted and the expressive characteristics of the audio. Labels may also be assigned in advance by a user who has grasped the characteristics of each dimension in the compressed feature. The characteristics of each dimension can be grasped, for example, by audio playback. The characteristics of each dimension may also be grasped based on the compression method for the audio feature by the dimension compression unit 13. The compression method is, for example, a method in which the audio feature is compressed into acting components and voice quality components. The characteristics of each dimension may also be grasped based on the description output by the audio language model. The audio language model outputs a description of the audio based, for example, on the location where the audio points are plotted and the expressive characteristics of the audio. The audio language model may also generate labels describing the characteristics of each dimension based on the location where the audio points are plotted and the expressive characteristics of the audio.
[0041] Figure 5 shows an example of Graph 31 with tags 313 and 314 assigned. Tag 313 is labeled "Loud voice". Tag 314 is labeled "Sounds fun". The tag labels indicate information about the audio corresponding to the audio point in the vicinity of the tag. This allows the user to roughly determine the location of the audio point corresponding to the desired audio by referring to the tag labels.
[0042] Tags are assigned by the tagging unit 143, for example, based on pre-calculated classification results for each audio. The classification results are calculated using, for example, the k-nearest neighbor algorithm (k-NN), support vector machine (SVM), or YAMNet model. Tags may also be assigned in advance by a user who understands the expressive characteristics of the audio corresponding to the audio point. Furthermore, the tags may contain descriptions of the audio output by an audio language model such as Gemini® or Qwen-Audio (Qwen is a registered trademark).
[0043] Figure 6 shows an example of generating speech features in an embodiment. The user performs a click operation on position 315, which is a location where no speech points are plotted. The feature generation unit 21 generates speech features corresponding to position 315 based on the speech features of the speech. The feature generation unit 21 generates speech features corresponding to position 315 based on the speech features of the speech corresponding to speech points within a predetermined range 316 from position 315. For example, speech features corresponding to position 315 may be generated by inverse distance weighting (IDW) using speech points within a predetermined range 316 from position 315. When speech features are generated by inverse distance weighting, all speech points on the graph 31 may be used instead of only the speech points within the predetermined range 316.
[0044] The speech features generated by the feature generation unit 21 are used as input to the speech synthesis unit 22. This allows the user to generate synthesized speech that reflects the desired expressive characteristics, even if the desired speech is not present in the given audio.
[0045] Figure 7 shows an example of replotting selected points in the embodiment. For example, the audio points within the selection range 317 are selected when the user specifies a selection range 317 in graph 31. Alternatively, multiple audio points may be selected when the user clicks on multiple audio points.
[0046] The dimensionality reduction unit 13 compresses the dimensionality of the speech features to three dimensions or less, for example, by principal component analysis using the speech features of the selected speech. The mapping unit 141 plots the speech points corresponding to the selected speech on graph 38 based on the compressed features, which have a reduced dimensionality. This allows the user to display only the speech points corresponding to speech similar to the desired speech. Therefore, even when there are many speech points, the user can efficiently find the desired speech.
[0047] This is a flowchart illustrating an example of the operation of the speech synthesis device 1 in the embodiment. The extraction unit 12 extracts speech features from pre-stored speech (step S101). The dimensionality reduction unit 13 compresses the extracted speech features to three dimensions or less (step S102). The mapping unit 141 plots the speech points on a graph based on the compressed features (step S103). The display unit 15 displays the graph on which the speech points are plotted (step S104). The speech selection unit 16 selects speech points (step S105). The speech determination unit 18 determines the selected speech to be used for speech synthesis (step S106). The text acquisition unit 19 acquires the text to be used for speech synthesis (step S107). The speech synthesis unit 22 generates synthesized speech based on the determined speech and the acquired text (step S108).
[0048] As described above, the extraction unit 12 extracts speech features from the speech, which include information about the expressive characteristics that are features of the speech. The dimensionality reduction unit 13 compresses the extracted speech features to three dimensions or less. The mapping unit 141 plots the speech points corresponding to the speech on a graph with the same dimensions as the compressed features, based on the compressed features which are the dimensionality-reduced speech features. The display unit 15 displays the graph on which the speech points are plotted. The speech selection unit 16 selects the plotted speech points. The speech determination unit 18 determines the selected speech to be used for speech synthesis. The text acquisition unit 19 acquires the text to be used for speech synthesis. The speech synthesis unit 22 generates synthesized speech based on the determined speech or the speech features extracted from the determined speech and the acquired text.
[0049] In this way, the mapping unit 141 plots the audio points corresponding to the audio on a graph with the same dimensions as the compressed features, based on the compressed features. This makes it possible to present the audio to the user based on its expressive characteristics.
[0050] (Example hardware configuration) Figure 9 shows an example of the hardware configuration of the information processing device 6 in an embodiment. Some or all of the functional units of the information processing device 6 are realized as software by a processor 61 such as a CPU (Central Processing Unit) executing a program stored in a storage unit 62 having a non-volatile recording medium (non-temporary recording medium). The program may be recorded on a computer-readable recording medium. Computer-readable recording media include, for example, portable media such as flexible disks, magneto-optical disks, ROMs (Read Only Memory), CD-ROMs (Compact Disc Read Only Memory), hard disks built into computer systems, and non-temporary recording media such as solid-state drives. The communication unit 63 may receive the program via a communication line.
[0051] Some or all of the functional units of the information processing device 6 may be implemented using hardware including electronic circuits (or circuits) such as LSI (Large Scale Integrated Circuit), ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), or FPGA (Field Programmable Gate Array).
[0052] Although embodiments of this invention have been described in detail above with reference to the drawings, the specific configuration is not limited to these embodiments and includes designs and the like that do not depart from the gist of this invention. For example, the speech synthesizer 1 may receive speech feature quantities extracted by a terminal different from the speech synthesizer 1 and speech, and perform the above-described processing based on the received speech feature quantities and speech. [Industrial applicability]
[0053] The present invention can be applied, for example, to a speech synthesis device that can present speech to a user based on its expressive characteristics. [Explanation of Symbols]
[0054] 1...Speech synthesis device, 3...Display screen, 11...Speech acquisition unit, 12...Extraction unit, 13...Dimensional compression unit, 14...Display control unit, 15...Display unit, 16...Speech selection unit, 17...Speech playback unit, 18...Speech determination unit, 19...Text acquisition unit, 20...Generated number acquisition unit, 21...Feature generation unit, 22...Speech synthesis unit, 23...Correction unit, 24...Storage processing unit, 25...Storage device, 31...Graph, 32...Selected speech screen, 33...Text input field, 34...Quantity input field Power column, 35... Synthesis button, 36... Synthesis voice screen, 37... Saved voice screen, 141... Mapping section, 142... Labeling section, 143... Tagging section, 301... Voice point, 302... Selection point, 311... Label, 312... Label, 313... Tag, 314... Tag, 315... Position, 316... Range, 317... Selected range, 321... Play button, 361... Play button, 362... Save button, 371... Play button, 372... Delete button
Claims
1. An extraction unit that extracts speech feature quantities, which include information about expressive characteristics that are features of speech, from the speech, A dimensionality reduction unit that compresses the extracted audio features into three dimensions or less, A mapping unit plots audio points corresponding to the audio on a graph with the same dimensions as the compressed audio features, based on the compressed audio features which are the dimensionally compressed audio features. A display unit that displays the graph on which the aforementioned audio points are plotted, A sound selection unit for selecting the aforementioned sound point, A speech determination unit that determines the selected speech, which is the speech corresponding to the selected speech point, as the speech to be used for speech synthesis, A text acquisition unit that acquires text used for the aforementioned speech synthesis, A speech synthesis unit generates synthesized speech based on the determined speech or speech features extracted from the determined speech and the acquired text. Equipped with, The aforementioned audio selection unit selects a plurality of audio points, The dimensionality reduction unit compresses the speech feature quantities extracted from the multiple selected speeches into three dimensions or less. The mapping unit is a speech synthesis device that plots speech points corresponding to the selected speech on the graph based on the compressed feature quantities.
2. The speech synthesis apparatus according to claim 1, wherein the aforementioned speech is a pre-stored speech sample.
3. The aforementioned audio is the user's spoken voice. The speech synthesis apparatus according to claim 1, further comprising a speech acquisition unit for acquiring the aforementioned spoken speech.
4. The speech synthesis apparatus according to claim 1, wherein the aforementioned speech is the synthesized speech that has been generated.
5. A sound playback unit that plays the selected sound, The speech synthesis apparatus according to claim 1, further comprising a storage processing unit for storing the generated synthesized speech.
6. The system further includes a generation count acquisition unit that acquires the number of synthesized voices that are generated, The speech synthesis device according to claim 1, wherein the speech synthesis unit generates the acquired number of synthesized speeches.
7. The speech synthesis apparatus according to claim 1, further comprising a labeling unit that assigns labels indicating the features of each dimension in the compressed feature quantity to the coordinate axes of the graph.
8. The speech synthesis apparatus according to claim 1, further comprising a tagging unit that assigns a tag indicating information relating to the speech corresponding to the speech point to the vicinity of the speech point.
9. The speech synthesis apparatus according to claim 1, further comprising a correction unit for correcting the generated synthesized speech based on instructions from the user.
10. The graph further comprises a feature generation unit that generates audio features corresponding to locations where the audio points are not plotted, based on the extracted audio features. The speech synthesis device according to claim 1, wherein the speech synthesis unit generates synthesized speech based on the generated speech features and the acquired text.
11. A speech synthesis method performed by a speech synthesis device, A step of extracting speech features from the speech, which include information about the expressive characteristics that are features of the speech, The steps include: compressing the extracted audio features to three dimensions or less; The steps include plotting the audio points corresponding to the audio on a graph with the same dimensions as the compressed audio features, based on the compressed audio features which are the dimensionally compressed audio features, The steps include displaying the graph on which the audio points are plotted, The step of selecting the aforementioned audio point, The steps include determining the selected speech, which is the speech corresponding to the selected speech point, as the speech to be used for speech synthesis, The steps include obtaining text to be used for the aforementioned speech synthesis, A step of generating synthesized speech based on the determined speech or speech features extracted from the determined speech and the acquired text. Includes, The step of selecting the aforementioned audio point includes selecting a plurality of such audio points. The step of dimensionality reduction of the extracted speech features to three dimensions or less includes dimensionality reduction of the speech features extracted from a plurality of selected speeches to three dimensions or less, A speech synthesis method comprising the step of plotting the aforementioned audio points on the graph, which includes plotting the audio points corresponding to the selected audio on the graph based on the compressed features.
12. On the computer, A procedure for extracting speech features from the speech, which include information about the expressive characteristics that are features of the speech, A procedure for dimensionality reduction of the extracted audio features to three dimensions or less, A procedure for plotting audio points corresponding to the audio on a graph with the same dimensions as the compressed audio features, based on the compressed audio features which are the dimensionally compressed audio features, A procedure for displaying the graph on which the aforementioned audio points are plotted, The procedure for selecting the aforementioned audio point, A procedure for determining the selected speech, which is the speech corresponding to the selected speech point, as the speech to be used for speech synthesis, A procedure for obtaining the text used for the aforementioned speech synthesis, A procedure for generating synthesized speech based on the determined speech or speech features extracted from the determined speech and the acquired text. Make it run, In the procedure for selecting the aforementioned audio points, multiple audio points are selected, In the procedure for dimensionality reduction of the extracted audio features to three dimensions or less, the audio features extracted from multiple selected audios are dimensionally reduced to three dimensions or less. A program that, in the procedure for plotting the aforementioned audio points on the graph, plots the audio points corresponding to the selected audio on the graph based on the compressed features.