Voice creation system and voice creation method
The voice generation system addresses the challenge of creating natural-sounding voices with pauses by ranking and concatenating phrase groups with breaths, improving user experience in scenarios like political campaigns and disaster prevention announcements.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- 平間 充
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-12
AI Technical Summary
Existing voice generation systems struggle to create natural-sounding voices in scenarios where short texts are repeatedly conveyed with pauses, such as in political campaigns, commercial solicitations, and disaster prevention announcements.
A voice generation system that ranks and selects multiple phrase groups based on predetermined ratios, concatenates selected phrases with breaths, and includes a repetition suppression mechanism to create natural-sounding voices.
The system effectively generates suitable audio for scenarios where short sentences are repeatedly conveyed with pauses, enhancing user experience and convenience.
Smart Images

Figure 0007873788000001_ABST
Abstract
Description
【Technical Field】 【0001】 The present invention relates to a voice generation system and a voice generation method. In particular, it relates to a system for generating voice by combining divided phrases, and a method for generating such voice. 【Background Art】 【0002】 Conventionally, a device for vocalizing pre-stored phrases has been proposed (for example, see Patent Document 1). According to Patent Document 1, it is stated that the synthetic voice when reading out stereotypical phrases from an input string can be made closer to a more natural voice. 【Prior Art Documents】 【Patent Documents】 【0003】 【Patent Document 1】 Japanese Patent Application Laid-Open No. 11-249679 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0004】 In recent years, automatic voice needs related to Patent Document 1 have been increasing, and therefore, a voice generation system focused on a specific scenario has been demanded. For example, when it is desired to repeatedly convey a short text with pauses, the need for creating natural voice within a long text without pauses is low. Therefore, on one hand, there has been a situation where the provision of a system capable of suitably generating voice in such a specific scenario has been awaited. 【0005】 Therefore, an object of the present invention is to provide a voice generation system capable of suitably generating voice in a scenario where a short text is repeatedly conveyed with pauses. Another object of the present invention is to provide a voice generation method capable of suitably generating voice in a scenario where a short text is repeatedly conveyed with pauses. 【Means for Solving the Problems】 【0006】 One aspect of the present invention is illustrated below. [1] A speech generation system that generates speech from a group of phrases including one or more phrases, A storage unit that stores phrases included in each of the aforementioned groups of phrases that are different from each other, associating them with the respective groups of phrases, A group selection unit that selects at least two of the phrase groups from a plurality of phrase groups based on a predetermined group ratio, A group selection unit selects at least one of the selected phrases from the group based on a predetermined group ratio, A voice creation unit that sequentially connects selected phrases from the selected group of phrases, with breaths in between, to create speech, A voice creation system equipped with the following features. [2] Multiple sets of the aforementioned phrases have been ranked, The aforementioned voice generation unit, The system described in item 1, which creates speech by sequentially concatenating the selected phrases from the selected group of phrases, in accordance with the order of the group of phrases, with breaths in between. [3] The multiple aforementioned phrase groups are, The system according to item 1 or 2, comprising at least three distinct phrase groups ranked in the order of upper phrase group, middle phrase group, and lower phrase group. [4] The system described in any one of items 1 to 3, wherein the multiple phrase groups include nine or fewer distinct phrase groups. [5] The system according to any one of items 1 to 4, further comprising a repetition suppression unit that detects identical repetitions of the selected phrases and suppresses opportunities to select the phrases that constitute the identical repetitions. [6] The system according to any one of items 1 to 5, wherein at least one of the inter-group ratio and the intra-group ratio can be manually changed. [7] The system according to item 5, wherein at least one of the inter-group ratio and the intra-group ratio can be automatically changed based on the detection of the same repetition. [8] The system according to any one of items 1 to 7, wherein at least one of the inter-group ratio and the intra-group ratio can be automatically modified based on the logs of the selected phrase group and / or the logs of the selected phrase. [9] The system includes an output device that outputs the created audio, and At least one selected from the group consisting of volume, speed, and intonation of the output sound, A system described in any one of items 1 to 8, which can be automatically modified based on at least one selected from the group consisting of time of day, location information, and weather.
[10] Applicable to political street campaigning, The aforementioned group of phrases includes policy words as phrases, The aforementioned group of phrases includes, as phrases, names and / or party names. The aforementioned group of phrases includes greeting words as part of the phrases, The policy mode in which the above-mentioned group of phrases is selected has a larger ratio among the other groups, In the mode in which the aforementioned middle phrase group is selected, the ratio of selections between the aforementioned groups is larger than that of the others, The greeting mode is one in which the ratio of the group in which the aforementioned lower phrase group is selected is larger than that of the others. The system described in item 3 allows for configuration.
[11] The multiple aforementioned phrase groups are, The system further includes a set of specific phrases, each containing one or more specific phrases that represent a scenario involving at least one selected from the group consisting of time of day, location information, and weather. The system according to any one of items 1 to 10, wherein a specific mode can be set in which the inter-group ratio at which the specific phrase group is selected is larger than others.
[12] A voice generation method for generating voice of at least one of a plurality of phrases included in a phrase group including one or more phrases, comprising: An inter-group selection step of selecting at least two of the phrase groups based on a predetermined inter-group ratio from a plurality of different phrase groups; An intra-group selection step of selecting at least one of the phrases based on a predetermined intra-group ratio from the selected phrase groups; A voice generation step of sequentially connecting the selected phrases among the selected phrase groups with pauses to generate voice; A voice generation method having the above. [Advantages of the Invention] 【0007】 According to the present invention, it is possible to provide a voice generation system capable of suitably generating voice in a scene where a short sentence is repeatedly conveyed with pauses. Also, an object of the present invention is to provide a voice generation method capable of suitably generating voice in a scene where a short sentence is repeatedly conveyed with pauses. [Brief Description of the Drawings] 【0008】 [Figure 1] A diagram showing an outline of the system according to Embodiment 1. [Figure 2] A block diagram showing a configuration example of the system according to Embodiment 1. [Figure 3] A flowchart for explaining an example of the flow of the method according to Embodiment 1. [Modes for Carrying Out the Invention] 【0009】 Hereinafter, an aspect of the present invention will be described with reference to the drawings. In this specification, the term "step" is not limited to independent steps; therefore, even if it is not clearly distinguishable from other steps, any manner in which the function of that step is achieved is included in the term "step." The same applies to the term "process." Furthermore, in the drawings, the scale, shape, and length may be exaggerated for the sake of clarity. 【0010】 One aspect of the present invention is specifically designed for situations where short sentences are repeatedly conveyed with pauses in between. Examples of such situations include political campaigns, commercial solicitations, tourist information, disaster prevention announcements, and announcements within public facilities. It is also well-suited for use in the entertainment field. Generally, a challenge in voice creation technology is the need to create natural-sounding voices within long sentences without pauses, but such a need is low in the situations described above. According to one aspect of the present invention, voices can be suitably created in the above situations, that is, they are easy to use and convenient. 【0011】 [Embodiment 1] <Voice Creation System> ≪Overview≫ The voice creation system according to this embodiment (hereinafter sometimes simply referred to as "this system") is A speech generation system that generates speech from a group of phrases including one or more phrases, A storage unit that stores phrases included in each of the aforementioned groups of phrases that are different from each other, associating them with the respective groups of phrases, A group selection unit that selects at least two of the phrase groups from a plurality of phrase groups based on a predetermined group ratio, A group selection unit selects at least one of the selected phrases from the group based on a predetermined group ratio, A voice creation unit that sequentially connects selected phrases from the selected group of phrases, with breaths in between, to create speech, It is equipped with. According to this, it is possible to create suitable audio for situations where short sentences are repeated with breaths in between. 【0012】 Figure 1 shows an overview of this system. Here, we show an overview of System 100 intended for application in political street campaigning. In political street campaigning, policies, names (names or party names), and greetings are generally conveyed repeatedly in short sentences with breaths in between. One aspect of this system is suitable for creating audio in such situations, meaning it is easy to use and convenient. 【0013】 This system 100 generates speech from at least one phrase f from a phrase group Gf that includes one or more phrases f. In this system, as a prerequisite, the phrases f included in each of several different phrase groups Gf are stored and linked to the respective phrase groups Gf. Here, "several different phrase groups" refers to phrase groups in which the phrases f included in the phrase groups do not exactly match or do not contain the same phrase f. 【0014】 In one embodiment, the system 100 selects at least two phrase groups Gf from a plurality of phrase groups Gf based on a predetermined inter-group ratio. Then, at least one phrase f is selected from the selected phrase groups Gf based on a predetermined intra-group ratio. Subsequently, the selected phrase f from the selected phrase groups Gf are sequentially linked together with breaths in between to create speech. 【0015】 In one embodiment, the system 100 is configured to rank multiple phrase groups Gf, and it is preferable to create audio by sequentially linking selected phrases f from the selected phrase group Gf, with breaths in between, according to the ranking of the phrase group Gf. Regarding the ranking, for example, it may include at least three different phrase groups Gf ranked in the order of upper phrase group Gf_Top, middle phrase group Gf_Mid, and lower phrase group Gf_Btm. Each of these arrangements makes it easier to create audio in which phrases appear sequentially in accordance with the listener's understanding. 【0016】 Furthermore, the number of phrase groups Gf to be memorized and ranked is not limited to three. Not all of the memorized phrase groups Gf are required to be ranked. In this regard, the limit of human short-term memory is generally said to be "7 plus or minus 2," and sometimes "4 plus or minus 1." Therefore, to correspond to the most comprehensive limit of human short-term memory, "7 plus 2," it is preferable that the multiple phrase groups Gf contain at most nine or fewer distinct phrase groups Gf. This makes it easier for the listener to understand the created series of audio messages. 【0017】 In an example of this system 100, which is intended for use in political street campaigning, for instance, (1) The above phrase group Gf_Top is, Phrase f1: "We aim to reduce taxes." Phrase f2: "We will focus on supporting child-rearing." Phrase f3: "We will reform social security." It can include phrases like f, (2) The middle phrase group Gf_Mid is, Phrase f1: "I belong to the △△ Party." It can include phrases like f, and also, (3) The phrase group Gf_Btm below is, Phrase f1: "Thank you very much." Phrase f2: "I will do my best." It can include phrases like f. 【0018】 If the phrases are currently ranked in the order of top phrase group Gf_Top, middle phrase group Gf_Mid, and bottom phrase group Gf_Btm, then, hypothetically, from multiple phrase groups Gf, (1) When the upper phrase group Gf_Top and the middle phrase group Gf_Mid are selected, for example, • Phrase f1 included in the above phrase group Gf_Top, • Phrase f1 included in the middle phrase group Gf_Mid, However, the sounds are created by sequentially linking them with breaths in between. In one instance, a series of audio messages were created, such as "We aim for tax cuts" (breath) and "We are the △△ Party." (2) When the upper phrase group Gf_Top and the lower phrase group Gf_Btm are selected, for example, • Phrase f2 included in the upper phrase group Gf_Top, • Phrase f1 included in the phrase group Gf_Btm below, However, the sounds are created by sequentially linking them with breaths in between. In one instance, a series of audio messages were created, such as "We will focus on supporting child-rearing" (breath) and "Thank you for your cooperation." (3) When the middle phrase group Gf_Mid and the lower phrase group Gf_Btm are selected, for example, • Phrase f1 included in the middle phrase group Gf_Mid, • Phrase f2 included in the phrase group Gf_Btm below, However, the sounds are created by sequentially linking them with breaths in between. For example, a series of audio messages were created, such as "I am from the △△ Party" (breath), "I will do my best," and then... (4) If all phrase groups Gf are selected, for example, • Phrase f3 included in the upper phrase group Gf_Top, • Phrase f1 included in the middle phrase group Gf_Mid, • Phrase f2 included in the phrase group Gf_Btm below, However, the sounds are created by sequentially linking them with breaths in between. In one scenario, a series of audio messages are created, such as "We aim for tax cuts" (breath), "We belong to the XX party" (breath), and "Thank you for your support." 【0019】 ≪Voice Creation Device≫ Figure 2 is a schematic diagram of the voice generation device 10 that constitutes the system 100. The voice generation device 10 includes, for example, a storage unit 11, an inter-group selection unit 12, an intra-group selection unit 13, and a voice generation unit 14. Furthermore, it may include a repetition suppression unit 15 and a mode input reception unit 16. Each of these units is realized by the execution of a program by a microcomputer. 【0020】 (Storage part) In one embodiment, the storage unit 11 is composed of, for example, ROM (Read Only Memory) or RAM (Random Access Memory). The storage unit 11 can store, for example, a program executed in the system 100, and by executing the program, the computer can perform various controls related to voice creation. 【0021】 In one embodiment, the memory unit 11 stores, for example, phrases included in multiple distinct phrase groups, associating them with the respective phrase groups. Each phrase group contains one or more phrases, and thus, the phrases included in each phrase group are stored in a way that allows them to identify the phrase group to which they belong. 【0022】 Furthermore, in one embodiment, the memory unit 11 can store, for example, inter-group ratios, intra-group ratios, various thresholds and reference values, set values, modes, etc., which are used to perform the functions of this system. The memory unit 11 can also store, for example, a group of phrases selected based on the above inter-group ratios and intra-group ratios, as well as log (performance) information related to those phrases. 【0023】 The storage structure for remembering phrases or phrase groups is not limited. A phrase group folder containing the phrases may be stored, or the phrases themselves may be stored with tags, classifications, identifiers, etc., indicating the phrase group. Databases, tables, etc., may be used as appropriate for storage. 【0024】 (Intergroup selection section) The group selection unit 12 selects at least two phrase groups from a plurality of phrase groups based on a predetermined group ratio. For example, it reads the group ratio from the storage unit 11 and selects two or more target phrase groups from all the phrase groups based on that group ratio. 【0025】 The inter-group ratio corresponds to the ratio in which the target phrase group is selected compared to other phrase groups. For example, if the ratio of the upper phrase group Gf_Top: middle phrase group Gf_Mid: lower phrase group Gf_Btm is 1:1:1, then theoretically, each phrase group will be selected with equal probability. The initial value of the inter-group ratio is stored in the memory unit 11, which represents the ratio in which each phrase group is selected with equal probability, but this ratio can be updated as needed. 【0026】 (Intragroup selection section) The group selection unit 13 selects at least one phrase from the selected phrase group based on a predetermined group ratio. For example, it reads the group ratio from the storage unit 11 and selects one or more target phrases from the phrase group based on that group ratio. The number of target phrases selected from the phrase group may be three or fewer, two or fewer, or one. 【0027】 The within-group ratio corresponds to the proportion of a given phrase being selected within a group of phrases. For example, in a group of three phrases, if phrase f1:phrase f2:phrase f3 = 1:1:1, then theoretically, each phrase will be selected with equal probability. The initial value of the within-group ratio, which represents the proportion of each phrase being selected with equal probability, is stored in the memory unit 11, but this ratio can be updated as needed. Note that the within-group ratio may be set for each group of phrases. 【0028】 (Audio playback unit) The voice creation unit 14 creates speech by sequentially concatenating selected phrases from a selected group of phrases, with breaths in between. For example, it receives information about selected phrase groups and selected phrases from the inter-group selection unit 12 and the intra-group selection unit 13, reads the target phrases from the memory unit 11, and then sequentially concatenates them with breaths in between. This creates a series of speech. 【0029】 A breath is a period of time that is perceived by the listener as a pause for breathing, for example, 0.5 to 2 seconds, but is not limited to this length. The length of a breath may be set as appropriate and may be adjustable. Breaths broadly include silences and pauses between phrases, as well as boundaries between beats. 【0030】 Information about the created audio is transmitted, for example, to the output device 20 (speaker, etc.) by wire or wireless connection and then played back as audio. The audio creation device 10 repeatedly creates a series of audio and continuously transmits information about this audio, so that the audio is automatically played back continuously from the output device 20. 【0031】 (Repetition suppression unit) In one embodiment, the voice generation device 10 may include a repetition suppression unit 15. The repetition detection unit 15 detects the repetition of the same linked phrases in the selected phrases. For example, the repetition suppression unit 15 receives information about the selected phrases from the group selection unit 13 and the voice generation unit, and then detects the repetition of the same linked phrases. 【0032】 For example, the repetition suppression unit 15 suppresses the linking of selected phrases. The same action has been repeated a predetermined number of times or more (for example, three times or more). The same action was repeated more than a specified number of times within a specified period (for example, more than 5 times in 5 minutes). It can detect this. 【0033】 The repetition suppression unit 15 may detect repetition in at least some of the linked sounds within the series of sounds created by the selected phrases. In this case, for example, regardless of the phrase selection in the lower phrase group Gf_Btm, if a predetermined phrase included in the upper phrase group Gf_Top (e.g., phrase f3) and a predetermined phrase included in the middle phrase group Gf_Mid (e.g., phrase f1) are repeatedly selected, the repetition suppression unit 15 can detect this. 【0034】 On the other hand, the repetition detection unit 15 may detect repetition across all connections of the series of sounds created by the selected phrase. In this case, for example, the repetition suppression unit 15 may detect when phrase f3 included in the upper phrase group Gf_Top, phrase f1 included in the middle phrase group Gf_Mid, and phrase f2 included in the lower phrase group Gf_Btm are repeatedly selected. 【0035】 Furthermore, when the repetition suppression unit 15 detects the same repetition, it can suppress the opportunity to select the phrase that constitutes that same repetition. 【0036】 For example, the repetition suppression unit 15 is It is preferable to reduce the within-group ratio and / or the inter-group ratio that includes at least one phrase among the phrases that constitute the same repetition (updating the within-group ratio and / or inter-group ratio stored in the memory unit 11 to a lower value). In this way, by making it possible to automatically change at least one of the inter-group ratio and the within-group ratio based on the detection of the same repetition, it becomes easier to deliver a variety of audio messages to the listener. 【0037】 In one embodiment, the repetition suppression unit 15 can update the in-group ratio in which a phrase related to repetition is selected, and / or the inter-group ratio including the phrase, to a value of 1 / 2, 1 / 3, or 1 / 4 or less from the current value. 【0038】 Information about previously selected phrase groups and phrases can be obtained by reading the phrase group logs and phrase logs stored in the memory unit 11. An embodiment in which at least one of the inter-group ratio and the intra-group ratio can be automatically changed based on the selected phrase group logs and / or selected phrase logs is preferable because it makes it easier to deliver a variety of audio messages to the listener. 【0039】 Furthermore, the repetition suppression unit 15 can also cancel the creation and output of audio for phrases that constitute the same repetition. This makes it easier to ensure that a variety of audio messages are delivered reliably to the listener. 【0040】 Of course, at least one of the group ratios and group selections may be manually adjustable. This makes it easier to vary the weighting of the phrases used in voice generation, thus facilitating the creation of desired voices depending on the situation. 【0041】 (Mode input reception section) In one embodiment, the voice generation device 10 may include a mode input receiving unit 16. The mode input receiving unit 16 can, for example, receive operation input from the user and update the inter-group ratio and / or intra-group ratio to preset values suitable for the situation. This makes it easier to create the desired voice according to the situation. 【0042】 Regarding this system 100, which is intended for use in political street campaigns, for example, The above phrase group Gf_Top includes a policy word as phrase f. The middle phrase group Gf_Mid contains a name and / or a political party name as phrase f. The phrase group Gf_Btm below includes a greeting word as phrase f. Let's assume that... 【0043】 At this time, the mode input receiving unit 16 receives the user's operation input, and then, The group Gf_Top has a larger inter-group selection ratio compared to others, and is a policy mode. The group Gf_Mid has a larger selection ratio compared to others, in the well-known mode. The group of phrases Gf_Btm is selected at a higher rate compared to others, in greeting mode, It can be configured and switched. 【0044】 For example, when the mode input receiving unit 16 receives an operation signal to "policy mode" from the user via an input device 30 such as a touch panel or switch, it updates the group ratio for selecting the upper phrase group Gf_Top, which is stored in the memory unit 11, to a set value that is larger than the group ratios of the other phrase groups Gf. The setting value is updated so that the ratio for selecting the upper phrase group Gf_Top is the largest value compared to the others (specifically, 80% or more, or 90% or more, etc.), and the ratios for selecting the remaining phrase groups Gf remain the same, while the remaining proportions excluding the ratio of the upper phrase group Gf_Top are allocated and updated. 【0045】 The same procedure applies when receiving operation signals from the user to switch to "Notification Mode" or "Greeting Mode." The settings are updated so that the selection rate of the middle phrase group Gf_Mid and the lower phrase group Gf_Btm becomes a large value (specifically, 80% or more, or 90% or more, etc.), while the selection rates of the remaining phrase groups Gf are kept the same and their respective group ratios are updated. 【0046】 The "policy words" included in the phrase group Gf_Top as phrase f include, for example, words such as "economy," "employment," "education," "childcare," "social security," "tax cuts," "environment," "energy," "diversity," "human rights," "local government," "diplomacy," and "security." 【0047】 Furthermore, the "name / political party name" included as phrase f in the middle phrase group Gf_Mid could include, for example, an individual's name or the name of their political party. 【0048】 Furthermore, the "greeting words" included in the phrase group Gf_Btm as phrase f include, for example, words such as "Thank you in advance," "I will do my best," "I apologize for the trouble," "Thank you for your time," and "We would appreciate your support." 【0049】 Furthermore, the user may be able to adjust at least one of the group consisting of volume, speed, and intonation of the output sound via the input device 30. 【0050】 One aspect of the present invention has been described above. According to one aspect of the present invention, it is possible to provide a voice creation system and a voice creation method that can suitably create voices in situations where short sentences are repeatedly conveyed with breaths in between. 【0051】 <How to create audio> ≪Overview≫ The voice creation method according to this embodiment (hereinafter sometimes simply referred to as "this method") is A method for creating speech, comprising creating speech from at least one phrase among a group of phrases that includes one or more phrases, A group selection step of selecting at least two phrase groups from a plurality of mutually distinct phrase groups based on a predetermined group ratio, A group selection step of selecting at least one phrase from the selected group of phrases based on a predetermined group ratio, A speech creation step involves sequentially concatenating selected phrases from the selected group of phrases, with breaths in between, to create speech. It holds. According to this, it is possible to create suitable audio for situations where short sentences are repeated with breaths in between. 【0052】 For each requirement in this method, refer to the explanation of the corresponding requirement in this system. Preferred embodiments in this system may also be preferred in this method. 【0053】 Flow Figure 3 is a flowchart illustrating an example of the flow of this method. In this method, first, at least two phrase groups are selected from multiple distinct phrase groups based on a predetermined inter-group ratio (inter-group selection step S10). In one embodiment, information regarding the selected phrase groups is transmitted to the intra-group selection unit, the speech creation unit, and the repetition suppression unit. 【0054】 Next, at least one phrase is selected from the selected group of phrases based on a predetermined group ratio (group selection step S20). In one embodiment, information regarding the selected group of phrases is transmitted to the speech generation unit and the repetition suppression unit. 【0055】 The system may now include a step (repetition detection step S30) to detect identical repetitions in the linked selected phrases. If a repetition is detected in step S30, the system proceeds to step S31, where the inter-group ratio and intra-group ratio for the phrases constituting the repetition are updated to lower values, and the system returns to step S10. If no repetition is detected in step S30, the system proceeds to step S40. 【0056】 In the next step, selected phrases from the selected group of phrases are sequentially linked together with breaths in between to create speech (speech creation step). 【0057】 Although not shown in the diagram, the intergroup selection step S10 may include a step to read the currently set mode and a step to update the intergroup ratios and intragroup ratios to values corresponding to the read mode. 【0058】 [Other embodiments] One aspect of the present invention is not limited to the above-described embodiment, and various modifications are possible within the scope of the technical idea described in the claims. 【0059】 For example, the phrase groups used in one embodiment of the present invention are not limited to three. In addition to the upper phrase group Gf_Top, the middle phrase group Gf_Mid, and the lower phrase group Gf_Btm, a top-tier phrase group Gf_MTop, which is ranked higher than any of the above, and / or a bottom-tier phrase group Gf_MBtm, which is ranked lower than any of the above, may also be used. 【0060】 One aspect of the present invention is an audio playback system comprising an output device that outputs the created audio, and At least one selected from the group consisting of volume, speed, and intonation of the output sound, It may be automatically modified based on at least one selected from the group consisting of time of day, location information, and weather. 【0061】 For example, an audio playback device can easily determine the time of day, location, and weather by obtaining time information from a timer, time and location information from GPS (Global Positioning System), and weather information from other known systems that can acquire weather information. 【0062】 For example, an audio playback device can determine that the current time is nighttime and then lower the volume, speed, and intonation compared to other times of the day (e.g., during the day). Furthermore, for example, the audio playback device can determine if the current location is within a residential area or a protected living environment area such as around schools, hospitals, or libraries, and can then lower the volume, speed, and intonation compared to other locations (for example, areas not included in protected living environment areas, such as in front of train stations, shopping districts, or industrial areas). Furthermore, for example, an audio playback device can determine that the current weather is rainy and increase the volume, speed, and intonation compared to other weather conditions (e.g., sunny or cloudy). 【0063】 Multiple phrases are, The system may further include a set of specific phrases, each containing one or more specific phrases that describe a scenario relating to at least one selected from the group consisting of time of day, location information, and weather. 【0064】 Specific phrases that are appropriate for situations related to time of day include, for example, "Good morning," "Good evening," and "Excuse me for bothering you so early in the morning (or late at night)." Specific phrases intended for use in situations involving location information include, for example, "Thank you for using the station," and "Excuse me for disturbing your peaceful rest." Specific phrases that might be used in situations related to weather include, for example, "We will deliver the news cheerfully, undeterred by the rain," and "Thank you for coming despite the intense heat." 【0065】 In this case, it is preferable to be able to set a specific mode in which the group ratio in which a particular group of phrases is selected is larger than that of others. The adoption of these specific group of phrases and their group ratios may be input by the user via an input device 30 such as a touch panel or a switch. When a specific group of phrases is turned ON by the user, the group ratio of the specific group of phrases is read from the storage unit, and the remaining ratios of the remaining phrase groups Gf are assigned and updated while keeping the ratios of each other in which the specific group of phrases is selected. 【0066】 As described above, one aspect of the present invention is specialized for situations in which short sentences are repeatedly conveyed with pauses in between, such situations include, for example, political campaigns, commercial solicitations, tourist information, disaster prevention announcements, and announcements within facilities. Although the above embodiment was intended for application to political campaigns, one aspect of the present invention is also intended for application to other situations, such as commercial solicitations, tourist information, disaster prevention announcements, and announcements within public facilities. 【0067】 If we consider its application to attracting businesses, • Phrase f includes a group of phrases Gf that contain "attraction words". (phrases such as "Welcome," "Welcome," and "We can assist you right away") • Phrase f is a group of phrases Gf that include the "store name". • Phrase f includes the phrase group Gf, which contains the "sell point word". (phrases such as "Lunch for 500 yen," "Today's special offer," and "Great deal") A speech creation system / method can be constructed using this. 【0068】 If we consider its application to tourist information, • Phrase f includes the phrase group Gf, which contains "tourist destination". • Phrase f is a group of phrases Gf that include "the highlights of that tourist destination". A speech creation system / method can be constructed using this. 【0069】 If we consider its application to disaster prevention public relations, • Phrase f includes the phrase group Gf which contains "disaster warning". (phrases such as "A heavy rain warning has been issued" or "An evacuation order has been issued") • Phrase f includes the phrase "location of the disaster," • Phrase f includes the phrase "suggestion of evacuation," (phrases such as "Please evacuate to higher ground" or "Please refrain from going outside") A speech creation system / method can be constructed using this. 【0070】 If we consider its application to announcements within public facilities, • Phrase f is a group of phrases Gf that includes "announcement content". (Phrases such as "Missing child announcement," "Train is coming," "Please come," etc.) • Phrase f is a group of phrases Gf that include "name and / or designation". A speech creation system / method can be constructed using this. 【0071】 [Application Examples] One aspect of the present invention, which is specifically designed for situations where short sentences are repeatedly conveyed with pauses in between, may also be well-suited to the entertainment field. 【0072】 For example, by memorizing rhymed phrases (e.g., a / i / u / e / o) and associating them with phrase groups based on their rhymes, it becomes easier to create rap audio where words are placed to a rhyming rhythm. Furthermore, by associating phrases corresponding to the 5W1H (e.g., who / where / what / when / why / how) with groups of phrases and memorizing them, it becomes easier to create unique audio with a game-like element of combinations and a high degree of unexpectedness. 【0073】 These application examples, as well as at least some of those described in other embodiments, may be appropriately combined with Embodiment 1 above. [Industrial applicability] 【0074】 According to one aspect of the present invention, a voice creation system and voice creation method can suitably create voices in situations where short sentences are repeatedly conveyed with breaths in between. Examples of such situations include political campaigning, commercial solicitation, tourist information, disaster prevention announcements, and announcements within public facilities. In these situations, the system is particularly easy to use and convenient, and therefore has great industrial applicability. [Explanation of Symbols] 【0075】 100: Voice creation system 10: Voice creation device 11: Storage section 12: Intergroup selection section 13: Intragroup selection 14: Audio Production Department 15: Repetition suppression unit 16: Mode input reception section 20: Output device 30: Input device f: phrase Gf: Group of phrases Gf_Top: Upper phrase group Gf_Mid: Group of mid-phrases Gf_Btm: Below is a group of phrases
Claims
[Claim 1] A speech generation system that generates speech from at least one phrase from a group of phrases that includes one or more phrases, A storage unit that stores phrases included in each of the aforementioned groups of phrases that are different from each other, associating them with the respective groups of phrases, A group selection unit that selects at least two of the phrase groups from a plurality of phrase groups based on a predetermined group ratio, A group selection unit selects at least one of the selected phrases from the group based on a predetermined group ratio, A voice creation unit that sequentially connects selected phrases from the selected group of phrases, with breaths in between, to create speech, A voice creation system equipped with the following features. [Claim 2] Multiple sets of the aforementioned phrases have been ranked, The aforementioned voice generation unit, The system according to claim 1, wherein selected phrases from the selected group of phrases are sequentially linked together, with breaths in between, according to the order of the group of phrases, to create speech. [Claim 3] The multiple aforementioned phrase groups are, The system according to claim 2, comprising at least three distinct phrase groups ranked in the order of upper phrase group, middle phrase group, and lower phrase group. [Claim 4] The system according to claim 1, wherein the plurality of phrase groups include nine or fewer phrase groups that are different from each other. [Claim 5] The system according to claim 1, further comprising a repetition suppression unit that detects identical repetitions of the selected phrases and suppresses opportunities to select the phrases constituting the identical repetitions. [Claim 6] The system according to claim 1, wherein at least one of the inter-group ratio and the intra-group ratio can be manually changed. [Claim 7] The system according to claim 5, wherein at least one of the inter-group ratio and the intra-group ratio can be automatically changed based on the detection of the same repetition. [Claim 8] The system according to claim 1, wherein at least one of the inter-group ratio and the intra-group ratio can be automatically modified based on the logs of the selected phrase group and / or the logs of the selected phrase. [Claim 9] The system includes an output device that outputs the created audio, and At least one selected from the group consisting of volume, speed, and intonation of the output sound, The system according to claim 1, which can be automatically modified based on at least one selected from the group consisting of time of day, location information, and weather. [Claim 10] Applicable to political street campaigning, The aforementioned group of phrases includes policy words as phrases, The aforementioned group of phrases includes, as phrases, names and / or party names. The aforementioned group of phrases includes greeting words as part of the phrases, The policy mode in which the above-mentioned group of phrases is selected has a larger ratio among the other groups, In the mode in which the aforementioned middle phrase group is selected, the ratio of selections between the aforementioned groups is larger than that of the others, The greeting mode is one in which the ratio of the group in which the aforementioned lower phrase group is selected is larger than that of the others. The system according to claim 3, wherein the following can be set. [Claim 11] The multiple aforementioned phrase groups are, The system further includes a group of specific phrases, each containing one or more specific phrases that represent a scenario involving at least one selected from the group consisting of time of day, location information, and weather. The system according to claim 1, wherein a specific mode can be set in which the ratio of the selection of the specific phrase group is larger than that of other groups. [Claim 12] A computer-based speech generation method for generating speech from a group of phrases that includes one or more phrases, wherein the phrase at least one of the phrases is generated as speech. The computer performs an intergroup selection step in which it selects at least two phrase groups from a plurality of mutually distinct phrase groups based on a predetermined intergroup ratio, The computer selects at least one of the phrases from the selected group of phrases based on a predetermined group ratio in a group selection step, The computer performs a speech creation step in which it sequentially connects selected phrases from the selected group of phrases, with breaths in between, to create speech. A method for creating audio, comprising [the specified element].