Information processing method, information processing device, and program
The information processing method and device address the issue of users not seeing the effect of speech processing by displaying visualization images that show the spectrum changes, enabling real-time confirmation of speech clarification effects.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- RADIUS CO LTD
- Filing Date
- 2024-12-27
- Publication Date
- 2026-07-02
AI Technical Summary
Users on both ends of a call cannot confirm the effect of speech processing on their own or the other party's voice, necessitating a technology that presents the effect of speech processing for clarifying voice.
An information processing method and device that includes acquiring audio data, generating clarified audio data through processing, and displaying a visualization image to show the spectrum changes before and after processing, allowing users to see the effect of speech clarification.
Enables users to visually confirm the effect of speech processing on their own and the other party's voice, enhancing clarity and understanding in real-time communications.
Smart Images

Figure JP2024046309_02072026_PF_FP_ABST
Abstract
Description
Information processing method, information processing device, and program
[0001] This disclosure relates to a technology for displaying information related to speech processing.
[0002] Patent Document 1 discloses a technique for applying audio processing to an audio signal representing the voice of the speaker during a call in order to clarify the voice. According to this technique, the audio signal of the speaker, which has been processed to be perceived as clear by the user on the receiving end, is output to the receiving end.
[0003] Japanese Patent Application Publication No. 2021-108429
[0004] In the technology described in Patent Document 1, there is a problem in that the user on the transmitting side cannot confirm the effect of the speech processing on the receiving side on their own voice. Furthermore, there is a problem in that the user on the receiving side cannot confirm the effect of the speech processing on the voice they are hearing. Therefore, there is a need for a technology that presents the effect of speech processing on clarifying voice to the user.
[0005] One aspect of this disclosure aims to realize a technology that presents to the user the effect of clarification through speech processing for the purpose of clarifying speech.
[0006] To solve the above problems, an information processing method according to one aspect of the present disclosure includes, at least one processor performing: a first acquisition process for acquiring audio data representing the voice of a speaker; a second acquisition process for acquiring clarified audio data generated by audio processing for clarifying the audio data; and a visualization control process for displaying a visualization image that visualizes the audio processing based on the spectrum of the audio data and the spectrum of the clarified audio data.
[0007] To solve the above problems, an information processing device according to one aspect of the present disclosure comprises the at least one processor, the at least one processor executing each of the processes included in the information processing method described above.
[0008] To solve the above problems, a program according to one aspect of this disclosure causes at least one processor to execute each of the processes included in the information processing method described above. A computer-readable non-temporary recording medium on which the program is recorded also falls within the scope of this disclosure.
[0009] According to one aspect of this disclosure, the effect of clarification through speech processing to clarify speech can be presented to the user.
[0010] This is a block diagram showing the configuration of an information processing system according to Embodiment 1 of this disclosure. This is a block diagram showing the functional configuration of a user terminal according to Embodiment 1 of this disclosure. This is a block diagram showing the functional configuration of another user terminal according to Embodiment 1 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 1 of this disclosure. This is a diagram schematically showing an example of a screen displayed in Embodiment 1 of this disclosure. This is a diagram schematically showing variations of a visualization image in Embodiment 1 of this disclosure. This is a flowchart illustrating the flow of another information processing method according to Embodiment 1 of this disclosure. This is a diagram schematically showing another example of a screen displayed in Embodiment 1 of this disclosure. This is a block diagram showing the functional configuration of a user terminal according to Embodiment 2 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 2 of this disclosure. This is a diagram schematically showing an example of a screen displayed in Embodiment 2 of this disclosure. This is a block diagram showing the configuration of an information processing system according to Embodiment 3 of this disclosure. This is a block diagram showing the functional configuration of each device constituting the information processing system according to Embodiment 3 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 3 of this disclosure. This is a flowchart illustrating the flow of another information processing method according to Embodiment 3 of this disclosure. This is a block diagram showing the configuration of a telephone system according to Embodiment 4 of this disclosure. This is a block diagram showing the configuration of a voice processing device according to Embodiment 4 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 4 of this disclosure.
[0011] [Embodiment 1] Hereinafter, an information processing system 100 according to Embodiment 1 of the present disclosure will be described in detail with reference to the drawings.
[0012] (Configuration of Information Processing System 100) Figure 1 is a block diagram showing the configuration of the information processing system 100. As shown in Figure 1, the information processing system 100 includes a user terminal 1 used by user U1 and a user terminal 2 used by user U2. User terminals 1 and 2 can be connected via network N for users U1 and U2 to make calls. Although Figure 1 shows one user terminal 1 and one user terminal 2, there may be multiple instances of each.
[0013] Here, "call" refers to the exchange of at least voice in real time between multiple users via a network N, and may be a voice call where only voice is exchanged, or a video call where both voice and video are exchanged. Furthermore, the call method may be a circuit-switched method via a public telephone network, etc., or a packet-switched method via an IP (Internet Protocol) network, etc., but is not limited to these.
[0014] Network N may be, for example, a public telephone network, a dedicated telephone network, a mobile communication network, an IP network, a wireless or wired LAN (Local Area Network), or a combination of some or all of these. However, Network N is not limited to the examples described above, and is only required to be a network that connects user terminal 1 and user terminal 2 in a manner that enables communication using a predetermined communication method.
[0015] The user terminal 1 is a terminal used by the user U1. The user terminal 1 has a function of connecting to the user terminal 2 via the network N by a predetermined call method (for example, the circuit switching method or the packet switching method described above). That is, the user U1 can communicate with the user U2 using the user terminal 1. The user terminal 1 is an aspect of the information processing device described in the claims, and has a clarifying function for clarifying its own or the other party's voice in a call, and a visualization function for visualizing the clarification of the voice. The user terminal 1 is a computer including at least one processor and a memory. The user terminal 1 may be, for example, a mobile phone, a smartphone, a tablet, a smartwatch, a notebook personal computer, a desktop personal computer, etc., but is not limited thereto.
[0016] The user terminal 2 is a terminal used by the user U2. The user terminal 2 has a function of connecting to the user terminal 1 via the network N by the same call method as the user terminal 1. That is, the user U2 can communicate with the user U1 using the user terminal 2. Hereinafter, the user terminal 2 will be described as not having the above-described clarifying function and clarifying visualization function, but this is not limited, and it may have these functions. The user terminal 1 is a computer including at least one processor and a memory. The user terminal 2 may be, for example, a mobile phone, a smartphone, a tablet, a smartwatch, a notebook personal computer, a desktop personal computer, etc., but is not limited thereto.
[0017] (Configuration of User Terminal 1) Figure 2 is a block diagram showing the functional configuration of user terminal 1. As shown in Figure 2, user terminal 1 includes a control unit 110, a storage unit 120, a display unit 130, an input unit 140, an audio input unit 150, an audio output unit 160, and a communication unit 170. The control unit 110 is realized, for example, by a processor executing a program stored in a memory, and comprehensively controls each part of user terminal 1. Details of each functional block of the control unit 110 will be described later. The storage unit 120 is constituted by, for example, a memory, and stores various data and programs used by the control unit 110.
[0018] The display unit 130 displays an image generated by the control unit 110. The display unit 130 may be configured to include, for example, a liquid crystal display, an organic EL (Electro luminescence) display, etc., but is not limited thereto. The input unit 140 receives the operation of user U1, acquires the input information indicated by the operation, and outputs the input information to the control unit 110. The input unit 140 may be configured to include, for example, a mouse, a keyboard, a touch pad, or a combination of some or all of these, but is not limited to these. Further, the display unit 130 and the input unit 140 may be integrally formed as a touch panel.
[0019] The audio input unit 150 converts an externally input audio signal into audio data which is a digital signal, and outputs the audio data to the control unit 110. The audio input unit 150 may be configured to include, for example, a microphone and an AD (Analog to Digital) converter, but is not limited thereto. The audio output unit 160 converts the audio data input from the control unit 110 into an audio signal which is an analog signal, and outputs the audio signal to the outside as audio. The audio output unit 160 may be configured to include, for example, a DA converter and a speaker, but is not limited thereto.
[0020] The communication unit 170 connects to the network N and communicates with the outside. The communication unit 170 transmits the information input from the control unit 110 via the network N. Further, the communication unit 170 outputs the information received via the network N to the control unit 110.
[0021] Note that some or all of the storage unit 120, display unit 130, input unit 140, audio input unit 150, audio output unit 160, and communication unit 170 may be connected as peripheral devices instead of being built into the user terminal 1.
[0022] (Functional Blocks of the Control Unit 110) As shown in Figure 2, the control unit 110 includes a call application unit 111, a first acquisition unit 112, a second acquisition unit 113, an audio output control unit 114, a clarity UI (User Interface) unit 115, and a visualization control unit 116. The first acquisition unit 112, the second acquisition unit 113, the audio output control unit 114, the clarity UI unit 115, and the visualization control unit 116 may constitute a clarity application that extends the call function of the call application unit 111.
[0023] The call application unit 111 provides functions for making calls according to a predetermined call method. For example, in response to an operation by user U1 to start a call, the call application unit 111 connects to the call application unit 211 of user terminal 2 via the network N according to a predetermined call method. The operation to start the call may include, for example, an operation to specify the call destination and an operation to instruct the sending of a call request to the call destination. The operation to start the call may also include, for example, an operation to respond to a received call request. The call application unit 111 also transmits audio data representing the voice of user U1, which is input from the audio input unit 150, to the connected user terminal 2. The call application unit 111 also receives audio data representing the voice of user U2 from user terminal 2 and outputs the audio data from the audio output unit 160. The call application unit 111 also terminates the connection with user terminal 2 in response to an operation by user U1 to instruct the end of the call.
[0024] The first acquisition unit 112 executes the first acquisition process. The first acquisition process is the process of acquiring audio data that represents the speaker's voice. Here, the audio data to be acquired may represent the speaker's voice in a call conducted by connecting multiple user terminals 1 and user terminals 2. Hereafter, "audio data that represents the speaker's voice" will also be referred to as "speaker's voice data". The audio data is acquired in real time according to the progress of the speaker's speech.
[0025] For example, in a call between user U1 and user U2, when user U1 speaks, user U1 is the speaker and user U2 is the listener. In this case, the first acquisition unit 112 acquires the voice data of user U1 (speaker) using the user terminal 1 via the voice input unit 150.
[0026] Furthermore, for example, in a call between user U1 and user U2, when user U2 speaks, user U2 is the speaker and user U1 is the listener. In this case, the first acquisition unit 112 acquires the voice data of user U2 (speaker) received by the call application unit 111.
[0027] The second acquisition unit 113 executes a second acquisition process. The second acquisition process is the process of acquiring clarified speech data generated by speech processing for clarifying the speech of the speaker's speech data. Here, the clarified speech data may be acquired in real time. Real-time acquisition of clarified speech data means that clarified speech data generated from the speech data is acquired in parallel with the process of acquiring speech data in accordance with the progress of speech.
[0028] For example, the second acquisition unit 113 may generate clarified speech data by performing speech processing on the speech data to clarify the speech. For example, the second acquisition unit 113 may perform the speech processing in real time in accordance with the progress of the speaker's speech, thereby acquiring the clarified speech data in real time.
[0029] Furthermore, for example, the second acquisition unit 113 may acquire clarified speech data from the speech processing device that performs the speech processing. Such a speech processing device may consist of a speech processing circuit that performs speech processing on an analog speech signal, or it may consist of a computer that performs speech processing on digital speech data. The clarified speech data is acquired in real time by the speech processing device performing the speech processing in real time in accordance with the progress of the speaker's utterance.
[0030] In the embodiments described below, the focus will be on examples in which the second acquisition unit 113 (or the second acquisition unit 313, described later) performs voice processing to acquire clarified voice data. However, the descriptions of each embodiment will also apply when the second acquisition unit 113 (or 313) acquires clarified voice data from an external voice processing device.
[0031] For example, speech processing to clarify speech may involve amplifying higher-order formant components, including at least the second-order formant component, in the speech data. In the spectral analysis of speech, multiple peak frequencies appear at integer multiples of each other. These multiple peak frequencies are referred to as the first-order formant, second-order formant, third-order formant, fourth-order formant, and so on, in order of increasing frequency. Although each peak frequency differs depending on the speaker's skeletal structure, it is generally known that in order to understand language, it is necessary to reliably hear the components from the first to the fourth formant. Furthermore, users with insufficient hearing often have reduced hearing in the frequency band containing higher-order formant components, such as the second to the fourth formant. Therefore, by amplifying the higher-order formant components, including the second formant, speech is clarified so that it can be perceived as clear by users with insufficient hearing.
[0032] For example, the second acquisition unit 113 may perform a process to amplify the level of the frequency band above the lower limit frequency in the audio data. In this case, the lower limit frequency is determined such that higher-order formant components, including at least a second-order formant component, are included in the frequency band above the lower limit frequency. The second acquisition unit 113 may further determine an upper limit frequency and perform a process to amplify the level of the frequency band above the lower limit frequency and below the upper limit frequency in the audio data. In this case, the lower limit frequency and the upper limit frequency are determined such that higher-order formant components, including at least a second-order formant component, are included in the frequency band. As an example, the lower limit frequency may be 400 Hz and the upper limit frequency may be 5 kHz. The frequency band between 400 Hz and 5 kHz is likely to contain the second to fourth formant components of typical human speech. However, the lower limit frequency and upper limit frequency are not limited to the examples described above. Also, for example, the frequency of the first-order formant component detected in real time in the audio data may be applied as the lower limit frequency. Furthermore, the audio processing for clarifying speech is not limited to the processes described above, and known processes can be applied.
[0033] Furthermore, the degree of speech clarification in speech processing can be set according to a first operation. The first operation is an operation to set the degree of clarification, and in this embodiment, it is performed by user U1. The degree of clarification may be, for example, a gain that amplifies the level of the bandwidth above the lower limit frequency and below the upper limit frequency as described above. This first operation is received by the clarification UI unit 115, which will be described later. For example, the degree of speech clarification may be set discretely as multiple steps. Alternatively, the degree of speech clarification may be set continuously.
[0034] The voice output control unit 114 executes voice output control processing. Voice output control processing is the process of controlling the output of the voice indicated by the clarified voice data from the user terminal of the listener among multiple user terminals (for example, multiple user terminals including user terminal 1 and user terminal 2). For example, when user U1 speaks, user U1 is the speaker and user U2 is the listener. Therefore, if clarified voice data for user U1 has been generated, the voice output control unit 114 transmits the clarified voice data to user terminal 2 in order to output the clarified voice data from user terminal 2. Also, for example, when user U2 speaks in a call between user U1 and user U2, user U2 is the speaker and user U1 is the listener. Therefore, if clarified voice data for user U2 has been generated, the voice output control unit 114 controls the voice output unit 160 to output the clarified voice data.
[0035] The clarification UI unit 115 accepts a first operation for setting the degree of speech clarification in speech processing. For example, the first operation may include setting the degree of clarification for the transmitted speech, or setting the degree of clarification for the received speech. Here, transmitted speech refers to the voice of user U1 when user U1 is the speaker. Received speech refers to the voice heard by user U1 when user U2 is the speaker.
[0036] For example, the first operation may include selecting one of several levels of clarity. These levels could be, for example, three levels such as weak, medium, and strong. The level of clarity may also include not performing any clarity. In this case, the levels could be, for example, four levels such as off, weak, medium, and strong, or two levels such as on and off. However, the names and number of levels for each stage are not limited to these examples.
[0037] Furthermore, for example, the first operation may include an operation to specify an arbitrary degree of clarification within a continuous range. For example, an arbitrary degree of clarification may be represented by any numerical value from 1 to m, where "clarification off" is 1x and "clarification strong" is mx (where m is a number greater than 1). Alternatively, the first operation may be an operation on an operation object (e.g., a slider object, a volume object, etc.) that is continuously associated with values from 1 to m. The information indicating the degree of clarification for the transmitted or received voice, received by the first operation, is stored in the storage unit 120 as setting information.
[0038] The visualization control unit 116 executes visualization control processing. Visualization control processing is a process for displaying a visualization image that visualizes the audio processing based on the spectrum of the audio data and the spectrum of the clarified audio data. As a result, the changes in the spectrum before and after the audio is clarified are visualized, so that user U1 can confirm the effect of the audio clarification.
[0039] For example, suppose that the speaker in a call is user U1, and that voice processing is performed on the transmitted voice, which is user U1's voice. In this case, the visualization control unit 116 may display the visualization image on the display unit 130 of user terminal 1 of user U1 (speaker), which is one of a group of user terminals (for example, a group of user terminals including user terminal 1 and user terminal 2). This allows user U1 (speaker) to confirm the effect on user U2 (receiver), the other party in the call, of clarifying their own transmitted voice.
[0040] Furthermore, for example, suppose that the speaker in a call is user U2, and that audio processing is performed on the received audio, which is user U2's voice. In this case, the visualization control unit 116 may display the visualization image on the display unit 130 of user terminal 1 of user U1 (the receiver), which is one of a group of user terminals (for example, a group of user terminals including user terminal 1 and user terminal 2). This allows user U1 (the receiver) to confirm the effect of clarifying the received audio.
[0041] Furthermore, the visualization control unit 116 may update the visualization image in real time according to the progress of the speaker's utterance. Here, since clarified speech data is generated in real time from the speech data that changes according to the progress of the utterance, the spectrum of the speech data and the spectrum of the clarified speech data change in real time. Therefore, the visualization image is updated in real time based on these two spectra, and the visualization image is displayed as a moving image.
[0042] Furthermore, the visualization image is displayed in a manner that allows for comparison of these two spectra. For example, the visualization image may include an image in which an image showing the spectrum of the audio data and an image showing the spectrum of the clarified audio data are superimposed. Alternatively, the visualization image may include an image showing the difference between the spectrum of the audio data and the spectrum of the clarified audio data. Alternatively, the visualization image may include an image in which an image showing the spectrum of the audio data and an image showing the spectrum of the clarified audio data are placed side by side. However, the display manner of the visualization image is not limited to the examples described above. This makes it possible to visualize which frequency bands in the audio spectrum are amplified and to what extent before and after clarification, so that user U1 can confirm the effect of audio clarification.
[0043] Furthermore, the visualization image may also include label images showing each formant component. This visualizes the extent to which each higher-order formant component in the speech spectrum is amplified, allowing user U1 to confirm the effect of speech clarification.
[0044] Furthermore, the visualization control unit 116 displays a visualization image that reflects the degree of clarification set in response to the first operation. Here, when the degree of clarification changes, the spectrum of the clarified audio data changes, and therefore the visualization image also changes. In other words, the visualization image is updated to reflect the set degree of clarification. This allows user U1 to check how the effect of audio clarification changes when the setting of the degree of clarification is changed.
[0045] (Configuration of User Terminal 2) Figure 3 is a block diagram showing the functional configuration of User Terminal 2. As shown in Figure 3, User Terminal 2 includes a control unit 210, a storage unit 220, a display unit 230, an input unit 240, an audio input unit 250, an audio output unit 260, and a communication unit 270. Each of these units will be described in the same way as the functional blocks of the same name that are provided in User Terminal 1.
[0046] (Functional Blocks of the Control Unit 210) As shown in Figure 3, the control unit 210 includes a call application unit 211. The call application unit 211 provides functions for making calls according to the same call method as the call application unit 111. Details of the call application unit 211 will be explained in the same way as in the explanation of the call application unit 111, by substituting the names of users U1 and U2, user terminal 1 and user terminal 2, and the functional blocks with the same names in user terminal 1 and user terminal 2 for each other.
[0047] (Flow of Information Processing Method S1) The information processing system 100 configured as described above executes information processing method S1. In information processing method S1, when user U1 is the speaker, speech processing is performed to clarify the voice transmitted by user U1. Information processing method S1 is a method of presenting user U1 with a visualization image that visualizes the speech processing on the voice transmitted by user U1. Figure 4 is a flowchart illustrating the flow of information processing method S1. As shown in Figure 4, information processing method S1 includes steps S101 to S111.
[0048] In step S101, the call application unit 111 of user terminal 1 connects to the call application unit 211 of user terminal 2 based on an operation by user U1 to initiate a call with user U2. Then, in step S102, the call application unit 211 of user terminal 2 connects to the call application unit 111 of user terminal 1 based on an operation by user U2. This initiates a call between user U1 and user U2. Note that steps S101 and S102 are not limited to being executed in this order, but may be executed in the reverse order, or some or all of them may be executed in parallel. Also, steps S101 and S102 do not limit which user U1 or user U2 initiates the call to the other.
[0049] In step S103, the clarification UI unit 115 of the user terminal 1 receives a first operation from user U1 to set the clarification level of the transmitted voice. The clarification level is an example of the degree of clarification, and refers to the stages when the degree of clarification is set discretely. Hereafter, the explanation will focus on examples in which the degree of clarification can be set discretely, but is not limited to these. The setting information, including the clarification level of the transmitted voice set by the first operation, is stored in the storage unit 120. Note that the timing of executing step S103 is not limited to after the start of a call (after step S101), but may also be executed before the start of a call (before step S101). Furthermore, this timing is not limited to immediately after the start of a call, but may also be executed at any point during a call (any point after step S104, which will be explained below). In addition, if the first operation is not accepted, pre-stored setting information may be applied.
[0050] Step S104 is an example of the first acquisition process. In step S104, the first acquisition unit 112 acquires the voice data of user U1 via the voice input unit 150. In other words, it is assumed that user U1 spoke during a call.
[0051] Step S105 is an example of the second acquisition process. In step S105, the second acquisition unit 113 performs voice processing on the user U1's voice data according to the setting information. As a result, clarified voice data is acquired that is clarified according to the clarification level indicated by the setting information. If the setting information indicates that the clarification level of the transmitted voice is "off", voice processing is not performed and clarified voice data is not acquired.
[0052] In step S106, the audio output control unit 114 transmits the clarified audio data to the user terminal 2. If the clarified audio data has not been generated, the user U1's audio data is transmitted instead.
[0053] In step S107, the call application unit 211 of the user terminal 2 receives the clarified voice data. The call application unit 211 also outputs the voice indicated by the received clarified voice data via the voice output unit 260. As a result, user U2 can hear the clarified voice of user U1. In other words, user U1 can have user U2 hear their clarified voice even if the user terminal 2 used by user U2, the person they are calling, does not have a voice clarification function. If voice data is received instead of clarified voice data, the unclarified voice indicated by that voice data will be output.
[0054] Step S108 is an example of visualization control processing. In step S108, the visualization control unit 116 of the user terminal 1 generates a visualization image based on the spectrum of the user U1's voice data and the spectrum of the clarified voice data. The visualization control unit 116 also displays the visualization image on the display unit 130. Here, steps S104 to S108 are executed in real time according to the progress of the utterance by user U1. Therefore, the visualization image displayed in step S108 is a moving image that changes according to the progress of the utterance. By viewing the visualization image, user U1 can confirm the effect on user U2 due to the clarification of their own voice.
[0055] Subsequently, when user U2 speaks, user terminal 1 receives voice data from user terminal 2 and outputs the received voice data via the voice output unit 160. As a result, user U1 hears user U2's voice, which is not clarified. It is also possible to configure the system so that when user U2 speaks, steps S204 to S208 of the information processing method S2 described later are executed. In this case, user U1 hears user U2's voice, which is clarified. Furthermore, if user U1 speaks again, steps S104 to S108 are repeated.
[0056] Furthermore, if step S103 is executed again during the call and the clarity level of the transmitted voice is changed, the changed clarity level is reflected in the clarityed voice data acquired in step S105 and in the voice heard by user U2 in step S107. In addition, the changed clarity level is reflected in the visualization image presented to user U1 in step S108.
[0057] In step S110, the call application unit 111 of user terminal 1 terminates its connection with the call application unit 211 of user terminal 2 based on an operation by user U1. Also in step S111, the call application unit 211 of user terminal 2 terminates its connection with the call application unit 111 of user terminal 1. This terminates the call between user U1 and user U2. Note that steps S110 and S111 are not limited to being executed in this order; they may be executed in the reverse order, or some or all of them may be executed in parallel. Furthermore, the operation to terminate the call may be performed by user U2, user U1, or both users.
[0058] (Screen Example) Figure 5 is a schematic diagram showing a screen example G1 displayed on the display unit 130 of the user terminal 1 in the information processing method S1. Screen example G1 provides a user interface related to the function of clarifying and visualizing the user's own voice (transmitted voice) during a call. As shown in Figure 5, screen example G1 includes information G11 to G13, operation objects G14-1 to G14-4, G15, and a visualization image G16.
[0059] Information G11 (for example, "Currently on a call with U2") indicates that a call is currently in progress and indicates the caller. Information G12 (for example, "1:02") indicates the elapsed time of the call. Information G13 (for example, "Your voice has been clarified and is audible to U2") indicates that the voice of user U1, who is using user terminal 1, has been clarified.
[0060] The operation objects G14-1 "Weak", G14-2 "Medium", G14-3 "Strong", and G14-4 "Off" accept an operation to select the clarity level for transmitted speech (an example of a first operation). In this example, any of the four clarity levels (Off, Weak, Medium, Strong) can be selected. Among the operation objects G14-1 to G14-4, the highlighted G14-2 "Medium" indicates that the currently selected clarity level is "Medium". Note that in Figure 5, the highlighting is shown as a thick border and gray fill, but is not limited to this. When an operation is accepted for any of the operation objects G14-1 to G14-4, setting information including the corresponding clarity level is stored in the storage unit 120. Furthermore, the operation objects G14-1 "Weak," G14-2 "Medium," G14-3 "Strong," and G14-4 "Off" can be operated at any point before the start of a call and during a call. Multiple operations may also be accepted, in which case the setting information based on the most recent operation will be stored.
[0061] Operation object G15 receives an operation from user U1 to end the call. When an operation is received by operation object G15, the call application unit 111 of user terminal 1 terminates its connection with the call application unit 211 of user terminal 2.
[0062] Visualization image G16 includes an image in which the spectrum G16-1 of the speaker user U1's voice before it was clarified and the spectrum G16-2 after it was clarified are superimposed. Visualization image G16 is a moving image that changes according to the progress of speech. Visualization image G16 allows user U1 to superimpose and compare the spectra G16-1 and G16-2 of their voice before and after it was clarified, so they can confirm the effect that their voice is clarified and heard by user U2, the person they are talking to.
[0063] Here, screen example G1 may include visualization images G16a to G16d shown in Figure 6 instead of visualization image G16. Figure 6 is a schematic diagram showing variations of visualization images.
[0064] As shown in Figure 6, the visualization image G16a is an image in which label images G16a-1 to G16a-4, each representing a formant component, are superimposed on the visualization image G16. Visualization image G16a allows user U1 to more clearly confirm that the first-order formant components, which are likely to be audible to user U2 (the other party in the conversation), are not amplified, while the higher-order formant components, which may be difficult to hear, are amplified.
[0065] Furthermore, as shown in Figure 6, visualization image G16b is an image showing the difference spectrum G16b-1 between the spectrum G16-1 before clarification and the spectrum G16-2 after clarification. Visualization image G16b allows user U1 to more clearly see the difference in how much the higher-order formant components in their own voice have been amplified.
[0066] Furthermore, as shown in Figure 6, visualization image G16c is an image obtained by placing visualization image G16 and visualization image G16b side by side. In other words, visualization image G16c includes an image in which the spectra G16-1 and G16-2 before and after clarification are superimposed, and an image showing the difference between the spectra G16-1 and G16-2. Visualization image G16b allows user U1 to more clearly confirm the effect on user U2 due to the clarification of their own voice.
[0067] Furthermore, as shown in Figure 6, visualization image G16d is an image in which visualization image G16d-1 and visualization image G16d-2 are placed side by side. Visualization image G16d-1 shows the spectrum G16-1 before clarification. Visualization image G16d-2 shows the spectrum G16-2 after clarification. Visualization image G16d allows user U1 to compare the spectra G16-1 and G16-2 before and after clarification side by side, so that user U2 can clearly confirm the effect of clarifying their own voice.
[0068] Note that the example screen G1 may include not only visualization images G16, G16a to G16d, but also visualization images of other forms, or combinations of some or all of them.
[0069] (Flow of Information Processing Method S2) The information processing system 100 also executes information processing method S2. In information processing method S2, when user U2 is the speaker, speech processing is performed to clarify the received speech (in other words, the speech of user U2) that is heard by user U1. Information processing method S2 is a method of presenting user U1 with a visualization image that visualizes the speech processing on the received speech. Figure 7 is a flowchart illustrating the flow of information processing method S2. As shown in Figure 7, information processing method S2 includes steps S201 to S211.
[0070] Steps S201 and S202 are steps to initiate a call between user U1 and user U2. Steps S201 and S202 will be explained in the same way as steps S101 and S102, so a detailed explanation will not be repeated.
[0071] In step S203, the clarification UI unit 115 of the user terminal 1 receives a first operation from user U1 to set the clarification level of the received voice. The setting information, including the clarification level of the received voice set by the first operation, is stored in the storage unit 120. Details of the clarification level of the received voice will be explained in the same way as the clarification level of the transmitted voice. Note that the timing of executing step S203 is not limited to after the start of a call (after step S201), but may also be executed before the start of a call (before step S201). This timing is not limited to immediately after the start of a call, but may also be executed at any point during a call (any point after step S204, which will be explained below). Furthermore, if the first operation is not accepted, pre-stored setting information may be applied.
[0072] In step S204, the call application unit 211 of the user terminal 2 acquires the voice data of user U2 via the voice input unit 250. In other words, it is assumed that user U2 spoke during the call. The call application unit 211 then transmits the acquired voice data of user U2 to the user terminal 1.
[0073] Step S205 is an example of the first acquisition process. In step S205, the first acquisition unit 112 acquires the user U2's voice data by receiving it.
[0074] Step S206 is an example of the second acquisition process. In step S206, the second acquisition unit 113 performs voice processing on the user U2's voice data according to the setting information. As a result, clarified voice data is acquired that is clarified according to the clarification level indicated by the setting information. If the setting information indicates that the clarification level of the received voice is "off", voice processing is not performed and clarified voice data is not acquired.
[0075] In step S207, the audio output control unit 114 outputs the audio indicated by the clarified audio data via the audio output unit 160. As a result, user U1 can hear the clarified voice of user U2. In other words, user U1 can hear the voice of user U2, the person they are talking to, clarified by the clarification function of the user terminal 1 that they are using. If clarified audio data has not been generated, the audio indicated by user U2's audio data will be output.
[0076] Step S208 is an example of visualization control processing. In step S208, the visualization control unit 116 of the user terminal 1 generates a visualization image based on the spectrum of user U2's voice data and the spectrum of the clarified voice data. The visualization control unit 116 also displays the visualization image on the display unit 130. Here, steps S204 to S208 are executed in real time according to the progress of speech by user U2. Therefore, the visualization image displayed in step S208 is a moving image that changes according to the progress of speech. By viewing the visualization image, user U1 can confirm the effect that the voice of user U2, the person they are talking to, has been clarified.
[0077] Subsequently, when user U1 speaks, user terminal 1 transmits the voice data of user U1 acquired via the voice input unit 150 to user terminal 2. User terminal 2 then outputs the received voice data via the voice output unit 260. As a result, user U2 hears the voice of user U1, which is not yet clarified. It is also possible to configure the system so that when user U1 speaks, steps S104 to S108 of the information processing method S1 described above are executed. In this case, user U2 hears the voice of user U1 that has been clarified. Furthermore, if user U2 speaks again, steps S204 to S208 are repeated.
[0078] Furthermore, if step S203 is executed again during the call and the clarity level of the received audio is changed, the changed clarity level is reflected in the clarity audio data acquired in step S206 and in the audio heard by user U1 in step S207. In addition, the changed clarity level is reflected in the visualization image presented to user U1 in step S208.
[0079] Steps S210 and S211 are steps to terminate the call between user U1 and user U2. Steps S210 and S211 will be explained in the same way as steps S110 and S111, so a detailed explanation will not be repeated.
[0080] (Screen Example) Figure 8 is a schematic diagram showing a screen example G2 displayed on the display unit 130 of the user terminal 1 in the information processing method S2. In addition to providing a user interface for the user's own voice (transmitted voice) similar to the screen example G1 shown in Figure 5, screen example G2 also provides a user interface for clarifying and visualizing the voice of the other party to the call (received voice). The screen example G2 shown in Figure 8 includes information G21 to G22, areas G23 and G25, and an operation object G27.
[0081] Information G21 and G22 indicate that a call is in progress and are explained in the same way as information G11 and G12 in screen example G1. Operation object G27 "End Call" is explained in the same way as operation object G15 in screen example G1.
[0082] Area G23 is an area that provides a user interface for clarifying one's own voice. The information G231, operation objects G232-1 to G232-4, and visualization image G233 included in area G23 are explained in the same way as the information G13, operation objects G14-1 to G14-4, and visualization image G16 in screen example G1. Note that in screen example G2, operation object G232-4 "Off" is highlighted, indicating an example of turning off one's own voice clarification. For this reason, visualization image G233 includes the spectrum G23-3 of the voice of the speaker, user U1, before clarification, but does not include the spectrum of the clarified voice.
[0083] Area G25 is an area that provides a user interface related to clarifying the voice of the person on the other end of a call. Area G25 includes information G251, operation objects G252-1 to G252-4, and a visualization image G253. Information G251 (for example, "U2's voice has been clarified") indicates that the voice of user U2, the person on the other end of the call, has been clarified.
[0084] The operation objects G252-1 "Weak", G252-2 "Medium", G252-3 "Strong", and G252-4 "Off" accept an operation to select the clarity level for the received audio (an example of the first operation). Operation objects G252-1 to G252-2 are similarly explained in the explanation of operation objects G14-1 to G14-4 in screen example G1, by replacing "transmitted audio" with "received audio". However, the number of clarity levels for the received audio does not necessarily have to be the same as the number of clarity levels for the transmitted audio. For example, the number of levels that can be set for the received audio may be greater than the number of levels that can be set for the transmitted audio. This allows the user to clarify their own voice for the other party while setting a more precise degree of clarity for the audio they need to hear. Also, for example, the number of levels that can be set for the received audio may be less than the number of levels that can be set for the transmitted audio. This allows for detailed clarification of one's own voice so that the other party can hear it, while also making it easy to adjust the degree of clarification for the voice that the user needs to hear.
[0085] Additionally, among the operation objects G252-1 to G252-4, the highlighted G252-1 "Weak" indicates that the currently selected clarity level is "Weak".
[0086] Visualization image G253 includes an image in which the spectrum G253-1 of user U2's voice before it was clarified and the spectrum G253-2 of user U2's voice after it was clarified are superimposed. Visualization image G253 is a moving image that changes according to the progress of speech. Visualization image G253 allows user U1 to superimpose and compare the spectra G253-1 and G253-2 of user U2's voice before and after it was clarified, thereby confirming the effect of clarifying user U2's voice.
[0087] Here, screen example G2 may include visualization images in a display manner similar to visualization images G16a to G16d shown in Figure 6, instead of visualization images G233 and G253. Note that the display manner of the visualization images included in screen example G2 is not limited to the examples described above. For example, the visualization image is not limited to showing the spectrum as a continuous waveform, but may also be shown as a histogram for each unit band (for example, a stacked graph in the style of a graphic equalizer).
[0088] (Modification 1 of Embodiment 1) In the information processing system 100 according to Embodiment 1, the description mainly focused on the case in which the user terminal 2 does not have a clarity function. However, when the user terminal 2 does have a clarity function, the second acquisition unit 113 in the user terminal 1 is modified as follows.
[0089] For example, suppose that voice processing to clarify spoken audio is performed in user terminal 2 based on the operation of user U2 (speaker). In this case, the second acquisition unit 113 of user terminal 1 either makes it impossible to accept the operation of user U1 (receiver) to perform the voice processing on the received audio, or even if it accepts the operation of user U1 (receiver), it does not perform the voice processing.
[0090] For example, the second acquisition unit 113 may determine whether or not speech processing for clarifying spoken speech has been performed at the user terminal 2 based on the spectrum of the received speech. Alternatively, for example, the second acquisition unit 113 may determine whether or not speech processing for clarifying spoken speech has been performed at the user terminal 2 based on the received speaker-side clarification flag. The speaker-side clarification flag is, for example, a flag that is transmitted to the receiver along with the speech data indicating spoken speech or the clarified speech data. For example, the user terminal 2 may transmit a speaker-side clarification flag indicating clarification on along with the clarified speech data to the user terminal 1, and a speaker-side clarification flag indicating clarification off along with the unclarified speech data to the user terminal 1. The user terminal 1 may also have a function to transmit the speaker-side clarification flag. This allows the user terminal 2 to operate based on the speaker-side clarification flag in the same way as the user terminal 1.
[0091] Furthermore, for example, suppose that voice processing to clarify the received voice is performed in user terminal 2 based on the operation of user U2 (receiver). In this case, the second acquisition unit 113 of user terminal 1 either makes it impossible to accept the operation of user U1 (speaker) to perform the voice processing on the uttered voice, or even if it accepts the operation of user U1 (speaker), it does not perform the voice processing.
[0092] For example, the second acquisition unit 113 may determine whether or not voice processing to clarify the received voice has been performed in the user terminal 2 based on the receiver-side clarification flag. The receiver-side clarification flag is, for example, a flag that is sent to the speaker when voice processing to clarify the received voice has been performed. For example, the user terminal 2 may send a receiver-side clarification flag indicating clarification on to the user terminal 1 when it has performed voice processing to clarify the received voice. In this case, the second acquisition unit 113 of the user terminal 1 will determine that voice processing to clarify the received voice has been performed in the user terminal 2 when it receives the receiver-side clarification flag indicating clarification on. Alternatively, the second acquisition unit 113 of the user terminal 1 may determine that voice processing to clarify the received voice has not been performed in the user terminal 2 when it has not received the receiver-side clarification flag indicating clarification on.
[0093] In this modified example, for example, the screen example G2 shown in Figure 8 is transformed as follows. For example, suppose it is determined that voice processing to clarify spoken audio is being performed on user terminal 2. In this case, on screen example G2 displayed on user terminal 1, the operation objects G252-1 to G252-4 related to clarifying received audio may be hidden or in a state where they do not accept operation (e.g., grayed out). However, even in this case, the visualization image G253 may still be displayed. Also, for example, suppose it is determined that voice processing to clarify received audio is being performed on user terminal 2. In this case, on screen example G2 displayed on user terminal 1, the operation objects G232-1 to G232-4 related to clarifying spoken audio may be hidden or in a state where they do not accept operation (e.g., grayed out). However, even in this case, the visualization image G253 may still be displayed.
[0094] In this modified version, since voice processing to clarify the same utterance is not performed twice on both the speaker and the receiver, it has the effect of preventing unnatural-sounding utterances from being heard by the receiver.
[0095] (Modification 2 of Embodiment 1) In the information processing system 100 according to Embodiment 1, the second acquisition unit 113 is modified as follows. The second acquisition unit 113 performs speech processing to clarify the speaker's voice based on any of the setting information selected from the setting information predetermined according to each of a plurality of types of speech characteristics. Speech characteristics may be, for example, language characteristics, dialect characteristics, gender characteristics, or individual characteristics, but are not limited to these. Here, the frequencies of higher-order formant components, including secondary formant components, that should be amplified for clarification may differ depending on the speech characteristics.
[0096] Therefore, the setting information may include information indicating the frequency band of higher-order formant components, including secondary formant components, according to the characteristics of the speech. For example, the first setting information may include information indicating the frequency band according to the characteristics of Japanese, and the second setting information may include information indicating the frequency band according to the characteristics of English. The number of types of speech characteristics for which the setting information is predetermined is not limited to two, but may be three or more. Furthermore, the process of selecting one of the multiple setting information may be performed by user operation or by a computer. The selection of setting information by a computer can be performed based on characteristics obtained by analyzing the spoken speech in real time.
[0097] In this modified example, for example, screen example G1 shown in Figure 5 and screen example G2 shown in Figure 8 are modified as follows. For example, screen examples G1 and G2 may include an operation object for selecting setting information to apply to clarifying one's own voice from multiple types of setting information (for example, setting information corresponding to Japanese, English, etc.). Also, for example, screen example G2 may include an operation object for selecting setting information to apply to clarifying the voice of the person on the other end of the call from the same multiple types of setting information.
[0098] This modified version has the effect of being able to clarify the speech accurately according to the characteristics of the speaker's voice.
[0099] [Embodiment 2] User terminal 1A according to Embodiment 2 of the present disclosure will be described below. For the sake of convenience of explanation, components having the same function as those described in the above embodiment will be denoted by the same reference numerals, and their descriptions will not be repeated.
[0100] User terminal 1A is one embodiment of the information processing device described in the claims, and has a function for clarifying and visualizing spoken voice. User terminal 1A can be used, for example, as a pre-call test before a call, but is not limited to this.
[0101] Figure 9 is a block diagram showing the functional configuration of user terminal 1A. As shown in Figure 9, user terminal 1A has the same configuration as user terminal 1, plus a playback control unit 117 in the control unit 110. The playback control unit 117 may also be included in a clarity application that extends the call function of the call application unit 111. Furthermore, the visualization control unit 116 in this embodiment is a modified version of the visualization control unit 116 in Embodiment 1. The other configurations are the same as in Embodiment 1, so a detailed explanation will not be repeated.
[0102] The visualization control unit 116 displays a visualization image in response to a second operation after the speaker has spoken during the visualization control process. The second operation is, for example, an operation instructing user U1 of user terminal 1A to clarify the spoken voice after the user has spoken. However, it is sufficient that at least the display of the visualization image is performed in response to the second operation; the voice processing that generates the clarified voice data and the processing that generates the visualization image may be performed in response to the second operation, but may also be started before the second operation.
[0103] The playback control unit 117 executes playback control processing. Playback control processing is the process of playing back the selected audio in response to a third operation in which the speaker selects either the audio indicated by the audio data or the audio indicated by the clarified audio data after the speaker has spoken. For example, after speaking, user U1 may sequentially perform both the third operation of selecting the audio indicated by the audio data and the third operation of selecting the audio indicated by the clarified audio data in order to compare the audio before and after clarification. In this case, the order in which each of the third operations is performed is not limited to the order described above.
[0104] (Flow of Information Processing Method S3) The user terminal 1A configured as described above executes information processing method S3. Information processing method S3 assumes that user U1 performs a pre-test of voice clarity before making a call. Hereafter, "performing a pre-test of voice clarity before making a call" will also be referred to as "test call". Information processing method S3 is a method of presenting user U1 with a visualized image that visualizes the voice processing applied to user U1's voice. Figure 10 is a flowchart showing the flow of information processing method S3. As shown in Figure 10, information processing method S3 includes steps S301 to S306.
[0105] Step S301 is an example of the first acquisition process. In step S301, the first acquisition unit 112 acquires the voice data of user U1 via the voice input unit 150. For example, let's assume that user U1 spoke during a test call. The acquired voice data is stored in the storage unit 120.
[0106] In step S302, the clarification UI unit 115 receives a second operation from user U1 to instruct the user to clarify the voice. In step S302, the clarification UI unit 115 may also receive a first operation to set the clarification level. The setting information, including the clarification level set by the first operation, is stored in the storage unit 120. If the first operation is not received, pre-stored setting information may be applied.
[0107] Step S303 is an example of the second acquisition process. In step S303, the second acquisition unit 113 performs voice processing on the user U1's voice data according to the setting information. As a result, clarified voice data is acquired according to the clarification level indicated by the setting information. The acquired clarified voice data is stored in the storage unit 120.
[0108] Step S304 is an example of visualization control processing. In step S304, the visualization control unit 116 refers to the audio data and clarified audio data stored in the storage unit 120 and generates a visualization image based on the spectrum of the audio data and the spectrum of the clarified audio data. The visualization image is a moving image corresponding to the temporal length of the audio data (or clarified audio data). The visualization control unit 116 also displays the visualization image on the display unit 130. Here, the moving image that is the visualization image may be played, or at least one still image that constitutes the moving image may be displayed. By viewing the visualization image, user U1 can confirm the clarification effect on their own voice. For example, user U1 can confirm the clarification effect of how their voice will be heard by the other party during an actual call before making a call.
[0109] In step 305, the playback control unit 117 accepts a third operation to select either the audio before clarification or the audio after clarification.
[0110] In step S306, the playback control unit 117 outputs the audio selected by the third operation received in step S305 via the audio output unit 160. This allows user U1 to hear their own voice before or after clarification. Furthermore, user U1 can compare their own voice before and after clarification by repeating steps S305 to S306. In this way, user U1 can confirm the effect of clarification on their own voice by listening to their own voice before or after clarification. For example, user U1 can confirm the effect of clarification—how their voice will sound clearer to the other party during a call—before an actual call.
[0111] (Screen Example) Figure 11 is a schematic diagram showing a screen example G3 displayed on the display unit 130 of the user terminal 1A in the information processing method S3. Screen example G3 provides a user interface related to the function of clarifying and visualizing the user's own voice during a test call. As shown in Figure 11, screen example G3 includes operation objects G31, G32-1 to G32-3, G34 to G36, and a visualization image G33.
[0112] The operation object G31 accepts an operation to initiate speech. When an operation is received by the operation object G31, the first acquisition unit 112 acquires audio data representing the voice of user U1 input via the voice input unit 150. For example, the acquired audio data may represent the voice input from the time the operation is received by the operation object G31 until a predetermined speaking time has elapsed. Alternatively, for example, the acquired audio data may represent the voice input from the time the operation is received by the operation object G31 until the end of speech is detected. Alternatively, for example, the acquired audio data may represent the voice input from the time the operation is received by the operation object G31 until the operation is received again.
[0113] The operation objects G32-1 "Weak", G32-2 "Medium", and G32-3 "Strong" accept an operation to select an clarification level during a test call (an example of a first operation). In this example, when an operation is received on any of the operation objects G32-1 to G32-3, voice processing is performed according to that clarification level. In other words, an operation on operation objects G32-1 to G32-3 is an example of a first operation, as well as an example of a second operation that instructs voice clarification. Also, in the example screen G3, the highlighted G32-2 "Medium" among the operation objects G32-1 to G32-3 indicates that an instruction to perform voice processing at clarification level "Medium" has been received. Note that the number of clarification levels, names, and highlighting methods during a test call are not limited to the example shown in Figure 11.
[0114] The visualization image G33 is a moving image generated based on the spectra of the audio data and the clarified audio data. The visualization image G33 may be played back as a moving image, or at least one still image that constitutes the moving image may be displayed. Furthermore, if a still image is displayed, it may be played back as a moving image when an operation is performed on, for example, the operation objects G34 and G35 described later. The visualization image G33 allows user U1 to compare the spectra G33-1 and G33-2 before and after clarification side by side, so that they can confirm the effect of how their voice will be clarified and heard by the other party during a call before the actual call.
[0115] Here, screen example G3 may include a visualization image in a display manner similar to visualization images G16a to G16d shown in Figure 6, instead of visualization image G33. Note that the display manner of the visualization image included in screen example G3 is not limited to the examples described above.
[0116] Operation object G34, "Listen to the audio before clarification," accepts an operation to select the audio before clarification as the audio to be played (an example of a third operation). Operation object G35, "Listen to the clarified audio," accepts an operation to select the audio after clarification as the audio to be played (an example of a third operation). For example, the aforementioned visualization image G33 may be played in sync with the audio when either of these audio is played.
[0117] The operation object G36 accepts an operation to instruct the end of the test call. When this operation is accepted, for example, screen example G3 may transition to a screen for starting a call.
[0118] In the example screen G3, operation objects G34 and G35 may be either unable to accept operations or hidden before any of the operation objects G32-1 to G32-3 that instruct the clarification of sound are operated and the visualization image G33 is displayed. In this case, once any of the operation objects G32-1 to G32-3 are operated and the visualization image G33 is displayed, operation objects G34 and G35 may become able to accept operations or change from hidden to displayed. In other words, the third operation may become available after the second operation has been accepted.
[0119] Furthermore, in the example screen G3, operation objects G34 and G35 do not necessarily have to be included. For example, when any of operation objects G32-1 to G32-3 is operated, audio processing is performed and the visualized image G33 is displayed, and the audio before clarification and the audio after clarification are played sequentially. In this case, the operation on operation objects G32-1 to G32-3 is an example of a second operation that instructs the clarification of the audio, and is also an example of a third operation that selects the audio to be played. In other words, the same operation may be applied as both the second and third operations.
[0120] (Modification of Embodiment 2) In the user terminal 1A according to Embodiment 2, the second acquisition unit 113 is deformable in the same way as the second acquisition unit 113 in Modification 2 of Embodiment 1. This provides the same effect as Modification 2 of Embodiment 1, even when performing pre-tests.
[0121] [Embodiment 3] An information processing system 100B according to Embodiment 3 of the present disclosure will be described below. For the sake of convenience of explanation, components having the same function as those described in the above embodiments will be denoted by the same reference numerals, and their descriptions will not be repeated.
[0122] (Configuration of Information Processing System 100B) Figure 12 is a block diagram showing the configuration of the information processing system 100B. As shown in Figure 12, the information processing system 100B includes a user terminal 1B used by user U1, a user terminal 2 used by user U2, and a server 3. User terminals 1B and 2 can be connected to each other via the server 3 for users U1 and U2 to make calls. User terminals 1B, 2, and 3 can be connected via a network N. Although Figure 12 shows one of each of user terminals 1B, 2, and 3, there may be multiple instances of each. Server 3 is a computer that has the function of relaying calls between user terminals 1B and 2 using a predetermined call method. Server 3 is one embodiment of the information processing device described in the claims, and provides at least a voice clarity function and a visualization function to user terminal 1B.
[0123] User terminal 1B is a terminal used by user U1 and is a modified version of user terminal 1. User terminal 1B has a user interface that utilizes the clarification and visualization functions provided by server 3. User terminal 2 is described as above and is assumed not to have the clarification and visualization functions, but is not limited to this and may have such functions.
[0124] (Configuration of Server 3) Figure 13 is a block diagram showing the functional configuration of each device constituting the information processing system 100B. As shown in Figure 13, Server 3 comprises a control unit 310, a storage unit 320, and a communication unit 370. The control unit 310 is implemented, for example, by a processor executing a program stored in memory, and controls each part of Server 3. Details of each functional block of the control unit 310 will be described later. The storage unit 320 is composed of, for example, memory, and stores various data and programs used by the control unit 310. The communication unit 370 connects to the network N and communicates with the outside. The communication unit 370 transmits information input from the control unit 310 via the network N. The communication unit 370 also outputs information received via the network N to the control unit 310. Note that the storage unit 320 and the communication unit 370 may be connected as peripheral devices instead of being built into Server 3.
[0125] (Functional block of control unit 310) As shown in Figure 13, the control unit 310 includes a call relay unit 311, a first acquisition unit 312, a second acquisition unit 313, an audio output control unit 314, and a visualization control unit 316.
[0126] The call relay unit 311 has the function of relaying calls between multiple user terminals (for example, user terminal 1B and user terminal 2) according to a predetermined call method. For example, the call relay unit 311 generates a call session in response to a request from any of the multiple user terminals. The call relay unit 311 also allows any of the multiple user terminals to join the generated call session in response to a request from any of the multiple user terminals. For example, the call relay unit 311 relays voice data between the multiple user terminals participating in the call session while it is active. For example, the call relay unit 311 either removes any of the multiple user terminals from the call session or terminates the call session itself in response to a request from any of the multiple user terminals.
[0127] For example, the call relay unit 311 may be at least part of a server application that provides a two-way voice call function or a group voice call function. Alternatively, for example, the call relay unit 311 may be at least part of a server application that provides a two-way video call function or a group video call function. In the following description, the call relay unit 311 will mainly be described in an example where it relays calls between two user terminals, user terminal 1B and user terminal 2, but the number of user terminals to be relayed may be three or more.
[0128] The first acquisition unit 312 executes the first acquisition process by receiving voice data from the speaker (for example, user U1 or user U2) from user terminal 1B or user terminal 2.
[0129] The second acquisition unit 313 acquires clarified speech data generated by speech processing for clarifying the speech of the speaker (for example, user U1 or user U2). The second acquisition unit 313 will be described in the same way as the second acquisition unit 113, so a detailed explanation will not be repeated.
[0130] The voice output control unit 314 transmits the clarified voice data to the recipient's user terminal (for example, user terminal 1B or user terminal 2) to control the output of the voice indicated by the clarified voice data from the user terminal. For example, if user U1 is the speaker and clarified voice data for user U1 has been generated, the voice output control unit 314 transmits the clarified voice data to user terminal 2. As a result, the voice indicated by the clarified voice data is output from user terminal 2. Also, for example, if user U2 is the speaker and clarified voice data for user U2 has been generated, the voice output control unit 314 transmits the clarified voice data to user terminal 1B. As a result, the voice indicated by the clarified voice data is output from user terminal 1B.
[0131] The visualization control unit 316 performs visualization control processing. For example, the visualization control unit 316 transmits a visualization image to the user terminal 1B which has a visualization function. Note that the visualization image does not need to be transmitted to the user terminal 2 which does not have a visualization function. For example, if user U1 is the speaker and cleared voice data for user U1 has been generated, a visualization image visualizing the cleared-up voice of the transmitted voice is transmitted to the user terminal 1B. Also, for example, if user U2 is the speaker and cleared voice data for user U2 has been generated, a visualization image visualizing the cleared-up voice of the received voice is transmitted to the user terminal 1B.
[0132] (Configuration of User Terminal 1B) As shown in Figure 13, User Terminal 1B, like User Terminal 1, is equipped with a control unit 110, a storage unit 120, a display unit 130, an input unit 140, an audio input unit 150, an audio output unit 160, and a communication unit 170, but the functional blocks provided by the control unit 110 are different. The storage unit 120, display unit 130, input unit 140, audio input unit 150, audio output unit 160, and communication unit 170 have been described above, so a detailed explanation will not be repeated.
[0133] (Functional Block of Control Unit 110) As shown in Figure 13, the control unit 110 includes a call application unit 111 and a clarity UI unit 115. The clarity UI unit 115 may constitute a clarity application that extends the call function of the call application unit 111.
[0134] The call application unit 111 provides functions for making calls according to a predetermined call method. The call application unit 111 is a modified version of the call application unit 111 in Embodiment 1. The following description will focus on the modifications made to the call application unit 111, and will not repeat the same points. The call application unit 111 connects with the call destination via the server 3. The call application unit 111 also sends and receives voice data with the call destination via the server 3. Furthermore, the call application unit 111 can make calls with multiple call destinations, not just one. For example, the call application unit 111 sends a request to the server 3 to create a call session with one or more call destinations. The call application unit 111 also joins the call session by sending a request to the server 3 to join the call session. The call application unit 111 also sends and receives voice data with other user terminals participating in the same call session via the server 3. The call application unit 111 leaves the call session by sending a request to the server 3 to leave the call session. Furthermore, the call application unit 111 terminates the call session by sending a termination request from the call session to the server 3.
[0135] The clarification UI unit 115 receives the first operation and generates setting information. The details of the first operation and setting information are as described above, so a detailed explanation will not be repeated. The generated setting information is transmitted to the server 3 and stored in the server 3's storage unit 320. The clarification UI unit 115 also displays the visualization image transmitted from the server 3 on the display unit 130.
[0136] (Configuration of User Terminal 2) The configuration of User Terminal 2 is as described above, so a detailed explanation will not be repeated. However, the call application unit 211 is a modified version of the call application unit 211 in Embodiment 1. The modifications in the call application unit 211 will be explained in the same way as the modifications in the call application unit 111.
[0137] (Flow of Information Processing Method S4) The information processing system 100B configured as described above executes information processing method S4. In information processing method S4, when user U1 is the speaker, voice processing to clarify user U1's transmitted voice is performed at the server 3 which relays the call. In addition, a visualization image that visualizes the voice processing on user U1's transmitted voice is generated by the server 3. Below, an example of a two-person call using information processing method S4 will be described. However, information processing method S4 can also be applied to group calls of three or more parties by assuming that at least one of user terminals 1B and user terminal 2 exists in multiples. Figure 14 is a flowchart illustrating the flow of information processing method S4. As shown in Figure 14, information processing method S4 includes steps S401 to S422.
[0138] In step S401, the call application unit 111 of user terminal 1B sends a call session creation request and a join request to server 3 based on user U1's operation to start a call with user U2. In step S402, the call relay unit 311 of server 3 creates a call session and allows user terminal 1B to join. In step S403, the call application unit 211 of user terminal 2 sends a call session join request to server 3 based on user U2's operation to respond to the call with user U1. The call relay unit 311 of server 3 allows user terminal 2 to join the call session. This initiates a call between user U1 and user U2. Steps S401 to S403 are not limited to this order; they can be executed in the reverse order, or some or all of them in parallel. Furthermore, steps S401 to S403 do not limit which user U1 or user U2 initiates the call request to the other.
[0139] In step S404, the clarification UI unit 115 of the user terminal 1 receives a first operation from user U1 to set the level of clarification of the transmitted voice. The details of step S404 will be explained in the same way as in step S103. The clarification UI unit 115 also transmits the setting information to the server 3.
[0140] In step S405, the second acquisition unit 313 of the server 3 stores setting information, including the clarity level of the transmitted voice, in the storage unit 320.
[0141] In step S406, the call application unit 111 of the user terminal 1B acquires the voice data of user U1 via the voice input unit 150. In other words, it is assumed that user U1 spoke during the call. The call application unit 111 also transmits the voice data of user U1 to the server 3.
[0142] Step S407 is an example of the first acquisition process. In step S407, the first acquisition unit 312 of the server 3 acquires the voice data of user U1 by receiving it from the user terminal 1B.
[0143] Step S408 is an example of the second acquisition process. In step S408, the second acquisition unit 313 performs voice processing on the user U1's voice data according to the setting information. As a result, clarified voice data is acquired that is clarified according to the clarification level indicated by the setting information. If the setting information indicates that the clarification level of the transmitted voice is "off", voice processing is not performed and clarified voice data is not acquired.
[0144] In step S409, the audio output control unit 314 transmits clarified audio data to the user terminal 2. If clarified audio data has not been generated, the user U1's audio data is transmitted instead.
[0145] In step S410, the call application unit 211 of the user terminal 2 receives the clarified voice data. The call application unit 211 also outputs the voice indicated by the received clarified voice data via the voice output unit 260. As a result, user U2 hears the clarified voice of user U1. If voice data is received instead of clarified voice data, the unclarified voice indicated by that voice data is output.
[0146] Step S411 is an example of visualization control processing. In step S411, the visualization control unit 316 of the server 3 generates a visualization image based on the spectrum of the user U1's voice data and the spectrum of the clarified voice data. The visualization control unit 316 also transmits the visualization image to the user terminal 1B.
[0147] In step S412, the clarification UI unit 115 of the user terminal 1B displays the visualization image received from the server 3 on the display unit 130. Here, steps S406 to S412 are executed in real time according to the progress of speech by user U1. Therefore, the visualization image displayed in step S412 is a moving image that changes according to the progress of speech. By viewing the visualization image, user U1 can confirm the effect on user U2 due to the clarification of their own voice.
[0148] Subsequently, when user U2 speaks, user terminal 1B receives the voice data of user U2 transmitted from user terminal 2 from server 3 and outputs the received voice data via voice output unit 160. As a result, user U1 hears user U2's voice, which is not clarified. It is also possible to configure the system so that when user U2 speaks, steps S506 to S512 of the information processing method S5 described later are executed. In this case, user U1 hears user U2's voice, which is clarified. Furthermore, if user U1 speaks again, steps S406 to S412 are repeated.
[0149] Furthermore, if step S404 is executed again during the call and the clarity level of the transmitted voice is changed, the changed clarity level is reflected in the clarityed voice data acquired in step S408 and in the voice heard by user U2 in step S410. In addition, the changed clarity level is reflected in the visualization image presented to user U1 in step S412.
[0150] In step S420, the call application unit 111 of user terminal 1 sends a call session termination request to server 3 based on user U1's operation. In step S421, the call relay unit 311 of server 3 terminates the call session in response to the termination request. In step S422, the call application unit 211 of user terminal 2 completes the process of exiting the terminated call session. This terminates the call between user U1 and user U2. Note that steps S420 to S422 are not limited to being executed in this order, but may be executed in the reverse order, or some or all of them may be executed in parallel. Also, the operation to terminate the call session may be performed by user U2, not just user U1, or by both users. Furthermore, at least one of user U1 and user U2 may perform an operation to exit the call session instead of an operation to terminate it.
[0151] According to the information processing method S4, for example, the display unit 130 of the user terminal 1B displays the example screen G1 in the same manner as the information processing method S1. Details of the example screen G1 are as described above. By viewing the example screen G1, user U1 can confirm that their voice is clearly audible to user U2, the other party to the call.
[0152] (Flow of Information Processing Method S5) The information processing system 100B then executes information processing method S5. In information processing method S5, when user U2 is the speaker, voice processing to clarify the received voice heard by user U1 is performed at the server 3 which relays the call. In addition, a visualization image that visualizes the voice processing for the received voice heard by user U1 is generated by the server 3. Below, similar to information processing method S4, an example of a two-person call in information processing method S5 will be described. However, information processing method S5 can also be applied to group calls of three or more parties by assuming that at least one of user terminals 1B and user terminal 2 exists in multiples. Figure 15 is a flowchart illustrating the flow of information processing method S5. As shown in Figure 15, information processing method S5 includes steps S501 to S522.
[0153] Steps S501 to S503 are steps to initiate a conversation between user U1 and user U2. Steps S501 to S503 will be explained in the same way as steps S401 to S403, so a detailed explanation will not be repeated.
[0154] In step S504, the clarification UI unit 115 of the user terminal 1B receives a first operation from user U1 to set the level of clarification of the received voice. The details of step S504 will be explained in the same way as in step S203. The clarification UI unit 115 also transmits the setting information to the server 3.
[0155] In step S505, the second acquisition unit 313 of the server 3 stores setting information, including the level of clarity of the received voice, in the storage unit 320.
[0156] In step S506, the call application unit 211 of the user terminal 2 acquires the voice data of user U2 via the voice input unit 250. In other words, it is assumed that user U2 spoke during the call. The call application unit 211 also transmits the voice data of user U2 to the server 3.
[0157] Step S507 is an example of the first acquisition process. In step S507, the first acquisition unit 312 of the server 3 acquires the voice data of user U2 by receiving it from the user terminal 2.
[0158] Step S508 is an example of the second acquisition process. In step S508, the second acquisition unit 313 performs voice processing on the user U2's voice data according to the setting information. As a result, clarified voice data is acquired according to the clarification level indicated by the setting information. If the setting information indicates that the clarification level of the received voice is "off", voice processing is not performed and clarified voice data is not acquired.
[0159] In step S509, the voice output control unit 314 transmits the clarified voice data to the user terminal 1B. If the clarified voice data has not been generated, the user U2's voice data is transmitted instead.
[0160] In step S510, the call application unit 111 of the user terminal 1B receives clarified voice data. The call application unit 111 also outputs the voice indicated by the received clarified voice data via the voice output unit 160. As a result, user U1 hears the clarified voice of user U1. If voice data is received instead of clarified voice data, the unclarified voice indicated by that voice data is output.
[0161] Step S511 is an example of visualization control processing. In step S511, the visualization control unit 316 of the server 3 generates a visualization image based on the spectrum of the user U2's voice data and the spectrum of the clarified voice data. The visualization control unit 316 also transmits the visualization image to the user terminal 1B.
[0162] In step S512, the clarification UI unit 115 of the user terminal 1B displays the visualization image received from the server 3 on the display unit 130. Here, steps S506 to S512 are executed in real time according to the progress of the speech by user U2. Therefore, the visualization image displayed in step S512 is a moving image that changes according to the progress of the speech. By viewing the visualization image, user U1 can confirm the effect that the voice of user U2, the person he is talking to, has been clarified.
[0163] Subsequently, when user U1 speaks, user terminal 2 receives the voice data of user U1 transmitted from user terminal 1B from server 3 and outputs the received voice data via voice output unit 260. As a result, user U2 hears user U1's voice, which is not clarified. It is also possible to configure the system so that when user U1 speaks, steps S406 to S412 of the information processing method S4 described above are executed. In this case, user U2 hears user U1's voice, which is clarified. Furthermore, if user U2 speaks again, steps S506 to S512 are repeated.
[0164] Furthermore, if step S504 is executed again during the call and the clarity level of the received audio is changed, the changed clarity level is reflected in the clarity audio data acquired in step S508 and in the audio heard by user U1 in step S510. In addition, the changed clarity level is reflected in the visualization image presented to user U1 in step S512.
[0165] Steps S520 to S522 are steps for terminating the call between user U1 and user U2. Steps S520 to S522 will be explained in the same way as steps S420 to S422, so a detailed explanation will not be repeated.
[0166] According to the information processing method S5, for example, the display unit 130 of the user terminal 1B displays the example screen G2, similar to the information processing method S2. The details of the example screen G2 are as described above. By viewing the example screen G2, user U1 can confirm the effect that the voice of user U2, the person they are talking to, has been made clearer.
[0167] (Modification 1 of Embodiment 3) The visualization control unit 316 of the server 3 in Embodiment 3 may be modified to display a visualization image in response to a second operation after utterance by user U1. The control unit 110 of the server 3 may also be modified to include a playback control unit that performs playback control processing. As described in Embodiment 2, the playback control processing is a process for playing back the selected audio in response to a third operation in which user U1 selects either the audio indicated by the audio data or the audio indicated by the clarification audio data after utterance by user U1. The clarification UI unit 115 of the user terminal 1B may also be modified to accept a second or third operation after utterance by user U1. Details of the second and third operations are as described in Embodiment 2.
[0168] In Modification 1, user U1 can confirm the effect of clarification on their own voice by viewing a visualized image or playing back audio before and after clarification, regardless of whether they are in the middle of a call. For example, the display unit 130 of the user terminal 1B may display a screen example G3 similar to that of the embodiment. This allows user U1 to confirm the effect of clarification—how their voice will sound clearer to the other party during a call—before actually making a call.
[0169] (Modification 2 of Embodiment 3) Embodiment 3 may be modified so that instead of the server 3 having a visualization control unit 316, the user terminal 1B has a visualization control unit 116. For example, in the information processing method S4 for visualizing the clarification of transmitted speech, the visualization control processing in step S411 may be performed by the user terminal 1B instead of the server 3. In this case, for example, in step S409, the server 3 not only transmits the clarified speech data of user U1 to the user terminal 2 but also to the user terminal 1B. This makes it possible for the user terminal 1B to perform the visualization control processing in step S411. As a result, instead of transmitting a moving image as a visualization image from the server 3 to the user terminal 1B, it is sufficient to transmit clarified speech data, which is generally smaller in size than the moving image, thus reducing communication costs.
[0170] Furthermore, for example, in the information processing method S5 for clarifying received audio, the visualization control processing in step S511 may be performed by the user terminal 1B instead of the server 3. For example, in step S509, the server 3 transmits not only the clarified audio data of user U2 but also the audio data before audio processing to the user terminal 1B. This makes it possible for the user terminal 1B to perform the visualization control processing in step S511. As a result, instead of transmitting a moving image as a visualization image from the server 3 to the user terminal 1B, it is sufficient to transmit audio data, which is generally smaller in size than the visualization image, thus reducing communication costs.
[0171] (Modification 3 of Embodiment 3) In the information processing system 100B according to Embodiment 3, the second acquisition unit 313 of the server 3 can be modified in the same way as the second acquisition unit 113 in Modification 1 or 2 of Embodiment 1. This provides the same effect as Modification 1 or 2 of Embodiment 1, even when the server 3 relays the call.
[0172] [Embodiment 4] The telephone system 100C according to Embodiment 4 of the present disclosure will be described below. The telephone system 100C includes a voice processing device 10C that performs voice processing to clarify the transmitted or received voice. The voice processing device 10C also has a built-in microcomputer 1C that visualizes the voice processing. The microcomputer 1C is one embodiment of the information processing device described in the claims. Note that the microcomputer 1C is not limited to what is called a "microcomputer," but may be any computer that has a processor and memory and can be built into the voice processing device 10C. For the sake of convenience of explanation, in the following, components that have the same function as the components described in the above embodiments will be denoted by the same reference numerals, and their descriptions will not be repeated.
[0173] (Configuration of Telephone System 100C) Figure 16 is a block diagram showing the configuration of telephone system 100C. As shown in Figure 16, telephone system 100C comprises an audio processing device 10C, a handset 20, a main unit 30, and cables 41, 42, and 43. Telephone system 100C is used by user U1 to make calls. The person with whom a call is made via telephone system 100C is referred to as user U2.
[0174] (Handset 20) The handset 20 includes a microphone 21, a speaker 22, and a port P21. The microphone 21 and speaker 22 are each connected to port P21. The microphone 21 is an example of a transmitter, and the speaker 22 is an example of a receiver.
[0175] The microphone 21 receives the voice emitted by user U1 as an audio signal S V Convert to audio signal S VThis is supplied to port P21. In this embodiment, the audio signal S V The signal is both an analog and an electrical signal. However, the microphone 21 receives the voice emitted by user U1 as a digital audio signal S. V It may be configured to convert to
[0176] For example, an electric condenser microphone can be used as the microphone 21. An electric condenser microphone has a DC voltage applied to it for driving. Also, electric condenser microphones are often connected using different types of connectors depending on the manufacturer or model.
[0177] Speaker 22 converts the audio signal supplied from port P21, which is the voice emitted by user U2, into sound and outputs the sound. In this embodiment, the audio signal supplied from port P21 is an analog signal and an electrical signal. However, the audio signal supplied from port P21 may be a digital signal, and speaker 22 may be configured to convert the digital audio signal into sound.
[0178] Port P21 is an example of a port on the handset 20, and in this embodiment, a modular jack conforming to the RJ-9 standard is used. However, port P21 is not limited to a modular jack conforming to the RJ-9 standard. Any connector capable of inputting and outputting audio signals may be used as port P21.
[0179] (Main unit 30) The main unit 30 is equipped with ports P31 and P32. Port P31 is an example of a port on the handset 20 side of the main unit 30, and in this embodiment, a modular jack conforming to the RJ-9 standard is used, similar to port P21.
[0180] Port P32 is an example of a line-side port on the main unit 30, and in this embodiment, a modular jack conforming to the RJ-11 standard is used. However, port P32 is not limited to a modular jack conforming to the RJ-11 standard. For example, a modular jack conforming to the RJ-12 standard or the RJ-14 standard may be used as port P32.
[0181] The main unit 30 is the main unit of a telephone using a two-wire telephone line. The main unit 30 is configured to enable bidirectional voice signal communication with the main unit of the receiving telephone by connecting the telephone line to port P32 and the handset 20 to port P31. The main unit 30 can be configured in the same way as the main unit of an existing telephone. Therefore, in this embodiment, the description of the main unit 30 will be omitted.
[0182] The main unit 30 receives the synthesized speech signal S supplied to port P31 from the adder / synthesizer 16 of the speech processing device 10C, which will be described later. SV This is output to cable 43, which will be described later.
[0183] (Cables 41, 42, 43) Each of the cables 41, 42, and 43 is equipped with multiple signal lines. Each of the multiple signal lines is made of a conductor capable of transmitting an audio signal, which is an electrical signal.
[0184] Cable 41 connects port P11 of the voice processing device 10C (described later) to port P21 of the handset 20, cable 42 connects port P12 of the voice processing device 10C to port P31 of the main unit 30, and cable 43 forms the end of the telephone line and is connected to port P32 of the main unit 30.
[0185] In this embodiment, each of the cables 41 and 42 is a cable provided with modular plugs conforming to the RJ-9 standard at both ends. Also, in this embodiment, the cable 43 is a cable provided with modular plugs conforming to the RJ-11 standard at both ends. In FIG. 16, the black squares shown at both ends of each of the cables 41 and 42 indicate modular plugs conforming to the RJ-9 standard, and the white squares shown at the ends of the cable 43 indicate modular plugs conforming to the RJ-11 standard.
[0186] [[ID=३]] (Voice Processing Device 10C) FIG. 17 is a block diagram showing the detailed configuration of the voice processing device 10C. As shown in FIG. 17, the voice processing device 10C includes a microcomputer 1C, a display unit 130, an AD (Analog-Digital) converter 180, and a voice processing circuit 200. The voice processing device 10C is provided so as to be interposed between the handset 20 and the main body 30.
[0187] Also, in this embodiment, the housing of the voice processing device 10C is made of aluminum. However, the material constituting the housing is not limited to aluminum, and may be a metal such as copper or stainless steel. Also, the material constituting the housing may be mainly resin with a metal layer provided on its surface. The housing can shield electromagnetic waves entering from the outside to the inside by at least its surface being covered with metal.
[0188] (Voice Processing Circuit 200) The voice processing circuit 200 performs voice processing on the voice signal S V input from the handset 20 so that the voice is felt to be clear, and outputs the synthesized voice signal S SV to the main body 30. Also, the voice signal S V input to the voice processing circuit 200 is also supplied to the AD converter 180, and is input to the microcomputer 1C as voice data converted into a digital signal by the AD converter ɩ80. Also, the synthesized voice signal S SVThis signal is also supplied to the AD converter 180, and the clarified audio data converted into a digital signal by the AD converter 180 is input to the microcomputer 1C.
[0189] The audio processing circuit 200 includes port P11, port P12, branching unit 11, phase corrector 12, filter 13, amplifier 14, adjuster 15, and adder / combiner 16. The audio processing circuit 200 also further includes a power supply unit, a detection unit, and a control unit, which are not shown in Figure 17.
[0190] Port P11 is a port connected to the handset 20. Port P11 and port P21 of the handset 20 are connected by a cable 41. Of the terminals of port P11, the terminal connected to the microphone 21 receives the audio signal S converted from the voice emitted by user U1 via the microphone 21. V However, it is supplied from microphone 21.
[0191] Audio signal S V This includes a first-order formant component, a second-order formant component, ..., an nth-order formant component. Note that n is a positive integer of at least 4, although there may be some individual differences, depending on the speaking user U1.
[0192] The branching section 11 receives the audio signal S V to the first audio signal S V1 and the second audio signal S V2 It branches into two. In this embodiment, the branching section 11 receives the first audio signal S V1 and the second audio signal S V2 The system is configured so that the intensity ratio between the two is 1:1, i.e., the distribution ratio is 1:1. However, the distribution ratio of the branching section 11 is not limited to 1:1 and can be set as appropriate.
[0193] Note that the first audio signal S V1 First, second audio signal S V2 Each spectrum of the audio signal S V It is similar to the spectrum of the first audio signal S. V1 First, second audio signal S V2Each of these includes a first-order formant component, a second-order formant component, ..., an nth-order formant component.
[0194] The filter 13 processes the first audio signal S that has passed through the branching section 11. V1 This is a high-pass filter that removes low-frequency components, which are components with a frequency of less than 400 Hz, from the first audio signal S that does not contain the low-frequency components. V1’ It outputs the first audio signal S. V1’ This is an audio signal that does not contain a first-order formant component, but contains a second-order formant component, a third-order formant component, ..., an nth-order formant component.
[0195] Amplifier 14 receives the first audio signal S that has passed through filter 13. V1’ By amplifying this, the intensity of the second-order formant component, the third-order formant component, ..., the nth-order formant component is increased in the first audio signal S. V1” Outputs.
[0196] The regulator 15 adjusts the gain of the amplifier 14. In this embodiment, the regulator 15 is configured using a switch that allows the gain to be adjusted in three stages. The regulator 15 is configured such that when "0" is selected by the switch, the gain becomes 1x; when "1" is selected by the switch, the gain becomes 3x; and when "2" is selected by the switch, the gain becomes 5x.
[0197] However, the number of switch stages constituting the regulator 15, and the gain selected by the switches, are not limited to the examples described above and can be selected as appropriate. Furthermore, the regulator 15 can employ a volume control that continuously changes the gain instead of switches that discretely change the gain. Also, in cases where the user U2 on the receiving end can be identified to some extent, such as when the telephone system 100C is installed in a home, the gain can be pre-set, and the regulator 15 can be omitted.
[0198] The phase corrector 12 processes the first audio signal S that has passed through the amplifier 14. V1” The second audio signal S is set to match the phase of the first audio signal. V2The phase is corrected. That is, the phase corrector 12 corrects the phase of the first audio signal S V1” The second audio signal S is in phase with the second audio signal S. V2’ Outputs.
[0199] The adder / combiner 16 processes the first audio signal S that has passed through the amplifier 14. V1” Then, the second audio signal S that has passed through the phase corrector 12 V2’ By adding and combining, a synthesized voice signal S is produced. SV It generates a synthesized speech signal S SV This is supplied to port P12.
[0200] Port P12 is a port connected to the main unit 30 and is an example of a second port. Port P12 and port P31 of the main unit 30 are connected using a cable 42. The terminal of port P12 to which the adder / combiner 16 is connected receives a synthesized voice signal S from the adder / combiner 16. SV It will be supplied.
[0201] In this embodiment, the microphone 21 receives an analog audio signal S via port P21. V This is supplied to port P11. Port P12 receives the synthesized voice signal S, which is an analog signal, via port P31. SV The audio signal S is supplied to the main unit 30. However, the microphone 21 is a digital signal. V The synthesized voice signal S is supplied to port P11, and port P12 receives the synthesized voice signal S, which is a digital signal. SV It may be configured to supply the power to the main body 30.
[0202] Furthermore, in this embodiment, the phase corrector 12, filter 13, amplifier 14, and adder / combiner 16 are each composed of analog circuits. However, the phase corrector 12, filter 13, amplifier 14, and adder / combiner 16 may each be composed of digital circuits.
[0203] Furthermore, in this embodiment, the audio processing circuit 200 processes the first audio signal S V1The audio processing circuit 200 includes a filter 13, which is a high-pass filter, to apply filtering to the first audio signal S that has passed through the branching section 11. V1 Therefore, the filter may be configured to remove (1) low-frequency components with a frequency of less than 400 Hz and (2) high-frequency components with a frequency exceeding 7 kHz. That is, the filter 13 may be a band-pass filter that allows components with a frequency of 400 Hz or more and 7 kHz or less to pass through, or a band-pass filter that allows components with a frequency of 400 Hz or more and 5 kHz or less to pass through. In this case, the filter 13 may be implemented as a single band-pass filter, or it may be implemented by connecting a high-pass filter and a low-pass filter in series.
[0204] In the audio processing circuit 200, the terminal of port P11 connected to the speaker 22 and a predetermined terminal of port P12 are directly connected. Therefore, the audio processing circuit 200 outputs to port P11 the audio signal supplied from the main unit 30 to port P12, which is the converted audio signal of the voice emitted by user U2, without applying any audio processing. Consequently, the telephone system 100C outputs the voice emitted by user U2 from speaker 22 without applying any audio processing.
[0205] (AD converter 180) Audio signal S supplied to port P11 V The signal is then branched and input to the AD converter 180 and the branching unit 11. The AD converter 180 receives the audio signal S V The signal is converted into digital audio data and output to the microcomputer 1C.
[0206] Synthesized speech signal S output from the adder / synthesizer 16 SV The signal is then split and supplied to the AD converter 180 and port P12. The AD converter 180 receives the synthesized voice signal S SV This is converted into clarified audio data, which is a digital signal, and output to the microcomputer 1C.
[0207] (Display Unit 130) The display unit 130 displays an image generated by the microcomputer 1C. The display unit 130 may include, but is not limited to, a liquid crystal display or an organic EL (electroluminescence) display.
[0208] (Microcomputer 1C) The microcomputer 1C includes a control unit 110 and a storage unit 120. The control unit 110 is implemented, for example, by a processor executing a program stored in memory, and comprehensively controls each part of the audio processing device 10C. The storage unit 120 is composed of, for example, memory, and stores various data and programs used by the control unit 110.
[0209] The control unit 110 includes a first acquisition unit 112, a second acquisition unit 113, and a visualization control unit 116.
[0210] The first acquisition unit 112 executes the first acquisition process. In this embodiment, the first acquisition unit 112 acquires the audio data output from the AD converter 180. The audio data represents the voice of user U1 speaking into the handset 20 using the telephone system 100C.
[0211] The second acquisition unit 113 executes the second acquisition process. In this embodiment, the second acquisition unit 113 acquires the clarified audio data output from the AD converter 180.
[0212] The visualization control unit 116 executes visualization control processing. Details of the visualization control unit 116 will be described in the same manner as in Embodiment 1. As a result, the display unit 130 displays a visualization image based on the spectrum of the audio data and the spectrum of the clarified audio data.
[0213] (Flow of Information Processing Method S6) In the telephone system 100C configured as described above, the microcomputer 1C executes information processing method S6. Information processing method S6 is a method executed by the microcomputer 1C when user U1 uses the telephone system 100C to make a call with the other party. In parallel with the execution of information processing method S6, the voice processing circuit 200 processes the voice signal S VFrom synthesized speech signal S SV This is generated in real time. Figure 18 is a flowchart showing the flow of the information processing method S6. As shown in Figure 18, the information processing method S6 includes steps S601 to S603.
[0214] In step S601, the first acquisition unit 112 acquires the audio data output from the AD converter 180. In step S602, the second acquisition unit 113 acquires the clarified audio data output from the AD converter 180. In step S603, the visualization control unit 116 displays a visualization image on the display unit 130 based on the spectrum of the audio data and the spectrum of the clarified audio data. By viewing the visualization image, user U1 can confirm the clarification effect, specifically how their voice sounds clearer to the person they are talking to.
[0215] (Modification 1 of Embodiment 4) The voice processing device 10C according to Embodiment 4 can be modified to include, in addition to the voice processing circuit 200 for clarifying transmitted voice, another voice processing circuit for clarifying received voice (hereinafter referred to as the voice processing circuit for received voice). In this case, the voice processing circuit for received voice is configured in the same way as the voice processing circuit 200, but is arranged between port P11 and port P12 in the opposite direction to the voice processing circuit 200. That is, the adder / combiner 16 included in the voice processing circuit for received voice is connected to the terminal corresponding to the speaker 22 of port P11. Also, the branching unit 11 included in the voice processing circuit for received voice is connected to a predetermined terminal of port P12.
[0216] The audio processing circuit for received audio receives an audio signal representing the received audio from the main unit 30. This audio signal is processed to make the audio clearer, and a synthesized audio signal is output to the speaker 22. The audio signal representing the received audio input to the audio processing circuit for received audio is also branched and supplied to the AD converter 180, where it is converted into digital audio data. This audio data is output to the microcomputer 1C. The synthesized audio signal branched from the audio processing circuit for received audio is also branched and supplied to the AD converter 180, where it is converted into digital clarified audio data. This clarified audio data is output to the microcomputer 1C.
[0217] The first acquisition unit 112 of the microcomputer 1C acquires audio data representing the received voice. The second acquisition unit 113 acquires clarified audio data corresponding to the said audio data. The visualization control unit 116 displays a visualization image on the display unit 130 based on the spectrum of the audio data representing the received voice and the spectrum of the clarified audio data.
[0218] According to this modified version, user U1 can confirm the clarification effect by viewing the visualized image, indicating that the voice of the person they are talking to is clearer and easier to hear.
[0219] (Modification 2 of Embodiment 4) In the audio processing device 10C according to Embodiment 4, the audio data input to the microcomputer 1C is the audio signal S from port P11. V This is not limited to signals that have been converted to digital signals. For example, the second audio signal S output from the phase corrector 12. V2’ This can be said to represent the sound before the gain of the higher-order formant components is amplified (i.e., before clarification). Therefore, the sound signal S V Instead of the second audio signal S V2’ The system is configured to input the second audio signal S to the AD converter 180. V2’ The audio data, converted into a digital signal, may be input to the microcomputer 1C.
[0220] Furthermore, the microcomputer 1C does not necessarily need to receive audio data and clarified audio data as input; it only needs to receive the data necessary to generate the visualized image. For example, the audio signal S V and synthesized speech signal S SV Instead, the first audio signal S output from amplifier 14 V1” The first audio signal S may be configured to be input to the AD converter 180. V1” This can be said to represent the difference between the audio signal before and after clarification. In this case, the first audio signal S V1” The difference audio data, which has been converted into a digital signal, is input to the microcomputer 1C. The visualization control unit 116 can generate a visualization image (for example, visualization image G16b in Figure 6) that shows the difference between the spectrum of the audio data and the spectrum of the clarified audio data based on the difference audio data.
[0221] (Modification 3 of Embodiment 4) The voice processing device 10C according to Embodiment 4 may be configured such that the control unit 110 of the microcomputer 1C performs voice processing instead of including the voice processing circuit 200. In this case, the voice processing device 10C further includes a DA (Digital-Analog) converter that converts the clarified voice data generated by the microcomputer 1C into an analog signal. The analog signal is output to the handset 20 or the main unit 30.
[0222] (Other Modifications of Embodiment 4) The voice processing device C according to Embodiment 4 may be connected wirelessly to one or both of the handset 20 and the main unit 30, not limited to a wired connection. Furthermore, the voice processing device 10C is not limited to being configured separately from the handset 20 and the main unit 30. For example, the voice processing device 10C may be housed within the casing of the main unit 30 so as to be integrated with the main unit 30. Alternatively, the voice processing device 10C may be housed within the casing of the handset 20 so as to be integrated with the handset 20.
[0223] Furthermore, the telephone system 100C according to Embodiment 4 may include a computer with a calling function instead of the main unit 30. This computer may be, but is not limited to, a mobile phone, smartphone, tablet, smartwatch, laptop computer, or desktop computer. Also, the handset 20 is not limited to the configuration shown in Figure 16. For example, the handset 20 may be a headset or the like.
[0224] [Example of implementation by software] The functions of user terminals 1, 1A, 1B, microcomputer 1C, and server 3 (hereinafter referred to as "devices") can be realized by programs that cause the devices to function as computers, and by programs that cause each control block of the devices (especially each part included in the control units 110 and 310) to function as computers.
[0225] In this case, the device includes a computer having at least one control device (e.g., a processor) and at least one storage device (e.g., memory) as hardware for executing the program. By executing the program using this control device and storage device, the functions described in each of the embodiments are realized.
[0226] The above program may be recorded on one or more computer-readable recording media, not temporary ones. These recording media may or may not be provided by the above device. In the latter case, the program may be supplied to the above device via any wired or wireless transmission medium.
[0227] Furthermore, some or all of the functions of each of the above control blocks can also be implemented by logic circuits. For example, an integrated circuit in which logic circuits functioning as each of the above control blocks are formed is also included in the scope of this disclosure. In addition, it is also possible to implement the functions of each of the above control blocks by, for example, a quantum computer.
[0228] Furthermore, each process described in the above embodiments may be performed by AI (Artificial Intelligence). In this case, the AI may operate on the control device described above, or it may operate on another device (for example, an edge computer or a cloud server).
[0229] [Summary] The information processing method according to Embodiment 1 includes, at least one processor performing: a first acquisition process for acquiring audio data representing the voice of a speaker; a second acquisition process for acquiring clarified audio data generated by audio processing for clarifying the audio data; and a visualization control process for displaying a visualization image that visualizes the audio processing based on the spectrum of the audio data and the spectrum of the clarified audio data. The above configuration has the effect of being able to present to the user the effect of clarification by audio processing for clarifying the voice.
[0230] The information processing method according to Embodiment 2, in Embodiment 1, includes an image in which an image showing the spectrum of the audio data and an image showing the spectrum of the clarified audio data are superimposed. With this configuration, the user can visually compare the spectra before and after clarification, and thus confirm the effect of clarification through this comparison.
[0231] The information processing method according to Embodiment 3, in Embodiment 1 or Embodiment 2, includes an image showing the difference between the spectrum of the audio data and the spectrum of the clarified audio data. With this configuration, the user can visually recognize the difference between the spectra before and after clarification, and thus confirm the effect of clarification due to the difference.
[0232] The information processing method according to Embodiment 4 is such that, in any one of Embodiments 1 to 3, the degree of clarification in the speech processing can be set according to a first operation, and in the visualization control processing, at least one processor displays the visualization image that reflects the degree of clarification set according to the first operation. With the above configuration, the user can confirm how the effect of clarification changes when the degree of clarification is changed.
[0233] The information processing method according to Embodiment 5 is such that, in any one of Embodiments 1 to 4, the audio data represents the voice of the speaker in a call made by connecting multiple user terminals, and in the visualization control processing, at least one processor displays the visualization image on the user terminal of the speaker among the multiple user terminals. With this configuration, the user can confirm the effect of the voice processing on the receiving end of a call on their own voice.
[0234] The information processing method according to Embodiment 6 is such that, in any one of Embodiments 1 to 5, the audio data represents the voice of the speaker in a call made by connecting multiple user terminals, and in the visualization control processing, at least one processor displays the visualization image on the user terminal of the receiver among the multiple user terminals. With this configuration, the user can confirm the effect of clarification by audio processing on the voice of the person they are talking to during a call.
[0235] The information processing method according to Embodiment 7 is such that, in any one of Embodiments 1 to 6, in the visualization control process, at least one processor updates the visualization image in real time according to the progress of the speaker's utterance. With this configuration, the user can confirm the clarification effect in real time according to the progress of the speaker's utterance.
[0236] The information processing method according to embodiment 8 is such that, in any one of embodiments 1 to 7, in the visualization control process, at least one processor displays the visualization image in response to a second operation after the speaker has spoken. With this configuration, the user can confirm the effect of the speech processing on their own voice by viewing the visualization image after they have spoken.
[0237] The information processing method according to embodiment 9 is characterized in that, in any one of embodiments 1 to 8, at least one processor further executes a playback control process to play the selected voice in response to a third operation in which the speaker selects either the voice indicated by the voice data or the voice indicated by the clarified voice data after the speaker has spoken. With this configuration, the user can confirm the effect of clarification on their own voice by listening to the voice before and after clarification after they have spoken.
[0238] The information processing method according to embodiment 10 is, in any one of embodiments 1 to 9, a process in which the audio processing amplifies higher-order formant components, including at least a second-order formant component, in the audio data. With the above configuration, the audio can be made clearer.
[0239] The information processing device according to embodiment 11 comprises at least one processor, the at least one processor executing each of the processes included in the information processing method described in any one of embodiments 1 to 10. With the above configuration, the same effects as any one of embodiments 1 to 10 are achieved.
[0240] The program according to embodiment 12 causes at least one processor to execute each of the processes included in the information processing method described in any one of embodiments 1 to 10. With the above configuration, the same effect as any one of embodiments 1 to 10 is achieved.
[0241] The computer-readable, non-temporary recording medium according to Embodiment 13 stores the program according to Embodiment 12. This configuration produces the same effect as any one of Embodiments 1 to 10.
[0242] Each embodiment of this disclosure has the effect of being able to present to the user the effect of clarification through speech processing for clarifying speech. Such an effect contributes, for example, to achieving Sustainable Development Goal (SDG) 3, "Ensure good health and well-being for all," as advocated by the United Nations.
[0243] This disclosure is not limited to the embodiments described above, and various modifications are possible within the scope of the claims. Embodiments obtained by appropriately combining the technical means disclosed in different embodiments are also included in the technical scope of this disclosure.
[0244] 1, 1A, 1B, 2 User terminal 1C Microcomputer 3 Server 10C Voice processing unit 11 Branching unit 12 Phase corrector 13 Filter 14 Amplifier 15 Adjuster 16 Adder / combiner 20 Handset 20 Transmitter / Receiver 21 Microphone 22 Speaker 30 Main unit 41, 42, 43 Cable 100, 100B Information processing system 100C Telephone system 110, 210, 310 Control unit 111, 211 Call application unit 112, 312 First acquisition unit 113, 313 Second acquisition unit 114, 314 Voice output control unit 115 Clarification UI unit 116, 316 Visualization control unit 117 Playback control unit 120, 220, 320 Storage unit 130, 230 Display unit 140, 240 Input section 150, 250 Audio input section 160, 260 Audio output section 170, 270, 370 Communication section 180 AD converter 200 Audio processing circuit 311 Call relay section 370 Communication
Claims
1. An information processing method comprising: a first acquisition process in which at least one processor acquires audio data representing the voice of a speaker; a second acquisition process in which it acquires clarified audio data generated by audio processing for clarifying the audio data; and a visualization control process for displaying a visualization image that visualizes the audio processing based on the spectrum of the audio data and the spectrum of the clarified audio data.
2. The information processing method according to claim 1, wherein the visualization image includes an image obtained by superimposing an image showing the spectrum of the audio data and an image showing the spectrum of the clarified audio data.
3. The information processing method according to claim 1, wherein the visualization image includes an image showing the difference between the spectrum of the audio data and the spectrum of the clarified audio data.
4. The information processing method according to claim 1, wherein the degree of clarification in the audio processing can be set according to a first operation, and in the visualization control processing, at least one processor displays the visualization image that reflects the degree of clarification set according to the first operation.
5. The information processing method according to claim 1, wherein the audio data represents the voice of a speaker in a call made by connecting multiple user terminals, and in the visualization control process, at least one processor displays the visualization image on the user terminal of the speaker among the multiple user terminals.
6. The information processing method according to claim 1, wherein the audio data represents the voice of a speaker in a call made by connecting multiple user terminals, and in the visualization control process, at least one processor displays the visualization image on the user terminal of the receiver among the multiple user terminals.
7. The information processing method according to claim 1, wherein in the visualization control process, the at least one processor updates the visualization image in real time according to the progress of speech by the speaker.
8. The information processing method according to claim 1, wherein in the visualization control process, at least one processor displays the visualization image in response to a second operation after the speaker has spoken.
9. The information processing method according to claim 8, wherein the at least one processor further performs a playback control process for playing back the selected voice in response to a third operation of selecting either the voice indicated by the voice data or the voice indicated by the clarified voice data after the utterance by the speaker.
10. The information processing method according to claim 1, wherein the audio processing is a process of amplifying higher-order formant components, including at least a second-order formant component, in the audio data.
11. An information processing apparatus comprising at least one processor, wherein the at least one processor performs each of the processes included in the information processing method according to any one of claims 1 to 10.
12. A program that causes at least one processor to execute each of the processes included in the information processing method according to any one of claims 1 to 10.