Information processing method, information processing device, and program
The information processing method and device address communication challenges in multi-user calls by clarifying voice data through real-time amplification of higher-order formants, enabling smooth communication for users with hearing impairments by adjusting clarity levels.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- RADIUS CO LTD
- Filing Date
- 2024-12-27
- Publication Date
- 2026-07-02
AI Technical Summary
Existing hearing aids struggle with maintaining clear voice communication, especially when users with insufficient hearing ability are involved, due to distance and howling issues, and there is a lack of support for smooth communication in multi-user calls.
An information processing method and device that clarifies voice data by amplifying higher-order formant components in real-time, allowing users to control the clarity level of voice output based on user actions and operations, and supports voice processing across multiple user terminals.
Enables smooth communication in calls involving users with insufficient hearing by clarifying voices in real-time, allowing users to adjust clarity levels for both transmitted and received voices, enhancing understanding and reducing distance and howling issues.
Smart Images

Figure JP2024046308_02072026_PF_FP_ABST
Abstract
Description
Information Processing Method, Information Processing Apparatus, and Program
[0001] This disclosure relates to voice processing for clarifying voices.
[0002] Patent Document 1 discloses a hearing aid that clarifies voices by performing frequency analysis on an acoustic signal that changes over time to obtain, in real time, a signal in the highest level frequency band, and increasing the gain of signals in a higher frequency range compared to the signal in that frequency band.
[0003] Japanese Patent No. 3731179
[0004] Here, when communicating with a person with insufficient hearing ability, it may be difficult to achieve smooth communication with the hearing aid described in Patent Document 1. For example, it is difficult for a user communicating with a person with insufficient hearing ability to force the other person to wear a hearing aid. Also, even when the other person is wearing a hearing aid, it is necessary to bring the receiving end close to the hearing aid. For example, when the hearing aid is an ear-hook type, depending on the distance between the hearing aid and the receiving end, the received voice may not be sufficiently clarified. Also, if the receiving end is brought too close to the hearing aid, there is a possibility of howling.
[0005] One aspect of this disclosure aims to realize a technology for supporting smooth communication in a call between a plurality of users including a user with insufficient hearing ability. [[ID=十七]]
[0006] To solve the above problems, an information processing method according to one aspect of this disclosure includes an acquisition process in which at least one processor acquires voice data indicating the voice of a speaker input to the speaker's user terminal among a plurality of user terminals connected for making a call, a voice processing process in which the voice data is subjected to voice processing for clarifying the voice to generate clarified voice data, and a voice output control process in which the voice indicated by the clarified voice data is controlled to be output from the user terminal of the listener among the plurality of user terminals.
[0007] To solve the above problems, an information processing device according to one aspect of the present disclosure comprises the at least one processor, the at least one processor executing each of the processes included in the information processing method described above.
[0008] To solve the above problems, a program according to one aspect of this disclosure causes at least one processor to execute each of the processes included in the information processing method described above. A computer-readable non-temporary recording medium on which the program is recorded also falls within the scope of this disclosure.
[0009] According to one aspect of this disclosure, it is possible to support smooth communication in calls between multiple users, including users with insufficient hearing.
[0010] This is a block diagram showing the configuration of an information processing system according to Embodiment 1 of this disclosure. This is a block diagram showing the functional configuration of a user terminal according to Embodiment 1 of this disclosure. This is a block diagram showing the functional configuration of another user terminal according to Embodiment 1 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 1 of this disclosure. This is a diagram schematically showing an example of a screen displayed in Embodiment 1 of this disclosure. This is a flowchart illustrating the flow of another information processing method according to Embodiment 1 of this disclosure. This is a diagram schematically showing another example of a screen displayed in Embodiment 1 of this disclosure. This is a diagram schematically showing yet another example of a screen displayed in Embodiment 1 of this disclosure. This is a block diagram showing the configuration of an information processing system according to Embodiment 2 of this disclosure. This is a block diagram showing the functional configuration of each device constituting the information processing system according to Embodiment 2 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 2 of this disclosure. This is a flowchart illustrating the flow of another information processing method according to Embodiment 2 of this disclosure. This is a block diagram showing the configuration of a telephone system according to Embodiment 3 of this disclosure. This is a block diagram showing the configuration of an audio processing device according to Embodiment 3 of this disclosure. This is a flowchart illustrating the flow of an information processing method according to Embodiment 3 of this disclosure.
[0011] [Embodiment 1] Hereinafter, an information processing system 100 according to Embodiment 1 of the present disclosure will be described in detail with reference to the drawings.
[0012] (Configuration of Information Processing System 100) Figure 1 is a block diagram showing the configuration of the information processing system 100. As shown in Figure 1, the information processing system 100 includes a user terminal 1 used by user U1 and a user terminal 2 used by user U2. User terminals 1 and 2 can be connected via network N for users U1 and U2 to make calls. Although Figure 1 shows one user terminal 1 and one user terminal 2, there may be multiple instances of each.
[0013] Here, "call" refers to the exchange of at least voice in real time between multiple users via a network N, and may be a voice call where only voice is exchanged, or a video call where both voice and video are exchanged. Furthermore, the call method may be a circuit-switched method via a public telephone network, etc., or a packet-switched method via an IP (Internet Protocol) network, etc., but is not limited to these.
[0014] Network N may be, for example, a public telephone network, a dedicated telephone network, a mobile communication network, an IP network, a wireless or wired LAN (Local Area Network), or a combination of some or all of these. However, Network N is not limited to the examples described above, and is only required to be a network that connects user terminal 1 and user terminal 2 in a manner that enables communication using a predetermined communication method.
[0015] User terminal 1 is a terminal used by user U1. User terminal 1 has the function of connecting to user terminal 2 via network N using a predetermined communication method (for example, the circuit-switched method or packet-switched method described above). In other words, user U1 can communicate with user U2 using user terminal 1. User terminal 1 is one embodiment of the information processing device described in the claim and has a clarity function that clarifies the voice of the user or the other party during a call. User terminal 1 is a computer including at least one processor and memory. User terminal 1 may be, but is not limited to, a mobile phone, smartphone, tablet, smartwatch, notebook computer (notebook personal computer), desktop computer (stationary personal computer), etc.
[0016] User terminal 2 is a terminal used by user U2. User terminal 2 has the function of connecting to user terminal 1 via network N using the same communication method as user terminal 1. In other words, user U2 can communicate with user U1 using user terminal 2. In the following description, user terminal 2 will be described as not having the clarification function described above, but it is not limited to this and may have the clarification function. User terminal 1 is a computer including at least one processor and memory. User terminal 2 may be, for example, a mobile phone, smartphone, tablet, smartwatch, laptop computer, desktop computer, etc., but is not limited to these.
[0017] (Configuration of User Terminal 1) Figure 2 is a block diagram showing the functional configuration of User Terminal 1. As shown in Figure 2, User Terminal 1 comprises a control unit 110, a storage unit 120, a display unit 130, an input unit 140, an audio input unit 150, an audio output unit 160, and a communication unit 170. The control unit 110 is implemented, for example, by a processor executing a program stored in memory, and comprehensively controls each part of User Terminal 1. Details of each functional block of the control unit 110 will be described later. The storage unit 120 is composed, for example, of memory and stores various data and programs used by the control unit 110.
[0018] The display unit 130 displays the image generated by the control unit 110. The display unit 130 may, but is not limited to, a liquid crystal display or an organic EL (electroluminescence) display. The input unit 140 receives operations from user U1, acquires input information indicated by those operations, and outputs the input information to the control unit 110. The input unit 140 may, but is not limited to, a mouse, keyboard, touchpad, or a combination of some or all of these. Furthermore, the display unit 130 and the input unit 140 may be integrally formed as a touch panel.
[0019] The audio input unit 150 converts an audio signal input from an external source into audio data, which is a digital signal, and outputs the audio data to the control unit 110. The audio input unit 150 may, but is not limited to, include a microphone and an analog-to-digital (AD) converter. The audio output unit 160 converts the audio data input from the control unit 110 into an analog audio signal, which is a digital signal, and outputs the audio signal to the outside as audio. The audio output unit 160 may, but is not limited to, include a digital-to-analog converter and a speaker.
[0020] The communication unit 170 connects to the network N and communicates with the outside world. The communication unit 170 transmits information input from the control unit 110 via the network N. The communication unit 170 also outputs information received via the network N to the control unit 110.
[0021] Note that some or all of the storage unit 120, display unit 130, input unit 140, audio input unit 150, audio output unit 160, and communication unit 170 may be connected as peripheral devices instead of being built into the user terminal 1.
[0022] (Functional Blocks of the Control Unit 110) As shown in Figure 2, the control unit 110 includes a call application unit 111, an acquisition unit 112, an audio processing unit 113, an audio output control unit 114, and a clarification UI (User Interface) unit 115. The acquisition unit 112, the audio processing unit 113, the audio output control unit 114, and the clarification UI unit 115 may constitute a clarification application that extends the call function of the call application unit 111.
[0023] The call application unit 111 provides functions for making calls according to a predetermined call method. For example, in response to an operation by user U1 to start a call, the call application unit 111 connects to the call application unit 211 of user terminal 2 via the network N according to a predetermined call method. The operation to start the call may include, for example, an operation to specify the call destination and an operation to instruct the sending of a call request to the call destination. The operation to start the call may also include, for example, an operation to respond to a received call request. The call application unit 111 also transmits audio data representing the voice of user U1, which is input from the audio input unit 150, to the connected user terminal 2. The call application unit 111 also receives audio data representing the voice of user U2 from user terminal 2 and outputs the audio data from the audio output unit 160. The call application unit 111 also terminates the connection with user terminal 2 in response to an operation by user U1 to instruct the end of the call.
[0024] The acquisition unit 112 executes an acquisition process. The acquisition process is the process of acquiring audio data indicating the speaker's voice, which is input into the speaker's user terminal among multiple user terminals connected for making a call (for example, multiple user terminals including user terminal 1 and user terminal 2). Hereafter, "audio data indicating the speaker's voice" will also be referred to as "speaker's voice data". The audio data is acquired in real time as the speaker's speech progresses.
[0025] For example, in a call between user U1 and user U2, when user U1 speaks, user U1 is the speaker and user U2 is the listener. In this case, the acquisition unit 112 acquires the voice data of user U1 (speaker) using the user terminal 1 via the voice input unit 150.
[0026] Furthermore, for example, in a call between user U1 and user U2, when user U2 speaks, user U2 is the speaker and user U1 is the listener. In this case, the acquisition unit 112 acquires the voice data of user U2 (speaker) received by the call application unit 111.
[0027] The voice processing unit 113 performs voice processing. Voice processing is the process of generating clarified voice data by performing voice processing on voice data to clarify the voice. Here, the clarified voice data may be generated in real time. Real-time generation of clarified voice data means that clarified voice data is generated from the voice data in parallel with the process of acquiring voice data as the speech progresses.
[0028] For example, speech processing to clarify speech may involve amplifying higher-order formant components, including at least the second-order formant component, in the speech data. In the spectral analysis of speech, multiple peak frequencies appear at integer multiples of each other. These multiple peak frequencies are referred to as the first-order formant, second-order formant, third-order formant, fourth-order formant, and so on, in order of increasing frequency. Although each peak frequency differs depending on the speaker's skeletal structure, it is generally known that in order to understand language, it is necessary to reliably hear the components from the first to the fourth formant. Furthermore, users with insufficient hearing often have reduced hearing in the frequency band containing higher-order formant components, such as the second to the fourth formant. Therefore, by amplifying the higher-order formant components, including the second formant, speech is clarified so that it can be perceived as clear by users with insufficient hearing.
[0029] For example, the audio processing unit 113 may perform a process to amplify the level of the frequency band above the lower limit frequency in the audio data. In this case, the lower limit frequency is determined such that higher-order formant components, including at least a second-order formant component, are included in the frequency band above the lower limit frequency. The audio processing unit 113 may further determine an upper limit frequency and perform a process to amplify the level of the frequency band above the lower limit frequency and below the upper limit frequency in the audio data. In this case, the lower limit frequency and the upper limit frequency are determined such that higher-order formant components, including at least a second-order formant component, are included in the frequency band. As an example, the lower limit frequency may be 400 Hz and the upper limit frequency may be 5 kHz. The frequency band between 400 Hz and 5 kHz is likely to contain the second to fourth formant components of typical human speech. However, the lower limit frequency and upper limit frequency are not limited to the examples described above. Also, for example, the frequency of the first-order formant component detected in real time in the audio data may be applied as the lower limit frequency. Furthermore, the audio processing for clarifying speech is not limited to the processes described above, and known processes can be applied.
[0030] Furthermore, the voice processing unit 113 may perform voice processing on the speaker's voice data based on the speaker's actions. For example, in a call between user U1 and user U2, when user U1 speaks, user U1 is the speaker and user U2 is the listener. In this case, the voice processing unit 113 performs voice processing on user U1's voice data based on user U1's actions. This allows user U1, the speaker, to instruct user terminal 1 to clarify their voice if they want user U2, the other party in the call, to hear their voice more clearly.
[0031] Furthermore, the voice processing unit 113 may set the degree of clarity in voice processing for the speaker's voice data based on the speaker's actions. For example, the voice processing unit 113 sets the degree of clarity in voice processing for user U1's (speaker's) voice data based on the user U1's actions. This allows user U1, the speaker, to set how clearly they want their voice to be heard by user U2, the other party in the call.
[0032] Furthermore, the voice processing unit 113 may perform voice processing on the speaker's voice data based on the receiver's operation. For example, in a call between user U1 and user U2, when user U2 speaks, user U2 is the speaker and user U1 is the receiver. In this case, the voice processing unit 113 performs voice processing on the voice data of user U2 (speaker) received by the call application unit 111 based on the operation of user U1 (receiver) using the user terminal 1. This allows user U1, as the receiver, to instruct the user terminal 1 to clarify user U2's voice if they find it difficult to hear user U2's voice.
[0033] Furthermore, the voice processing unit 113 may set the degree of clarity in voice processing for the speaker's voice data based on the receiver's operation. For example, the voice processing unit 113 sets the degree of clarity in voice processing for the voice data of user U2 (speaker) based on the operation of user U1 (receiver). This allows user U1, as the receiver, to set how much clarity to give to the voice of user U2, the other party in the conversation.
[0034] The degree of clarification may, for example, be a gain that amplifies the level of the bandwidth between the lower frequency limit and the upper frequency limit mentioned above. The operation for setting the degree of clarification is accepted by the clarification UI unit 115, which will be described later. For example, the degree of speech clarification may be set discretely in multiple stages. Alternatively, the degree of speech clarification may be set continuously.
[0035] The voice output control unit 114 executes voice output control processing. Voice output control processing is the process of controlling the output of the voice indicated by the clarified voice data from the user terminal of the listener among multiple user terminals (for example, multiple user terminals including user terminal 1 and user terminal 2). For example, when user U1 speaks, user U1 is the speaker and user U2 is the listener. Therefore, if clarified voice data for user U1 has been generated, the voice output control unit 114 transmits the clarified voice data to user terminal 2 in order to output the clarified voice data from user terminal 2. Also, for example, when user U2 speaks in a call between user U1 and user U2, user U2 is the speaker and user U1 is the listener. Therefore, if clarified voice data for user U2 has been generated, the voice output control unit 114 controls the voice output unit 160 to output the clarified voice data.
[0036] The clarification UI unit 115 accepts operations to set the degree of speech clarification in speech processing. For example, such operations to set the degree of clarification may include operations to set the degree of clarification for transmitted speech, or operations to set the degree of clarification for received speech. Here, transmitted speech refers to the voice of user U1 when user U1 is the speaker. Received speech refers to the voice heard by user U1 when user U2 is the speaker.
[0037] For example, the operation for setting this may include selecting one of several levels of clarity. These levels could be, for example, three levels such as weak, medium, and strong. Furthermore, the level of clarity may include not providing any clarity. In this case, the levels could be, for example, four levels such as off, weak, medium, and strong, or two levels such as on and off. However, the names and number of levels for each are not limited to these examples.
[0038] Furthermore, for example, the operation for setting the settings may include an operation to specify an arbitrary degree of clarification within a continuous range. For example, an arbitrary degree of clarification may be represented by any numerical value from 1 to m, where "clarification off" is 1x and "clarification strong" is mx (where m is a number greater than 1). Also, the operation for setting the settings may be an operation on an operation object (e.g., a slider object, a volume object, etc.) that corresponds to values from 1 to m in a continuous range. The information indicating the degree of clarification for the transmitted or received voice, received through the operation for setting the settings, is stored in the storage unit 120 as setting information.
[0039] (Configuration of User Terminal 2) Figure 3 is a block diagram showing the functional configuration of User Terminal 2. As shown in Figure 3, User Terminal 2 includes a control unit 210, a storage unit 220, a display unit 230, an input unit 240, an audio input unit 250, an audio output unit 260, and a communication unit 270. Each of these units will be described in the same way as the functional blocks of the same name that are provided in User Terminal 1.
[0040] (Functional Blocks of the Control Unit 210) As shown in Figure 3, the control unit 210 includes a call application unit 211. The call application unit 211 provides functions for making calls according to the same call method as the call application unit 111. Details of the call application unit 211 will be explained in the same way as in the explanation of the call application unit 111, by substituting the names of users U1 and U2, user terminal 1 and user terminal 2, and the functional blocks with the same names in user terminal 1 and user terminal 2 for each other.
[0041] (Flow of Information Processing Method S1) The information processing system 100 configured as described above executes information processing method S1. Information processing method S1 is a method of performing speech processing to clarify the transmitted speech of user U1 when user U1 is the speaker, on the terminal of user U1 who is the speaker. In other words, the processor provided in the user terminal 1 of user U1 who is the speaker performs at least speech processing. Figure 4 is a flowchart illustrating the flow of information processing method S1. As shown in Figure 4, information processing method S1 includes steps S101 to S111.
[0042] In step S101, the call application unit 111 of the user terminal 1 connects to the call application unit 211 of the user terminal 2 based on an operation of the user U1 for starting a call with the user U2. Also, in step S102, the call application unit 211 of the user terminal 2 connects to the call application unit 111 of the user terminal 1 based on an operation of the user U2. Thereby, a call between the user U1 and the user U2 is started. Note that steps S101 and S102 are not limited to being executed in this order, and may be executed in the reverse order, or partially or entirely in parallel. Also, steps S101 and S102 do not limit which of the user U1 and the user U2 requests a call to the other party.
[0043] In step S103, the clarification UI unit 115 of the user terminal 1 receives an operation (hereinafter also referred to as a level setting operation) from the user U1 for setting the clarification level of the transmitted voice. The clarification level is an example of the degree of clarification, and refers to the stage when the degree of clarification is discretely set. Hereinafter, an example where the degree of clarification can be discretely set will be mainly described, but it is not limited to this. The setting information including the clarification level of the transmitted voice set by the level setting operation is stored in the storage unit 120. Note that the timing of executing step S103 is not limited to after the start of the call (after step S101), and may be executed before the start of the call (before step S101). Also, the timing is not limited to immediately after the start of the call, and may be executed at an arbitrary point during the call (any point after step S104 described below). Also, when the level setting operation is not received, the previously stored setting information may be applied.
[0044] Step S104 is an example of acquisition processing. In step S104, the acquisition unit 112 acquires the voice data of the user U1 via the voice input unit 150. That is, it is assumed that the user U1 spoke during the call.
[0045] Step S105 is an example of voice processing. In step S105, the voice processing unit 113 performs voice processing on the user U1's voice data according to the setting information. This generates clarified voice data that is clarified according to the clarification level indicated by the setting information. If the setting information indicates that the clarification level of the transmitted voice is "off", voice processing is not performed and clarified voice data is not generated.
[0046] Step S106 is an example of voice output control processing. In step S106, the voice output control unit 114 transmits clarified voice data to the user terminal 2. If clarified voice data has not been generated, the voice data of user U1 is transmitted instead.
[0047] In step S107, the call application unit 211 of the user terminal 2 receives the clarified voice data. The call application unit 211 also outputs the voice indicated by the received clarified voice data via the voice output unit 260. As a result, user U2 can hear the clarified voice of user U1. In other words, user U1 can have user U2 hear their clarified voice even if the user terminal 2 used by user U2, the person they are calling, does not have a voice clarification function. If voice data is received instead of clarified voice data, the unclarified voice indicated by that voice data will be output.
[0048] Note that steps S104 to S107 are executed in real time according to the progress of the speech by user U1. That is, according to the progress of the speech by user U1, the clarified voice of user U1 can be heard by user U2 in real time. Thereafter, when user U2 speaks, user terminal 1 receives the voice data of user U2 from user terminal 2 and outputs the received voice data via the voice output unit 160. Thereby, the unclarified voice of user U2 can be heard by user U1. It is also possible to configure such that steps S204 to S207 of information processing method S2 described later are executed when user U2 speaks. In this case, the clarified voice of user U2 can be heard by user U1. Further, when user U1 speaks again, steps S104 to S107 are repeated.
[0049] Also, when step S103 is executed again during the call and the clarification level of the transmitted voice is changed, the changed clarification level is referred to in the voice processing of step S105. Therefore, the changed clarification level is reflected in the voice that can be heard by user U2 in step S107.
[0050] In step S110, the call application unit 111 of user terminal 1 terminates the connection with the call application unit 211 of user terminal 2 based on the operation of user U1. Also, in step S111, the call application unit 211 of user terminal 2 terminates the connection with the call application unit 111 of user terminal 1. Thereby, the call between user U1 and user U2 ends. Note that steps S110 and S111 are not limited to being executed in this order, and the reverse order, or a part or all of them may be executed in parallel. Also, the operation for ending the call is not limited to user U1, and user U2 may perform it, or both may perform it.
[0051] (Screen Example) Figure 5 is a schematic diagram showing a screen example G1 displayed on the display unit 130 of the user terminal 1 in the information processing method S1. Screen example G1 provides a user interface related to the function of clarifying one's own voice (transmitted voice) during a call. As shown in Figure 5, screen example G1 includes information G11 to G13 and operation objects G14-1 to G14-4 and G15.
[0052] Information G11 (for example, "Currently on a call with U2") indicates that a call is currently in progress and indicates the caller. Information G12 (for example, "1:02") indicates the elapsed time of the call. Information G13 (for example, "Your voice has been clarified and is audible to U2") indicates that the voice of user U1, who is using user terminal 1, has been clarified.
[0053] The operation objects G14-1 "Weak", G14-2 "Medium", G14-3 "Strong", and G14-4 "Off" accept level setting operations to select the clarity level for transmitted speech. In this example, any of the four clarity levels (Off, Weak, Medium, Strong) can be selected. Among the operation objects G14-1 to G14-4, the highlighted G14-2 "Medium" indicates that the currently selected clarity level is "Medium". Note that in Figure 5, the highlighting is shown as a thick border and gray fill, but is not limited to this. When a level setting operation is accepted for any of the operation objects G14-1 to G14-4, setting information including the corresponding clarity level is stored in the storage unit 120. Furthermore, the operation objects G14-1 "Weak," G14-2 "Medium," G14-3 "Strong," and G14-4 "Off" can accept level setting operations before the start of a call and at any point during a call. Multiple level setting operations may also be accepted, in which case the setting information based on the most recent operation will be stored.
[0054] Operation object G15 receives an operation from user U1 to end the call. When an operation is received by operation object G15, the call application unit 111 of user terminal 1 terminates its connection with the call application unit 211 of user terminal 2.
[0055] User U1 can clarify their own voice (transmitted voice) and set the degree of clarity by operating on screen example G1. Therefore, even if the other party, User U2, has insufficient hearing, User U1 can communicate smoothly with User U2 during a call.
[0056] (Flow of Information Processing Method S2) The information processing system 100 also executes information processing method S2. Information processing method S2 is a method of performing speech processing on the user terminal 1 of user U1, who is the recipient, to clarify the received speech heard by user U1 when user U2 is the speaker. In other words, the processor in the recipient's user terminal 1 executes at least this speech processing. Figure 6 is a flowchart illustrating the flow of information processing method S2. As shown in Figure 6, information processing method S2 includes steps S201 to S211.
[0057] Steps S201 and S202 are steps to initiate a call between user U1 and user U2. Steps S201 and S202 will be explained in the same way as steps S101 and S102, so a detailed explanation will not be repeated.
[0058] In step S203, the clarification UI unit 115 of the user terminal 1 receives a level setting operation from user U1 to set the level of clarification of the received voice. The setting information, including the level of clarification of the received voice set by the level setting operation, is stored in the storage unit 120. Details of the level of clarification of the received voice will be explained in the same way as the level of clarification of the transmitted voice. Note that the timing of executing step S203 is not limited to after the start of a call (after step S201), but may also be executed before the start of a call (before step S201). This timing is not limited to immediately after the start of a call, but may also be executed at any point during a call (any point after step S204, which will be explained below). In addition, if the level setting operation is not accepted, pre-stored setting information may be applied.
[0059] In step S204, the call application unit 211 of the user terminal 2 acquires the voice data of user U2 via the voice input unit 250. In other words, it is assumed that user U2 spoke during the call. The call application unit 211 then transmits the acquired voice data of user U2 to the user terminal 1.
[0060] Step S205 is an example of the acquisition process. In step S205, the acquisition unit 112 acquires the voice data of user U2 by receiving it.
[0061] Step S206 is an example of voice processing. In step S206, the voice processing unit 113 performs voice processing on the user U2's voice data according to the setting information. This generates clarified voice data that is clarified according to the clarification level indicated by the setting information. However, if the setting information indicates that the clarification level of the received voice is "off", voice processing is not performed and clarified voice data is not generated.
[0062] In step S207, the audio output control unit 114 outputs the audio indicated by the clarified audio data via the audio output unit 160. As a result, user U1 can hear the clarified voice of user U2. In other words, user U1 can hear the voice of user U2, the person they are talking to, clarified by the clarification function of the user terminal 1 that they are using. If clarified audio data has not been generated, the audio indicated by user U2's audio data will be output.
[0063] Steps S204 to S207 are executed in real time in accordance with the progress of the utterance by user U2. In other words, as user U2's utterance progresses, the clarified voice of user U2 is heard by user U1 in real time. Subsequently, when user U1 speaks, user terminal 1 transmits the voice data of user U1 acquired via the voice input unit 150 to user terminal 2. User terminal 2 then outputs the received voice data via the voice output unit 260. As a result, user U2 hears the unclarified voice of user U1. It is also possible to configure the system so that steps S104 to S107 of the information processing method S1 described above are executed when user U1 speaks. In this case, user U2 hears the clarified voice of user U1. Furthermore, if user U2 speaks again, steps S204 to S207 are repeated.
[0064] Furthermore, if step S203 is executed again during the call and the clarity level of the received audio is changed, the changed clarity level is referenced in the audio processing in step S206. Therefore, the changed clarity level is reflected in the audio heard by user U1 in step S207.
[0065] Steps S210 and S211 are steps to terminate the call between user U1 and user U2. Steps S210 and S211 will be explained in the same way as steps S110 and S111, so a detailed explanation will not be repeated.
[0066] (Screen Example) Figure 7 is a schematic diagram showing a screen example G2 displayed on the display unit 130 of the user terminal 1 in the information processing method S2. In addition to providing a user interface for the user's own voice (transmitted voice) as shown in screen example G1 in Figure 5, screen example G2 also provides a user interface for clarifying the voice of the other party (received voice). The screen example G2 shown in Figure 7 includes information G21 to G22, areas G23 and G25, and an operation object G27.
[0067] Information G21 and G22 indicate that a call is in progress and are explained in the same way as information G11 and G12 in screen example G1. Operation object G27 "End Call" is explained in the same way as operation object G15 in screen example G1.
[0068] Area G23 is an area that provides a user interface for clarifying one's own voice. The information G231 and operation objects G232-1 to G232-4 included in area G23 are explained in the same way as the information G13 and operation objects G14-1 to G14-4 in screen example G1. Note that in screen example G2, operation object G232-4 "Off" is highlighted, indicating an example of turning off one's own voice clarity.
[0069] Area G25 is an area that provides a user interface related to clarifying the voice of the person on the other end of a call. Area G25 includes information G251 and operation objects G252-1 to G252-4. Information G251 (for example, "U2's voice has been clarified") indicates that the voice of user U2, the person on the other end of the call, has been clarified.
[0070] The operation objects G252-1 "Weak", G252-2 "Medium", G252-3 "Strong", and G252-4 "Off" accept level setting operations to select the level of clarity for the received audio. Operation objects G252-1 to G252-2 are similarly explained in the explanation of operation objects G14-1 to G14-4 in screen example G1, by replacing "transmitted audio" with "received audio". However, the number of levels for the clarity of the received audio does not necessarily have to be the same as the number of levels for the clarity of the transmitted audio. For example, the number of levels that can be set for the received audio may be greater than the number of levels that can be set for the transmitted audio. This allows the user to clarify their own voice for the other party while setting a more precise degree of clarity for the audio they need to hear. Also, for example, the number of levels that can be set for the received audio may be less than the number of levels that can be set for the transmitted audio. This allows for detailed clarification of one's own voice so that the other party can hear it, while also making it easy to adjust the degree of clarification for the voice that the user needs to hear.
[0071] Additionally, among the operation objects G252-1 to G252-4, the highlighted G252-1 "Weak" indicates that the currently selected clarity level is "Weak".
[0072] User U1 can perform operations on screen example G2 to clarify either their own voice (speaking voice) or the voice of the other party, User U2 (receiving voice), or set the degree of clarity. Therefore, even if either User U1 or the other party has insufficient hearing, User U1 can communicate smoothly with User U2 during a call.
[0073] (Other screen examples) Figure 8 schematically shows a screen example G3 that is displayed on the display unit 130 of the user terminal 1 when setting the clarity level before starting a call. In other words, screen example G3 is displayed in step S103 or S203 when it is executed in the information processing method S1 or S2 before steps S101 to S102 or S201 to S202. As shown in Figure 8, screen example G3 includes information G31, areas G32 and G33, and an operation object G34.
[0074] Information G31 indicates that the person to whom the call is to be initiated is user U2. Operation object G34 accepts an operation to initiate a call with the other party. When an operation is accepted by operation object G34, step S101 or S201 for initiating a call is executed.
[0075] Area G32 is an area that provides a user interface for clarifying one's own voice. The operation objects G32-1 to G32-4 included in area G32 are described in the same way as the operation objects G14-1 to G14-4 in screen example G1. However, operation objects G32-1 to G32-4 accept settings for the clarity level of the outgoing voice in a call that is about to start, rather than settings for the outgoing voice in a call that is already in progress.
[0076] Area G33 is an area that provides a user interface for clarifying the voice of the person on the other end of a call. The operation objects G33-1 to G33-4 included in area G33 are described in the same way as the operation objects G252-1 to G252-4 in screen example G2. However, operation objects G33-1 to G33-4 accept settings for the clarity level of the received voice in a call that is about to start, rather than settings for the received voice in a call that is already in progress.
[0077] User U1 can, by operating on screen example G3, pre-set whether to clarify either the transmitted or received voice, or both, and to what degree of clarity before starting a call. Therefore, if User U1 knows in advance that either they or the other party has insufficient hearing, they can prepare in advance to ensure smooth communication during a call with User U2.
[0078] (Modification 1 of Embodiment 1) In the information processing system 100 according to Embodiment 1, the description mainly focused on an example in which the user terminal 2 does not have a clarity function. However, when the user terminal 2 does have a clarity function, the voice processing unit 113 in the user terminal 1 is modified as follows.
[0079] For example, suppose that in user terminal 2, speech processing is performed to clarify the spoken voice based on the operation of user U2 (speaker). In this case, the speech processing unit 113 of user terminal 1 either makes it impossible to accept the operation of user U1 (receiver) to perform the speech processing on the received voice, or even if it accepts the operation of user U1 (receiver), it does not perform the speech processing.
[0080] For example, the voice processing unit 113 may determine whether or not voice processing for clarifying spoken speech has been performed at the user terminal 2 based on the spectrum of the received speech. Alternatively, for example, the voice processing unit 113 may determine whether or not voice processing for clarifying spoken speech has been performed at the user terminal 2 based on the received speaker-side clarification flag. The speaker-side clarification flag is, for example, a flag that is sent to the receiver along with the voice data indicating spoken speech or the clarified voice data. For example, the user terminal 2 may send the speaker-side clarification flag indicating clarification on along with the clarified voice data to the user terminal 1, and send the speaker-side clarification flag indicating clarification off along with the unclarified voice data to the user terminal 1. The user terminal 1 may also have a function to send the speaker-side clarification flag. This allows the user terminal 2 to operate based on the speaker-side clarification flag in the same way as the user terminal 1.
[0081] Furthermore, for example, suppose that voice processing to clarify the received audio is performed in user terminal 2 based on the operation of user U2 (speaker). In this case, the voice processing unit 113 of user terminal 1 either becomes unable to accept the operation of user U1 (speaker) to perform the voice processing on the spoken audio, or even if it accepts the operation of user U1 (speaker), it does not perform the voice processing.
[0082] For example, the voice processing unit 113 may determine whether or not voice processing to clarify the received audio has been performed in the user terminal 2 based on the receiver-side clarification flag. The receiver-side clarification flag is, for example, a flag that is sent to the speaker when voice processing to clarify the received audio has been performed. For example, the user terminal 2 may send a receiver-side clarification flag indicating clarification on to the user terminal 1 when it has performed voice processing to clarify the received audio. In this case, the voice processing unit 113 of the user terminal 1 will determine that voice processing to clarify the received audio has been performed in the user terminal 2 when it receives the receiver-side clarification flag indicating clarification on. Alternatively, the voice processing unit 113 of the user terminal 1 may determine that voice processing to clarify the received audio has not been performed in the user terminal 2 when it has not received the receiver-side clarification flag indicating clarification on.
[0083] In this modified example, for instance, the screen example G2 shown in Figure 7 is modified as follows. For example, suppose it is determined that voice processing for clarifying spoken audio is being performed on user terminal 2. In this case, in screen example G2 displayed on user terminal 1, the operation objects G252-1 to G252-4 related to clarifying received audio may be hidden or may be in a state where they do not accept operation (e.g., grayed out). Also, for example, suppose it is determined that voice processing for clarifying received audio is being performed on user terminal 2. In this case, in screen example G2 displayed on user terminal 1, the operation objects G232-1 to G232-4 related to clarifying spoken audio may be hidden or may be in a state where they do not accept operation (e.g., grayed out).
[0084] In this modified version, since voice processing to clarify the same utterance is not performed twice on both the speaker and the receiver, it has the effect of preventing unnatural-sounding utterances from being heard by the receiver.
[0085] (Modification 2 of Embodiment 1) In the information processing system 100 according to Embodiment 1, the voice processing unit 113 is modified as follows. The voice processing unit 113 performs voice processing to clarify the speaker's voice based on any of the setting information selected from the setting information predetermined according to each of a plurality of types of voice characteristics. Voice characteristics may be, for example, language characteristics, dialect characteristics, gender characteristics, or individual characteristics, but are not limited to these. Here, the frequencies of higher-order formant components, including secondary formant components, that should be amplified for clarification may differ depending on the voice characteristics.
[0086] Therefore, the setting information may include information indicating the frequency band of higher-order formant components, including secondary formant components, according to the characteristics of the speech. For example, the first setting information may include information indicating the frequency band according to the characteristics of Japanese, and the second setting information may include information indicating the frequency band according to the characteristics of English. The number of types of speech characteristics for which the setting information is predetermined is not limited to two, but may be three or more. Furthermore, the process of selecting one of the multiple setting information may be performed by user operation or by a computer. The selection of setting information by a computer can be performed based on characteristics obtained by analyzing the spoken speech in real time.
[0087] In this modified example, for example, screen example G1 shown in Figure 5, screen example G2 shown in Figure 7, and screen example G3 shown in Figure 8 can be modified as follows. For example, screen examples G1, G2, and G3 may include an operation object for selecting setting information to apply to clarifying one's own voice from multiple types of setting information (for example, setting information corresponding to Japanese, English, etc.). Also, for example, screen examples G2 and G3 may include an operation object for selecting setting information to apply to clarifying the voice of the person on the other end of the call from the same multiple types of setting information.
[0088] This modified version has the effect of being able to clarify the speech accurately according to the characteristics of the speaker's voice.
[0089] [Embodiment 2] An information processing system 100B according to Embodiment 2 of the present disclosure will be described below. For the sake of convenience of explanation, components having the same function as those described in the above embodiment will be denoted by the same reference numerals, and their descriptions will not be repeated.
[0090] (Configuration of Information Processing System 100B) Figure 9 is a block diagram showing the configuration of the information processing system 100B. As shown in Figure 9, the information processing system 100B includes a user terminal 1B used by user U1, a user terminal 2 used by user U2, and a server 3. User terminals 1B and 2 can be connected to each other via the server 3 for users U1 and U2 to make calls. User terminals 1B, 2, and 3 can be connected via a network N. Although Figure 9 shows one of each of user terminals 1B, 2, and 3, there may be multiple instances of each. Server 3 is a computer that has the function of relaying calls between user terminals 1B and 2 using a predetermined call method. Server 3 is one embodiment of the information processing device described in the claims and provides at least a voice clarity function to user terminal 1B.
[0091] User terminal 1B is a terminal used by user U1 and is a modified version of user terminal 1. User terminal 1B has a user interface that utilizes the clarification function provided by server 3. User terminal 2 is described as above and is assumed not to have the clarification function, but is not limited to this and may have such a function.
[0092] (Configuration of Server 3) Figure 10 is a block diagram showing the functional configuration of each device constituting the information processing system 100B. As shown in Figure 10, Server 3 comprises a control unit 310, a storage unit 320, and a communication unit 370. The control unit 310 is implemented, for example, by a processor executing a program stored in memory, and controls each part of Server 3. Details of each functional block of the control unit 310 will be described later. The storage unit 320 is composed of, for example, memory, and stores various data and programs used by the control unit 310. The communication unit 370 connects to the network N and communicates with the outside. The communication unit 370 transmits information input from the control unit 310 via the network N. The communication unit 370 also outputs information received via the network N to the control unit 310. Note that the storage unit 320 and the communication unit 370 may be connected as peripheral devices instead of being built into Server 3.
[0093] (Functional block of the control unit 310) As shown in Figure 10, the control unit 310 includes a call relay unit 311, an acquisition unit 312, an audio processing unit 313, and an audio output control unit 314.
[0094] The call relay unit 311 has the function of relaying calls between multiple user terminals (for example, user terminal 1B and user terminal 2) according to a predetermined call method. For example, the call relay unit 311 generates a call session in response to a request from any of the multiple user terminals. The call relay unit 311 also allows any of the multiple user terminals to join the generated call session in response to a request from any of the multiple user terminals. For example, the call relay unit 311 relays voice data between the multiple user terminals participating in the call session while it is active. For example, the call relay unit 311 either removes any of the multiple user terminals from the call session or terminates the call session itself in response to a request from any of the multiple user terminals.
[0095] For example, the call relay unit 311 may be at least part of a server application that provides a two-way voice call function or a group voice call function. Alternatively, for example, the call relay unit 311 may be at least part of a server application that provides a two-way video call function or a group video call function. In the following description, the call relay unit 311 will mainly be described in an example where it relays calls between two user terminals, user terminal 1B and user terminal 2, but the number of user terminals to be relayed may be three or more.
[0096] The acquisition unit 312 receives voice data from the speaker (for example, user U1 or user U2) from user terminal 1B or user terminal 2 and performs the acquisition process.
[0097] The speech processing unit 313 generates clarified speech data by performing speech processing on the speech data of the speaker (for example, user U1 or user U2) to clarify the speech. The speech processing unit 313 will be described in the same way as the speech processing unit 113, so a detailed explanation will not be repeated.
[0098] The voice output control unit 314 transmits the clarified voice data to the recipient's user terminal (for example, user terminal 1B or user terminal 2) to control the output of the voice indicated by the clarified voice data from the user terminal. For example, if user U1 is the speaker and clarified voice data for user U1 has been generated, the voice output control unit 314 transmits the clarified voice data to user terminal 2. As a result, the voice indicated by the clarified voice data is output from user terminal 2. Also, for example, if user U2 is the speaker and clarified voice data for user U2 has been generated, the voice output control unit 314 transmits the clarified voice data to user terminal 1B. As a result, the voice indicated by the clarified voice data is output from user terminal 1B.
[0099] (Configuration of User Terminal 1B) As shown in Figure 10, User Terminal 1B, like User Terminal 1, is equipped with a control unit 110, a storage unit 120, a display unit 130, an input unit 140, an audio input unit 150, an audio output unit 160, and a communication unit 170, but the functional blocks provided by the control unit 110 are different. The storage unit 120, display unit 130, input unit 140, audio input unit 150, audio output unit 160, and communication unit 170 have been described above, so a detailed explanation will not be repeated.
[0100] (Functional Block of Control Unit 110) As shown in Figure 10, the control unit 110 includes a call application unit 111 and a clarity UI unit 115. The clarity UI unit 115 may constitute a clarity application that extends the call function of the call application unit 111.
[0101] The call application unit 111 provides functions for making calls according to a predetermined call method. The call application unit 111 is a modified version of the call application unit 111 in Embodiment 1. The following description will focus on the modifications made to the call application unit 111, and will not repeat the same points. The call application unit 111 connects with the call destination via the server 3. The call application unit 111 also sends and receives voice data with the call destination via the server 3. Furthermore, the call application unit 111 can make calls with multiple call destinations, not just one. For example, the call application unit 111 sends a request to the server 3 to create a call session with one or more call destinations. The call application unit 111 also joins the call session by sending a request to the server 3 to join the call session. The call application unit 111 also sends and receives voice data with other user terminals participating in the same call session via the server 3. The call application unit 111 leaves the call session by sending a request to the server 3 to leave the call session. Furthermore, the call application unit 111 terminates the call session by sending a termination request from the call session to the server 3.
[0102] The clarification UI unit 115 receives a level setting operation and generates setting information. The details of the level setting operation and setting information are as described above, so a detailed explanation will not be repeated. The generated setting information is sent to the server 3 and stored in the storage unit 320 of the server 3.
[0103] (Configuration of User Terminal 2) The configuration of User Terminal 2 is as described above, so a detailed explanation will not be repeated. However, the call application unit 211 is a modified version of the call application unit 211 in Embodiment 1. The modifications in the call application unit 211 will be explained in the same way as the modifications in the call application unit 111.
[0104] (Flow of Information Processing Method S4) The information processing system 100B configured as described above executes information processing method S4. Information processing method S4 is a method in which the server 3 that relays the call performs voice processing to clarify the transmitted voice of user U1 when user U1 is the speaker. In other words, the processor of the server 3 that relays the call performs at least this voice processing. Below, an example of a two-person call will be described using information processing method S4. However, information processing method S4 can also be applied to group calls of three or more parties by assuming that there are multiple user terminals 1B and user terminals 2. Figure 11 is a flowchart illustrating the flow of information processing method S4. As shown in Figure 11, information processing method S4 includes steps S401 to S422.
[0105] In step S401, the call application unit 111 of user terminal 1B sends a call session creation request and a join request to server 3 based on user U1's operation to start a call with user U2. In step S402, the call relay unit 311 of server 3 creates a call session and allows user terminal 1B to join. In step S403, the call application unit 211 of user terminal 2 sends a call session join request to server 3 based on user U2's operation to respond to the call with user U1. The call relay unit 311 of server 3 allows user terminal 2 to join the call session. This initiates a call between user U1 and user U2. Steps S401 to S403 are not limited to this order; they can be executed in the reverse order, or some or all of them in parallel. Furthermore, steps S401 to S403 do not limit which user U1 or user U2 initiates the call request to the other.
[0106] In step S404, the clarification UI unit 115 of the user terminal 1B receives a level setting operation from user U1 to set the level of clarification of the transmitted voice. The details of step S404 will be explained in the same way as in step S103. The clarification UI unit 115 also transmits the setting information to the server 3.
[0107] In step S405, the voice processing unit 313 of the server 3 stores setting information, including the clarity level of the transmitted voice, in the storage unit 320.
[0108] In step S406, the call application unit 111 of the user terminal 1B acquires the voice data of user U1 via the voice input unit 150. In other words, it is assumed that user U1 spoke during the call. The call application unit 111 also transmits the voice data of user U1 to the server 3.
[0109] Step S407 is an example of the acquisition process. In step S407, the acquisition unit 312 of the server 3 acquires the voice data of user U1 by receiving it from the user terminal 1B.
[0110] Step S408 is an example of voice processing. In step S408, the voice processing unit 313 performs voice processing on the user U1's voice data according to the setting information. This generates clarified voice data that is clarified according to the clarification level indicated by the setting information. If the setting information indicates that the clarification level of the transmitted voice is "off", voice processing is not performed and clarified voice data is not generated.
[0111] In step S409, the audio output control unit 314 transmits clarified audio data to the user terminal 2. If clarified audio data has not been generated, the user U1's audio data is transmitted instead.
[0112] In step S410, the call application unit 211 of the user terminal 2 receives the clarified voice data. The call application unit 211 also outputs the voice indicated by the received clarified voice data via the voice output unit 260. As a result, user U2 hears the clarified voice of user U1. If voice data is received instead of clarified voice data, the unclarified voice indicated by that voice data is output.
[0113] Steps S406 to S410 are executed in real time in accordance with the progress of the utterance by user U1. In other words, as the utterance by user U1 progresses, user U2 hears the clarified voice of user U1 in real time. Subsequently, when user U2 speaks, user terminal 1B receives the voice data of user U2 transmitted from user terminal 2 from server 3 and outputs the received voice data via the voice output unit 160. As a result, user U1 hears the unclarified voice of user U2. It is also possible to configure the system so that when user U2 speaks, steps S506 to S510 of the information processing method S5 described later are executed. In this case, user U1 hears the clarified voice of user U2. Furthermore, if user U1 speaks again, steps S406 to S410 are repeated.
[0114] Furthermore, if step S404 is executed again during the call and the clarity level of the transmitted voice is changed, the changed clarity level is referenced in the voice processing in step S408. Therefore, the changed clarity level is reflected in the voice heard by user U2 in step S410.
[0115] In step S420, the call application unit 111 of user terminal 1B sends a call session termination request to server 3 based on user U1's operation. In step S421, the call relay unit 311 of server 3 terminates the call session in response to the termination request. In step S422, the call application unit 211 of user terminal 2 completes the process of exiting the terminated call session. This terminates the call between user U1 and user U2. Note that steps S420 to S422 are not limited to being executed in this order, but may be executed in the reverse order, or some or all of them may be executed in parallel. Also, the operation to terminate the call session may be performed by user U2, not just user U1, or by both users. Furthermore, at least one of user U1 and user U2 may perform an operation to exit the call session instead of an operation to terminate it.
[0116] According to the information processing method S4, for example, the display unit 130 of the user terminal 1B displays the example screen G1 in the same manner as the information processing method S1. Details of the example screen G1 are as described above. By operating on the example screen G1, user U1 can clarify their voice so that user U2, the person they are talking to, can hear them, even if user U2 has poor hearing. As a result, this contributes to smooth communication between user U1 and user U2.
[0117] (Flow of Information Processing Method S5) The information processing system 100B then executes information processing method S5. Information processing method S5 is a method in which the server 3 that relays the call performs voice processing to clarify the received voice heard by user U1 when user U2 is the speaker. In other words, the processor of the server 3 that relays the call performs at least this voice processing. In the following, similar to information processing method S4, an example of performing a two-party call in information processing method S5 will be described. However, information processing method S5 can also be applied to group calls of three or more parties by assuming that there are multiple user terminals 1B and user terminals 2. Figure 12 is a flowchart illustrating the flow of information processing method S5. As shown in Figure 12, information processing method S5 includes steps S501 to S522.
[0118] Steps S501 to S503 are steps to initiate a conversation between user U1 and user U2. Steps S501 to S503 will be explained in the same way as steps S401 to S403, so a detailed explanation will not be repeated.
[0119] In step S504, the clarification UI unit 115 of the user terminal 1B receives a level setting operation from user U1 to set the level of clarification of the received voice. The details of step S504 will be explained in the same way as in step S203. The clarification UI unit 115 also transmits the setting information to the server 3.
[0120] In step S505, the voice processing unit 313 of the server 3 stores setting information, including the level of clarity of the received voice, in the storage unit 320.
[0121] In step S506, the call application unit 211 of the user terminal 2 acquires the voice data of user U2 via the voice input unit 250. In other words, it is assumed that user U2 spoke during the call. The call application unit 211 also transmits the voice data of user U2 to the server 3.
[0122] Step S507 is an example of the acquisition process. In step S507, the acquisition unit 312 of the server 3 acquires the voice data of user U2 by receiving it from the user terminal 2.
[0123] Step S508 is an example of voice processing. In step S508, the voice processing unit 313 performs voice processing on the user U2's voice data according to the setting information. This generates clarified voice data that is clarified according to the clarification level indicated by the setting information. However, if the setting information indicates that the clarification level of the received voice is "off", voice processing is not performed and clarified voice data is not generated.
[0124] In step S509, the voice output control unit 314 transmits the clarified voice data to the user terminal 1B. If the clarified voice data has not been generated, the user U2's voice data is transmitted instead.
[0125] In step S510, the call application unit 111 of the user terminal 1B receives clarified voice data. The call application unit 111 also outputs the voice indicated by the received clarified voice data via the voice output unit 160. As a result, user U1 hears the clarified voice of user U1. If voice data is received instead of clarified voice data, the unclarified voice indicated by that voice data is output.
[0126] Steps S506 to S510 are executed in real time in accordance with the progress of the utterance by user U2. In other words, as the utterance by user U2 progresses, the clarified voice of user U2 is heard by user U1 in real time. Subsequently, when user U1 speaks, user terminal 2 receives the voice data of user U1 transmitted from user terminal 1B from server 3 and outputs the received voice data via the voice output unit 260. As a result, user U2 hears the unclarified voice of user U1. It is also possible to configure the system so that steps S406 to S410 of the information processing method S4 described above are executed when user U1 speaks. In this case, user U2 hears the clarified voice of user U1. Furthermore, if user U2 speaks again, steps S506 to S510 are repeated.
[0127] Furthermore, if step S504 is executed again during the call and the clarity level of the received audio is changed, the changed clarity level is referenced in the audio processing of step S508. Therefore, the changed clarity level is reflected in the audio heard by user U1 in step S510.
[0128] Steps S520 to S522 are steps for terminating the call between user U1 and user U2. Steps S520 to S522 will be explained in the same way as steps S420 to S422, so a detailed explanation will not be repeated.
[0129] According to the information processing method S5, for example, the display unit 130 of the user terminal 1B displays the example screen G2 in the same manner as the information processing method S2. Details of the example screen G2 are as described above. By operating on the example screen G2, user U1 can hear the voice of user U2, the other party to the call, more clearly, even if U1's own hearing is not sufficient. As a result, this contributes to smooth communication between user U1 and user U2. (Modification of Embodiment 2) In the information processing system 100B according to Embodiment 2, the voice processing unit 313 of the server 3 can be modified in the same way as the voice processing unit 113 in Modification 1 or 2 of Embodiment 1. This provides the same effect as Modification 1 or 2 of Embodiment 1, even when the server 3 relays the call.
[0130] [Embodiment 3] The telephone system 100C according to Embodiment 3 of the present disclosure will be described below. The telephone system 100C includes a voice processing device 10C that performs voice processing to clarify the transmitted or received voice. The voice processing device 10C also has a built-in microcomputer 1C that performs the voice processing. The microcomputer 1C is one embodiment of the information processing device described in the claims. Note that the microcomputer 1C is not limited to what is called a "microcomputer," but may be any computer that has a processor and memory and can be built into the voice processing device 10C. For the sake of convenience of explanation, in the following, components that have the same function as the components described in the above embodiments will be denoted by the same reference numerals, and their descriptions will not be repeated.
[0131] (Configuration of Telephone System 100C) Figure 13 is a block diagram showing the configuration of telephone system 100C. As shown in Figure 13, telephone system 100C comprises an audio processing device 10C, a handset 20, a main unit 30, and cables 41, 42, and 43. Telephone system 100C is used by user U1 to make calls. The person with whom a call is made via telephone system 100C is referred to as user U2.
[0132] (Handset 20) The handset 20 includes a microphone 21, a speaker 22, and a port P21. The microphone 21 and speaker 22 are each connected to port P21. The microphone 21 is an example of a transmitter, and the speaker 22 is an example of a receiver.
[0133] The microphone 21 converts the voice emitted by user U1 into an audio signal and supplies the audio signal to port P21. In this embodiment, the audio signal is an analog signal and an electrical signal. However, the microphone 21 may be configured to convert the voice emitted by user U1 into a digital audio signal.
[0134] For example, an electric condenser microphone can be used as the microphone 21. An electric condenser microphone has a DC voltage applied to it for driving. Also, electric condenser microphones are often connected using different types of connectors depending on the manufacturer or model.
[0135] The speaker 22 converts the audio signal supplied from port P21, based on the voice emitted by user U2, into sound, and outputs the sound. In this embodiment, the audio signal supplied from port P21 is an analog signal and an electrical signal. However, the audio signal supplied from port P21 may be a digital signal, and the speaker 22 may be configured to convert the digital audio signal into sound.
[0136] Port P21 is an example of a port on the handset 20, and in this embodiment, a modular jack conforming to the RJ-9 standard is used. However, port P21 is not limited to a modular jack conforming to the RJ-9 standard. Any connector capable of inputting and outputting audio signals may be used as port P21.
[0137] (Main unit 30) The main unit 30 is equipped with ports P31 and P32. Port P31 is an example of a port on the handset 20 side of the main unit 30, and in this embodiment, a modular jack conforming to the RJ-9 standard is used, similar to port P21.
[0138] Port P32 is an example of a line-side port on the main unit 30, and in this embodiment, a modular jack conforming to the RJ-11 standard is used. However, port P32 is not limited to a modular jack conforming to the RJ-11 standard. For example, a modular jack conforming to the RJ-12 standard or the RJ-14 standard may be used as port P32.
[0139] The main unit 30 is the main unit of a telephone using a two-wire telephone line. The main unit 30 is configured to enable bidirectional voice signal communication with the main unit of the receiving telephone by connecting the telephone line to port P32 and the handset 20 to port P31. The main unit 30 can be configured in the same way as the main unit of an existing telephone. Therefore, in this embodiment, the description of the main unit 30 will be omitted.
[0140] The main unit 30 outputs the audio signal supplied to port P31 from the audio processing device 10C (described later) to the cable 43 (described later).
[0141] (Cables 41, 42, 43) Each of the cables 41, 42, and 43 is equipped with multiple signal lines. Each of the multiple signal lines is made of a conductor capable of transmitting an audio signal, which is an electrical signal.
[0142] Cable 41 connects port P11 of the voice processing device 10C (described later) to port P21 of the handset 20, cable 42 connects port P12 of the voice processing device 10C to port P31 of the main unit 30, and cable 43 forms the end of the telephone line and is connected to port P32 of the main unit 30.
[0143] In this embodiment, each of cables 41 and 42 is a cable equipped with modular plugs conforming to the RJ-9 standard at both ends. In this embodiment, cable 43 is a cable equipped with modular plugs conforming to the RJ-11 standard at both ends. In Figure 13, the black squares shown at both ends of cables 41 and 42 represent modular plugs conforming to the RJ-9 standard, and the white square shown at the end of cable 43 represents a modular plug conforming to the RJ-11 standard.
[0144] (Voice Processing Device 10C) The voice processing device 10C is a device that performs voice processing to clarify transmitted voice. Figure 14 is a block diagram showing the detailed configuration of the voice processing device 10C. As shown in Figure 14, the voice processing device 10C includes a microcomputer 1C, an AD (Analog-Digital) converter 180, a DA converter 190, and a tuner 15. The voice processing device 10C is installed between the handset 20 and the main unit 30.
[0145] Furthermore, in this embodiment, the housing of the audio processing device 10C is made of aluminum. However, the material constituting the housing is not limited to aluminum, and may be a metal such as copper or stainless steel. Alternatively, the material constituting the housing may be mainly resin with a metal layer provided on its surface. The housing can shield electromagnetic waves from entering the interior from the outside by having at least its surface covered with metal.
[0146] (AD converter 180, DA converter 190) The AD converter 180 converts the audio signal (corresponding to transmitted speech) supplied to port P11 into digital audio data and outputs it to the microcomputer 1C. The DA converter 190 converts the clarified audio data output from the microcomputer 1C into an analog audio signal and supplies it to port P12.
[0147] (Adjuster 15) The adjuster 15 is a user interface for setting the clarity level. In this embodiment, the adjuster 15 is configured using a switch that allows selection of three clarity levels (0, 1, 2). For example, the selectable levels 0, 1, and 2 by the switch correspond to clarity levels "off," "weak," and "strong." For example, the adjuster 15 outputs a signal to the microcomputer 1C indicating one of the levels 0, 1, or 2 selected by the switch.
[0148] The number of switch stages constituting the regulator 15 is not limited to the example described above and can be selected as appropriate. Furthermore, the regulator 15 can employ a volume control that continuously changes the degree of clarity instead of switches that discretely change the degree of clarity. Also, in cases where the user U2 on the receiving end can be identified to some extent, such as when the telephone system 100C is installed in a home, the degree of clarity can be pre-set and the regulator 15 can be omitted.
[0149] (Microcomputer 1C) The microcomputer 1C includes a control unit 110 and a storage unit 120. The control unit 110 is implemented, for example, by a processor executing a program stored in memory, and comprehensively controls each part of the audio processing device 10C. The storage unit 120 is composed, for example, of memory and stores various data and programs used by the control unit 110. The control unit 110 includes an acquisition unit 112, an audio processing unit 113, and an audio output control unit 114.
[0150] The acquisition unit 112 performs the acquisition process. In this embodiment, the acquisition unit 112 acquires the audio data output from the AD converter 180. The audio data represents the voice (transmitted voice) of user U1 speaking into the handset 20.
[0151] The voice processing unit 113 performs voice processing. This generates clarified voice data corresponding to the transmitted voice. Details of the voice processing unit 113 will be described in the same manner as in Embodiment 1.
[0152] The audio output control unit 114 performs audio output control processing. In this embodiment, the audio output control processing is performed by outputting clarified audio data corresponding to the transmitted speech to the DA converter 190.
[0153] (Flow of Information Processing Method S7) In the telephone system 100C configured as described above, the microcomputer 1C executes information processing method S7. Information processing method S7 is a method executed by the microcomputer 1C when user U1 makes a call with the other party using the telephone system 100C. Figure 15 is a flowchart showing the flow of information processing method S7. As shown in Figure 15, information processing method S7 includes steps S701 to S704.
[0154] In step S701, the audio processing unit 113 acquires the clarity level from the adjuster 15. In step S702, the acquisition unit 112 acquires the audio data (corresponding to the transmitted voice) output from the AD converter 180. In step S703, the audio processing unit 113 generates clarified audio data corresponding to the transmitted voice by performing audio processing on the audio data according to the clarity level. In step S703, the audio output control unit 114 outputs the clarified audio data corresponding to the transmitted voice to the DA converter 190.
[0155] As a result, the clarified voice data corresponding to the transmitted voice is converted into an analog signal, and this voice signal is transmitted via the main unit 30 to the user terminal of user U2, who is the other party in the call. Consequently, the user terminal of user U2 is controlled to output the clarified voice of user U1. Even if user U2, the other party in the call, has insufficient hearing, user U1 can clarify their own voice using the voice processing device 10C they use, so that user U2 can hear them. This contributes to smooth communication between user U1 and user U2.
[0156] (Modification of Embodiment 3) The voice processing device 10C according to Embodiment 3 can be modified to perform voice processing for clarifying received voice in addition to voice processing for clarifying transmitted voice. In this case, the voice processing device 10C further includes an AD converter that converts the voice signal (corresponding to received voice) supplied from port P12 into a digital signal and outputs it to the microcomputer 1C. The voice processing device 10C further includes a DA converter that converts the clarified voice data (corresponding to received voice) output from the microcomputer 1C into an analog signal and supplies it to port P11. The voice processing device 10C further includes a regulator for setting the clarification level for received voice.
[0157] In this modified example, the acquisition unit 112 acquires audio data supplied from port P12 via the AD converter. This audio data represents the voice (received voice) of user U2, the person on the other end of the call, as received by the main unit 30.
[0158] The voice processing unit 113 obtains the clarity level for the received audio from the adjuster. The voice processing unit 113 also performs voice processing on the user U2's voice data according to the clarity level. This generates clarified voice data corresponding to the received audio. Details of the voice processing unit 113 will be described in the same manner as in Embodiment 1.
[0159] The audio output control unit 114 performs audio output control processing. In this embodiment, audio output control processing is performed by supplying clarified audio data corresponding to the received audio to port P11 via a DA converter. As a result, the clarified voice of user U2 is output from speaker 22.
[0160] According to this modified version, even if user U1 has insufficient hearing, the voice of user U2, the person they are talking to, can be clearly heard by the voice processing device 10C that U1 uses. As a result, it contributes to smooth communication between user U1 and user U2.
[0161] (Another Modification 1 of Embodiment 3) The voice processing device C according to Embodiment 3 may be connected wirelessly to one or both of the handset 20 and the main unit 30, not limited to a wired connection. Furthermore, the voice processing device 10C is not limited to being configured separately from the handset 20 and the main unit 30. For example, the voice processing device 10C may be housed within the casing of the main unit 30 so as to be integrated with the main unit 30. Alternatively, the voice processing device 10C may be housed within the casing of the handset 20 so as to be integrated with the handset 20.
[0162] Furthermore, the telephone system 100C according to Embodiment 3 may include a computer with a calling function instead of the main unit 30. This computer may be, but is not limited to, a mobile phone, smartphone, tablet, smartwatch, laptop computer, or desktop computer. Also, the handset 20 is not limited to the configuration shown in Figure 13. For example, the handset 20 may be a headset or the like.
[0163] (Modification 2 of Embodiment 3) In the telephone system 100C according to Embodiment 3, the voice processing unit 113 of the microcomputer 1C can be modified in the same way as the voice processing unit 113 in Modification 1 or 2 of Embodiment 1. As a result, even in a call using the voice processing device 10C which has the microcomputer 1C built in, the same effects as in Modification 1 or 2 of Embodiment 1 can be achieved.
[0164] [Example of implementation by software] The functions of user terminals 1 and 1B, microcomputer 1C, and server 3 (hereinafter referred to as "devices") can be realized by programs that cause the devices to function as computers, and by programs that cause each control block of the devices (especially each part included in the control units 110 and 310) to function as computers.
[0165] In this case, the device includes a computer having at least one control device (e.g., a processor) and at least one storage device (e.g., memory) as hardware for executing the program. By executing the program using this control device and storage device, the functions described in each of the embodiments are realized.
[0166] The above program may be recorded on one or more computer-readable recording media, not temporary ones. These recording media may or may not be provided by the above device. In the latter case, the program may be supplied to the above device via any wired or wireless transmission medium.
[0167] Furthermore, some or all of the functions of each of the above control blocks can also be implemented by logic circuits. For example, an integrated circuit in which logic circuits functioning as each of the above control blocks are formed is also included in the scope of this disclosure. In addition, it is also possible to implement the functions of each of the above control blocks by, for example, a quantum computer.
[0168] Furthermore, each process described in the above embodiments may be performed by AI (Artificial Intelligence). In this case, the AI may operate on the control device described above, or it may operate on another device (for example, an edge computer or a cloud server).
[0169] [Summary] The information processing method according to Embodiment 1 includes, at least one processor performing: an acquisition process to acquire audio data indicating the voice of the speaker that has been input to the speaker's user terminal among a plurality of user terminals connected for making a call; an audio processing to generate clarified audio data by performing audio processing on the audio data to clarify the voice; and an audio output control process to control the output of the voice indicated by the clarified audio data from the receiver's user terminal among the plurality of user terminals. With the above configuration, the speaker's voice is clarified in calls between multiple users, thus providing the effect of supporting smooth communication in calls between multiple users, including users with insufficient hearing.
[0170] The information processing method according to Embodiment 2 is such that, in Embodiment 1, at least one processor performs the voice processing based on the speaker's operation. With this configuration, the speaker can perform an operation to clarify their own voice if the person they are talking to does not have sufficient hearing, so that the person they are talking to can hear their clarified voice.
[0171] The information processing method according to Embodiment 3 is such that, in Embodiment 2, at least one processor sets the degree of clarity in the speech processing based on the speaker's operation. With this configuration, the speaker can set the degree of clarity for their own voice according to the hearing ability of the person they are talking to.
[0172] The information processing method according to Embodiment 4 is such that, in Embodiment 2 or Embodiment 3, at least one processor, when the speech processing is performed based on the speaker's operation, either makes it impossible to receive the receiver's operation to perform the speech processing, or does not perform the speech processing even if the receiver's operation is received. With this configuration, since speech processing to clarify the speaker's voice is not performed twice on both the speaker and receiver sides, the occurrence of unnatural received voice can be suppressed.
[0173] The information processing method according to Embodiment 5 is such that, in any one of Embodiments 1 to 4, at least one processor performs the voice processing based on the operator's actions. With this configuration, if the operator's hearing is insufficient, they can perform actions to clarify the voice of the person they are talking to, and thus hear the clarified voice of the person they are talking to.
[0174] The information processing method according to embodiment 6 is such that, in embodiment 5, at least one processor sets the degree of clarity in the voice processing based on the operation of the receiver. With this configuration, the speaker can set the degree of clarity for the voice of the person they are talking to, according to their own hearing ability.
[0175] The information processing method according to Embodiment 7 is such that, in Embodiment 5 or Embodiment 6, if the voice processing is performed based on the receiver's operation, at least one processor either becomes unable to accept the speaker's operation to perform the voice processing, or does not perform the voice processing even if the speaker's operation is accepted. With this configuration, voice processing to clarify the speaker's voice is not performed twice on both the speaker and receiver sides, thus suppressing the occurrence of unnatural received voice.
[0176] The information processing method according to embodiment 8 is characterized in that, in any one of embodiments 1 to 7, in the speech processing, at least one processor performs the speech processing based on any of the setting information selected from predetermined setting information according to each of the characteristics of a plurality of types of speech. With the above configuration, the speech can be clarified with greater accuracy according to the characteristics of the speaker's speech.
[0177] The information processing method according to Embodiment 9 is such that, in any one of Embodiments 1 to 8, the processor provided in the server relaying the call performs at least the voice processing. With this configuration, the speaker and the listener can clarify the speaker's voice even if their own user terminals used for the call do not have a function to perform voice processing for clarifying the voice.
[0178] The information processing method according to embodiment 11 is such that, in any one of embodiments 1 to 8, the processor in the speaker's user terminal performs at least the voice processing. With this configuration, the speaker can clarify their own voice using their own user terminal, regardless of whether the other party's user terminal has a function to perform voice processing for clarifying voice.
[0179] The information processing method according to embodiment 12 is such that, in any one of embodiments 1 to 8, the processor in the recipient's user terminal performs at least the voice processing. With this configuration, the recipient can clarify the voice of the person they are talking to using their own user terminal, regardless of whether the other person's user terminal has a function to perform voice processing for clarifying the voice.
[0180] The information processing method according to embodiment 13 is such that, in any one of embodiments 1 to 12, the speech processing is a process that amplifies higher-order formant components, including at least a second-order formant component, in the speech data. With the above configuration, speech can be made clearer.
[0181] The information processing device according to embodiment 14 comprises at least one processor, the at least one processor executing each of the processes included in the information processing method described in any one of embodiments 1 to 12. With this configuration, the same effects as any one of embodiments 1 to 9 are achieved.
[0182] The program according to embodiment 11 causes at least one processor to execute each of the processes included in the information processing method described in any one of embodiments 1 to 9. With this configuration, the same effects as any one of embodiments 1 to 9 are achieved.
[0183] The computer-readable, non-temporary recording medium according to Embodiment 12 stores the program according to Embodiment 12. This configuration provides the same effect as any one of Embodiments 1 to 9.
[0184] Each embodiment of this disclosure has the effect of being able to present to the user the effect of clarification through speech processing for clarifying speech. Such an effect contributes, for example, to achieving Sustainable Development Goal (SDG) 3, "Ensure good health and well-being for all," as advocated by the United Nations.
[0185] This disclosure is not limited to the embodiments described above, and various modifications are possible within the scope of the claims. Embodiments obtained by appropriately combining the technical means disclosed in different embodiments are also included in the technical scope of this disclosure.
[0186] 1, 1B, 2 User terminal 1C Microcomputer 3 Server 10C Voice processing unit 15 Regulator 20 Handset 21 Microphone 22 Speaker 30 Main unit 41, 42, 43 Cable 100, 100B Information processing system 100C Telephone system 110, 210, 310 Control unit 111, 211 Call application unit 112, 312 Acquisition unit 113, 313 Voice processing unit 114, 314 Voice output control unit 115 Clarification UI unit 120, 220, 320 Storage unit 130, 230 Display unit 140, 240 Input unit 150, 250 Voice input unit 160, 260 Voice output unit 170, 270, 370 Communication unit 180 AD converter 190 DA converter 311 Call relay unit
Claims
1. An information processing method comprising: an acquisition process in which at least one processor acquires audio data indicating the voice of a speaker that has been input to the speaker's user terminal among a plurality of user terminals connected for making a call; an audio processing process that generates clarified audio data by performing audio processing on the audio data to clarify the voice; and an audio output control process that controls the output of the voice indicated by the clarified audio data from the receiver's user terminal among the plurality of user terminals.
2. The information processing method according to claim 1, wherein the at least one processor performs the speech processing based on the operation of the speaker.
3. The information processing method according to claim 2, wherein at least one processor sets the degree of clarification in the speech processing based on the operation of the speaker.
4. The information processing method according to claim 2, wherein if the at least one processor has performed the speech processing based on the speaker's operation, it will not accept the receiver's operation to perform the speech processing, or will not perform the speech processing even if it accepts the receiver's operation.
5. The information processing method according to claim 1, wherein the at least one processor performs the voice processing based on the operator of the receiver.
6. The information processing method according to claim 5, wherein at least one processor sets the degree of clarification in the speech processing based on the operator of the listener.
7. The information processing method according to claim 5, wherein if the at least one processor has performed the voice processing based on the receiver's operation, it will not accept the speaker's operation to perform the voice processing, or will not perform the voice processing even if it accepts the speaker's operation.
8. The information processing method according to claim 1, wherein, in the audio processing, the at least one processor performs the audio processing based on any setting information selected from predetermined setting information according to each of a plurality of types of audio characteristics.
9. The information processing method according to claim 1, wherein a processor in the server that relays the call performs at least the voice processing.
10. The information processing method according to claim 1, wherein the processor provided in the speaker's user terminal performs at least the speech processing.
11. The information processing method according to claim 1, wherein the processor provided in the recipient's user terminal performs at least the voice processing.
12. The information processing method according to claim 1, wherein the audio processing is a process of amplifying higher-order formant components, including at least a second-order formant component, in the audio data.
13. An information processing apparatus comprising at least one processor, wherein the at least one processor performs each of the processes included in the information processing method according to any one of claims 1 to 12.
14. A program that causes at least one processor to execute each of the processes included in the information processing method described in any one of claims 1 to 12.