Voice changing method, apparatus and device for speech dialog in live-streaming room, and medium

By detecting speech segments in the live broadcast room and using deep learning technology to match pitch and timbre features for voice changing, the problem of insufficient real-time performance and personalization in traditional technologies has been solved, achieving natural and personalized voice changing effects and improving the live broadcast interactive experience.

WO2026138260A1PCT designated stage Publication Date: 2026-07-02GUANGZHOU FANGGUI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GUANGZHOU FANGGUI INFORMATION TECHNOLOGY CO LTD
Filing Date
2025-11-18
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Traditional voice changing technology has shortcomings in real-time performance, natural pitch adjustment, realistic timbre reproduction, and personalized services, which affect user experience and voice changing effect. Especially in highly interactive scenarios such as live streaming, it is difficult to meet users' requirements for naturalness and similarity.

Method used

By detecting the voice segments of users speaking in the live broadcast room, the system performs pitch and voice-changing processing using preset target pitch values ​​and target timbre features to generate voice-changing audio data that meets the user's needs and sends it to the receiving user in real time. The system also combines deep learning and audio processing technologies to extract pitch features and match timbre.

Benefits of technology

It achieves real-time, natural, and personalized voice changing, improving the naturalness and accuracy of the voice changing effect, enhancing the interactivity of the live broadcast room and the audience's sense of participation, and ensuring the continuity and fluency of voice communication.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025135644_02072026_PF_FP_ABST
    Figure CN2025135644_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A voice changing method, apparatus and device for a speech dialog in a live-streaming room, and a medium, which relate to the field of network live streaming. The method comprises: in response to a speech speaking event of a speaker user in a live-streaming room, detecting and determining a speech segment in target audio data; on the basis of a preset target pitch value, performing tone-change processing on a segment pitch feature of the speech segment, so as to obtain an optimized pitch feature; on the basis of the optimized pitch feature and a target timbre feature, performing voice-change processing on the speech segment, so as to obtain a voice-changed segment; and replacing a corresponding speech segment in the target audio data with the voice-changed segment, so as to obtain voice-changed audio data, and sending the voice-changed audio data to a receiver user in the live-streaming room. The present application significantly improves the performance of voice changing technology, and solves the problems in the conventional technology in the aspects of real-time performance, personalized services, pitch adjustment naturalness, real timbre restoration, etc.
Need to check novelty before this filing date? Find Prior Art

Description

Live Streaming Voice Conversation Voice Changing Methods, Devices, Equipment, and Media

[0001] This application claims priority to Chinese Patent Application No. 202411954867.0, filed on December 27, 2024, entitled "Voice Changing Method and Apparatus, Device and Medium for Live Streaming Voice Dialogue", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to online live streaming technology, and more particularly to a method, apparatus, and medium for voice changing in live streaming conversations. Background Technology

[0003] In modern communications and entertainment, especially in the live streaming industry, voice communication is one of the core interactive methods. Users often want to increase the fun of interaction during live streams, or for privacy reasons, they need to change their voices. However, traditional voice-changing technologies often have limitations, which are particularly evident in terms of real-time performance and personalized services.

[0004] First, traditional voice-changing technologies often struggle to balance efficiency and quality when processing real-time audio. In high-real-time scenarios like live streaming, these technologies frequently negatively impact user experience due to processing latency. Furthermore, traditional technologies also have shortcomings in timbre matching; they typically lack flexible timbre adjustment capabilities and struggle to accurately match and change timbre according to the user's personalized needs.

[0005] Secondly, many existing voice-changing technologies fail to adequately consider the natural adjustment of pitch and the realistic reproduction of timbre during the voice-changing process. This results in a significant difference between the changed voice and the target timbre or the user's expected effect, failing to meet users' requirements for naturalness and similarity. Especially in highly interactive scenarios such as live streaming, this unnaturalness reduces audience engagement and satisfaction.

[0006] Furthermore, traditional technologies often overlook users' personalized needs for sound quality when processing audio data. Users may want to adjust the pitch according to their preferences to achieve a voice-changing effect that better suits their individual characteristics. However, existing technologies often cannot provide such personalized services, limiting the improvement of user experience.

[0007] Finally, traditional voice-changing technology, in integrating timbre and pitch features, fails to effectively utilize advanced audio processing techniques, resulting in significant differences in sound quality between the changed speech and the original speech, thus failing to achieve the desired voice-changing effect. This difference is particularly pronounced in applications with high sound quality requirements, such as live streaming, affecting the practicality and popularity of voice-changing technology.

[0008] It is evident that while existing voice-changing technologies meet market demands to some extent, they still have significant shortcomings in many areas, such as real-time performance, personalized services, naturalness of pitch adjustment, and realistic timbre reproduction. These shortcomings limit the application potential of voice-changing technology in real-time interactive scenarios such as live streaming, necessitating a new technological solution to overcome these challenges and provide more natural, realistic, and personalized voice-changing services. Summary of the Invention

[0009] The purpose of this application is to solve the above-mentioned problems by providing a method for voice changing in live streaming, as well as corresponding devices, equipment, and non-volatile readable storage media.

[0010] According to one aspect of this application, a method for voice changing in live streaming conversations is provided, comprising the following steps:

[0011] In response to voice speaking events triggered by users speaking in the live broadcast room, detect and determine the voice segment in the target audio data corresponding to the event;

[0012] Based on the preset target pitch value, the pitch features of the speech segment are subjected to pitch shifting to obtain optimized pitch features;

[0013] Based on the optimized pitch feature and the target timbre feature, the speech segment is processed to obtain a voice-changing segment. The target timbre feature is the template timbre feature of the target object obtained by matching the speaker's own timbre feature from the sound information template library.

[0014] The corresponding voice segment in the target audio data is replaced by the voice-changing segment to obtain voice-changing audio data, which is then sent to the listener in the live broadcast room.

[0015] According to another aspect of this application, a voice-changing device for live-streaming conversations is provided, comprising:

[0016] The segment analysis module is configured to respond to voice speaking events triggered by users speaking in the live broadcast room, and detect and determine the voice segments in the target audio data corresponding to the event;

[0017] The pitch shifting module is configured to perform pitch shifting on the segment pitch features of the speech segment based on a preset target pitch value to obtain optimized pitch features.

[0018] The voice-changing processing module is configured to perform voice-changing processing on the speech segment based on the optimized pitch feature and the target timbre feature to obtain a voice-changing segment. The target timbre feature is the template timbre feature of the target object obtained by matching the speaker's own timbre feature from the sound information template library.

[0019] The segment update module is configured to replace the corresponding voice segment in the target audio data with the voice-changing segment to obtain voice-changing audio data, and send the voice-changing audio data to the listener in the live broadcast room.

[0020] According to another aspect of this application, a voice-changing device for live-streaming conversations is provided, including a central processing unit and a memory, wherein the central processing unit is used to invoke and run a computer program stored in the memory to execute the steps of the voice-changing method for live-streaming conversations described in this application.

[0021] According to another aspect of this application, a non-volatile readable storage medium is provided, which stores a computer program implemented according to the live-streaming voice dialogue voice changing method in the form of computer-readable instructions. When the computer program is invoked by a computer, it executes the steps included in the method.

[0022] This application, through its technical solution, effectively solves a series of technical problems existing in traditional technologies and achieves significant technical effects, demonstrating its technical advantages, including but not limited to:

[0023] First, this application addresses the problem of insufficient real-time performance in traditional technologies by proposing a voice-changing method for live-streaming dialogues. This method can respond to voice speaking events triggered by users speaking in the live-streaming room and detect and determine voice segments in the target audio data in real time, ensuring the real-time performance of voice-changing processing and meeting the needs of real-time interactive scenarios such as live-streaming.

[0024] Secondly, this application introduces a preset target pitch value to precisely modulate the pitch characteristics of the speaking user, resulting in optimized pitch characteristics for subsequent voice-changing processing. This improvement not only enhances the naturalness and authenticity of the changed speech but also makes the voice-changing effect more in line with the user's personalized needs, thereby improving the user's interactive experience.

[0025] Furthermore, this application performs voice-changing processing on speech segments by combining optimized pitch features and target timbre features. The target timbre features are template timbre features of the target object obtained by matching the timbre features of the speaking user from a sound information template library. This not only improves the accuracy of timbre matching but also makes the changed speech closer to the target timbre and expected pitch, enhancing the naturalness and similarity of the voice-changing effect.

[0026] Furthermore, after completing the voice-changing process, this application can replace the corresponding human voice segment in the original audio data with the changed voice segment and send the changed voice audio data to the recipient in the live broadcast room in real time. This not only improves the efficiency and accuracy of voice-changing processing but also ensures the continuity and fluency of voice communication in the live broadcast room, providing a richer and more interesting experience for voice interaction in live broadcast scenarios. Attached Figure Description

[0027] Figure 1 shows an exemplary network architecture suitable for applying the live-streaming voice-changing method of this application;

[0028] Figure 2 is a flowchart illustrating an embodiment of the voice-changing method for live-streaming dialogue in this application;

[0029] Figure 3 shows an exemplary deep learning-based network architecture proposed in this application for implementing an end-to-end voice changing process;

[0030] Figure 4 shows an exemplary voice-changing configuration interface of this application, which can be used to select the target voice characteristics of the person and set the target pitch value;

[0031] Figure 5 is a schematic diagram of the voice changing device for live streaming in this application.

[0032] Figure 6 is a structural schematic diagram of a voice-changing device for live streaming conversations used in this application. Detailed Implementation

[0033] To facilitate understanding of the various embodiments of the live-streaming voice-changing method of this application, an exemplary network architecture is first introduced. As shown in Figure 1, the exemplary network architecture includes multiple computer devices, including multiple servers and multiple terminal devices. Through the interaction between these devices, a live-streaming service can be ensured, and information transmission is achieved by transmitting live streams, specifically including image streams and audio streams. For example, the network architecture deploys a live-streaming server 83 and a media server 85. Simultaneously, terminal devices 90 of broadcasters and terminal devices 92 of viewers are allowed to access the network architecture to communicate with the various servers.

[0034] Media server 85 is responsible for processing media data in the live streaming room, mainly processing the live stream, including its image stream and / or audio stream. Adapting to the purpose of this application to implement audio voice changing processing, in one embodiment, media server 85 receives an audio stream generated by a speaking user, decodes the audio stream, and in the decoding space, performs voice changing processing on the corresponding target audio data in the audio stream according to the live streaming room voice dialogue voice changing method of this application to obtain voice-changing audio data. Then, media server 85, corresponding to each terminal device 90, 92 in the live streaming room, encodes the voice-changing audio data and transmits it to each audience user as the receiving user for playback.

[0035] In another embodiment, terminal devices 90 and 92, acting as the broadcaster or audience user, can perform voice-changing processing on the target audio data in the audio stream generated by their terminal devices according to the live broadcast voice dialogue voice-changing method of this application to obtain the corresponding voice-changing audio data. Terminal devices 90 and 92 can acquire the target audio data in real time from their audio acquisition units, detect the speech segments within it, perform voice-changing processing, and then encode it into an audio stream, which is then sent to the live broadcast room via a media server. This allows the audio to be delivered to the recipient user, such as other audience users or broadcasters in the live broadcast room, through instant messaging, the public chat area of ​​the live broadcast room, or through comments.

[0036] The live streaming server 83 serves as a bridge connecting broadcasters and viewers to maintain the operation of the live streaming room. To this end, the live streaming server 83 is responsible for providing users in the live streaming room with the address to pull the live stream from the media server 85, as well as handling interactive functions related to the live streaming, such as public chat, bullet comments, gifts, and comments, thereby enhancing the interactivity of the live streaming and the sense of participation of the viewers.

[0037] In this network architecture, the computer program product implementing the live-streaming voice-changing method of this application can be flexibly deployed in different network nodes. For broadcasters and viewers, this program product can be integrated into the live-streaming application and deployed on their terminal devices 90 and 92, such as smartphones, tablets, personal computers, or game consoles. On terminal devices 90 and 92, the program can acquire the audio data generated by the speaking user in real time as the target audio data to be changed, demonstrating its real-time advantage.

[0038] Furthermore, this computer program product can also be directly deployed on the media server 85 for centralized processing of audio streams in the live broadcast. On the media server 85, the program can process a large number of audio streams from different speaking users, perform voice-changing processing on the audio stream of each speaking user, and send it to the terminal devices 92 of each viewer user in the corresponding live broadcast room, so that the corresponding listener user can play the corresponding voice-changing audio data.

[0039] Based on the above description of the network architecture, it can be seen that the method of this application can be implemented on terminal devices 90, 92 or media server 85, providing high flexibility and broad application potential.

[0040] Please refer to Figure 2. According to the voice-changing method for live-streaming dialogue provided in this application, it can be implemented as a computer program product, installed and run on computer devices such as terminal devices or servers. It processes the target audio data generated by the speaking user, changes its voice, and then sends it to the live-streaming room for other listeners to play. In some embodiments, it includes the following steps:

[0041] Step S3100: Respond to the voice speaking event triggered by the user speaking in the live broadcast room, and detect and determine the voice segment in the target audio data corresponding to the event;

[0042] In a live stream, viewers or hosts can act as speakers, sending voice messages to other viewers and / or hosts via the public chat area, comment section, or real-time chat controls. This triggers specific voice message events. For example, when a user wants to speak during the live stream, they can click the voice message control on the live stream interface; this action triggers a voice message event. The speaker's computer device then begins collecting audio data, which is used as the target audio data. Following the inherent business logic of the live stream, after the speaker finishes speaking, this target audio data is sent back to the live stream for other users to listen to.

[0043] During this process, the computer device can acquire the target audio data in real time, or acquire the data when the user finishes speaking, for subsequent processing in this application. To determine the human speech segments contained in the target audio data, various techniques can be employed for detection. These techniques include, but are not limited to, audio signal processing techniques, such as Fourier transform, to identify speech components in the audio, or the use of speech activity detection algorithms to determine segments in the audio containing human voices.

[0044] In some embodiments, noise reduction preprocessing can be performed on the target audio data before speech segment detection to improve detection accuracy. Noise reduction can be achieved through various algorithms, such as Wiener filters, spectral subtraction, or deep learning-driven noise reduction techniques. These noise reduction techniques can reduce the impact of background noise on speech detection, thereby improving the accuracy and efficiency of subsequent speech processing steps.

[0045] In one embodiment, specifically for noise reduction, a Wiener filter is selected. This is a linear filter based on statistical methods that reduces noise by minimizing the mean square error between the original signal and the estimated signal. In some embodiments, spectral subtraction can also be used. Spectral subtraction is a simpler method that reduces noise by estimating the power spectrum of the noise and subtracting it from the power spectrum of the signal. Other embodiments may employ deep learning-based noise reduction techniques, which train neural networks to identify and remove noise. This approach is adaptable to various complex noise environments and can provide better noise reduction results.

[0046] Through these technologies, computer equipment can effectively detect and extract speech segments from target audio data, laying the foundation for subsequent voice-changing processing. This processing workflow allows the speaker's voice to be presented more clearly and naturally in live broadcasts, enhancing the interactivity and audience engagement. Furthermore, by avoiding non-speech segments in the target audio data, it can save on the system overhead of computer equipment, making it more suitable for deployment on terminal devices.

[0047] Step S3200: According to the preset target pitch value, the pitch feature of the speech segment is subjected to pitch shifting processing to obtain the optimized pitch feature;

[0048] By altering the pitch of a speaker's voice segment and adjusting the speaker's pitch characteristics to match a preset target pitch value, a more natural and user-friendly voice-changing effect can be achieved.

[0049] The preset target pitch value can be achieved in several ways. In one embodiment, users can directly input their desired pitch value through the graphical user interface of the live broadcast room. This pitch value can be an absolute frequency value, such as 440Hz representing the A note in music, or a relative value, such as a semitone higher or lower than the original pitch. In another embodiment, users can select preset pitch values ​​from a sound information template library corresponding to the target object selected by the user, so as to reflect the tonal characteristics of the voice of a specific person that the user wishes to imitate.

[0050] Extracting corresponding pitch features from speech segments is fundamental to pitch shifting. Pitch feature extraction can be achieved through various audio signal processing techniques. One approach is to use Fourier transform to convert the audio signal from the time domain to the frequency domain, and then determine the fundamental frequency (pitch) by searching for energy concentration regions in the spectrum, thus obtaining the segment pitch feature. Another approach is to use autocorrelation, estimating periodicity by measuring the correlation between the signal and its delayed version, thereby obtaining the segment pitch feature. Furthermore, in other embodiments, deep learning methods can also be used for pitch feature extraction, training neural networks to recognize pitch features in audio, and then using them to extract corresponding segment audio features for speech segments.

[0051] After extracting pitch features, the pitch features of the speech segment can be transposed to match the segment pitch features with a preset target pitch value. The key to this process is adjusting the frequency components of the audio signal to achieve precise pitch adjustment while preserving other characteristics of the speech.

[0052] In one embodiment, the PSOLA algorithm can be used to perform pitch conversion on the pitch features of a segment. This algorithm can adjust the pitch without significantly changing the duration and length of the speech. The PSOLA algorithm achieves pitch shifting up or down by synchronously cutting, moving, and superimposing the original audio signal, thereby obtaining an audio signal that matches the target pitch value.

[0053] Another embodiment is based on a phase-modulation method, which adjusts the pitch characteristics of a speech segment by directly manipulating its phase information. Phase-modulation technology can adjust pitch while maintaining the naturalness and clarity of the speech signal, avoiding distortion or sound quality degradation caused by pitch changes.

[0054] In another embodiment, a deep learning-based pitch-shifting technique can be employed, particularly utilizing the WaveNet model. WaveNet is a technique that uses deep convolutional neural networks to generate audio waveforms. It can generate high-quality pitch-shifted speech based on the target pitch value, obtaining corresponding optimized pitch features. By learning from a large amount of audio data, the WaveNet model can generate pitch-shifted speech with high naturalness and clarity, while accurately preserving the intonation and emotional features of the original speech.

[0055] Through the various embodiments described above, pitch shifting of speech segments can be achieved, resulting in optimized pitch features—that is, adjusted pitch features—that match the target pitch value. This not only improves the realism and naturalness of the voice shift but also enhances the interactivity of live streaming and audience participation. By precisely controlling the pitch, computer equipment can generate voice shifting effects that better meet user expectations.

[0056] Step S3300: Based on the optimized pitch feature and the target timbre feature, perform voice-changing processing on the speech segment to obtain a voice-changing segment. The target timbre feature is the template timbre feature of the target object obtained by matching the speaker's own timbre feature from the sound information template library.

[0057] After the pitch adjustment is completed, the voice segment can be processed to change the voice. To do this, the adjusted pitch features are combined with the target timbre features to generate a voice-changing audio that is different from the original voice segment but is natural and realistic.

[0058] Voice changing requires combining optimized pitch features with target timbre features. Target timbre features can be template timbre features extracted from a sound information template library that match the speaker's timbre. Various algorithms, such as cosine similarity or Euclidean distance, can be used to determine the template closest to the speaker's timbre from the sound information template library.

[0059] After obtaining the target timbre features, computer devices use these features to perform voice-changing processing on speech segments. This can be achieved through various techniques. One embodiment uses traditional signal processing techniques, such as spectral transformation methods, to adjust the spectral envelope of the speech segment to match the spectral characteristics of the target timbre, thus obtaining a voice-changing segment. Another embodiment utilizes HMM (Hidden Markov Model)-based speech synthesis technology, which generates voice-changing segments that match the target timbre by modeling the statistical properties of timbre and pitch features. Yet another embodiment employs deep learning techniques, especially neural network-based methods, providing a more advanced solution for voice-changing processing. For example, the WaveNet model can directly learn the distribution of the original audio waveform and generate high-quality voice-changing segments. Furthermore, GAN (Generative Adversarial Network)-based techniques can be used to generate realistic voice-changing segments, where a generator network is responsible for producing the voice-changing segments, while a discriminator network ensures that the generated voice-changing segments are indistinguishable from real audio.

[0060] When performing voice-changing processing, feature fusion technology can also be used to jointly encode the optimized pitch features, target timbre features, and other speech features such as the semantic content of speech segments to form voice-changing audio features. These features can then be used to decode and generate corresponding voice-changing segments. This encoding process can be implemented using deep neural networks, whose structures can include convolutional layers, recurrent layers, or fully connected layers to extract and fuse multiple speech features.

[0061] After encoding, the computer device will decode the voice-altering audio features to generate the final voice-altering segment. This decoding process can also be achieved through neural networks, particularly using a sequence-to-sequence (Seq2Seq) model to convert the encoded voice-altering audio features back into an audio waveform.

[0062] Accordingly, computer equipment can accurately process the voice segments of the speaker, generating natural and realistic voice-altered audio that matches both the target timbre and the adjusted pitch. This voice-altering process not only enhances the fun and interactivity of the audio but also increases the entertainment value of the live stream and audience participation.

[0063] Step S3400: Replace the corresponding voice segment in the target audio data with the voice-changing segment to obtain voice-changing audio data, and send the voice-changing audio data to the listener in the live broadcast room.

[0064] Voice-changing clips are generated to simulate specific timbre and pitch to suit the needs of the speaking user or provide entertainment. The generated voice-changing clips need to be temporally aligned with the speech segments in the target audio data to ensure coherence and synchronization during replacement. To this end, computer devices can use audio editing techniques, such as timestamp matching and audio alignment algorithms, to ensure that the voice-changing clips accurately replace the corresponding speech segments.

[0065] The replacement operation can be implemented programmatically, where the computer device uses a custom script to process the target audio data. Specifically, the computer device first locates the speech segment in the target audio data and replaces it with the voice-changing segment.

[0066] To ensure the replaced audio data sounds natural and smooth, computer equipment can also perform post-processing, such as smoothing the transitions between audio segments and reducing potential audio distortion and noise. This can be achieved using crossfade technology, which gradually increases the volume of one segment while gradually decreasing the volume of the other at the transition point between two audio segments, thus achieving a smooth transition.

[0067] In addition, the computer device replaces the corresponding voice segment with the voice-changing segment to obtain voice-changing audio data, which is then sent to the listener in the live broadcast room.

[0068] After receiving the voice-changing audio data, the listeners can play and hear it in the live stream. This voice-changing audio data is transmitted via the live stream's audio stream, ensuring that all viewers receive the same audio content simultaneously. Listeners experience richer and more engaging interactions when hearing the voice-changing audio, as the voice-changing effect adds a new dimension to the live stream. For example, the speaker's voice might sound like a completely different person, increasing the live stream's mystery and entertainment value.

[0069] Furthermore, listeners can react to the voice-changing effect through interactive features in the live stream, such as bullet comments, real-time messages, or bullet screens. This feedback allows them to evaluate the voice-changing effect and further engage with the live stream. This real-time feedback mechanism not only enhances user participation but also provides speakers with a basis for adjusting voice-changing parameters, enabling the effect to be optimized according to audience preferences. In this way, audio communication in the live stream becomes more dynamic and personalized, improving the overall live stream experience.

[0070] As can be seen from the above embodiments, this application effectively solves a series of technical problems existing in traditional technologies and achieves significant technical effects, demonstrating its technical advantages, including but not limited to:

[0071] First, this application addresses the problem of insufficient real-time performance in traditional technologies by proposing a voice-changing method for live-streaming dialogues. This method can respond to voice speaking events triggered by users speaking in the live-streaming room and detect and determine voice segments in the target audio data in real time, ensuring the real-time performance of voice-changing processing and meeting the needs of real-time interactive scenarios such as live-streaming.

[0072] Secondly, this application introduces a preset target pitch value to precisely modulate the pitch characteristics of the speaking user, resulting in optimized pitch characteristics for subsequent voice-changing processing. This improvement not only enhances the naturalness and authenticity of the changed speech but also makes the voice-changing effect more in line with the user's personalized needs, thereby improving the user's interactive experience.

[0073] Furthermore, this application performs voice-changing processing on speech segments by combining optimized pitch features and target timbre features. The target timbre features are template timbre features of the target object obtained by matching the timbre features of the speaking user from a sound information template library. This not only improves the accuracy of timbre matching but also makes the changed speech closer to the target timbre and expected pitch, enhancing the naturalness and similarity of the voice-changing effect.

[0074] Furthermore, after completing the voice-changing process, this application can replace the corresponding human voice segment in the original audio data with the changed voice segment and send the changed voice audio data to the recipient in the live broadcast room in real time. This not only improves the efficiency and accuracy of voice-changing processing but also ensures the continuity and fluency of voice communication in the live broadcast room, providing a richer and more interesting experience for voice interaction in live broadcast scenarios.

[0075] Based on any embodiment of this application, end-to-end pitch shifting and voice changing can be achieved using a pre-built deep learning model network architecture. As shown in Figure 3, the network architecture includes a content encoder, a pitch extraction model, a joint encoder, and a generator. The content encoder extracts deep semantic features corresponding to the spoken content of a speech segment to obtain content semantic features. The pitch extraction model extracts the pitch features of the speech segment and shifts the pitch according to the target pitch value to obtain optimized pitch features. The content semantic features, optimized pitch features, and target timbre features are jointly encoded by the joint encoder and then sent to the generator for decoding to generate corresponding audio data as a voice-changing segment.

[0076] Both the content encoder and the pitch extraction model can be implemented using convolutional layers. These convolutional layers can capture corresponding local features in the audio signal and extract deeper feature representations through stacking. Therefore, these convolutional layers can constitute a content encoder that encodes the content of a speech segment and a pitch extraction model that extracts the pitch features of the speech segment. The content encoder can also be implemented using existing models such as Whisper, Hubert, and ContentVec.

[0077] For the pitch adjustment process in step S3200, the pitch extraction model in this network architecture can be used to extract the pitch features of the speech segment, and then these pitch features can be adjusted according to the preset target pitch value. This pitch extraction model can be regarded as a pitch conversion network, which can adjust the fundamental frequency of the speech segment according to the target pitch value to obtain optimized pitch features, realize the upward or downward shift of pitch, and thus obtain the segment pitch features that match the target pitch value.

[0078] In the voice-changing process of step S3300, the network architecture combines the content semantic features, the adjusted pitch features, and the target timbre features. These are then encoded into an embedding vector by a joint encoder, which is input into the generator of the network architecture. The generator generates voice-changing segments based on this embedding vector. By learning from a large amount of audio data, the generator can generate voice-changing audio with high naturalness and clarity, while accurately preserving the intonation and emotional features of the original speech.

[0079] This network architecture can be trained beforehand using a generative adversarial approach, where a generator produces altered voice clips, while a discriminator evaluates the differences between the generated audio and the real audio. In this way, the generator can learn to generate high-quality altered voice clips that are difficult to distinguish from real audio.

[0080] The voice-changing segments output by the network architecture obtained in this embodiment can be used to replace corresponding speech segments in the target audio data. The advantage of this network architecture lies in its ability to generate high-quality voice-changing segments end-to-end while maintaining the naturalness and clarity of the audio, providing users with a seamless and natural voice-changing experience. In this way, the network architecture not only improves the realism and naturalness of the voice changing but also enhances the interactivity of the live stream and the audience's sense of participation. Its efficiency advantage is even more significant when this network architecture is used to process the voice segments of a massive number of users speaking on a live streaming platform.

[0081] Based on any embodiment of the method in this application, in response to a voice transmission event triggered by a user speaking in a live broadcast room, detecting and determining a voice segment in the target audio data corresponding to the event includes:

[0082] Step S3110: Respond to the voice speaking event triggered by the speaking user in the live broadcast room, and obtain the target audio data corresponding to the real-time speech of the speaking user;

[0083] In a live broadcast, when a user begins to speak, a specific voice event is triggered. The computer device responds to this event and acquires the target audio data corresponding to the user's real-time speech. In the computer device acting as the terminal, real-time acquisition of audio data is achieved through a microphone attached to the user's device. The sound signal captured by the microphone is then converted into digital form so that the computer device can process it further.

[0084] Audio data can be acquired in various ways, such as by using the chat button on the live streaming interface. Once the user activates this button, the computer device's audio acquisition unit, i.e., the microphone, begins to work, capturing the user's voice in real time. This acquired raw audio data includes the user's voice and any ambient noise, which forms the basis for subsequent processing steps.

[0085] During audio acquisition, computer equipment records information such as the amplitude and frequency of the audio signal. This information is then used to identify the human voice portion of the audio. Acquiring audio data requires not only capturing clear sound but also ensuring the integrity and accuracy of the data so that human voice segments can be accurately identified and extracted in subsequent processing steps.

[0086] Acquired audio data typically exists in the form of audio waveforms, which contain rich information such as pitch, volume, and timbre of the speech. To ensure that this data can be accurately analyzed and processed, computer equipment uses specific algorithms to process these waveforms in order to extract useful information. These algorithms may include signal processing techniques such as Fourier transform, which can convert audio signals from the time domain to the frequency domain, making it easier to identify and analyze features in the audio.

[0087] Through the above process, the computer equipment can accurately acquire the real-time audio data of the speaker as the target audio data, providing a foundation for subsequent audio noise reduction and human activity detection. This implementation method enables the speaker's voice to be presented more clearly and naturally in live broadcasts, while also laying a solid foundation for achieving high-quality voice changing processing.

[0088] Step S3120: Perform audio noise reduction processing on the target audio data to obtain low noise frequency data with noise suppressed;

[0089] After acquiring the target audio data corresponding to the real-time speech of the user, audio denoising processing can be performed to remove or reduce background noise in the target audio data, and more accurately identify and extract speech segments with vocal activity. Audio denoising processing is achieved by applying specific algorithms to the target audio data. These algorithms can identify and suppress unwanted noise components while preserving the core features of the speech signal.

[0090] As revealed above, audio noise reduction techniques can be implemented in various ways, including but not limited to Wiener filters, spectral subtraction, and deep learning-driven noise reduction techniques. When performing audio noise reduction, computer equipment analyzes the spectrum of the target audio data, identifies the characteristics of noise and speech, and adjusts the signal accordingly. For example, deep learning models can be trained to identify specific types of noise, such as fan noise, traffic noise, or background conversations, and remove them from the audio data. Such processing not only improves the accuracy of speech detection but also provides a clearer audio input for subsequent voice-changing processing.

[0091] Through audio noise reduction processing, computer equipment can obtain low-noise frequency data with suppressed noise, providing high-quality audio signals for subsequent human voice activity detection and voice changing processing.

[0092] Step S3130: Detect human voice activity based on the low noise frequency data to identify speech segments containing human voices.

[0093] After audio noise reduction, the computer will perform voice activity detection to accurately identify segments containing human voices from the low-noise frequency data. Voice activity detection is a crucial step in audio processing; it uses sound characteristics to distinguish between human and non-human voice parts, ensuring that voice-changing processing only applies to the actual speech content.

[0094] Human activity detection can be achieved through various algorithms. For example, a common method is to use features based on the Short Time Fourier Transform (STFT), which converts time-series signals into a time-frequency representation, making it easier to identify the frequency range of human voices. Building upon STFT, energy threshold detection can be applied, where an energy level threshold is set, and when the energy of an audio signal within a specific frequency range exceeds this threshold, it is considered human activity.

[0095] Another approach is to use deep learning-based algorithms, such as recurrent neural networks (RNNs) or long short-term memory networks (LSTMs), which are capable of learning complex patterns in human voice and taking into account the temporal continuity of the speech signal during detection. Deep learning methods typically require large amounts of labeled data to train the model to ensure its accuracy and robustness.

[0096] Another implementation uses Mel-frequency cepstral coefficients (MFCCs) as features, a widely used feature extraction method for speech recognition that captures key spectral features of human voice signals. By analyzing the changes in MFCC features over time, computer devices can distinguish human voices from background noise, thereby accurately locating segments of human voice.

[0097] In practical applications, computer equipment can combine various technologies to improve the accuracy of human voice activity detection. For example, energy threshold detection can be used first to quickly locate possible human voice regions, and then a deep learning-based model can be used for more refined recognition to ensure the accuracy of the detection results.

[0098] By executing all the steps of the above embodiments, more efficient and accurate speech processing can be achieved compared to other embodiments, thereby significantly improving the performance of voice changing technology. This embodiment, through the combination of its various steps, not only enhances the naturalness and clarity of the speech but also ensures that the changed audio data truly reflects the speaker's original speech characteristics, while avoiding unnecessary noise interference, providing users with a more natural, realistic, and personalized voice changing service. Furthermore, this comprehensive technical solution can adapt to various complex live streaming environments, providing stable and reliable voice changing effects whether in noisy backgrounds or in scenarios requiring high privacy protection, greatly enhancing the interactivity of the live stream and the audience's sense of participation.

[0099] Based on any embodiment of the method in this application, the segment pitch features of the speech segment are subjected to pitch shifting processing according to a preset target pitch value to obtain optimized pitch features, including:

[0100] Step S3210: Determine whether the speaking user has preset a pitch value. If the user has preset a pitch value, use that pitch value as the target pitch value.

[0101] Users can access the voice changer configuration interface through the voice changer configuration control in the live broadcast room, where they can set their desired pitch value to be used as the target pitch value. Whether a user has preset this pitch value can be indicated by a status flag. When the user sets a pitch value, the status flag directly stores the user's preset pitch value. When the user has not set a pitch value or has it marked as the default state, the user's desired target pitch value is considered to be the preset pitch value corresponding to the target object of their determined target timbre feature. This pitch value is usually mapped to and stored in the sound information template library along with the target timbre feature.

[0102] Accordingly, the computer device can determine whether the user has preset a pitch value by reading the value of the status identifier. If it recognizes that the user has manually preset a pitch value, it will use that pitch value as the target pitch value. Otherwise, it will continue to step S3220 to determine the target pitch value.

[0103] Step S3220: When the speaking user has not preset a pitch value, the preset pitch value corresponding to the target object determined by the speaking user from the character object in the sound information template library is used as the target pitch value.

[0104] When a user does not preset a pitch value, the computer will automatically retrieve and use the preset pitch value from the sound information template library. The sound information template library is a database that stores various timbre features and their corresponding pitch values. This data can be based on statistical results from a large number of samples or customized for a specific individual. In this library, each timbre feature has its corresponding pitch value, which reflects the tonal characteristics of different timbres.

[0105] The computer equipment has pre-determined the target vocal characteristics corresponding to the target object that the user wishes to imitate. Specifically, the user has selected a target object through a pre-provided list of recommended vocal samples. This target object can be a preset person, role, or other vocal characteristics that the user wishes to imitate. Therefore, based on the identity of the target object, the preset pitch value associated with that target object can be retrieved from the vocal information template library.

[0106] Step S3230: Perform pitch shifting processing on the segment pitch features of the speech segment according to the target pitch value, so that the segment pitch features reach the level of the target pitch value.

[0107] Pitch shifting of a speaker's speech segments allows the pitch features of the speech segments to be matched with a target pitch value. As previously explained, pitch shifting can be implemented using various deep learning models, such as the pitch extraction model based on convolutional neural networks. This model analyzes the spectral characteristics of the speech segments and identifies the fundamental frequency components in the speech signal, which determine the pitch features. By adjusting these fundamental frequency components, the model can change the pitch of the original speech, shifting it upwards or downwards to the target pitch value.

[0108] As revealed earlier, in pitch shifting, computer equipment can also apply a range of other algorithms to finely adjust the pitch of speech segments. These algorithms include traditional signal processing techniques, such as the PSOLA algorithm, which can stretch or compress speech signals in the time domain without changing the duration or length of the speech, thereby adjusting the pitch. Additionally, phase estimation-based methods can be used, which modify the phase spectrum of the speech signal to change the pitch while maintaining the naturalness of the speech.

[0109] This embodiment enables intelligent adjustment of the pitch characteristics of a user's speech segment, significantly improving the performance of voice changing technology, especially in meeting users' personalized needs and providing a natural voice changing experience.

[0110] First, this embodiment allows users to directly set the desired pitch value through the voice-changing configuration interface. This direct user input enables the voice-changing process to accurately reflect the user's personalized needs, enhancing the user's sense of participation and control. When the user sets a pitch value, the system can immediately respond and adopt that value as the target pitch value, thereby ensuring that the voice-changing result is consistent with the user's expectations.

[0111] For users who haven't manually set pitch values, the system automatically retrieves and uses preset pitch values ​​from the sound information template library for the target object. This intelligent processing provides convenience, especially when users want to imitate a specific person or timbre. This not only simplifies the user's workflow but also ensures the naturalness and accuracy of the voice-changing process, enabling high-quality voice-changing effects even when the user hasn't explicitly set pitch values.

[0112] As can be seen, this embodiment intelligently considers the user's personalized needs for pitch values ​​and provides the target object's preset pitch value when the user does not manually set the pitch value, thus realizing a flexible and accurate pitch adjustment scheme, which significantly improves the practicality of voice changing technology and user experience.

[0113] Based on any embodiment of the method in this application, before performing voice-changing processing on the speech segment based on the optimized pitch features and target timbre features, the method includes:

[0114] Step S1100: Obtain the speaking voice data of the speaking user, and extract the self-voice features of the speaking user based on the speaking voice data. The speaking voice data is the target audio data generated in real time, or the historical audio data pre-recorded by the speaking user.

[0115] Acquiring the speaker's voice data can be approached in two different ways: First, processing real-time generated audio data, i.e., the target audio data in the audio stream generated during a live broadcast. This type of target audio data makes it easier to extract the speaker's unique vocal characteristics in real time. Second, processing pre-recorded audio data from the speaker, which is typically generated by requiring the user to read a pre-set text. Therefore, this type of audio data usually provides richer semantics for more accurate and efficient extraction of the speaker's unique vocal characteristics. Whether using real-time or historical audio, the goal is to extract the speaker's unique vocal characteristics, obtaining their individual vocal features—parameters in the speech signal that represent the individual's vocal characteristics.

[0116] Phonographic features can be extracted using audio signal processing techniques. Computer equipment analyzes the speaker's voice data using specific algorithms. First, a Fourier transform is performed to determine the Mel-frequency cepstral coefficients (MFCCs), transforming the audio data from the time domain to the frequency domain, thus revealing the frequency components of the audio signal. Then, key spectral features in the audio signal are identified; these features can then be used to characterize the speaker's timbre, constituting timbre features. In practice, a pre-trained, convergent timbre feature extraction model can be used to extract the timbre from the audio data.

[0117] Phonographic features play a crucial role in speech processing. They are not only used to identify and verify the identity of the speaker, but also serve as adjustment targets in voice-changing processing to generate audio output that matches a specific timbre. By precisely matching or adjusting these features, computer devices can generate voice-changing audio that sounds natural and similar to the target timbre, which is essential for improving the realism and naturalness of speech.

[0118] Step S1200: Calculate the semantic similarity between the self-voice timbre features and the template timbre features corresponding to each preset character object in the sound information template library;

[0119] This application pre-constructs a sound information template library. This library is a pre-built database containing the timbre features and pitch values ​​of multiple individuals, along with associated audio clips of each individual's speech. This data is linked to the individual's identity and stored in the library. Building such a database requires collecting audio clips of different individuals' speech, and then using audio processing techniques, such as the timbre feature extraction model described above, to extract the timbre features of each person.

[0120] In the sound information template library, the timbre features and pitch values ​​of each person are encoded into a feature vector. These vectors are stored in the database and associated with a unique identifier for each person, thus giving each person a corresponding template timbre feature. In this way, when timbre matching is needed, computer devices can quickly retrieve and compare these template timbre features.

[0121] Based on the voice information template library, the semantic similarity between the speaker's own vocal timbre features and the vocal timbre features of each template can be determined. To this end, the speaker's own vocal timbre features are compared with the template vocal timbre features of each person in the voice information template library to determine the degree of similarity between them. This comparison can be implemented using various algorithms, such as cosine similarity, Euclidean distance, or neural networks.

[0122] For example, cosine similarity measures the similarity between two timbre features by calculating the cosine of the angle between their corresponding vectors. The closer the cosine value is to 1, the more similar the two vectors are. Euclidean distance measures the distance between two vectors in multidimensional space; the smaller the distance, the higher the similarity. Furthermore, deep learning methods, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), can be used. These models can learn complex patterns in timbre features and calculate high-precision similarity scores.

[0123] Step S1300: Select a subset of people with relatively high semantic similarity, construct a timbre sample recommendation list from the subset of people and their corresponding audio clips, and push it to the speaking user;

[0124] After calculating the semantic similarity between the speaker's own vocal timbre features and the template vocal timbre features of various individuals in the voice information template library, individuals with relatively high semantic similarity to the speaker's vocal timbre features are selected based on these similarity scores. In one embodiment, a threshold is preset, and semantic similarity exceeding this threshold is defined as relatively high semantic similarity.

[0125] After identifying individuals with high semantic similarity to the speaker's voice timbre, a voice sample recommendation list is constructed from these individuals and their corresponding audio clips. This recommendation list contains multiple candidate voice samples, i.e., audio clips, each with a certain degree of similarity to the speaker's voice, thus providing the user with a range of possible voice timbre choices. The purpose of the recommendation list is to allow the user to select the voice timbre they deem most suitable for subsequent voice-changing processing.

[0126] The recommendation list can also be generated in the following way: a sorting algorithm can be used to sort all the user objects according to the similarity of their voice characteristics to the speaking user, and then select the top N (N is a preset integer value) user objects with the highest similarity.

[0127] The final list of recommended voice samples is then pushed to the user, who can hear sample audio for each candidate voice and choose the one that best suits their preferences. This step not only improves the user experience by allowing users to select according to their own preferences but also enhances the personalization and naturalness of the voice-changing process, as the final selected voice is more likely to be one that satisfies and pleases the user.

[0128] Step S1400: Take the person object determined by the speaking user based on the speaking audio segment of the timbre sample recommendation list as the target object, and take the template timbre feature of the target object as the target timbre feature.

[0129] After the user selects a voice sample from the recommended list that best matches their desired tone, the computer device determines the target tone features based on the user's selection. Specifically, the computer device displays the recommended tone sample list to the user. When the user selects a specific tone sample, the computer device identifies the corresponding person or entity as the target, and extracts the template tone features of that target entity from a voice information template library as the target tone features. After determining the target tone features, the computer device uses these features to guide the voice-changing process, ensuring that the final generated voice-changing audio matches the tone selected by the user.

[0130] This embodiment achieves highly personalized and accurate timbre matching by precisely extracting the speaker's timbre features and comparing them with template timbre features in a sound information template library. This process first effectively extracts the user's own timbre features based on the speaker's voice data. Then, by calculating the semantic similarity between the speaker's timbre features and the template timbre features of individuals in the template library, it can identify the individuals whose timbre is closest to the user's. This not only provides the user with a recommended list of timbre samples but also ensures that these recommendations are based on precise matching results, greatly improving the relevance and accuracy of timbre selection.

[0131] Once a user selects a person from the recommended voice sample list as the target, and determines their corresponding template voice characteristics as the target voice characteristics, these can be used for subsequent voice-changing processing. This not only improves the naturalness and realism of the voice-changing process but also enhances user participation and satisfaction, as users can choose a voice that best suits their personal preferences. Furthermore, this intelligent voice selection and matching process makes voice-changing technology more flexible and adaptable, capable of meeting the personalized needs of different users, whether in live streaming, entertainment, or other applications requiring voice changing.

[0132] As can be seen, the technical advantage of this embodiment lies in its ability to provide an intelligent timbre matching and selection mechanism based on the user's timbre characteristics. This not only enhances the personalization and naturalness of voice changing processing but also improves the overall user experience. In this way, this embodiment achieves a highly efficient and user-friendly method for timbre selection and voice changing processing, providing users with a novel and highly interactive live streaming experience.

[0133] Based on any embodiment of the method in this application, the speech segment is subjected to voice-changing processing based on the optimized pitch feature and the target timbre feature to obtain a voice-changing segment, including:

[0134] Step S3310: Extract features from the speech segment to obtain its content semantic features;

[0135] This embodiment can be implemented using the pre-built network architecture disclosed above. In this network architecture, a content encoder extracts features from speech segments to obtain their content semantic features. These features represent the content information of the speech, i.e., the meaning of the specific words and sentences spoken by the speaker. The content encoder plays a core role, responsible for extracting deep semantic features from the input speech segments. These features include not only physical attributes of the speech such as pitch, volume, and timbre, but also semantic information in the speech, such as phonemes, words, and phrases.

[0136] Content encoders typically consist of multiple layers of convolutional neural networks, which abstract and extract features from speech signals layer by layer. For example, the first layer might focus on extracting basic spectral features of the audio signal, while subsequent layers can identify more complex patterns, such as phonemes and words. In this way, content encoders can capture the semantic content in speech and encode this information into a set of high-dimensional feature vectors, i.e., content semantic features.

[0137] These content semantic features are then used in other steps of the voice-changing process. For example, in the joint encoding step, the content semantic features are combined with the optimized pitch features and the target timbre features to form the voice-changing audio features.

[0138] Step S3320: Jointly encode the content semantic features, the optimized pitch features, and the target timbre features to obtain the voice-changing audio features;

[0139] Referring to the network architecture described above in this application, it combines extracted semantic features, optimized pitch features, and target timbre features through a joint encoder to generate voice-changing audio features. By fusing multiple features into a unified representation, this representation can comprehensively capture the characteristics of the original speech segment and guide the generator to produce the corresponding voice-changing audio.

[0140] Content semantic features contain the content information of the speech, optimized pitch features represent the adjusted pitch information, and target timbre features describe the desired timbre attributes. The task of the joint encoder is to fuse these features into an embedding vector that can be understood and used by the generator to produce voice-altered audio as voice-altered segments.

[0141] Joint encoders can be implemented using various techniques. One embodiment uses deep neural networks, such as recurrent neural networks (RNNs) or long short-term memory networks (LSTMs), which can handle sequential data and take temporal dependencies into account during encoding. In this network, the input feature sequence is passed through a series of layers, each of which can contain non-linear activation functions to enhance the model's expressive power. In this way, the joint encoder can learn how to map different features into a common embedding space.

[0142] Another implementation uses an attention mechanism to enhance the performance of the joint encoder. The attention mechanism allows different weights to be assigned to different parts of the input features during encoding, thus focusing more on features that significantly impact the voice-changing effect. For example, if a particular pitch feature is crucial for timbre recognition, the attention mechanism ensures that this feature receives sufficient attention during encoding.

[0143] Another embodiment uses an autoencoder architecture for joint encoding. The autoencoder compresses the input features into a low-dimensional representation through an encoding stage, and then restores the original data from this representation through a decoding stage. In this process, the encoder learns how to capture the essential information of the input features, while the decoder learns how to reconstruct the audio signal from this compressed representation.

[0144] By applying a joint encoder, multiple features can be effectively fused to generate a richly informative embedding vector as the voice-altering audio feature. This vector is then used by the generator to produce the voice-altered audio segment. This joint encoding process not only improves the accuracy and naturalness of voice-altering processing but also enhances the sound quality and expressiveness of the final audio output.

[0145] Step S3330: Decode the voice-changing audio features to generate corresponding audio data as a voice-changing segment.

[0146] Based on the network architecture described above in this application, the decoding and generation of voice-altering audio features can be accomplished by a generator within the network architecture. This generator is responsible for converting the encoded voice-altering audio features back into audio data to generate voice-altering segments. The generator is a key component of the deep learning model's network architecture; it typically consists of multiple layers and is capable of reconstructing high-quality audio signals from embedded vectors.

[0147] Generators can be designed and implemented in various ways. A common approach is to use deep convolutional neural networks (DCNNs), which progressively upsample the encoded audio features to reconstruct the spectrum of the original audio signal. In this process, each convolutional layer of the generator adds dimensionality to the audio signal until it reaches the sampling rate and bit depth of the original audio. For example, the WaveNet model is a special type of DCNN that uses stacked convolutional layers and the concept of dilation gating to generate audio waveforms.

[0148] Another implementation uses recurrent neural networks (RNNs) or long short-term memory networks (LSTMs), which are capable of processing sequential data and taking temporal dependencies into account during generation. These networks generate continuous audio waveforms by recursively processing encoded features, progressively constructing each time step of the audio signal.

[0149] Another implementation uses an autoencoder structure, where the encoder compresses the input features into a low-dimensional representation, and the generator (decoder) reconstructs the audio signal from this representation. In this process, the generator learns how to recover the details and structure of the original audio signal from the compressed features.

[0150] Training a generator typically requires a large amount of audio data, which is used to tune the generator's parameters so that it can produce altered audio that is indistinguishable from real audio. During training, the generator's output is compared with the target audio data, and the difference is calculated using a loss function. Common loss functions include mean squared error (MSE) and adversarial loss, which help the generator learn how to better reconstruct the audio signal.

[0151] This embodiment presents a highly efficient and accurate voice-changing processing workflow with significant technical advantages. First, the workflow accurately extracts deep semantic features from speech segments. These features encompass not only the physical properties of the speech but also semantic information, providing rich input information for voice-changing processing. Second, by applying a joint encoder, content semantic features, optimized pitch features, and target timbre features are effectively fused to generate a comprehensive voice-changing audio feature vector. This not only improves the accuracy and naturalness of the voice-changing process but also enhances the sound quality and expressiveness of the final audio output. Furthermore, the generator design is flexible and diverse, capable of reconstructing high-quality audio signals from the encoded voice-changing audio features. Whether using deep convolutional neural networks, recurrent neural networks, or autoencoder structures, the naturalness and clarity of the voice-changing segments are ensured. Finally, this workflow provides users with a seamless and natural voice-changing experience. It also demonstrates its efficiency advantage when processing massive amounts of speech segments from users on live streaming platforms, significantly improving the performance of voice-changing technology and solving problems related to real-time performance, personalized services, natural pitch adjustment, and realistic timbre reproduction in traditional technologies.

[0152] Based on any embodiment of the method in this application, after sending the voice-changing audio data to the listener in the live broadcast room, the method includes:

[0153] Step S5100: Obtain feedback information from the recipient after playing the voice-changing audio data, and determine the recipient's acceptance level of the voice-changing audio data based on the feedback information;

[0154] In a live streaming environment, feedback from listeners on the voice-changing audio data can be used to measure the effectiveness of the voice-changing effect. Therefore, relevant feedback information can be collected and analyzed to determine the degree to which the voice-changing audio data is accepted by the listeners. Feedback information can be obtained through various channels, including any channel suitable for enabling dialogue between the speaker and listeners, such as chat messages in the public chat area of ​​the live stream, bullet comments, comment sections, and instant chat windows. Listeners can provide their feedback through these channels in the form of voice or text, and this feedback may include positive evaluations, criticisms, suggestions, or other forms of commentary.

[0155] To process this feedback, a pre-defined deep learning scoring model can be used to reason about it. This model can classify and map the collected feedback to determine the recipient's acceptance level of the voice-changing audio data. Specifically, the scoring model analyzes the emotional tone in textual feedback or identifies keywords and intonation in voice feedback, then maps these analysis results to a predefined category, obtaining a corresponding acceptance probability value, which is then used as a representation of the degree of acceptance. Positive feedback corresponds to a higher acceptance probability value, while negative feedback corresponds to a lower value.

[0156] Training a rating model requires a large amount of labeled data, including historical feedback information and its corresponding acceptance labels. Through training, the model learns how to accurately extract sentiment and intent from feedback information and translate it into quantifiable levels of acceptance. Furthermore, the model can continuously learn from user feedback to optimize its performance, more accurately reflecting users' true feelings.

[0157] In practice, the feedback information flow in the live broadcast room is monitored in real time, and this information is analyzed using a scoring model. Once enough feedback information is collected, the model calculates a comprehensive acceptance probability value, which reflects the overall acceptance level of the corresponding listeners of the current voice-changing audio data.

[0158] Step S5200: Detect whether the acceptance level is lower than a preset threshold. When it is lower than the preset threshold, select other person objects from the sound information template library and construct the other person objects and their speaking audio segments as a re-voice changing request and send it to the speaking user.

[0159] When the computer detects that the recipient's acceptance level is lower than a preset threshold, it means that the voice-changing effect has failed to meet the recipient's expectations and needs to be adjusted. At this time, new voice objects are selected from the voice information template library. These objects have different timbre characteristics from the previous target timbre characteristics in order to provide a voice-changing effect that better matches the recipient's preferences.

[0160] There are several ways to select a new voice actor. For example, a computer device can automatically select one or more new voice actors based on the recipient's historical feedback, preference settings, or the specific content of the current feedback. Alternatively, machine learning algorithms can be used to predict and select the voice most likely to improve user satisfaction by analyzing the recipient's feedback patterns on different voice timbres. A simpler approach could be to randomly select a voice actor as the new voice actor.

[0161] Once a new voice actor is identified, a re-voice-changing request is constructed using their template voice characteristics and sent to the user. This request contains an audio clip of the new voice actor speaking, allowing the user to preview the effect of the new voice and decide whether to accept the voice-changing request.

[0162] Step S5300: After the speaking user confirms the request to re-change the voice, the other person objects are taken as the new target objects. For the subsequently generated voice segments, the process starts again from the step of changing the pitch characteristics of the voice segments according to the preset target pitch value, and voice changing is performed on the subsequently generated voice segments.

[0163] After the speaker confirms the request to re-change their voice, the newly selected person can be used as the target, and its template timbre features can be used as the target timbre features to perform voice-changing processing on subsequent speech segments. Specifically, steps S3200 to S3400 can be executed on subsequent speech segments, so that subsequent speech segments can be processed based on the target timbre features of the new person, thereby improving the recipient's satisfaction with the voice-changing audio data.

[0164] This embodiment significantly enhances the interactivity and adaptability of voice-changing technology by introducing a mechanism for collecting and analyzing feedback from the recipient, solving the problem that traditional technologies often neglect the recipient's experience. Traditional voice-changing processes typically only consider the speaker's needs, lacking consideration for the recipient's acceptance, which may lead to a discrepancy between the voice-changing effect and the recipient's expectations. This embodiment, by monitoring the recipient's feedback in real time, can dynamically adjust the voice-changing parameters to ensure that the voice-changing effect better matches the recipient's preferences, thereby improving the overall user experience.

[0165] This embodiment uses a deep learning scoring model to perform sentiment analysis and intent recognition on the recipient's feedback. This allows for the quantification of the recipient's acceptance level and the adjustment of the voice-changing strategy based on this feedback. This feedback-based dynamic adjustment mechanism transforms voice-changing from a one-way process into a two-way interactive process, where the recipient's feedback is considered and responded to in real time. When the recipient's acceptance level falls below a preset threshold, a new voice subject is automatically selected from the voice information template library, providing an option to re-change the voice. This intelligent processing method improves the flexibility and personalization of voice-changing.

[0166] Furthermore, this embodiment allows the speaker to re-change their voice based on feedback from the listener. This user-participatory voice-changing process increases the speaker's sense of control and makes the voice-changing process more accurately meet the needs of both parties. In this way, not only is the listener's satisfaction improved, but the speaker's participation is also enhanced, thereby creating a more active and positive interactive atmosphere on the live streaming platform.

[0167] As can be seen, the technical advantage of this embodiment lies in its ability to achieve voice-changing processing driven by the user's feedback. This user-centric approach improves the practical application effect and user satisfaction of voice-changing technology, and provides a richer and more personalized voice communication experience for live streaming platforms.

[0168] Based on any embodiment of the method in this application, after sending the voice-changing audio data to the listener in the live broadcast room, the method includes:

[0169] Step S7100: Respond to the user configuration event and pop up the voice changer configuration interface in the live broadcast room. Receive the voice changer control parameters of the speaking user through the voice changer configuration interface. The voice changer control parameters include the target timbre characteristics of the target object selected by the speaking user from the character objects in the voice information template library and the target pitch value set by the speaking user.

[0170] This embodiment allows the speaking user to trigger a user configuration event through the controls provided in the live broadcast room. In response to the event, a voice changer configuration interface pops up in the live broadcast room, as shown in Figure 4, so that the speaking user can further personalize and optimize the voice changer effect.

[0171] The voice changer configuration interface provides an intuitive operating environment. Users can preview the voice effects by playing audio clips of various individuals from the recommended voice sample list displayed on the interface. Based on personal preference, they can select the individual they wish to imitate, thus determining their voice information, i.e., the target voice characteristic. Users can select an individual from the recommended voice sample list to designate them as the target, thereby determining their corresponding target voice characteristics.

[0172] In addition to selecting the target timbre feature, users can also set the target pitch value in the voice changer configuration interface. This parameter allows users to adjust the pitch of the changed voice to suit their personal preferences or the needs of a specific scenario. The target pitch value can be absolute, such as a specific Hertz number, or relative, such as a semitone higher or lower than the original pitch.

[0173] After the user makes a selection and sets parameters, the computer device collects these voice-changing control parameters. Specifically, this may include the target's identity identifier and a preset target pitch value as voice-changing control parameters for subsequent speech processing. These parameters guide the computer device on how to perform pitch-shifting on new speech segments, ensuring the voice-changing effect matches the user's expectations. For example, if the user selects a deep male voice as the target timbre and sets a low target pitch value, the computer device will adjust the pitch of subsequent speech segments accordingly to match this setting.

[0174] Step S7200: Based on the target timbre characteristics corresponding to the target object selected by the speaking user and the target pitch value set by the speaking user, for the subsequently generated speech segments, start from the step of performing pitch-shifting processing on the segment pitch characteristics of the speech segments according to the preset target pitch value, and perform voice-shifting processing on the subsequently generated speech segments.

[0175] After the user selects the target timbre feature and sets the target pitch value through the voice changing configuration interface, these parameters can be used to perform subsequent voice changing processing. Specifically, the computer device can execute steps S3200 to S3400 of this application to generate corresponding voice-changing audio data for the subsequently generated speech segments based on the user's selected target timbre feature and set target pitch value. This enables voice changing processing of the user's subsequently generated speech segments, and then the generated voice-changing audio data is sent back to the live broadcast room for the receiving user to play.

[0176] This embodiment significantly improves user experience and satisfaction by providing a user-friendly voice-changing configuration interface. Speakers can select specific individuals based on their preferences and needs to determine the target vocal characteristics. This personalized selection not only increases the user's control over the voice-changing effect but also makes the result more in line with the speaker's personal style and expectations, enhancing the realism and naturalness of the voice change.

[0177] Furthermore, users can set a target pitch value to further fine-tune the voice-changing effect, making it more closely match their vocal characteristics or the needs of a specific scenario. This flexibility allows users to adjust their voice in different live-streaming environments, whether to increase entertainment value, protect privacy, or better adapt to the live-streaming content. In this way, users are not only participants in the voice-changing process but also creators, able to adjust and optimize their voice-changing effects in real time.

[0178] This user-customizable and participatory speaking mechanism makes voice-changing technology more flexible and responsive to user needs. It not only increases user satisfaction with the technology but also enhances user loyalty and activity on the live streaming platform. Users can express themselves more confidently while enjoying the limitless possibilities offered by voice-changing technology. In summary, this embodiment, by giving users more control and choice, achieves more personalized and precise voice-changing effects, providing users with a brand-new and highly interactive live streaming experience.

[0179] Please refer to Figure 5. A voice-changing device for live-streaming dialogue, according to one aspect of this application, includes a segment analysis module 3100, a pitch-shifting processing module 3200, a voice-changing processing module 3300, and a segment update module 3400. The segment analysis module 3100 is configured to respond to a voice speaking event triggered by a user speaking in the live-streaming room, detecting and determining a voice segment in the target audio data corresponding to the event. The pitch-shifting processing module 3200 is configured to perform pitch-shifting processing on the segment pitch features of the voice segment according to a preset target pitch value to obtain an optimized pitch feature. The voice-changing processing module 3300 is configured to perform voice-changing processing on the voice segment based on the optimized pitch feature and the target timbre feature to obtain a voice-changing segment. The target timbre feature is a template timbre feature of a target object matched from a sound information template library based on the user's own timbre feature. The segment update module 3400 is configured to replace the corresponding voice segment in the target audio data with the voice-changing segment to obtain voice-changing audio data, and send the voice-changing audio data to the receiving user in the live-streaming room.

[0180] Based on any embodiment of the device in this application, the segment analysis module 3100 includes: an audio determination module, configured to respond to a voice speaking event triggered by a user speaking in a live broadcast room, and acquire target audio data corresponding to the real-time speech of the user speaking; a noise reduction processing module, configured to perform audio noise reduction processing on the target audio data to obtain low noise frequency data with suppressed noise; and a detection and interception module, configured to perform human voice activity detection based on the low noise frequency data, and determine the voice segment in which there is a voice speaking.

[0181] Based on any embodiment of the device in this application, the pitch shifting processing module 3200 includes: a manual application module, configured to determine whether the speaking user has preset a pitch value, and if so, to use that pitch value as the target pitch value; a template application module, configured to use the preset pitch value corresponding to the target object determined by the speaking user from the voice information template library as the target pitch value when the speaking user has not preset a pitch value; and a pitch shifting execution module, configured to perform pitch shifting processing on the segment pitch features of the speech segment according to the target pitch value, so that the segment pitch features reach the level of the target pitch value.

[0182] Based on any embodiment of the device in this application, prior to the voice-changing processing module 3300, this device includes: a timbre extraction module, configured to acquire the speaking voice data of the speaking user, and extract the speaking user's own timbre features based on the speaking voice data, wherein the speaking voice data is the target audio data generated in real time, or the historical audio data pre-recorded by the speaking user; a semantic calculation module, configured to calculate the semantic similarity between the own timbre features and the template timbre features corresponding to each preset person object in the sound information template library; an object recommendation module, configured to select a portion of person objects with relatively high semantic similarity, construct a timbre sample recommendation list of the portion of person objects and their corresponding speaking audio segments, and push it to the speaking user; and a user selection module, configured to select the person object determined by the speaking user based on the speaking audio segments of the timbre sample recommendation list as the target object, and use the template timbre features of the target object as the target timbre features.

[0183] Based on any embodiment of the device in this application, the voice-changing processing module 3300 includes: a content extraction module, configured to extract features from the speech segment to obtain its content semantic features; a joint encoding module, configured to jointly encode the content semantic features, the optimized pitch features, and the target timbre features to obtain voice-changing audio features; and a decoding generation module, configured to decode and generate corresponding audio data as a voice-changing segment based on the voice-changing audio features.

[0184] Based on any embodiment of the device in this application, following the segment update module 3400, this device includes: a speech evaluation module, configured to obtain feedback information from the speech user after playing the voice-changing audio data, and determine the speech user's acceptance level of the voice-changing audio data based on the feedback information; a reset recommendation module, configured to detect whether the acceptance level is lower than a preset threshold, and when it is lower than the preset threshold, to reselect other person objects from the sound information template library, and to construct a re-voice-changing request for the other person objects and their speaking audio segments and send it to the speaking user; and a pitch-shifting reset module, configured to, after the speaking user confirms the re-voice-shifting request, use the other person objects as new target objects, and for subsequently generated speech segments, restart execution from the pitch-shifting processing module 3200 to perform voice-shifting processing for the subsequently generated speech segments.

[0185] Based on any embodiment of the device in this application, following the segment update module 3400, this device includes: a user configuration module, configured to respond to a user configuration event, pop up a voice-changing configuration interface in the live broadcast room, and receive the voice-changing control parameters of the speaking user through the voice-changing configuration interface, the voice-changing control parameters including the target timbre features of the target object selected by the speaking user from the character objects in the sound information template library and the target pitch value set by the speaking user; and a configuration application module, configured to, based on the target timbre features corresponding to the target object selected by the speaking user and the target pitch value set by the speaking user, restart execution from the pitch-changing processing module 3200 for the subsequently generated voice segments, and perform voice-changing processing for the subsequently generated voice segments.

[0186] Another embodiment of this application provides a voice-changing device for live-streaming voice dialogue. Figure 6 shows a schematic diagram of the internal structure of the live-streaming voice dialogue device. This device includes a processor, a computer-readable storage medium, a memory, and a network interface connected via a system bus. The computer-readable, non-volatile storage medium stores an operating system, a database, and computer-readable instructions. The database may store information sequences. When the computer-readable instructions are executed by the processor, the processor can implement a voice-changing method for live-streaming voice dialogue.

[0187] The processor of this live-streaming voice-changing device provides computing and control capabilities, supporting the operation of the entire device. The device's memory can store computer-readable instructions, which, when executed by the processor, cause the processor to perform the live-streaming voice-changing method of this application. The network interface of the device is used for communication with a terminal.

[0188] Those skilled in the art will understand that the structure shown in Figure 6 is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the live-streaming voice-changing device to which the present application is applied. A specific live-streaming voice-changing device may include more or fewer components than those shown in the figure, or may combine certain components, or may have different component arrangements.

[0189] In this embodiment, the processor executes the specific functions of each module in Figure 5, and the memory stores the program code and various types of data required to execute the above modules or sub-modules. The network interface is used to realize data transmission between user terminals or servers. The non-volatile readable storage medium in this embodiment stores the program code and data required to execute all modules in the live broadcast voice dialogue voice changing device of this application, and the server can call the server's program code and data to execute the functions of all modules.

[0190] This application also provides a non-volatile readable storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the live-streaming voice-changing method of any embodiment of this application.

[0191] This application also provides a computer program product, including a computer program / instructions that, when executed by one or more processors, implement the steps of the method described in any embodiment of this application.

[0192] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The aforementioned storage medium can be a computer-readable storage medium such as a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM).

[0193] In summary, this application significantly improves the performance of voice changing technology through innovations such as real-time response, precise pitch shifting, accurate timbre matching, and efficient audio processing. It solves the problems of real-time performance, personalized service, natural pitch adjustment, and realistic timbre reproduction in traditional technologies, providing users with a more natural, realistic, and personalized voice changing service.

Claims

1. A method for voice changing in live streaming conversations, characterized in that, include: In response to voice speaking events triggered by users speaking in the live broadcast room, detect and determine the voice segment in the target audio data corresponding to the event; Based on the preset target pitch value, the pitch features of the speech segment are subjected to pitch shifting to obtain optimized pitch features; Based on the optimized pitch feature and the target timbre feature, the speech segment is processed to obtain a voice-changing segment. The target timbre feature is the template timbre feature of the target object obtained by matching the speaker's own timbre feature from the sound information template library. The corresponding voice segment in the target audio data is replaced by the voice-changing segment to obtain voice-changing audio data, which is then sent to the listener in the live broadcast room.

2. The voice-changing method for live-streaming dialogue according to claim 1, characterized in that, In response to a voice transmission event triggered by a user speaking in the live stream, detect and determine the corresponding voice segment in the target audio data, including: Respond to voice speaking events triggered by users speaking in the live broadcast room and obtain the target audio data corresponding to the real-time speech of the speaking user; The target audio data is subjected to audio noise reduction processing to obtain low-noise frequency data with suppressed noise; Human voice activity is detected based on the low-noise frequency data to identify speech segments containing human voices.

3. The voice-changing method for live-streaming dialogue according to claim 1, characterized in that, Based on a preset target pitch value, the pitch features of the speech segment are subjected to pitch shifting processing to obtain optimized pitch features, including: Determine whether the speaking user has preset a pitch value. If the user has preset a pitch value, use that pitch value as the target pitch value. When the speaking user has not preset a pitch value, the preset pitch value corresponding to the target object determined by the speaking user from the character object in the sound information template library is used as the target pitch value. The pitch features of the speech segment are subjected to pitch shifting processing based on the target pitch value, so that the pitch features of the segment reach the level of the target pitch value.

4. The voice-changing method for live-streaming dialogue according to claim 1, characterized in that, Before performing voice-changing processing on the speech segment based on the optimized pitch features and target timbre features, the process includes: Acquire the speaking voice data of the speaking user, and extract the user's own timbre features based on the speaking voice data. The speaking voice data is the target audio data generated in real time, or the historical audio data pre-recorded by the speaking user. Calculate the semantic similarity between the self-voice timbre features and the template timbre features corresponding to each preset character object in the voice information template library; Select a subset of individuals with relatively high semantic similarity, construct a timbre sample recommendation list from the subset of individuals and their corresponding audio clips, and push it to the speaking user; The user identified by the speaking audio segment from the recommended list of timbre samples is taken as the target object, and the template timbre features of the target object are taken as the target timbre features.

5. The voice-changing method for live-streaming dialogue according to claim 1, characterized in that, Based on the optimized pitch features and target timbre features, the speech segment is subjected to voice-changing processing to obtain a voice-changed segment, including: Feature extraction is performed on the speech segment to obtain its content semantic features; The content semantic features, the optimized pitch features, and the target timbre features are jointly encoded to obtain voice-changing audio features; The corresponding audio data is generated as a voice-changing segment by decoding the voice-changing audio features.

6. The voice-changing method for live-streaming dialogue according to any one of claims 1 to 5, characterized in that, After sending the voice-changing audio data to the listener in the live stream, the process includes: Obtain feedback information from the recipient user after playing the voice-changing audio data, and determine the recipient user's acceptance level of the voice-changing audio data based on the feedback information; If the acceptance level is lower than a preset threshold, other human objects are selected from the sound information template library, and the other human objects and their audio clips are used to construct a re-voice changing request and sent to the speaking user. Once the speaking user confirms the request to re-change the voice, the other person objects are taken as the new target objects. For the subsequently generated voice segments, the process starts again from the step of changing the pitch characteristics of the voice segments according to the preset target pitch value, and the voice changing process is performed on the subsequently generated voice segments.

7. The voice-changing method for live-streaming dialogue according to any one of claims 1 to 5, characterized in that, After sending the voice-changing audio data to the listener in the live stream, the process includes: In response to a user configuration event, a voice-changing configuration interface pops up in the live broadcast room. The voice-changing control parameters of the speaking user are received through the voice-changing configuration interface. The voice-changing control parameters include the target timbre characteristics of the target object selected by the speaking user from the character objects in the voice information template library and the target pitch value set by the speaking user. Based on the target timbre characteristics corresponding to the target object selected by the speaking user and the target pitch value set by the speaking user, for the subsequently generated speech segments, the process starts again from the step of performing pitch-shifting processing on the segment pitch characteristics of the speech segment according to the preset target pitch value, and performs voice-changing processing on the subsequently generated speech segments.

8. A voice-changing device for live-streaming conversations, characterized in that, include: The segment analysis module is configured to respond to voice speaking events triggered by users speaking in the live broadcast room, and detect and determine the voice segments in the target audio data corresponding to the event; The pitch shifting module is configured to perform pitch shifting on the segment pitch features of the speech segment based on a preset target pitch value to obtain optimized pitch features. The voice-changing processing module is configured to perform voice-changing processing on the speech segment based on the optimized pitch feature and the target timbre feature to obtain a voice-changing segment. The target timbre feature is the template timbre feature of the target object obtained by matching the speaker's own timbre feature from the sound information template library. The segment update module is configured to replace the corresponding voice segment in the target audio data with the voice-changing segment to obtain voice-changing audio data, and send the voice-changing audio data to the listener in the live broadcast room.

9. A voice-changing device for live streaming, comprising a central processing unit and a memory, characterized in that, The central processing unit is used to invoke and run a computer program stored in the memory to perform the steps of the method as described in any one of claims 1 to 7.

10. A non-volatile readable storage medium, characterized in that, It stores, in the form of computer-readable instructions, a computer program implemented according to any one of claims 1 to 7, which, when invoked by a computer, executes the steps included in the corresponding method.