How to Speech coding?

Audio-visual joint speech enhancement method and apparatus in multi-person environment, device and medium,Using machine learning speech synthesizer using synthetic analytic speech coding,Voice interaction method, device and electronic equipment,Method and apparatus for adjusting speech coding, electronic device, and storage medium,Heart failure speech data type recognition method and system based on multi-task learning

Patents

Literature

Patsnap Eureka AI that helps you search prior art, draft patents, and assess FTO risks, powered by patent and scientific literature data.

12 results about "Speech coding" patented technology

Filter

Efficacy Topic

Property

Owner

Technical Advancement

Application Domain

Technology Topic

Technology Field Word

Patent Country/Region

Patent Type

Patent Status

Application Year

Inventor

Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Audio-visual joint speech enhancement method and apparatus in multi-person environment, device and medium

PendingCN122337229ASpeech soundAudio frequency

This application relates to a method, apparatus, device, and medium for audiovisual joint speech enhancement in a multi-person environment. The method includes: a normalization unit processes facial video and mixed audio to obtain a facial image sequence and a synchronized two-dimensional encoded audio sequence; extracting the lip region of each frame to obtain a lip image sequence; processing this sequence through a network to obtain lip-reading content features; aligning these features with the audio frame; inputting the aligned audio frame to a speech encoder to obtain fixed-length speech encoding features; outputting these features to a speaker feature decoupling network to decouple audio content features from target speaker features; aligning and fusing the lip-reading and audio content features through a cross-modal network to obtain cross-modal fused features; fusing these features with the target speaker features through an attention mechanism to obtain target speaker enhanced features; and finally, using these features, a feature enhancement network enhances the mixed audio and outputs the target audio. This method, by using audiovisual multimodal fusion and speaker feature decoupling, can effectively improve the clarity of target speech in a multi-person environment and reduce background interference.

Audio-visual joint speech enhancement method and apparatus in multi-person environment, device and medium

View all

Owner:NAT INNOVATION INST OF DEFENSE TECH PLA ACAD OF MILITARY SCI

Using machine learning speech synthesizer using synthetic analytic speech coding

PendingCN122374817ASpeech codeSpeech synthesis

An apparatus includes a memory configured to store data associated with a machine learning (ML) based speech synthesis model. The apparatus also includes a speech encoder including the ML based speech synthesis model. The speech encoder is configured to perform a synthetic analysis operation of an input speech signal including generating a synthetic version of the input speech signal by the ML based speech synthesis model.

Using machine learning speech synthesizer using synthetic analytic speech coding

View all

Owner:QUALCOMM INC

Voice interaction method, device and electronic equipment

ActiveCN121075331BSpeech recognitionSpoken languageIntent recognition

The application provides a speech interaction method and device and electronic equipment, and relates to the technical field of speech processing. The method comprises the following steps: acquiring speech information input by a user, and acquiring historical intention text of the user; inputting the speech information into a speech encoder of a spoken language understanding model to obtain acoustic coding features output by the speech encoder; inputting the historical intention text into a text encoder of the spoken language understanding model to obtain text coding features output by the text encoder; inputting the acoustic coding features and the text coding features into an intention recognition module of the spoken language understanding model to obtain an intention recognition result output by the intention recognition module, so as to be used for speech interaction. The application can improve the accuracy of the intention recognition result and accurately acquire the real intention of the user.

Voice interaction method, device and electronic equipment

View all

Owner:HANGZHOU QIUGUOJIHUA TECHNOLOGY CO LTD

Method and apparatus for adjusting speech coding, electronic device, and storage medium

ActiveCN116600304BComputer networkFrequency spectrum

The application provides a speech coding adjustment method and device in dynamic spectrum sharing, electronic equipment and storage medium, and relates to the technical field of wireless communication. The method comprises the following steps: receiving a notification of whether a next first preset time length occupies a shared frequency band sent by a long term evolution (LTE) network at a current time; in response to the notification including that the LTE network needs to occupy the shared frequency band in the next first preset time length, correcting channel quality information, and adjusting a speech coding rate based on the corrected channel quality information. At least the problem that the LTE system has high priority and occupies the shared spectrum for a long time when the traffic volume is large, resulting in poor voice service quality of the UMTS system, is solved. The application is suitable for spectrum sharing optimization, voice service optimization and the like.

Method and apparatus for adjusting speech coding, electronic device, and storage medium

View all

Owner:CHINA UNITED NETWORK COMM GRP CO LTD

Heart failure speech data type recognition method and system based on multi-task learning

PendingCN122347965AEngineeringMulti-task learning

The application provides a heart failure voice data type recognition method and system based on multi-task learning, which comprises the following steps: after standardizing the heart failure voice data, inputting the standardized heart failure voice data into a shared voice encoder to obtain shared features; for multiple recognition tasks, corresponding task embedding vectors are generated respectively, and based on the task embedding vectors and the shared features, task-specific gating weight vectors are generated, and further, routing features corresponding to each task are obtained; for each recognition task, an independent Adapter module is used to obtain private features corresponding to each task; and each private feature is input into a corresponding task classification head to obtain a recognition result. The present scheme can effectively solve the problem that in the prior art, under a single voice input, heart failure voice data multi-task parallel recognition cannot be stably and consistently realized, and at the same time, due to the interference between the tasks, the recognition results are not balanced and reliable.

Heart failure speech data type recognition method and system based on multi-task learning

View all

Owner:UNIV OF SCI & TECH BEIJING

Deep learning-based adaptive speech recognition system

PendingCN122337201AFeature extractionSpeech code

This invention discloses a deep learning-based adaptive speech recognition system, relating to the field of speech recognition technology. The system processes raw speech signal data acquired by a speech acquisition module and dynamically adapts it to user identification information to obtain speech coding feature data. Furthermore, it extracts features from environmental metadata to obtain environmental embedding feature data. Based on the environmental embedding feature data and the speech coding feature data, feature recognition processing is performed to calculate recognition feature coefficients. The speech recognition module compares these coefficients with preset recognition feature thresholds and determines the speech quality based on the comparison results. This enables recognition under dynamically changing user and environmental conditions, improving the accuracy of speech recognition.

View all

Owner:IANGSU COLLEGE OF ENG & TECH

Information processing device, information processing method, computer program, learning device, remote conference system, and support device

PendingUS20260204275A1Information processingSupervised learning

Provided is an information processing device that perform processing related to speech conversion of a speech that is not normally uttered and does not include pitch information such as a whisper or a faint speech. The information processing device includes a speech-to-unit encoder that generates an acoustic unit from a speech waveform, and a unit-to-speech decoder that reconstructs a speech waveform from an acoustic unit. The unit-to-speech decoder is subjected to preliminary learning by self-supervised learning of a Masked Language Model type using a normal speech and a whisper without a text label of a specific speaker to generate an acoustic unit common to the normal speech and the whisper, the acoustic unit being a latent expression in which a difference between the normal speech and the whisper is absorbed.

Information processing device, information processing method, computer program, learning device, remote conference system, and support device

View all

Owner:SONY GROUP CORP

A voice-driven three-dimensional virtual figure complex emotional facial motion generation method

PendingCN122265483ABiological modelsAnimationMedicineAnimation

The application discloses a speech-driven three-dimensional virtual image complex emotion facial action generation method, relates to the technical field of virtual image animation, and comprises the following steps: receiving a speech segment and an emotion guide image as input, extracting speech features and emotion coding features through a speech coding module and an emotion coding module respectively; randomly sampling a fixed-length noise sequence, splicing the fixed-length noise sequence with time step embedded and processed default shape parameters to form an initial noise input; based on a conditional diffusion model, sequentially guiding a diffusion process with the speech features and the emotion coding features as conditions, generating a three-dimensional facial expression animation with synchronized lip shapes and speech and consistent emotions and input pictures; the method can effectively handle complex emotion scenes, effectively express mixed emotions and hidden emotion characteristics, and quantitative evaluation results show that the method has excellent performance.

A voice-driven three-dimensional virtual figure complex emotional facial motion generation method

View all

Owner:BEIJING INST OF TECH

Speech recognition methods, speech recognition systems, computer equipment and storage media

ActiveCN116959424Bavoid collectingaccurate identificationSpeech recognitionSpeech codeSpeech classification

This application provides a speech recognition method, a speech recognition system, a computer device, and a storage medium, belonging to the field of financial technology. The method includes: inputting target speech with a preset emotion category into a pre-trained multi-task speech recognition model; encoding the target speech using a first speech coding sub-model to obtain initial speech features; performing speech attention processing on the initial speech features using a first attention sub-model to obtain first target attention features; encoding the initial speech features using a second speech coding sub-model to obtain hidden speech features; performing hidden attention processing on the first target attention features and the hidden speech features using a second attention sub-model to obtain second target attention features; and performing speech classification on the second target attention features using a multi-task classification sub-model to obtain a target speech label. This application embodiment can improve the recognition accuracy of multi-task speech recognition.

Speech recognition methods, speech recognition systems, computer equipment and storage media

View all

Owner:PING AN TECH (SHENZHEN) CO LTD

A system and method for processing audio data

PendingCN122313998AData processing systemNoise

This invention relates to the field of speech coding technology, specifically to an audio data processing system and method. The system includes an audio signal framing module, a spectrum threshold routing module, a threshold balance calibration module, a peak link recognition module, and a curve audio correction module. In this invention, after the audio signal is framed and transformed, the amplitude is compared with adjacent differences to trigger dynamic adjustments, enabling fine correction of spectral abrupt change regions. Local threshold adaptive reconstruction maintains frequency band energy coordination, half-value iterative compensation suppresses single-point deviations and stabilizes the energy structure, and inverse-range weighted calculation of peak positions makes frequency band transitions smoother, reducing energy abrupt changes. Offset correction adjusts the amplitude distribution based on the predicted curve difference, optimizing spectral continuity and residual smoothness, and overall improving transient fidelity and low-noise balance, allowing compressed audio to remain clear and natural in complex scenes.

A system and method for processing audio data

View all

Owner:NANJING CODE NOTE NETWORK TECH CO LTD

Method and device for speech synthesis

PCT designated stageWO2026151373A1AcousticsSpeech synthesis

The present disclosure provides a method and a device for speech synthesis, wherein the method includes obtaining a target text in a speech synthesis request; performing preset processing on the target text to obtain to-be-synthesized input information, wherein the to-be- synthesized input information includes information obtained by splicing the target text and historical information, or encoded combination information obtained by parsing text encoding of the target text and combining fixed speech encoding and / or model output speech encoding; and inputting the to-be-synthesized input information into a pre-trained speech synthesis model, so that the speech synthesis model outputs target synthesized speech corresponding to the target text in combination with text-associated content of the to-be-synthesized input information. In this way, it ensures that the input speech synthesis model information contains text-associated content, and also reduces the delay between the target text input and the target synthesized speech output, which helps improve the accuracy and efficiency of speech synthesis.

View all

12 results about "Speech coding" patented technology

Popular searches