Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

56 results about "Prosody" patented technology

In linguistics, prosody is concerned with those elements of speech that are not individual phonetic segments (vowels and consonants) but are properties of syllables and larger units of speech, including linguistic functions such as intonation, tone, stress, and rhythm. Such elements are known as suprasegmentals.

Multi-lingual speech synthesis

A method for speech synthesis of a word in a first language, comprising dividing the word into a first sequence of pronunciation phonemes in the first language, mapping the first phoneme sequence to a second sequence of pronunciation phonemes in at least one second language, and generating an audio output of the phonemes in the second phoneme sequence using prosody models adapted for the at least one second language. According to this method, an audio output of a word in a first language can be generated by a speech synthesizing engine not having actual support for this language. Instead, the pronunciation phonemes of the word are mapped onto phonemes of at least one second language, for which the speech synthesizing engine does have support.
Owner:NOKIA CORP

Speech synthesis apparatus and speech synthesis method

The present invention includes: a characteristic parameter DB 106 that holds, with respect to each speech-unit, speech-unit data indicating a loan word attribute and acoustic characteristics; a language analysis unit 104 and a prosody prediction unit 109 that obtain text data and respectively predict a loan word attribute and acoustic characteristics of each of a plurality of speech-units that form text indicated by the text data; a speech-unit selection unit 108 that selects, from the characteristic parameter DB 106, speech-unit data that represents the loan word attribute and the acoustic characteristics similar to the predicted loan word attribute and acoustic characteristics of each speech-unit; and a speech synthesis unit 110 that generates synthesized speech using a plurality of the selected speech-units and outputs the synthesized speech.
Owner:PANASONIC CORP

System for tuning synthesized speech

An embodiment of the invention is a software tool used to convert text, speech synthesis markup language (SSML), and or extended SSML to synthesized audio. Provisions are provided to create, view, play, and edit the synthesized speech including editing pitch and duration targets, speaking type, paralinguistic events, and prosody. Prosody can be provided by way of a sample recording. Users can interact with the software tool by way of a graphical user interface (GUI). The software tool can produce synthesized audio file output in many file formats.
Owner:CERENCE OPERATING CO

Confirmation system for command or speech recognition using activation means

A system and method for confirming command or speech recognition results returned by an automatic speech recognition (ASR) engine from a command issued by an operator of a vehicle or platform, such as an aircraft or unmanned air-vehicle (UAV). The operator transmits a command signal to the ASR engine, initiated by an activation means, such as a push-button (formally known as push-to-talk or push-to-recognize). A recognition result is communicated to the user and the system awaits the confirmation for a limited period of time. During this period, in one embodiment, a low tone with high prosody is played to notify the user that the system is ready to receive the confirmation. If the user quickly presses and releases the push-button a predetermined number of times (for instance, twice to make a double-click), the result is confirmed and the ASR forwards a command signal to a system controlled thereby. Otherwise, the ASR waits for another speech command.
Owner:ADACEL

Method and system for adjusting the voice prompt of an interactive system based upon the user's state

ActiveUS7881934B2Enhance better drivingPromote alertnessSpeech recognitionSpeech synthesisSpeech soundSignal processing
The voice prompt of an interactive system is adjusted based upon a state of a user. An utterance of the user is received, and the state of the user is determined based upon signal processing of the utterance of the user. Once the state of the user is determined, the voice prompt is adjusted by adjusting at least one of a tone of voice of the voice prompt, a content of the voice prompt, a prosody of the voice prompt, and a gender of the voice prompt based upon the determined state of the user.
Owner:TOYOTA INFOTECHNOLOGY CENT CO LTD

Method and apparatus for preventing speech comprehension by interactive voice response systems

A method and apparatus utilizing prosody modification of a speech signal output by a text-to-speech (TTS) system to substantially prevent an interactive voice response (IVR) system from understanding the speech signal without significantly degrading the speech signal with respect to human understanding. The present invention involves modifying the prosody of the speech output signal by using the prosody of the user's response to a prompt. In addition, a randomly generated overlay frequency is used to modify the speech signal to further prevent an IVR system from recognizing the TTS output. The randomly generated frequency may be periodically changed using an overlay timer that changes the random frequency signal at a predetermined intervals.
Owner:NUANCE COMM INC

Prosody conversion

A contour for a syllable (or other speech segment) in a voice undergoing conversion is transformed. The transform of that contour is then used to identify one or more source syllable transforms in a codebook. Information regarding the context and / or linguistic features of the contour being converted can also be compared to similar information in the codebook when identifying an appropriate source transform. Once a codebook source transform is selected, an inverse transformation is performed on a corresponding codebook target transform to yield an output contour. The corresponding codebook target transform represents a target voice version of the same syllable represented by the selected codebook source transform. The output contour may be further processed to improve conversion quality.
Owner:WSOU INVESTMENTS LLC

Computerized speech synthesizer for synthesizing speech from text

Disclosed are novel embodiments of a speech synthesizer and speech synthesis method for generating human-like speech wherein a speech signal can be generated by concatenation from phonemes stored in a phoneme database. Wavelet transforms and interpolation between frames can be employed to effect smooth morphological fusion of adjacent phonemes in the output signal. The phonemes may have one prosody or set of prosody characteristics and one or more alternative prosodies may be created by applying prosody modification parameters to the phonemes from a differential prosody database. Preferred embodiments can provide fast, resource-efficient speech synthesis with an appealing musical or rhythmic output in a desired prosody style such as reportorial or human interest. The invention includes computer-determining a suitable prosody to apply to a portion of the text by reference to the determined semantic meaning of another portion of the text and applying the detennined prosody to the text by modification of the digitized phonemes. In this manner, prosodization can effectively be automated.
Owner:LESSAC TECH INC

Method For Adding Realism To Synthetic Speech

The present disclosure provides a method for adding realism to synthetic speech. The method includes receiving text (218) that is to be converted into synthetic speech from a mobile device (108). The text (218) may include embedded emoticons indicating a first prosody information and a predefined sound stored in a stored data repository (208). The method also includes identifying a user associated with the text (218) based on a comparison between metadata associated with the text (218) and user profiles stored in the stored data repository (208); retrieving a speech font from a speech data corpus associated with the user stored in the stored data repository (208). The speech font includes a second prosody information and a predefined accent of the user. The method further includes converting the text (218) into synthetic speech based on the retrieved speech font, which is being modulated based on the emoticon.
Owner:CLEARONCE COMM INC

Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature

The present disclosure relates to a text-to-speech synthesis method using machine learning based on a sequential prosody feature. The text-to-speech synthesis method includes receiving input text, receiving a sequential prosody feature, and generating output speech data for the input text reflecting the received sequential prosody feature by inputting the input text and the received sequential prosody feature to an artificial neural network text-to-speech synthesis model.
Owner:NEOSAPIENCE INC

Prosody Generation Using Syllable-Centered Polynomial Representation of Pitch Contours

The present invention discloses a parametrical representation of prosody based on polynomial expansion coefficients of the pitch contour near the center of each syllable. The said syllable pitch expansion coefficients are generated from a recorded speech database, read from a number of sentences by a reference speaker. By correlating the stress level and context information of each syllable in the text with the polynomial expansion coefficients of the corresponding spoken syllable, a correlation database is formed. To generate prosody for an input text, stress level and context information of each syllable in the text is identified. The prosody is generated by using the said correlation database to find the best set of pitch parameters for each syllable. By adding to global pitch contours and using interpolation formulas, complete pitch contour for the input text is generated. Duration and intensity profile are generated using a similar procedure.
Owner:THE TRUSTEES OF COLUMBIA UNIV IN THE CITY OF NEW YORK

Method and System for Preventing Speech Comprehension by Interactive Voice Response Systems

A method of and system for generating a speech signal with an overlayed random frequency signal using prosody modification of a speech signal output by a text-to-speech (TTS) system to substantially prevent an interactive voice response (IVR) system from understanding the speech signal without significantly degrading the speech signal with respect to human understanding. The present invention involves modifying a prosody of the speech output signal by using a prosody of the user's response to a prompt. In addition, a randomly generated overlay frequency is used to modify the speech signal to further prevent the IVR system from recognizing the TTS output. The randomly generated frequency may be periodically changed using an overlay timer that changes the random frequency signal at a predetermined intervals.
Owner:NUANCE COMM INC

Training method and device for prosody model used for speech synthesis

ActiveCN104867491AImprove accuracyPause smoothly and naturallySpeech synthesisSpeech synthesisSpeech sound
The invention discloses a training method and device for a prosody model used for speech synthesis, wherein the training method for the prosody model used for speech synthesis comprises the following steps: S1, extracting textual features and marker features corresponding to participles from a training corpus text; S2, generalizing the participles in the training corpus text on the basis of Chinese thesaurus; S3, training the prosody model according to the textual features, the marker features and the generalized participles. According to the training method and device for the prosody model used for speech synthesis, by extracting the textual features and marker features corresponding to participles from the training corpus text, generalizing the participles in the training corpus text on the basis of Chinese thesaurus and then training the prosody model according to the textual features, the marker features and the generalized participles, the prosody model is more perfect, and further the prosody prediction accuracy is improved.
Owner:BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD

Prosody generating devise, prosody generating method, and program

A prosody generation apparatus capable of suppressing distortion that occurs when generating prosodic patterns and therefore generating a natural prosody is provided. A prosody changing point extraction unit in this apparatus extracts a prosody changing point located at the beginning and the ending of a sentence, the beginning and the ending of a breath group, an accent position and the like. A selection rule and a transformation rule of a prosodic pattern including the prosody changing point is generated by means of a statistical or learning technique and the thus generate rules are stored in a representative prosodic pattern selection rule table and a transformation rule table beforehand. A pattern selection unit selects a representative prosodic pattern from the representative prosodic pattern selection rule table according to the selection rule. A prosody generation unit transforms the selected pattern according to the transformation rule and carries out interpolation with respect to portions other than the prosody changing points so as to generate prosody as a whole.
Owner:SOVEREIGN PEAK VENTURES LLC

System and method for cross-speaker style transfer in text-to-speech and training data generation

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize / train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.
Owner:MICROSOFT TECH LICENSING LLC

Electronic apparatus and method for controlling thereof

An electronic apparatus, based on a text sentence being input, obtains prosody information of the text sentence, segments the text sentence into a plurality of sentence elements, obtains a speech in which prosody information is reflected to each of the plurality of sentence elements in parallel by inputting the plurality of sentence elements and the prosody information of the text sentence to a text to speech (TTS) module, and merges the speech for the plurality of sentence elements that are obtained in parallel to output speech for the text sentence.
Owner:SAMSUNG ELECTRONICS CO LTD

Speech synthesis apparatus and method

The present disclosure relates to a speech synthesis apparatus and method that can remove discontinuity between phoneme units when generating a synthesized sound from the phoneme units, thereby implementing natural utterances and producing a high-quality synthesized sound having stable prosody.
Owner:SK TELECOM CO LTD

Prosody model training method and device thereof

The invention relates to a prosody model training method and a device thereof. The method comprises the following steps: receiving a training corpus containing prosodic annotation information; inputting the training corpus into a prosody model to be trained to obtain a prosody output result; and training network parameters of the prosody model to be trained according to the prosody output result and / or the prosodic annotation information to obtain a target prosody model. Through the technical scheme of the invention, the target prosody model is a personalized prosody model with relatively highadaptability and precision, and the annotation universality can be better learned from training data from different sources, so that the prediction accuracy of prosodic word boundaries and prosodic phrase boundaries and the robustness of the prosody model can be improved.
Owner:BEIJING UNISOUND INFORMATION TECH +1

Voice synthetic method and device, dictionary constructional method and computer ready-read medium

A plurality of tasks of a speech synthesizing process in which at least one of speakers, emotion or situation at the time when speeches are made, and contents of the speeches is different are set (s1), word dictionaries, prosody dictionaries, and waveform dictionaries corresponding to respective tasks are organized (s2), and when a character string is to be synthesized is input with the task specified through a game system, etc., a speech synthesizing process is performed using the word dictionary, the prosody dictionary, and the waveform dictionary corresponding to the specified task (s3). Therefore, a speech message can be generated depending on the personality of a speaker, the emotion or situation at the time when a speech is made, and the contents of the speech.
Owner:KONAMI DIGITAL ENTERTAINMENT CO LTD +1

Speech synthesis system

InactiveUS20110196680A1Preventing excessive deterioration in degree of naturalness of the synthesized speechSpeech synthesisAcousticsSpeech synthesis
When a system (100) is used for synthesizing speech having prosody serving as a reference, the system stores speech element information representing a speech element capable of synthesizing speech having a degree of naturalness indicating a degree of similarity to speech uttered by a human higher than a predetermined reference value (speech element information storage (115)). The system accepts requested prosody information representing prosody requested by the user (requested prosody information accepting part (113)). The system generates intermediate prosody information representing intermediate prosody between the reference prosody and the requested prosody (intermediate prosody information generator (114)). The system executes a speech synthesis process to synthesize speech based on the generated intermediate prosody information and the stored speech element information (speech synthesizer (116)).
Owner:NEC CORP

Hybrid predictive model for enhancing prosodic expressiveness

Systems and methods for prosody prediction include extracting features from runtime data using a parametric model. The features from runtime data are compared with features from training data using an exemplar-based model to predict prosody of the runtime data. The features from the training data are paired with exemplars from the training data and stored on a computer readable storage medium.
Owner:IBM CORP

System and method for cross-speaker style transfer in text-to-speech and training data generation

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize / train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.
Owner:MICROSOFT TECH LICENSING LLC

Speech synthesis apparatus and method

The present disclosure relates to a speech synthesis apparatus and method that can remove discontinuity between phoneme units when generating a synthesized sound from the phoneme units, thereby implementing natural utterances and producing a high-quality synthesized sound having stable prosody.
Owner:SK TELECOM CO LTD

Depression Auxiliary Detection Method and Classifier Based on Acoustic Features and Sparse Mathematics

The invention belongs to the technical field of voice processing and image processing, and discloses an auxiliary detection method and a classifier for depression based on acoustic features and sparse mathematics, and a depression discrimination based on joint recognition of voice and facial emotion; realizing glottis through an inverse filter For signal estimation, global analysis is used for the voice signal, feature parameters are extracted, the timing and distribution characteristics of the feature parameters are analyzed, and the prosody of different emotional voices is found as the basis for emotion recognition; MFCC is used as the feature parameter to analyze the voice signal to be processed, and the Multiple sets of training data are collected from the recorded data, and a neural network model is established for discrimination; the sparse linear combination of test samples is obtained by using the sparse representation algorithm based on OMP, and the facial emotions are discriminated and classified, and the obtained results are compared with speech recognition The results are linearly combined to obtain the final probability representing each data point. The depression recognition rate has been greatly improved and the cost is low.
Owner:NORTHWEST UNIV +1

Systems and methods for integrating recorded content

It would be desirable to have audio and / or video systems and processing tools that can automatically record audio / video and analyze such recordings to capture material that may be relevant to the user. In one or more embodiments disclosed herein, recordings may be compressed by using one or more tools, including but not limited to, converting speech to text and searching for relevant content, keywords, etc.; detecting and analyzing speakers and and / or classify content (e.g., events in a conversation); remove non-substantial content (e.g., silence and other extraneous content); adjust audio to increase playback speed; use rhythm and other markers in audio to identify areas of interest; Perform segmentation clustering; use pseudo-random or random samples to select content; and other methods of extracting information to provide a summary or representation of record content for user review.
Owner:BAIDU USA LLC

Audio detection method, device, electronic device and readable storage medium

The present application relates to the technical field of information processing, and discloses an audio detection method, device, electronic equipment, and a readable storage medium. The audio detection method includes: receiving the audio to be detected and the text corresponding to the audio sent by the terminal; combining the audio with the text Perform alignment processing to obtain the start and end time of each phoneme of multiple phonemes corresponding to the text in the audio; extract the phoneme feature vector of each phoneme in the audio, and obtain the audio sequence feature of the audio based on the start and end time of each phoneme; The phoneme feature vector and the audio sequence feature are used to obtain the prosody detection result of the audio; the prosody detection result includes the accent feature and the pause feature of the audio; the prosody detection result is returned to the terminal, so that the terminal displays the text corresponding to the accent feature and the pause feature. The audio detection method provided by the present application can improve the accuracy of prosody detection results.
Owner:TENCENT TECH (SHENZHEN) CO LTD

Clockwork hierarchical variational encoder

A method (400) for representing an intended prosody in synthesized speech (152) includes receiving a text utterance (320) having at least one word (250), and selecting an utterance embedding (260) forthe text utterance. Each word in the text utterance has at least one syllable (240) and each syllable has at least one phoneme (230). The utterance embedding represents an intended prosody. For eachsyllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features (232) of each phoneme of the syllable with a corresponding prosodic syllable embedding (245) for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames (280) based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
Owner:GOOGLE LLC

Prosodic labeling method, device and equipment

The invention provides a rhythm marking method, device and equipment. The method comprises steps that voice data of a to-be-marked text is obtained; according to the voice data, the rhythm informationof the voice data is determined, and the rhythm information is used for indicating the pause duration of the voice data; rhythm symbols of the to-be-marked text are marked according to the rhythm information of the voice data. The method is advantaged in that rhythm marking efficiency and accuracy are improved.
Owner:北京海天瑞声科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products