Representation of the speech apparatus in an articulatory feature space
The articulatory feature space in speech synthesis addresses the lack of interpretability in existing techniques by mapping speech production to anatomical properties, enabling adaptable and explainable speech synthesis and compensation for speech impairments.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ALTAVO GMBH
- Filing Date
- 2025-11-20
- Publication Date
- 2026-06-18
AI Technical Summary
Existing speech synthesis techniques lack explainability and modifiability, particularly in machine-learned models, as they often rely on acoustic features that are not directly interpretable or physically correlated with the speech apparatus.
Utilize an articulatory feature space defined by parameters of a source-filter forward model, which maps speech production to anatomical properties of the vocal tract, enabling comprehensive characterization and modification of speech synthesis.
The articulatory feature space allows for explainable and adaptable speech synthesis, accommodating anatomical and linguistic modifications, and compensating for speech impairments, while being speaker-independent.
Smart Images

Figure EP2025083623_18062026_PF_FP_ABST
Abstract
Description
[0001] DESCRIPTION
[0002] REPRESENTATION OF THE SPEECH APPARATUS IN ARTICULATORY FEATURE SPACE
[0003] TECHNICAL AREA
[0004] Several examples of the disclosure concern the use of an articulatory feature space in connection with language-related applications, such as speech synthesis, user feedback during speech utterance, or the diagnosis of pathological conditions related to the speech apparatus.
[0005] BACKGROUND
[0006] Speech synthesis is situated within the field of speech analysis and processing. It is used in various applications, including text-to-speech (TTS) and speech-to-speech (SSP) applications. Techniques also exist for synthesizing speech by measuring the articulatory activity of the vocal tract (for example, with a camera or radar sensor). See, e.g., EP 4 139 917 A1. One example of such vocal tract measurements is so-called "silent speech," in which the vocal tract performs articulation without phonation, i.e., without the excitation of acoustic vibrations by the vocal cords. The lack of phonation can be pathological or intentional, for example, to protect privacy in public spaces.
[0007] Speech synthesis is often implemented algorithmically using machine-learned models. A typical data processing pipeline then includes an encoding model (also called an acoustic model) that maps input data (e.g., text in TTS applications or measurements of the vocal tract) into a specific feature space. A decoding model is then used, which generates a synthetic speech utterance based on the feature vectors in the feature space. A concrete example in the context of TTS speech synthesis is receiving text as input data, where an acoustic model then generates acoustic spectrograms with frequency resolution based on the text, for example, Mel spectrograms. Based on these acoustic spectrograms, a vocoder is then applied to generate synthetic speech, i.e., a waveform in the time domain that can be reproduced acoustically. See, for example, [reference to relevant example].Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. Mel spectrograms have a physical meaning, namely the frequency spectrum of speech utterance as a function of time. Such acoustic spectrograms are therefore interpretable or "explainable." Machine-learned techniques are also conceivable in which features in the feature space between the encoding model and the decoding model are entirely machine-learned and thus not readily interpretable or exhibiting no direct physical / technical / biological correlation.
[0008] SUMMARY
[0009] The object of the present invention is to provide improved techniques for language analysis and processing. In particular, it is an object of the present invention to provide techniques that are explainable and modifiable.
[0010] This task is solved by the features of the independent patent claims. The features of the dependent patent claims define embodiments.
[0011] Several disclosed techniques involve speech analysis and processing based on feature vectors generated by a machine-learned model. These features are used within a specific feature space. This feature space is defined by features that correspond to certain parameters of a source-filter forward model for a speech apparatus. The feature space can therefore also be referred to as an articulatory feature space. Thus, the articulatory feature space is interpretable and has a direct physical-anatomical correspondence in certain parameters or properties of the speech apparatus.
[0012] A source-filter forward model uses aeroacoustic simulation to model how sounds are generated in the vocal tract. One-dimensional, two-dimensional, or three-dimensional simulations are possible. For example, a finite element simulation can be used for the fluid mechanics of the vocal tract. In principle, the source-filter forward model can include both dynamic and static parameters. Examples of static parameters are jaw opening at rest or a specific jaw angle at rest. Other static parameters include, for example, the area of a proximal tongue region or the area of a distal tongue region. The dynamic parameters can then describe specific movements of anatomical points or regions within the vocal tract.For example, movements around a rest point or the defined rest area defined by the corresponding static parameter could be described by dynamic parameters.
[0013] The speech apparatus comprises the larynx with its vocal cords and the vocal tract. The source-filter forward model posits that speech production occurs through two main processes: sound generation and sound shaping. The "source" of the model typically refers to the vocal cords in the larynx, which generate sound waves when set into vibration by the airflow from the lungs. These vibrations have a fundamental frequency that constitutes the fundamental tone of the voice. The "filter" of the model is formed by the vocal tract, which includes the space above the vocal cords, encompassing the pharynx, mouth, and nasal cavity. Movements of the jaw, tongue, and lips alter the shape of these cavities, modifying the sound waves by amplifying or attenuating certain frequencies. This results in the formation of different speech sounds.The "forward model" describes how the vibrations of the vocal folds are modified by the vocal tract to produce the audible speech sounds that we perceive as speech. Below are some exemplary parameters associated with such a source-filter model of the vocal apparatus. All such parameters are candidates for defining features in the articulatory feature space. For example, movements of specific anatomical reference points in the vocal tract can be captured. For instance, a pharyngeal movement or jaw opening could be captured. The movement and / or shape of the tongue could be captured. Other examples include the lateral width of the larynx (as a source parameter) and the pharynx (as a filter parameter). Exemplary parameters of the vocal tract are described, for example, in: Birkholz, Peter, Dietmar Jäckel, and Bernd J. Kroger. "Construction and control of a three-dimensional vocal tract model.""2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 1. IEEE, 2006. A three-dimensional model of the vocal tract is used there, although a two-dimensional or one-dimensional model can also be used in the various examples. See also Birkholz, Peter, Dietmar Jäckel, and Bernd J. Kroger. "Construction and control of a three-dimensional vocal tract model." 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 1. IEEE, 2006. Another example of a source-filter model is described in Öhman, Sven EG. "Numerical model of coarticulation." The Journal of the Acoustical Society of America 41.2 (1967): 310-320."
[0014] It is evident from the above that the parameters of the source-filter forward model of the speech apparatus have a direct correspondence in the anatomy or movement of the speech apparatus. Thus, the features predicted by the machine-learned model in the articulatory feature space are readily explainable and interpretable. However, unlike, for example, Mel spectrograms, they do not (directly) capture acoustic properties of speech utterances; rather, they relate to physical-anatomical properties of the speech apparatus. This enables diverse and expanded applications in the field of speech analysis and processing, for example, in speech synthesis.
[0015] A data processing device is disclosed.
[0016] The data processing device comprises at least one processor and a memory. The at least one processor is configured to load and execute program code from memory. Based on the program code, the at least one processor receives one or more measurement data streams for a person's speech apparatus. This is done for a common
[0017] Recording period. Based on the program code, at least one processor continued to process the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors. The features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus.
[0018] In various examples, it would be conceivable to use more than one measurement modality to record multiple data streams simultaneously during the recording period. However, it would also be possible to use only a single measurement modality.
[0019] One example of a measurement modality is an audio recording of a person's speech utterance during the recording period. This could, for example, enable speech-to-speech applications. Alternatively or additionally, one or more articulatory measurement modalities could be used.
[0020] For example, an image-based measurement modality could be used, which captures and analyzes camera images of the person's lips or oral cavity. Other measurement modalities include radar measurements, ultrasound measurements, and electromyography.
[0021] Examples of articulatory measurement modalities include: optical lip reading, ultrasound imaging, electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articulography (PMA), strain gauges; high-frequency measurements, for example in the UWB range, especially frequency-modulated continuous wave radar (FMCW) that transmits through the oral cavity.
[0022] In principle, the various techniques can work particularly well when different properties of the speech apparatus, especially the vocal tract, are comprehensively measured. Therefore, in several of the techniques described herein, it can be advantageous to use articulatory measurement modalities that can specifically measure properties of the oral cavity and pharynx of the vocal tract. Examples of such articulatory measurement modalities are radar and ultrasound measurements. These can be performed using extracorporeal sensors, which are attached to the skin surface, for example, with a medical patch. Simultaneously, it can be beneficial to combine such articulatory measurement modalities, which measure properties within the vocal tract, such as in the oral cavity, with other articulatory measurement modalities that measure lip and / or tongue movement.This can include, in particular, camera-based techniques. In this way, the vocal tract can be comprehensively characterized. Specifically, it has been shown that radar measurements make it possible to comprehensively measure large areas of the vocal tract. For example, jaw opening, tongue positioning, and lip movement can be measured.
[0023] For each measurement modality, a corresponding sequence of feature vectors can be determined. Then, a fusion of the feature vector sequence can be performed in the articulatory feature space. In such a case, several coding branches can be provided in the machine-learned model, each specifically trained to process the measurement data streams of the respective measurement modality. However, it would also be conceivable to perform a fusion of the measurement data streams before inputting them into the machine-learned model.
[0024] In various examples, features corresponding to dynamic and / or static parameters of the source-filter forward model can be predicted using the machine-learned model. Dynamic parameters can be determined, in particular, from the time dependence of the feature values in the sequence of feature values. Static parameters can be determined once at the beginning of a corresponding measurement period or even at the beginning of the respective recording period and then fixed. The dynamic parameters have parameter values that vary over the recording period because the speech apparatus moves. The static parameters characterize static properties of the speech apparatus, and thus the corresponding parameter values are constant over the recording period. A feature can be predicted that corresponds to a parameter of the source of the source-filter model.Alternatively or additionally, a feature corresponding to a parameter of the filter can also be predicted.
[0025] For example, different motor states could define different reference ranges of a vocal tract. Such motor states could, for instance, specify a movement trajectory or the extent of the reference ranges.
[0026] It has been found that such features possess three helpful properties: (i) first, these features enable a comprehensive characterization of speech production and thus a natural and accurate speech synthesis; (ii) second, such features are also explainable, i.e., they allow for evaluations, modifications, diagnoses, or other inferences about sound formation within the context of speech utterance; (iii) third, such features are—at least predominantly—speaker-independent. In other words, this means that speech utterances can be transferred into a speaker-independent feature space, regardless of the speaker. This promotes, on the one hand, the trainability of the machine-learned model, because the same machine-learned model can be used for different speakers. Furthermore, this also promotes explainability, because, regardless of the specific speaker, certain feature values have a specific meaning.
[0027] If speaker-dependent speech synthesis is desired, this can be achieved during decoding (i.e., mapping the articulatory feature space into the audio space); for example, vocoders are known that take speaker-specific characteristics (e.g., voice quality or timbre, fundamental frequency, intonation, prosody, articulation speed, nasality, etc.) into account. Examples include WaveNet and Tacotron, which can be individually adapted to each speaker.
[0028] Based on the feature vectors in the articulatory feature space, various applications are possible. Some applications are summarized in Table 1.
[0029]
[0030] Table 1: List of various applications made possible by an articulatory feature space. The different applications can be combined or used in isolation.
[0031] For example, speech synthesis is possible, see Table 1, Example 1. In this process, speech synthesis is performed based on one or more feature values. It is possible that one or more feature values are modified beforehand, so that the speech synthesis is then performed based on these modified values. The speech synthesis generates a synthetic utterance for the person during the recording period. This utterance can be represented, in particular, as an audio waveform within the time domain. The audio waveform could be determined by a vocoder, which receives a spectrogram as input. This spectrogram is then further processed by a decoding branch; the decoding branch maps the articulatory feature space to the audio space.
[0032] By modifying feature values in the feature space, a particular characteristic of speech synthesis can be adjusted based on anatomical specifications. In state-of-the-art techniques, feature values in the feature space can be adjusted, for example, based on acoustic specifications or on mathematical-functional characteristics of the temporal sequence of feature vector values. Examples include, for instance, the smoothing of Mel spectrograms using machine-learned models that operate purely on a pattern-based basis. See, for example, Neekhara, Paarth, et al. "Expediting TTS synthesis with adversarial vocoding." arXiv preprint arXiv: 1904.07944 (2019). In contrast, the articulatory feature space allows, in addition to purely pattern-based mathematical or acoustic adjustment of feature values, a modification of feature values based on anatomical dependencies.Modifying feature values in the articulatory feature space offers certain advantages compared to modifying feature values in the acoustic feature space or even in a machine-learned feature space. In contrast to modifications in a machine-learned feature space, features defined in the articulatory feature space are explainable and can therefore be adjusted in a physically informed manner, going beyond purely mathematical-functional adaptation. Physically informed modification of features is also possible in the acoustic feature space, for example, by smoothing spectrograms. However, the various characteristics of sound production and speech formation are superimposed in the spectrograms, meaning that individual aspects of sound production and speech formation often cannot be addressed in isolation within spectrograms.In contrast, the articulatory feature space directly maps sound formation and speech production, allowing certain corresponding properties to be adjusted in the vocal tract based on a corresponding aeroacoustic understanding of sound formation and speech production. This enables a wide range of modifications that are simply not possible in the acoustic feature space. Some examples are explained below.
[0033] For example, modification can be based on a vocal or linguistic premise for synthetic speech production. In the source-filter model, sound shaping can be adjusted by setting specific geometric parameters or altering movement to align with certain anatomical reference points. Jaw opening, tongue position, and lip rounding are important geometric parameters in the source-filter model that influence the sound shaping of vowels. When articulating the vowel / i / as in "sie" (she / they), the tongue is positioned high in the mouth and close to the front teeth, which alters the resonance chamber in such a way that certain frequencies are amplified, producing the characteristic sound of this vowel. Another example is lip rounding in vowels like / u / in "Mut" (courage / toughness), where the lips are extended and rounded.This rounding lengthens the vocal tract, which alters the resonance chamber and leads to the specific acoustic properties of the sound. Finally, jaw opening has a strong influence on the timbre of vowels: for a vowel like / a / in "tag," the jaw opens wide, increasing the space in the mouth and lowering the frequency of the first formants, thus making the vowel sound deeper. These anatomical adjustments modulate the vocal tract and therefore decisively influence the specific sound characteristics of different vowels. For example, the sound of a particular vowel could be adjusted. The length of a vowel could be adjusted.
[0034] In addition to such sound-related modifications dependent on vocal cues, linguistic cues are also conceivable. For example, a specific dialect (such as a rolled "r" for Franconian) could be suppressed or emphasized. Further examples include the unrounding of / ö / , lö / l, / ü / , / ü: / , and / üe / to / e / , / e: / , Hl, and Hl in certain dialects; or consonant weakening. It is possible to specifically consider such linguistic cues during the modification process. A comparable modification is not possible, or only possible to a very limited extent, in the acoustic feature space, for example, based on spectrograms.
[0035] Typically, linguistic conventions comprise a collection of several vocal conventions. For example, a particular dialect is characterized by a collection of specific conventions for the pronunciation of certain sounds.
[0036] In principle, the state of the art has comprehensively investigated, within the framework of the source-filter model, the physical-anatomical dependencies that exist between vocal or linguistic characteristics of the synthetic speech utterance on the one hand and the individual parameters of the source-filter model on the other. Based on such prior knowledge about the physical-anatomical dependencies, the modification of one or more feature values can then be carried out.
[0037] Modifying one or more of the feature values can be based not only on a vocal or linguistic specification for the synthetic speech utterance (i.e., based on a target specification), but alternatively or additionally on prior knowledge about a vocal or linguistic characteristic or limitation of the person (i.e., based on properties of the input data). The examples discussed above in connection with the vocal or linguistic specification for the synthetic speech utterance also apply, in principle, to modifications based on prior knowledge about the vocal or linguistic characteristic or limitation of the person. For example, as discussed above, it would be conceivable that the emphasis on certain sounds—perhaps due to the known use of a dialect—could be compensated for.
[0038] Other examples include pathological articulatory impairments such as dysarthria, stroke, Parkinson's disease, or tongue paralysis, as well as absent or impaired phonation ability and non-organically caused phonetic disorders such as lisping. Various situations that lead to a partial or complete loss of phonation are explained below. In patients requiring long-term mechanical ventilation, a tracheotomy is typically performed to avoid the side effects of nasal or oral intubation. A cannula is inserted through an artificial opening in the trachea. An inflatable cuff is typically used to seal the trachea tightly. This prevents exhaled air from flowing through the larynx, thus preventing phonation.Following a total laryngectomy, a permanent opening in the trachea, called a tracheostomy, is created, and the esophagus and pharynx are surgically separated from the airway. This situation also prevents physiological phonation. Removal of the thyroid gland (thyroidectomy) can lead to injury of one or both recurrent laryngeal nerves, which are anatomically very close to the thyroid gland. Injury to the laryngeal nerves can partially or completely paralyze the vocal cords, thus impairing or eliminating phonation. It is important to understand that in each of the situations described here, the mechanism of phonation is deactivated, while the articulatory function is not affected (although it is also conceivable that the speaker might intentionally speak without phonation, i.e., "whisper"—this might occur, for example, to maintain the confidentiality of a telephone conversation in a public space).
[0039] Such limitations mean that certain sound formations are no longer possible, or only possible in a distorted form, due to a corresponding anatomical impairment. If corresponding features are observed in the temporal sequence of feature vectors affected by such a limitation, the corresponding feature values can be modified to compensate for the deviation from a target value caused by the limitation. However, such compensation for certain anatomically caused limitations in speech production is not possible, or only possible to a limited extent, in the acoustic feature space according to previously known techniques.
[0040] For example, dysarthria can lead to motor impairment in the vocal tract, which manifests symptomatically as a vocal impairment; typical symptoms include slurred or indistinct speech, and altered voice quality, such as being particularly hoarse or nasal. For instance, some stroke patients can only open their jaw within a specific range reduced by the pathology. This is due to impairment of the corresponding motor center in the brain. Within the set of features, there might be a specific feature corresponding to a parameter of the source-filter model that indicates jaw opening. It would then be conceivable to selectively modify the feature values of this feature when a reduced jaw opening compared to a target value is detected.
[0041] Another example is lisping. Lisping, also called sigmatism, involves the incorrect pronunciation of certain sounds, especially the zs / sound and sometimes the zzz sound. These pronunciation deviations arise from an unusual positioning or movement of the tongue, as well as a possible misalignment of the teeth, which disrupts the normal airflow patterns in the vocal tract. This can be compensated for by specifically adjusting the feature values associated with parameters of the source-filter model that describe the positioning and movement of the tongue or teeth.
[0042] In principle, modification can be based on a comparison of the characteristic values with a predefined reference. The predefined reference could, for example, be determined based on the vocal or linguistic specifications for synthetic speech discussed above. Furthermore, the comparison can take into account, for example, the vocal or linguistic characteristics or limitations of the individual discussed above; if, for instance, certain limitations or characteristics are known, their expression in specific characteristic movement patterns or positions of anatomical features in the speech apparatus can be specifically sought and identified within the comparison.
[0043] With such techniques, the modification can be carried out directly in the feature space, meaning that the feature values can be directly mapped to adapted feature values by applying a mathematical operation (which is determined, for example, as described above).
[0044] However, in various scenarios, it would also be conceivable to first convert the data into the parameter space of the source-filter forward model, then adjust it within the framework of the source-filter forward model, and finally convert it back into the articulatory feature space of the machine-learned model. This means that the source-filter forward model is parameterized based on the feature values of the features, and then a prediction of the parameterized source-filter model is generated. The source-filter model can then be inferred. A comparison of the source-filter forward model's prediction (for example, the airflow or a geometric parameter of a specific anatomical feature) with a predefined reference can then be performed, and the feature values can be modified based on this comparison.In such a scenario, speech synthesis would be performed using the source-filter forward model to modify the feature values in the feature space of the machine-learned coding model. In principle, it would then be possible for the modified feature values to be translated by a machine-learned decoding model (vocoder) into a synthetic speech utterance or audio waveform, thus resulting in further speech synthesis using a machine-learned vocoder.
[0045] For example, modifying one or more attribute values can be based on a predefined weighting of the attributes. This weighting can, for instance, involve a relative weighting of different attributes within a feature vector representing the temporal sequence. This means that, for example, a first attribute has a first attribute value in a specific feature vector, and a second attribute has a second attribute value in this same feature vector; the ratio of the first attribute value to the second attribute value can then be compared with a specified relative weighting, and, if necessary, at least one of the first attribute values and one of the second attribute values can be adjusted to meet the specified value.Another example—in addition to the weighting of different characteristics described above—would be the weighting of characteristic values for one and the same characteristic within the corresponding temporal sequence. This means that the weighting involves a relative weighting of a specific characteristic between two characteristic vectors in the temporal sequence. For example, certain rates of change could be taken into account. Such a predetermined weighting can, in turn, be determined based on a vocal or linguistic requirement for the synthetic speech utterance; alternatively or additionally, such a predetermined weighting can be based on prior knowledge about a vocal or linguistic characteristic or limitation of the person. For example, a certain limited tongue movement speed could be considered within the framework of a relative weighting over a given time period.As part of a relative weighting between different characteristics, for example, the ratio of jaw opening to lip opening could be taken into account.
[0046] The preceding section described techniques in which one or more characteristic values are modified based on a vocal or linguistic premise, or on prior knowledge of a vocal or linguistic characteristic or limitation of the individual. This means that techniques were specifically disclosed in which modifications are adapted based on anatomical features of the speech apparatus, for example, symptomatically due to subjective characteristics of the individual, or outcome-driven based on specific target specifications for synthetic speech production. Alternatively or additionally to such modifications dependent on characteristics of the individual, the synthetic speech production, or the speech apparatus, modifications based on properties of the measurement modality(ies) would also be conceivable.For example, modification could be based on limitations or characteristics of one or more measurement modalities used to acquire the one or more measurement data streams. For instance, it would be conceivable to apply a weighting (e.g., as described above, at a specific time between different features, or for a specific feature between different times) based on a predefined rule that emphasizes or suppresses a particular feature for a specific measurement modality.
[0047] Such techniques are particularly relevant for articulatory measurement modalities, that is, measurement modalities that measure properties of the speech apparatus and, in particular, the vocal tract. The observable of an articulatory measurement modality typically has a direct relationship to a corresponding feature in the articulatory feature space. Therefore, certain limitations associated with the measurement modality can be better compensated for in the articulatory feature space than, for example, in an acoustic feature space, as in reference implementations.
[0048] For example, certain measurement methods can systematically distort, overemphasize, or underemphasize specific characteristics. For instance, one measurement method might exhibit particularly high accuracy in determining lip position, while another might show particularly high accuracy in determining tongue position. An example of this would be image-based lip positioning using a camera pointed directly at the speaker's face. By analyzing the images captured by the camera, the lip position or movement can be determined relatively accurately. Conversely, image analysis of the camera's captured images typically only allows for a relatively inaccurate determination of tongue position.On the other hand, radar measurement can determine the position of the tongue or even the jaw opening with high accuracy (at least compared to the image analysis described above). Accordingly, it would be conceivable that feature values for the "lip position" obtained through image analysis of camera images could be weighted relatively highly, while feature values for the "tongue position" obtained through image analysis of camera images could be weighted relatively low. Conversely, feature values for the "tongue position" obtained through radar measurement could be weighted relatively highly.
[0049] Techniques have been described above for modifying one or more feature values of the feature vectors when prior knowledge exists regarding the measurement modality and / or linguistic characteristics or limitations of the speaker and / or target specifications for the synthetic speech utterance. Sometimes, as an alternative or in addition to rules that specifically consider certain characteristics or specifications, it may be desirable to generally detect and, if necessary, compensate for deviations from the norm. In particular, it may be desirable to detect and compensate for previously unobserved deviations from the norm. For example, anomaly detection could be performed to identify anomalies in the feature values of the feature vectors. Modification can then be based on the results of the anomaly detection.Anomaly detection can also identify previously unknown deviations, such as in movements, of certain anatomical reference points in the vocal tract compared to a norm and, if necessary, compensate for them.
[0050] For example, modification can be based on the time dependencies of feature values over time. Specific rates or amplitudes of change can be considered, and modifications can be made based on these time dependencies. In particular, the change in a feature value over a specific period can be taken into account, in addition to the feature value at a given time. Certain dynamic properties of speech production are characterized by such time dependencies and can be specifically adjusted in this way. An example of such a dynamic pattern would be the amplitude of tongue movement. This contrasts with a corresponding static pattern, such as jaw opening within an observation interval.
[0051] The preceding section described aspects where, after a corresponding modification of one or more feature values, speech synthesis is performed (cf. Table 1: Application 1). Speech synthesis is not required in all variants of the present disclosure. Other applications are also conceivable. For example, it would be conceivable that a control signal is generated based on the temporal sequence of the feature vector (with or without modification of at least one corresponding feature value) (cf. Table 1: Application 2). This control signal can then be output to one or more components for their control.
[0052] The control signal could be implemented, for example, as a digital control message. Alternatively, the control signal could be an analog control signal. A digital control message could, for example, contain one or more information elements that are indicative of different characteristic values of the various features.
[0053] For example, the control signal could be passed to a user interface. This user interface could be a graphical user interface. The control signal could instruct the user interface to output contextual information about a vocal or linguistic characteristic of the person during the recording period. For instance, the contextual information could include a visual representation of the time-varying spatial shape of the vocal tract during the recording period.
[0054] For example, a corresponding visual representation could essentially occur in real time. This would allow continuous user feedback to be given to the speaker during speech. Through the visualization of the spatial structure of the vocal tract (especially the vocal tract), the user can see which movement of the vocal tract produces which sound, and conversely, which changes in vocal tract movement result in which changes in sound production. This can be particularly helpful for speech training (for example, for learning or unlearning a specific dialect or special singing techniques) or in speech therapy. Certain pathological conditions (for example, as described above, pathological articulatory limitations such as dysarthria, stroke, Parkinson's disease, tongue paralysis, etc.) could also be addressed.) can be treated through such accompanying visual feedback to the speaker.
[0055] In principle, the term "speech utterance" as used in this revelation encompasses both singing and verbal communication. Singing and verbal communication differ in terms of sound production, specifically in the manner of voice use and breath control. In singing, the voice is modulated to allow for a smooth and controlled vibration of the vocal cords, resulting in clearer and often more powerful sound production. This requires deep and controlled breathing, which allows singers to sustain and modulate tones for longer periods and with a greater dynamic range. In contrast, sound production in verbal communication is often characterized by a faster, less controlled breathing technique, enabling shorter and more pragmatic sounds.Despite certain differences in sound production and use of the vocal apparatus, both singing and verbal communication can benefit from the techniques described herein.
[0056] For example, the visual representation can be configured based on the value of a static parameter in the source-filter forward model and then modified over time based on the value of a dynamic parameter. For instance, a relative scaling of tongue size to jaw opening or a relative scaling of tongue size to lip size could be set within a corresponding configuration of the source-filter forward model and then remain fixed; based on this, the position estimation or lip opening could then be dynamically modified and displayed. Such a technique, where the visual representation is also configured based on one or more static parameters of the source-filter forward model, has the advantage of providing the speaker with more intuitive visual feedback.In speech therapy for children, certain anatomical characteristics of the vocal tract can be taken into account. For example, in children, the larynx is smaller and its parts are less developed than in adults. This affects the voice and speech clarity, which is why speech therapy exercises often aim to improve control and use of the larynx. The same applies to the teeth.
[0057] In one variant, for example, contextual information displayed via the user interface can include a visual representation of a target specification for the time-varying spatial shape of the vocal tract during the recording period. This means that, for example, a specific (e.g., time-varying) geometry of the vocal tract is displayed to the user. This can be done interactively. For instance, the user could be asked to read a specific text. Using speech recognition, it can then be determined which specific passage of the text the user is currently reading. Simultaneously, the target specification can be determined using a reference shape of the vocal tract defined for this text passage. Such a reference shape of the vocal tract can be determined, for example, using the source-filter forward model itself and / or an alternative speech synthesis method.For example, a specific speaker model could be taken into account. The target could therefore be based on a predetermined pronunciation of a speech utterance made during the recording period. In this way, specific dialects can be trained, for instance, for actors preparing for a particular role. Pathological conditions can also be treated. The target could be based on a therapeutic articulation task. For example, it would be conceivable to create a corresponding therapeutic articulation task for dysarthria patients who suffer from restricted tongue mobility, which would train their tongue mobility.
[0058] For example, the displayed contextual information could, as an alternative or in addition to the examples mentioned above, include highlighting a part of the speech apparatus, where one of the features associated with this part of the speech apparatus has feature values that deviate from a corresponding predefined reference within the recording period. For example, such contextual information could be determined based on anomaly detection (as described above) to identify anomalies in the feature values.
[0059] In such a scenario, speech training can be carried out by giving visual feedback on articulation disorders (instead of, for example, acoustic feedback through a corresponding reproduction of the speech utterance).
[0060] The contextual information can be determined, for example, based on a change in a characteristic of the feature values over time during the common recording period compared to an earlier recording period.
[0061] In this way, specific changes in speech patterns can be communicated to the user, enabling them to improve their articulation. This facilitates speech training, making progress visible.
[0062] The displayed contextual information can be indicative of user instructions for modifying vocal tract movements. For example, a specific articulation task can be assigned to the user, such as "enlarge jaw opening to X millimeters," "move tongue tip further upwards in the oral cavity," etc. Such user instructions can be displayed in text form or graphically.
[0063] The preceding section described aspects where the control signal (cf. Table 1: Application 2) was used to control a user interface. In such cases, relevant contextual information can be visualized in relation to a user's speech utterance. This contextual information can relate to vocal or linguistic characteristics of the person during the recording period. The contextual information need not necessarily be speaker-dependent; it could also relate to the measurement modality used to measure one or more data streams. The contextual information can be derived from an analysis of feature values regarding characteristics or deviations from the norm for the respective measurement modality. For example, the contextual information could be indicative of user instructions for adjusting the mounting of a sensor used to record one of the data streams.Such techniques are based on the understanding that certain measurement modalities—for example, radar measurements in particular—can be highly sensitive to incorrect placement of the corresponding high-frequency antennas on the skin surface in the vocal tract region, such as the throat or cheeks. For instance, a medical patch with one or more high-frequency antennas attached may slip or partially detach from the skin. It has also been observed that perspiration alters the coupling of high-frequency waves into the tissue, potentially leading to drift in the measurement data. These effects are not limited to radar measurements but can also occur with other articulatory measurement modalities.Examples include capacitance measurements using electrodes that are attached to the skin and can shift. Such changes or problems with the attachment of a sensor in an articulatory measurement modality can be detected particularly reliably in the articulatory feature space, as this is relatively "close" to the measured observables.
[0064] The preceding examples describe situations where the control signal is passed to a user interface, for instance, to output information about the user's speech and / or information about the measurement modality used. However, it is not always necessary to pass the control signal to a user interface. For example, a control signal can also be used for controlling a machine.
[0065] For example, the control signal could be transmitted as user input to a machine's user interface. Such a machine could be, for instance, a mobility aid in rehabilitation technology or an electronic user device. This could enable machine control techniques for people with limited mobility or paraplegia. The control signal could, for example, be used to control a motorized device. One or more actuators could be configured. A user interface of the mobile device—such as a smartphone or a PC—could be accessed.
[0066] Another application of the feature values determined in the articulatory feature space would be a diagnosis based on these values (cf. Table 1: Application 3). For example, an evaluation of the temporal sequence of feature vectors could be used to diagnose a pathology associated with the speech apparatus. For instance, a specific organic restriction of vocal tract movement, as discussed above, could be diagnosed. For example, reduced tongue mobility could be identified, leading to a corresponding diagnosis such as dysarthria.
[0067] A method involves recording one or more measurement data streams for a person's speech apparatus. These data streams are acquired during a common recording period. The method also includes processing these data streams in a machine-learned model to generate a temporal sequence of feature vectors. The features of these vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus.
[0068] For example, the procedure could involve modifying one or more of the characteristic values. Based on these modified characteristic values, the procedure could then include performing speech synthesis. This speech synthesis could generate a synthetic speech utterance for the person during the recording period.
[0069] Alternatively or additionally, the method can include generating a control signal based on the temporal sequence of the feature vectors. The feature vectors may or may not be modified.
[0070] Alternatively or additionally, the procedure can include evaluating the temporal sequence of feature vectors to determine a diagnosis of a pathology associated with the speech apparatus. Techniques such as those described above in connection with a data processing device are also applicable to such a procedure.
[0071] The features set out above and those described below can be used not only in the corresponding explicitly set out combinations, but also in further combinations or in isolation, without leaving the scope of protection of the present invention.
[0072] BRIEF DESCRIPTION OF THE FIGURES
[0073] FIG. 1 schematically illustrates the speech apparatus of a person according to various examples.
[0074] FIG. 2 schematically illustrates an electronic data processing device according to various examples, coupled with several sensors for different measurement modalities to capture measurement data streams for the speaking apparatus.
[0075] FIG. 3 is a flowchart of an exemplary process.
[0076] FIG. 4 schematically illustrates a data processing pipeline for processing measurement data streams according to various examples.
[0077] FIG. 5 schematically illustrates various parameters of a source-filter forward model of the speech apparatus according to different examples.
[0078] DETAILED DESCRIPTION OF EXAMPLES
[0079] The present invention is explained in more detail below with reference to preferred embodiments and the drawings. In the figures, identical reference numerals denote identical or similar elements. The figures are schematic representations of various embodiments of the invention. Elements depicted in the figures are not necessarily shown to scale. Rather, the various elements depicted in the figures are represented in such a way that their function and general purpose are understandable to a person skilled in the art. Connections and couplings between functional units and elements shown in the figures can also be implemented as indirect connections or couplings. A connection or coupling can be implemented as a wired or wireless connection. Functional units can be implemented as hardware, software, or a combination of hardware and software.
[0080] The following describes techniques for translating measurements of a person's vocal tract (e.g., acoustic and / or articulatory measurements) into an articulatory feature space. Based on a temporal sequence of feature vectors and for features defined within this articulatory feature space, various applications become possible. Examples include speech synthesis, the output of contextual information about speech utterance to a user via a user interface (particularly a graphical user interface for visualizing the vocal tract during speech), the control of a technical machine, such as a robot or motorized device, and the diagnosis of pathological conditions.
[0081] FIG. 1 shows the main parts of the anatomy of the human vocal organs (speech apparatus). Technically, the human voice—for example, for verbal communication or singing—is often described by the so-called source-filter model. The lungs, the trachea 203, and the larynx 204 together form the source 201. Air is compressed in the lungs and flows upwards through the trachea to the larynx. In the larynx, the vocal folds 204a—colloquially referred to as "vocal cords"—form the glottis. The laryngeal muscles keep the vocal folds under tension by exerting force via the arytenoid cartilages. During voiced speech, the pressure in the trachea and the tension of the vocal folds cause them to open and close periodically, thereby creating an acoustic vibration, a sound wave.This sound wave is acoustically filtered by the time-varying shape of the vocal tract 202, consisting of the pharynx 206, oral cavity 208, and nasal cavity 212, before it exits the mouth and nostrils 213. Speech production consists of the process of phonation, technically expressed as the excitation of an acoustic vibration by the vocal cords, and articulation, i.e., the filtering of the sound spectrum by the time-varying shape of the vocal tract. The shaping of the vocal tract is carried out by the soft palate 207, which opens or closes the nasal cavity, the tongue 209, the upper 210a and lower teeth 210b, and the upper 211a and lower lips 211b.
[0082] FIG. 2 schematically illustrates a system comprising a data processing device 60 and several sensor devices 66, 67, 68. The sensor devices 66, 67, 68 are arranged on or near a speaker 20 and are each configured to provide measurement data streams for the speaker's speech apparatus. A processor 61 of the data processing device 60 can receive and record the measurement data streams via a suitable interface 63. For example, sensor device 66 could be a microphone, sensor device 67 a camera, and sensor device 68 one or more radar antennas. These are only examples, however, and various sensor modalities are known that can be used.
[0083] The processor 61 can also load and execute program code from memory 62. When the processor 61 executes the program code, this causes the processor 61 to implement techniques as described herein, for example, in particular, processing the recorded measurement data streams to obtain a temporal sequence of feature vectors, modifying feature values of the feature vectors, performing speech synthesis, outputting a control signal via a communication interface 64, for example, to control a technical device or to display context information on a graphical user interface, etc.
[0084] FIG. 3 is a flowchart of an exemplary procedure. The procedure in FIG. 3 can be executed by a data processing device. The procedure in FIG. 3 can be executed by at least one processor that loads and executes program code from memory. For example, the procedure in FIG. 3 can be executed by the processor 61 of the data processing device 60, as discussed above in connection with FIG. 2. In step 905, one or more measurement data streams are acquired for a person's speech apparatus. The person uses the speech apparatus, for example, to sing or to communicate verbally (also referred to as "speaking"). One or more measurement modalities can be used.
[0085] For example, an audio recording of a person's speech utterance can be made. Alternatively or additionally, one or more measurement data streams can be recorded, based on an articulatory measurement modality. Such articulatory measurement modalities include an observable that quantifies a characteristic of the vocal tract. Examples include, for instance, radar measurement of tongue movement or an image sequence indicating lip movement. Various articulatory measurement modalities are known in the prior art and can be used in conjunction with the techniques disclosed herein.
[0086] In step 910, the one or more measurement data streams from step 905 are processed. Various examples use a machine-learned model for this purpose. This machine-learned model provides a temporal sequence of feature vectors. For example, a temporal sequence could be provided for each measurement data stream. However, it would also be conceivable to provide a common temporal sequence for all recorded measurement data streams. The machine-learned model could have multiple coding branches, for example, one for each measurement data stream or for each measurement modality. Alternatively, multiple measurement data streams could be merged / combined before being input into the machine-learned model.
[0087] The feature values are defined in a feature space where different features correspond to predefined parameters of a source-filter forward model of the vocal tract. Such parameters can be, for example, static parameters, dynamic parameters, source parameters, or filter parameters. Different features can, for example, indicate motor states of different areas of the vocal tract. Motor states can, in particular, indicate a movement trajectory or the extent of the reference ranges. In step 915, it is optionally possible to modify one or more feature values. This can be done, for example, based on a vocal or linguistic specification for the synthetic speech utterance. It would be conceivable that the modification could be based on prior knowledge about a vocal or linguistic characteristic or limitation of the person.For example, pathological articulatory impairments such as dysarthria, stroke, Parkinson's disease, or tongue paralysis could be taken into account. Further examples include absent or impaired phonation, or non-organically caused phonetic disorders such as lisping. Feature values can be compared and / or weighted. A comparison with a predefined threshold as a reference can also be performed. Modifications can consider, for example, the time dependence of the feature values and / or instantaneous amplitudes. Anomaly detection can reveal deviations from a norm, allowing them to be compensated for. Based on the feature values, the source-filter model could also be parameterized, and a prediction of the source-filter forward model could then be determined. This prediction can then be compared with a reference.
[0088] The modification can also be based on a characteristic or restriction of at least one of the one or more measurement modalities from step 905. For example, measurement metadata (such as error bars or variance) could be obtained, specifying one or more properties of the corresponding sensors; the feature values could then be modified based on such measurement metadata. For example, feature values for unreliably measured features could be weighted less (i.e., suppressed) than feature values for particularly reliably measured features (which would be emphasized).
[0089] Modifying the feature values in step 915 can involve applying another machine-learned model, for example, to detect anomalies or to identify specific patterns or features that need to be compensated for or balanced. The modification in step 915 can take into account instantaneous amplitudes of feature values and / or time dependencies of feature values.
[0090] In step 920, speech synthesis can optionally be performed, possibly based on the modified feature values from step 915 or on the basis of the unmodified feature values from step 910.
[0091] Speech synthesis can be performed, for example, using a machine-learned model. This model can receive additional input (beyond the potentially modified feature values), such as information characterizing the speaker. The machine-learned model can, for instance, first map the articulatory feature space to a frequency-resolved acoustic feature space. Based on the temporal sequence of feature vectors, one or more audio spectrograms can then be generated, and these spectrograms can subsequently be translated into an audio waveform.
[0092] In step 925, based on the output of step 910 or step 915, a control signal can optionally be passed to a user interface, particularly a graphical user interface. This control signal can instruct the graphical user interface to output specific contextual information about a speech utterance during the recording period. The contextual information can relate to a vocal or linguistic characteristic of the person's speech utterance during the recording period from step 905. This contextual data can be determined, at least in part, based on the (possibly modified) feature values. For example, a visualization of the spatial structure of the vocal tract, and especially the vocal cords, can be triggered. Dynamic and / or static parameters of the vocal tract can be set in this process.For example, certain deviations from the norm could be highlighted, or specific (user) instructions for the movement and / or shaping of certain parts of the vocal tract, such as the tongue, lips, or jaw opening, could be visualized. A target specification can be issued, for example, regarding the movement of the tongue, jaw, etc. This target specification can be part of a therapeutic application to correct certain speech impairments or defects. The target specification can be part of an articulation task. For example, the graphical user interface can be instructed to highlight certain parts of the speech apparatus, particularly the vocal tract. For instance, those parts or areas of the vocal tract that are responsible for a specific abnormality in speech production can be graphically highlighted.Parts or areas of the vocal tract that exhibit movement deviating from the norm can be graphically highlighted. Anomalies can be identified and visualized in this context.
[0093] Deviations from the norm can be determined absolutely or as a function of time, that is, by considering, for example, the change in a characteristic's values over time compared to historical references. This can facilitate articulation training or vocal training. A specific dialect can be acquired or unlearned. Speech disorders can be treated.
[0094] Output via the graphical user interface can occur in real time or with a latency of less than, for example, 100 ms, in parallel with the spoken utterance.
[0095] Step 925 allows for the control of not only a user interface, but also, alternatively or additionally, another machine. For example, a mobility aid in rehabilitation technology or an electronic user device, such as a computer, smartphone, telephone, etc., could be controlled.
[0096] In step 930, based on an output from box 910 or box 915, an optional history of pathological limitations can be taken. A diagnosis of a pathology associated with the speech apparatus can be made. For example, it could be checked whether a specific organic restriction of the mobility of one or more elements of the vocal tract is present. The corresponding restriction of mobility, if detected, could also be quantified. FIG. 4 schematically illustrates a data processing pipeline. Several measurement data streams 501, 502, 503 are obtained, which are associated with different measurement modalities (compare sensors 66, 67, 68 in FIG. 2). For example, the measurement data streams can be obtained as vectors, with different vectors corresponding to different time points within a recording period. In the example of FIG.Figure 4 states that a machine-learned model 540 comprises a number of machine-learned coding branches 541, 542, 543, each configured to evaluate the different measurement data streams 501, 502, 503 (corresponding to step 910 in Figure 3). Associated feature vectors 511, 512, 513 are then obtained, containing feature values for features in an articulatory feature space defined by parameters of a source-filter forward model speech apparatus. By repeatedly applying each of the coding branches 541, 542, 543 to different input vectors (which, for example, cover different time ranges within a recording period and are typically shifted using a sliding window technique), a corresponding temporal sequence of feature vectors 511, 512, 513 is obtained for each dimension. The feature values can then be optionally modified (this corresponds to step 915 in FIG.).3), resulting in modified feature vector 521, which can then be combined. The combination can be performed, for example, by averaging. During modification, different feature values within feature vectors 511, 512, 513 can be weighted differently, as shown in FIG. 4 by the differently sized circles in the modified feature vectors 521, 522, 523. Subsequently, a decoding branch 549 is applied in the data processing pipeline 500 to obtain an audio waveform 530 of a synthetic speech utterance (this corresponds to step 920 in FIG. 3).
[0097] Coding branches 541, 542, and 543 can be trained as follows: First, synthetic pseudo-speech samples and the corresponding temporal sequences of feature vectors can be generated using a Monte Carlo process based on the available source-filter forward model. For this purpose, pseudo-phoneme states are determined and then randomly chained and interpolated by target approximation to form continuous trajectories for the different areas of the vocal tract. This yields the basic truth for the output of the coding branches for specific pseudo-speech samples. The input to the coding branches can be determined, for example, by appropriate models for the respective measurement modality or (when using audio as input) directly from the pseudo-speech samples. Then, the training of each coding branch can be performed. Details are described in particular in: P.K. Krug, P. Birkholz, B. Gerazov, D.R. Van Niekerk, A. Xu, and Y.Xu, “Self-supervised solution to the control problem of articulatory synthesis,” in Proc. Interspeech, 2023, pp. 4329–4333. Such training can also preserve speaker-independent features. This has the advantage that speaker-specific training of the coding branches is not necessary.
[0098] Once the coding branches are trained, the decoding branch 549 can, for example, be trained using a pre-made audio training dataset: Using a coding branch that processes audio data from the audio training dataset, the time sequences of the feature vectors are determined, and then a reconstruction of the initial audio data from the audio training dataset can be used to train the decoding branch 549.
[0099] FIG. 5 is a schematic illustration of a two-dimensional source-filter forward model 600 of the speech apparatus. The source-filter forward model 600 can be characterized by a variety of static and dynamic parameters of the filter and the source. See, for example: Birkholz, Peter. "Modeling consonant-vowel coarticulation for articulatory speech synthesis." PLoS ONE 8.4 (2013): e60603, Table 2. By way of example, FIG. 4 shows the horizontal and vertical position 611 of the hyoid bone, the vertical lip distance 612, the lip protrusion 613, and the extent 614 of the tongue body (modeled as a circle).
[0100] In summary, techniques for processing measurement data streams from a person's vocal tract were described. These data streams can be acquired through various sensor modalities, such as audio recordings or articulatory measurements, which contain an observable that quantifies a characteristic of the vocal tract. The processing of these data streams can be performed using a machine-learned model to obtain a temporal sequence of feature vectors in an articulatory feature space. Within this feature space, the features can be assigned to the vocal tract and represent various parameters, such as static or dynamic parameters of the filter or source. Optionally, feature values in the feature vectors can be modified, for example, based on a vocal or linguistic specification for synthetic speech production.The modification can also be based on prior knowledge of a person's vocal or linguistic characteristic or limitation, such as a pathological articulatory impairment. The temporal sequences of feature vectors can then be used for various applications, such as speech synthesis, the output of contextual information about speech utterance to a user via a graphical user interface, or the control of a technical machine. It is also possible to use the feature values to diagnose pathological limitations or to enable therapeutic applications such as articulation training or vocal training. Overall, the disclosed techniques offer a wide range of possibilities for processing and analyzing measurement data streams of a person's vocal apparatus to support various applications in speech technology.
[0101] In summary, the following EXAMPLES were described in particular.
[0102] EXAMPLE 1. Data processing device comprising at least one processor and one memory, wherein the at least one processor is configured to load and execute program code from memory, wherein the at least one processor, based on the program code, performs the following steps:
[0103] - Recording one or more measurement data streams for a person's speech apparatus during a common recording period,
[0104] - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, where the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus,
[0105] - Modifying one or more feature values of the features, and - based on the modified feature values, performing a speech synthesis that generates a synthetic speech utterance for the person during the recording time.
[0106] EXAMPLE 2. Data processing device according to EXAMPLE 1, wherein the modification is based on a vocal or linguistic specification for the synthetic speech utterance.
[0107] EXAMPLE S. Data processing device according to EXAMPLE 1 or 2, wherein the modification is based on prior knowledge of a vocal or linguistic characteristic or limitation of the person.
[0108] EXAMPLE 4. Data processing device according to EXAMPLE 3, wherein the vocal or linguistic characteristic or limitation of the person is selected from the following group: pathological articulatory limitation such as dysarthria, stroke, Parkinson's disease, tongue paralysis; absent or limited phonation ability; non-organic phonetic disorder such as lisping.
[0109] EXAMPLE 5. Data processing device according to one of the preceding EXAMPLES, wherein the modification is based on a comparison of the feature values with a predetermined reference.
[0110] EXAMPLE 6. Data processing device according to one of the preceding EXAMPLES, wherein the processor is further configured to perform the following step based on the program code:
[0111] - Parameterizing the source-filter forward model based on the feature values of the features and determining a prediction of the parameterized source-filter forward model, wherein the modification is based on a comparison of the source-filter forward model's prediction with a predefined reference. EXAMPLE 7. Data processing device according to one of the preceding EXAMPLES, wherein the modification is based on a predefined weighting of the features.
[0112] EXAMPLE 8. Data processing device according to EXAMPLE 6, wherein the weighting comprises a relative weighting of different features within a feature vector of the temporal sequence, and / or wherein the weighting comprises a relative weighting of a particular feature between two feature vectors of the temporal sequence.
[0113] EXAMPLE 9. Data processing device according to one of the preceding EXAMPLES, wherein the modification is based on a characteristic or limitation of one or more measurement modalities used for receiving the one or more measurement data streams.
[0114] EXAMPLE 10. Data processing device according to EXAMPLE 9, as well as according to EXAMPLE 7 or 8, wherein the weighting is based on a predetermined rule which emphasizes or suppresses a particular feature for a particular measurement modality.
[0115] EXAMPLE 11. Data processing device according to one of the preceding EXAMPLES, wherein the processor is further configured to perform the following step based on the program code:
[0116] - Performing anomaly detection to identify anomalies in feature values of the feature vectors, whereby modification is based on a result of the anomaly detection.
[0117] EXAMPLE 12. Data processing device according to one of the preceding
[0118] EXAMPLES, where the modification is based on time dependencies of feature values of the temporal sequence of feature vectors.
[0119] EXAMPLE 13. Data processing device comprising a processor and a memory, wherein the processor is configured to load and execute program code from memory, and wherein, based on the program code, the processor performs the following steps:
[0120] - Recording one or more measurement data streams for a person's speech apparatus during a common recording period,
[0121] - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and
[0122] - Generating a control signal based on the temporal sequence of the feature vectors.
[0123] EXAMPLE 14. Data processing device according to EXAMPLE 13, wherein the processor is further configured to perform the following step based on the program code:
[0124] - Passing the control signal to a user interface, wherein the control signal instructs the user interface to output contextual information for a vocal or linguistic characteristic of the person during the recording period.
[0125] EXAMPLE 15. Data processing device according to EXAMPLE 14, wherein the context information comprises a visual representation of the time-varying spatial shape of the speech apparatus during the recording period.
[0126] EXAMPLE 16. Data processing device according to EXAMPLE 15, wherein the predefined parameters of the source-filter forward model include at least one static parameter and at least one dynamic parameter, wherein the visual representation is configured depending on a parameter value of the static parameter and is changed depending on a parameter value of the dynamic parameter.
[0127] EXAMPLE 17. Data processing device according to EXAMPLE 15 or 16, wherein the context information includes a visual representation of a target specification for the time-varying spatial shape of the speech apparatus during the recording period.
[0128] EXAMPLE 18. Data processing device according to EXAMPLE 17, wherein the target specification is based on a predetermined pronunciation of a speech utterance made during the recording period and / or on a therapeutic articulation task.
[0129] EXAMPLE 19. Data processing device according to one of EXAMPLES 14 to 18, wherein the context information includes a highlighting of a part of the speech apparatus, wherein one of the features associated with that part of the speech apparatus has feature values that show a deviation from a predetermined reference within the recording period.
[0130] EXAMPLE 20. Data processing device according to one of EXAMPLES 14 to 19, wherein the context information is determined on the basis of anomaly detection to recognize anomalies in the feature values.
[0131] EXAMPLE 21. Computer-implemented procedure according to one of EXAMPLES 14 to 20, wherein the context information is determined based on a temporal change in a characteristic of the feature values during the common recording period compared to an earlier recording period. EXAMPLE 22. Computer-implemented procedure according to one of EXAMPLES 14 to 21, wherein the context information is indicative of a user instruction to modify vocal tract movements.
[0132] EXAMPLE 23. Computer-implemented method according to one of EXAMPLES 14 to 22, wherein the context information is indicative of a user instruction for adjusting a mounting of a sensor used to record a measurement data stream.
[0133] EXAMPLE 24. Data processing device according to EXAMPLE 13, wherein the method further comprises:
[0134] - Passing the control signal as user input to a user interface of a machine.
[0135] EXAMPLE 25. Data processing device according to EXAMPLE 24, wherein the machine is a mobility aid in rehabilitation technology.
[0136] EXAMPLE 26. Data processing device according to EXAMPLE 24, wherein the machine is an electronic user terminal device.
[0137] EXAMPLE 27. Data processing device comprising a processor and a memory, wherein the processor is configured to load and execute program code from memory, wherein the processor, based on the program code, performs the following steps:
[0138] - Recording one or more measurement data streams for a person's speech apparatus during a common recording period,
[0139] - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and
[0140] - Evaluating the temporal sequence of feature vectors to determine a diagnosis of a pathology associated with the speech apparatus. EXAMPLE 28. Data processing device according to one of the preceding
[0141] EXAMPLES, wherein the machine-learned model has a corresponding coding branch for each of the one or more measurement data streams, or wherein multiple measurement data streams are joined together before being input into the machine-learned model.
[0142] EXAMPLE 29. Data processing device according to one of the preceding EXAMPLES, wherein the one or more measurement data streams comprise an audio recording of a speech utterance of the person during the recording period.
[0143] EXAMPLE 30. Data processing device according to one of the preceding EXAMPLES, wherein at least one of the one or more measurement data streams is measured based on an articulatory measurement modality.
[0144] EXAMPLE 31. Data processing device according to one of the preceding EXAMPLES, wherein the parameters of the source-filter forward model are selected from the following group: static parameter; dynamic parameter; source parameter; filter parameter.
[0145] EXAMPLE 32. Data processing device according to one of the preceding EXAMPLES, wherein different features of the feature vectors indicate motor states of different areas of a vocal tract.
[0146] EXAMPLE 33. Data processing device according to EXAMPLE 32, wherein the motor states specify a motion trajectory or an extension of the reference ranges.
[0147] EXAMPLE 34. Method comprising: - Recording one or more measurement data streams for a person's speech apparatus during a common recording period,
[0148] - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, where the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus,
[0149] - Modifying one or more characteristic values of the features, and
[0150] - based on the modified feature values, performing a speech synthesis that generates a synthetic speech utterance for the person during the recording time.
[0151] EXAMPLE 35. Procedure that includes:
[0152] - Recording one or more measurement data streams for a person's speech apparatus during a common recording period,
[0153] - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and
[0154] - Generating a control signal based on the temporal sequence of the feature vectors.
[0155] EXAMPLE 36. Procedure that includes;
[0156] - Recording one or more measurement data streams for a person's speech apparatus during a common recording period,
[0157] - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and
[0158] - Evaluating the temporal sequence of feature vectors to determine a diagnosis of a pathology associated with the speech apparatus.
[0159] EXAMPLE 37. Method according to one of EXAMPLES 34 to 36, wherein the method is carried out by a data processing device according to one of EXAMPLES 1 to 33. Naturally, the features of the embodiments and aspects of the invention described above can be combined with one another. In particular, the features can be used not only in the combinations described, but also in other combinations or individually, without departing from the scope of the invention.
[0160] For example, various techniques have been described above in which one or more measurement data streams are mapped into the articulatory feature space using a machine-learned model. In principle, it would be conceivable to use a non-machine-learned model for mapping the one or more measurement data streams into the articulatory feature space, either as an alternative or in addition to a machine-learned model.
[0161] Furthermore, techniques have been described above in which one or more measurement data streams are processed to obtain a sequence of feature vectors in an articulatory feature space. Additionally, it would be conceivable to process the one or more measurement data streams to obtain a further sequence of feature vectors in an acoustic feature space. Then, articulatory and acoustic features can be used together, for example, to modify articulatory features.
[0162] Furthermore, the aforementioned techniques related to the vocal apparatus were described, particularly in connection with speaking (i.e., using the vocal apparatus) in the sense of verbal communication; however, these techniques are equally applicable to speaking or using the vocal apparatus for singing.
Claims
PATENT CLAIMS 1. Data processing device comprising at least one processor and a memory, wherein the at least one processor is configured to load and execute program code from the memory, wherein the at least one processor, based on the program code, performs the following steps: - Recording one or more measurement data streams for a person's speech apparatus during a common recording period, - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, where the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, - Modifying one or more characteristic values of the features, and - based on the modified feature values, performing a speech synthesis that generates a synthetic speech utterance for the person during the recording time.
2. Data processing device according to claim 1, wherein the modification is based on a vocal or linguistic specification for the synthetic speech utterance.
3. Data processing device according to claim 1 or 2, wherein the modification is based on prior knowledge of a vocal or linguistic characteristic or limitation of the person.
4. Data processing device according to claim 3, wherein the vocal or linguistic characteristic or limitation of the person is preferably selected from the following group: pathological articulatory limitation such as dysarthria, stroke, Parkinson's disease, tongue paralysis; absent or limited phonation ability; non-organically caused phonetic disorder such as lisping.
5. Data processing device according to one of the preceding claims, where the modification is based on a comparison of the characteristic values with a predefined reference.
6. Data processing device according to any one of the preceding claims, wherein the processor is further configured to perform, based on the Program code to execute the following step: - Parameterizing the source-filter forward model based on the feature values of the features and determining a prediction of the parameterized source-filter forward model, whereby the modification is based on a comparison of the prediction of the source-filter forward model with a given reference.
7. Data processing device according to one of the preceding claims, wherein the modification is based on a predetermined weighting of the Features are present.
8. Data processing device according to claim 6, wherein the weighting comprises a relative weighting of different features within a feature vector of the temporal sequence, and / or wherein the weighting comprises a relative weighting of a specific feature between two feature vectors of the temporal sequence.
9. Data processing device according to one of the preceding claims, wherein the modification is based on a characteristic or limitation of one or more measurement modalities used for recording the one or more measurement data streams.
10. Data processing device according to claim 9, as well as according to claim 7 or 8, wherein the weighting is based on a predetermined rule which emphasizes or suppresses a particular feature for a particular measurement modality.
11. Data processing device according to any of the preceding claims, wherein the processor is further configured to perform, based on the Program code to execute the following step: - Performing anomaly detection to identify anomalies in feature values of the feature vectors, whereby modification is based on a result of the anomaly detection.
12. Data processing device according to one of the preceding claims, wherein the modification is based on time dependencies of feature values of the temporal sequence of feature vectors.
13. Data processing device comprising a processor and a memory, wherein the processor is configured to load and execute program code from memory, the processor performing the following steps based on the program code: - Recording one or more measurement data streams for a person's speech apparatus during a common recording period, - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and - Generating a control signal based on the temporal sequence of the feature vectors.
14. Data processing device according to claim 13, wherein the processor is further configured to perform the following step based on the program code: - Passing the control signal to a user interface, wherein the control signal instructs the user interface to output contextual information for a vocal or linguistic characteristic of the person during the recording period.
15. Data processing device according to claim 14, the contextual information includes a visual representation of the temporally changing spatial shape of the speech apparatus during the recording period.
16. Data processing device according to claim 15, wherein the predefined parameters of the source-filter forward model comprise at least one static parameter and at least one dynamic parameter, wherein the visual representation is configured depending on a parameter value of the static parameter and is changed depending on a parameter value of the dynamic parameter.
17. Data processing device according to claim 15 or 16, wherein the context information comprises a visual representation of a target specification for the time-varying spatial shape of the speech apparatus during the recording period.
18. Data processing device according to claim 17, wherein the target specification is based on a predetermined pronunciation of a speech utterance made during the recording period and / or on a therapeutic articulation task.
19. Data processing device according to one of claims 14 to 18, wherein the context information highlights a part of the The speech apparatus comprises one of the features associated with this part of the speech apparatus, which has feature values that show a deviation from a given reference within the recording period.
20. Data processing device according to one of claims 14 to 19, wherein the context information is based on anomaly detection for The detection of anomalies in the characteristic values is determined.
21. Computer-implemented method according to one of claims 14 to 20, where the contextual information is determined based on a temporal change in a characteristic of the feature values during the common recording period compared to an earlier recording period.
22. Computer-implemented method according to any one of claims 14 to 21, wherein the context information is indicative of a user instruction for This involves changing vocal tract movements.
23. Computer-implemented method according to any one of claims 14 to 22, wherein the context information is indicative of a user instruction for Adapting a mounting of a sensor used to record a stream of measurement data is.
24. Data processing device according to claim 13, wherein the method further comprises: - Passing the control signal as user input to a user interface of a machine.
25. Data processing device according to claim 24, wherein the machine is a mobility aid in rehabilitation technology.
26. Data processing device according to claim 24, wherein the machine is an electronic user terminal device.
27. Data processing device comprising a processor and a memory, wherein the processor is configured to load and execute program code from memory, the processor performing the following steps based on the program code: - Recording one or more measurement data streams for a person's speech apparatus during a common recording period, - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and - Evaluating the temporal sequence of feature vectors to determine a diagnosis of a pathology associated with the speech apparatus.
28. Data processing device according to one of the preceding claims, wherein the machine-learned model for each of the one or more measurement data streams have a corresponding coding branch, or where several measurement data streams are combined before being entered into the machine-learned model.
29. Data processing device according to one of the preceding claims, wherein the one or more measurement data streams comprise an audio recording of a speech utterance of the person during the recording period.
30. Data processing device according to one of the preceding claims, wherein at least one of the one or more measurement data streams is measured based on an articulatory measurement modality.
31. Data processing device according to one of the preceding claims, wherein the parameters of the source-filter forward model are selected from the following group: static parameter; dynamic parameter; source parameter; filter parameter.
32. Data processing device according to one of the preceding claims, wherein different features of the feature vectors specify motor states of different areas of a vocal tract.
33. Data processing device according to claim 32, wherein the motor states specify a motion trajectory or an extent of the reference ranges.
34. Procedure, which includes: - Recording one or more measurement data streams for a A person's vocal apparatus during a shared recording period, - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, where the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, - Modifying one or more characteristic values of the features, and - based on the modified feature values, performing a speech synthesis that generates a synthetic speech utterance for the person during the recording time.
35. Procedure, which includes: - Recording one or more measurement data streams for a person's speech apparatus during a common recording period, - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and - Generating a control signal based on the temporal sequence of the feature vectors.
36. Procedure, which includes; - Recording one or more measurement data streams for a person's speech apparatus during a common recording period, - Processing the one or more measurement data streams in a machine-learned model into a temporal sequence of feature vectors, wherein the features of the feature vectors correspond to predefined parameters of a source-filter forward model for the speech apparatus, and - Evaluating the temporal sequence of feature vectors to determine a diagnosis of a pathology associated with the speech apparatus.
37. Method according to any one of claims 34 to 36, wherein the method is carried out by a data processing device according to any one of claims 1 to 33.