Systems and methods for laryngeal engagement signal analysis

WO2026143035A1PCT designated stage Publication Date: 2026-07-02EAST CAROLINA UNIVERSITY

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: EAST CAROLINA UNIVERSITY
Filing Date: 2025-12-22
Publication Date: 2026-07-02

Smart Images

Figure US2025060965_02072026_PF_FP_ABST

Patent Text Reader

Abstract

Systems and methods are provided for analyzing laryngeal engagement during speech production. Audio data is obtained from a user, and acoustic features are processed to determine a sustained phonation metric and a voice-onset-time metric. These metrics may be combined with sensor-derived indicators of vocal-fold activation to generate a composite laryngeal engagement value that reflects engagement continuity and onset timing. The system identifies engagement-reduction points where the composite value fails to satisfy an engagement criterion and provides corresponding visual, auditory, or interactive corrective feedback in real time.

Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No. 190412-00041 WO PatentSYSTEMS AND METHODS FOR LARYNGEAL ENGAGEMENT SIGNAL ANALYSISRELATED APPLICATIONS

[0001] The present application claims priority benefit to U.S. Provisional App. No. 63 / 739,182, filed December 27, 2024, which is hereby incorporated herein by reference in its entirety.FIELD

[0002] Tire present disclosure generally relates to systems and techniques for processing speech signals and sensor data and, more specifically, to analysis of laryngeal engagement using acoustic features, physiological indicators, or multimodal sensing sources to detect characteristics of vocal-fold activation during speech production.BACKGROUND

[0003] Speech production involves coordinated activation of the vocal folds to initiate and sustain voicing across expected phonetic contexts. In some individuals, disruptions in this coordination, including those that can occur during stuttering, may lead to delayed onsets of voicing, unexpected losses of voicing, or other irregularities in phonatory timing that can affect speech fluency. Conventional acoustic -only analysis methods often have difficulty reliably determining when vocal-fold engagement occurs, especially in low-intensity, noisy, or ambiguous speech conditions.SUMMARY

[0004] Systems and methods are provided for analyzing laryngeal engagement during speech production. Audio data is obtained from a user, and acoustic features are processed to detennine a sustained phonation metric and a voice -onset-time metric. These metrics may be combined with sensor-derived indicators of vocal-fold activation to generate a composite laryngeal engagement value that reflects engagement continuity and onset timing. The system identifies engagement-reduction points where the composite value fails to satisfy an engagement criterion and provides corresponding visual, auditory, or interactive corrective feedback in real time.BRIEF DESCRIPTION OF THE DRAWINGS

[0005] Throughout the drawings, reference numbers can be re-used to indicate correspondence between referenced elements. The drawings are provided to illustrate embodiments of the present disclosure and do not limit the scope thereof.

[0006] FIG. 1 illustrates an example laryngeal engagement analysis environment according to some aspects of the inventive concepts.

[0007] FIG. 2 illustrates an example visual feedback element.

[0008] FIGS. 3A, 3B, and 3C illustrate example visual feedback elements.Attorney Docket No. 190412-00041 WO Patent

[0009] FIG. 4 is a flow diagram illustrative of an embodiment of a routine implemented by one or more components of the laryngeal engagement analysis system for processing speech-production data and identifying block locations in a spoken utterance.DETAILED DESCRIPTION

[0010] In light of the description provided herein, it will be understood that the embodiments disclosed provide improvements to computer functionality in the areas of digital signal processing, real time multimodal analysis, and automated detection of phonatory conditions. Conventional speech performance analysis workflows may rely on subjective human judgment and lack the ability to directly observe, quantify, or evaluate the rapid changes in laryngeal engagement that occur during speech production. By contrast, some inventive aspects described herein implement computer based techniques that automatically detect sustained phonation, measure voice onset time with millisecond level precision, derive additional engagement-related features such as airflow characteristics, temporal relationships between syllables, or other patterns that may be identified by trained machine-learning models, and detennine a composite laryngeal engagement state from acoustic features and physiological indicators. These operations are executed through specialized processing carried out by computing devices and cannot be performed manually by human observers.

[0011] For example, the sustained phonation detector 122 may analyze frame level periodicity, harmonic structure, spectral content, and / or inferred voicing characteristics at temporal resolutions far below human perceptual limits and may further evaluate additional engagement-related indicators such as intersyllabic timing patterns or inferred subglottal -pressure conditions. The voice onset time detector 123 may identify transient release bursts and the onset of periodicity using quantitative acoustic analysis that humans cannot observe with comparable accuracy, where the resulting timing values may include positive or negative voice-onset-time intervals corresponding to voicing that begins after or before a stop release. Tire composite laryngeal engagement system 124 may combine and evaluate multiple metrics through computational methods that apply weighting, normalization, threshold adaptation, and cross sensor corroboration including weighting of sustained-phonation metrics, voice-onset-time metrics, and any additional engagement-related features derived from audio data or other sensing modalities. These operations generate a unified representation of laryngeal engagement that no human could compute, let alone in real time.

[0012] Some inventive aspects described herein enhance the functioning of computing devices by enabling continuous time alignment and fusion of audio data with physiological and imaging data obtained from a lary ngeal activation sensor 130. The speech signal processor 121 may synchronize these data streams so that engagement related events can be identified even when the acoustic signal is weak, occluded, or absent. This synchronized multimodal analysis allows the system to detect silent blocks, pre phonatory delays, and other engagement failures that cannot be reliably detected by humans or conventional audio only systems.

[0013] Some inventive aspects described herein improve computer based feedback systems. For example, the feedback generator 126 may modify visual indicators, timing thresholds, or cursor movementAttorney Docket No. 190412-00041 WO Patentin real time based on continuous engagement related measurements. These adjustments require rapid evaluation of multiple concurrent data sources and dynamic control of graphical display elements. The resulting improvements in responsiveness, adaptability, and display precision represent enhancements to the operation of the underlying computing device, rather than mere presentation of information.

[0014] Some inventive aspects described herein compute relationships between voicing-interruption frequency (sometimes referred to as stuttering frequency), heart rate, heart rate variability, respiration rate, and context dependent speaking performance. These relationships may be derived through automated computational processes that update continuously as new data is received. Some inventive aspects described herein may automatically modify internal thresholds, adjust feedback behavior, or select alternative speaking environments based on these quantitative indicators. Humans cannot perform these complex correlations or real time adjustments with comparable speed or precision.

[0015] Some inventive aspects described herein therefore address technical problems in automated phonation monitoring, multi sensor fusion, and real time speech analysis. The techniques described include specialized feature extraction, computational thresholding, synchronized processing of heterogeneous sensor data, identification of engagement events under ambiguous acoustic conditions, and / or automatic transformation of engagement related measurements into control signals for computer based displays. These capabilities materially improve the functioning of the computer systems involved and provide technical solutions that cannot be performed manually. As a result, the present disclosure represents a substantial improvement in the fields of real time speech signal processing, multimodal physiological sensing, and automated feedback systems.

[0016] Some inventive aspects described herein relate to using acoustic cues that people cannot perceive, such as continuous-phonation continuity and voice -onset-time changes, as well as additional engagement-related patterns including airflow signatures, intersyllabic temporal features, or other features discovered by machine-learning models, to identify the moment of an engagement-reduction event, and provides system-controlled feedback based on the voicing-intcrruption condition.Lary ngeal engagement Analysis Environment

[0017] FIG. 1 illustrates an example laryngeal engagement analysis environment 100 that includes a user interface device 110, a network 105, a laryngeal engagement analysis system 120, a laryngeal-activation sensor 130, and a performance management portal 140. To simplify discussion and not to limit the present disclosure, FIG. 1 illustrates only one user 102, user interface device 110, laryngeal -activation sensor 130, and laryngeal engagement analysis system 120, although multiple of any of these components may be used.

[0018] Any of the components of the environment 100 may communicate through the network 105. Although a single network 105 is illustrated, multiple distinct or distributed networks may be present. Hie network 105 can include any type of communication network. For example, the network 105 may include one or more of a wide-area network (WAN), a local-area network (LAN), a wireless network, a cellular network (e.g., 5G, LTE, or other cellular technologies), a satellite network, an ad-hoc network, a wired network, or combinations thereof. In some embodiments, the network 105 can include the Internet. The network 105 may support transmission of audio data, sensor data, or physiological data for real-time, near-Attorney Docket No. 190412-00041 WO Patentreal-time, or delayed analysis. The network 105 may support extended-reality (XR) environments such as, but not limited to, virtual-reality (VR), augmented-reality (AR), or mixed-reality (MR) interfaces in which stimulus phrases, conversational tasks, or feedback may be delivered to the user 102 via the user interface device 110 or another device. In some implementations, the network 105 facilitates cloud-based processing, low-latency communication, or distributed execution of functions across multiple computing resources.

[0019] Any of the components or systems of the environment 100, such as the user interface device 110, the laryngeal engagement analysis system 120, the laryngeal -activation sensor 130, or the performance management portal 140 may be implemented using individual computing devices, processors, distributed computing platforms, servers, edge-processing devices, or cloud-based execution environments. In some embodiments, any of the components may be implemented within virtualized or isolated execution environments, such as. but not limited to, virtual machines, containers, or other software-defined computing resources. The components of the environment 100 may be combined in various configurations. For example, portions of the laryngeal engagement analysis system 120 may execute on the user interface device 110, tire performance management portal 140, on a remote server accessible through the network 105, or across multiple distributed sy stems. Similarly, functionality associated with the laryngeal-activation sensor 130 may be integrated into the user interface device 110 or may operate as an independent sensing device that communicates through the network 105. Any of the foregoing components may include software, firmware, hardware, or any combination of software, firmware, and hardware suitable for performing the operations described herein.

[0020] The user 102 refers to an individual who produces spoken utterances for analysis within the environment 100. Such speech may include, without limitation, a single syllable, a word, a multisyllabic word, a phrase, a sentence, continuous reading, or conversational speech that the user initiates or produces in response to a prompt. Hie user 102 may experience difficulty initiating or sustaining phonation or may exhibit speech-production disruptions including, but not limited to, hesitations, blocks, or interruptions in voicing that are detectable from the spoken utterance. The user 102 may engage in speaking tasks such as repeating stimulus phrases, reading structured passages, or participating in conversational prompts presented by the user interface device 110, and the resulting speech is evaluated by the laryngeal engagement analysis system 120. The user 102 may receive visual, auditory; or interactive feedback through the user interface device 110 based on the detection of phonatory disruptions or block locations.

[0021] The user interface device 110 includes any computing device configured to present a stimulus phrase to a user 102 or receive a spoken utterance from the user 102. The user interface device 110 may include, but is not limited to, a smartphone, tablet, laptop, desktop computer, wearable device, smart speaker, or extended-reality headset. Tire user interface device 110 may include one or more components for capturing speech or user input, including microphones, cameras, touch-sensitive displays, speakers, motion sensors, inertial sensors, or depth sensors. The user interface device 110 may include audio hardware including, but not limited to, near-field microphones, array microphones, beamforming microphones, noise-canceling microphones, built-in speakers, or headphones used to present auditory' stimuli or feedback to the user 102.Attorney Docket No. 190412-00041 WO Patent

[0022] The user interface device 110 can include local processing resources configured to execute applications or modules that manage stimulus delivery, control extended-reality scenes, generate visual indicators of laryngeal engagement, or coordinate communication with the laryngeal engagement analysis system 120. The user interface device 110 may include graphics hardware for rendering virtual-reality or augmented-reality environments or displaying visual feedback including voice bars, progress bars, cursor indicators, or phonation-related animations. The user interface device 110 may include communication capabilities for transmitting audio data or sensor data through the network 105 or for synchronizing speech data with one or more laryngeal-activation sensors 130.

[0023] Tire user interface device 110 can include storage or memory' resources for caching speech recordings, application state data, stimulus-phrase libraries, or user-specific performance information used during speech performance analysis tasks. The user interface device 110 may include haptic components or illumination components that provide tactile or visual cues corresponding to block detection, laryngeal engagement, or feedback prompts generated by the laryngeal engagement analysis system 120. In some implementations, the user interface device 110 may include positional-tracking systems or XR controllers that enable the user 102 to participate in immersive speaking scenarios or interactive speech-practice tasks that are monitored for phonation behavior.

[0024] Tire user interface device 110 can include functionality for presenting a stimulus phrase in textual, auditory , graphical, or extended-reality' formats. Tire user interface device 110 may generate or display structured speaking tasks, continuous-reading tasks, conversational prompts, or immersive speaking scenarios including, but not limited to, those produced in virtual -reality' or augmented-reality environments. The user interface device 110 may interact with the user 102 by presenting syllable-level prompts, displaying model utterances, or initiating dialogue sequences that are monitored for sustained phonation or voice-onset-time characteristics .

[0025] The user interface device 110 can include functionality for receiving or transmitting speech signals or sensor signals to the laryngeal engagement analysis system 120. Hie user interface device 110 may capture the spoken utterance, convert the response to a speech signal, or transmit raw or processed audio data across the network 105 for analysis. The user interface device 110 may communicate with laryngeal -activation sensor 130 to obtain indications of laryngeal engagement that contribute to generation of the composite laryngeal engagement value.

[0026] The user interface device 110 can include functionality for presenting at least one of an indication of a block location or a feedback prompt associated with the block location. The user interface device 110 may display visual indicators including, but not limited to, voice bars, progress bars, cursor indicators, or graphical elements that represent sustained phonation or laryngeal engagement. The user interface device 110 may animate or update such indicators in real time, or present feedback prompts such as re-initiating phonation or prolonging a vowel segment based on output from the laryngeal engagement analysis system 120. Tire user interface device 110 may present visual feedback, auditory' feedback, or extended-reality feedback, or present summaries of the user's performance across tasks or sessions.

[0027] For purposes of this disclosure, a block location may be referred to as an engagement-reduction point and may? refer to a point within a spoken utterance at which laryngeal engagement decreases or failsAttorney Docket No. 190412-00041 WO Patentto satisfy a system-defined phonation criterion expected for the utterance. An engagement-reduction point may occur at the onset of a word, during a transition between phonetic segments, or at any point in the utterance where continuous periodic voicing or timely initiation of voicing is expected. An engagementreduction point may be identified by, for example, detecting a drop in periodic-voicing features, an increase in voice-onset-time beyond an expected timing range, a voice-onset-time interval of inappropriate sign (e.g., unexpected negative VOT), one or more airflow- or pressure-based indicators of disengagement, intersyllabic temporal patterns that deviate from an expected engagement profile, or a decrease in a composite laryngeal engagement value. An engagement-reduction point may correspond to a moment in which the system detects a delay or interruption in initiating or maintaining periodic vocal-fold vibration, or another speech-production disruption observed in the signal data. An engagement-reduction point may be used or determined by the laryngeal engagement analysis system 120 to generate an indication of the detected voicing-interruption condition or to provide a system -controlled interface-state adjustment through the user interface device 110. In some cases, the term engagement-reduction point refers to a time-aligned point within an utterance at which expected vocal-fold vibration fails to occur or ceases, including a missing or delayed onset of periodic voicing at a voiceless^-voiced transition, or an unintended cessation of periodic voicing within a voiced segment. An engagement-reduction point may be strictly defined by measurable acoustic or sensor-derived indicators, independent of any subjective judgment.

[0028] The laryngeal engagement analysis system 120 includes computing resources that can coordinate the overall process of evaluating phonatory behavior or vocal -fold activity of the user 102 and supporting improvement in speaking performance. The laryngeal engagement analysis system 120 may receive data from the user interface device 110 or from the laryngeal-activation sensor 130 and may assess the data to identify patterns of voicing, determine whether the user 102 experiences difficulty initiating or sustaining voicing, or detect locations within an utterance where laryngeal engagement decreases. The laryngeal engagement analysis system 120 can generate information, prompts, or guidance that may be delivered through the user interface device 110 to encourage more stable voicing behavior or to help the user 102 adjust phonatory patterns during speaking tasks.

[0029] The speech signal processor 121 may manage acquisition or preparation of speech data for laryngeal engagement analysis. The speech signal processor 121 can receive an acoustic signal corresponding to a spoken utterance from the user interface device 110 or from another capture device, or can obtain a digitized speech signal that has been buffered or streamed over the network 105. The speech signal processor 121 may perform pre-processing operations such as noise suppression, echo cancellation, automatic gain control, channel equalization, or sampling-rate conversion to improve suitability of the signal for subsequent detection of sustained phonation or voice onset time. In some implementations, the speech signal processor 121 can segment the speech signal into frames or phonetic regions, can perform endpoint detection to identify utterance onsets or offsets, or can align the speech signal with a known stimulus phrase or with a phonetic transcription obtained from an automatic speech recognizer. The speech signal processor 121 may compute acoustic features such as, but not limited to, pitch, voicing probability, spectral energy, cepstral coefficients, formant frequencies, or temporal envelopes that may be provided toAttorney Docket No. 190412-00041 WO Patentthe sustained phonation detector 122, the voice-onset-time detector 123, or other components of the laryngeal engagement analysis system 120.

[0030] In some cases, the speech signal processor 121 can combine audio data with auxiliary information such as timestamps, phoneme labels, syllable boundaries, or prosodic markers to support evaluation of laryngeal engagement at specific transitions between phonetic segments. Hie speech signal processor 121 may integrate device-level metadata, such as microphone configuration or XR-scene context, to distinguish structured reading tasks from conversational speech or extended-reality interactions. The speech signal processor 121 may coordinate with one or more laryngeal-activation sensors 130 by synchronizing audio frames with physiological measurements or imaging data, so that voice-related metrics or sensor-derived indicators can be fused on a common timeline.

[0031] The sustained phonation detector 122 may evaluate whether the user 102 maintains continuous voicing during portions of an utterance where steady phonation is expected. Continuous voicing generally means that the vocal folds remain vibrating without unintended pauses during segments that are intended to be voiced, and is a direct acoustic indicator that the larynx is engaged. When the larynx is engaged, the vocal folds stay appropriately adducted and vibrate in a regular, periodic pattern, producing a smooth and uninterrupted sound. Nonnal speech does not require the larynx to be engaged at all times, because some sounds, such as voiceless consonants, are naturally produced without vocal-fold vibration. However, fluent speakers rapidly re-engage the larynx at the exact moments voicing should resume. When the lary nx disengages or fails to adduct adequately at a moment where voicing should occur, the vocal folds fail to vibrate despite respiratory effort, causing the sound to drop out; this loss of vibration often produces an internal “block” characteristic of stuttering. The sustained phonation detector 122 can analyze features of the speech signal, such as pitch periodicity, voicing probability, or formant patterns, and can further incorporate other engagement-related indicators including airflow continuity, subglottal-pressure surrogates, or intersyllabic timing patterns, to identify whether periodic vibration is present, whether it continues across vowels or syllable transitions, and / or whether it unexpectedly ceases. By detecting whether and / or where the user 102 maintains or loses this vibration when voicing is expected, the sustained phonation detector 122 can provide an objective indication of how consistently the user keeps the larynx engaged during an utterance. Such information can be used by downstream components to identify points of phonatory instability, adjust system feedback parameters, or select subsequent speech stimuli that target specific patterns of reduced laryngeal engagement.

[0032] For purposes of this disclosure, an engaged larynx may refer to a physiological state in which the vocal folds are sufficiently adducted and are vibrating in a self-sustaining, periodic manner in response to subglottal air pressure. Once initiated, such vibration is often governed primarily by myoelastic-aerodynamic principles, in which vocal-fold elasticity and glottal airflow interact to maintain oscillation without requiring continuous neural adjustment. As a result, assuming healthy vocal folds and adequate airflow, sustained phonation during a vowel or other voiced segment is expected to remain stable once voicing has begun. Observable disruptions associated with speech disfluency therefore tend to occur at moments of voice initiation or re-initiation, such as at utterance onsets or following a termination of voicing, rather than during uninterrupted sustained phonation. Accordingly, in some embodiments described herein,Attorney Docket No. 190412-00041 WO Patentreduced laryngeal engagement corresponds to conditions in which this self-sustaining oscillatory state fails to initiate, is delayed beyond an expected timing window, or is terminated due to a loss of subglottal air pressure or other disengagement event. These engaged and disengaged laryngeal states may be inferred from acoustic features, sensor-derived indicators, or patterns identified by trained machine-learning models, as described herein.

[0033] Tire sustained phonation detector 122 may determine where laryngeal engagement is expected by referencing a phonetic representation of the utterance (e.g., a predetermined stimulus phrase provided to the user, a word or sentence spontaneously spoken by the user, etc.). In some implementations, the sustained phonation detector 122 can access a phonetic transcription associated with a stimulus phrase presented to the user 102, such as / m u v / for the word move or / p ae k / for the word pack, where each phoneme is already known to the system and can be annotated with its corresponding voicing property. In some implementations, the sustained phonation detector 122 can receive a phoneme sequence generated by the speech signal processor 121 from an automatic speech recognizer operating on a spontaneous spoken utterance, such as converting the open-ended utterance I think so” into a phonetic sequence like / ai 0 i ij k s oo / . In some or all cases, each phoneme may be annotated with a voicing attribute indicating whether the sound normally requires vocal -fold vibration; for example, vowels such as / u / or / ae / and voiced consonants such as / v / or / g / may be marked as requiring engagement, whereas voiceless consonants such as / p / , KI, or / 9 / may be marked as non-voicing intervals. Using this voicing attribute, the sustained phonation detector 122 can identify intervals during which the larynx is expected to be engaged, such as vowels, voiced consonants, or transitions between sequential voiced segments. In some cases, the sustained phonation detector 122 may identify transitions from voiceless to voiced segments as engagement-onset locations at which phonation should begin promptly following the release of a voiceless consonant, such as the transition from / p / to / as / in pack or from / 0 / to / i / in think.

[0034] In some embodiments, the sustained phonation detector 122 may determine expected engagement positions by analyzing boundary information supplied by tire speech signal processor 121, including timestamps corresponding to syllable nuclei, vowel onsets, or release bursts of stop consonants. When the user 102 repeats a known stimulus phrase, such as “take it the speech signal processor 121 may identify the release burst of the voiceless KI in K ei k / and provide a corresponding timestamp. The sustained phonation detector 122 may classify the onset of the following vowel / ei / as an engagement-onset position because the larynx is expected to begin vibrating immediately after the release. Conversely, during spontaneous speech such as “it’s fine,” the speech signal processor 121 may identify the voiced sequence K t s f ai n / , and the sustained phonation detector 122 may treat the region spanning the vowel / ai / and the nasal / n / as a continuous-voicing interval where tire larynx should remain engaged. In either case, the sustained phonation detector 122 may maintain an expected-voicing timeline that represents the specific intervals during which the larynx should remain engaged based on detected phonetic boundaries. In some embodiments, the sustained phonation detector 122 may classify the phonetic environment of the stimulus phrase or spontaneous utterance (for example, distinguishing stop-vowel transitions, vowel-vowel sequences, or nasal-vowel boundaries) to determine which segment boundaries are evaluated for laryngeal engagement.Attorney Docket No. 190412-00041 WO Patent

[0035] The sustained phonation detector 122 may evaluate whether the user 102 engaged the larynx at each expected interval by analyzing one or more acoustic features computed by the speech signal processor 121. These features may include pitch periodicity, fundamental-frequency estimates, voicing-probability curves, spectral-harmonic structure, harmonic-to-noise ratios, or low-frequency energy distributions that are characteristic of vocal-fold vibration. For instance, if the system expects voicing on the vowel / as / in a known stimulus such as “cat”, the sustained phonation detector 122 may examine whether periodicity appears immediately after the / k / release. In the case of spontaneous speech such as “I know,” the sustained phonation detector 122 may analyze whether the voiced sequence / ai n oo / exhibits uninterrupted harmonic structure across the vowel transitions. If required features exceed defined thresholds for a minimum duration, the sustained phonation detector 122 may classify the interval as having adequate laryngeal engagement. If the features fail to reach a threshold, or if periodicity is absent longer than an allowable temporal gap, the sustained phonation detector 122 may classify the interval as exhibiting a lapse of engagement.

[0036] In some implementations, the sustained phonation detector 122 may use temporal-alignment techniques to compare expected voicing intervals with observed acoustic features. The sustained phonation detector 122 may align the speech signal with the expected phonetic timeline using forced alignment, dynamic time-warping, energy-based syllable detection, or vowel-center identification. For a system-provided stimulus such as “apple”, the expected phonetic sequence ( / as p 9 1 / ) allows the sustained phonation detector 122 to determine that voicing is expected at the beginning of / as / and again after the voiceless / p / . If periodicity does not appear within an acceptable window following the / p / release, the sustained phonation detector 122 may identify’ a delayed-onset interval. When the user produces spontaneous speech such as “about it,” alignment may be guided by automatically detected vowel centers at / a / , / ao / , and hl, enabling the sustained phonation detector 122 to compare tire observed voicing pattern with the inferred timing of expected engagement across consecutive vowels.

[0037] Tire sustained phonation detector 122 may evaluate continuity of voicing across segment boundaries to determine whether the user 102 maintains engagement during transitions that normally require uninterrupted phonation. For example, in the system-provided phrase “move along”, the voiced sequence / m u v a 1 o rj / contains several transitions — such as / u / — > / v / or / I / — > / □ / — during which continuous vibration is expected. If the sustained phonation detector 122 identifies a cessation of voicing between any of these transitions, this may indicate decreased laryngeal engagement. In contrast, for spontaneous speech such as “I’m going now.” the detector may treat the sequence / ai m g oo i rj n ao / as a multi-segment voiced region, except for naturally voiceless intervals if present, and may assess whether the user maintains voicing across the expected / m / — > / g / and loul^hl transitions. Conversely, in a phrase such as “pack a bag”, the sustained phonation detector 122 may correctly classify the / p / in / p as k / as a legitimate non-voicing interval and evaluate whether voicing resumes promptly at tire onset of / ae / .

[0038] In some cases, the sustained phonation detector 122 may classify the magnitude or nature of an engagement deviation. For example, the sustained phonation detector 122 may distinguish between a brief micro-gap in periodicity during a stimulus phrase such as “ready”, e.g., a short unexplained drop in voicing between / r / and / E / , and a prolonged absence of voicing following a voiceless stop in a phrase such as “takeAttorney Docket No. 190412-00041 WO Patentit”. In spontaneous speech, the sustained phonation detector 122 may detect repeated interruption patterns across similar phonetic contexts, such as multiple delayed-onset events following voiceless^voiced transitions like / t / ^ / u / or / k / ^ / a / in the user's natural utterances. The sustained phonation detector 122 may quantify these deviations by computing the duration of the disengagement, the timing of the disengagement relative to the expected onset, or the number of affected transitions. These metrics may be stored or provided to other components of the laryngeal engagement analysis system 120 for generating block-location indicators, adjusting timing thresholds, or identifying phonetic environments associated with decreased laryngeal engagement.

[0039] In some implementations, the sustained phonation detector 122 may employ machine-learning or generative-AI models to enhance detection of expected and observed laryngeal engagement. A machinelearning model may be trained on labeled speech data to classify each frame of the speech signal as voiced or unvoiced, or more generally to classify segments as engaged or disengaged based on any combination of acoustic, airflow-related, temporal, physiological, or other engagement-related features learned by the model, or to predict a likelihood of periodic vocal-fold vibration based on acoustic features such as pitch estimates, spectral-harmonic structure, cepstral embeddings, ortemporal modulation patterns. For example, a recurrent neural network or transformer-based acoustic classifier may learn to detect the onset of voicing in a Zp / — / ae / transition or to identify subtle disruptions in periodicity across a voiced sequence such as / m u v / in move. In some cases, the sustained phonation detector 122 may use a machine-learning model to segment the user’ s spoken utterance into phonetic units and to infer which segments are expected to contain continuous voicing, supporting the determination of sustained phonation metrics or voice-onset-time metrics. In some cases, a generative-AI model may be used to produce or refine phonetic representations of spontaneous speech, to synthesize alternative pronunciations for alignment purposes, or to generate userspecific stimulus phrases that incorporate phonetic challenge types associated with reduced laryngeal engagement. For example, a generative model may produce a new phrase containing a voiceless-to-voiced transition that the lary ngeal engagement analysis system 120 has identified as difficult for the user 102, enabling the sustained phonation detector 122 to evaluate whether the user 102 re-engages the larynx in accordance with expected phonation patterns. In some embodiments, the sustained phonation detector 122 may integrate outputs from rule-based phonetic logic and / or machine-learning classifiers, using the combined information to improve accuracy in determining whether expected engagement occurred, whether periodic voicing remained continuous across a voiced region, or whether a disruption reflects a potential block location that can be provided to downstream components of the laryngeal engagement analysis system 120.

[0040] Tire voice -onset-time detector 123 may evaluate a temporal interval between a release burst of a stop consonant, which may be voiced or voiceless, and an onset of periodic vocal-fold vibration in the segment in which periodic voicing begins. The temporal interval may be positive, indicating that voicing begins after the stop release, or negative, indicating that voicing begins before the stop release. This interval, referred to as voice onset time (VOT), can reflect whether the user 102 initiates laryngeal engagement at the expected moment following a stop release, rather than whether the user maintains engagement once voicing has begun, which may be determined by the laryngeal engagement analysis system 120. In typicalAttorney Docket No. 190412-00041 WO Patentadult speech, many English stop-vowel transitions exhibit VOT values in the range of approximately 20-60 milliseconds for aspirated stops such as / p / . / t / . or / k / , and in the range of approximately 10-30 milliseconds for less aspirated contexts. When the larynx fails to initiate vibration within these expected windows, the observed VOT may be abnormally prolonged, interrupted, or missing, indicating delayed or absent engagement rather than a lapse in continuous voicing. The voice-onset-time detector 123 may therefore provide an objective indicator of engagement initiation rather than engagement continuity.

[0041] Tire voice-onset-time detector 123 may determine where a VOT interval is expected by referencing a phonetic representation of the utterance. When the user repeats a structured system-provided stimulus such as “take it” ( / t ei k 11 / ), the voiceless stops / t / and / k / are already known to the system, along with the expected post-stop vowels / ei / and / i / . Each stop-vowel pair may therefore be pre-annotated as requiring a VOT measurement. In spontaneous speech, such as the user saying “maybe tomorrow,” the speech signal processor 121 may convert the utterance into a phonetic sequence such as / m ei b i t o m a r oo / , from which the voice-onset-time detector 123 may identify the voiceless / t / followed by the vowel / a / and designate that location as requiring a VOT analysis. In structured and spontaneous cases, the voiceonset-time detector 123 may compile an expected set of VOT regions for subsequent timing evaluation.

[0042] In some embodiments, the voice-onset-time detector 123 may analyze acoustic boundary' information supplied by the speech signal processor 121, such as timestamps corresponding to stop-release bursts, aspiration noise, or vowel onsets. For example, when the user speaks the stimulus word “cap,” the speech signal processor 121 may detect the high-frequency transient associated with the / k / release at time t = 2.310 s, and the rising periodic energy associated with the vowel / ae / at time t = 2.355 s, resulting in an observed VOT of 45 ms. In spontaneous speech such as “I can do it,” the processor may identify the release of the / k / in “can” and the onset of / ae / to compute a similar interval. The voice-onset-time detector 123 may use this timing information to construct an observed VOT timeline and compare it with expected ranges for the given phonetic context.

[0043] Tire voicc-onsct-timc detector 123 may determine whether the user 102 initiated phonation within a permissible VOT tolerance by analyzing acoustic features associated with the onset of voicing. These features may include, but are not limited to, increases in harmonic-to-noise ratio, low-frequency periodic energy, stable pitch estimates, or voicing-probability values that exceed a defined threshold for a minimum temporal duration (e.g., 15-20 ms). For instance, in the phrase “pick up,” the voice-onset-time detector 123 may examine whether periodicity emerges within 20-70 ms after the / p / release. If the user 102 produces a VOT of 180 ms, or a negative VOT of inappropriate magnitude, or fails to produce periodic vibration following the release, the voice-onset-time detector 123 may classify the event as a delayed or absent engagement-onset.

[0044] In some implementations, the voice-onset-time detector 123 may apply temporal -alignment techniques such as forced alignment, dynamic time-warping, or vowel-center identification to refine the correspondence between expected and observed boundaries. For example, in a structured phrase such as “potato,” alignment techniques may anchor the expected releases of / p / and III and allow the voice-onset-time detector 123 to validate whether voicing begins within typical ranges (e.g., 20-60 ms for the / p / ^ / o / Attorney Docket No. 190412-00041 WO Patenttransition). In spontaneous productions like '‘I took it,’’ automatic vowel-center detection may identify the beginning of / o / and / i / to support precise VOT estimation, even when speech rate or prosody varies.

[0045] The voice-onset-time detector 123 may classify the magnitude and nature of VOT deviations using one or more thresholds. For example, the detector may distinguish: (i) a mild delay (e.g., VOT extended from 40 ms to 80 ms), (ii) a moderate delay (e.g., 120-200 ms), or (iii) a severe delay or omission, in which no periodicity is detected within 300 ms of tire release burst and / or may classify inappropriate or unstable negative VOT values when voicing begins substantially before tire stop release. The voice-onset-time detector 123 may classify the frequency of delayed VOT events across similar contexts. For instance, repeated prolongation of VOT in / t / — vowel transitions during spontaneous conversation may indicate a recurring difficulty initiating laryngeal engagement. These metrics may be transmitted to other components of the laryngeal engagement analysis system 120 to support generation of a composite laryngeal engagement value or identification of block locations.

[0046] In some embodiments, the voice-onset-time detector 123 may employ machine -learning models to improve detection of release bursts and voicing onsets. A neural classifier may be trained to recognize the acoustic signature of stop bursts using features such as high-frequency transients, sudden energy spikes, or spectral discontinuities. A second model may be trained to predict frame-level probabilities of voicing onset using harmonically rich spectro-temporal cues. For example, a transformerbased model may leam typical VOT patterns for / k / — > / i / transitions, while a convolutional classifier may learn to detect the burst of / t / under variable environmental noise. Model outputs may be fused to compute VOT values with greater robustness than rule-based methods.

[0047] In some cases, generative-AI models may be used to supplement the voice-onset-time detector 123 by producing or refining phonetic expectations, synthesizing alternative realizations of stop-vowel transitions, or generating new user-specific phrases that contain stop-vowel contexts associated with delayed VOT patterns. For example, if the voice-onset-time detector 123 identifies unusually long VOT values in / k / — > / ac / transitions across multiple utterances, a generative model may produce a set of new phrases emphasizing / k / — > vowel transitions to evaluate whether the user initiates phonation more reliably in different lexical or prosodic environments. Outputs from generative models may assist the voice-onset-time detector 123 in aligning spontaneous speech with predicted phonetic structures when the user produces reduced forms, variable pronunciations, or disfluencies. In some implementations, the voice-onset-time detector 123 may integrate outputs from rule-based phonetic logic, machine-learning classifiers, or generative-AI predictions to refine VOT estimates and to identify’ timing deviations associated with reduced laryngeal engagement.

[0048] The composite laryngeal engagement system 124 may generate a composite representation of laryngeal engagement exhibited by the user 102 during an utterance. The composite representation may be based on the sustained phonation metric (e.g., produced by the sustained phonation detector 122), tire voiceonset-time metric (e.g., produced by the voice -onset-time detector 123), information obtained from the laryngeal -activation sensor 130, and / or one or more additional engagement-related features derived from the audio data or from other sensing modalities, such as airflow characteristics, intersyllabic timing patterns, respiratory-effort indicators, or subglottal-pressure signatures, or any combination of these indicators,Attorney Docket No. 190412-00041 WO Patentdepending on which indicators of vocal-fold behavior are available or more relevant for the spoken utterance. By integrating information associated with engagement maintenance and engagement initiation, the composite laryngeal engagement system 124 may produce a composite laryngeal engagement state that reflects overall stability, timing, or continuity of phonatory activity across the utterance. A composite laryngeal engagement state may refer to a fused representation of vocal-fold activity produced from sustained phonation metrics, voice-onset-time metrics, and / or sensor-derived activation indicators. The composite lary ngeal engagement system 124 may structure the composite representation as a time-aligned sequence, a per-segment engagement vector, or another format suitable for subsequent evaluation.

[0049] The composite laryngeal engagement system 124 may analyze the composite representation to determine where, within the spoken utterance, the user 102 exhibited reduced, disrupted, or delayed laryngeal engagement. In some implementations, the composite laryngeal engagement system 124 may evaluate the representation segment-by-segment, comparing engagement indicators to one or more thresholds associated with expected voicing intervals. If the composite representation fails to satisfy a threshold at a segment boundary, vowel onset, or transition from a voiceless consonant to a voiced segment, or if airflow-related, subglottal-pressure-related, or temporal features indicate insufficient or unstable engagement, the composite laryngeal engagement system 124 may classify that location as an engagementreduction point that may correspond to a block or to another phonatory disruption.

[0050] In some embodiments, the composite laryngeal engagement system 124 may determine the magnitude or type of a detected engagement reduction. For example, the composite laryngeal engagement system 124 may distinguish between a brief decrease in sustained phonation, a prolonged delay in voice onset, or combined reductions across adjacent transitions. The composite laryngeal engagement system 124 may compute associated measures such as the duration of the reduction, the position of the reduction within the utterance, or the number of affected transitions. These measures may indicate whether tire user 102 experienced difficulty initiating voicing, maintaining voicing across voiced sequences, or re-engaging the larynx after a particular phonetic segment.

[0051] Tire composite laryngeal engagement system 124 may identify one or more block locations within the spoken utterance by determining that the composite representation, or a composite laryngeal engagement value derived from the representation or other derived indicator, fails to satisfy a laryngeal engagement criterion. A laryngeal engagement criterion may refer to a quantitative threshold applied to one or more voicing features, e.g.. fundamental-frequency stability, harmonic-to-noise ratio, voicing-probability value, or sensor-derived vibration amplitude, airflow- or pressure-based indicators of sufficient subglottal air pressure, intersyllabic temporal regularity, or other engagement-related features, used to determine whether vocal-fold vibration is present at a given moment. In some implementations, when the composite lary ngeal engagement system 124 detects that the composite value falls below a threshold for longer than an allowable tolerance during an utterance onset or during a transition between syllables, the composite laryngeal engagement system 124 may designate that time, segment index, or phonetic boundary as a block location. The composite laryngeal engagement system 124 may classify each block location according to its context, such as whether the block occurred at an initial vowel onset, at a transition from a stop consonant to a voiced segment, or within a multi-syllabic voiced sequence, and may provide theseAttorney Docket No. 190412-00041 WO Patentblock-location indicators to the feedback generator 126 or other components for use in presenting indications of the block location or feedback prompts associated with the block location.

[0052] In some implementations, the composite laryngeal engagement system 124 may quantify the composite laryngeal engagement state using a score or other numeric indicator. For example, the composite laryngeal engagement system 124 may generate a value on a normalized scale (e.g., 0-1) or on a discrete performance scale (e.g., 0-100). The score may represent a weighted function of the sustained phonation metric and the voice -onset-time metric, such as in combination with one or more additional engagement-related features such as airflow continuity, subglottal-pressure indicators, or intersyllabic timing stability. In some examples, sustained phonation continuity for a phrase may be represented as a value of 0.85 (e.g., meaning 85% of expected continuous-voicing intervals were successfully maintained), and the voice-onset-time metric may be normalized to a value of 0.60 (e.g., indicating moderately delayed onset relative to typical ranges). The composite laryngeal engagement system 124 may combine these metrics using weighting factors, such as 0.6 for sustained phonation and 0.4 for voice onset, to produce a composite engagement score of 0.77. A higher score may indicate more stable or timely laryngeal engagement across the utterance, whereas a lower score may signal one or more regions in which expected engagement was reduced or delayed. Tire composite score may be stored, displayed, or used to select subsequent stimuli or feedback prompts tailored to the user’s specific engagement pattern. In some embodiments, the sustained phonation metric and the voice-onset-time metric may be normalized to user-specific baselines derived from prior utterances, so that the composite laryngeal engagement system 124 evaluates changes in engagement relative to each user’s typical performance rather than to a fixed global standard.

[0053] The composite laryngeal engagement system 124 may evaluate patterns of reduced engagement across multiple utterances to detennine whether the user exhibits recurring phonetic challenges. The composite laryngeal engagement system 124 may identify repeated reductions in the composite representation at specific contexts such as voicclcss-to-voiccd onsets, high-vowcl transitions, nasal-to-vowel boundaries, or multi -syllabic voiced sequences. Tire composite laryngeal engagement system 124 may classify these recurring patterns as phonetic challenge types, which may be associated with a manner of articulation, a place of articulation, or a contextual property such as a vowel onset, a stop-consonant release, or a specific segment-duration pattern.

[0054] In some cases, the composite laryngeal engagement system 124 may use the composite representation to select or adjust subsequent stimulus phrases. The composite laryngeal engagement system 124 may detennine which phonetic environments correspond to reduced engagement for the user 102 and may select a phrase containing similar phonetic challenges. For instance, if reduced engagement is repeatedly detected at / k / — ► vowel onsets, the composite laryngeal engagement system 124 may choose a phrase containing multiple occurrences of that transition. In some embodiments, generative phrase selection may be employed to construct a user-specific phrase that includes a phonetic pattern associated with prior reductions observed in tire composite representation.

[0055] The composite laryngeal engagement system 124 may also provide information used for realtime or post-utterance feedback. The composite laryngeal engagement system 124 may supply an indication of a detected engagement-reduction location, a measure of its duration, or an overall representation ofAttorney Docket No. 190412-00041 WO Patentengagement performance across the utterance. Such information may support downstream operations including presenting a feedback prompt, updating visual indicators of phonation, adjusting adaptive timing thresholds, or generating a session-level summary that includes sustained phonation metrics, voice-onset-time metrics, or composite engagement scores.

[0056] In some embodiments, the composite laryngeal engagement system 124 may compute engagement measures at the level of individual syllables or phonetic transitions and may combine these measures to derive a phrase-level composite representation. For example, for the phrase “take it over,” the composite lary ngeal engagement system 124 may identify the transitions / tZ— >ZeiZ, ZkZ^-ZiZ, and ZtZ— >ZouZ as positions where voice onset is expected. The system 124 may assign each transition an engagement score, such as 0.90 for a prompt onset after ZtZ, 0.50 for a delayed onset after ZkZ, and 0.80 for a moderately prompt onset after the final ZtZ. The composite laryngeal engagement system 124 may evaluate sustained phonation continuity within voiced regions such as Zei k iZ and Zoo v a-Z, producing continuity values such as 0.95 and 0.88, respectively. The composite laryngeal engagement system 124 may aggregate these per-transition and per-syllable values using any suitable method, such as averaging, weighted fusion, or rule-based scoring, to produce a detailed engagement profile across tire phrase. This profile may highlight specific transitions with reduced engagement, may indicate which voiced regions exhibit instability, or may identify patterns such as consistently low scores following voiceless or voiced stops. The aggregated per-transition information may be used to pinpoint specific block-susceptible moments within the utterance and to guide the selection of targeted follow-up stimuli that focus on the particular transitions or voiced sequences associated with reduced laryngeal engagement.

[0057] In some implementations, the composite laryngeal engagement system 124 may adapt engagement thresholds based on the user's composite scores or engagement patterns across multiple utterances. For example, if a user consistently produces composite engagement scores above 0.85 in phrases with simple CV transitions but scores around 0.40 in phrases involving voicclcss-to-voiccd transitions such as ZkZ^-ZseZ or ZtZ— >ZuZ, the composite laryngeal engagement system 124 may adjust its internal thresholds to be more sensitive to delays in these specific contexts. Such adaptation may include narrowing the acceptable timing window for voice onset, increasing the required continuity threshold for sustained phonation in voiced sequences, or modifying baseline expectations for segment-duration characteristics. Conversely, if a user’s scores indicate improvement in previously disrupted transitions, the composite laryngeal engagement system 124 may broaden or relax thresholds to reduce false-positive indications of engagement reduction. These adaptive adjustments may allow the composite laryngeal engagement system 124 to account for user-specific phonation patterns, developmental progress, or variability across speaking contexts, and may support dynamic selection of stimulus phrases, update real-time feedback behavior, or enhance the system’s ability to detect genuine phonatory disruptions while avoiding over-sensitivity to natural variation.

[0058] In some embodiments, tire composite laryngeal engagement system 124 employs explicit sensor-fusion algorithms to synchronize and combine heterogeneous data streams. Temporal alignment may be performed using cross-correlation of transient acoustic and sensor-derived events, or by applying fixed or adaptive temporal-synchronization windows. Engagement estimates may be refined using KalmanAttorney Docket No. 190412-00041 WO Patentfiltering or other state-estimation techniques that integrate acoustic features with physiological or imaging-derived indicators. In some cases, the composite laryngeal engagement system 124 implements probabilistic fusion in which each modality contributes a likelihood of vocal-fold activation, and the modalities are combined using weighted or Bayesian methods. Hie fusion pipeline may further incorporate confidence-weighted blending, where each sensor stream is assigned a dynamic confidence score based on signal quality, enabling robust composite engagement detection even under noise, motion, or occlusion conditions. Sensor-fusion inputs may include, without limitation, metrics of sustained phonation, voiceonset timing, airflow presence, subglottal air-pressure surrogates, intersyllabic temporal patterns, or other engagement-related features derived from audio or physiological sensing.

[0059] In some implementations, the composite lary ngeal engagement system 124 may incorporate machine-learning models to refine classification or prediction of reduced-engagement events. A model may receive the sustained phonation metric, the voice-onset-time metric, one or more additional engagement-related features such as airflow characteristics, temporal patterns between syllables, or physiological indicators of sufficient subglottal air pressure, or auxiliary features extracted by the speech signal processor 121, and may output estimated engagement probabilities or likelihoods of phonatory disruption at each segment boundary. In some cases, a generative-AI model may produce candidate stimulus phrases, simulate expected voicing behavior for a given phonetic sequence, or assist in aligning observed engagement patterns with predicted patterns to improve detection accuracy. In some cases, the machine-learning model may be trained to identify patterns of disrupted phonation or reduced laryngeal engagement that are indicative of stuttering blocks, allowing the composite laryngeal engagement system 124 to classify block locations based on learned disruption patterns in addition to rule-based criteria.

[0060] The feedback generator 126 may provide real-time, near-real-time, or post-utterance feedback based on information derived from the sustained phonation metric, the voice-onset-time metric, the composite representation of laryngeal engagement, or any detected engagement-reduction or block locations. The feedback generator 126 may present such feedback through, but is not limited to, visual, auditory , haptic, or extended-reality interfaces provided by, for example, tire user interface device 110, and may adjust the feedback modality or timing according to user performance or system -defined criteria.

[0061] In some implementations, the feedback generator 126 may provide continuous visual indicators that reflect the presence or absence of periodic voicing. These indicators may include, but are not limited to. a voice bar that illuminates when periodic voicing is detected, a progress bar that fills according to a duration of sustained phonation, or a cursor that advances as voiced sound continues and signals the user when a time threshold has been met. For example, when the user sustains phonation on a vowel or across a voiced sequence, the voice bar may remain filled; if periodic voicing ceases unexpectedly, the indicator may pause, retract, or change color to illustrate the interruption. In some implementations, the cursor may fill according to the duration of sustained phonation on an initial vow el of a word and may signal the user 102 to proceed once a time threshold is met, where the time threshold corresponds to a nearest-normal duration in a range from approximately 0.5 seconds to approximately 1.5 seconds.

[0062] The feedback generator 126 may provide timing-related cues associated with voice-onset behavior. During transitions from a voiceless consonant to a voiced segment, the feedback generator 126Attorney Docket No. 190412-00041 WO Patentmay display a timing window, countdown marker, or animated cursor that indicates whether the onset of periodic voicing occurred within an expected interval. If the voice onset is delayed, the interface may visually highlight the transition, change indicator color, or display a small notification to signal that the voice-onset-time criterion was not satisfied.

[0063] The feedback generator 126 may deliver feedback prompts associated with detected reductions in laryngeal engagement. Feedback prompts may include, but are not limited to, directing the user 102 to re-initiate phonation at the point where voicing was lost, prolong a vowel associated with a disrupted onset, maintain airflow across a voiced sequence, or ease into a consonant-to-vowel transition. These instructions may be presented textually, aurally, or through XR-based animations (e.g., an avatar demonstrating a slowed phonatory onset), to name a few.

[0064] When the composite laryngeal engagement system 124 identifies a block location, the feedback generator 126 may generate an indication of the block location. The indication may be rendered as a highlight on a phonetic timeline, a symbol marking the affected syllable, an illuminated cursor at the moment of voicing cessation, or an auditory cue that signals the interruption. Hie feedback generator 126 may present contextual information such as the type of phonetic transition at which the block occurred or whether the block corresponded to a delayed onset, a mid-sequence dropout, or a failure to maintain continuous phonation.

[0065] For blocks detected in real time, the feedback generator 126 may modify or interrupt ongoing visual indicators. For example, a progress bar may freeze or dim, a voice bar may collapse, or a cursor may stop advancing at the exact position of the block. The feedback generator 126 may then present a recovery prompt, such as “restart voicing,” “smoothly begin the vowel.” or “maintain airflow”, to encourage immediate re-engagement of the larynx. If blocks occur repeatedly at similar phonetic contexts, the feedback generator 126 may emphasize those contexts visually to reinforce awareness.

[0066] In some implementations, the feedback generator 126 may align block indications or engagement-reduction cues with the phonetic structure of the stimulus phrase or with an automatically inferred phonetic representation of spontaneous speech. For example, if a block is detected at the transition from / k / to / se / , the feedback generator 126 may display an annotation or animation highlighting that specific onset location, allowing the user to understand precisely where expected voicing failed to initiate.

[0067] The feedback generator 126 may provide auditory feedback that reinforces phonation behavior. Examples include tones that sound while sustained phonation is present, pitch-neutral chimes that indicate timely onset, or attenuated audio effects that signal loss of voicing. In some cases, auditory cues may be synchronized with visual indicators to reinforce user awareness of phonatory control.

[0068] In implementations using extended-reality environments, the feedback generator 126 may produce immersive feedback effects. For example, a virtual object may brighten or move while periodic voicing remains stable, or an avatar may pause, gesture, or provide contextual coaching when voicing drops out. Such feedback may be integrated into simulated conversational or anxiety-inducing environments to reinforce naturalistic speaking behavior.

[0069] Tire feedback generator 126 may adapt its feedback strategy based on dynamic thresholds or user-specific baselines. For instance, as the user demonstrates improved sustained phonation control, theAttorney Docket No. 190412-00041 WO Patentfeedback generator 126 may increase the duration required to fill a progress bar or narrow the permitted window for voice onset. Conversely, if the user experiences difficulty, the feedback generator 126 may relax thresholds or simplify visual cues to reduce cognitive effort. Adaptation may be based on compositeengagement scores, per-transition metrics, or historical performance.

[0070] In some implementations, the feedback generator 126 may generate post-utterance summaries that include measures such as sustained phonation continuity, voice -onset-time values, the composite laryngeal engagement value, or any detected block locations. The summary may include a timeline show ing where engagement reductions occurred, a per-transition scoring chart, or an overall session-level performance indicator. These summaries may support longitudinal progress tracking or may inform selection of subsequent stimulus phrases.

[0071] In some embodiments, the laryngeal engagement analysis system 120 may receive, through the user interface device 110, a self-reported control level from the user 102 (for example, a rating of perceived control or ease of phonation during an utterance). The feedback generator 126 may adjust feedback prompts, feedback sensitivity, or subsequent stimulus selection based at least in part on the self-reported control level, such as providing more detailed guidance or relaxed thresholds when the user reports low’ control and gradually increasing task difficulty as the reported control level improves.

[0072] Tire feedback generator 126 may allow' user-adjustable parameters including phonationduration thresholds, indicator formats, sensitivity levels, or timing windows for voice onset. Such customizations may be applied dynamically to guide phonation behavior in a manner suited to the user’s preferences or therapeutic goals.

[0073] In some embodiments, the feedback generator 126 acts as an exercise-recommendation engine configured to select, rank, or generate performance analysis tasks based on detected stuttering behaviors. The feedback generator 126 may access a database storing phonetic patterns, previously observed laryngeal engagement reductions, user-specific challenge types, and historical session outcomes. Using these data, the feedback generator 126 may automatically determine which stimulus phrases, sound classes, conversational scenarios, or XR-based speaking environments are most appropriate for reinforcing improved phonatory control. Tire feedback generator 126 may transmit its selections to an SLP-facing dashboard for review or approval.

[0074] In some embodiments, the feedback generator 126 may access one or more stimulus-content libraries maintained by the laryngeal engagement analysis system 120 and / or the performance management portal 140. The stimulus-content libraries may include items such as monosyllabic words, multisyllabic words, words organized by phonetic class (for example, plosives, fricatives, or nasals), short phrases, full sentences, reading passages, sentence-formulation prompts, or conversation-starter prompts. The feedback generator 126 may select or request items from these libraries based on engagement metrics, block-location indicators, phonetic challenge types, anxiety-analysis outputs, or configuration data supplied by the performance management portal 140, thereby enabling adaptive sequencing of stimulus items across speaking tasks and environments.

[0075] In extended-reality or conversational -avatar implementations, the feedback generator 126 may allow selection among multiple intervention behaviors for the avatar. Examples include, but are not limitedAttorney Docket No. 190412-00041 WO Patentto, immediate interruption upon detection of disengagement, delayed intervention at syllable boundaries, or end-of-sentence feedback. These settings may be user-selectable or automatically configured based on anxiety level, session goals, or clinician input via the supervisory portal.

[0076] In some embodiments, the feedback generator 126 may employ machine-learning or generative models to refine or personalize feedback. A model may predict which feedback modality or threshold will most effectively support improvement, may generate context-specific corrective messages, or may synthesize demonstration audio or animations illustrating ideal phonation patterns. Predictive models may also anticipate upcoming stalls in voicing and enable pre-emptive cues designed to support continuous, stable laryngeal engagement.

[0077] The laryngeal engagement analysis system 120 may include an anxiety-analysis system 128 configured to assess how the user 102 responds to varying speaking contexts, including structured speaking tasks, conversational tasks, and / or extended-reality environments. The anxiety-analysis system 128 may determine whether voicing-interruption frequency or block frequency increases when the user speaks in an environment that is designed to simulate social or performance-related pressure, thereby enabling the anxiety-analysis system 128 to infer an anxiety level associated with the speaking task. The anxiety-analysis system 128 may operate in conjunction with the sustained phonation detector 122, the voice-onset-time detector 123, the composite laryngeal engagement system 124, or any other component that detects phonatory disruptions.

[0078] Structured speaking tasks may include reading a stimulus phrase, repeating predetermined text, or producing short, medium, or long utterances in a low-stress context where environmental variables are mitigated or minimized. In contrast, conversational tasks may include open-ended dialogue with an avatar or a sequence of socially interactive prompts. Extended-reality environments, including virtual-reality or mixed-reality scenes, may introduce simulated social pressure, such as speaking to a virtual audience, ordering food from a virtual cashier, or answering questions from an interactive avatar. The anxiety -analysis system 128 may compare voicing-intcrruption frequency, block locations, or composite laryngeal engagement reductions across these contexts to detect whether disfluencies increase under conversational or XR-based conditions relative to structured tasks.

[0079] The anxiety -analysis system 128 may compute a context-sensitivity metric a degree of context-dependent variation in laryngeal engagement. For example, the anxiety-analysis system 128 may compute a first engagement-stability measure for the user 102 during structured reading tasks and a second engagement-stability measure during conversational or XR scenarios. If the second measure reflects an increased number of block locations, increased duration of laryngeal disengagement, or lower composite laryngeal engagement values relative to the structured condition, the anxiety-analysis system 128 may classify the difference as an elevation in speaking anxiety. The magnitude of the context-sensitivity metric may be proportional to, for example, the ratio or difference between voicing-interruption frequency in the high-pressure environment and the voicing-interruption frequency in the structured environment. In some cases, the context-sensitivity metric is an anxiety metric representing variations in speech-production performance that correlate with changes in speaking-condition difficulty. In some cases, the context-Attorney Docket No. 190412-00041 WO Patentsensitivity metric is an anxiety metric representing a data-driven indicator of context-dependent changes in phonation behavior as measured from audio and physiological signals.

[0080] In some embodiments, the anxiety-analysis system 128 may incorporate physiological signals obtained from a wearable device (e.g.. the sensor 130), such as heart rate, heart-rate variability, respiration rate, or respiratory -effort metrics. These physiological signals may be temporally aligned with the user’s spoken utterances to determine whether increases in physiological arousal correspond with increased voicing-interruption frequency or prolonged engagement-reduction events. A correlation between physiological arousal and increased phonatory disruptions across different speaking contexts may increase the context-sensitivity metric, while a reduction in physiological arousal accompanying improved phonatory stability may decrease the indicator.

[0081] The anxiety -analysis system 128 may integrate physiological indicators with phonation-based indicators to produce a multimodal anxiety estimate. For example, during a structured reading task, the user 102 may exhibit a block rate of approximately 2% (e.g., 1 block in 50 transitions), an average voice-onset-time of 45 ms, and a composite-engagement score of 0.88. In contrast, when speaking in a virtual-reality classroom, the user 102 may exhibit a block rate of approximately 18% (e.g., 9 blocks in 50 transitions), an average voice-onset-time of 135 ms, and a composite-engagement score of 0.52. Simultaneously, physiological measurements from a wearable device may indicate an increase in heart rate from a baseline of 72 bpm during the structured task to 104 bpm in the virtual -reality context, a decrease in heart-rate variability from 52 ms to 21 ms, and an increase in respiration rate from 12 breaths per minute to 19 breaths per minute. The anxiety-analysis system 128 may compute an anxiety score using a weighted combination of phonation-based features (e.g.. block rate, VOT prolongation, composite-engagement reductions) and physiological features (e.g., elevated heart rate, reduced heart-rate variability, increased respiration rate). For instance, tire anxiety-analysis system 128 may assign a low-anxiety score of 0.18 to the structured reading task and a high-anxiety score of 0.82 to the virtual -reality classroom scenario based on these multimodal indicators.

[0082] Tire anxiety-analysis system 128 may use the context-sensitivity metric or anxiety score to select subsequent speaking tasks. For example, if a user 102 demonstrates pronounced increases in voicing-interruption frequency when speaking to a virtual audience, the anxiety-analysis system 128 may present follow-up tasks that gradually increase or decrease the simulated audience size, adjust the difficulty level, or transition the user 102 toward conversational interactions that train stability under elevated pressure . The anxiety-analysis system 128 may adjust the difficulty of phonation-based feedback, such as widening or narrowing acceptable timing windows for voice onset, based on the user’s current context-sensitivity metric.

[0083] In some cases, the anxiety-analysis system 128 may store anxiety scores, physiological-speech correlations, or context-dependent performance differences for longitudinal tracking. Such stored information may assist in monitoring progress overtime, evaluating performance analysis effectiveness, or generating reports that compare the user’s anxiety-related stuttering patterns across multiple sessions or speaking contexts.Attorney Docket No. 190412-00041 WO Patent

[0084] The anxiety -analysis system 128 may provide an output indicative of the user’s anxiety state to the feedback generator 126 or to another presentation component, such as the user interface device 110. In some embodiments, the anxiety-analysis system 128 may transmit an anxiety score, an anxiety level classification, or an indication of increased voicing-interruption frequency in an anxiety-inducing context to the feedback generator 126. The feedback generator 126 may incorporate this information to modify visual, auditory, or XR-based cues presented to the user 102. For example, when the anxiety -analysis system 128 detects an elevated anxiety score relative to a structured speaking baseline, the feedback generator 126 may simplify the real-time visual indicators, broaden timing thresholds for voice onset, adjust target durations for sustained phonation, reduce the frequency of corrective prompts, or present calming cues such as slowed animations, breathing indicators, or supportive auditory tones. Conversely, when the anxiety-analysis system 128 detects lower anxiety, the feedback generator 126 may restore default thresholds or increase precision of visual indicators. In some embodiments, the anxiety-related output may be displayed to the user 102 as part of a session summan’, including graphical representations of anxiety scores across contexts, correlations between voicing-interruption frequency and physiological measurements, or indicators showing improvement or escalation in speaking-related anxiety across tasks or environments.

[0085] In some cases, data or analysis produced by one or more components of tire laryngeal engagement analysis system 120, such as the sustained phonation detector 122, the voice-onset-time detector 123, or the composite laryngeal engagement system 124, may be replaced or supplemented by data obtained from the laryngeal-activation sensor 130. The laryngeal-activation sensor 130 may obtain physiological or imaging information directly indicative of vocal-fold engagement and may function as a supplemental sensing device that provides data separate from the microphone-based speech signal processed by the speech signal processor 121. The laryngeal-activation sensor 130 may capture physical indicators of laryngeal activation that are not derivable from the acoustic signal alone, enabling the laryngeal engagement analysis system 120 to determine whether the vocal folds are engaged even when the speech audio is weak, ambiguous, occluded, or silent. Such information may assist in identifying sustained phonation behavior, voice-onset behavior, or laryngeal-disengagement events associated with block locations.

[0086] In some embodiments, the laryngeal -activation sensor 130 may include a physiological sensor configured to detect neck-surface vibration or subglottal excitation associated with vocal-fold vibration. A throat-mounted accelerometer, contact microphone, or vibration sensor may detect mechanical oscillation of the vocal folds and may indicate whether the folds are physically engaged even when audible periodicity is absent from an external microphone. Data obtained from such physiological sensing devices may be supplied to the composite laryngeal engagement system 124 as an independent indicator of laryngeal engagement.

[0087] In some embodiments, the laryngeal engagement analysis system 120 may obtain a physiological signal from the laryngeal-activation sensor 130, which may be a wearable device, such as a heart-rate signal, heart-rate-variability signal, or respiration-rate signal. The laryngeal engagement analysis system 120 may correlate changes in the physiological signal with changes in voicing-interruptionAttorney Docket No. 190412-00041 WO Patentfrequency across different speaking contexts (e.g., structured tasks versus conversational or extended-reality interaction) and may adjust the context-sensitivity metric based at least in part on this correlation, such that concurrent increases in physiological arousal and voicing-interruption frequency yield a higher context-sensitivity metric than either measure alone.

[0088] The laryngeal -activation sensor 130 may include an optical, infrared, or camera-based sensor positioned to observe movement of the user's neck or lower facial region. Such imaging devices may detect external motion correlated with vocal-fold activation, including laryngeal elevation, anterior neck vibration, surface displacement, or visual cues associated with glottal opening or closing patterns. In some embodiments, the lary ngeal -activation sensor 130 may detect non-laryngeal stuttering behaviors including, but not limited to, eye blinks, facial tension, jaw freezing, head-nodding, or other secondary behaviors commonly associated with speech blocks. These behaviors may be incorporated into engagement-reduction detection logic, block-classification models, or the composite laryngeal engagement system 124.

[0089] In some embodiments, the laryngeal -activation sensor 130 may include an endoscopic or intranasal imaging device configured to capture internal views of the glottis, supraglottic structures, or vocal-fold motion during speech production. Such a device may provide direct physiological evidence of engagement or disengagement that cannot be derived from external audio or surface-imaging signals alone. The system may synchronize endoscopic frames with the speech signal to identify silent blocks, pre-phonatory positioning, incomplete adduction, or other engagement failures with increased anatomical precision.

[0090] In some embodiments, the laryngeal-activation sensor 130 may include an imaging device configured to capture data from a line of sight generally directed toward the laryngeal or submandibular region. Such a sensor may include a camera, infrared imager, structured-light device, or optical module positioned at the neck, under the j aw, or at another vantage point that allows indirect visualization of vocalfold motion, glottal aperture changes, or supraglottic activity correlated with laryngeal engagement. These sensors may detect physical cues such as glottal narrowing, vocal-fold vibration patterns observable through tissue, or rapid changes in lary ngeal posture that precede or accompany phonation. Data from these devices may allow the laryngeal engagement analysis system 120 to identify whether the vocal folds attempt to adduct or vibrate at moments when acoustic output is weak or absent, thereby improving identification of pre-voicing delays, silent blocks, or incomplete engagement at expected onset locations.

[0091] In some implementations, the laryngeal -activation sensor 130 may include proximity sensors, structured-light sensors, or depth-sensing hardware configured to detect displacement of the skin surface or laryngeal structure with fine spatial resolution. Because these sensing modalities observe physical movement independent of acoustic output, the laryngeal -activation sensor 130 may detect silent blocks, pre-phonatory delays, or other engagement failures even when no sound is produced, conditions that the speech signal processor 121 may not reliably identify using acoustic data alone.

[0092] Tire laryngeal-activation sensor 130 may include physiological or biometric sensors such as respiration belts, airflow or pressure sensors, or photoplethysmography devices that detect respiratory' timing, subglottal pressure patterns, or airflow presence. These physiological signals may indicate whether expiratory flow or sufficient subglottal air pressure to support vocal-fold engagement is present while theAttorney Docket No. 190412-00041 WO Patentvocal folds remain disengaged, thereby assisting in distinguishing airflow-driven silent blocks from naturally unvoiced segments or low-intensity phonation.

[0093] In some embodiments, the laryngeal-activation sensor 130 may integrate multiple sensing modalities, such as vibration sensing, imaging-based sensing, or physiological sensing, to generate an activation indicator. An activation indicator may refer to a sensor-derived signal feature, such as acoustic, vibratory, optical, physiological, or mechanical, that provides direct or indirect evidence of vocal-fold activation. Indicators may include neck-surface vibration amplitude, glottal-area motion cues, measures of sufficient subglottal air pressure, subglottal pressure signatures, airflow onset, or other measurable correlates. The laryngeal-activation sensor 130 may synchronize the data stream with the acoustic timeline established by the speech signal processor 121 so that the composite laryngeal engagement system 124 can evaluate whether vocal-fold engagement was expected at a particular moment and whether sensor-derived information confirms or contradicts the acoustic signal.

[0094] The laryngeal -activation sensor 130 may output activation indicators in any suitable format, including binary engagement flags, continuous activation values, time-aligned activation curves, or feature vectors representing vibration, displacement, optical motion, or airflow patterns. These indicators may be synchronized with the speech-derived engagement metrics for fusion by the composite laryngeal engagement system 124.

[0095] In some embodiments, the laryngeal -activation sensor 130 may be integrated into AR / VR headsets or XR-wearables. Optical modules, depth cameras, or inward-facing sensors may track neck motion, glottal-region displacement, or respiration patterns while the user interacts with an XR scene, enabling low-latency feedback synchronized with the virtual environment.

[0096] The laryngeal-activation sensor 130 may provide an independent verification pathway for confirming reductions in laryngeal engagement, identifying block locations, or improving robustness of sustained phonation and voice-onset-time measurements. For example, when the acoustic signal does not clearly indicate whether the user attempted to initiate voicing after a stop-consonant release, the laryngeal-activation sensor 130 may detect tire absence of vocal -fold vibration despite articulatory movement, thereby enabling more precise localization of a block even when no audible signal is present.

[0097] Signals from the laryngeal -activation sensor 130 may contribute to generation of the composite laryngeal engagement representation or value by serving as an auxiliary indicator of vocal -fold engagement. The composite laryngeal engagement system 124 may incorporate indications of laryngeal activation from the laryngeal -activation sensor 130, alone or in combination with acoustically derived metrics, to refine assessments of engagement continuity, engagement onset, or disengagement events. Sensor-derived activation information may improve accuracy in detecting silent or partially voiced blocks, may support threshold adaptation to user-specific physiology, and may enhance overall reliability when environmental noise or atypical articulatory patterns complicate detection based solely on the speech signal processed by the speech signal processor 121.

[0098] Tire performance management portal 140 may provide computing resources that support supervisory, administrative, or coordination functions associated with operation of the environment 100 across multiple devices or users. In some embodiments, the performance management portal 140 mayAttorney Docket No. 190412-00041 WO Patentreceive any information generated or received by the user interface device 110, the laryngeal engagement analysis system 120, or the laryngeal-activation sensor 130. including sustained phonation metrics, voice-onset-time metrics, composite laryngeal engagement values, engagement-reduction indicators, anxietyanalysis outputs, physiological-signal summaries, task-completion records, or system-generated performance assessments. Tire performance management portal 140 may process, store, aggregate, or visualize these data for purposes such as tracking usage patterns, monitoring performance trends, organizing session histories, or identifying changes in engagement behavior across speaking tasks or environments.

[0099] Tire performance management portal 140 may support automated or semi -automated decision processes that influence operation of the user interface device 110 or the laryngeal engagement analysis system 120. Such processes may include selecting subsequent stimulus phrases, approving or modifying exercise recommendations produced by an internal recommendation engine, adjusting adaptive timing thresholds, configuring feedback-sensitivity parameters, or managing access to stimulus-phrase libraries or XR-based scenarios. In some embodiments, the performance management portal 140 may enable oversight of multiple users by presenting summaries of engagement-related measurements, session-level analytics, longitudinal progress indicators, or comparisons of performance across structured tasks, conversational tasks, or extended-reality environments. The performance management portal 140 may function as a centralized computational interface for supervising system behavior, managing performance -tracking workflows, or coordinating data associated with one or more users without requiring direct interaction with the user interface device 110 or the laryngeal -activation sensor 130.

[0100] Although generally described herein with respect to English, in some embodiments the laryngeal engagement analysis environment 100 supports multilingual phonetic rules, including languages with distinct voicing patterns, aspirated or unaspirated contrasts, or non-English stop categories. The sustained phonation detector 122 and the voice-onset-time detector 123 may apply language-specific voicing attributes, such as Korean three-way stop distinctions, Hindi breathy-voiced consonants, or vowelinitial onset behaviors common in French.

[0101] In some embodiments, the lary ngeal engagement analysis environment 100 supports tonal languages, where pitch is used lexically. For example, the speech signal processor 121 may? isolate linguistic fO modulations from phonatory-continuity cues so that tone changes do not create false "disengagement" events. In some such cases, tone contours may be used as features for improved voicing-probability estimation.

[0102] In some embodiments, the laryngeal engagement analysis environment 100 supports pediatric speech, including shorter segment durations, higher fO ranges, immature articulatory timing, or increased acoustic variability . In some such cases, the laryngeal engagement analysis environment 100 may adjust thresholds, expected VOT ranges, or voicing-probability’ models to a pediatric baseline.

[0103] In some embodiments, the laryngeal engagement analysis environment 100 detects whisperspeech in which periodic voicing is absent by design. Whisper detection may be performed using spectral tilt, turbulent-noise ratios, or subharmonic absence, and the laryngeal engagement analysis environment 100 may evaluate engagement primarily from sensor-derived activation indicators rather than acoustic periodicity.Attorney Docket No. 190412-00041 WO Patent

[0104] In some embodiments, the laryngeal engagement analysis environment 100 detects breath-to-voicing transitions, where airflow begins prior to vocal-fold vibration. A pre-phonatory airflow signature may be detected using low-frequency energy, airflow sensors, or optical neck-movement cues, allowing more accurate identification of delayed engagement.

[0105] FIG. 2 illustrates an example visual feedback element 200 generated by the feedback generator 126 of FIG. 1 and displayed by the user interface device 110. The illustrated example includes multiple progress-type indicators corresponding to different target prolongation durations, such as approximately two seconds, one second, and a nearest-normal duration. In some embodiments, these indicators may be displayed simultaneously to represent multiple available training levels. In some embodiments, a user interface device may display only the progress indicator associated with the target duration applicable to the current exercise. For example, when the user is instructed to begin an utterance by prolonging the initial vowel of a stimulus phrase (such as the first vowel of ‘"out,” "fin,” or ‘"apple"’), the user interface device may present the single progress indicator corresponding to the required prolongation duration for that task. In some embodiments, the indicators may further adapt based on system-determined performance conditions, such as adjusting the displayed target duration according to a difficulty level selected (e g., via the performance management portal 140), or modify ing the visual sty le of the bar based on whether the environment includes a conversational avatar or another interactive XR element.

[0106] Each indicator includes a reference bar that represents the target duration and a fillable portion that expands in real time as sustained phonation is detected (e.g., by the sustained phonation detector 122). When the fillable portion reaches the end of the reference bar, the user interface device 110 provides an indication that the required prolongation has been achieved and that the user may proceed with the remainder of the utterance. FIG. 2 depicts an example of real-time visual feedback that facilitates timely initiation and maintenance of laryngeal engagement at the beginning of an utterance. In some embodiments, the feedback generator 126 may present cues associated with avatar-interaction behaviors, such as a visual prompt indicating that an avatar will respond once the prolongation target is reached, or a timing indicator showing how the detected phonation aligns with system-specified performance thresholds.

[0107] FIGS. 3A-3C illustrate example visual feedback elements 310, 320, 330 generated by the feedback generator 126 of FIG. 1 and displayed by the user interface device 110 to support continuous laryngeal engagement during multi-word or multi-syllable utterances. With reference to FIG. 1, in each example, the feedback generator 126 receives real-time voicing information derived from components of the laryngeal engagement analysis system 120. such as the sustained phonation detector 122, the voice-onset-time detector 123, and the composite laryngeal engagement system 124, and produces corresponding visual indicators aligned with the text of the spoken phrase. In some embodiments, the visual indicators may reflect system-defined interaction settings, such as whether feedback is to be shown continuously, delayed until a phrase boundary , or synchronized with avatar-based conversational timing. Tire indicators may adjust based on performance-management instructions, including adjustments to sensitivity, timing windows, or highlighting modes for particular phonetic contexts.

[0108] FIG. 3 A illustrates an example visual indicator 310 generated by the feedback generator 126 and displayed by the user interface device 110 to show continuous phonation throughout an utterance. AsAttorney Docket No. 190412-00041 WO Patentthe user produces a phrase such as “Now that I think about it, the door was open, and the lights were on,” the feedback generator 126 receives real-time voicing information (e.g., from the sustained phonation detector 122 and / or the composite laryngeal engagement system 124) and fills the voice bar in synchrony with detected periodic voicing. In the illustrated case, phonation remains active across the entire segment, and the voice bar fills continuously from left to right. In some embodiments, an avatar or XR element may be configured to provide a synchronized gesture, animation, or conversational acknowledgment when continuous phonation is maintained, and such behaviors may be selected or modified by the performance management portal 140.

[0109] FIG. 3B illustrates an example visual indicator 320 used when the system detects a reduction in laryngeal engagement at a particular transition. In the illustrated scenario, the composite laryngeal engagement system 124 identifies a disengagement between “it,” and “the,” and the voice bar displayed by the user interface device 110 is visually interrupted at that point. Ute interruption corresponds to a location where the composite laryngeal engagement value did not satisfy an engagement criterion, indicating a block or a loss of continuous phonation at that syllable boundary. In some embodiments, the interface may present an avatar-based pause, a contextual highlight, or an adaptive notification indicating that the system has detected a disengagement according to rules or thresholds that may be adjusted through the performance management portal 140.

[0110] FIG. 3C illustrates an example visual indicator 330 that provides guidance for re-initiating phonation and linking the restarted segment into the following syllable or word. After detecting the disengagement shown in FIG. 3B, the feedback generator 126 may present a short prolongation bar beneath the word at which voicing should be restarted (e.g.. “the”), prompting the user to re-engage the larynx. A linking cue may also be displayed to indicate that the newly initiated voicing should continue smoothly into the next segment. These visual elements reflect real-time corrective feedback derived from the composite representation produced by the composite laryngeal engagement system 124. In some embodiments, restart cues may coordinate with avatar-interaction behaviors, such as an avatar waiting, signaling readiness, or visually linking to the next word, and these behaviors may be configured or overridden based on supervisory instractions received via the performance management portal 140.

[0111] FIG. 4 is a flow diagram illustrative of an embodiment of a routine 400 implemented by one or more components of the laryngeal engagement analysis system 120 for processing speech-production data and identifying block locations in a spoken utterance. Although certain operations of routine 400 are described as being performed by specific modules, such as the speech signal processor 121, the sustained phonation detector 122, the voice -onset-time detector 123, the composite laryngeal engagement system 124, or the feedback generator 126, it will be understood that any of these operations can be distributed among multiple components of the laryngeal engagement analysis environment 100, including the user interface device 110 or the performance management portal 140. Accordingly, tire following description should not be construed as limiting.

[0112] At block 402, the laryngeal engagement analysis system 120 receives speech-production audio. For example, the speech signal processor 121 may receive an acoustic signal corresponding to a spoken utterance produced by the user 102 and captured by the user interface device 110. The spoken utteranceAttorney Docket No. 190412-00041 WO Patentmay include an utterance of a prompted stimulus phrase presented by the user interface device 110, or may include spontaneous speech, such as that produced organically or in response to an open-ended prompt. In some embodiments, the spoken utterance includes speech produced by the user, including at least one of a prompted stimulus phrase or spontaneous speech, the speech being in the form of a single syllable, word, multisyllabic word, phrase, sentence, continuous reading, or conversational production. In some embodiments, the speech signal processor 121 receives streaming audio frames via the network 105; in some embodiments, the speech signal processor 121 receives buffered recordings or feature vectors previously computed on the user interface device 110. The speech signal processor 121 may perform preprocessing such as noise suppression, normalization, or framing so that the audio data is suitable for subsequent phonation and voice-onset analysis.

[0113] At block 404, the sustained phonation detector 122 detects a sustained phonation metric from the audio data. As described herein, the sustained phonation detector 122 can analyze acoustic features supplied by the speech signal processor 121, such as pitch estimates, voicing-probability values, or spectral-harmonic structure, to determine continuity of periodic voicing across one or more voiced-segment transitions and / or may incorporate other engagement-related indicators, including airflow continuity, intersyllabic temporal regularity, or inferred subglottal-pressure conditions, when computing the sustained phonation metric. In some implementations, the sustained phonation detector 122 compares observed voicing patterns to an expected-voicing timeline derived from a phonetic representation of the utterance, and computes a metric that reflects how consistently the user 102 maintains laryngeal engagement across vowels, voiced consonants, or other regions where continuous voicing is expected.

[0114] At block 406, the voice-onset-time detector 123 detects a voice-onset-time metric. The voice-onset-time detector 123 may use boundary information and acoustic landmarks provided by the speech signal processor 121 to identify release bursts of stop consonants, which may be voiced or voiceless, and / or to detect the subsequent onset of periodic vocal -fold vibration in following voiced segments or in segments in which periodic voicing begins. For one, some, or each identified stop-vowel or stop-voiced transition, the voice-onset-time detector 123 can compute a temporal interval between the release burst and the onset of periodicity, producing a voice-onset-time value that may be positive or negative, corresponding respectively to voicing that begins after or before the stop release. These values may be aggregated or otherw ise combined to form a voice -onset-time metric that indicates whether the user 102 initiates laryngeal engagement within expected timing ranges at voiceless-to-voiced transitions or other stop-voicing contexts.

[0115] At block 408, the composite laryngeal engagement system 124 generates a composite laryngeal-engagement representation based on the sustained phonation metric and the voice-onset-time metric. In some embodiments, the composite laryngeal engagement system 124 further incorporates one or more additional engagement-related features, such as airflow7characteristics, intersyllabic timing patterns, respiratory -effort indicators, or activation indicators obtained from the laryngeal-activation sensor 130, so that the composite representation reflects a fusion of acoustic and sensor-derived information. In some embodiments, the composite laryngeal engagement system 124 normalizes each metric to a common scale and combines them using weighting factors or rule-based logic to obtain a composite laryngeal engagement value for the utterance or for individual segment boundaries. In some embodiments, the composite laryngealAttorney Docket No. 190412-00041 WO Patentengagement system 124 generates a time-aligned sequence of engagement values that reflects continuity of voicing and / or onset timing across the spoken utterance. The composite representation can provide an integrated indication of overall laryngeal engagement behavior during the utterance.

[0116] At block 410, the composite laryngeal engagement system 124 identifies a block location based on the composite laryngeal-engagement representation. For example, the composite laryngeal engagement system 124 may compare composite engagement values to a laryngeal -engagement criterion and determine that a block location, which may be an engagement-reduction point, is present when the composite laryngeal-engagement representation fails to satisfy the criterion for longer than an allowable temporal tolerance at a particular onset or transition. In some cases, the composite laryngeal engagement system 124 classifies the context of the block location (e.g., an initial vowel onset or a voiceless-to-voiced transition) and identifies an index or timestamp that can be aligned with a phonetic transcription or with text of the stimulus phrase.

[0117] At block 412, the feedback generator 126 provides system-generated output signals representing at least one of an indication of the identified block location or a system-determined adjustment to interface -state parameters associated with that location. The feedback generator 126 may, for example, cause the user interface device 110 to highlight a portion of displayed text corresponding to the block location, update a voice bar or progress bar to reflect an interruption in sustained phonation, or modify timing thresholds used for subsequent utterances. In some embodiments, the feedback generator 126 provides auditory or haptic cues, or updates extended-reality elements presented through the user interface device 110, to guide the user 102 in re-initiating phonation or maintaining continuous voicing. The systemgenerated output signals may be transmitted to the performance management portal 140 for storage, visualization, or use in selecting subsequent speaking tasks.

[0118] In some embodiments, one or more operations of routine 400 are repeated across multiple utterances, and the sustained phonation metrics, voice -onset-time metrics, composite laryngeal engagement values, and block locations generated during routine 400 may be aggregated to support longitudinal tracking, adaptive thresholding, or exercise-recommendation logic as described elsewhere herein.

[0119] It will be appreciated that the arrangement of blocks in routine 400 is illustrative and that fewer, additional, reordered, or concurrent operations may be used in other embodiments. For example, in some implementations the laryngeal engagement analysis system 120 may introduce an additional operation in which the composite laryngeal engagement system 124 receives one or more activation indicators from the laryngeal -activation sensor 130 and incorporates those indicators into the composite laryngeal-engagement representation. In some such cases, the composite laryngeal engagement system 124 may weight or normalize sensor-derived activation information relative to the sustained phonation metric and the voice-onset-time metric, or may replace one of those metrics entirely when acoustic data is unreliable, so that the engagement representation at block 408 reflects a combination of microphone-derived and sensor-derived evidence of vocal-fold activation.Example Embodiments

[0120] Various illustrative examples, relating to processing speech-production data, are described in the following numbered clauses:Attorney Docket No. 190412-00041 WO Patent

[0121] Clause 1. A computer-implemented method for processing speech-production data, comprising: receiving audio data representing a spoken utterance produced by the user, the spoken utterance comprising speech produced by the user, including at least one of a prompted stimulus phrase or spontaneous speech, the speech being in the form of a single syllable, word, multisyllabic word, phrase, sentence, continuous reading, or conversational production; detecting, from the audio data, a sustained phonation metric representing continuity of periodic voicing across one or more voiced-segment transitions; detecting, from the audio data, a voice -onset-time metric representing a temporal interval betw een a stop consonant and a subsequent voiced segment in which periodic voicing begins, the temporal interval including positive or negative voice-onset-time values; generating a composite laryngeal engagement representation based on at least one of the sustained phonation metric, the voice-onset-time metric, or one or more additional engagement-related features derived from the audio data or from another sensing modality; identifying a block location in the spoken utterance based on the composite laryngeal engagement representation; and providing system-generated output signals representing at least one of an indication of the block location or a system-determined adj ustment to interface-state parameters associated with the block location.

[0122] Clause 2. The method of clause 1, wherein the stimulus phrase defines an intended phonetic sequence including at least one intended transition between sequential voiced phonetic segments for which continuous voicing is expected, and wherein generating the composite laryngeal engagement representation is based on evaluating a system-defined engagement criterion for the at least one intended transition.

[0123] Clause 3. The method of any of the preceding clauses, wherein the stimulus phrase includes at least one intended voiceless stop consonant or voiced stop consonant, followed by an intended voiced segment, and generating the composite laryngeal engagement representation is based on evaluating a system-defined onset-timing criterion for the intended voiceless-to-voiced transition.

[0124] Clause 4. Ute method of any of the preceding clauses, further comprising identifying a phonetic challenge type indicated by a reduced laryngeal engagement characteristic in the composite lary ngeal engagement representation and selecting a subsequent stimulus phrase for presentation to the user that includes the phonetic challenge type.

[0125] Clause 5. The method of clause 4, wherein the phonetic challenge type comprises at least one phonetic feature including voicing, manner of articulation, place of articulation, or segment duration associated with the reduced laryngeal engagement characteristic.

[0126] Clause 6. The method of clause 5, wherein the phonetic feature further comprises a contextual property including at least one of a transition between adjacent phonetic segments, an onset characteristic of periodic voicing, a release burst property associated with a stop consonant, or a temporal characteristic of a voiced segment associated with the reduced laryngeal engagement characteristic.

[0127] Clause 7. Ute method of clause 4. wherein selecting the subsequent stimulus phrase includes automatically generating a user-specific stimulus phrase, the user-specific stimulus phrase being generated to include a phonetic element or phonetic challenge type corresponding to the reduced laryngeal engagement characteristic identified from the composite laryngeal engagement state.Attorney Docket No. 190412-00041 WO Patent

[0128] Clause 8. The method of any of the preceding clauses, wherein detecting the sustained phonation metric comprises identifying a duration of periodic voicing of a first vowel of a word in the stimulus phrase.

[0129] Clause 9. The method of any of the preceding clauses, wherein detecting the sustained phonation metric comprises determining whether periodic voicing remains continuous across a plurality of syllable transitions.

[0130] Clause 10. Tire method of clause 9, further comprising evaluating one or more additional engagement-related features detectable from the audio data or another sensing modality, including airflow-related characteristics or temporal patterns of intersyllabic production.

[0131] Clause 11. The method of any of the preceding clauses, wherein detecting the sustained phonation metric comprises identifying a first vowel sound within the spoken utterance.

[0132] Clause 12. The method of clause 11, wherein detecting the first vowel sound comprises extracting acoustic features including at least one of formant frequencies or an initial segment exhibiting vowel -articulation patterns.

[0133] Clause 13. The method of any of the preceding clauses, wherein detecting the voice-onset-time metric comprises identifying a release burst of the stop consonant in the audio data and detecting a subsequent onset of periodic voicing.

[0134] Clause 14. Ute method of any of the preceding clauses, wherein generating the composite laryngeal engagement representation comprises applying respective weighting factors to the sustained phonation metric and the voice -onset-time metric.

[0135] Clause 15. The method of any of the preceding clauses, wherein generating the composite laryngeal engagement representation further comprises normalizing at least one of the sustained phonation metric or the voice-onset-time metric to a user-specific baseline.

[0136] Clause 16. The method of any of the preceding clauses, wherein identifying tire block location comprises determining that the composite laryngeal engagement representation fails to satisfy a laryngeal engagement criterion.

[0137] Clause 17. The method of any of the preceding clauses, wherein identifying the block location comprises detecting a cessation of periodic voicing in a voiced segment in the stimulus phrase.

[0138] Clause 18. The method of any of the preceding clauses, wherein identifying the block location further comprises determining whether the block occurred during an utterance onset or during a transition between syllables.

[0139] Clause 19. Tire method of any of the preceding clauses, wherein providing the system-generated output signals comprises modifying interface-state parameters to highlight the block location.

[0140] Clause 20. The method of any of the preceding clauses, wherein providing the systemgenerated output signals comprises modifying a display or timing parameter associated with an initial vowel of a word.Attorney Docket No. 190412-00041 WO Patent

[0141] Clause 21. The method of any of the preceding clauses, wherein providing the systemgenerated output signals comprises updating a graphical user interface with at least one of the sustained phonation metric, the voice-onset-time metric, or the composite laryngeal engagement representation.

[0142] Clause 22. The method of clause 21. wherein the graphical user interface comprises a voice bar that indicates whether periodic voicing is present at one or more syllable transitions.

[0143] Clause 23. The method of clause 21 , wherein the graphical user interface comprises a progress bar that fdls in accordance with a duration of periodic voicing.

[0144] Clause 24. Tire method of any of the preceding clauses, wherein the system-generated output signals represent a cursor that fdls according to a duration of sustained phonation and generates an output signal indicating that a time threshold has been satisfied.

[0145] Clause 25. The method of clause 24, wherein the time threshold corresponds to a nearest-normal duration in a range from approximately 0.5 seconds to approximately 1.5 seconds.

[0146] Clause 26. The method of any of the preceding clauses, wherein the audio data is processed in real time and the system-generated output signals are updated continuously.

[0147] Clause 27. The method of any of the preceding clauses, wherein phonation-duration thresholds are dynamically adjusted based on user-specific performance metrics.

[0148] Clause 28. The method of any of the preceding clauses, wherein the system-generated output signals include real-time auditory output signals to indicate system -detected phonation conditions.

[0149] Clause 29. The method of any of the preceding clauses, further comprising storing at least one of the sustained phonation metric, the voice-onset-time metric, or the composite laryngeal engagement representation for longitudinal progress tracking.

[0150] Clause 30. The method of any of the preceding clauses, further comprising receiving a user-provided parameter and adjusting system-output behavior based on the user-provided parameter.

[0151] Clause 31. The method of any of the preceding clauses, further comprising classifying a phonetic environment of the stimulus phrase to determine which segment boundaries are evaluated for laryngeal engagement.

[0152] Clause 32. The method of any of the preceding clauses, further comprising generating a summary report including metrics of sustained phonation and interface adjustments performed during a session.

[0153] Clause 33. The method of any of the preceding clauses, wherein detecting the sustained phonation metric includes using a trained machine-learning model to classify segments as engaged or disengaged based on voicing characteristics or other engagement-related features.

[0154] Clause 34. The method of any of the preceding clauses, wherein detecting the block location comprises applying a trained machine-learning model to identify patterns of disrupted phonation associated with stuttcring-typc behaviors.

[0155] Clause 35. The method of any of the preceding clauses, wherein the audio data is obtained during interaction with an extended-reality environment that simulates real-world speaking scenarios and is configured to present a performance-challenging speaking context.Attorney Docket No. 190412-00041 WO Patent

[0156] Clause 36. The method of any of the preceding clauses, wherein presenting the stimulus phrase comprises presenting through a visual or auditory prompt.[00157J Clause 37. The method of any of the preceding clauses, wherein the audio data corresponds to a word or sentence spontaneously spoken by the user in response to an open-ended prompt.

[0158] Clause 38. The method of any of the preceding clauses, wherein the method is configured to support performance -based adjustment of system-generated feedback related to speech-production characteristics.

[0159] Clause 39. Tire method of any of the preceding clauses, further comprising determining a context-sensitivity metric by comparing a voicing-interruption frequency measured during a structured speaking task to a voicing-interruption frequency measured during at least one of conversational speech with an avatar or speech produced in an extended-reality environment, wherein an increase in voicing-interruption frequency in the conversational or extended-reality environment relative to the structured speaking task is interpreted as an indication of a context-sensitivity metric associated with speakingcondition variability.

[0160] Clause 40. The method of any of the preceding clauses, further comprising obtaining a physiological signal comprising at least one of heart rate, heart-rate variability, or respiration rate from a wearable device, and adjusting the context-sensitivity metric based at least in part on a correlation between the physiological signal and changes in voicing -interruption frequency across different speaking contexts.

[0161] Clause 41. The method of any of the preceding clauses, further comprising obtaining an indication of laryngeal activation from at least one sensor configured to detect w hether vocal folds of the user are engaged during speech production or whether sufficient subglottal air pressure is present to support engagement, wherein generating the composite laryngeal engagement representation is based at least in part on the indication of laryngeal activation.

[0162] Clause 42. The method of clause 41, w herein obtaining the indication of laryngeal activation comprises analyzing a microphone signal to detect periodic voicing, including identifying continuous phonation or a reduction in voice-onset time as correlates of vocal-fold engagement.

[0163] Clause 43. The method of clause 41, wherein obtaining the indication of laryngeal activation comprises receiving imaging data or physiological-sensor data from a camera, optical sensor, or proximity sensor positioned to observe vocal-fold motion or glottal opening, tire imaging data indicating w hether the larynx remains engaged during speech production.

[0164] Clause 44. The method of clause 41, further comprising confirming laryngeal activation by combining microphone-derived voicing metrics with imaging-derived or sensor-derived indicators of vocal-fold engagement to improve accuracy in detecting laryngeal disengagement associated with stuttering blocks.

[0165] Clause 45. The method of any of the preceding clauses, wherein the stop consonant is voiceless.

[0166] Clause 46. A computer-implemented method for processing speech-production data, comprising: presenting a stimulus phrase to a user, the stimulus phrase defining an intended phonetic sequence that includes at least one intended transition between sequential voiced phonetic segments atAttorney Docket No. 190412-00041 WO Patentwhich continuous voicing is expected when a larynx is engaged, and that further includes at least one intended voiceless stop consonant followed by an intended voiced segment, such that a spoken utterance to the stimulus phrase includes a first criterion corresponding to detecting whether the larynx remains engaged across the at least one intended transition between sequential voiced phonetic segments and includes a second criterion corresponding to measuring a voice onset time representation between the at least one intended voiceless stop consonant and the intended voiced segment that follows the at least one intended voiceless stop consonant; receiving the spoken utterance that includes an utterance of the stimulus phrase; determining a sustained phonation representation from the spoken utterance by evaluating the first criterion to compute, for each of the at least one intended transition between sequential voiced phonetic segments, a continuity representation of voiced sound at the at least one intended transition, the sustained phonation representation indicating a laryngeal engagement representation at each of the at least one intended transition; determining the voice onset time representation from the spoken utterance by evaluating the second criterion to compute a timing representation representing an acoustic delay between a release burst of the at least one intended voiceless stop consonant included in the stimulus phrase and an onset of periodic voicing in the intended voiced segment that follows the at least one intended voiceless stop consonant, the timing representation indicating a degree of laryngeal activation during a transition from the at least one intended voiceless stop consonant to the intended voiced segment; deriving a composite laryngeal engagement state from the sustained phonation representation and from the timing representation, the composite laryngeal engagement state indicating a level of laryngeal engagement during the utterance of the stimulus phrase; identify ing a block location within the utterance of the spoken utterance by detecting a point at which the composite laryngeal engagement state decreases to a disengaged condition; and providing system-generated output signals representing at least one of an indication of the block location or a system-determined adjustment to a graphical or auditory interface state associated with the block location.

[0167] Clause 47. A system for processing speech-production data, comprising: one or more processors; and a non-transitory memory- storing instructions that, when executed by the one or more processors, cause the system to: receive audio data representing a spoken utterance produced by a user, the spoken utterance including an utterance of a stimulus phrase or spontaneous speech; detect, from the audio data, a sustained-phonation metric representing continuity of periodic voicing across one or more voiced-segment transitions; detect, from the audio data, a voice-onset-time metric representing a temporal interval between a voiceless stop consonant and a subsequent voiced segment; generate a composite laryngeal-engagement representation based on the sustained-phonation metric and the voice -onset-time metric; identify a block location in the spoken utterance based on the composite laryngeal-engagement representation; and provide system-generated output signals representing at least one of an indication of the block location, or a system-determined adjustment to interface -state parameters associated with the block location.

[0168] Clause 48. The system of clause 47, wherein the system is further configured to perform any of the steps or have or any of the features of any of the preceding clauses.Attorney Docket No. 190412-00041 WO Patent

[0169] Clause 49. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the processors to perform a method comprising: receiving audio data representing a spoken utterance produced by the user, the spoken utterance including an utterance of a stimulus phrase or spontaneous speech; detecting, from the audio data, a sustained-phonation metric representing continuity of periodic voicing across one or more voiced-segment transitions; detecting, from the audio data, a voice -onset-time metric representing a temporal interval between a voiceless stop consonant and a subsequent voiced segment; generating a composite laryngeal -engagement representation based on the sustained-phonation metric and the voice-onset-time metric; identifying a block location in the spoken utterance based on the composite laryngeal-engagement representation; and providing systemgenerated output signals representing at least one of: an indication of the block location, or a system-determined adjustment to interface-state parameters associated with the block location.

[0170] Clause 50. The non-transitory computer-readable medium of clause 49. wherein the method comprises any of the steps or features of any of the preceding clauses.Terminology

[0171] Any or all of the features and functions described above can be combined with each other, except to the extent it may be otherwise stated above or to the extent that any such embodiments may be incompatible by virtue of their function or structure, as will be apparent to persons of ordinary skill in the art. Unless contrary to physical possibility, it is envisioned that (i) the methods / steps described herein may be performed in any sequence and / or in any combination, and (ii) the components of respective embodiments may be combined in any manner.

[0172] Although the subject matter has been described in language specific to structural features and / or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as examples of implementing the claims, and other equivalent features and acts are intended to be within the scope of the claims.

[0173] Conditional language, such as, among others, “can,” “could,” “might,” or “may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and / or steps. Thus, such conditional language is not generally intended to imply that features, elements and / or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and / or steps are included or are to be performed in any particular embodiment.

[0174] Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense, e.g.. in the sense of “including, but not limited to.” Where the context permits, words using the singular or plural number may also include the plural or singular number respectively. The word “or” in reference to a list of two or more items, covers all of tire following interpretations of the word: any one of the items in the list, all of the items in the list, and any combination of the items in the list. Uikewise the term “and / or” in reference to a list of two or more items, covers all ofAttorney Docket No. 190412-00041 WO Patentthe following interpretations of the word: any one of the items in the list, all of the items in the list, and any combination of the items in the list.

[0175] Conjunctive language such as the phrase "‘at least one of X, Y and Z. unless specifically stated otherwise, is otherwise understood with the context as used in general to convey that an item. term. etc. may be either X, Y or Z, or any combination thereof. Thus, such conjunctive language is not generally intended to imply that certain embodiments require at least one of X, at least one of Y and at least one of Z to each be present. Further, use of the phrase “at least one of X, Y or Z” as used in general is to convey that an item, term, etc. may be either X, Y or Z, or any combination thereof.

[0176] Any patents and applications and other references noted above, including any that may be listed in accompanying filing papers, are incorporated herein by reference to the extent permitted by applicable law and regulations and not to incorporate essential material. To the extent of any inconsistency, this specification governs. Aspects of the invention can be modified, if necessary, to employ the systems, functions, and concepts of the various references described above to provide yet further implementations of the invention. These and other changes can be made to the invention in light of the above Detailed Description. While the above description describes certain examples of the invention, and describes the best mode contemplated, no matter how detailed the above appears in text, the invention can be practiced in many ways. Details of the system may vary considerably in its specific implementation, while still being encompassed by the invention disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the invention should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the invention with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the invention to the specific examples disclosed in the specification, unless the above Detailed Description section explicitly defines such terms. Accordingly, the actual scope of tire invention encompasses not only tire disclosed examples, but also all equivalent ways of practicing or implementing the invention under the claims.

[0177] To reduce the number of claims, certain aspects of tire invention are presented below in certain claim forms, but the applicant contemplates other aspects of the invention in any number of claim forms. Any claims intended to be treated under 35 U.S.C. §1 12(f) will begin with the words “means for,” but use of the term “for” in any other context is not intended to invoke treatment under 35 U.S.C. § 112(f). Accordingly, the applicant reserves the right to pursue additional claims after filing this application, in either this application or in a continuing application. Nothing in this section is intended to disclaim subject matter or limit the broadest reasonable interpretation of the claims.

Claims

Attorney Docket No. 190412-00041 WO PatentWHAT IS CLAIMED IS:

1. A computer-implemented method for processing speech-production data, comprising: receiving audio data representing a spoken utterance produced by the user, the spoken utterance comprising speech produced by the user, including at least one of a prompted stimulus phrase or spontaneous speech, the speech being in the form of a single syllable, word, multisyllabic word, phrase, sentence, continuous reading, or conversational production;detecting, from the audio data, a sustained phonation metric representing continuity of periodic voicing across one or more voiced-segment transitions;detecting, from the audio data, a voice-onset-time metric representing a temporal interval between a stop consonant and a subsequent voiced segment in which periodic voicing begins, the temporal interval including positive or negative voice-onset-time values;generating a composite laryngeal engagement representation based on at least one of the sustained phonation metric, the voice-onset-time metric, or one or more additional engagement-related features derived from the audio data or from another sensing modality:identifying a block location in the spoken utterance based on the composite laryngeal engagement representation; andproviding system-generated output signals representing at least one of an indication of the block location or a system-detennined adjustment to interface -state parameters associated with tire block location.

2. The method of claim 1, wherein the stimulus phrase defines an intended phonetic sequence including at least one intended transition between sequential voiced phonetic segments for which continuous voicing is expected, and wherein generating the composite laryngeal engagement representation is based on evaluating a system-defined engagement criterion for the at least one intended transition.

3. Tire method of claim 1 , wherein the stimulus phrase includes at least one intended voiceless stop consonant or voiced stop consonant, followed by an intended voiced segment, and generating the composite laryngeal engagement representation is based on evaluating a system-defined onset-timing criterion for the intended voiceless-to-voiced transition.

4. The method of claim 1, further comprising identifying a phonetic challenge type indicated by a reduced laryngeal engagement characteristic in the composite laryngeal engagement representation and selecting a subsequent stimulus phrase for presentation to the user that includes the phonetic challenge type.

5. The method of claim 4, wherein the phonetic challenge type comprises at least one phonetic feature including voicing, manner of articulation, place of articulation, or segment duration associated with the reduced laryngeal engagement characteristic.Attorney Docket No. 190412-00041 WO Patent6. The method of claim 5, wherein the phonetic feature further comprises a contextual property' including at least one of a transition between adjacent phonetic segments, an onset characteristic of periodic voicing, a release burst property associated with a stop consonant, or a temporal characteristic of a voiced segment associated with the reduced laryngeal engagement characteristic.

7. The method of claim 4, wherein selecting the subsequent stimulus phrase includes automatically generating a user-specific stimulus phrase, the user-specific stimulus phrase being generated to include a phonetic element or phonetic challenge type corresponding to the reduced laryngeal engagement characteristic identified from the composite laryngeal engagement state.

8. Tire method of claim 1, wherein detecting the sustained phonation metric comprises identifying a duration of periodic voicing of a first vowel of a word in the stimulus phrase.

9. The method of claim 1, wherein detecting the sustained phonation metric comprises determining whether periodic voicing remains continuous across a plurality of syllable transitions.

10. The method of claim 9, further comprising evaluating one or more additional engagement-related features detectable from the audio data or another sensing modality, including airflow-related characteristics or temporal patterns of intersyllabic production.

11. The method of claim 1, wherein detecting the sustained phonation metric comprises identifying a first vowel sound within the spoken utterance.

12. The method of claim 11, wherein detecting the first vowel sound comprises extracting acoustic features including at least one of formant frequencies or an initial segment exhibiting vowelarticulation patterns.

13. Tire method of claim 1, wherein detecting the voice-onset-time metric comprises identifying a release burst of the stop consonant in the audio data and detecting a subsequent onset of periodic voicing.

14. The method of claim 1, wherein generating the composite laryngeal engagement representation comprises applying respective weighting factors to the sustained phonation metric and the voice-onset-time metric.

15. Tire method of claim 1, wherein generating the composite laryngeal engagement representation further comprises normalizing at least one of the sustained phonation metric or the voice-onset-time metric to a user-specific baseline.

16. The method of claim 1, wherein identifying the block location comprises determining that the composite lary ngeal engagement representation fails to satisfy? a lary ngeal engagement criterion.

17. Tire method of claim 1, wherein identifying the block location comprises detecting a cessation of periodic voicing in a voiced segment in tire stimulus phrase.Attorney Docket No. 190412-00041 WO Patent18. The method of claim 1, wherein identifying the block location further comprises determining whether the block occurred during an utterance onset or during a transition between syllables.

19. Tire method of claim 1, wherein providing the system-generated output signals comprises modifying interface -state parameters to highlight the block location.

20. The method of claim 1, wherein providing the system -gene rated output signals comprises modifying a display or timing parameter associated with an initial vowel of a word.

21. The method of claim 1, wherein providing the system -generated output signals comprises updating a graphical user interface with at least one of the sustained phonation metric, the voice-onset-time metric, or the composite laryngeal engagement representation.

22. Tire method of claim 21, wherein the graphical user interface comprises a voice bar that indicates whether periodic voicing is present at one or more syllable transitions.

23. The method of claim 21 , wherein the graphical user interface comprises a progress bar that fills in accordance with a duration of periodic voicing.

24. Tire method of claim 1 , wherein the system -generated output signals represent a cursor that fills according to a duration of sustained phonation and generates an output signal indicating that a time threshold has been satisfied.

25. The method of claim 24, wherein the time threshold corresponds to a nearest-normal duration in a range from approximately 0.5 seconds to approximately 1.5 seconds.

26. The method of claim 1, wherein the audio data is processed in real time and the systemgenerated output signals are updated continuously.

27. The method of claim 1, wherein phonation -duration thresholds are dynamically adjusted based on user-specific performance metrics.

28. The method of claim 1, wherein the system-generated output signals include real-time auditory output signals to indicate system -detected phonation conditions.

29. The method of claim 1, further comprising storing at least one of the sustained phonation metric, the voice-onset-time metric, or the composite laryngeal engagement representation for longitudinal progress tracking.

30. The method of claim 1, further comprising receiving a user-provided parameter and adjusting system -output behavior based on the user-provided parameter.

31. The method of claim 1, further comprising classifying a phonetic environment of the stimulus phrase to determine which segment boundaries are evaluated for laryngeal engagement.

32. The method of claim 1, further comprising generating a summary report including metrics of sustained phonation and interface adjustments performed during a session.Attorney Docket No. 190412-00041 WO Patent33. The method of claim 1, wherein detecting the sustained phonation metric includes using a trained machine-learning model to classify segments as engaged or disengaged based on voicing characteristics or other engagement-related features.

34. Tire method of claim 1, wherein detecting the block location comprises applying a trained machine-learning model to identify patterns of disrupted phonation associated with stuttering-type behaviors.

35. The method of claim 1, wherein the audio data is obtained during interaction with an extended-reality environment that simulates real-world speaking scenarios and is configured to present a performance-challenging speaking context.

36. Tire method of claim 1, wherein presenting the stimulus phrase comprises presenting through a visual or auditory prompt.

37. The method of claim 1. wherein the audio data corresponds to a word or sentence spontaneously spoken by the user in response to an open-ended prompt.

38. Tire method of claim 1, wherein the method is configured to support performance-based adjustment of system-generated feedback related to speech-production characteristics.

39. The method of claim 1, further comprising determining a context-sensitivity metric by comparing a voicing-interruption frequency measured during a structured speaking task to a voicing-interruption frequency measured during at least one of conversational speech with an avatar or speech produced in an extended-reality environment, wherein an increase in voicing-interruption frequency in the conversational or extended-reality environment relative to tire structured speaking task is interpreted as an indication of a context-sensitivity metric associated with speaking-condition variability.

40. The method of claim 1, further comprising obtaining a physiological signal comprising at least one of heart rate, heart-rate variability, or respiration rate from a wearable device, and adjusting the context-sensitivity metric based at least in part on a correlation between the physiological signal and changes in voicing-interruption frequency across different speaking contexts.

41. Tire method of claim 1, further comprising obtaining an indication of laryngeal activation from at least one sensor configured to detect whether vocal folds of the user are engaged during speech production or whether sufficient subglottal air pressure is present to support engagement, wherein generating the composite laryngeal engagement representation is based at least in part on the indication of laryngeal activation.

42. The method of claim 41 , wherein obtaining the indication of laryngeal activation comprises analyzing a microphone signal to detect periodic voicing, including identifying continuous phonation or a reduction in voicc-onsct time as correlates of vocal-fold engagement.Attorney Docket No. 190412-00041 WO Patent43. The method of claim 41, wherein obtaining the indication of laryngeal activation comprises receiving imaging data or physiological-sensor data from a camera, optical sensor, or proximity sensor positioned to observe vocal-fold motion or glottal opening, the imaging data indicating whether the lary nx remains engaged during speech production.

44. The method of claim 41, further comprising confirming laryngeal activation by combining microphone-derived voicing metrics with imaging-derived or sensor-derived indicators of vocal-fold engagement to improve accuracy in detecting laryngeal disengagement associated with stuttering blocks.

45. Tire method of claim 1, wherein the stop consonant is voiceless.

46. A computer-implemented method for processing speech-production data, comprising: presenting a stimulus phrase to a user, tire stimulus phrase defining an intended phonetic sequence that includes at least one intended transition between sequential voiced phonetic segments at which continuous voicing is expected when a larynx is engaged, and that further includes at least one intended voiceless stop consonant followed by an intended voiced segment, such that a spoken utterance to the stimulus phrase includes a first criterion corresponding to detecting whether the lary nx remains engaged across the at least one intended transition between sequential voiced phonetic segments and includes a second criterion corresponding to measuring a voice onset time representation between the at least one intended voiceless stop consonant and the intended voiced segment that follows the at least one intended voiceless stop consonant;receiving the spoken utterance that includes an utterance of the stimulus phrase; determining a sustained phonation representation from the spoken utterance by evaluating the first criterion to compute, for each of the at least one intended transition between sequential voiced phonetic segments, a continuity representation of voiced sound at the at least one intended transition, tire sustained phonation representation indicating a laryngeal engagement representation at each of the at least one intended transition;determining the voice onset time representation from the spoken utterance by evaluating the second criterion to compute a timing representation representing an acoustic delay between a release burst of the at least one intended voiceless stop consonant included in the stimulus phrase and an onset of periodic voicing in the intended voiced segment that follows the at least one intended voiceless stop consonant, the timing representation indicating a degree of laryngeal activation during a transition from the at least one intended voiceless stop consonant to the intended voiced segment;deriving a composite laryngeal engagement state from the sustained phonation representation and from the timing representation, the composite laryngeal engagement state indicating a level of laryngeal engagement during the utterance of the stimulus phrase;Attorney Docket No. 190412-00041 WO Patentidentifying a block location within the utterance of the spoken utterance by detecting a point at which the composite laryngeal engagement state decreases to a disengaged condition; andproviding system-generated output signals representing at least one of an indication of the block location or a system-determined adjustment to a graphical or auditory interface state associated with the block location.

47. A system for processing speech -production data, comprising:one or more processors; anda non-transitory memory storing instructions that, when executed by the one or more processors, cause the system to:receive audio data representing a spoken utterance produced by a user, the spoken utterance including an utterance of a stimulus phrase or spontaneous speech;detect, from the audio data, a sustained-phonation metric representing continuity of periodic voicing across one or more voiced-segment transitions;detect, from the audio data, a voice-onset-time metric representing a temporal interval between a voiceless stop consonant and a subsequent voiced segment;generate a composite laryngeal-engagement representation based on the sustained- phonation metric and the voice -onset-time metric:identify a block location in the spoken utterance based on the composite laryngeal- engagement representation; andprovide system-generated output signals representing at least one of an indication of the block location, or a system-determined adjustment to interface-state parameters associated with the block location.

48. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the processors to perform a method comprising:receiving audio data representing a spoken utterance produced by the user, the spoken utterance including an utterance of a stimulus phrase or spontaneous speech;detecting, from the audio data, a sustained-phonation metric representing continuity of periodic voicing across one or more voiced-segment transitions;detecting, from the audio data, a voice-onset-time metric representing a temporal interval between a voiceless stop consonant and a subsequent voiced segment;generating a composite laryngeal-engagement representation based on the sustained-phonation metric and the voice-onset-time metric;identifying a block location in the spoken utterance based on the composite laryngcal-cngagcmcnt representation; andAttorney Docket No. 190412-00041 WO Patentproviding system-generated output signals representing at least one of:an indication of the block location, ora system-determined adjustment to interface -state parameters associated with the block location.