Subvocalization feedback systems, methods and computer readable media

The wearable system detects facial neuromuscular activity to interpret subvocalization, addressing the challenge of silent speech interpretation and enabling real-time translation and communication.

WO2026133098A1PCT designated stage Publication Date: 2026-06-25Q (CUE) LTD +3

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
Q (CUE) LTD
Filing Date
2025-12-15
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively interpret and provide feedback on subvocalization, which occurs when individuals articulate sounds without vocal airflow, limiting applications such as real-time translation and silent communication.

Method used

A wearable system that detects facial neuromuscular activity using a coherent or non-coherent light source and a detector to analyze facial skin micromovements, processing these movements with machine learning algorithms to identify subvocalized words and provide feedback.

Benefits of technology

Enables real-time detection and interpretation of subvocalized words, facilitating applications like real-time translation and silent communication without the need for vocalization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IB2025062904_25062026_PF_FP_ABST
    Figure IB2025062904_25062026_PF_FP_ABST
Patent Text Reader

Abstract

The present disclosure provides a wearable system for providing feedback on subvocalization. The system includes at least one processor configured to determine subvocalization data obtained via a wearable detector worn by an individual, wherein the subvocalization data corresponds to physical engagement of the individual. The at least one processor analyzes the subvocalization data to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit. The processor then provides feedback to the individual based on the determination. The feedback may include phoneme-level or word-level guidance for improving silent speech effectiveness.
Need to check novelty before this filing date? Find Prior Art

Description

SUBVOCALIZATION FEEDBACK SYSTEMS, METHODS AND COMPUTER READABLE MEDIACROSS REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority of U.S. Provisional Patent Application No. 63 / 734,454, filed on December 16, 2024, the entirety of which is incorporated herein by reference.TECHNICAL FIELD

[0002] The present disclosure relates to systems, computer-readable media, and methods for interpreting facial neuromuscular activity and providing feedback on subvocalization.BACKGROUND

[0003] The human brain and neural activity are complex and involve many subsystems. One of those subsystems is the facial region used by humans for communication with others. From birth, humans are trained to activate craniofacial muscles to articulate sounds. Even before full language ability evolves, babies use facial expressions, including microexpressions, to convey deeper information about themselves. After language abilities are learned, however, speech is the main technique that humans use to communicate.

[0004] The normal process of vocalized speech uses multiple groups of muscles and nerves, from the chest and abdomen, through the throat, and up through the mouth and face. To utter a given phoneme, motor neurons activate muscle groups in the face, larynx, and mouth in preparation for propulsion of air flow out of the lungs, and these muscles continue moving during speech to create words and sentences. Without this air flow, no sounds are emitted from the mouth. Silent speech occurs when the air flow from the lungs is absent, while the muscles in the face, larynx, and mouth articulate the desired sounds or move in a manner enabling interpretation.

[0005] Some of the disclosed embodiments are directed to providing a new approach for extracting meaning from neuromuscular activity, one that detects facial skin micromovements that occur during subvocalization, such as silent speech.SUMMARY

[0006] Embodiments consistent with the present disclosure provide systems, methods, and devices for detection and usage of neuromuscular activity. Consistent with other disclosed embodiments, non-transitory computer-readable storage media may storeprogram instructions, which are executed by at least one processing device and perform any of the methods described herein.

[0007] According to an aspect of the present disclosure, a wearable system for providing feedback on subvocalization data is provided. The wearable system includes at least one processor configured to determine subvocalization data obtained via a wearable detector worn by an individual, wherein the subvocalization data corresponds to physical engagement of the individual. The at least one processor is further configured to analyze the subvocalization data to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit. Additionally, the at least one processor is configured to provide feedback to the individual based on the determination.

[0008] According to another aspect of the present disclosure, a wearable system for silent speech self-presentation is provided. The wearable system includes at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of an individual and generates signals of a first signal type associated with the detected neuromuscular activity. For example, the wearable system may also include at least one audio sensor configured to detect sounds within an ear canal of the individual and generate signals of a second signal type associated with the detected sounds. The wearable system may further include at least one processor configured to perform several operations. During a first time period, the processor may use a first set of the first signal type to identify a first plurality of words vocalized by the individual. The processor may further use a second set of the second signal type to determine an acoustic profile indicative of a presentation characteristic associated with perception of the first plurality of words by the individual. During a second time period subsequent to the first time period, the processor may use a third set of the first signal type to identify a second plurality of words subvocalized by the individual. Based on the acoustic profile, the processor may determine a presentation manner for an audible presentation of the second plurality of words. The processor then synthesizes the second plurality of words in the determined presentation manner.

[0009] The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various disclosed embodiments. In the drawings:

[0011] Fig. 1 is a schematic illustration of a user using a first example speech detection system, consistent with some embodiments of the present disclosure.

[0012] Fig. 2A is a schematic illustration of a user using a second example speech detection system, consistent with some embodiments of the present disclosure.

[0013] Fig. 2B is a perspective view of a user using a third example speech detection system, consistent with some embodiments of the present disclosure.

[0014] Fig. 3 is a schematic illustration of a user using a fourth example speech detection system, consistent with some embodiments of the present disclosure.

[0015] Fig. 4 is a block diagram illustrating some of the components of a speech detection system and a remote processing system, consistent with some embodiments of the present disclosure.

[0016] Fig. 5A and 5B are schematic illustrations of part of the speech detection system as it detects facial skin micromovements, consistent with some embodiments of the present disclosure.

[0017] Fig. 6 is a schematic illustration of a reflection image associated with light reflections received from an area of facial region associated with a single spot, consistent with some embodiments of the present disclosure.

[0018] Fig. 7 is a block diagram of a memory consistent with the disclosed embodiments.

[0019] Fig. 8 is an illustration of two example use cases for interpreting facial skin movements from light reflections, consistent with some embodiments of the present disclosure.

[0020] Fig. 9 is an illustration of another example use case for interpreting facial skin movements from light reflections, consistent with some embodiments of the present disclosure.

[0021] Fig. 10 is a schematic illustration of an individual using a wearable system that provides feedback on subvocalization data, consistent with some embodiments of the present disclosure.

[0022] Fig. 11 is a flow diagram chart of an example process for providing subvocalization feedback in different cases, consistent with some embodiments of the present disclosure.

[0023] Fig. 12 is a schematic illustration of an individual using a wearable system that determines an appropriateness for providing subvocalization feedback, consistent with some embodiments of the present disclosure.

[0024] Fig. 13 is a flowchart of an example method for providing feedback on subvocalization data, consistent with some embodiments of the present disclosure.

[0025] Fig. 14 is a system diagram illustrating a wearable system capable of silent speech self-presentation, consistent with some embodiments of the present disclosure.

[0026] Fig. 15 is a chart depicting a process for silent speech self-presentation occurring over three time periods, consistent with some embodiments of the present disclosure.

[0027] Fig. 16 is a flowchart of a process for silent speech self-presentation, consistent with some embodiments of the present disclosure.DETAILED DESCRIPTION

[0028] The following detailed description includes references to the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or similar parts. While several illustrative embodiments are described herein, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the components illustrated in the drawings, and the illustrative methods described herein may be modified by substituting, reordering, removing, or adding steps to the disclosed methods. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples. Instead, the proper scope is defined by the appended claims.

[0029] Various terms used in the specification and claims may be defined or summarized differently when discussed in connection with differing disclosed embodiments. It is to be understood that the definitions, summaries and explanations of terminology in each instance apply to all instances, even when not repeated, unless the transitive definition, explanation, or summary would result in inoperability of an embodiment. It is also to be understood that once a term is defined herein, in the absence of an inherent inconsistency, that definition applies to all other uses of the term herein. Moreover, the exemplary embodiments of the figures and their description are not to be considered definitions of claim terms, but rather are non-limiting examples used to illustrate specific embodiments.

[0030] Throughout, this disclosure mentions “embodiments” and “disclosed embodiments,” which refer to examples of inventive ideas, concepts, and / or manifestations described herein. Many related and unrelated embodiments are described throughout this disclosure. The fact that some “disclosed embodiments” are described as exhibiting a feature or characteristic does not mean that other disclosed embodiments necessarily share that feature or characteristic.

[0031] This disclosure employs open-ended permissive language, indicating for example, that some embodiments “may” employ, involve, or include specific features. The use of the term “may,” and other open-ended terminology, is intended to indicate that although not every embodiment may employ the specific disclosed feature, at least one embodiment employs the specific disclosed feature.

[0032] Differing embodiments of this disclosure may involve systems, methods, and / or computer readable media containing instructions. A system refers to at least two interconnected or interrelated components or parts that work together to achieve a commonobjective, function, or subfunction. A method refers to at least two steps, actions, or techniques to be followed to complete a task or a sub-task, to reach an objective, or to arrive at a next step. Computer-readable media containing instructions refers to any storage mechanism that contains program code instructions, for example to be executed by a computer processor. Examples of computer-readable media are further described elsewhere in this disclosure. Instructions may be written in any type of computer programming language, such as an interpretive language (e.g., scripting languages such as HTML and JavaScript), a procedural or functional language (e.g., C or Pascal that may be compiled for converting to executable code), an object-oriented programming language (e.g., Java or Python), a logical programming language (e.g., Prolog or Answer Set Programming), and / or any other programming language. Instructions executed by at least one processor may include implementing one or more program code instructions in hardware, in software (including in one or more signal processing and / or application specific integrated circuits), in firmware, or in any combination thereof, as described earlier. Causing a processor to perform operations may involve causing the processor to calculate, execute, or otherwise implement one or more arithmetic, mathematic, logic, reasoning, or inference steps.

[0033] Some disclosed embodiments may involve detecting facial skin micromovements. The term “facial skin micromovements” broadly refers to skin motions on the face that may be detectable using a sensor, but which might not be readily detectable to the naked eye. The facial skin micromovements include various types of movements, including involuntary movements caused by muscle recruitments and other types of small-scale skin deformations that fall within the range of micrometers to millimeters and fractions of a second to several seconds in duration. In some cases, the facial skin micromovements are part of a larger- scale skin movement visible to the naked eye (e.g., a smile may involve many facial skin micromovements). In other cases, the facial skin micromovements are not part of any larger- scale skin movement visible to the naked eye. While such micromovements may occur over a multi-square millimeter facial area, they may occur in a surface area of the facial skin of less than one square centimeter, less than one square millimeter, less than 0.1 square millimeter, less than 0.01 square millimeter, or an even smaller area. In some embodiments, the facial skin micromovements correspond to one or more muscle recruitments in a facial region of a head of an individual. The facial region may include specific anatomical areas, for example: a part of the cheek above the mouth, a part of the cheek below the mouth, a part of the mid-jaw, a part of the cheek below the eye, a neck, a chin, and other areas associated with specific muscle recruitments that may cause facial skin micromovements. In some embodiments, the specific muscles may be connected to skin tissue and not to any bone. In particular, the specific muscles may be in a subcutaneous tissue associated with cranial nerve V or cranial nerve VII. As is discussed herein in greater detail, first facial skinmicromovement 522A and second facial skin micromovement 522B in Fig. 5A and are nonlimiting examples of facial skin micromovements, consistent with the present disclosure.

[0034] When specific muscles contract, the muscles pull on the facial skin and cause movements of the facial skin. Some of the movements that occur when the specific muscles contract may be micromovements. By way of example, the specific muscles that may cause facial skin micromovements in the context of the present disclosure may broadly be split into four groups: orbital, nasal, oral, and tongue. The orbital group of facial muscles contains two muscles associated with the eye socket. These muscles control the movements of the eyelids to protect the cornea from damage. They are both innervated by cranial nerve VII. The nasal group of facial muscles is associated with movements of the nose and the skin around it. There are three muscles in this group, and they are also all innervated by cranial nerve VII. The oral group is the most important group of the facial expressors: responsible for movements of the mouth and lips. Such movements are required in singing and whistling and add emphasis to vocal communication. The oral group of muscles consists of the orbicularis oris, buccinator, and various smaller muscles. In a specific embodiment, a disclosed system may monitor facial skin micromovements that correspond to recruitment of the buccinator muscle. The buccinator muscle is located between the mandible and maxilla relatively deep compared to other muscles of the face. The tongue group of muscles consists of four intrinsic muscles (e.g., the superior longitudinal muscle, the inferior longitudinal muscle, the vertical muscle, and the transverse muscle) used to change the shape of the tongue; and four extrinsic muscles (e.g., the genioglossus, the hyoglossus, the styloglossus, and the palatoglossus) used to change the position of the tongue. Any of the tongue muscles listed above may cause movements of the tongue that may be detected by analyzing detected facial skin micromovements. As is discussed herein in greater detail, muscle fiber 520 in Figs. 5A and 5B is a non-limiting example of a facial muscle that causes micromovements of the facial skin, consistent with the present disclosure.

[0035] Consistent with the present disclosure, facial skin micromovements may be detected during subvocalization. The term “during subvocalization” refers to any speech- related activity that takes place without utterance, before utterance, or preceding an imperceptible utterance. In one embodiment, the speech-related activity may include silent speech (i.e. , when air flow from the lungs is absent but the facial muscles articulate the desired sounds). In another embodiment, the speech-related activity may include speaking soundlessly (i.e., when some air flows from the lungs, but words are articulated in a manner that is not perceptible using an audio sensor). In yet another embodiment, the speech- related activity may include prevocalization muscle recruitments (i.e., subvocalization that occurs prior to an onset of vocalization is sometimes referred to herein as prevocalization). In some cases, the prevocalization facial skin micromovements may be triggered byvoluntary muscle recruitments that occur when certain craniofacial muscles start to vocalize words. In other cases, the prevocalization facial skin micromovements may be triggered by involuntary facial muscle recruitments that the individual makes when certain craniofacial muscles prepare to vocalize words. By way of example, the involuntary facial muscle recruitments may occur between 0.1 seconds to 0.5 seconds before the actual vocalization. In some cases, a suggested system may use the detected facial skin micromovement that occurs during subvocalization to identify words that are about to be vocalized. Determining words that the user intends to say before they are actually vocalized may have many benefits because the system does not have to wait for the user to vocally articulate the words to start processing the words. In one example, a disclosed system may generate subtitles for live broadcasts without delays. In another example, a disclosed system may translate what the user is saying in real-time to a different language. Additionally, because the disclosed system can detect words before they are vocalized, the actual vocalization of these words is not a requirement. Thus, facial skin micromovements that occur during subvocalization may be detected in an absence of perceptible vocalization. Movement of facial skin or muscles in an absence of vocalization but which nevertheless conveys speech- related information is referred to herein as silent speech. Detecting silent speech may have various usages, including but not limited to enabling silent communicating with other users, initiating a command, or enabling interaction with a virtual personal assistant. As is discussed herein in greater detail, subvocalization deciphering module 708 in Fig. 7 is a nonlimiting example of a software module used for deciphering some subvocalization facial skin micromovements.

[0036] In some embodiments, the detection of the facial skin micromovements occurs using a speech detection system. While the shorthand “speech detection system” is employed, it is to be understood that the system may alternatively or additionally be configured to detect non-speech commands, expressions, or emotions. The system may also be used for user authentication. The speech detection system may include any device of a group of devices operatively coupled together. As used herein, the term “system” includes any device or a group of devices operatively connected together and configured to perform a function. In some embodiments, the system may include a computer (e.g., a desktop computer, a laptop computer, a server, a smart phone, a portable digital assistant (PDA), or a similar device) or plurality of computers or servers operatively connected together (e.g., using wires or wirelessly) to share information and / or data. The computer(s) may include special purpose computers (e.g., hardwired and coded to perform desired functions) or may include general purpose computers (e.g., using software to perform any desired function). In some embodiments, the system may include a cloud server. As described elsewhere in this disclosure, a cloud server may be a computer platform thatprovides services via a network, such as the Internet. In one embodiment, the speech detection system may include a wearable housing, a coherent light source or a non-coherent light source, a light detector, and a processor. However, the specific list of components mentioned above is not intended to limit systems covered by the present disclosure. As will be appreciated by a person skilled in the art having the benefit of this disclosure, numerous variations and / or modifications may be made to the example speech detection system. For example, not all components may be essential for the detection of facial skin micromovements in all cases. Moreover, the components may be rearranged into a variety of configurations while providing the functionality of various disclosed embodiments. In some cases, a speech detection system according to some embodiments of the disclosure does not have to be wearable, but could be aimed at a skin from a location not connected to a human body. A wearable or a non-wearable system may project coherent light towards a facial region of a user, analyze reflected light, and determine facial skin micromovements. Alternatively, in other cases, a speech detection system according to some embodiments of the disclosure does not have to include a coherent light source. Specifically, the light detector may be an ultra-high resolution image sensor (e.g., more than 120 megapixel) or any other sensor capable of facial micromovement detection, and the detection of the facial skin micromovements may be accomplished using one or more image processing algorithms. As is discussed herein in greater detail, speech detection systems 100 in Figs. 1- 3 are non-limiting examples of a speech detection system, consistent with the present disclosure. As illustrated in these examples, the system includes a wearable housing 110, a light source 410, a light detector 412, and a processing device 400.

[0037] Some disclosed embodiments involve a wearable housing configured to be worn on a head of an individual. The term “wearable housing” broadly includes any structure or enclosure designed for connection to a human head, such as in a manner configured to be worn by a user. Such wearable housing may be configured to contain or support one or more electronic components or sensors. In one example, the wearable housing is configured for association with a pair of glasses. In another example, the wearable housing is associated with an earbud. The wearable housing may have a cross-section that is buttonshaped, P-shaped, square, rectangular, rounded rectangular, or any other regular or irregular shape capable of being worn by a user. Such a structure may permit the wearable housing to be worn on, in, or around a body part associated with a head of the user (e.g., on the ear, in the ear, around the neck). The wearable housing may be made of plastic, metal, composite, a combination of two or more of plastic, metal and composite, or other suitable material. Consistent with disclosure embodiments, the housing may be worn on an ear. There are several ways in which the housing can be attached to the ear: In-the-ear (ITE): the housing may be inserted directly into the ear canal and held in place by the shape of the ear.Examples include earbuds and earplugs. In some cases, the housing may be custom-made to fit the specific shape of an individual's ear and seated in the ear bowl. Behind-the-ear (BTE): the housing may be seated behind the ear and with a small tube that runs to the ear canal. Examples include hearing aids and Bluetooth headsets. Over-the-ear (OTE): the housing may be seated on top of the ear and held in place by a headband or other support. Examples include structures like headphones and earmuffs. Over-the-head (OTH): the housing may be held in place by a headband that goes over the top of the head. In other embodiments, the wearable housing may be attached to a secondary device such as a glasses (e.g., sun or corrective vision glasses), a hat, a helmet, a visor, or any other type of head wearable devices. In some cases, the wearable housing may be attached to a secondary device using at least one adapter. Specifically, the at least one adaptor may be configured to enable the individual to wear the speech detection system in two or more different ways. For example, a single adapter may enable the wearable housing to be attached to glasses and to an earbud. As is discussed herein in greater detail, wearable housings 110 in Fig. 1 and Fig. 2A are non-limiting examples of a wearable housing, consistent with the present disclosure.

[0038] Some embodiments involve a coherent light source configured to project light towards a facial region of the user. Other embodiments involve a non-coherent light source configured to project light towards a facial region of the user. As used herein, the term “light source” broadly refers to any device configured to emit light. The term “coherent light” includes light that is highly ordered and exhibits a high degree of spatial and temporal coherence. This may occur, for example, when the light waves are in phase with each other and have a uniform frequency and wavelength, resulting in a beam of light that is highly directional and has restricted outward spread out as it travels. Alternatively, coherent light may include a scenario when light waves have constant phase difference. In some examples, coherent light may be produced by a coherent light source, such as lasers and other types of light sources that have a narrow spectral range and a high degree of monochromaticity (i.e. , the light consists of a single wavelength). In contrast, incoherent light may be produced by a non-coherent light source such as incandescent bulbs and natural sunlight, which have a broad spectral range and a low degree of monochromaticity.

[0039] By way of example, coherent light may include many waves of the same frequency, having different phases and amplitudes, not necessarily at the same time and locations. To control the interference, light phase information may be required to be recognized in advance. In one embodiment, the coherent light source may be a laser such as a solid-state laser, laser diode, a high-power laser, Quantum-Cascade Laser (QCLs), or an alternative light source such as a light emitting diode (LED)-based light source. In addition, the coherent light source may emit light in differing formats, such as light pulses, continuous wave (CW),quasi-CW, and so on. For example, one type of light source that may be used is a verticalcavity surface-emitting laser (VCSEL). Another type of light source that may be used is an external cavity diode laser (ECDL). In some examples, the light source may include a laser diode configured to emit light at a wavelength between about 650 nm and 1150 nm. Alternatively, the coherent light source may include a laser diode configured to emit light at a wavelength between about 800 nm and about 1020 nm, between about 850 nm and about 950 nm, or between about 1300 nm and about 1700 nm. Unless indicated otherwise, the terms “about” and “substantially the same,” with regard to a numeric value, may include a variance of up to 5% with respect to the stated value. As is discussed herein in greater detail, light source 410 in Fig. 4 and in Figs. 5A and 5B are non-limiting examples of a light source, consistent with the present disclosure. In the context of this disclosure, it should be recognized that the use of a coherent light source is intended as a non-limiting example implementation in the context of speech detection systems, methods, and computer readable media. Many of the embodiments described herein may be practiced with coherent light or non-coherent light, and the reference to either by way of example, is not intended to be limiting. For example, even when not explicitly stated, the described and claimed speech detection systems, methods, and computer program products may be configured to measure non-coherent light reflections for detecting facial skin micromovements.

[0040] Some embodiments involve at least one detector configured to receive light reflections from a facial region of the user. The term “light detector,” or simply “detector,” broadly refers to any device, element, or system capable of measuring one or more properties (e.g., power, frequency, phase, pulse timing, pulse duration, or other characteristics) of electromagnetic waves and to generate an output relating to the measured property or properties. Examples of detectors consistent with this disclosure may include: a light sensitive sensor, an imaging sensor, a phase detector, a MEMS sensor, a wavemeter, a spectrometer, a spectrophotometer, a homodyne detector, or a heterodyne detector. In some embodiments, the at least one detector may be configured to detect coherent light reflections. Additionally or alternatively, the at least one detector may be configured to detect non-coherent light reflections. The at least one detector may include a plurality of detectors constructed from a plurality of detecting elements. The at least one detector may include a light detector of different types. The at least one detector may include multiple detectors of the same type which may differ in other characteristics (e.g., sensitivity, size). Combinations of several types of detectors may be used for different reasons. Consistent with some embodiments, the at least one detector may measure any form of reflection and of scattering of light, including secondary speckle patterns, different types of specular reflections, diffuse reflections, speckle interferometry, and any other form of light scattering. In some embodiments, the at least one detector is configured to outputassociated reflection signals from the detected coherent light reflections. In the context of this disclosure, the term “reflection signals” broadly refers to any form of data retrieved from the at least one light detector in response to the light reflections from the facial region. The reflection signals may be any electronic representation of a property determined from the light reflections, or raw measurement signals detected by the at least one light detector. As is discussed herein in greater detail, light detector 412 in Fig. 4 and in Figs. 5A and 5B are non-limiting examples of a light detector, consistent with the present disclosure.

[0041] Some embodiments involve at least one processor configured to use the reflection signals from the detector and determine the facial skin micromovements. The term “processor” may involve any physical device or group of devices having electric circuitry that performs a logic operation on an input or inputs. For example, the at least one processor may include one or more integrated circuits (IC), including an application-specific integrated circuit (ASIC), microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field- programmable gate array (FPGA), server, virtual server, or other circuits suitable for executing instructions or performing logic operations. The instructions executed by at least one processor may, for example, be pre-loaded into a memory integrated with or embedded into the controller or may be stored in a separate memory. The memory may include a Random Access Memory (RAM), a Read-Only Memory (ROM), a hard disk, an optical disk, a magnetic medium, a flash memory, other permanent, fixed, or volatile memory, or any other mechanism capable of storing instructions. In some embodiments, the at least one processor may include more than one processor. Each processor may have a similar construction, or the processors may be of differing constructions that are electrically connected or disconnected from each other. For example, the processors may be separate circuits or integrated in a single circuit. When more than one processor is used, the processors may be configured to operate independently or collaboratively and may be colocated or located remotely from each other. The processors may be coupled electrically, magnetically, optically, acoustically, mechanically, or by other means that permit them to interact. As is discussed herein in greater detail, processing unit 112 in Fig. 1 and processing device 400 in Fig. 4 are non-limiting examples of at least one processor, consistent with the present disclosure.

[0042] In some embodiments, at least one processor may determine the facial skin micromovements by applying a light reflection analysis. The term “light reflection analysis” involves the evaluation of properties of a surface by analyzing patterns of light scattered off the surface. When light strikes a surface (e.g., the facial skin), some of it is absorbed, some are transmitted, and some are reflected. The amount and type of light that is reflected depends on the properties of the surface and the angle at which the light strikes it. In oneexample, when a non-coherent light source is used, the light reflection analysis may include scattering analysis which involves measuring the scattering of light from the surface (e.g., the facial skin). In another example, when a coherent light source is used, the light reflection analysis may include a speckle analysis or any pattern-based analysis. By way of example, coherent light shining onto a rough, contoured, or textured surface may be reflected or scattered in many different directions, resulting in a pattern of bright and dark areas called “speckles.” Such analysis may be performed using a computer (e.g., including a processor) to identify a speckle pattern and derive information about a surface (e.g., facial skin) represented in reflection signals received from at least one light detector. A speckle pattern may occur as the result of the interference of coherent light waves added together to give a resultant wave whose intensity varies. The detected speckle pattern or any other detected pattern may then be processed to generate reflection image data. As is discussed herein in greater detail, light reflections processing module 706 depicted in Fig. 7 is a non-limiting example of a software module used for determining facial skin micromovements by applying a light reflection analysis.

[0043] Consistent with the present disclosure, the reflection image data may be processed by any image processing algorithms, including classic and / or artificial neural network (ANN) based algorithms such as Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN). In some examples, the reflection image data may be preprocessed by transforming the image data using a transformation function to obtain a transformed speckle image. For example, the transformed reflection image data may include one or more convolutions of the speckle image. The transformation function may include one or more image filters, such as low-pass filters, high-pass filters, band-pass filters, all-pass filters, and so forth. In some examples, the transformation function may comprise a nonlinear function. In some examples, the reflection image data may be preprocessed by smoothing at least parts of the reflection image data, for example using Gaussian convolution, using a median filter, and so forth. In some examples, the reflection image data may be preprocessed to obtain a different representation of the reflection image data. For example, reflection image data may comprise: a representation of at least part of the reflection image data in a frequency domain; a Discrete Fourier Transform of at least part of the reflection image data;a Discrete Wavelet Transform of at least part of the reflection image data;a time / frequency representation of at least part of the reflection image data;a representation of at least part of the reflection image data in a lower dimension; a lossy representation of at least part of the reflection image data;a lossless representation of at least part of the reflection image data;a time-ordered series of any of the above; any combination of the above. In some examples, the reflection image data may be preprocessed to extract edges, and the preprocessed reflection image data may comprise information based on and / or related to the extractededges. In some examples, the reflection image data may be preprocessed to extract features from the reflection image data. Some examples of such features may comprise information related to edges, corners, blobs, ridges, Scale Invariant Feature Transform (SIFT) features, temporal features, and more.

[0044] In some embodiments, performing light reflection analysis may include evaluating the reflection image data and / or the preprocessed reflection image data using one or more rules, functions, procedures, artificial neural networks, object detection algorithms, visual event detection algorithms, action detection algorithms, motion detection algorithms, background subtraction algorithms, inference models, and so forth. Some non-limiting examples of such inference models may include: an inference model preprogrammed manually; a classification model; a regression model; a result of training algorithms, such as machine learning algorithms and / or deep learning algorithms, on training examples, where the training examples may include examples of data instances, and in some cases, a data instance may be labeled with a corresponding desired label and / or result; and so forth. In some embodiments, performing speckle analysis may comprise analyzing pixels, voxels, point cloud, range data, etc. included in the reflection image data.

[0045] Some embodiments may involve analyzing the reflection image data to decipher speech. The process of deciphering the speech from the reflection image data may involve identifying patterns or recognizing signatures in the reflection image data. For example, known data, patterns, or signatures may be associated with certain phenomes, combinations of phonemes, words, combinations of words, or any other speech-related component. By recognizing such information in the reflection image data, speech may be deciphered. Such recognition and / or deciphering may be aided by machine learning. For example, machine learning models or algorithms may be employed to recognize and / or understand speech or commands. Some non-limiting examples of machine learning algorithms that may be used include classification algorithms, data regressions algorithms, image segmentation algorithms, visual detection algorithms (such as object detectors, motion detectors, edge detectors, etc.), visual recognition algorithms (such as object recognition, etc.), speech recognition algorithms, mathematical embedding algorithms, natural language processing algorithms, support vector machines, random forests, nearest neighbors algorithms, deep learning algorithms, artificial neural network algorithms, convolutional neural network algorithms, recursive neural network algorithms, linear machine learning models, non-linear machine learning models, ensemble algorithms, and so forth. For example, a trained machine learning algorithm may include an inference model, such as a predictive model, a classification model, a regression model, a clustering model, a segmentation model, an artificial neural network (such as a deep neural network, a convolutional neural network, a recursive neural network, etc.), a random forest, a support vector machine, and so forth. Insome examples, the training examples may include example inputs together with the desired outputs corresponding to the example inputs. Further, in some examples, training machine learning algorithms using the training examples may generate a trained machine learning algorithm, and the trained machine learning algorithm may be used to estimate outputs for inputs not included in the training examples. In some examples, engineers, scientists, processes, and machines that train machine learning algorithms may further use validation examples and / or test examples. For example, validation examples and / or test examples may include example inputs together with the desired outputs corresponding to the example inputs, a trained machine learning algorithm and / or an intermediately trained machine learning algorithm may be used to estimate outputs for the example inputs of the validation examples and / or test examples, the estimated outputs may be compared to the corresponding desired outputs, and the trained machine learning algorithm and / or the intermediately trained machine learning algorithm may be evaluated based on a result of the comparison. In some examples, a machine learning algorithm may have parameters and hyper parameters, where the hyper parameters are set manually by a person or automatically by a process external to the machine learning algorithm (such as a hyper parameter search algorithm), and the parameters of the machine learning algorithm are set by the machine learning algorithm according to the training examples. In some implementations, the hyper-parameters are set according to the training examples and the validation examples, and the parameters are set according to the training examples and the selected hyper- parameters.

[0046] In some examples, deciphering the speech from the reflection image data may involve a trained machine learning algorithm that is used as an inference model that when provided with an input generates an inferred output. For example, a trained machine learning algorithm may include a classification algorithm, the input may include a sample, and the inferred output may include a classification of the sample. In another example, a trained machine learning algorithm may include a regression model, the input may include a sample, and the inferred output may include an inferred value for the sample. In yet another example, a trained machine learning algorithm may include a clustering model, the input may include a sample, and the inferred output may include an assignment of the sample to at least one cluster. In an additional example, a trained machine learning algorithm may include a classification algorithm, the input may include an image, and the inferred output may include a classification of an item depicted in the image. In yet another example, a trained machine learning algorithm may include a regression model, the input may include an image, and the inferred output may include an inferred value for an item depicted in the image (such as an estimated facial skin motion, and so forth). In an additional example, a trained machine learning algorithm may include an image segmentation model, the input may include animage, and the inferred output may include a segmentation of the image. In yet another example, a trained machine learning algorithm may include an object detector, the input may include an image, and the inferred output may include one or more detected objects in the image and / or one or more locations of objects within the image. In some examples, the trained machine learning algorithm may include one or more formulas and / or one or more functions and / or one or more rules and / or one or more procedures, the input may be used as input to the formulas and / or functions and / or rules and / or procedures, and the inferred output may be based on the outputs of the formulas and / or functions and / or rules and / or procedures (for example, selecting one of the outputs of the formulas and / or functions and / or rules and / or procedures, using a statistical measure of the outputs of the formulas and / or functions and / or rules and / or procedures, and so forth). As is discussed herein in greater detail, reflection image 600 in Fig. 6 is a non-limiting example of a visualization of reflection image data, consistent with the present disclosure.

[0047] In some embodiments, artificial neural networks may be configured to analyze inputs and generate corresponding outputs. Some non-limiting examples of such artificial neural networks may include shallow artificial neural networks, deep artificial neural networks, feedback artificial neural networks, feed-forward artificial neural networks, autoencoder artificial neural networks, probabilistic artificial neural networks, time-delay artificial neural networks, convolutional artificial neural networks, recurrent artificial neural networks, long / short term memory artificial neural networks, and so forth. In some examples, an artificial neural network may be configured manually. For example, a structure of the artificial neural network may be selected manually, a type of an artificial neuron of the artificial neural network may be selected manually, a parameter of the artificial neural network (such as a parameter of an artificial neuron of the artificial neural network) may be selected manually, and so forth. In some examples, an artificial neural network may be configured using a machine learning algorithm. For example, a user may select hyperparameters for the artificial neural network and / or the machine learning algorithm, and the machine learning algorithm may use the hyper-parameters and training examples to determine the parameters of the artificial neural network, for example using back propagation, using gradient descent, using stochastic gradient descent, using mini-batch gradient descent, and so forth. In some examples, an artificial neural network may be created from two or more other artificial neural networks by combining the two or more other artificial neural networks into a single artificial neural network.

[0048] Disclosed embodiments may include and / or access a data structure or data. A data structure consistent with the present disclosure may include any collection of data values and relationships among them. By way of example, a data structure may contain correlations of facial micromovements with words or phonemes, and the at least one processor mayperform a lookup in the data structure of particular words or phenomes associated with detected facial skin micromovements. The data may be stored linearly, horizontally, hierarchically, relationally, non-relationally, uni-dimensionally, multidimensionally, operationally, in an ordered manner, in an unordered manner, in an object-oriented manner, in a centralized manner, in a decentralized manner, in a distributed manner, in a custom manner, or in any manner enabling data access. By way of non-limiting examples, data structures may include an array, an associative array, a linked list, a binary tree, a balanced tree, a heap, a stack, a queue, a set, a hash table, a record, a tagged union, ER model, and a graph. For example, a data structure may include an XML database, an RDBMS database, an SQL database, or NoSQL alternatives for data storage / search such as, for example, MongoDB, Redis, Couchbase, Datastax Enterprise Graph, Elastic Search, Splunk, Solr, Cassandra, Amazon DynamoDB, Scylla, HBase, and Neo4J. A data structure may be a component of the disclosed system or a remote computing component (e.g., a cloud-based data structure). Data in the data structure may be stored in contiguous or non-contiguous memory. Moreover, a data structure, as used herein, does not require information to be colocated. It may be distributed across multiple servers, for example, servers that may be owned or operated by the same or different entities. Thus, the term “data structure” as used herein in the singular is inclusive of plural data structures. As is discussed herein in greater detail, data structure 124 in Fig. 1 and data structures 422 and 464 in Fig. 4 are non-limiting examples of a data structure, consistent with the present disclosure.

[0049] Consistent with the present disclosure, at least one processor may generate output associated with the determined facial skin micromovements. The term “generating an output” broadly refers to emitting a command, emitting data, and / or causing any type of electronic device to initiate an action. In some embodiments, the output may be sound (e.g., delivered via a speaker configured to fit in the ear of the user), and the sound may be an audible presentation of words associated with silent or prevocalized speech. In one example, the audible presentation of words may include an answer to a question that the user silently asked a virtual personal assistant. In another example, the audible presentation of words may include synthesized speech (e.g., artificial production of human speech). According to other disclosed embodiments, the output may be directed to a display (e.g., a visual display such as a computer monitor, television, mobile communications device, VR or XR glasses, or any other device that enables visual perception) and the generated output may include graphics, images, or textual presentations of words associated with prevocalized or vocalized speech (e.g., subtitles). The textual presentation of the words may be presented at the same time words are vocalized. In other embodiments, the output may be directed to a communications device associated with the user and the generated output may be any data exchanged with the communications device. The term “communications device” is intendedto include all possible types of devices capable of exchanging data using a network configured to convey data. In some examples, the communications device may include a smartphone, a tablet, a smartwatch, a personal digital assistant, a desktop computer, a laptop computer, an Internet of Things (loT) device, a dedicated terminal, a wearable communications device, and any other device that enables data communications. As is discussed herein in greater detail, output determination module 712 in Fig. 7 is a non-limiting example of a software module used for generating output associated with the determined facial skin micromovements.

[0050] Disclosed embodiments may involve exchanging data (e.g., textual data) using a network. The term “communications network,” or simply “network,” may include any type of physical or wireless computer networking arrangement used to exchange data. For example, a network may be the Internet, a private data network, a virtual private network using a public network, a Wi-Fi network, a LAN or WAN network, a combination of one or more of the foregoing, and / or other suitable connections that may enable information exchange among various components of the system. In some embodiments, a network may include one or more physical links used to exchange data, such as Ethernet, coaxial cables, twisted pair cables, fiber optics, or any other suitable physical medium for exchanging data. A network may also include a public switched telephone network (“PSTN”) and / or a wireless cellular network. A network may be a secured network or an unsecured network. In other embodiments, one or more components of the system may communicate directly through a dedicated communication network. Direct communications may use any suitable technologies, including, for example, BLUETOOTH™, BLUETOOTH LE™ (BLE), Wi-Fi, near-field communications (NFC), or other suitable communication methods that provide a medium for exchanging data and / or information between separate entities. As is discussed herein in greater detail, communications network 126 shown in Fig. 1, is a non-limiting example of a communications network, consistent with the present disclosure.

[0051] As used herein, a non-transitory computer-readable storage medium (or similar constructs such as a non-transitory computer-readable media) refers to any type of physical memory on which information or data readable by at least one processor can be stored. Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, any other optical data storage medium, any physical medium with patterns of holes, markers, or other readable elements, a PROM, an EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same. The terms “memory” and “computer-readable storage medium” may refer to multiple structures, such as a plurality of memories or computer-readable storage mediums located within a wearable device or at a remote location. Additionally, one or more computer-readable storage mediums can be utilized in implementing a computer-implemented method. Accordingly, the term computer-readable storage medium should be understood to include tangible items and exclude carrier waves and transient signals.

[0052] Reference is now made to Fig. 1 , which illustrates a user 102 using a speech detection system consistent with some embodiments of the present disclosure. Fig. 1 is a single exemplary representation, and it is to be understood that some illustrated elements might be omitted, and others may be added within the scope of this disclosure. In the illustrated example implementation, a speech detection system 100 may be mountable on a head of user 102. Specifically, speech detection system 100 (also referred to herein simply as “the system”) may have the form and appearance of an over-the-ear clip-on headset. Alternatively, the system may be head-mountable in one of many other ways within the scope of this disclosure, including an in-ear bud, integration into or connectable to a temple of glasses, a head band, or any other mechanism capable of securing the system or a portion thereof to a human head. Speech detection system 100 may be configured to direct projected light 104 (e.g., coherent light) toward respective locations on the face of user 102, thus creating an array of light spots 106 extending over a facial region 108 of the face. Facial region 108 may have an area of at least 1 cm2, at least 2 cm2, at least 4 cm2, at least 6 cm2, or at least 8 cm2. In some embodiments, the size of facial region 108 may be determined to enable sensing the motion of different parts of the facial muscles. In the depicted example, only one beam of projected light 104 is illustrated, however, it is contemplated that every spot projected towards facial region 108 may be associated with a corresponding light beam or with one or more light beams. In other embodiments, the light source may project light in a manner other than an array of spots. For example, a region of the face may be uniformly or non-uniformly illuminated.

[0053] For embodiments that are head-worn, speech detection system 100 may include a wearable housing 110 configured to be worn on a head of user 102. Wearable housing 110 may include or be associated with a processing unit 112 configured to interpret facial skin micromovements; an output unit 114 configured to fit into the user’s ear and to present audible and / or vibrational output; and optical sensing unit 116 configured to project light toward a non-lip part of the face of user 102 and to detect reflections of the projected light. In the illustrated example, optical sensing unit 116 may be connected to output unit 114 by an arm 118 and thus may be held in a location in proximity to and / or facing the user’s face. According to some disclosed embodiments, optical sensing unit 116 does not contact the user’s skin at facial region 108, but rather optical sensing unit 116 may be held at a certain distance from the skin surface of facial region 108. The distance of optical sensing unit 116 from the skin surface may be at least 5 mm, at least 7.5 mm, at least 10 mm, at least 15 mm, or at least 20 mm.

[0054] Optical sensing unit 116 may be configured to receive reflections of projected light 104 from facial region 108 and to output associated reflection signals. Specifically, the reflection signals may be indicative of light patterns (e.g., secondary speckle patterns) that may arise due to reflection of the coherent light from each of spots 106 within a field of view of speech detection system 100. To cover a sufficiently large facial region 108, the detector of speech detection system 100 may have a wide field of view, for example, the field of view may have an angular width of at least 60°, at least 70°, or at least 90°. Within this field of view, speech detection system 100 may sense and process the signals reflective of light patterns in all of spots 106 or only a certain subset of spots 106. For example, processing unit 112 may select a subset of spots 106 determined to give the largest amount of useful and reliable information with respect to the relevant movements of the skin surface of user 102 and may avoid processing data from other spots 106. Additional details of the structure and operation of optical sensing unit 116 are described below with reference to Fig. 5.

[0055] Consistent with the present disclosure, speech detection system 100 may be capable of detecting facial skin micromovements of user 102 and extracting meaning from the detected movements, even without vocalization of speech or utterance of any other sounds by user 102. The extracted meaning may be an identification of user 102 wearing speech detection system 100, an identification of a subvocalization by a user, such as a word silently spoken by user 102, an identification of a word vocally spoken by user 102, an identification of a phoneme silently spoken by user 102, or an identification of a phoneme vocally spoken by user 102. Similarly, the extract meaning may include an identification of a heart rate of user 102, an identification of a breathing rate of user 102, and / or other characteristics associated with verbal or non-verbal communication by user 102. In one example, speech detection system 100 may generate output signals that include data associated with an identification information, a III command, synthesized audio signal, a textual transcription, or any combination thereof. In one example, the synthesized audio signal may be played back to user 102 via a speaker in output unit 114. This playback may be useful in giving user 102 feedback with respect to the speech output.

[0056] Consistent with the present disclosure, speech detection system 100 may exchange data (e.g., output signals) with a variety of communications devices associated with users, for example, a mobile communications device 120 or a server 122. The term “communications device” is intended to include all possible types of devices capable of exchanging data using a digital communications network, an analog communication network, or any other communications network configured to convey data. In some examples, the communications device may include a wearable communications device, such as a smartphone, a tablet, a smartwatch, a personal digital assistant, a laptop computer, an loT device, a dedicated terminal, industrial machinery, a vehicle, a smart house, an appliance, orany other electronic device capable of exchanging information or data with another electronic device. In other examples, the communications device may include a nonwearable communications device, such as a desktop computer, a smart home hub, a router, a server, or any other network-connected equipment. In some cases, a processing device of mobile communications device 120 or server 122 may supplement or replace some functions of processing unit 112 of speech detection system 100. In some embodiments, the output signals generated by speech detection system 100 may be transmitted via a communication link to mobile communications device 120 or to a cloud server. The term “cloud server” refers to a computer platform that provides services via a network, such as the Internet. In the example embodiment illustrated in Fig. 1, a server 122 may use one or more virtual machines that may not correspond to individual pieces of hardware. For example, computational and / or storage capabilities may be implemented by allocating appropriate portions of desirable computation / storage power from a scalable repository, such as a data center or a distributed computing environment. In one example configuration, server 122 may be a cloud server that determines neural activity of user 102 based on facial skin micromovements. In one example, server 122 may implement the methods described herein using customized hard-wired logic, one or more Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), firmware, and / or program logic which, in combination with the computer system, cause server 122 to be a special-purpose machine.

[0057] In some embodiments, server 122 may access data structure 124 to determine, for example, correlations between words and a plurality of facial movements. Data structure 124 may utilize a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, nonremovable, other type of storage device or tangible or non-transitory computer-readable medium, or any medium or mechanism for storing information. Data structure 124 may be part of server 122 or separate from server 122, as shown. When data structure 124 is not part of server 122, server 122 may exchange data with data structure 124 via a communication link. Data structure 124 may include one or more memory devices that store data and instructions used to perform one or more features of the disclosed methods. In one embodiment, data structure 124 may include any of a plurality of suitable data structures, ranging from small data structures hosted on a workstation to large data structures distributed among data centers. Data structure 124 may also include any combination of one or more data structures controlled by memory controller devices (e.g., servers) or software. Consistent with the present disclosure, speech detection system 100 may communicate with mobile communications device 120 or server 122 using a communications network 126 as defined above.

[0058] Reference is now made to Fig. 2A, which illustrates another example implementation of speech detection system 100, in accordance with the present disclosure. In this example, wearable housing 110 may be integrated with or otherwise attached to a pair of glasses 200 having a frame 202. In this example implementation, glasses 200 may include nasal electrodes 204 and temporal electrodes 206 attached to frame 202 and contacting the user’s skin surface. Electrodes 204 and 206 may receive body surface electromyogram (sEMG) signals, which provide additional information regarding the activation of the user’s facial muscles. Speech detection system 100 may use the electrical activity sensed by electrodes 204 and 206 together with the output of optical sensing unit 116 in generating, for example, the synthesized audio signals. Additionally or alternatively, speech detection system 100 may include one or more additional optical sensing units 208, similar to optical sensing unit 116, for sensing skin movements in other areas of the user’s face, such as eye movement. These additional optical sensing units may be used together with or instead of optical sensing unit 116. In the illustrated example, optical sensing unit 116 may illuminate a first facial region 108A and optical sensing unit 208 may illuminate a second facial region 108B. First facial region 108A and second facial region 108B may be nonoverlapping.

[0059] In some disclosed embodiments, the speech detection system may be incorporated with, integrated with, or otherwise attached to an extended reality appliance. As used herein, the term “extended reality appliance” may include any type of device or system that enables a user to perceive and / or interact with an extended reality environment. The term “extended reality environment,” refers to all types of real-and-virtual combined environments and human-machine interactions at least partially generated by computer technology. One nonlimiting example of an extended reality environment may be a Virtual Reality (VR) environment. A virtual reality environment may be an immersive simulated non-physical environment which provides the user with the perception of being present in the virtual environment. Another non-limiting example of an extended reality environment may be an Augmented Reality (AR) environment. An augmented reality environment may involve live direct or indirect views of a physical real-world environment enhanced with virtual computergenerated perceptual information, such as virtual objects with which the user may interact. Another non-limiting example of an extended reality environment is a Mixed Reality (MR) environment. A mixed reality environment may be a hybrid of physical real-world and virtual environments, in which physical and virtual objects may coexist and interact in real time. Examples of the extended reality appliance may include VR headsets, AR headsets, MR headsets, smart glasses, and wearable projection devices.

[0060] Reference is now made to Fig. 2B, illustrating another example implementation of speech detection system 100, in accordance with some embodiments of the presentdisclosure. In the depicted example, speech detection system 100 may be part of an extended reality appliance 250. Extended reality appliance 250 may include all the sensors discussed above with reference to glasses 200 and more. For example, extended reality appliance 250 may include one or more of a gyroscope, an accelerometer, a magnetometer, an image sensor, a depth sensors, an infrared sensors, a proximity sensor, and / or any other sensor configured to measure one or more properties associated with the individual wearing extended reality appliance 250 and to generate an output relating to the measured property or properties. In some cases, speech detection system 100 may use the input from any one of the sensors of extended reality appliance 250 to determine the vocalized or subvocalized words that user 102 articulated. The term “determining” may refer to establishing or arriving at a conclusive outcome as a result of a reasoned, learned, calculated or logical process. For example, speech detection system 100 may use input from an image sensor of extended reality appliance 250 together with data from optical sensing unit 116 (See Fig. 1) to extract meaning of facial movements. In other cases, extended reality appliance 250 may generate output that includes a visual and / or audible presentation associated with the words detected by the speech detection system 100. For example, user 102 may interact with extended reality appliance 250 using silent commands.

[0061] Reference is now made to Fig. 3, which illustrates another example implementation of speech detection system 100, in accordance with the present disclosure. In the implementation illustrated in Fig. 3, speech detection system 100 may be integrated with mobile communications device 120. Specifically, mobile communications device 120 may include a light detector configured to detect reflections 300 of light from facial region 108. In this example, the light projected to facial region 108 originates from a non-wearable light source 302 that may be a coherent light source or non-coherent light source. In some configurations, non-wearable light source 302 may be included in mobile communications device 120. Alternatively, non-wearable light source 302 may be separated from mobile communications device 120.

[0062] Consistent with the present disclosure, and as depicted in Fig. 3, the pattern of the light projected to facial region 108 may be a single spot 106 large enough to illuminate different portions of facial region 108. For example, spot 106 may include a first portion 304A associated with a first facial muscle and a second portion 304B associated with a second facial muscle. Thereafter, a processing device of mobile communications device 120 may apply a light reflection analysis on received reflections 300 to determine facial skin micromovements. In particular, the processing device of mobile communications device 120 may determine first facial skin micromovements of first portion 304A and second facial skin micromovements of second portion 304B. The processing device may use both the first facial skin micromovements and the second facial skin micromovements to extract meaning(e.g., determine speech or a command, or to authenticate user 102) and to generate output. The example implementation of speech detection system 100 illustrated in Fig. 3 may be used when the extracted meaning includes a continuous authentication of user 102. Specifically, speech detection system 100 may provide an authentication service that uses biometrics of facial micromovements for continuous authentication during usage of mobile communications device 120.

[0063] Fig. 4 is a block diagram of an exemplary configuration of speech detection system 100 and an exemplary configuration of remote processing system 450. It is to be noted that Fig. 4 is a representation of just one embodiment, and it is to be understood that some illustrated elements might be omitted and others added within the scope of this disclosure. In the depicted embodiment, speech detection system 100 comprises processing unit 112 that includes a processing device 400 and a memory device 402; output unit 114 that includes a speaker 404, a light indicator 406, and a haptic feedback device 408; optical sensing unit 116 that includes at least one light source 410 and at least one light detector 412; an audio sensor 414, a power source 416, one or more additional sensors 418, network interface 420, and data structure 422. Speech detection system 100 may directly or indirectly access a bus 424 (or any other communication mechanism) that interconnects the above-mentioned subsystems and components for transferring information and commands within speech detection system 100. Some of the subsystems and components listed above are referred to herein in the singular but in alternative configurations may be plural. For example, in some configurations speech detection system 100 may include multiple light sources 410 or multiple light detectors 412.

[0064] Processing device 400, shown in Fig. 4, may constitute any physical device or group of devices having electric circuitry that performs a logic operation on an input or inputs. The instructions executed by at least one processor may, for example, be pre-loaded into a memory integrated with or embedded into processing device 400, or may be stored in a separate memory (e.g., memory device 402 or data structure 422). As described above, the processing device may include more than one processor. Each processor may have a similar construction, or the processors may be of differing constructions that are electrically connected or disconnected from each other. For example, the processors may be separate circuits or integrated in a single circuit. When more than one processor is used, the processors may be configured to operate independently or collaboratively and may be colocated or located remotely from each other. The processors may be coupled electrically, magnetically, optically, acoustically, mechanically, or by other means that permit them to interact. Consistent with the present disclosure, at least some of the functionalities described below regarding processing device 400 may be executed by a processing device of remote processing system 450.

[0065] Memory device 402, shown in Fig. 4, may include high-speed random-access memory and / or non-volatile memory, such as one or more magnetic disk storage devices, one or more optical storage devices, and / or flash memory (e.g., NAND, NOR). Consistent with the present disclosure, the components of memory device 402 may be distributed in more than one unit of speech detection system 100 and / or in more than one memory device. In particular, memory device 402 may be used to store a software product and / or data stored on a non-transitory computer-readable medium. As described above, the terms “memory” and “computer-readable storage medium” may refer to multiple structures, such as a plurality of memories or computer-readable storage mediums located within speech detection system 100 or at a remote location (e.g., at remote processing system 450). Additionally, one or more computer-readable storage mediums can be utilized in implementing a computer- implemented method. Examples of software modules stored in memory device 402 are described below with reference to Fig. 7.

[0066] Output unit 114, shown in Fig. 4, may cause output from a variety of output devices, such as speaker 404, light indicator 406, and a haptic feedback device 408. Examples of speaker 404 may include or may be incorporated with a loudspeaker, earbuds, audio headphones, a hearing aid type device, a bone conduction headphone, and any other device capable of converting an electrical audio signal into a corresponding sound. In some embodiments, speaker 404 may be configured to let only user 102 to listen to the generated audio signals. Alternatively, speaker 404 may be configured to emit sound into the open air for anyone nearby to hear. Light indicator 406 may include one or more light sources, for example, a LED array associated with different colors. Light indicator 406 may be used to indicate the battery status of speech detection system 100 or to indicate its operational mode. Haptic feedback device 408 may include a vibrating motor, linear actuator, vibrational transducer, or any other force feedback device that provides tactile or haptic cues or can convert an electrical signal into corresponding vibrations or force applications.

[0067] Optical sensing unit 116, shown in Fig. 4, may include light source 410 and light detector 412. Light source 410 may project coherent light or non-coherent light to facial region 108. As discussed above, light source 410 may be a laser such as a solid-state laser, laser diode, a high-power laser, or an alternative light source such as a light emitting diode (LED)-based light source. In addition, the light source 410, may emit light in differing formats, such as light pulses, continuous wave (CW), quasi-CW, and so on. In one embodiment, light source 410 may be an infrared laser diode configured to emit an input beam of coherent radiation. Light source 410 may be associated with a beam-splitting element, such as a Dammann grating or another suitable type of diffractive optical element (DOE), for splitting an input beam into multiple output beams, which form respective spots 106 at a matrix of locations extending over facial region 108. In another embodiment (not shown in the figures)light source 410 may include multiple laser diodes or other emitters, which generate respective groups of the output beams, covering different respective sub-areas within facial region 108. In one embodiment, processing unit 112 may select and actuate only a subset of the emitters, without actuating all the emitters. For example, to reduce the power consumption of speech detection system 100, processing unit 112 may actuate only one emitter or a subset consisting of two or more emitters that illuminates a specific area on the user’s face that has been found to give the most useful information for generating the desired speech output.

[0068] Light detector 412, shown in Fig. 4, may be used to detect reflections from facial region 108 indicative of facial skin movements. As discussed above, a light detector may be capable of measuring properties of coherent or non-coherent light, such as power, frequency, phase, pulse timing, pulse duration, and other properties. In some embodiments, light detector 412 may include an array of detecting elements, for example, a set of a charge-coupled device (CCD) sensors and / or a set of complementary metal-oxide semiconductor (CMOS) sensors, with objective optics for imaging facial region 108 onto the array. Due to the small dimensions of optical sensing unit 116 and its proximity to the skin surface, light detector 412 may have a sufficiently wide field of view to detect many of spots 106 at a high angle of at least 60°, at least 70°, or at least 90°. Light detector 412 may be configured to generate an output relating to the measured properties of the detected light. Consistent with the present disclosure, the output of light detector 412 may include any form of data determined in response to the received light reflections from facial region 108. In some embodiments, the output may include reflection signals that include electronic representation of one or more properties determined from the coherent or non-coherent light reflections. In other embodiments, the output may include raw measurements detected by at least one light detector 412.

[0069] In some embodiments, light detector 412 may measure one of more optical attributes associated with skin changes. The term “skin changes” refers to any detectable movements, alterations, or modifications that occurred to the skin. Such skin changes may include changes in the epidermis (i.e. , the outermost layer of the skin), changes in the dermis (i.e., the middle layer of the skin), changes in the hypodermis (i.e., the deepest layer of the skin), and changes in deeper muscle tissues. The optical attributes may be measured without contacting the skin of user 102. Examples of one of more optical attributes of the reflected light that may be measured by light detector 412 may include intensity, frequency, reflection, angle, sharpness, bidirectional reflectance distribution function, color, brightness, glossiness, transparency, opacity, surface texture, surface relief, surface movement, and other optical attributes derivable from analysis of light reflections. The output of light detector 412 may be used to determine information associated with skin changes. In someembodiments, the information associated with those skin changes may be derived from changes in a distance from the skin to the detector as the skin moves, and in other embodiments the changes may not be derived from variations in the distance of the skin from light detector 412. For example, the determined speed or angular speed of the changes of the facial skin may be determined by detecting the changes of non-distance measurements (e.g., image sharpness) over time. Thus, in one non-limiting example, optical attributes may be detected from random intensity variations observed when coherent light interacts with a rough or scattering surface, such as human skin. In another non-limiting example, optical attributes may be detected based on the interference of light waves, such as when interference patterns are used to measure the phase difference or amplitude changes between two or more optical paths.

[0070] In some embodiments, optical sensing unit 116 may not require reference to parameters of the light source, such as the light source’s wavelength, intensity, or coherence, and may not require a reference beam (typically used with a beam-splitter) to measure the one or more optical attributes of the reflected light. For example, optical sensing unit 116 may use a single beam to illuminate the skin and then process the light reflections returned to light detector 412. While some speech detection systems may include a single pixel sensor (e.g., a photo diode), in other embodiments, light detector 412 may include one or more multi-pixel sensors (e.g., each pixel sensor includes more than 4 megapixels, more than 10 megapixels, or more than 10 megapixels) that enables producing an image providing spatial information beyond a single point. For example, a reflection image depicted in Fig. 6 may be produced from the output of light detector 412. As described throughout the disclosure, output of light detector 412 may be analyzed using image processing methods to determine patterns of light scattered off a surface. For example, features of secondary speckles may be determined.

[0071] In some non-limiting examples, optical sensing unit 116 may use a diffractive element to split the outbound beam to multiple beams and may not rely on superposition of coherent light waves to cause interference. In some non-limiting examples, optical sensing unit 116 may be arranged such that light detector 412 may be positioned along a different optical axis from light source 410. In other non-limiting examples, aligning the light source and the sensor along the same optical axis is needed may be used for maintaining coherence, achieving path length matching, ensuring spatial overlap, and preserving the sensitivity and accuracy of the interference patterns. However, since some implementations of light detector 412 detect a reflection image and not a distance to a point, optical sensing unit 116 may include a first optical axis for outbound light and a second optical axis, not aligned with the first optical axis, for inbound light. In some embodiments, light detector 412 is configured to measure both sub-microbic speed and depth changes in the ranges of 5-500microns. In alternative embodiments, light detector 412 is configured to measure changes that are less than a micron. All the examples provided in this paragraph are alternatives and may be implemented in the many alternative embodiments provided herein, depending on the specifics of implementation.

[0072] Audio sensor 414, shown in Fig. 4, may include one or more audio sensors configured to capture audio by converting sounds to digital information. Some examples of audio sensors may include microphones, unidirectional microphones, bidirectional microphones, cardioid microphones, omnidirectional microphones, onboard microphones, wired microphones, wireless microphones, or any combination of the above. Audio sensor 414 may be configured to capture sounds uttered by user 102, thereby enabling user 102 to use speech detection system 100 as a conventional headphone when desired. Additionally or alternatively, audio sensor 414 may be used in conjunction with the silent speech sensing capabilities of speech detection system 100. In one embodiment, the audio signals output by audio sensor 414 can be used in changing the operational state of speech detection system 100. For example, processing unit 112 may generate the speech output only when audio sensor 414 does not detect vocalization of words by user 102. In another embodiment, audio sensor 414 may be used in a calibration procedure, in which optical sensing unit 116 detects micromovements of the skin while user 102 utters certain phonemes or words. Processing unit 112 may compare the reflection signals output by light detector 412 to the sounds sensed by audio sensor 414 to calibrate optical sensing unit 116. This calibration may include prompting user 102 to shift the position of optical sensing unit 116 to align the optical components in the desired position relative to facial region 108. In yet another embodiment, audio sensor 414 enables on-the-fly training of a neural network of speech detection system 100. For example, speech detection system 100 may be configured to correlate facial skin micromovements with words using audio signals concurrently captured with the micromovements. After recognizing recorded words, speech detection system 100 can perform a look-back to identify facial micromovement that preceded articulation of those words, thereby training speech detection system 100. In a similar way, speech detection systems can be used to train on expressions, commands, user recognition, and emotions.

[0073] Power source 416, shown in Fig. 4, may provide electrical energy to power speech detection system 100. A power source may include any device or system that can store, dispense, or convey electric power, including, but not limited to, one or more batteries (e.g., a lead-acid battery, a lithium-ion battery, a nickel-metal hydride battery, a nickel-cadmium battery), one or more capacitors, one or more connections to external power sources, one or more power convertors, or any combination of the foregoing. With reference to the example illustrated in Fig. 4, power source 416 may be mobile, which means that speech detection system 100 can be wearable. The mobility of the power source enables user 102 to usespeech detection system 100 in a variety of situations. In other embodiments, power source 416 may be associated with a connection to an external power source (such as an electrical995 power grid) that may be used to charge power source 416.

[0074] Additional sensors 418, shown in Fig. 4, may include a variety of sensors, for example, image sensors, motion sensors, environmental sensors, Electromyography (EMG) sensors, resistive sensors, ultrasonic sensors, proximity sensors, biometric sensors, or other sensing devices configured to facilitate related functionalities. For example, speech detection1000 system 100 may include one or more image sensors configured to capture visual information from the environment of user 102 by converting light (not emitted from light source 410) to image data. Consistent with the present disclosure, an image sensor may be included in any device or system capable of detecting and converting optical signals in the near- infrared, infrared, visible, and / or ultraviolet spectrums into electrical signals. Examples of image1005 sensors may include digital cameras, semiconductor charge-coupled devices (CCDs), active pixel sensors in complementary metal-oxide semiconductor (CMOS), or N-type metal-oxide- semiconductor (NMOS, Live MOS). The electrical signals may be used to generate image data. Consistent with the present disclosure, the image data may include pixel data streams, digital images, digital video streams, data derived from captured images, and data that may1010 be used to construct one or more 3D images, a sequence of 3D images, 3D videos, or a virtual 3D representation. The image data acquired by the one or more image sensors may be transmitted by wired or wireless transmission to processing unit 112 or to remote processing system 450.

[0075] Speech detection system 100 may also include one or more motion sensors1015 configured to measure motion of user 102. Specifically, a motion sensor may perform at least one of the following: detect motion of user 102, measure the velocity of user 102, measure the acceleration of user 102, or measure any other action that involves movement. In some embodiments, the motion sensor may include one or more accelerometers configured to detect changes in acceleration (e.g., proper acceleration) and / or to measure1020 acceleration of speech detection system 100. In some embodiments, the motion sensor may include one or more gyroscopes configured to detect changes in the orientation of speech detection system 100 and / or to measure information related to the orientation of speech detection system 100. In some embodiments, the motion sensors may include one or more using image sensors, LIDAR sensors, radar sensors, or proximity sensors. For example, by1025 analyzing captured images, processing device 400 may determine the motion of speech detection system 100, for example, using ego-motion algorithms. In addition, the processing device may determine the motion of objects in the environment of speech detection system 100, for example, through object tracking.

[0076] Speech detection system 100 may also include one or more environmental sensors1030 of different types configured to capture data reflective of the environment of user 102. In some embodiments, the environmental sensor may include one or more chemical sensors configured to perform at least one of the following: measure chemical properties in the environment of user 102, measure changes in the chemical properties in the environment of user 102, detect the present of chemicals in the environment of user 102, and / or measure1035 the concentration of chemicals in the environment of user 102. Examples of measurable chemical properties include pH level, toxicity, and temperature. Examples of chemicals or phenomena that may be measured include electrolytes, particular enzymes, particular hormones, particular proteins, smoke, carbon dioxide, carbon monoxide, oxygen, ozone, hydrogen, and hydrogen sulfide. In other embodiments, the environmental sensor may1040 include one or more temperature sensors configured to detect changes in the temperature of the environment of user 102 and / or to measure the temperature of the environment of user 102. In other embodiments, the environmental sensor may include one or more barometers configured to detect changes in the atmospheric pressure in the environment of user 102 and / or to measure the atmospheric pressure in the environment of user 102. In other1045 embodiments, the environmental sensor may include one or more light sensors configured to detect changes in the ambient light in the environment of user 102.

[0077] Network interface 420, shown in Fig. 4, may provide two-way data communications to a network, such as communications network 126. In one embodiment, network interface 420 may include an Integrated Services Digital Network (ISDN) card, cellular modem,1050 satellite modem, or a modem to provide a data communication connection over the Internet. As another example, network interface 420 may include a Wireless Local Area Network (WLAN) card. In another embodiment, network interface 420 may include an Ethernet port connected to radio frequency receivers and transmitters and / or optical (e.g., infrared) receivers and transmitters. The specific design and implementation of network interface 4201055 may depend on the communications network or networks over which speech detection system 100 is intended to operate. For example, in some embodiments, speech detection system 100 may include network interface 420 designed to operate over a GSM network, a GPRS network, an EDGE network, a Wi-Fi or WiMax network, and a Bluetooth network. In any such implementation, network interface 420 may be configured to send and receive1060 electrical, electromagnetic, or optical signals that carry digital data streams or digital signals representing various types of information.

[0078] Data structure 422, shown in Fig. 4, may include any hardware, software, firmware, or combination thereof for storing and facilitating the retrieval of information from a database. The term “database” may be understood to include a collection of data that may be1065 distributed or non-distributed. A database may include a database management system thatcontrols the organization, storage and retrieval of data contained within the database. As described above, the data included in the database may be stored linearly, horizontally, hierarchically, relationally, non-relationally, uni-dimensionally, multidimensionally, operationally, in an ordered manner, in an unordered manner, in an object-oriented manner,1070 in a centralized manner, in a decentralized manner, in a distributed manner, in a custom manner, or in any manner enabling data access. In disclosed embodiments, data structure 422 may include correlations of facial micromovements with words, commands, emotions, expressions, and / or biological conditions. The at least one processor may perform a lookup in the data structure to thereby interpret the detected facial skin micromovements. In1075 accordance with one embodiment, at least some of the data stored in data structure 422 may alternatively or additionally be stored in remote processing system 450.

[0079] Consistent with the present disclosure, speech detection system 100 may be configured to communicate with a remote processing system 450 (e.g., mobile communications device 120 or server 122). Remote processing system 450 may directly or1080 indirectly access a bus 452 (or other communication mechanism) interconnecting subsystems and components for transferring information within remote processing system 450. For example, bus 452 may interconnect a memory interface 454, a network interface 456, a power source 458, a processing device 460, one or more additional sensors 462, a data structure 464, and memory device 466.1085

[0080] Memory interface 454, shown in Fig. 4, may be used to access a software product and / or data stored on a non-transitory computer-readable medium or on other memory devices, such as memory devices 402, 466, data structure 422, or data structure 464. Memory device 466 may contain software modules to execute processes consistent with the present disclosure. In some embodiments, memory device 466 may include a shared1090 memory module 472, a node registration module 473, a load balancing module 474, one or more computational nodes 475, an internal communication module 476, an external communication module 477, and a database access module (not shown). Modules 472-477 may contain software instructions for execution by at least one processor (e.g., processing device 460) associated with remote processing system 450. Shared memory module 472,1095 node registration module 473, load balancing module 474, computational node 475, and external communication module 477 may cooperate to perform various operations.

[0081] Shared memory module 472 may allow information sharing between remote processing system 450 and other devices related to one or more speech detection systems 100. In some embodiments, shared memory module 472 may be configured to enable1100 processing device 460 to access, retrieve, and store data. For example, using shared memory module 472, processing device 460 may perform at least one of: executing software programs stored on memory devices 402, 466, data structure 422, or data structure 464;storing information in memory devices 402, 466, Data structure 422, or data structure 464; or retrieving information from memory devices 402, 466, data structure 422, or data structure1105 464.

[0082] Node registration module 473 may be configured to track the availability of one or more computational nodes 475. In some examples, node registration module 473 may be implemented as: a software program, such as a software program executed by one or more computational nodes 475, a hardware solution, or a combined software and hardware1110 solution. In some implementations, node registration module 473 may communicate with one or more computational nodes 475, for example, using internal communication module 476. In some examples, one or more computational nodes 475 may notify node registration module 473 of their status, for example, by sending messages: at startup, at shutdown, at constant intervals, at selected times, in response to queries received from node registration1115 module 473, or at any other determined times. In some examples, node registration module 473 may query about the status of one or more computational nodes 475, for example, by sending messages: at startup, at constant intervals, at selected times, or at any other determined times.

[0083] Load balancing module 474 may be configured to divide the workload among one1120 or more computational nodes 475. In some examples, load balancing module 474 may be implemented as a software program, such as a software program executed by one or more of the computational nodes 475, a hardware solution, or a combined software and hardware solution. In some implementations, load balancing module 474 may interact with node registration module 473 to obtain information regarding the availability of one or more1125 computational nodes 475. In some implementations, load balancing module 474 may communicate with one or more computational nodes 475, for example, using internal communication module 476. In some examples, one or more computational nodes 475 may notify load balancing module 474 of their status, for example, by sending messages: at startup, at shutdown, at constant intervals, at selected times, in response to queries received1130 from load balancing module 474, or at any other determined times. In some examples, load balancing module 474 may query about the status of one or more computational nodes 475, for example, by sending messages: at startup, at constant intervals, at pre-selected times, or at any other determined times.

[0084] Internal communication module 476 may be configured to receive and / or to1135 transmit information from one or more components of remote processing system 450. For example, control signals and / or synchronization signals may be sent and / or received through internal communication module 476. In one embodiment, input information for computer programs, output information of computer programs, and / or intermediate information of computer programs may be sent and / or received through internal communication module1140 476. In another embodiment, information received though internal communication module 476 may be stored in memory device 466 or in data structure 464. For example, information retrieved from data structure 464 may be transmitted using internal communication module 476. In another example, reference signals reflecting facial micromovements of user 102 may be stored in data structure 464 and accessed using internal communication module1145 476.

[0085] External communication module 477 may be configured to receive and / or to transmit information from one or more speech detection systems 100. For example, control signals may be sent and / or received through external communication module 477. In one embodiment, information received though external communication module 477 may be1150 stored in memory device 466, in data structure 464, and / or any memory device in the one or more speech detection systems 100. In another embodiment, information retrieved from data structure 464 may be transmitted using external communication module 477 to speech detection system 100 or to any entity with whom user 102 communicates. For example, when user 102 communicates with a financial institution (e.g., a bank) information retrieved1155 from data structure 464 may be transmitted to enable authentication of user 102. In another embodiment, sensor data may be transmitted and / or received using external communication module 477. Examples of such input data may include data received from speech detection system 100, information captured from the environment of user 102 using one or more sensors such as additional sensors 418 and additional sensors 462.1160

[0086] In some embodiments, aspects of modules 472-477 may be implemented in hardware, in software (including in one or more signal processing and / or application specific integrated circuits), in firmware, or in any combination thereof, executable by one or more processors, alone, or in various combinations with each other. Specifically, modules 472-477 may be configured to interact with each other and / or other modules of speech detection1165 system 100 to perform functions consistent with disclosed embodiments. Memory device 466 may include additional modules and instructions or fewer modules and instructions.

[0087] Network interface 456, power source 458, processing device 460, additional sensors 462, and data structure 464, shown in Fig. 4, may share similar functionality with the functionality of corresponding elements in speech detection system 100, as described1170 above. The specific design and implementation of the above-mentioned components may vary based on the implementation of remote processing system 450. In addition, remote processing system 450 may include more or fewer components. For example, when remote processing system 450 is a mobile communications device associated with user 102 (e.g., mobile communications device 120) it may include a speaker, a microphone, and additional1175 sensors.

[0088] The components and arrangements of speech detection system 100 and remote processing system 450 as illustrated in Fig. 4 are not intended to limit the disclosed embodiments. As will be appreciated by a person skilled in the art having the benefit of this disclosure, numerous variations and / or modifications may be made to the depicted1180 configuration of speech detection system 100 and remote processing system 450. For example, not all components may be essential for the operation of an input unit in all cases. Any component may be located in any appropriate part of speech detection system 100 or remote processing system 450. Moreover, the components may be rearranged into a variety of configurations while providing the functionality of the disclosed embodiments. For1185 example, some speech detection systems may not include all of the elements as shown in speech detection system 100 and in remote processing system 450. Other speech detection systems may include additional components and still fall within the scope of this disclosure.

[0089] Figs. 5A and 5B include two schematic illustrations of optical sensing unit 116 as it detects facial skin micromovements in accordance with some embodiments of the present1190 disclosure. The two schematic illustrations show a simplified scenario before muscle recruitment and after muscle recruitment. As depicted, optical sensing unit 116 may include an illumination module 500, a detection module 502, and, optionally, audio sensor 414. As discussed above and illustrated in Fig. 5, optical sensing unit 116 may be configured not to contact the user’s skin at facial region 108, but rather may be held at a distance D from the1195 skin surface of facial region 108. The distance D of optical sensing unit 116 from the skin surface may be at least 5 mm, at least 7.5 mm, at least 10 mm, at least 15 mm, or at least 20 mm.

[0090] In the depicted embodiment, illumination module 500 includes light source 410 (e.g., an infrared laser diode) configured to generate an input light beam 504. Illumination1200 module 500 further includes a beam-splitting element 506, such as a Dammann grating or another suitable type of diffractive optical element (DOE), configured to split input light beam 504 into multiple output beams 508, which form respective spots 106A-106E at a pattern (e.g., a matrix of locations) extending over facial region 108. In an alternative embodiment (not shown in the figure), illumination module 500 may include multiple light sources 410,1205 which generate respective groups of output beams 508, covering different respective subareas within facial region 108. In this alternative embodiment, processing unit 112 may select and actuate only a subset of the multiple light sources, without actuating all of them. For example, to reduce the power consumption of speech detection system 100, processing unit 112 may actuate only one light source or a group of two or more light sources that1210 illuminate a part of facial region 108.

[0091] Detection module 502 may include light detector 412, which may include an array 510 of optical sensors (e.g., an array of CMOS image sensors) with objective optics 512 forobtaining reflections 300 of coherent light from facial region 108. Because of the small dimensions of optical sensing unit 116 and its proximity to the skin surface, detection module1215 502 may be configured to have a wide field of view to acquire reflections from many spots 106 at a high angle. As mentioned above, the field of view of light detector 412 may have an angular width of at least 60°, at least 70°, or at least 90°. Due to the roughness of the skin surface, the light patterns at spots 106 can be detected at these high angles, as well.

[0092] Speech detection system 100 may analyze light reflections 300 to determine facial1220 skin micromovements resulting from recruitment of muscle fiber 520. Determining the facial skin micromovements may include determining an amount of the skin movement, determining a direction of the skin movement, and / or determining an acceleration of the skin movement. The determined facial skin micromovements may include voluntary and / or involuntary recruitment of muscle fiber 520. Muscle fiber 520 may be part of: a zygomaticus1225 muscle, an orbicularis oris muscle, a risorius muscle, genioglossus muscle, or a levator labii superioris alaeque nasi muscle. Processing device 400 may be configured to perform a first speckle analysis on light reflected from a first region of face in proximity to spot 106A to determine that the first region moved by a distance d1, i.e., first facial skin micromovement 522A; and perform a second speckle analysis on light reflected from a second region of face1230 in proximity to spot 106E to determine that the second region moved by a distance d2, i.e., second facial skin micromovement 522B. Thereafter, processing device 400 may use the determined movements of the first region and the second region to ascertain at least one spoken word. Consistent with disclosed embodiments, distances d1 and d2 may be less than 1000 micrometers, less than 100 micrometers, less than 10 micrometers, or less.1235

[0093] Fig. 6 is a schematic illustration of a reflection image 600 associated with light reflections 300 received from an area of facial region 108 associated with a single spot 106 (e.g., spot 106A depicted in Fig. 5). In disclosed embodiments, processing device 400 may receive reflection signals indicative of coherent light reflections from facial region 108. The reflection signals may be represented by reflection image 600. Thereafter, processing device1240 400 may determine the facial skin micromovements by applying a light reflection analysis. When light source 410 is a coherent light source, the light reflection analysis may include a speckle analysis or any pattern-based analysis. Such analysis may be performed by processing device 400 or processing device 460 to identify a speckle pattern and derive thereof movement of a corresponding area of facial region 108.1245

[0094] In the depicted example, a speckle 602 appears in reflection image 600 after recruitment of muscle fiber 520. The detected speckle or any other detected pattern may then be processed to generate reflection image data. With reference to the example discussed above, assuming reflection image 600 reflects spot 106A, the reflection image data may include data indicating that the first region moved by a distance d1. In some cases,1250 the reflection image data may be processed by any image processing algorithms (e.g., CNN and RNN) to determine skin movements of at least two areas within facial region 108. Thereafter, processing device 400 may use one or more machine learning (ML) algorithms and artificial intelligence (Al) algorithms to decipher the reflection image data and to extract meaning from the facial skin micromovement.1255

[0095] As shown in Fig. 7, memory device 700 may contain software modules to execute processes consistent with the present disclosure. In particular, memory device 700 may include an illumination control module 702, a sensors communication module 704, a light reflections processing module 706, an artificial neural network (ANN) training module 710, a subvocalization deciphering module 708, an output determination module 712, and a1260 database structure access module 714. The disclosed embodiments are not limited to any particular configuration of memory device 700. Further, processing device 400 and / or processing device 460 may execute the instructions stored in any of modules 702-714 included in memory device 700. It is to be understood that references in the following discussions to a processing device may refer to processing device 400 of speech detection1265 system 100 and processing device 460 of remote processing system 450 individually or collectively. Accordingly, steps of any of the following processes associated with modules 702-714 may be performed by one or more processors associated with speech detection system 100.

[0096] Consistent with disclosed embodiments, illumination control module 702, sensors1270 communication module 704, light reflections processing module 706, subvocalization deciphering module 708, ANN training module 710, output determination module 712, and database access module 714 may cooperate to perform various operations. For example, illumination control module 702 may determine light characteristics for illuminating facial region 108. Sensors communication module 704 may receive coherent light reflections from1275 facial region 108 and output associated reflection signals. Light reflections processing module 706 may process the reflection signals to determine facial skin micromovements. Subvocalization deciphering module 708 and database access module 714 may cooperate to extract meaning (e.g., determine silently spoken words) from the facial skin micromovements. In some cases, ANN training module 710 may use the determined silently1280 spoken words and the determined facial skin micromovements to train an artificial network. Output determination module 712 may generate a presentation of the determined words.

[0097] Illumination control module 702 may regulate the operation of light source 410 to illuminate facial region 108. In some embodiments, illumination control module 702 may determine values for characteristics of projected light 104 such as light intensity, pulse1285 frequency, duty cycle, illumination pattern, light flux, or any other optical characteristic. In a specific embodiment, as long as user 102 is not speaking, speech detection system 100 mayoperate in a first illumination mode (e.g., low frame rate) to conserve power of its battery. While speech detection system 100 operates at this first illumination mode, it may process the images to detect at least one trigger in the reflection signals (e.g., a movement of the1290 face) indicative of speech. When such trigger is detected, illumination control module 702 may cause the coherent light source to operate in a second illumination mode (e.g., high frame rate) to enable detection of changes in the coherent light patterns (e.g., speckle) that occur due to silent speech. Illumination control module 702 may also be configured to change one or more characteristics of projected light 104 based on various types of triggers.1295 The various types of triggers may be detected by analysis of data from sensors communication module 704.

[0098] Sensors communication module 704 may regulate the operation of light detector 412, audio sensor 414, and additional sensors 418 to receive captured measurements from one or more sensors, integrated with, or connected to, speech detection system 100. In one1300 embodiment, sensors communication module 704 may use the signals received from one or more sensors to generate sensor data associated with user 102. In one example, sensors communication module 704 may receive reflection signals from light detector 412 and may generate a first data stream of reflections images from which the facial skin micromovements in the facial region may be determined. In another example, sensors communication module1305 704 may receive audio signals from audio sensor 414 and may generate a second data stream from which the words vocally spoken by user 102 may be determined. In another example, sensors communication module 704 may receive motion signals from a motion sensor included in additional sensors 418 and generate a third data stream from which an activity that user 102 is engaged with may be determined. Sensors communication module1310 704 may convey the sensor data to other software modules for processing.

[0099] Light reflections processing module 706 may process the sensor data received from sensors communication module 704 in preparation for speech deciphering. In one embodiment, light reflections processing module 706 may receive from sensors communication module 704 reflection signals indicative of coherent light reflections from1315 facial region 108 that originates from light detector 412. The reflection signals may be represented by a reflection image (e.g., reflection image 600) that can be processed by at least one image processing algorithm to extract the skin motion at a set of pre-selected locations on the face of user 102. The number of locations to inspect may be an input to the image processing algorithm. In some cases, the locations on the skin that are extracted for1320 coherent light processing may be taken from a list of points of interest. The list of points of interest specifies anatomical locations that correspond with the zygomaticus muscle, the orbicularis oris muscle, the risorius muscle, genioglossus muscle, or the levator labii superioris alaeque nasi muscle. In plain language, the list of points of interest may includespecific points in the cheek above mouth, in the chin, in mid-jaw, in the cheek below mouth,1325 in the high cheek, and in the back of the cheek. Consistent with the present disclosure, the list of points of interest may be dynamically updated with more points on the face that are extracted during a training phase. The entire set of locations may be ordered in descending order such that any subset of the list (in order) minimizes the word error rate (WER) with respect to the chosen number of locations that are inspected. In another embodiment, light1330 reflections processing module 706 may crop each of the coherent light spots that were extracted from the raw image frames around the coherent light spots, and the algorithm process only the cropped images. Typically, the process of coherent light spot processing involves reducing by two the order of magnitude of a size of full frame image pixels (of -1.5MP) that are received from sensors communication module 704, with a very short1335 exposure. Exposure may be dynamically set and adapted to be able to capture only coherent light reflections and not skin segments. The cropped images of the coherent light spots may depict coherent light patterns. In other embodiments, light reflections processing module 706 may apply an image processing algorithm on the reflection image. For example, light reflections processing module 706 may improve the images’ contrast, by removing1340 noise using a threshold to determine black pixels and computing a characteristic metric of the coherent light, such as scalar speckle energy measure, e.g., an average intensity. In addition, light reflections processing module 706 may analyze changes in time in the reflections pattern (e.g., in average speckle intensity). Alternatively, other metrics may be used such as the detection of specific coherent light patterns. Thereafter, light reflections1345 processing module 706 may assign a sequence of values of the characteristic metric of the coherent light, which may be calculated frame-by-frame and aggregated to generate reflection image data indicative of facial skin micromovements. Light reflections processing module 706 may convey the reflection image data indicative of facial skin micromovements to other software modules for processing.1350

[0100] Subvocalization deciphering module 708 may use machine learning (ML) algorithms and artificial intelligence (Al) algorithms to decipher the reflection image data indicative of facial skin micromovements received from light reflections processing module 706. Consistent with the present disclosure, deciphering the reflection image data may include extracting meaning from the detected facial skin micromovements. In one1355 embodiment, subvocalization deciphering module 708 may use a trained ANN to correlate words with the facial skin micromovements. Different types of ANNs may be used, such as a classification NN that eventually outputs words, and a sequence-to-sequence NN which outputs a sentence (word sequence). In some embodiments, during normal speech of the user, system 100 may simultaneously sample the voice of user 102 and the facial1360 movements. Automatic speech recognition (ASR) and Natural Language Processing (NLP)algorithms may be applied by subvocalization deciphering module 708 on the actual voice, and the outcome of these algorithms may be used for optimizing the parameters of the algorithms used by subvocalization deciphering module 708. These parameters may include the weights of the various neural networks, as well as the spatial distribution of laser beams1365 for optimal performance. In addition, subvocalization deciphering module 708 may limit the output of the algorithms to a pre-defined word set may significantly increase the accuracy of word detection in cases of ambiguity, i.e., when two different words result in similar micromovements on the facial skin. The used word set can be personalized over time, adjusting the dictionary to the actual words used by the specific user, with their respective1370 frequency and context. In addition, subvocalization deciphering module 708 may use the context of a conversation between user 102 and a callee. The context may be determined from the input of the words and sentences extraction algorithms to increase the accuracy by eliminating out-of-context options. The context of the conversation may be understood by applying Automatic speech recognition (ASR) and Natural Language Processing (NLP)1375 algorithms on the side of user 102 and on the side of the callee.

[0101] ANN training module 710 may be used to train an ANN to perform silent speech deciphering, in accordance with embodiments of the disclosure. To train an ANN such as the one that may be used by subvocalization deciphering module 708 may require several thousands of examples. To achieve this, ANN training module 710 may rely on a large group1380 of people (e.g., a group of reference human subjects). In one example, subvocalization deciphering module 708 may perform fine adjustments to the ANN such that it is customized to user 102. In this manner, within minutes or less of wearing speech detection system 100, subvocalization deciphering module 708 may be ready for deciphering the facial skin micromovements. ANN training module 710 can be used to train two different ANN types: a1385 classification neural network that eventually outputs words, and a sequence-to-sequence neural network which outputs a sentence (word sequence). To do so, ANN training module 710 may upload from a memory training data, such as silent speech data received from light reflections processing module 706 that was gathered from multiple reference human subjects. The silent speech data may be collected from a wide variety of people (people of1390 varying ages, genders, ethnicities, physical disabilities, etc.). It is to be noted that the number of examples required for learning and generalization may be task dependent. For word / utterance prediction (within a closed group) at least several thousands of examples may be gathered. Thereafter, ANN training module 710 may augment the image processed training data to get more artificial data for the training process. In particular, the augmented1395 data may include image processed coherent light patterns, with some of the image processing steps described herein. The data augmentation process may include the steps of (i) time dropout, where amplitudes at random time points are replaced by zeros;(ii) frequencydropout, where the signal is transformed into the frequency domain, and random frequency chunks are filtered out; (iii) clipping, where the maximum amplitude of the signal at random1400 time points is clamped. This clipping may add a saturation effect to the data;(iv) noise addition, where Gaussian noise is added to the signal, and speed change, where the signal is resampled to achieve a slightly lower or slightly faster signal.

[0102] The augmented dataset may go through a feature extraction process. In this process, ANN training module 710 may compute time domain silent speech features. For1405 this purpose, for example, each signal may be split into low and high frequency components, xjow and x_high, and windowed to create time frames, for example, using a frame length of 27ms and shift of 10 ms. For each of the frame five time-domain features and the nine frequency domain features, a total of 14 features per signal may be computed. Specifically, the time-domain features may be represented as follows:1410 where ZCR is the zero-crossing rate. In addition, in this example, the magnitude values used are from a 16-point short Fourier transform, i.e., frequency domain features and all features are normalized to zero mean unit variance.

[0103] Thereafter, ANN training module 710 may split the data into training, validation, and1415 test sets. The training set may be the data used to train the model. Hyperparameter tuning may be done using the validation set, and final evaluation may be done using the test set. The model architecture may be task dependent. Two different examples describe training two networks for two conceptually different tasks. A first task may include signal transcription, i.e., translating silent speech to text by generating a word, a phoneme, or a1420 letter. This first task may be addressed by using a sequence-to-sequence model. A second task may include predicting a word or an utterance, i.e., categorizing utterances uttered by users into a single category within a closed group. This second task may be addressed by using a classification model. The disclosed sequence-to-sequence model may be composed of an encoder, which may transform the input signal into high level representations1425 (embeddings), and a decoder, which produces linguistic outputs (i.e., characters or words) from the encoded representations. The input entering the encoder may be a sequence of feature vectors. In one example, the input may enter the first layer of the encoder, a temporal convolution layer, which may down-sample the data to achieve a good performance. The model may use an order of a hundred of such convolution layers.1430

[0104] In some embodiments, the outputs from the temporal convolution layer at each time step may be passed to three layers of bidirectional recurrent neural networks (RNN).ANN training module 710 may employ long short-term memory (LTSM) as units in each RNN layer. Each RNN state may be a concatenation of the state of the forward RNN with the state of the backward RNN. The decoder RNN may be initialized with the final state of the encoder1435 RNN (concatenation of the final state of the forward encoder RNN with the first state of the backward encoder RNN). At each time step, the decoder RNN may receive as input the preceding word, encoded one-hot and embedded in a 150-dimensional space with a fully connected layer. The decoder RNN output may be projected through a matrix into the space of words or phonemes (depending on the training data). The sequence-to-sequence model1440 may condition the next step prediction on the previous prediction. During learning, a log probability may be maximized:where y<i is the ground truth of the previous prediction. The classification neural network may be composed of the encoder as in the sequence-to-sequence network and an additional1445 fully connected classification layer on top of the encoder output. The output may be projected into the space of closed words and the scores may be translated into probabilities for each word in the dictionary. The results of the above entire procedure may include two types of trained ANNs, expressed in computed coefficients. The coefficients may be stored in a data structure associated with speech detection system 100 (e.g., data structure 422 and1450 data structure 464). In day-to-day use, ANN training module 710 may receive up to date coefficients for the trained ANN. The first ANN task may be the signal transcription, i.e. , translating silent speech to text by word / phoneme / letter generation. The second ANN task may be word / utterance prediction, i.e., categorizing utterances uttered by users into a single category within a closed group.1455

[0105] Output determination module 712 may regulate the operation of output unit 114 and the operation of network interface 420 to generate output using speaker 404, light indicator 406, haptic feedback device 408, and / or to send data to a remote computing device. In some embodiments, the output generated by output determination module 712 may include various types of output associated with silent speech determined from detected1460 facial skin micromovements. Specifically, output determination module 712 may synthesize vocalization of words determined from the facial skin movements by subvocalization deciphering module 708. The synthesis may emulate a voice of user 102 or emulate a voice of someone other than user 102 (e.g., a voice of a celebrity or preselected template voice). The vocalization of the words may be presented via speaker 404 or transmitted to the1465 remote computing device via network interface 420. Alternatively, output determination module 712 may generate a textual output from the facial skin movements by subvocalization deciphering module 708. The textual output may be transmitted to theremote computing device via network interface 420. According to another embodiment, the output generated by output determination module 712 may relate to the operation of speech1470 detection system 100. In some cases, light indicator 406 may include a light indicator that shows the battery status of speech detection system 100. For example, the light indicator may start to blink when speech detection system 100 has a low battery. Additional examples of the types of output that may be generated by output determination module 712 are described throughout the present disclosure.1475

[0106] Database access module 714 may cooperate with data structures 422 and 464 to retrieve stored data. The retrieved data may include, for example, correlations between a plurality of words and a plurality of facial skin movements, correlations between a specific individual and a plurality of facial skin micromovements associated with the specific individual, and more. As described above, subvocalization deciphering module 708 may use1480 a trained ANN to perform silent speech deciphering. The trained ANN may use data stored in data structures 422 and 464 to extract meaning from detected facial skin micromovements. Data structures 422 and 464 may include separate databases, including, for example, a vector database, raster database, tile database, viewport database, and / or a user input database. The data stored in data structures 422 and 464 may be received from1485 modules 702-712 or other components of speech detection system 100. Moreover, the data stored in data structures 422 and 464 may be provided as input using data entry, data transfer, or data uploading.

[0107] Modules 702-714 may be implemented in software, hardware, firmware, a mix of any of those, or the like. Processing devices of speech detection system 100 and remote1490 processing system 450 may be configured to execute the instructions of modules 702-714. In some embodiments, aspects of modules 702-714 may be implemented in hardware, in software (including in one or more signal processing and / or application specific integrated circuits), in firmware, or in any combination thereof, executable by one or more processors, alone, or in various combinations with each other. Specifically, modules 702-714 may be1495 configured to interact with each other and / or other modules associated with speech detection system 100 to perform functions consistent with disclosed embodiments.

[0108] In accordance with one implementation, a speech detection system projects a pattern of light on facial skin (e.g., a cheek) of a user. Thereafter, the speech detection system may detect light reflections from various locations of the facial skin. Notably,1500 reflections associated with specific areas may be more relevant for extracting meaning (e.g., determining communication) than other areas. The specific areas may be those that are located closer to particular facial muscles. Identifying the specific locations may pose challenges because each user has unique facial features, and the position of the light source and / or detector relative to the user’s face may change during every usage and even during1505 ongoing operations. The following paragraphs describe systems, methods, and computer program products for identifying the locations of those specific areas, using the light reflections from the specific areas to extract meaning, and ignoring light reflections from other areas to conserve processing resources.

[0109] Some disclosed embodiments involve interpreting facial skin movements. The term1510 “interpreting facial skin movements” refers to extracting meaning from detected skin movements, as described elsewhere in this disclosure. In one example, interpreting facial skin movements may include determining one or more vocalized or subvocalized words from the facial skin movements or determining a facial expression (e.g., happy, sad, anger, fear, surprise, disgust, contempt, or other emotion) of the individual. In another example,1515 interpreting facial skin movements may include determining an identity of the individual. These facial skin movements may be detectable as described elsewhere in this disclosure.

[0110] Some disclosed embodiments involve projecting light on a plurality of facial region areas of an individual, wherein the plurality of areas includes at least a first area and a second area. The term “projecting” includes controlling a light source (e.g., a coherent light1520 source) such that it emits light in a given direction (e.g., toward a portion of the face), as discussed elsewhere in this disclosure. The term “individual” includes a person who uses a speech detection system (or another person to whom the light source is projected), as described elsewhere in this disclosure. The term “facial region area” or simply “area” in the context of the face includes a portion of the face of the individual, as described elsewhere in1525 this disclosure. For example, a facial region area may have a size of at least 1 cm2, at least 2 cm2, at least 4 cm2, at least 6 cm2, or at least 8 cm2. Consistent with some disclosed embodiments, the projected light illuminates a plurality of facial region areas. For example, the plurality of areas includes 4, 8, 16, 32, or any other numbers of areas. In some cases, the projected light may include at least one spot, as described elsewhere in this disclosure.1530 The at least one spot may illuminate more than one facial region area, for example, as illustrated in Fig. 3, a single spot 106 may illuminate different portions of facial region 108. For example, spot 106 may include a first portion 304A associated with a first facial muscle and a second portion 304B associated with a second facial muscle. Alternatively, a single facial region area may be illuminated by multiple light spots. Some of the plurality of areas1535 may be spaced apart from each other while others of the plurality of areas may be overlapping with each other. The term “spaced apart” may refer to being non-overlapping or separated by at least some distance. Thus, spaced apart areas may refer to two or more facial region areas that do not overlap with each other and have even a very small gap in between. For example, stating that a first facial region area is spaced apart from a second1540 facial region area may include distances between the first and second region of at least 5 mm, at least 10 mm, at least 15 mm, or any other desired distance. In some embodimentsthe distance may be less than 1 mm, or between 1mm and 5mm. In some cases, only a portion of a facial region area may be illuminated by the projected light. In other cases, all of the facial region areas may be illuminated by the projected light. By way of example, Figs. 81545 and 12 illustrate illuminating plurality of facial region areas of an individual using a plurality of spots. As illustrated, each of areas 800A and 800B are illustrated by more than one light spot.

[0111] Some disclosed embodiments involve illuminating at least a portion of the first area and at least a portion of the second area with a common light spot. As used herein, the term1550 “at least a portion” and / or grammatical equivalents thereof can refer to any fraction of a whole amount. For example, “at least a portion” can refer to at least about 1%, 5%, 10%, 20%, 40%, 65%, 90%, 95%, 99%, 99.9%, or 100% of a whole amount, or any other fraction. The term “common light spot” means that a single (common) light spot may cover some or all of the first area and the second area. The common light spot may illuminate at least a1555 portion of the first area and the second area. In one example, the common light spot may illuminate 30% of the first area and 10% of the second area. In another example, the common light spot may illuminate 100% of the first area and 100% of the second area. Controlling the at least one coherent light source may include illuminating a continuous area on the face that includes the first area and the second area. By way of one example, as1560 illustrated in Fig. 3 single light spot 106 may illuminate two or more facial areas (e.g., 304A and 304B).

[0112] Some disclosed embodiments involve illuminating the first area with a first group of spots and illuminating the second area with a second group of sports distinct from the first group of spots. The term “group of spots” refers to more than one light spot. The number of1565 spots in the group of spots may range from two to 64 or more. For example, the group of spots may include 4 spots, 8 spots, 16 spots, 32 spots, 64 spots, or any number of spots greater than two. There may be variations in illumination characteristics between spots or within the group of spots, as discussed elsewhere in this disclosure. Illuminating an area with a group of spots may refer to illuminating some or all of a facial area region by two or more1570 spots. In one example, the group of spots may illuminate at least 15% of the area, at least 40% of the area, or at least 70% of the area. A first area may be illuminated by a first group of spots and a second area may be illuminated by a second group of spots distinct from the first group of spots. In this context, the term “distinct” means that the first group of spots is distinguishable from the second group of spots. For example, the first group of spots may1575 include at least one spot not included in the second group of spots. By way of example, Figs. 8 and 12 illustrate a first area facial regions 800A illuminated by a first group of spots 808A and a second area 800B illuminated by a second group of sports 808B distinct from the first group of spots.

[0113] Some disclosed embodiments involve operating a coherent light source (as1580 described elsewhere in this disclosure) located within a wearable housing (as described elsewhere in this disclosure) in a manner enabling illumination of the plurality of facial region areas. Enabling illumination, as used herein, may refer to a process of controlling a light source to generate at least one light beam and directing the at least one light beam toward the plurality of facial region areas. For example, enabling illumination may also include1585 utilizing a beam-splitting element (as described elsewhere in this disclosure) configured to split an input beam into multiple output beams (as described elsewhere in this disclosure) extending over a portion of a face. In an alternative embodiment, enabling illumination may include utilizing multiple light sources which generate respective groups of output beams, covering different respective sub-areas within a portion of a face. Figs. 1 and 2 illustrate an1590 example implementation of speech detection system (e.g., speech detection system 100) in which at least one facial region area (e.g., facial region 108) is illuminated by a plurality of light spots (e.g., light spots 106). In some embodiments, the plurality of light spots may be generated by optical sensing unit 116 that includes at least one light source 410 and at least one light detector 412 and located in a wearable housing 110.1595

[0114] Some disclosed embodiments involve operating a coherent light source (as described elsewhere in this disclosure) located remote from a wearable housing (as described elsewhere in this disclosure) in a manner enabling illumination of the plurality of facial region areas (as described elsewhere in this disclosure). The term “located remote” indicates that two objects are separated from each other and with a physical distance1600 between them such that they do not appear physically as a unified component. For example, the coherent light source may be part of device other than the speech detection system and located more than 1 cm from a wearable housing of the speech detection system. As another example, the coherent light source may be located more than 3 cm from a wearable housing of the speech detection system. It should be understood that the distances 1 cm1605 and 3 cm are exemplary and nonlimiting and other distances may be used. Fig. 3 illustrate an example implementation of speech detection system in which a plurality of facial region areas (e.g., first portion 304A of facial region 108 and second portion 304B of facial region 108 ) are illuminated by a coherent light source located remote from the wearable housing (e.g., a non-wearable light source 302).1610

[0115] In some disclosed embodiments, the first area is closer to at least one of a zygomaticus muscle or a risorius muscle than the second area. The phrase “a first area is closer to a muscle than a second area” means that a distance of the first area to a specific muscle is less than a distance of the second area to a specific muscle. For example, the distances may be measured from an edge of an area to an edge of specific muscle, from a1615 center of an area to a center of a specific muscle, or any combination thereof. In this context,the center of a shape (i.e. , the first area, the second area, or a specific muscle) may be a geometric center, which is the point which corresponds to the mean position of all the points in shape; a circumscribed center, which is the center of the smallest circle that completely encloses the 2D shape; an incenter, which is the center of the inscribed circle that is tangent1620 to all sides of the 2D shape, or any other reference point previously defined. As discussed, the first area is closer to at least one of a zygomaticus muscle or a risorius muscle than a second area. In other words, the disclosed embodiments capture two example use cases, the first example use case is that the first area is closer to the zygomaticus muscle than the second area. The second example use case is that the first area is closer to the risorius1625 muscle than the second area. By way of example, Fig. 8 illustrates one implementation of the first and second example use cases. Specifically, the first use case is illustrated with regards to user 102 A and the second use case is illustrated with regards to user 102 B.

[0116] Fig. 8 illustrates two example use cases for interpreting facial skin movements. In both example use cases, a plurality of facial areas 800 of user 102 may be illuminated by at1630 least one light source (e.g., light source 410, not shown). The depicted plurality of areas includes at least a first area 800A and a second area 800B. In the first example use case involving user 102 A, first area 800A is closer to the zygomaticus muscle than second area 800B, and in the second example use case involving user 102 B, first area 800A is closer to the risorius muscle than second area 800B.1635

[0117] Some disclosed embodiments involve receiving reflections from the plurality of areas. The term “receiving” may include obtaining, retrieving, acquiring, or otherwise gaining access to data or signals. In some cases, receiving may include reading data from memory and / or obtaining data from a computing device via a (e.g., wired and / or wireless) communications channel. In other cases, receiving may include detecting electromagnetic1640 waves (e.g., in the visible or invisible spectrum) and generating an output relating to measured properties of the electromagnetic waves. In a first embodiment, at least one processor may receive data indicative of light reflected from the plurality of areas from at least one detector. In a second embodiment, at least one detector may receive light rays reflected from the plurality of areas. The term “reflections” refers to one or more light rays1645 bouncing off a surface (e.g., the individual’s face) or data derived from the one or more light rays bouncing off the surface. For example, the reflections may include light detected by a light detector after it was deflected from an object. The light detected by the light detector may be generated by at least one coherent light source of the disclosed speech detection system and / or may be generated from sources other than the disclosed speech detection1650 system. By way of one example, light detector 412 in Figs. 5A and 5B is employed to receive reflections 300 that originated from light generated by light source 410.

[0118] By way of example with reference to the two uses cases depicted in Fig. 8, a reflection image 802A may represent the reflections received from the first area 800A, and reflection image 802B may represent the reflections received from the second area 800B. As1655 illustrated, in the first example use case, reflection image 802A represents the reflections received from an area closer to the zygomaticus muscle; and in the second example use case, reflection image 802A represents the reflections received from an area closer to the risorius muscle.

[0119] Some disclosed embodiments involve detecting first facial skin movements1660 corresponding to reflections from the first area and second facial skin movements corresponding to reflections from the second area. The term “detecting” in this context refers to the process of discovering, identifying, or determining the existence of light reflections (or signals associated therewith). In one example, a change in the position of facial skin may be detected. As discussed elsewhere in this disclosure, the detection process may involve1665 using various techniques or technologies to determine the existence of the pattern or the event. In some cases, the process of detecting facial skin movement may involve determining if there is any movement that occurred and recording information representing the detected movement. For example, at least one processor may detect facial skin movements by applying a light reflection analysis on received reflections. In other cases,1670 detecting facial skin movements may include determining times in which facial skin movements occurred. In other cases, detecting facial skin movements may include determining data representing the facial skin movements (e.g., direction, velocity, acceleration). The term “facial skin movements” broadly refers to any type of movements prompted by recruitment of underlying facial muscles. The facial skin movements include1675 facial skin micromovements — as described elsewhere in this disclosure — and larger-scale skin movements generally visible and detectable to the naked eye without the need for magnification (e.g., a smile, a yawn, a frown). The term “the facial skin movements corresponding to reflections from a specific area” means that the detected facial skin movements took place in a specific area of the face from which reflections were received.1680 For example, detecting first facial skin movements corresponding to reflections from the first area means that the first facial skin movements may be detected by analyzing reflections received from the first area; and detecting second facial skin movements corresponding to reflections from the second area means that the second facial skin movements may be detected by analyzing reflections received from the second area.1685

[0120] In some disclosed embodiments, detecting the first facial skin movements involves performing a first speckle analysis on light reflected from the first area, and detecting the second facial skin movements involves performing a second speckle analysis on light reflected from the second area. The term “performing” refers to the act of carrying out a task,activity, or function. The term “speckle analysis” may be understood as described elsewhere1690 in this disclosure. Consistent with the present disclosure, performing a speckle analysis may include detecting a speckle pattern, or any other patterns in signals received from a light reflected from a facial region area. For example, performing a speckle analysis may include identifying secondary speckle patterns that arise due to reflection of the coherent light from each area. In other embodiments, detecting facial skin movements may involve performing a1695 pattern-based analysis or an image-based analysis additionally or alternatively from performing a speckle analysis.

[0121] Consistent with some disclosed embodiments, the first speckle analysis and the second speckle analysis occur concurrently by the at least one processor, the term “occur concurrently” means that two or more events occur during coincident or overlapping time1700 periods, either where one begins and ends during the duration of the other, or where a later one starts before the completion of the other. In some cases the two or more events may be speckle analyses (or any pattern-based analysis). In order for the first speckle analysis and the second speckle analysis to occur concurrently, the at least one processor may include a plurality of processors or a multi-core processor that allows multiple speckle analyses to be1705 executed simultaneously.

[0122] By way of example with reference to the two uses cases depicted in Fig. 8, first facial skin movements 804A may correspond to reflections from the first area 800A and second facial skin movements 804B may correspond to reflections from the second area 800B. For example, in the first example use case, first facial skin movements 804A1710 correspond to reflections received from an area closer to the zygomaticus muscle; and in the second example use case, second facial skin movements 804B correspond to reflections received from an area closer to the risorius muscle.

[0123] Some disclosed embodiments involve determining, based on differences between the first facial skin movements and the second facial skin movements, that the reflections1715 from the first area closer to the at least one of a zygomaticus muscle or a risorius muscle are a stronger indicator of communication than the reflections from the second area. Determining refers to ascertaining. For example, from the differences between the first and second facial skin movements, the processor may determine which is closer to the associated muscle. The differences between the first facial skin movements and the second1720 facial skin movements may include any distinctions, variations, or dissimilarities between the first facial skin movements and the second facial skin movements. The differences between the first facial skin movements and the second facial skin movements may be determined using at least one of the following techniques: surface alignment, point-to-point comparison, surface registration, topological analysis, or any other technique for determining differences1725 between two data sets. For example, the differences between the first facial skin movementsand the second facial skin movements may include differences in the movement intensity, movement trajectory, the movement speed, and / or various changes in topography the facial skin. Based on the differences, the at least one processor may determine that reflections from a first area are a stronger indicator of communication than the reflections from a second1730 area. The term “communication” refers to the process of conveying information through various mediums, such as spoken language, words, body language, gestures, or signals. For example, the communication may include verbal cues (e.g., words, phrases, and language) and non-verbal cues (e.g., body language, facial expressions, gestures, and eye contact). The term “indicator of communication” refers to a measure or sign reflective of an1735 information conveyed by the individual. For example, the statement that reflections from the first area are a stronger indicator of communication than the reflections from a second area means that it may be easier to determine that the individual intends to convey information and what communication the individual intends to convey from the first facial skin movements than from the second facial skin movements. For example, the reflections from1740 the first area may be a stronger indicator of communication than the reflections from a second area because the facial skin micromovements determined from the reflections from the first area may be associated with a higher velocity, a higher displacement, or a higher other parameter indicating that the individual intents to convey information and / or the content of the information that the individual intends to convey. Consistent with disclosed1745 embodiments, in the first example use case, when the first area is closer to the zygomaticus muscle, the first facial skin movements may reflect movements with a velocity on the order of one to ten pm / ms, and the second facial skin movements may reflect smaller movements, if any. In the second example use case, when the first area is closer to the risorius muscle, the first facial skin movements may reflect movements on the order of 0.5-2 mm, and the second1750 facial skin movements reflect smaller movements, if any.

[0124] Consistent with some disclosed embodiments, the differences between the first facial skin movements and the second facial skin movements include differences of less than 100 microns. The term “differences of less than 100 microns” means that the changes between a first parameter that represents the first facial skin movements and a second1755 parameter that represents second facial skin movements is less than 100 microns. In one example, the first parameter may be a magnitude of a first displacement change vector associated with the first facial skin movements and a second parameter may be a magnitude of a second displacement change vector associated with the second facial skin movements. A displacement change is a vector that quantifies the distance and direction changes1760 between two measurements of the facial skin. For example, the differences between the first facial skin movements and the second facial skin movements include differences of less than 50 microns, less than 10 microns, or less than 1 micron. In other embodiments, thedifferences between the first facial skin movements and the second facial skin movements include differences of less than 1 millimeter. Accordingly, the determination that the1765 reflections from the first area are a stronger indicator of communication than the reflections from the second area is based on the differences of less than 1 millimeter, less than 100 microns, less than 50 microns, less than 10 microns, or less than 1 micron.

[0125] Some disclosed embodiments involve, based on the determination that the reflections from the first area are a stronger indicator of communication, processing the1770 reflections from the first area to ascertain the communication. The term “processing” refers to the act of performing operations or transformations on data or information to achieve a desired outcome. For example, processing may include manipulating, analyzing, or altering inputs in a systematic way to produce meaningful outputs. The term “processing reflections” means extracting information from signals representing the received reflections. For1775 example, processing reflections may include actions, such as filtering, amplifying, modulating, and applying light reflection analysis as described elsewhere in this disclosure. Based on the determination that the reflections from the first area are a stronger indicator of communication, the reflections from the first area are processed to ascertain the communication. The term “ascertain the communication” means determining speech or facial1780 expressions associated with non-verbal communication from facial movements, as described elsewhere in this disclosure. Consistent with the present disclosure, the reflections from the first area may be processed to create images of speckle patterns. Even at fast exposure times, such as 10 ms, the velocity of motion of the skin may be sufficient to make the speckle pattern change during each frame so that the bright pixels are blurred and washed1785 out. The degree of speckle blur of a given spot in a given frame, as manifested by the loss of contrast in the image, for example, may be indicative of the instantaneous velocity of motion of the skin in the small area of the cheek under the spot. Processing the reflections from the first area may also include extracting quantitative image features from the images of speckle patterns. Vectors of these features, extracted from successive image frames, may be input1790 to a neural network in order to ascertain the communication. Details of neural network architectures and training algorithms that may be used for this purpose are described elsewhere in this disclosure. An example feature that may be extracted for the purpose of ascertaining the communication may include speckle contrast. Any suitable measure of contrast may be used for this purpose, for example, the mean square value of the luminance1795 gradient taking over the area of the speckle pattern. High contrast in the speckle pattern of a given spot from the first area may be indicative that the corresponding location of the cheek is stationary, while reduced contrast may be indicative of motion. The contrast decreases with increasing velocity of motion. Contrast features of this sort may be typically extracted from multiple spots distributed over the first area. Additionally, or alternatively, other features1800 may be extracted from the speckle images and input to the neural network. Examples of such features may include total brightness of the speckle pattern and orientation of the speckle pattern, for instance, as computed by a Sobel filter. By way of one example, subvocalization deciphering module 708 in Fig. 7 may be used for processing the reflections from the first area to ascertain the communication.1805

[0126] Consistent with some disclosed embodiments, the communication ascertained from the reflections from the first area includes words articulated by the individual. “Ascertaining words articulated by the individual” refers to understanding words that are either vocalized or subvocalized by the individual. By processing the signals resulting from reflections, words can be ascertained as discussed elsewhere herein. By way of example, the word “Hello” in1810 Fig. 8 represents the words articulated by user 102 A or user 102 B that may be ascertained from the reflections from the first area.

[0127] Consistent with some disclosed embodiments, the communication ascertained from the reflections from the first area includes non-verbal cues of the individual. The term “nonverbal cues” refers to the various forms of communication that occur without the use of1815 spoken words. Some examples of non-verbal cues may include facial expressions, body language, gestures, eye contact, tone of voice, postures, and other subtle signals that convey meaning in interpersonal interactions. For example, non-verbal cues, such as facial expressions, may be used to communicate basic emotions like happiness, sadness, anger, fear, surprise, and disgust. As discussed elsewhere in this disclosure, the at least one1820 processor may determine a non-verbal cue by analyzing reflection signals representing facial skin micromovements in the first facial area. By way of example, the emoji in Fig. 8 represents the non-verbal cues that may be ascertained from the reflections from the first area.

[0128] Some disclosed embodiments involve, based on the determination that the1825 reflections from the first area are a stronger indicator of communication, ignoring the reflections from the second area. In this context, the term “ignoring the reflections” means that the processing actions on the signals representing the received reflections from the second area are less than the processing actions on the signals representing the received reflections from the first area. In one embodiment, signals representing the received1830 reflections from the second area may be filtered, amplified, and analyzed to determine the second facial skin movements, but some quantitative features may not be extracted because the communication may not be ascertained from signals representing the received reflections from the second area. In another embodiment which also involves “ignoring,” during a first time frame, reflections from both the first area and the second area may be1835 processed to determine which area is closer to the zygomaticus muscle or the risorius muscle. Thereafter, during a subsequent second time frame, and upon determining that thefirst area is closer to the zygomaticus muscle or the risorius muscle, reflections from the second area may be automatically discarded.

[0129] According to some disclosed embodiments, ignoring the reflections from the1840 second area includes omitting use of the reflections from the second area to ascertain the communication. The term “omitting use” refers to not using information associated with reflections from the second area when determining the meaning of the communication.

[0130] By way of example with reference to the two uses cases depicted in Fig. 8, reflection image 802A may be processed to ascertain communication 806 from first facial1845 skin movements 804A associated with the zygomaticus muscle or the risorius muscle, and reflection image 802B may ignored, e.g., not used or omitted in ascertaining the communication. As depicted, the ascertained communication may include at least one word 806A (articulated silently or vocally by user 102 A or user 102 B) and / or at least one facial expression 806B that serves as an example of a non-verbal cue.1850

[0131] Some disclosed embodiments involve determining, based on differences between the first facial skin movements and the second facial skin movements, that the first area is closer than the second area to the subcutaneous tissue associated with cranial nerve V or with cranial nerve VII. The term “subcutaneous tissue” refers to the layer of tissue located beneath the skin and above the underlying muscles and bones. It is composed of fat cells,1855 connective tissue, blood vessels, nerves, and other structures. Cranial nerve V, also known as the trigeminal nerve, is a sensory nerve for the face that control of jaw muscles. Cranial nerve VII controls facial expressions and carries taste sensation from the front of the tongue. Based on differences between the first facial skin movements and the second facial skin movements (as described above), a determination may be made that the first area is closer1860 than the second area to the subcutaneous tissue associated with cranial nerve V or with cranial nerve VII.

[0132] Some disclosed embodiments involve operating a coherent light source in a manner enabling bi-mode illumination of the plurality of facial region areas. The term “coherent light source” may be understood as described elsewhere in this disclosure.1865 Operating a coherent light source in this context refers to regulating, supervising, instructing, allowing, and / or enabling the coherent light source to illuminate at least part of a face. For example, the coherent light source may be controlled to illuminate a region of a face in a specific mode of illumination when turned on in response to a trigger. Bi-mode illumination refers to a capability of the coherent light source to illuminate an object using at least two1870 different modes of illumination. The term “mode of illumination” refers to a specific configuration or settings of the coherent light source. Each of the two modes may be associated with different values of illumination parameters, such as light intensity,illumination pattern, pulse frequency, duty cycle, light flux. Light source 410 in Fig. 4 is one example of either a single mode or multi-mode (e.g., bi-mode) light source.1875

[0133] In some disclosed embodiments, a first light intensity of the first mode of illumination differs from a second light intensity of the second mode of illumination. In some disclosed embodiments, a first illumination pattern of the first mode of illumination differs from a second illumination pattern of the second mode of illumination. Light intensity refers to a brightness level of an illumination and an illumination pattern refers to an arrangement,1880 distribution, or sequence of coherent or non-coherent light emitted from a source or reflected off a surface. The light pattern may be created by a specific design, shape, or configuration of light sources to create a particular visual or non-visual effect on the portion of the face. Examples of illumination patterns may include a grid of light spots having the same size, a grid of light spots having the various sizes, a single light spot, or any other pattern.1885

[0134] Some disclosed embodiments involve analyzing reflections associated with a first mode of illumination to identify one or more light spots associated with the first area, and analyzing reflections associated with a second mode of illumination to ascertain the communication. The term “identifying one or more light spots associated with the first area” means determining which of the light spots projected by the coherent light source are1890 located in the first area. For example, identifying the one or more light spots associated with the first area may be implemented by comparing light intensity at a particular location with boundaries of the first area, based on image analysis of the face of the individual, or by any other processing method. In one example, the first mode of illumination may include a first illumination pattern (e.g., 64 light spots) and the second mode of illumination may include a1895 second illumination pattern (e.g., 32 light spots). By way of example, with reference to the first example use case depicted in Fig. 8, the first mode of illumination may be used to identify eight light spots included within first area 800A associated with the zygomaticus muscle. Thereafter, the second mode of illumination (e.g., 4 light spots) may be used to illuminate first area 800A in a manner that enables ascertaining the communication from1900 received reflections.

[0135] Consistent with some disclosed embodiments, the first area is closer than the second area to the zygomaticus muscle, and the plurality of areas further include a third area closer to the risorius muscle than each of the first area and second area. The terms “plurality of areas” and “closer to” may be understood as described elsewhere in this disclosure. By1905 way of example with reference to Fig. 9, the plurality of facial areas 800 includes the first area 800A closer to the zygomaticus muscle than second area 800B, and a third area 800C closer to the risorius muscle than each of the first area 800A and second area 800B. In some disclosed embodiments, based on a determination that user 102C is engaged in silent speech, a processing device of the speech detection system may process the reflections1910 from the first area 800A to ascertain the communication, and ignore the reflections from the second area 800B and the third area 800C. In other embodiments, based on a determination that user 102 C is engaged in voiced speech, a processing device of the speech detection system may process the reflections from third area 800C to ascertain the communication, and ignore the reflections from the second area 800B and the first area 800A.1915

[0136] Some disclosed embodiments involve analyzing reflected light from the first area when speech is generated with perceptible vocalization (i.e. , voiced speech) and analyzing reflected light from the third area when speech is generated in an absence of perceptible vocalization (i.e., silent speech). In other words, rather than monitoring the entire cheek and processing reflections from a plurality of areas, the speech detection system may process1920 reflections received from a subset of the cheek area (e.g., only a few square millimeters or centimeters) in these two areas to detect both silent and voiced speech.

[0137] Furthermore, when the plurality of areas are illuminated by multiple light sources (e.g., an array of laser diodes) only the light sources that illuminate these two areas may be actuated, thus reducing power consumption. If a large movement of the speech detection1925 system relative to the skin is detected, a different set of light sources may be actuated.

[0138] In some disclosed embodiments, different modes of processing may be applied to ascertain silent speech from voiced speech. For example, during silent speech, the first area being closer to the zygomaticus muscle may exhibit movements with a velocity on the order of one to ten pm / ms. Therefore, features of the images of the speckles themselves may1930 change rapidly, and these features may be analyzed to generate an output. But during voiced speech, the third area being closer to the risorius muscle may exhibit movements on the order of 0.5-2 mm. Thus, the locations of the spots on the cheek may shift laterally due to the movement of the cheek. In this case, the lateral movements of the spots may be indicative of changes in the distance of the spots from the speech detection system, which1935 may thus function as a sort of depth sensor. The two processing modes — speckle sensing and depth sensing — may be used individually in detecting silent and voiced speech, respectively. Alternatively, or additionally, these two processing modes may be used together to improve the precision and specificity of measurement, for example, by applying measurements of voiced speech by a given user to learn the patterns of microscopic1940 movement that will occur in silent speech by the same user.

[0139] The field of interpreting silent speech has seen significant advancements in recent years. Systems were developed to interpret subvocalized speech, where users silently articulate words without producing audible sounds. However, a significant challenge with1945 current systems is that users need to learn how to effectively use these silent speech systems. The process of mastering subvocalization techniques can be difficult, as usersmust learn to consistently produce subtle neuromuscular signals that the system can accurately detect and interpret. This learning curve often results in unreliable or inconsistent performance. A key issue is that users typically do not know if their physical engagement1950 during subvocalization is strong enough for the system to understand their intended speech. Inconsistency in interpreting silent speech, can lead to frustration and ineffective use of the technology. Users may find themselves repeating subvocalized commands or phrases multiple times, unsure if their attempts are being accurately interpreted by the system. The need for more robust and user-friendly silent speech systems is driven by the increasing1955 demand for discreet communication methods in various settings, such as offices, public spaces, or assistive technologies for individuals with speech impairments. By addressing these challenges, including providing real-time feedback on subvocalization quality, the suggested system aims to enhance the accuracy, usability, and user experience of subvocalization-based systems.1960

[0140] Some disclosed embodiments involve a wearable system for providing feedback on subvocalization data. The term “wearable system” refers to a device or set of interconnected devices designed to be worn on or attached to the body of a user. A wearable system may include components such as sensors, processors, and output devices integrated into clothing, accessories, or standalone wearable units. Consistent with the present disclosure,1965 a wearable system may be designed to be worn on, in, or around a body part associated with a head of the user (e.g., on the ear, in the ear, around the neck). The term “subvocalization data” refers to information associated with any speech-related activity that occurs without audible utterance, before utterance, or immediately prior to an imperceptible utterance. This data may be generated during silent speech, or prevocalization.1970 Subvocalization data may include raw sensor outputs or processed data that has undergone initial analysis or transformation. For example, it may encompass information derived from signals (e.g., electrical, mechanical, or optical signals) produced by sensors measuring neuromuscular activity in non-lip regions of the head, including the cheeks, jawline, temples, forehead, or ear canal. In some cases, subvocalization data includes these signals in their1975 raw form, as directly captured by the sensors. In other cases, it includes signals that have been processed, such as through noise reduction, feature extraction, or signal amplification. The neuromuscular activity underlying the subvocalization data occurs when a user of the wearable system mentally articulates words or phrases without producing audible speech, such as during mental rehearsal, silent reading, or imagined speech. The term “feedback”1980 refers to any type of output provided in response to an input or action. Feedback may be presented in various forms, such as visual, auditory, or haptic cues, to convey information or guide user behavior. Consistent with the present disclosure, the feedback may convey indication about the sufficiency or quality of the subvocalization data thereby guiding userson how to improve their subvocalization technique. For example, the wearable system may1985 provide continuous auditory feedback in the form of a subtle background tone. As the user begins to subvocalize, the pitch or volume of this tone may change in proportion to the strength and clarity of the detected subvocalization signals. A higher pitch or increased volume could indicate stronger, more easily interpretable signals, while a lower pitch or decreased volume might suggest that the user’s subvocalization is too weak or unclear for1990 accurate interpretation. This real-time auditory feedback allows users to adjust their subvocalization technique immediately, helping them learn how to produce more consistent and recognizable silent speech patterns.

[0141] Some disclosed embodiments involve determining subvocalization data obtained via a wearable detector worn by an individual. The term “determine” refers to the act of1995 ascertaining, calculating, or establishing information or data through analysis, measurement, or processing. For example, determining may involve collecting raw sensor data, applying algorithms to process the data, and extracting meaningful information from the processed data. The term “wearable detector” refers to any device, component, or system designed to be worn on or attached to the body of a user, capable of detecting, measuring, or responding2000 to physical stimuli or changes in its environment. A wearable detector may convert any measurable quantities or properties obtained as a result of one or more physical stimuli or environmental changes into electrical signals, optical signals, or other forms of data that can be processed or analyzed. By way of non-limiting examples, a wearable detector may measure one or more properties of signals indicative of neuromuscular activity (e.g., power,2005 frequency, phase, pulse timing, pulse duration) and may generate an output relating to the measured properties. The wearable detector may include one or more sensors to detect the neuromuscular activity. The phrase “worn by an individual” means that the wearable detector is physically attached to a person's body or to an object worn by the person. A wearable system for providing feedback on subvocalization data may determine subvocalization data2010 through various sensing modalities. For example, a wearable detector may include one or more sensors capable of detecting neuromuscular activity in regions associated with speech production. These sensors may employ technologies such as electromyography (EMG) to measure electrical signals generated by muscle activity, or may use optical sensors to detect subtle changes in skin deformation caused by muscle contractions. In some cases, the2015 wearable system may process the raw sensor data to determine the subvocalization data. This processing may involve filtering out noise, amplifying relevant signals, and applying pattern recognition algorithms to identify specific subvocalized phonemes or words.Additionally, the wearable system may use machine learning techniques to improve its ability to interpret the subvocalization data over time, adapting to the individual’s unique patterns of2020 neuromuscular activity during silent speech.

[0142] In some embodiments the wearable detector includes at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of the individual and to generate signals associated with the neuromuscular activity, and wherein the at least one processor is configured to determine the subvocalization data from the signals obtained from2025 the at least one sensor. The term “sensor” refers to any device, component, or system capable of detecting, measuring, or responding to physical stimuli or changes in its environment. A sensor may convert any measurable quantities or properties associated with one or more physical stimuli or environmental changes into electrical signals, optical signals, or other forms of data that can be processed or analyzed. By way of non-limiting examples,2030 a sensor may measure one or more properties of signals indicative of neuromuscular activity (e.g., power, frequency, phase, pulse timing, pulse duration) may generate an output relating to the measured properties. The at least one sensor may employ various sensing modalities to detect the neuromuscular activity. The term “detect” refers to the act of identifying, recognizing, measuring, or perceiving the presence, occurrence, or characteristics of a2035 particular phenomenon, signal, or event. The term “neuromuscular activity” refers to any electrical, chemical, or mechanical processes associated with the interaction between nerves and muscles. For example, neuromuscular activity may include nerve impulses, muscle contractions, or other physiological events related to the nervous and muscular systems. The phrase “non-lip region of a head” refers to any area of an individual’s head2040 excluding the lips. For example, the non-lip region of the head may include, but is not limited to, the cheeks, jaw, forehead, temples, ear canal, or any other facial, cranial, or inner ear area. By way of example, the placement of the at least one sensor in proximity to the non-lip region may allow for the detection of neuromuscular activity associated with speech without relying on tongue movements or lip movements, thus enabling more discreet2045 communication. The term “generating” refers to creating, producing, or forming through any means or method. In some examples, generating may involve transforming, converting, or deriving new data from input signals or existing information. The term “signals” includes any electrical or electromagnetic waves that carry information such as voice, video, or data. Signals can take various forms, including analog signals and digital signals. Other examples2050 of signals that may be received include radio signals, optical signals, microwave signals, infrared signals, ultrasonic signals, or any other wave or other conveyance that carries information. In the context of neuromuscular activity detection, the received signals may include electrical potentials, mechanical vibrations, optical reflections, or any other measurable manifestation of physiological processes associated with muscle or nerve2055 activity. The phrase associated with refers to a relationship, connection, or correlation between two or more elements, concepts, or phenomena. In the context of the present disclosure, signals associated with neuromuscular activity may be directly or indirectlyrelated to the underlying physiological processes, potentially including both causal and correlative relationships.2060

[0143] In some embodiments, the at least one sensor includes a light detector configured to detect reflections of light projected from a light source. The terms “light source,” “light detector,” and “reflections” may be understood as described elsewhere in this disclosure. In some configurations, the light source may project light onto a non-lip region of the individual’s head, such as the cheek or temple area. The light detector may then capture the2065 reflected light from the skin surface. As the individual engages in subvocalization activity or attempts to speak silently, the associated activity may cause subtle changes in the skin’s surface position. These changes may alter the characteristics of the reflected light, such as in reflected light intensity or scattering patterns from the skin surface. By analyzing these changes in the reflected light (e.g., using pattern analysis), the system may detect speckle2070 patterns or any other detected patterns in the reflection image, and may then generate a reflection image. The reflection image can be processed using a variety of image analysis techniques, including pattern recognition and neural networks (such as CNNs and RNNs), to extract features related to micromovements. These processed data are then evaluated using inference models, object detection, and pattern recognition techniques to decipher speech2075 by identifying patterns or signatures within the reflection image data. Analyzing the image reflection data of scattered light may differ fundamentally from analyzing phase shifts or interference patterns between coherent light beams, which typically involve precise measurements of optical path differences and relies on the coherent properties of light. The scattered light analysis focuses on detecting and interpreting subtle changes in reflected2080 light patterns to infer facial skin micromovements, while coherent light analysis measures direct interference effects between precisely controlled light beams. The analysis of the image reflection data may be used to detect facial skin micromovements. As discussed elsewhere in this disclosure, the term “facial skin micromovements” refers to minute, often visually imperceptible movements or deformations of the skin on the face, typically2085 associated with muscle contractions or nerve impulses related to speech or other cognitive processes.

[0144] In other embodiments, the at least one sensor includes a touch sensor configured to detect skin deformations caused by muscle engagement. The term “skin deformations” refers to any changes in the shape, texture, or position of the skin surface or soft tissue. This2090 may include stretching, compression, wrinkling, or other alterations in skin topography. In some examples, the skin deformations include ear canal deformations that involve subtle changes in the shape, diameter, or pressure within the ear canal that occur due to muscle movements associated with subvocalization or other neuromuscular activities. The phrase “muscle engagement” refers to the activation or contraction of muscles, which may occur2095 voluntarily or involuntarily in response to nerve signals or other stimuli. In this configuration, the at least one sensor may be designed to detect subtle changes in the skin’s surface or ear canal that occur when facial muscles associated with speech are engaged. One or more sensors may use various technologies such as strain gauges, piezoelectric sensors, capacitive sensors, or pressure sensors to measure small displacements, tensions or2100 deformations in the skin or ear canal. In these embodiments, detecting neuromuscular activity may involve detecting ear canal vibrations. The term “ear canal vibrations” refers to small-scale oscillations or movements within the ear canal, which may be caused by bone conduction of speech-related muscle activity or other physiological processes associated with subvocalization. By incorporating detection of facial skin micromovements or ear canal2105 vibrations, systems, methods and computer readable media for interpreting facial neuromuscular activity may be able to capture a wide range of physiological signals associated with subvocalization.

[0145] Additionally or alternatively, detecting neuromuscular activity may involve detecting electrical signals transmitted through cranial nerves. Specifically, in some embodiments, at2110 least one sensor includes an electrode configured to detect electrical signals transmitted through cranial nerves. The term “electrode” refers to a conductive material or device used to make electrical contact with a nonmetallic part of a circuit, in this case, the skin or underlying tissue of an individual. The phrase “cranial nerves” refers to one or more of the sets of twelve paired nerves that emerge directly from the brain and brainstem, controlling2115 various functions of the head and neck, including facial movements and sensations. In this configuration, at least one sensor may include one or more electrodes placed on the skin surface in non-lip regions of the head. These electrodes may detect the small electrical potentials generated by nerve impulses traveling through the cranial nerves, particularly those associated with speech. By analyzing these electrical signals, the system may infer2120 patterns of nerve activity related to speech (e.g., audible or silent).

[0146] By way of non-limiting example, reference is made to Fig. 10, which illustrates a wearable system 1000 for providing feedback on subvocalization data is shown. Wearable system 1000 may include a wearable detector 1001 worn by a user 102. Wearable detector 1001 may be configured to obtain subvocalization data from user 102. Wearable system2125 1000 may be capable of processing information from wearable detector 1001 to determine subvocalization data 1002. For example, wearable system 1000 may determine first subvocalization data 1002A, second subvocalization data 1002B, and third subvocalization data 1002C.

[0147] In some embodiments, subvocalization data corresponds to physical engagement2130 of the individual. The term “corresponds” refers to a relationship, connection, or correlation between two or more elements, concepts, or phenomena. The term “physical engagement”refers to the activation or involvement of an individual's muscles, nerves, or other bodily structures in performing a specific action. In the context of subvocalization, physical engagement may involve subtle movements or tensions in various anatomical structures2135 associated with speech production, even when no audible sound is produced. Examples of physical engagement during subvocalization include: micro-movements of facial muscles, such as subtle contractions of muscles such as the orbicularis oris (around the mouth), zygomaticus major (cheek), or mentalis (chin) that occur during silent articulation; electrical signals in speech-related areas of the brain and along cranial nerves, detectable through2140 neuroimaging or electrophysiological techniques; and skin surface deformations, such as changes in the skin’s topography due to underlying muscle activity, potentially measurable through high-precision optical techniques. The phrase “corresponds to physical engagement of the individual” means that the subvocalization data is related to the activation of the individual’s muscles, nerves, or other bodily structures during the process of subvocalization.2145 For example, the subvocalization data may be derived from signals that measure the physical engagement during the process of subvocalization. These signals may be captured by any of the sensors in the wearable system, as described elsewhere in this disclosure. In other words, the subvocalization data may be considered as a measurable representation of the user’s physical engagement during subvocalization.2150

[0148] In some disclosed embodiments, the physical engagement is indicative of facial muscle activity or facial nerve activity. The term “facial muscle activity” refers to the movement or contraction of muscles in the face. In the context of subvocalization, this may include subtle movements or tensions in facial muscles associated with speech production, even when no audible sound is produced. For example, activity of the facial muscles, such2155 as the orbicularis oris, buccinator, and mentalis, may occur when an individual silently articulates words. This activity can be detected through various means, such as the analysis of light reflections from the skin surface, as described elsewhere in this disclosure. The term “facial nerve activity” refers to the transmission of signals through nerves in the face. In the context of subvocalization, this may involve the electrical impulses sent through cranial2160 nerves that control facial muscles involved in speech production. For example, activity of cranial nerve VII (facial nerve), cranial nerve XII (hypoglossal nerve), and cranial nerve V (trigeminal nerve) may be involved in the coordination of muscle movements during subvocalization. This neural activity can be measured using techniques such as electromyography (EMG) or through the detection of subtle skin movements or vibrations2165 caused by these neural signals. Both facial muscle activity and facial nerve activity can serve as indicators of physical engagement during subvocalization, allowing the system to detect and interpret silent speech attempts. In some cases, the wearable detector may analyze patterns of facial muscle contractions or nerve impulses to determine the subvocalizationdata. This analysis may take into account the timing, intensity, and sequence of muscle2170 activations to reconstruct the subvocalized words or phrases.

[0149] Referring back to Fig. 10, wearable system 1000 may determine first subvocalization data 1002A, second subvocalization data 1002B, and third subvocalization data 1002C. Each instance of subvocalization data corresponds to a different level of physical engagement. Specifically, first subvocalization data 1002A is associated with a use2175 case where the physical engagement is very weak, resulting in no interpretable subvocalization data being determined. Second subvocalization data 1002B is associated with a use case where the physical engagement is still not strong enough, leading to mistakes in the interpretation of the subvocalization data. Third subvocalization data 1002C is associated with a use case where the physical engagement is strong enough, resulting in2180 sufficient subvocalization data for accurate interpretation. In each case, the subvocalization data itself comprises information reflective of subtle skin deformations or other physiological indicators associated with subvocalization attempts from which the text depicted in the figure is derived. In other words, the text shown in the figure represents the system’s interpretation of the subvocalization data, not the subvocalization data itself.2185

[0150] Some disclosed embodiments involve analyzing the subvocalization data to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit. The term “analyze” refers to examining, investigating, scrutinizing, and / or studying data to extract meaningful information or draw conclusions. In the context of subvocalization data, analyzing may involve determining correlations, associations, or2190 patterns within the data set or in comparison to other data sets. For example, analyzing may include applying various processing techniques such as signal processing, pattern recognition, feature extraction, machine learning algorithms, artificial intelligence, or deep learning to interpret the subvocalization data and derive insights about the user’s silent speech attempts. The wearable system may employ various analytical techniques to process2195 and interpret the subvocalization data. For example, the system may use signal processing algorithms to filter and enhance the raw data collected from sensors. Feature extraction methods may be applied to identify key characteristics or patterns in the data that correspond to specific subvocalized linguistic units. Machine learning models, such as neural networks or support vector machines, may be trained on labeled datasets to recognize and2200 classify different subvocalized linguistic units based on the patterns of physical engagement detected. These models may take into account factors such as the intensity, duration, and spatial distribution of neuromuscular activity to make accurate determinations. Analyzing may also involve using trained machine learning models to compare received subvocalization data with reference data to determine if the physical engagement is2205 sufficient for ascertaining subvocalized words. In some cases, analysis may includecalculating various metrics or features from the subvocalization data and using the results to determine the quality or clarity of the subvocalization attempt and / or what individual had actually subvocalized. The system may also incorporate adaptive thresholding techniques to account for variations in individual physiology and subvocalization styles. By dynamically2210 adjusting the criteria for sufficiency based on historical data and user-specific patterns, the system may improve its ability to accurately ascertain subvocalized linguistic units across different contexts.

[0151] The phrase “make a determination” refers to reaching a conclusion, decision, or judgment based on available information or evidence. For example, making a determination2215 may involve assessing multiple factors, comparing data against predefined criteria, or using decision-making algorithms to arrive at a specific outcome. The term “sufficient” refers to meeting a minimum threshold or requirement for a particular purpose or goal. For example, sufficiency may be determined based on predefined criteria, statistical measures, or performance benchmarks relevant to the specific use case. In the context of the present2220 disclosure, the subvocalization data is analyzed to determine whether the physical engagement of the individual associated with the subvocalization data provides enough information to reliably identify and interpret the subvocalized speech. This means that the system evaluates the quality of the subvocalization data to assess if they meet the minimum requirements for accurately interpreting and identifying what the user is silently articulating.2225 The sufficiency threshold may be based on factors such as signal strength, clarity, or consistency, and may be determined through machine learning algorithms trained on large datasets of subvocalization patterns.

[0152] The term “ascertaining” refers to finding out, learning, or determining with certainty. For example, ascertaining may involve gathering evidence, conducting tests, or applying2230 analytical methods to establish facts or confirm information. The term “subvocalized linguistic unit” refers to a component of language that is mentally articulated without producing audible speech. For example, a subvocalized linguistic unit may include phonemes (basic units of sound), morphemes (smallest meaningful units of language), words, phrases, or sentences that are silently formed in the mind. The phrase “ascertaining a subvocalized linguistic unit”2235 refers to the process of determining, establishing, or identifying a specific component of language that is articulated without producing audible speech. This process involves analysis of subvocalization data or associated signals using advanced signal processing techniques, such as feature extraction, pattern recognition, and machine learning algorithms. For example, a neural network may be trained on labeled datasets of subvocalization signals2240 to recognize specific phonemes, morphemes, words, or phrases. The system may use techniques such as time-frequency analysis, wavelet transforms, or convolutional neural networks to extract relevant features from the subvocalization data. These features may becompared against pre-established patterns or fed into classification algorithms to determine the most likely linguistic unit being subvocalized. In some implementations, the process of2245 ascertaining a subvocalized linguistic unit may also involve using contextual information or language models to improve accuracy, especially when ascertaining larger linguistic units like words or phrases. The ascertainment process can occur in real-time, allowing for immediate interpretation of silent speech, or it may involve a slight delay for more complex linguistic units.2250

[0153] In some embodiments, the subvocalized linguistic unit includes at least one of: a phoneme, a morpheme, a word, a clause, or a sentence. The disclosed wearable system may be configured to analyze the subvocalization data at different linguistic levels. For phoneme-level analysis, the system may focus on detecting and interpreting the smallest units of sound in language, even when subvocalized. For morpheme-level analysis, the2255 system may look for patterns that correspond to the smallest meaningful units of language, such as prefixes, suffixes, or root words. Word-level analysis may involve detecting more complex patterns of physical engagement that represent entire words subvocalized by the individual. The system may also be capable of determining larger linguistic structures such as clauses, phrases, or complete sentences by combining determined words or as2260 standalone linguistic structure. By supporting multiple levels of linguistic analysis, the wearable system may provide a more comprehensive and flexible approach to providing feedback on subvocalization.

[0154] Referring back to Fig. 10, the wearable system 1000 may analyze the first subvocalization data 1002A, second subvocalization data 1002B, and third subvocalization2265 data 1002C to determine whether the physical engagement associated with each instance is sufficient for ascertaining a subvocalized linguistic unit. The system may process these data streams using the techniques described throughout the disclosure to make determinations about the sufficiency of the detected physical engagement. Specifically, based on the first subvocalization data 1002A, wearable system 1000 may determine that the physical2270 engagement is insufficient to ascertain linguistic unit 1004, even with additional information like context. For the second subvocalization data 1002B, wearable system 1000 may determine that the physical engagement is insufficient to ascertain linguistic unit 1004 without additional information but may be sufficient with such additional information. Finally, for the third subvocalization data 1002C, the system may determine that the physical2275 engagement is sufficient to ascertain linguistic unit 1004 even without additional information.

[0155] Some disclosed embodiments involve providing feedback to the individual based on the determination. The term “provide” refers to the act of supplying, offering, or making available something for use or consideration. For example, providing may involve presenting information, delivering a service, or offering a resource to a recipient. The term “feedback”2280 refers to any type of information or output returned to a source (e.g., a user of the wearable system) about the result of a process, action, or behavior. For example, feedback may be provided to inform the individual that the physical engagement associated with their subvocalization was insufficient for ascertaining a subvocalized linguistic unit. This feedback may be designed to guide, inform, or assist the individual in improving their subvocalization2285 technique or understanding the system’s interpretation of their silent speech attempts. The phrase “based on the determination” refers to using the outcome of the analysis of subvocalization data as the foundation for generating and delivering feedback to the user. Specifically, it means that the feedback provided is responsive to the system’s assessment of whether the user’s physical engagement during subvocalization was sufficient for2290 ascertaining a subvocalized linguistic unit. This determination serves as the basis for tailoring the feedback to help the user understand the quality of their subvocalization attempt and potentially improve their technique. Consistent with the present disclosure, the feedback provides phoneme-level guidance or word-level guidance for improving silent speech effectiveness of the individual. For phoneme-level guidance, the wearable system may2295 identify specific sounds or parts of words that are not being sufficiently articulated during subvocalization attempts. For example, if the user’s silent articulation of the phoneme “pe” is insufficient, and the system identifies that the intended subvocalized word is “petrol,” the system may output “etrol” as feedback. This indicates to the user that their silent articulation of the initial “pe” phoneme was not strong enough for accurate detection. For word-level2300 guidance, the wearable system may provide feedback on entire words that are not being subvocalized clearly enough for accurate interpretation. For instance, the system may emit a weak background tone when it detects that the engagement associated with a particular word is not strong enough. The intensity or pitch of this tone may vary based on the degree of insufficiency, allowing the user to identify which words or parts of words require more2305 focused effort in their silent articulation. This subtle auditory cue helps users recognize and improve upon weaknesses in their subvocalization technique without disrupting their thought process.

[0156] In some embodiments the feedback includes at least one of: a visual output, an audible output, or a haptic output. A visual output may include displaying text, graphics, or2310 animations on a screen associated with the wearable system or a connected device. For example, if the wearable system is part of smart glasses or paired with a device with a screen (e.g., a smartphone), it may present visual output such as a color-coded representation of subvocalization quality, with green indicating sufficient physical engagement and red indicating areas needing improvement. An audible output may involve2315 playing sounds, tones, or synthesized speech through speakers or earphones connected to the wearable system. For instance, if the wearable system is incorporated into an earplugwith a speaker, it may use different pitches or patterns of sounds to indicate the quality of subvocalization or provide spoken instructions for improvement. A haptic output may include vibrations, pressure sensations, or other tactile feedback delivered through the wearable2320 device. For example, the system may use different vibration patterns to signal successful interpretation of subvocalized words or to guide the user in adjusting their physical engagement. By offering multiple types of feedback, the wearable system can be adapted to different user preferences and environmental contexts, ensuring that the individual can receive and understand the feedback effectively in various situations.2325

[0157] In other embodiments, the feedback includes a continuous output, and wherein the at least one processor is configured to detect a change in a level of the physical engagement and to adjust the feedback accordingly. In this context, “continuous output” refers to feedback that is provided in an ongoing manner to the individual using the wearable system. Rather than discrete or intermittent signals, continuous output involves a constant stream of2330 sensory cues that reflect the current state of the user’s subvocalization performance in realtime. The phrase “a change in a level of the physical engagement” describes a detectable variation in the quality or characteristics of the user’s subvocalization efforts. It refers to fluctuations in the neuromuscular activity, facial movements, or other physiological signals associated with silent speech that the system is monitoring. The term “to adjust the feedback2335 accordingly” means that the wearable system modifies its output in response to detected changes in the user’s physical engagement during subvocalization. The system dynamically alters the feedback provided to reflect the current quality or sufficiency of the user’s subvocalization attempts, thereby guiding the user towards more effective silent speech. In some implementations, the wearable system may be configured to monitor the level of2340 physical engagement associated with the subvocalization attempts continuously. As the system detects changes in the intensity, consistency, or patterns of the physical engagement, it may dynamically adjust the feedback provided to the individual. For example, if the system detects an improvement in the clarity of subvocalized phonemes, it may reduce the frequency or intensity of corrective feedback. Conversely, if the system detects a decline2345 in the quality of subvocalization, it may increase a characteristic of the feedback provided. For example, a continuous output may take the form of an ongoing auditory signal or a persistent haptic sensation that reflects the current quality of the individual’s subvocalization attempts. For example, the system may use modulating tone through an earpiece or varying vibration intensities to provide immediate and ongoing feedback about the user's2350 subvocalization performance.

[0158] Referring back to Fig. 10, the wearable system 1000 may provide a first feedback 1006A, a second feedback 1006B, and a third feedback 1006C to user 102 based on the analysis of the first subvocalization data 1002A, second subvocalization data 1002B, andthird subvocalization data 1002C, respectively. Each feedback instance may correspond to2355 the determination made regarding the sufficiency of the physical engagement for ascertaining a subvocalized linguistic unit.

[0159] Some disclosed embodiments involve selecting a feedback output characteristic based on information other than the subvocalization data . The term “feedback output characteristic” refers to a specific attribute or property of the feedback provided to the2360 individual. For example, a feedback output characteristic includes at least one of: a volume level, an audible pattern, a visual pattern, a pitch, a duration of an output signal, the intensity, frequency, or location of a haptic sensation. Consistent with the present disclosure, a volume level may be a feedback output characteristic that refers to the loudness or intensity of an auditory output or haptic output. An audible pattern may be a2365 feedback output characteristic that refers to a specific sequence, rhythm, or arrangement of sounds used to convey information. For example, an audible pattern may include a series of beeps, tones, or spoken words that indicate the quality or accuracy of the subvocalization. A visual pattern may be a feedback output characteristic that refers to a specific arrangement, sequence, or display of visual elements used to convey information. For example, a visual2370 pattern may include color-coded indicators, animated graphics, or text displays that represent the effectiveness of the subvocalization attempts. The phrase “information other than the subvocalization data” refers to any data, parameters, or contextual factors that are not directly derived from the detected neuromuscular activity associated with subvocalization. In some disclosed embodiments, this includes information about an2375 environment of the individual, stored user preferences, or an operational system status, stored system settings, or other external inputs that may influence the selection of feedback characteristics. In some cases, the environment of the individual includes physical conditions in which the individual is using the wearable system. For example, the environment may include factors such as ambient noise levels, lighting conditions, or the presence of other2380 people. The stored user preferences may include options selected by the individual or learned by the system over time. For example, user preferences may include preferred feedback modalities, sensitivity thresholds, or personalized training data for interpreting subvocalization signals. The operational system status may include the current state of the wearable system itself. For example, operational system status may include battery level,2385 processing load, or the availability of specific system features or components. In the context of the present disclosure, the phrase “select a feedback output characteristic based on information other than the subvocalization data” refers to determining specific attributes of the feedback provided to the individual using factors beyond the detected subvocalization signals. This selection process may take into account various contextual or user-specific2390 information to optimize the effectiveness of the feedback. For example, the volume level may be adjusted based on the ambient noise in the environment or the user’s hearing sensitivity.

[0160] Referring back to Fig. 10, the wearable system 1000 may determine ambient sounds originating from a speaker 1010 in the environment of user 102 (step 1012) and select a feedback output characteristic (step 1014). The selected feedback output2395 characteristic may be implemented by any of the feedback instances that correspond to the determination made regarding the sufficiency of the physical engagement for ascertaining a subvocalized linguistic unit. For example, the volume level of feedback 1008B may be adjusted due to the ambient noise from speaker 1010.

[0161] In some cases, when physical engagement associated with a specific subvocalized2400 linguistic unit is insufficient for interpretation, the wearable system estimates a value of the specific subvocalized linguistic unit based on the subvocalization data and additional data. The term “estimate” refers to inferring or predicting the most likely outcome based on available information, even when that information is incomplete or uncertain. In this context, it involves using various analytical techniques to determine the probable value of a2405 subvocalized linguistic unit. The term “a value of the specific subvocalized linguistic unit” refers to the identity of the linguistic component (such as a phoneme, word, or phrase) that the user is attempting to subvocalize. In other words, the value represents the system’s best interpretation of what the user is trying to communicate silently. The term “additional data” refers to any information or inputs beyond the detected subvocalization signals that may aid2410 in estimating the value of an insufficiently interpreted linguistic unit. Example of the additional data may include the user’s past subvocalization patterns, contextual information (e.g., the topic of conversation or frequently used words), or outputs from predictive models. The phrase “based on the subvocalization data and additional data” indicates that the estimation process of the value of the specific subvocalized linguistic unit considers both the current2415 subvocalization signals (even if insufficient) and the supplementary information to make a more informed prediction about the intended linguistic unit.

[0162] By way of non-limiting example, reference is made to Fig. 11, which illustrates a flowchart of a process 1100 for providing subvocalization feedback in different cases. The steps of process 1100 may be performed in any order and may be carried out by at least one2420 processor of a wearable system or other relevant entity. The process 1100 begins with step 1102, where neuromuscular activity is detected. This detection may involve using sensors in the wearable system to capture signals associated with muscle movements or nerve impulses in a non-lip region of the head during subvocalization. In step 1104, the system determines subvocalization data based on the detected activity. This step may involve2425 processing the raw sensor data to extract meaningful features or patterns that correspond to specific subvocalized linguistic units. In step 1106, where the subvocalization data isanalyzed. This analysis may involve comparing the extracted features against known patterns or using machine learning models to interpret the subvocalization signals. Decision step 1108 determines whether the engagement is sufficient. This decision point evaluates2430 whether the detected physical engagement associated with the subvocalization attempt provides enough information for accurate interpretation of the intended linguistic unit. If the engagement is deemed sufficient, the method proceeds to step 1112, where third feedback 1006C is provided. Third feedback 1006C may indicate to the individual that their subvocalization attempt was successful and clearly interpretable, and in some cases, third2435 feedback 1006C may be the avoidance of any other feedback indicating that the engagement was insufficient. In case the engagement is not sufficient in step 1108, the process moves to step 1110 to obtain additional data. The additional data may include contextual information, historical subvocalization patterns of the individual, or outputs from predictive language models. This additional data serves to supplement the insufficient2440 subvocalization signals, allowing the system to make a more informed estimate of the intended linguistic unit.

[0163] Consistent with the present disclosure, the additional data used to estimate the value of the specific subvocalized linguistic unit is received from one or more machine learning models trained using prior subvocalization signals as training examples. The term2445 “machine learning models” refers to computational algorithms or systems capable of improving their performance on a specific task through experience or exposure to data. For example, machine learning models may include neural networks, support vector machines, or decision trees trained to recognize patterns in subvocalization signals. The phrase “prior subvocalization signals as training examples” refers to previously collected and labeled2450 subvocalization data used to teach the machine learning models how to interpret and predict subvocalized linguistic units. These training examples may include a diverse range of subvocalization patterns from multiple users or sessions. Additionally or alternatively, the additional data used to estimate the value of the specific subvocalized linguistic unit includes at least one of: historical subvocalization patterns, linguistic context, or environmental data.2455 The historical subvocalization patterns may include previously observed and recorded characteristics of an individual’s subvocalization attempts. For example, this may include typical muscle activation patterns, common errors, or user-specific tendencies in silent speech production. The linguistic context may include the surrounding words, phrases, or semantic information that may provide clues about the likely identity of an insufficiently2460 interpreted linguistic unit. For example, linguistic context may include the subject matter of a conversation, grammatical structures, or frequently co-occurring words. The environmental data may include information about the physical surroundings or conditions that may influence subvocalization performance or interpretation. For example, environmental datamay include ambient noise levels, the presence of distractions, or the individual’s current2465 activity or location. After obtaining the additional data, the wearable system may use various analytical techniques, such as probabilistic inference or machine learning algorithms, to estimate the most likely value of the specific subvocalized linguistic unit. This estimation process combines the available subvocalization data with the additional contextual or historical information to produce a best guess of the intended silent speech.2470

[0164] Referring back to Fig. 11, After obtaining the additional data in step 1110, the process reaches another decision step 1114 to assess if the engagement combined with the additional data is sufficient. This decision step evaluates whether the combination of the subvocalization data and the additional data provides enough data for accurate interpretation of the subvocalized linguistic unit. If it is sufficient, the process proceeds to2475 step 1116 to provide second feedback 1006B. The second feedback 1006B may indicate to the individual that their subvocalization attempt was insufficient to interpret the intended linguistic unit without the help of additional data. Additionally or alternatively, as described in greater details below with reference to Fig. 12, the system may determine an appropriateness of providing the feedback and may select to withhold from providing the2480 feedback and just synthesize the determined linguistic unit. However, if the combination of the subvocalization data and the additional data provides is not enough for accurate interpretation of the subvocalized linguistic unit, the process moves to step 1118 to provide first feedback 1006A. The first feedback 1006A may inform the individual that their subvocalization attempt was insufficient to interpret the subvocalization. Process 1100 then2485 loops back from steps 1112, 1116, and 1118 to the step 1102, creating a continuous cycle of detection, analysis, and feedback provision. This iterative process allows for ongoing assessment and improvement of subvocalization data collection and analysis. By continuously monitoring and providing feedback on the individual's subvocalization attempts, the wearable system may help the individual improve their silent speech technique over2490 time.

[0165] In some embodiments, when the determined value of the specific subvocalized linguistic unit corresponds to a phoneme, the feedback includes phoneme replacement with substitute output. The term “phoneme” refers to the smallest unit of sound in speech that can distinguish one word from another in a given language. The phrase “phoneme2495 replacement with substitute output” refers to providing alternative feedback when a specific subvocalized phoneme is not accurately interpreted. This approach allows the system to offer targeted feedback on individual sound units, helping users identify and improve specific aspects of their subvocalization technique. Thus, if the system detects that a user consistently struggles with a particular phoneme, it can provide substitute output that2500 highlights this issue, guiding the user towards more effective silent articulation of that sound.For example, if the user attempts to subvocalize the word “petrol” but the system fails to detect sufficient engagement for the initial “pe” phoneme, the feedback might be “etrol”. In this case, the omission of the “pe” in the feedback indicates to the user that their subvocalization of the initial phoneme was not strong enough for accurate detection.2505 Alternately, the feedback might be “detrol”. In this case, the replacement of the “pe” with a “de” in the feedback may indicate to the user that their subvocalization of the initial phoneme was not strong enough or clear enough for accurate detection. By hearing the “de” sound, the user may understand that they did not articulate the “pe” well enough. This type of feedback can guide users to focus on improving their silent articulation of specific2510 problematic phonemes, in this case, emphasizing the need for a stronger, more defined “pe” sound in their subvocalization.

[0166] In some embodiments, prior to providing feedback, the wearable system determines an appropriateness of providing feedback and withholds feedback when timing is determined to be inappropriate. The term “determine an appropriateness” refers to2515 assessing whether providing feedback at a given moment is suitable, beneficial, or likely to be well-received by the individual. The phrase “withhold feedback” refers to the action of temporarily refraining from delivering feedback to the individual. In some cases, withholding feedback may take place even when subvocalization data has been processed and it is determined that the physical engagement is insufficient to determine the linguistic unit.2520 Withholding feedback may involve storing the feedback for later delivery or discarding it if it becomes irrelevant. The phrase “timing is determined to be inappropriate” means that the system has assessed the current situation and concluded that providing feedback at this specific moment would not be suitable or beneficial for the user. This determination could be based on factors such as the user’s condition or other contextual information that suggests2525 the feedback might be disruptive or ineffective at that particular time. For example, the appropriateness may be based on a psychological state or a physiological state of the individual. A psychological state may refer to a current mental or emotional condition of the individual. For example, the psychological state may include factors such as stress levels, attention focus, or mood. A physiological state may refer to the current physical condition of2530 the individual, such as tiredness or stress levels. These physiological states may be inferred from measurable bodily functions. For example, the system may assess the individual's psychological and physiological states by monitoring indicators such as heart rate, skin conductance, muscle tension, or other biometric data. To determine appropriateness, the system may employ various technical approaches. The system may integrate data from2535 multiple sensors, such as accelerometers, gyroscopes, and biometric sensors, to assess the user’s current activity, use trained machine learning algorithms to analyze patterns in sensor data, user behavior, and historical feedback responses to predict the optimal timing forfeedback delivery. In addition, the system may use context awareness that may be obtained by using environmental sensors (e.g., ambient light sensors, microphones) to evaluate the2540 user’s surroundings and determine if it is an appropriate setting for this type of feedback. In some cases, the wearable system may employ natural language processing to analyze the semantic context and determine if providing feedback is appropriate. Additionally, the system may examine temporal patterns in the user’s behavior and feedback receptiveness through time-series analysis to identify optimal feedback windows.2545

[0167] By way of non-limiting example, reference is made to Fig. 12, which illustrates providing subvocalization feedback at appropriate times. As shown, the determined subvocalization data 1002 is associated with the text “I need to sto_ and get _etrol at the next gas station.” It appears that individual 102 has difficulty silently articulating “p” sounds. In this case, the system manages to determine the linguistic unit 1004 “I need to stop and2550 get petrol at the next gas station” using additional data such as, context information indicating that user 102 is driving. The system then determines whether it is appropriate to provide feedback (decision step 1200). If it is determined that it is not an appropriate time to provide feedback (e.g., the individual is stressed), the wearable system proceeds to output an audible presentation of the linguistic unit as determined without informing the individual2555 that he struggles with the silent articulation of the linguistic unit “p” in the words “stop” and “petrol” (audible presentation 1206A). Conversely, if it is determined that it is appropriate to provide feedback, the wearable system proceeds to output an audible presentation of the linguistic unit as determined using only the subvocalization data, thereby informing user 102 that his physical engagement associated with the silent articulation of the linguistic unit “p”2560 was insufficient (audible presentation 1206B).

[0168] In one specific implementation, the at least one processor is configured to: receive first subvocalization data indicative of a first facial muscle engagement; analyze the first subvocalization data to determine that a level of the first facial muscle engagement is insufficient for ascertaining a first subvocalized linguistic unit; provide ongoing feedback to2565 the individual indicating that the level of the first facial muscle engagement is insufficient for subvocalization determination; while providing the ongoing feedback, receive second subvocalization data indicative of a second facial muscle engagement; analyze the second subvocalization data to determine that a level of the second facial muscle engagement is sufficient for subvocalization determination; and cease the ongoing feedback when the2570 second facial muscle engagement is determined to be sufficient. The term “ongoing feedback” refers to continuous or repeated signals or information provided to the individual over a period of time. In the context of auditory feedback, this may include a constant background tone, rhythmic beeps, or modulating sounds that reflect the current quality of the user's subvocalization attempts. For example, the system might use a continuously playing2575 sound whose pitch or volume changes in real-time based on the detected level of facial muscle engagement during subvocalization. The phrase “cease the ongoing feedback” refers to any change in the continuous feedback signals once a specific condition is met. This encompasses not only completely stopping the feedback but also altering its characteristics. In the case of auditory feedback, this may involve silencing the sound2580 completely, changing the tone or pitch to indicate successful subvocalization, transitioning to a different type of audio cue, or modifying the rhythm or intensity of the ongoing sound. Consistent with the present disclosure, the feedback may change in response to the detected improvement in facial muscle engagement, signaling to the user that their subvocalization technique has become sufficient for accurate interpretation. In some cases,2585 the wearable system may be configured to provide real-time guidance to help users improve their subvocalization technique. The system may continuously monitor the level of facial muscle engagement associated with subvocalization attempts and provide immediate feedback to the individual. When the system detects insufficient muscle engagement for accurate interpretation of a subvocalized linguistic unit, it may initiate ongoing feedback to2590 alert the user.

[0169] In some embodiments the feedback may be included in or incorporated into an audible presentation of the subvocalized linguistic unit delivered to the individual. The term “audible presentation of the subvocalized linguistic unit” refers to an output of a process of converting detected and interpreted subvocalized speech into audible sound. This process2595 involves synthesizing speech that represents the content of the subvocalized linguistic unit, making it perceptible to the human ear. In other words, the wearable system may embed feedback regarding the sufficiency of physical engagement associated with subvocalization directly into the audible output of the interpreted subvocalized speech. For example, the system might add or modify a tone, pitch, or other presentation parameter of the synthesized2600 speech to indicate when the physical engagement associated with subvocalization was insufficient. By way of example, audible presentation 1206B in Fig. 12 includes feedback indicating that the physical engagement associated with the linguistic unit “p” in the words “stop” and “petrol” was insufficient.

[0170] In related embodiments, the wearable system generates an additional audible2605 presentation for delivery to a third party, wherein the additional audible presentation differs from the audible presentation delivered to the individual. In this context, “third party” refers to any person or entity other than the individual using the wearable system for subvocalization. This may include, but is not limited to, conversation partners, caregivers, or automated systems that need to receive or process the subvocalized speech. The phrase2610 “differs from” means that there are distinct differences or variations between the two audible presentations. These differences may relate to content, format, quality, or othercharacteristics of the audible presentations. Additionally, the audible presentation of the subvocalized linguistic unit delivered to the individual is associated with a first quality level, while the additional audible presentation delivered to the third party is associated with a2615 second quality level that is higher than the first quality level. In this context, “quality level” refers to the overall fidelity, clarity, and comprehensibility of the synthesized speech. The system may leverage a time delay in transmission to the third party to achieve this higher quality level. For example, while the individual may receive an immediate audible presentation with potentially lower quality, the system may introduce a delay (e.g., 2002620 milliseconds) when transmitting to the third party to allow for more sophisticated processing and enhancement of the audible presentation. This delay may enable the system to apply advanced audio processing techniques, resulting in a higher-quality audible presentation for the third party. The higher quality level may involve more natural-sounding speech, improved articulation, or the inclusion of additional contextual cues. By providing a higher-quality2625 audible presentation to the third party, the system ensures clearer communication in scenarios such as phone conversations, while maintaining an immediate feedback loop for the individual.

[0171] By way of non-limiting example, reference is made to Fig. 13, which illustrates a flowchart of a method for providing feedback on subvocalization data. In some disclosed2630 embodiments, process 1300 may be performed by at least one processor (e.g., processing device 400 or processing device 460) to perform operations or functions described herein. In some embodiments, some aspects of process 1300 may be implemented as software (e.g., program codes or instructions) that are stored in a memory (e.g., memory device 402 or memory device 466) or a non-transitory computer readable medium. In some embodiments,2635 some aspects of process 1300 may be implemented as hardware (e.g., a specific-purpose circuit). In some embodiments, process 1300 may be implemented as a combination of software and hardware.

[0172] Process 1300 includes step 1302, where subvocalization data is determined. This subvocalization data may be obtained via a wearable detector worn by an individual. The2640 subvocalization data corresponds to physical engagement of the individual, such as subtle muscle movements or neural activity associated with silent speech attempts. In step 1304, the process involves analyzing the subvocalization data. This analysis is performed to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit. The analysis may involve signal processing, pattern recognition,2645 or machine learning techniques to evaluate the quality and clarity of the detected subvocalization signals. In step 1306, feedback is provided to the individual based on the determination made in step 1304. This feedback may take various forms, such as visual,auditory, or haptic cues, designed to guide the individual in improving their subvocalization technique.2650

[0173] Another aspect of this disclosure addresses the challenge of providing a natural and personalized user experience in silent speech systems. The present disclosure offers an innovative approach that leverages in-ear audio sensing to capture the individual’s selfperceived voice characteristics. By utilizing measurements of sounds within the ear canal,2655 the disclosed system can create a personalized acoustic profile that represents how users accurately hear their own voices. This personalized acoustic profile may be used to align synthesized output of silent speech more closely with the individual’s self-perception. By doing so, the disclosed system makes subvocalized communication more intuitive and less fatiguing, potentially facilitating easier adoption and integration of this technology.2660

[0174] Some disclosed embodiments involve a wearable system for silent speech selfpresentation. The term “wearable system” refers to any device or collection of components designed to be worn on or attached to an individual’s body, such as defined elsewhere in this disclosure. In the context of the present disclosure, the phrase “silent speech selfpresentation” refers to proving a user with representation of what the user subvocalized. In2665 some embodiments, silent speech self-presentation may be accomplished using a visual output. In other embodiments, as discussed in detail below, it may be implemented using audio output. For example, when a user subvocalizes, the user may receive one or more of an audio or textual representation of that subvocalization. Such an audio presentation may be presented through an earbud or other sound producing element of the system, or via a2670 separate device wired to or wirelessly connected to a system component. Similarly, the textual representation may be provided via a display associated with the system or via a separate device wired to or wirelessly connected to a system component. This enables individuals using the wearable system to process their own subvocalized thoughts without the need to speak aloud. The wearable nature of the disclosed system may allow for2675 continuous and context-aware operation, adapting to the individual's environment and communication needs. For instance, the system may adjust the volume of the silent speech self-presentation volume based on ambient noise levels or the individual's activity. By way of a non-limiting example, as described in detail below, a wearable system for silent speech self-presentation may include sensors positioned near the individual's jaw or throat to detect2680 subtle muscle movements associated with subvocalized speech. These signals may be processed to determine subvocalized words, which are then synthesized and played back through an earpiece. This allows individuals to speak silently in various settings without disrupting others or compromising privacy.

[0175] Some disclosed embodiments involve at least one sensor configured to detect2685 neuromuscular activity in a non-lip region of a head of an individual and generate signals of a first signal type associated with the detected neuromuscular activity. The term sensor refers to any device or component capable of detecting, measuring, or responding to or changes in its environment, as described elsewhere in this disclosure]. The term “detect” refers to any process of identifying, measuring, or recognizing the presence, absence, or2690 characteristics of a particular phenomenon, signal, or condition. The phrase “neuromuscular activity” refers to any electrical, chemical, or mechanical processes that occur during vocalized speech production or subvocalized speech production, as defined elsewhere in this disclosure. The term non-lip region refers to any area of the head that is not part of the lips. This may include, but is not limited to, the cheeks, jaw, throat, temples, ear, ear canal,2695 or forehead. In some embodiments, the neuromuscular activity may be detected in the neck. The term “generating signals” refers to producing, creating, or bringing into existence detectable physical quantities or impulses that can transmit messages or information. In the context of the present disclosure, these signals are associated with detected neuromuscular activity in non-lip regions of the head. The signals may be electrical, optical, mechanical, or2700 of any other nature that can be measured, interpreted, and produced by the at least one sensor in response to subtle muscle movements, electrical impulses, or other physiological changes associated with speech production. The phrase a signal type refers to a specific category or classification of signals that are generated by one or sensors. For example, signals from a first signal type may be directly associated with the detected neuromuscular2705 activity. Specifically, signals from a first signal type may be generated by the at least one sensor in response to subtle muscle movements, electrical impulses, or other physiological changes associated with vocalized speech or silent speech. In some cases, signals from the first signal type may be associated with detected neuromuscular activity in non-lip regions of the head of the individual. In other cases, signals from the first signal type may be2710 associated with detected neuromuscular activity in other regions of the individual.

[0176] At least one sensor may include at least one of the following: a light detector configured to detect reflections of light projected from a light source, a touch sensor configured to detect skin deformations caused by muscle engagement, or an electrode configured to detect electrical signals transmitted through cranial nerves. The specific design2715 of the wearable system may incorporate different types of sensors for detecting neuromuscular activity in a non-lip region of the individual’s head. A detailed description of several exemplary sensor types is provided above. By integrating one or more of these sensors, a wearable system can achieve robust and accurate detection of speech-related neuromuscular activity, thereby enabling effective detection of subvocalization.2720

[0177] By way of example, Fig. 14 illustrates a wearable system 1400 capable of silent speech self-presentation. Wearable system 1400 includes a sensor 1402 configured to detect neuromuscular activity and generate signals of a first signal type (first signals 1404). In one implementation, sensor 1402 may be a light detector used in conjunction with a light source for detecting neuromuscular activity. In this implementation, first signals 1404 of the2725 first signal type may correspond to reflections of projected light 1406 and indicate facial skin micromovements associated with speech.

[0178] Some disclosed embodiments involve at least one audio sensor configured to detect sounds within an ear canal of the individual and generate signals of a second signal type associated with the detected sounds. The term “audio sensor” may be an electronic2730 device and may be configured to capture audio or sound and convert it into an electrical representation that may be transmitted, recorded, or processed by various electronic devices. The audio sensor may be used for recording, communication, broadcasting, or any other suitable audio application. Examples of audio sensors include microphones (such as dynamic, condenser, or ribbon microphones), piezoelectric sensors, MEMS (Micro- Electro¬2735 Mechanical Systems) microphones, hydrophones for underwater sound detection, contact sensors that detect audio vibrations through solid objects, and bone conduction sensors that pick up vibrations through bones. The term “detect sounds” refers to any process of identifying, measuring, or capturing acoustic waves or vibrations within a specific environment or medium, and converting them into electrical signals. The term “ear canal”2740 refers to the tubular passage in the outer ear that leads from the exterior of the head to the eardrum. The ear canal is a part of the auditory system and plays a role in conducting sound waves to the middle and inner ear. The phrase “second signal type” refers to a specific category or classification of signals. Specifically, signals from second signal type are generated by the audio sensor and directly associated with acoustic waves, vibrations, or2745 other sound-related phenomena occurring within the ear canal. The signals of the second signal type generated by the audio sensor may contain rich information about the acoustic properties of the individual’s speech. This may include data on pitch, timbre, resonance, and other characteristics that are unique to how the individual perceives their own voice. The second signal type is distinguished from other types of signals that the system may process,2750 such as those of the first signal type which are associated with neuromuscular activity in non-lip regions of the head.

[0179] Some disclosed embodiments involve at least one processor configured to execute different actions. The term processor refers to any electronic circuit, device, or system capable of performing computations, executing instructions, or manipulating data, as2755 described elsewhere in this disclosure.

[0180] Referring back to Fig. 14, wearable system 1400 may include a first audio sensor 1408 positioned to detect sounds within the individual's ear canal, producing signals of a second signal type (second signals 1410). Wearable system 1400 may also include at least one processor 1412. Processor 1412 may work in conjunction with sensor 1402 and with2760 first audio sensor 1408 to provide a comprehensive system for detecting and processing speech-related signals.

[0181] Some disclosed embodiments involve during a first time period, using a first set of signals of the first signal type to identify a first plurality of words vocalized by the individual. The term “time period” refers to any interval or duration during which a specific process,2765 action, or measurement occurs. For example, a time period may refer to a number of seconds (or portions thereof), minutes, hours, or any other length of time during which the individual is engaged in a certain activity or the processor executes certain actions. In the context of this disclosure, the activity associated with the time period may be vocalizing a first plurality of words. The term “set of signals” generally refers to a collection or group of2770 related signals that are treated as a unit for processing purposes. In some cases, the set of signals may share a common characteristic, such as signals generated by the same sensor. In other cases, each signal in the set conveys part of the overall information required for the system to perform a function. For example, in a wearable system detecting neuromuscular activity, a set of signals might include all electrical impulses recorded by electrodes during a2775 vocalization event or all light reflection measurements captured by an optical sensor within a specific time window or within a speech session. In the context of the present disclosure the first set of signals originates from the at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of an individual. The term “identify” refers to any process of recognizing, determining, or distinguishing specific elements, patterns, or characteristics2780 within a set of data. In the context of the present disclosure, identification may involve analyzing signals from a sensor to determine a plurality of words. The term “a plurality of words” refers to any group or collection of two or more distinct words or utterances. This may include complete sentences, phrases, or individual words spoken by a user of the wearable system. The term “vocalized” refers to any process of producing audible speech sounds2785 using the vocal cords. In the context of the present disclosure, the system may utilize signals generated by the sensor during a specific time period to identify a first plurality of words that were vocalized. In some implementations, the wearable system may compare extracted features from the signals to pre-existing models or databases of speech patterns. This comparison could allow the system to match the detected neuromuscular activity with known2790 patterns associated with specific words or phonemes to determine what was said. In other implementations, the identification process may focus solely on recognizing that vocalization has occurred, without necessarily determining the specific words that were vocalized. Thisapproach may be used to accurately distinguish between vocalized speech from the individual and other sounds. By confirming that the individual is indeed the source of the2795 vocalization, the system can ensure that the sounds captured in the ear canal correspond to the individual's own speech.

[0182] Referring back to Fig. 14, wearable system 1400 may use sensor 1402 to generate first signals 1404 indicative of neuromuscular activity in a non-lip region of an individual’s head that took place during a first time period. These signals may be used to identify a first2800 plurality of words 1414 vocalized by the individual.

[0183] Some disclosed embodiments involve use a second set of signals of the second signal type to determine an acoustic profile. The term “determine” refers to identifying, calculating, deriving, or establishing a particular value, characteristic, or result based on input data or measurements. The term “acoustic profile” refers to a set of acoustic2805 parameters or characteristics that describe the auditory properties of a sound or voice. An acoustic profile may include various attributes that define how a sound is perceived, such as frequency distribution, amplitude, resonance, and temporal patterns. For example, an acoustic profile associated with a voice of the individual may include values for at least one of: volume, pitch, timbre, tempo, or prosody. Volume refers to the typical loudness or2810 intensity of the voice of the individual measured in decibels or represented on a relative scale. Pitch refers to the typical frequency of the voice of the individual. Timbre refers to the typical quality of the voice of the individual that distinguishes it from other sounds of the same pitch and volume. Tempo refers to the pace at which the voice of the individual is typically delivered, often measured in words or syllables per minute. Prosody refers to the2815 patterns of stress and intonation in speech, encompassing aspects such as rhythm, emphasis, and melodic contour of spoken language. By determining the acoustic profile, the system can capture a representation of the individual's voice characteristics. This acoustic profile enables the system to replicate the individual's natural speech patterns when synthesizing subvocalized words. The wearable system may employ various signal2820 processing techniques to extract relevant features from the second set of signals of the second signal type. These features may include spectral characteristics, temporal patterns, and amplitude variations that correspond to the individual's perception of their own voice. Advanced algorithms, such as machine learning models or neural networks, may be used to analyze these features and to determine the acoustic profile.2825

[0184] In some embodiments, an acoustic profile may be indicative of a presentation characteristic associated with perception of the first plurality of words by the individual. The term “presentation characteristic” refers to any attribute or quality that influences how information, in this case, speech, is perceived. This may include aspects such as volume, pitch, tone, speed, or emphasis that contribute to the overall auditory experience of the2830 individual. The term “indicative” refers to something that serves as a sign, indication, or pointer towards a particular characteristic, quality, or condition. The phrase “indicative of a presentation characteristic” means that the acoustic profile contains information that demonstrates specific attributes of how the individual sounds. For example, a typical volume level of the individual may be one presentation characteristic, as some people tend to speak2835 louder compared to others. The term “associated with” may relate to any component that is linked, incorporated, affiliated with, connected to, or related to another element or concept. For example, when an object is associated with another object, it indicates a relationship or connection between the two objects without necessarily implying a specific type or degree of connection. In this case, the presentation characteristic may be associated with perception2840 of the first plurality of words by the individual. The phrase “perception of the first plurality of words by the individual” refers to how the individual using the wearable system experiences or hears their own voice when vocalizing words. This internal perception may differ from how others hear the individual's voice due to factors such as bone conduction and the unique acoustics of the individual's head and ear structures. Specifically, determining an acoustic2845 profile based on signals from the at least one audio sensor involves analyzing the sounds detected (e.g., within the ear canal of the individual) while they are vocalizing the first plurality of words. This process aims to capture the unique characteristics of how the individual perceives their own voice, which may differ significantly from how their voice sounds to others or when recorded externally. By capturing the nuances of how the2850 individual perceives their own voice, the system can later use this information to synthesize subvocalized speech in a manner that closely resembles the individual's natural voice perception. For example, the system may adjust the volume levels of synthesized speech to match the individual’s typical speaking volume or modulate the pitch and timbre to closely resemble the unique qualities of the individual’s voice.2855

[0185] Referring back to Fig. 14, wearable system 1400 may use first audio sensor 1408 to generate second signals 1410 indicative of sounds within an ear canal of the individual captured during the first time period. These signals may be used to determine an acoustic profile 1416 based on the individual's perception of their own voice.

[0186] Some disclosed embodiments involve during a second time period subsequent to2860 the first time period, use a third set of signals of the first signal type to identify a second plurality of words subvocalized by the individual. The term “subsequent time period” refers to any interval of time occurring after (or extending beyond) a previously defined time period. A subsequent time period may be immediately consecutive to the first time period, somewhat overlapping with and extending beyond the first time period, or may occur after a delay or2865 intervening events. In some examples, a second time period may take place after the system determined an acoustic profile. In some examples, it may take place beforehand, for laterapplication to pre-recorded data. The term “using” refers to employing, applying, or utilizing a particular set of data, tools, or methods to accomplish a specific task or achieve a desired outcome. In some embodiments the system may use a third set of signals of the first signal2870 type for determining subvocalization. The third set of signals are signals generated by the least one sensor and configured detect neuromuscular activity captured during the second time period. Using the third set of signals to identify a second plurality of words subvocalized by the individual typically involves the application of signal processing techniques, pattern recognition algorithms, or machine learning models to analyze and interpret the third set of2875 signals. Known signal processing techniques or variances of know signal processing techniques may be employed in this regard. The term 'identifying” refers to recognizing, determining, or distinguishing specific elements, patterns, or characteristics within a set of data or signals. In the context of speech processing, identification may involve analyzing neuromuscular signals to determine the second plurality of words subvocalized by the2880 individual. The term “subvocalized” refers to any process of internally articulating words or speech without producing audible sounds, or when producing minute sounds not readily understandable or perceptible. Subvocalization may involve any speech-related activity that occurs without audible utterance, before utterance, or immediately prior to an imperceptible utterance. Identifying subvocalized words may involve various signal processing techniques2885 to extract relevant features from the third set of signals of the first signal type. These features may include temporal, spectral, or spatial characteristics that correspond to articulatory movements associated with subvocalization. Machine learning models or neural networks may be used to analyze these features and map them to specific phonemes or words. For example, identification of subvocalized words may involve comparing the2890 extracted features to pre-existing models or databases of speech patterns. These models may be general or user-specific, allowing the system to adapt to individual speech characteristics and improve accuracy over time. The system may also employ contextual analysis to enhance word identification accuracy, considering linguistic context, such as grammar rules or common word sequences, to resolve ambiguities and improve the overall2895 interpretation of the subvocalized speech.

[0187] In some embodiments, a wearable system may make a determination whether a physical engagement associated with the second time period is sufficient for ascertaining the second plurality of words, and provide feedback to the individual based on the determination. The terms “physical engagement,” sufficient for ascertaining,” and “feedback” are defined2900 elsewhere in this disclosure. If the system determines that the physical engagement is insufficient for accurate word ascertainment, it may provide feedback to the individual to help improve their subvocalization technique. This feedback mechanism allows for a dynamic and interactive user experience, potentially improving the overall effectiveness of the silentspeech self-presentation system over time. The system may determine sufficiency using a2905 machine learning model or by comparing captured information with prestored data. The feedback on the physical engagement may include phoneme-level guidance or word-level guidance for improving silent speech effectiveness of the individual. By providing detailed guidance at the phoneme level and / or at the word level, the system can help users refine their subvocalization techniques. The guidance may involve suggestions for adjusting tongue2910 position, lip shape, or other aspects of articulation to improve the clarity and detectability of subvocalized speech. Such type of guidance can be particularly valuable for users who are new to silent speech systems.

[0188] In other embodiments, the system may make a determination whether environmental lighting conditions associated with the second time period interferes with the2915 determination of the second plurality of words and provide feedback to the individual based on the determination. The term “environmental lighting conditions” refers to the ambient illumination present in the surroundings of the individual, which may affect the performance of optical sensors or other light-dependent components of the wearable system. For example, environmental lighting conditions may encompass fluorescent office lighting,2920 natural outdoor light, dim restaurant lighting, or the flickering lights of a moving vehicle. The term “interferes” in this context refers to any environmental lighting conditions that negatively impact or hinder the system's ability to accurately detect and interpret the neuromuscular signals associated with subvocalized speech. This interference may result in reduced accuracy or reliability in determining the second plurality of words. For example, bright2925 sunlight or rapidly changing light conditions could potentially interfere with optical sensors used to detect subtle facial movements, leading to misinterpretation of subvocalized speech signals. In these cases, the feedback may include suggestions for adjusting their position relative to light sources, recommendations for optimal usage environments, or notifications about the system's current performance limitations due to lighting conditions.2930

[0189] Referring back to Fig. 14, wearable system 1400 may use sensor 1402 to detect neuromuscular activity during the second time period and generate a third set of signals of a first signal type (third signals 1418). Third signals 1418 may be used to identify a second plurality of words 1420 subvocalized by the individual.

[0190] Some disclosed embodiments involve, based on the acoustic profile, determining a2935 presentation manner for an audible presentation of the second plurality of words. The term “audible presentation” refers to an output of sound, audio, acoustic waves, or any other output that may be perceived by human hearing or via a listening device. Generating an audible presentation may be accomplished by generating audio signals that when played by a speaker (e.g., headphone or external speaker) generate sound that may be perceived by a2940 human ear. For example, particular words corresponding to the second plurality of wordsmay be stored in a data structure in a digital audio format. Upon accessing the data structure, the digital audio may be retrieved, converted to analog audio (e.g., using a D / A converter) and the analog audio may be used to drive a speaker to generate sound output. In some embodiments, generating the audible presentation may include creating sound2945 (e.g., delivered via a speaker configured to fit in the ear of the user), and may be associated with silent or previsualized speech. In an example, the audible presentation of the second plurality of words may include synthesized speech (e.g., artificial production of human speech). For example, the synthesized speech may be generated using a text-to-speech algorithm to convert normal language text into speech by assigning a phonetic transcription2950 to each text word converting the symbolic linguistic representation into sound. In some examples, a text-to-speech (TTS) system may convert normal language text into speech. Examples of suitable text to speech algorithms include concatenative TTS, parametric TTS, neural network-based TTS, and end-to-end transformer-based models. Other systems may render symbolic linguistic representations like phonetic transcriptions into speech. In one2955 example, a speaker may be used to generate an audible presentation based on detected particular subvocalized words through light reflection analysis of the reflected signals detected from the face region. The term “presentation manner” with reference to the audible presentation refers to the specific characteristics, qualities, and methods used to generate and deliver the audible output of synthesized speech or other audio signals. This may2960 include aspects such as the voice characteristics (e.g., pitch, tone, speed, volume), the type of method of speech synthesis (e.g., text-to-speech algorithms, phonetic transcription rendering), and the type of delivery method (e.g., through in-ear speakers or external audio devices). The presentation manner may determine how the audible presentation is perceived by the individual, taking into account factors such as clarity, naturalness, and personalization2965 based on the individual’s acoustic profile or preferences.

[0191] In context of the present disclosure, the presentation manner may be determined based on the acoustic profile such that the unique characteristics associated with the manner of how the individual perceives their own voice will be applied when providing the audible presentation of the second plurality of words. Doing so may create a more natural2970 and intuitive user experience by aligning the synthesized output with the individual's internal auditory expectations. The system may employ various techniques to achieve determine the presentation manner. For example, it may use digital signal processing to adjust the spectral characteristics of the synthesized speech to match the frequency response captured in the acoustic profile. Additionally, the system may apply filters or other audio processing to2975 simulate resonance effects that contribute to how an individual perceives their own voice internally. Some embodiments involve determining the presentation manner to replicate how the individual perceives their own voice. In these embodiments, determining the presentationmanner may involve analyzing the acoustic profile derived from the second set of signals of the second signal type and applying this information to the synthesis of subvocalized2980 speech. To do so, the system may use digital signal processing to adjust the spectral characteristics of the synthesized speech to match the frequency response captured in the acoustic profile. For example, a TTS system may generate speech that sounds clear but lacks the tonal warmth of a target voice. DSP algorithms like equalization and spectral envelope matching may then adjust the synthesized audio’s frequency response to align with2985 the acoustic profile, creating a voice that closely matches the desired characteristics. Additionally, the wearable system may apply filters or other audio processing techniques to simulate the conditions that contribute to how an individual perceives their own voice internally.

[0192] Referring back to Fig. 14, wearable system 1400 determines the presentation2990 manner for an audible presentation of the second plurality of words. Processor 1412 uses the acoustic profile 1416, which was previously determined based on first signals 1404 associated with first plurality of words 1414 vocalized by the individual during the first period of time. Acoustic profile 1416 represents the individual's perception of their own voice. Then, processor 1412 applies the characteristics from acoustic profile 1416 to ensure the audible2995 presentation of second plurality of words 1420 aligns closely with how the individual perceives their own voice.

[0193] Some disclosed embodiments involve synthesizing the second plurality of words in a determined presentation manner. The term “synthesize” refers to artificially producing speech or other sounds using electronic means. Accordingly, the system may employ3000 various speech synthesis techniques to generate an audible presentation of the second plurality of words. These techniques can include concatenative synthesis that combines prerecorded speech segments, and parametric synthesis that uses mathematical models to generate speech waveforms. The choice of synthesis technique may depend on factors such as computational resources, desired voice quality, and specific application requirements.3005 Consistent with the present disclosure, the synthesis of the second plurality of words may occur in the determined presentation manner, which refers to the specific approach and characteristics used to synthesize and deliver the audible presentation of the second plurality of words. As discussed above, the presentation manner may be determined based on various factors, including the individual’s acoustic profile, and may involve voice3010 parameters and attributes designed to mimic the individual’s unique vocal qualities as perceived by themselves. In some cases, synthesis of words may be implemented using an artificial voice generated by computer algorithms and software associated with the wearable system, and the synthesized voice for the audible presentation may be created using previous recordings of the individual (e.g., during the first time period). Consistent with the3015 present disclosure, the synthesized voice may be designed to closely resemble the individual’s natural voice characteristics, enabling the output of subvocalized speech in a way that mimics how the individual hears their own voice.

[0194] Referring back to Fig. 14, wearable system 1400 may use a speaker 1422 to synthesize the second plurality of words 1420 in the determined presentation manner. This3020 may involve setting voice parameters such as pitch, timbre, and prosody to align with the characteristics captured in acoustic profile 1416.

[0195] In some embodiments, a wearable system may include an additional at least one audio sensor configured to detect ambient noise outside the ear canal and to generate signals of a third signal type associated with ambient noise. The term “ambient noise” refers3025 to any background sounds or acoustic disturbances present in the environment surrounding the user. This may include sounds from traffic, conversations, machinery, or any other sources that are not the individual. The third signal type refers to a distinct category of signals generated by the additional audio sensor configured to detect ambient noise outside the ear canal. The inclusion of an additional audio sensor for detecting ambient noise allows3030 the system to gather contextual information about the individual’s acoustic environment. The contextual information is used to make informed decisions about how to adjust the presentation manner of synthesized speech. Specifically, the system may analyze an additional set of signals of a third signal type for determining an ambient noise level during the second time period, and to dynamically adjust the presentation manner based on the3035 determined ambient noise level. The term “ambient noise level” refers to the magnitude of background sounds in the individual's environment. The ambient noise level may be measured in decibels or represented on a relative scale. In some cases, the ambient noise level may be determined through various signal processing techniques, such as spectral analysis, root mean square (RMS) amplitude calculation, or machine learning algorithms3040 trained to classify and quantify different types of environmental sounds. The term “dynamically adjust” refers to modifying data in real-time or near real-time in response to changing conditions, such as the ambient noise level. In the context of the present embodiments, the dynamic adjustment of the presentation manner based on the determined ambient noise level allows the system to maintain optimal performance across various3045 acoustic environments. For example, in a quiet setting, the wearable system may use a softer, more natural-sounding voice for the synthesized speech. In contrast, in a noisy environment, the wearable system may increase the volume, adjust the frequency range to stand out from the background noise. In a specific implementation, a volume at which the second plurality of words is synthesized may be increased when the determined ambient3050 noise level exceeds a predefined threshold. In the context, the predefined threshold may refer to a predetermined value for making decisions or triggering specific actions within thesystem. For example, when the ambient noise exceeds the predefined threshold, the system may increase the volume of the synthesized speech to ensure it remains audible and comprehensible to the user. The predefined threshold may be set based on various factors,3055 such as user preferences, typical usage scenarios, or empirical studies on speech intelligibility in different noise conditions. The system may also incorporate adaptive thresholds that learn from user behavior and adjust over time to provide optimal performance for individual users. By automatically increasing the volume in noisy environments, the system can maintain effective communication without requiring manual adjustments from the3060 user.

[0196] Referring back to Fig. 14, wearable system 1400 may use second audio sensor 1424 as an additional audio sensor for detecting ambient noise outside the ear canal and generating signals of a third signal type associated with ambient noise. In this context, processor 1412 may be configured to analyze the signals from second audio sensor 14243065 and to dynamically adjust the presentation manner based on the determined ambient noise level.

[0197] As mentioned above, the wearable system may include an additional at least one audio sensor configured to generate signals of a third signal type associated with ambient noise. In some embodiments, a wearable system may analyze an additional set of the third3070 signal type to identify background noise during the second time period. The term “background noise” refers to any unwanted sounds or acoustic disturbances present in an environment of the individual that are not the primary focus of attention. These sounds may interfere with the audible presentation of the subvocalized words, potentially affecting the clarity and intelligibility of the synthesized speech. Background noise may be identified by3075 analyzing the signals from the additional audio sensor through various signal processing techniques. These include spectral analysis to examine frequency components, time-domain analysis to assess amplitude and temporal patterns, machine learning algorithms to classify different types of noise, and signal-to-noise ratio calculations to quantify noise levels. Examples of background noise that the system may encounter include environmental3080 sounds (e.g., traffic, wind, rain), mechanical noises (e.g., air conditioning, computer fans), crowd noise in public spaces, electronic interference, transportation noise, music or media, household appliances, and natural sounds like animal noises or flowing water. By identifying and characterizing these types of background noise, the wearable system can adjust its speech processing and synthesis algorithms to maintain optimal performance across various3085 acoustic environments, ensuring clear and effective silent speech self-presentation regardless of the individual’s surroundings. For example, the wearable system may cancel the background noise while synthesizing the second plurality of words in the determined presentation manner. As used herein, the term “cancel” may refer to reducing, eliminating, orcounteracting the effects of unwanted signals or disturbances. In the context of background3090 noise during the second time period, cancellation may involve techniques to minimize or remove background noise from a desired audio signal. This may be achieved through various methods such as adaptive filtering, spectral subtraction, or active noise control, which aim to improve the clarity and intelligibility of the target audio by attenuating or neutralizing interfering sounds.3095

[0198] Referring back to Fig. 14, wearable system 1400 may use second audio sensor 1424 as an additional audio sensor for detecting background noise. Processor 1412 may be configured to analyze signals from second audio sensor 1424 and to cancel the background noise while synthesizing the second plurality of words in the determined presentation manner.3100

[0199] Fig. 15 is a chart depicting a process for silent speech self-presentation occurring over three time periods, consistent with some embodiments of the present disclosure. Process 1500 begins with a first time period 1502A, which involves capturing and analyzing vocalized speech from the individual. This initial phase sets the foundation for understanding the individual's speech characteristics and how the individual perceives their own voice.3105 During this period, the system creates an acoustic profile based on sounds captured within their ear canal. In a second time period 1502B, subsequent to first time period 1500A, the system identifies subvocalized speech from the individual, determines various conditions that may affect a desired audible presentation of the subvocalized speech, and synthesizes the personalized audible presentation of the subvocalized speech. Process 1500 continues3110 to a third time period 1502C that represents ongoing use and refinement of the wearable system. During this time period, additional data is processed, updating the acoustic profile, and synthesizing new subvocalized speech based on the updated acoustic profile. Each time period builds upon the information gathered in the previous one. This structure allows the system to initially calibrate to the individual’s voice, then apply that calibration to silent3115 speech, and finally refine its performance over time for increasingly natural and personalized silent speech presentation.

[0200] In some embodiments, prior to determining the acoustic profile from the second set of signals of the second signal type, the wearable system may verify that the first plurality of words are pronounced while a comfort level of the individual is at an acceptable value.3120 Verifying may refer to checking, confirming, or validating the status of a particular condition or set of data. For example, verification may involve assessing specific parameters or criteria to ensure they meet predetermined standards before proceeding with subsequent operations. In this case, the wearable system may verify that a comfort level of the individual is at an acceptable value. The term “comfort level” generally refers to the degree of physical3125 or psychological ease, relaxation, or well-being in a given situation or environment. It caninclude a range of conditions where a person feels physically at ease, such as temperature or ergonomic settings and a sense of security in social settings. For example, a high comfort level might be indicated by relaxed muscle tension, steady breathing, and positive emotional effect, while a low comfort level could manifest as physical fidgeting, elevated heart rate, or3130 expressed anxiety about using the system. These characteristics may be assessed by one or more sensors in the system. The phrase “acceptable value” refers to a predefined threshold or range that indicates a satisfactory level of comfort for the individual, which may vary based on user preferences or system requirements. Verifying the comfort level of the individual before determining the acoustic profile may ensure that the captured voice data3135 accurately represents the individual's natural speech patterns, as discomfort or stress may alter vocal characteristics. The wearable system may employ various methods to assess the individual's comfort level. These may include analyzing detected ambient noise, analyzing physiological signals such as heart rate or skin conductance, monitoring speech patterns for signs of stress or hesitation, or directly querying the user about their comfort through a user3140 interface. If the comfort level is determined to be below the acceptable value, the system may delay the acoustic profile determination process, provide guidance to the user to improve comfort, or adjust system parameters to create a more comfortable experience.

[0201] Referring back to Fig. 15, first time period 1500A of process 1500 includes identifying a first plurality of words vocalized by the individual (step 1502), verifying the3145 individual's comfort level during vocalization (step 1504), and determining the acoustic profile for the individual (step 1506). Analyzing various physiological indicators to ensure the comfort level is at an acceptable value is important to make sure that the data used for determining the acoustic profile is reflective of the individual's real voice. If the comfort level is below the acceptable threshold, the wearable system (e.g., processor 1412) may initiate3150 actions such as delaying a process, providing guidance to the individual, or adjusting system parameters to improve comfort.

[0202] In some embodiments, a wearable system may determine a state of the individual during the second time period, and to dynamically adjust the presentation manner based on the determined state. In this context, the term “state of the individual' refers to the current3155 condition of the individual, which may encompass various physiological or psychological factors that can influence speech perception. Determining the state of the individual during the second time period involves analyzing various inputs or captured data to assess the individual's current condition. This may include monitoring physiological parameters such as blood pressure, heart rate, and other vital signs, analyzing speech patterns, and considering3160 environmental factors that may affect the individual's state. This may occur using one or more sensors in the system. In related embodiments, the state of the individual includes at least one of: an emotional state or a physical condition. An emotional state refers to thecurrent mood-related condition of the individual (e.g., feelings such as happiness, sadness, anger, or anxiety), and a physical condition refers to the current physiological factors of the3165 individual (e.g., fatigue, illness, or physical exertion). By considering both emotional states and physical conditions, the wearable system can provide a more comprehensive and nuanced approach to adjusting the presentation manner of synthesized speech. Specifically, the wearable system may use the information on the state of the individual to dynamically adjust the presentation manner of the synthesized speech. For example, if the system3170 detects that the individual is in a state of heightened stress or excitement, it may modify the tempo or pitch of the synthesized speech to reflect increased enthusiasm. Conversely, if the system detects a physical condition of fatigue, it may slow down the tempo of the synthesized speech or increase its clarity to compensate for potential cognitive fatigue.

[0203] Referring back to Fig. 15, second time period 1502B of process 1500 includes3175 identifying a second plurality of words subvocalized by the individual (step 1510), determining the state of the individual to refine the presentation manner of synthesized speech (step 1512), determining the ambient noise level using additional audio sensors to allow the system to adapt its output to different acoustic environments (step 514), and step 516 of determining the presentation manner based on at least one of the results from step3180 1508 (determining the acoustic profile in the first time period), step 1512 (determining the state of the individual in the second time period), and step 1514 (determining ambient noise level in the second time period). Step 1516 may involve weighing these various factors to optimize the synthesized speech output. Finally, in step 1518, the system synthesizes the second plurality of words based on the determined presentation manner. These steps3185 collectively enable the system to provide a personalized and context-aware silent speech self-presentation experience, adapting in real time to the individual's subvocalized speech, internal state, and external environment.

[0204] Some embodiments involve updating the acoustic profile based on an additional set of signals of the second signal type. The term “updating” refers to modifying, refining, or3190 adjusting existing information or parameters based on new data or inputs. In the context of the present disclosure, updating may involve incorporating new information to refine an existing acoustic profile. In some cases, the acoustic profile is updated based on additional set of signals of the second signal type. The additional set of signals may include new signals generated by the at least one audio sensor, which new signal corresponds to sounds3195 within the ear canal of the individual. This additional set of signals may be collected during subsequent use of the wearable system or during specific calibration periods. Updating the acoustic profile based on an additional set of signals of the second signal type allows the wearable system to adapt and improve its performance over time. By continuously refining the acoustic profile, the system may maintain or enhance the accuracy and naturalness of3200 the synthesized speech output. Updating the acoustic profile may involve various techniques, such as adaptive filtering, machine learning, or statistical analysis. These methods may compare new data derived from the additional set of signals of the second signal type with the existing acoustic profile, identifying discrepancies or trends that warrant adjustments. The system may then modify specific parameters within the acoustic profile,3205 such as volume levels, pitch, timbre, tempo, or prosody, to better align with the updated information.

[0205] Related embodiments involve updating the acoustic profile based on user input. The term “user input” refers to any form of information or commands provided by the individual using the wearable system. This may include verbal commands, physical3210 interactions with the device, or inputs through a user interface. Updating the acoustic profile based on user input may involve a dedicated calibration mode or ongoing adjustments during regular use. The system may present the user with options to modify various aspects of the synthesized speech, such as increasing or decreasing volume, adjusting pitch, or altering the speaking rate. As the individual provides feedback or makes selections, the3215 system may update the corresponding parameters in the acoustic profile. In some cases, the user input may be indicative of preference parameters. The term “indicative” refers to something that conveys, signals, or demonstrates a particular characteristic, quality, or condition. Preference parameters refer to specific settings or characteristics that the individual finds most comfortable for the synthesized speech output. These parameters may3220 relate to various aspects of the acoustic profile, such as volume levels, pitch, speaking rate, or tonal qualities. Updating the acoustic profile based on user input allows for personalization and fine-tuning of the synthesized speech output.

[0206] In other embodiments, during a third time period subsequent to the second time period, the wearable system is configured to use an additional set of signals of the first3225 signal type to identify a third plurality of words subvocalized by the individual. The additional set of signals of the first signal type refers to a new collection of signals associated with neuromuscular activity in a non-lip region of the individual detected when the individual subvocalized. As described above, the wearable system can determine a third plurality of words from the additional set of signals associated with neuromuscular activity detected3230 during the third time period. In some cases, the wearable system may determine a change between the third set of signals of the first signal type and the additional set of signals of the first signal type. The term “determine a change” refers to identifying differences, variations, or trends between two sets of data or signals. In this case, the system compares signals associated with detected speech subvocalized during the second time period and signals3235 associated with detected speech subvocalized during the third time period. This analysis may involve comparing signal characteristics such as amplitude, frequency content, ortemporal patterns to identify any significant differences or trends in the manner of subvocalization or the physical engagement associated with it. For example, the system might detect decreased amplitude in the signals, indicating less intense physical3240 engagement during subvocalization, even if the subvocalized words are the same. Alternatively, it might observe increased frequency content, suggesting a higher pitch in subvocalization, which could be associated with increased excitement or stress while subvocalizing the same words. Changes in temporal patterns, such as longer intervals between signal bursts, might indicate a slower pace of subvocalization or different emphasis3245 patterns, regardless of the actual words being subvocalized. These changes reflect variations in how the individual is performing the act of subvocalization, rather than changes in the content of the subvocalized speech itself. Thereafter, based on the change, the wearable system may determine an updated presentation manner for an audible presentation of the third plurality of words. The detected change may indicate a desired3250 change in presentation manner. For instance, if decreased amplitude is detected, the system might adjust the volume of the synthesized speech to be softer. If temporal patterns indicate a slower pace, the system might reduce the speed of the synthesized speech, inserting slight pauses between words to match the individual’s rhythm. This dynamic adaptation in the presentation manner aims to maintain a comfortable experience for the individual by3255 adapting the audible output to match their current subvocalization patterns and presumed intentions.

[0207] Referring back to Fig. 15, third time period 1502C includes identifying additional input from the individual (step 1520). This additional input may include new subvocalized speech, vocalized speech, or user feedback. The system may use various sensors and input3260 methods to capture this information, potentially including the neuromuscular sensors and in- ear audio sensors described earlier. Thereafter, process 1500 includes updating the acoustic profile based on the additional input (step 1522). This update may incorporate new data about the individual’s speech patterns, changes in their vocal characteristics, or explicit preferences they have expressed. The system may use machine learning or adaptive3265 filtering to refine the acoustic profile, ensuring it remains an accurate representation of the individual’s voice and speech perception. Finally, process 1500 includes synthesizing a third plurality of words based on the updated acoustic profile and the determined presentation manner (step 1524). This synthesis may reflect any changes or refinements made to the acoustic profile, potentially resulting in a more natural and personalized audible presentation3270 of the individual’s subvocalized speech. These steps collectively represent an iterative process of continuous improvement, allowing the wearable system to adapt to changes in the individual’s speech patterns, preferences, or environmental conditions over time. Thisongoing refinement may help maintain or enhance the system's accuracy and user comfort during extended use.3275

[0208] By way of non-limiting example, reference is made to Fig. 16, which illustrates a flowchart of a process 1600 for silent speech self-presentation. In some disclosed embodiments, process 1600 may be performed by at least one processor (e.g., processor 1412) to perform operations or functions described herein. In some embodiments, some aspects of process 1600 may be implemented as software (e.g., program codes or3280 instructions) that are stored in a memory or a non-transitory computer readable medium. In some embodiments, some aspects of process 1600 may be implemented as hardware (e.g., a specific purpose circuit). In some embodiments, process 1600 may be implemented as a combination of software and hardware.

[0209] Process 1600 includes step 1602 that involves generating signals of a first signal3285 type associated with neuromuscular activity in a non-lip region of a head of an individual. The generation of signals of the first signal type may be accomplished using a sensor to detect subtle muscle movements or nerve impulses related to speech production in areas such as the jaw, throat, or temples. Process 1600 continues with step 1604 that involves generating signals of a second signal type associated with sounds detected inside an ear3290 canal of the individual. The generation of signals of the second signal type may be accomplished using an audio sensor configured to capture sounds within the ear canal, providing insight into how the individual actually hears their sounds including their own voice. Process 1600 continues with step 1606 that involves, during a first time period, using a first set of signals of the first signal type to identify a first plurality of words vocalized by the3295 individual. This step may include analyzing the neuromuscular signals to confirm that it is the individual that vocalized the first plurality of words. Process 1600 continues with step 1608 that involves using a second set of signals of the second signal type to determine an acoustic profile indicative of a presentation characteristic associated with perception of the first plurality of words by the individual. The determination of the acoustic profile may be3300 based on an analysis of sounds captured within the ear canal and reflective of how the individual perceives their own voice.

[0210] Process 1600 then moves to a second time period subsequent to the first time period. Step 1610 involves using a third set of signals of the first signal type to identify a second plurality of words subvocalized by the individual. The third set of signals are3305 generated by the at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of an individual. Based on the acoustic profile, step 1612 involves determining a presentation manner for an audible presentation of the second plurality of words. This step may take into account the individual's unique voice characteristics and their perception of their own voice. Finally, in step 1614, process 1600 synthesizes the second3310 plurality of words in the determined presentation manner. This may involve using speech synthesis techniques to generate audible output that closely matches the individual's natural voice and speaking style.

[0211] Various example embodiments of speech detection technology are articulated below in the form of clauses. It is to be understood the term “technology” refers equally to3315 systems, methods, and non-transitory computer readable media:

[0212] Examples of inventive concepts are contained in the following clauses which are an integral part of this disclosure:Clause 1. A system, method, or computer readable media for providing feedback on subvocalization data, comprising:3320 at least one processor configured to: determine subvocalization data obtained via a wearable detector worn by an individual, wherein the subvocalization data corresponds to physical engagement of the individual; analyze the subvocalization data to make a determination whether the3325 physical engagement is sufficient for ascertaining a subvocalized linguistic unit; and provide feedback to the individual based on the determination.Clause 2. The technology of clause 1, wherein the wearable detector includes at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of the individual and to generate signals associated with the neuromuscular activity, and wherein3330 the at least one processor is configured to determine the subvocalization data from the signals obtained from the at least one sensor.Clause 3. The technology of each preceding clause, wherein the at least one sensor includes at least one of: a light detector configured to detect reflections of light projected from a light source; a touch sensor configured to detect skin deformations caused by muscle3335 engagement; or an electrode configured to detect electrical signals transmitted through cranial nerves.Clause 4. The technology of one or more of the preceding clauses, wherein the physical engagement is indicative of facial muscle activity or facial nerve activity.Clause 5. The technology of one or more of the preceding clauses, wherein the subvocalized3340 linguistic unit includes at least one of: a phoneme, a morpheme, a word, a clause, or a sentence.Clause 6. The technology of one or more of the preceding clauses, wherein the feedback provides phoneme-level guidance or word-level guidance for improving silent speech effectiveness of the individual.3345 Clause 7. The technology of one or more of the preceding clauses, wherein the feedback includes a continuous output, and wherein the at least one processor is configured to detect a change in a level of the physical engagement and to adjust the feedback accordingly. Clause 8. The technology of one or more of the preceding clauses, wherein the feedback includes at least one of: a visual output, an audible output, or a haptic output.3350 Clause 9. The technology of one or more of the preceding clauses, wherein the at least one processor is configured to select a feedback output characteristic based on information other than the subvocalization data.Clause 10. The technology of one or more of the preceding clauses, wherein the feedback output characteristic includes at least one of: a volume level, an audible pattern, a visual3355 pattern.Clause 11. The technology of one or more of the preceding clauses, wherein the information other than the subvocalization data includes at least one of: an environment of the individual, stored user preferences, or an operational system status.Clause 12. The technology of one or more of the preceding clauses, wherein when physical3360 engagement associated with a specific subvocalized linguistic unit is insufficient for interpretation, the at least one processor is configured to estimate a value of the specific subvocalized linguistic unit based on the subvocalization data and additional data.Clause 13. The technology of one or more of the preceding clauses, wherein the additional data used to estimate the value of the specific subvocalized linguistic unit is received from3365 one or more machine learning models trained using prior subvocalization signals as training examples.Clause 14. The technology of one or more of the preceding clauses, wherein the additional data used to estimate the value of the specific subvocalized linguistic unit includes at least one of: historical subvocalization patterns, linguistic context, or environmental data.3370 Clause 15. The technology of one or more of the preceding clauses, wherein if the determined value of the specific subvocalized linguistic unit corresponds to a phoneme, the feedback includes phoneme replacement with substitute output.Clause 16. The technology of one or more of the preceding clauses, wherein prior to providing feedback, the at least one processor is configured to determine an3375 appropriateness of providing feedback and to withhold feedback when timing is determined to be inappropriate.Clause 17. The technology of one or more of the preceding clauses, wherein the at least one processor is configured to determine the appropriateness based on a psychological state or a physiological state of the individual.3380 Clause 18. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to:receive first subvocalization data indicative of a first facial muscle engagement; analyze the first subvocalization data to determine that a level of the first facial muscle engagement is insufficient for ascertaining a first subvocalized linguistic unit;3385 provide ongoing feedback to the individual indicating that the level of the first facial muscle engagement is insufficient for subvocalization determination; while providing the ongoing feedback, receive second subvocalization data indicative of a second facial muscle engagement; analyze the second subvocalization data to determine that a level of the second3390 facial muscle engagement is sufficient for subvocalization determination; and cease the ongoing feedback when the second facial muscle engagement is determined to be sufficient.Clause 19. The technology of one or more of the preceding clauses, wherein the feedback is included in an audible presentation of the subvocalized linguistic unit delivered to the3395 individual.Clause 20. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to generate an additional audible presentation for delivery to a third party, the additional audible presentation differing from the audible presentation delivered to the individual.3400 Clause 21. The technology of one or more of the preceding clauses, wherein the audible presentation of the subvocalized linguistic unit delivered to the individual is associated with a first quality level, and the additional audible presentation delivered to the third party is associated with a second quality level that is higher than the first quality level.Clause 22. A system, method, or computer readable media for silent speech self¬3405 presentation, either alone or in combination with one or more of the preceding clauses, comprising: at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of an individual and generate signals of a first signal type associated with the detected neuromuscular activity;3410 at least one audio sensor configured to detect sounds within an ear canal of the individual and generate signals of a second signal type associated with the detected sounds; at least one processor configured to: during a first time period, use a first set of signals of the first signal type to identify a first plurality of words vocalized by the individual;3415 use a second set of signals of the second signal type to determine an acoustic profile indicative of a presentation characteristic associated with perception of the first plurality of words by the individual;during a second time period subsequent to the first time period, use a third set of signals of the first signal type to identify a second plurality of words3420 subvocalized by the individual; based on the acoustic profile, determine a presentation manner for an audible presentation of the second plurality of words; and synthesize the second plurality of words in the determined presentation manner.3425 Clause 23. The technology of one or more of the preceding clauses, wherein the at least one sensor includes at least one of: a light detector configured to detect reflections of light projected from a light source, a touch sensor configured to detect skin deformations caused by muscle engagement, or an electrode configured to detect electrical signals transmitted through cranial nerves.3430 Clause 24. The technology of one or more of the preceding clauses, wherein acoustic profile includes values for at least one of: volume, pitch, timbre, tempo, or prosody.Clause 25. The technology of one or more of the preceding clauses, wherein the at least one processor determines the presentation manner to replicate how the individual perceives their own voice.3435 Clause 26. The technology of one or more of the preceding clauses, wherein, prior to determining the acoustic profile from the second set of signals of the second signal type, the at least one processor is configured to verify that the first plurality of words are pronounced while a comfort level of the individual is at an acceptable value.Clause 27. The technology of one or more of the preceding clauses, wherein the at least one3440 processor is further configured to determine a state of the individual during the second time period, and to dynamically adjust the presentation manner based on the state.Clause 28. The technology of one or more of the preceding clauses, wherein the state of the individual includes at least one of: an emotional state or a physical condition.Clause 29. The technology of one or more of the preceding clauses, further comprising an3445 additional at least one audio sensor configured to detect ambient noise outside the ear canal and to generate signals of a third signal type associated with ambient noise.Clause 30. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to analyze an additional set of signals of the third signal type for determining an ambient noise level during the second time period, and to dynamically3450 adjust the presentation manner based on the determined ambient noise level.Clause 31. The technology of one or more of the preceding clauses, wherein a volume at which the second plurality of words is synthesized is increased when the determined ambient noise level exceeds a predefined threshold.Clause 32. The technology of one or more of the preceding clauses, wherein the at least one3455 processor is further configured to analyze an additional set of the third signal type to identify background noise during the second time period.Clause 33. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to cancel the background noise while synthesizing the second plurality of words in the determined presentation manner.3460 Clause 34. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to update the acoustic profile based on an additional set of signals of the second signal type.Clause 35. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to update the acoustic profile based on user input indicative3465 of preference parameters.Clause 36. The technology of one or more of the preceding clauses, wherein during a third time period subsequent to the second time period, the at least one processor is configured to use an additional set of signals of the first signal type to identify a third plurality of words subvocalized by the individual; determine a change between the third set of signals of the3470 first signal type and the additional set of signals of the first signal type; and determine an updated presentation manner for an audible presentation of the third plurality of words.Clause 37. The technology of one or more of the preceding clauses, wherein the at least one processor is further configured to make a determination whether a physical engagement associated with the second time period is sufficient for ascertaining the second plurality of3475 words, and provide feedback to the individual based on the determination.Clause 38. The technology of one or more of the preceding clauses, wherein the feedback includes phoneme-level guidance or word-level guidance for improving silent speech effectiveness of the individual.Clause 39. The technology of one or more of the preceding clauses, wherein the at least one3480 processor is further configured to make a determination whether environmental lighting conditions associated with the second time period interferes with the determination of the second plurality of words and provide feedback to the individual based on the determination.

[0213] Disclosed e...

Claims

1. 3680 CLAIMS1. A wearable system for providing feedback on subvocalization data, the wearable system comprising: at least one processor configured to: determine subvocalization data obtained via a wearable detector worn by an3685 individual, wherein the subvocalization data corresponds to physical engagement of the individual; analyze the subvocalization data to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit; and provide feedback to the individual based on the determination.

902. The wearable system of claim 1 , wherein the wearable detector includes at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of the individual and to generate signals associated with the neuromuscular activity, and wherein the at least one processor is configured to determine the subvocalization data from the 95 signals obtained from the at least one sensor.

3. The wearable system of claim 2, wherein the at least one sensor includes at least one of: a light detector configured to detect reflections of light projected from a light source; a touch sensor configured to detect skin deformations caused by muscle engagement; or an 00 electrode configured to detect electrical signals transmitted through cranial nerves.

4. The wearable system of claim 1 , wherein the physical engagement is indicative of facial muscle activity or facial nerve activity. 05 5. The wearable system of claim 1 , wherein the subvocalized linguistic unit includes at least one of: a phoneme, a morpheme, a word, a clause, or a sentence.

6. The wearable system of claim 1 , wherein the feedback provides phoneme-level guidance or word-level guidance for improving silent speech effectiveness of the individual.

107. The wearable system of claim 1 , wherein the feedback includes a continuous output, and wherein the at least one processor is configured to detect a change in a level of the physical engagement and to adjust the feedback accordingly.

8. The wearable system of claim 1 , wherein the feedback includes at least one of: a visual 15 output, an audible output, or a haptic output.

9. The wearable system of claim 1 , wherein the at least one processor is configured to select a feedback output characteristic based on information other than the subvocalization data.3720 10. The wearable system of claim 9, wherein the feedback output characteristic includes at least one of: a volume level, an audible pattern, a visual pattern.

11. The wearable system of claim 9, wherein the information other than the subvocalization data includes at least one of: an environment of the individual, stored user preferences, or 25 an operational system status.

12. The wearable system of claim 1, wherein when physical engagement associated with a specific subvocalized linguistic unit is insufficient for interpretation, the at least one processor is configured to estimate a value of the specific subvocalized linguistic unit based on the 30 subvocalization data and additional data.

13. The wearable system of claim 12, wherein the additional data used to estimate the value of the specific subvocalized linguistic unit is received from one or more machine learning models trained using prior subvocalization signals as training examples. 3514. The wearable system of claim 12, wherein the additional data used to estimate the value of the specific subvocalized linguistic unit includes at least one of: historical subvocalization patterns, linguistic context, or environmental data. 40 15. The wearable system of claim 12, wherein if the determined value of the specific subvocalized linguistic unit corresponds to a phoneme, the feedback includes phoneme replacement with substitute output.

16. The wearable system of claim 12, wherein prior to providing feedback, the at least one 45 processor is configured to determine an appropriateness of providing feedback and to withhold feedback when timing is determined to be inappropriate.

17. The wearable system of claim 16, wherein the at least one processor is configured to determine the appropriateness based on a psychological state or a physiological state of the 50 individual.

18. The wearable system of claim 1, wherein the at least one processor is further configured to:receive first subvocalization data indicative of a first facial muscle engagement;3755 analyze the first subvocalization data to determine that a level of the first facial muscle engagement is insufficient for ascertaining a first subvocalized linguistic unit; provide ongoing feedback to the individual indicating that the level of the first facial muscle engagement is insufficient for subvocalization determination; while providing the ongoing feedback, receive second subvocalization data indicative3760 of a second facial muscle engagement; analyze the second subvocalization data to determine that a level of the second facial muscle engagement is sufficient for subvocalization determination; and cease the ongoing feedback when the second facial muscle engagement is determined to be sufficient. 6519. The wearable system of claim 1, wherein the feedback is included in an audible presentation of the subvocalized linguistic unit delivered to the individual.

20. The wearable system of claim 19, wherein the at least one processor is further 70 configured to generate an additional audible presentation for delivery to a third party, the additional audible presentation differing from the audible presentation delivered to the individual.

21. The wearable system of claim 20, wherein the audible presentation of the subvocalized 75 linguistic unit delivered to the individual is associated with a first quality level, and the additional audible presentation delivered to the third party is associated with a second quality level that is higher than the first quality level.

22. A method for providing feedback on subvocalization data, the method comprising: 80 determining subvocalization data obtained via a wearable detector worn by an individual, wherein the subvocalization data corresponds to physical engagement of the individual; analyzing the subvocalization data to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit; and 85 providing feedback to the individual based on the determination.

23. A non-transitory computer readable medium containing instructions that when executed by at least one processor cause the at least one processor to perform operations for providing feedback on subvocalization data, the operations comprising:3790 determining subvocalization data obtained via a wearable detector worn by an individual, wherein the subvocalization data corresponds to physical engagement of the individual; analyzing the subvocalization data to make a determination whether the physical engagement is sufficient for ascertaining a subvocalized linguistic unit; and3795 providing feedback to the individual based on the determination.

24. A wearable system for silent speech self-presentation, the wearable system comprising: at least one sensor configured to detect neuromuscular activity in a non-lip region of a head of an individual and generate signals of a first signal type associated with the 00 detected neuromuscular activity; at least one audio sensor configured to detect sounds within an ear canal of the individual and generate signals of a second signal type associated with the detected sounds; at least one processor configured to: during a first time period, use a first set of signals of the first signal type to 05 identify a first plurality of words vocalized by the individual; use a second set of signals of the second signal type to determine an acoustic profile indicative of a presentation characteristic associated with perception of the first plurality of words by the individual; during a second time period subsequent to the first time period, use a third 10 set of signals of the first signal type to identify a second plurality of words subvocalized by the individual; based on the acoustic profile, determine a presentation manner for an audible presentation of the second plurality of words; and synthesize the second plurality of words in the determined presentation 15 manner.

25. The wearable system of claim 24, wherein the at least one sensor includes at least one of: a light detector configured to detect reflections of light projected from a light source, a touch sensor configured to detect skin deformations caused by muscle engagement, or an 20 electrode configured to detect electrical signals transmitted through cranial nerves.

26. The wearable system of claim 24, wherein acoustic profile includes values for at least one of: volume, pitch, timbre, tempo, or prosody.

27. The wearable system of claim 24, wherein the at least one processor determines the 25 presentation manner to replicate how the individual perceives their own voice.

28. The wearable system of claim 24, wherein, prior to determining the acoustic profile from the second set of signals of the second signal type, the at least one processor is configured to verify that the first plurality of words are pronounced while a comfort level of the individual3830 is at an acceptable value.

29. The wearable system of claim 24, wherein the at least one processor is further configured to determine a state of the individual during the second time period, and to dynamically adjust the presentation manner based on the state. 3530. The wearable system of claim 29, wherein the state of the individual includes at least one of: an emotional state or a physical condition.

31. The wearable system of claim 24, further comprising an additional at least one audio 40 sensor configured to detect ambient noise outside the ear canal and to generate signals of a third signal type associated with ambient noise.

32. The wearable system of claim 31 , wherein the at least one processor is further configured to analyze an additional set of signals of the third signal type for determining an 45 ambient noise level during the second time period, and to dynamically adjust the presentation manner based on the determined ambient noise level.

33. The wearable system of claim 32, wherein a volume at which the second plurality of words is synthesized is increased when the determined ambient noise level exceeds a 50 predefined threshold.

34. The wearable system of claim 31 , wherein the at least one processor is further configured to analyze an additional set of the third signal type to identify background noise during the second time period. 5535. The wearable system of claim 34, wherein the at least one processor is further configured to cancel the background noise while synthesizing the second plurality of words in the determined presentation manner. 60 36. The wearable system of claim 24, wherein the at least one processor is further configured to update the acoustic profile based on an additional set of signals of the second signal type.

37. The wearable system of claim 24, wherein the at least one processor is further3865 configured to update the acoustic profile based on user input indicative of preference parameters.

38. The wearable system of claim 24, wherein during a third time period subsequent to the second time period, the at least one processor is configured to use an additional set of3870 signals of the first signal type to identify a third plurality of words subvocalized by the individual; determine a change between the third set of signals of the first signal type and the additional set of signals of the first signal type; and determine an updated presentation manner for an audible presentation of the third plurality of words. 75 39. The wearable system of claim 24, wherein the at least one processor is further configured to make a determination whether a physical engagement associated with the second time period is sufficient for ascertaining the second plurality of words, and provide feedback to the individual based on the determination. 80 40. The wearable system of claim 39, wherein the feedback includes phoneme-level guidance or word-level guidance for improving silent speech effectiveness of the individual.

41. The wearable system of claim 24, wherein the at least one processor is further configured to make a determination whether environmental lighting conditions associated 85 with the second time period interferes with the determination of the second plurality of words and provide feedback to the individual based on the determination.

44. A method for silent speech self-presentation, the method comprising: generating signals of a first signal type associated with associated with 90 neuromuscular activity in a non-lip region of a head of an individual; generating signals of a second signal type associated with sounds detected inside an ear canal of the individual; during a first time period, using a first set of signals of the first signal type to identify a first plurality of words vocalized by the individual; 95 using a second set of signals of the second signal type to determine an acoustic profile indicative of a presentation characteristic associated with perception of the first plurality of words by the individual; during a second time period subsequent to the first time period, using a third set of signals of the first signal type to identify a second plurality of words subvocalized by the 00 individual;based on the acoustic profile, determining a presentation manner for an audible presentation of the second plurality of words; and synthesizing the second plurality of words in the determined presentation manner.3905 43. A non-transitory computer readable medium containing instructions that when executed by at least one processor cause the at least one processor to perform operations for silent speech self-presentation, the operations comprising: generating signals of a first signal type associated with associated with neuromuscular activity in a non-lip region of a head of an individual; 10 generating signals of a second signal type associated with sounds detected inside an ear canal of the individual; during a first time period, using a first set of signals of the first signal type to identify a first plurality of words vocalized by the individual; using a second set of signals of the second signal type to determine an acoustic 15 profile indicative of a presentation characteristic associated with perception of the first plurality of words by the individual; during a second time period subsequent to the first time period, using a third set of signals of the first signal type to identify a second plurality of words subvocalized by the individual; 20 based on the acoustic profile, determining a presentation manner for an audible presentation of the second plurality of words; and synthesizing the second plurality of words in the determined presentation manner.