Selective communication updating based on eye monitoring

By using eye-tracking to stabilize real-time text streams based on user gaze, the system addresses visual instability in live captions and translations, ensuring a seamless and accurate reading experience.

WO2026142912A1PCT designated stage Publication Date: 2026-07-02GOOGLE LLC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GOOGLE LLC
Filing Date
2025-12-17
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Real-time text streams, such as live captions and translations, experience visual instability due to frequent updates and corrections, which distract users and impair readability, especially on devices with limited display space or for individuals with visual impairments.

Method used

A system that uses eye-tracking to determine the user's focal region and selectively updates or delays corrections to text portions outside this region, prioritizing stability for already-read or soon-to-be-read text, thereby reducing visual noise and improving the reading experience.

Benefits of technology

The solution provides a smoother and more focused reading experience by minimizing unnecessary text updates, maintaining accuracy without sacrificing latency, and enhancing user engagement in immersive environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025060202_02072026_PF_FP_ABST
    Figure US2025060202_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A method includes receiving a stream of content. The method further includes receiving data related to a gaze of a user viewing a region of the content. The method further includes identifying an error related to at least one character within the content, based on a criterion. The method further includes modifying a portion of the content that includes the error based on a location of the error relative to the region.
Need to check novelty before this filing date? Find Prior Art

Description

Atty Docket No. 0120-1136W01SELECTIVE COMMUNICATION UPDATING BASED ON EYE MONITORINGBACKGROUND

[0001] Speech -to-text (STT) technologies are capable of converting spoken language into written text, for example using combinations of voice recognition and linguistic algorithms. For example, known STT technologies may capture sound vibrations and translate them into digital signals, and then use software to convert the digital signals into editable text that can be displayed for a user.SUMMARY

[0002] This disclosure is related to a technology' that makes reading live captions or real-time translations less distracting. Specifically, the systems and techniques described herein are related to stabilizing the display of real-time text streams (e.g.. captions, translations, and the like) by selectively updating portions of a text feed based on where a user is looking. For example, the updates to a stream of text may be limited to portions of a real-time stream of text that are actively being looked at by a user. Additionally, updates may be delayed or omitted for portions of text that the user is not looking at. such as previously read sentences for example. The systems and techniques described herein may be implemented in a device that is capable of monitoring the gaze of a user, such as an XR device for example. The systems and techniques described herein may be used by such a device while a user of the device is viewing a real-time stream of content that includes text, such as a live translation or live captioning. The systems and techniques may also be implemented in, for example, any type of computing system.

[0003] In an aspect, a method includes receiving a stream of content. The method further includes receiving data related to a gaze of a user view ing a region of the content. The method further includes identifying an error related to at least one character within the content, based on a criterion. The method further includes modifying a portion of the content that includes the error based on a location of the error relative to the region.

[0004] In another aspect, a computing system includes a computer-readable storage media, at least one processor operatively coupled to the computer-readable storage media, and program instructions stored on the computer-readable storage media. When the programAtty Docket No. 0120-1136W01instructions are executed by the at least one processor, the at least one processor is directed to perform a method. The method includes receiving a stream of content. The method further includes receiving data related to a gaze of a user viewing a region of the content. The method further includes identifying an error related to at least one character within the content, based on a criterion. The method further includes modify ing a portion of the content that includes the error based on a location of the error relative to the region.

[0005] In another aspect, a computer-readable storage medium has program instructions stored thereon. When the program instructions are executed by at least one processor, the at least one processor is directed to perform a method. The method includes receiving a stream of content. The method further includes receiving data related to a gaze of a user viewing a region of the content. The method further includes identifying an error related to at least one character within the content, based on a criterion. The method further includes modifying a portion of the content that includes the error based on a location of the error relative to the region.BRIEF DESCRIPTION OF THE DRAWINGS

[0006] FIG. 1 is a block diagram of an example of a computing environment for selective text updating, using eye monitoring data, in accordance with the present disclosure.

[0007] FIGs. 2A-2D depict an example of selectively updating text of a stream of content in real time, using eye monitoring data of a user viewing the content.

[0008] FIG. 3 is a block diagram of an example of a process for selective text updating, using eye monitoring data, in accordance with the present disclosure.

[0009] FIG. 4 is a block diagram of an example of a computing system for selective text updating, using eye monitoring data, in accordance with the present disclosure.DETAILED DESCRIPTION

[0010] The systems and techniques described herein provide a solution to the problem of visual instability in real-time text streams, such as live captions for video calls or real-time translations. A fundamental challenge of speech-to-text rendering is that as an automatic speech recognition system processes more of an utterance, it may refine its initial predictions, causing previously displayed text to change. Such updates can be distracting and can impair readability. The described solution uses gaze-tracking data to determine which part of a stream of content, (e.g., text) a user is actively reading. It then prioritizes stability of portions of the text the user has already read or is not currently reading by preventing or delayingAtty Docket No. 0120-1136W01updates to those portions. For instance, in a live-captioned presentation, if the system initially transcribes a phrase incorrectly, it may delay correction of the error until the user's gaze is focused on or past the incorrect word, or may omit the correction altogether, for example if the user has already read the portion of text with the error. This may prevent having words change after the user has already read them or is about to read them, creating a smoother and more focused reading experience.

[0011] Specifically, the solution is a smart system that uses eye-tracking, for example in a pair of AR glasses, to know where a user’s gaze is focused within a stream of content. For example, the system may know exactly which word a user is currently reading. The system may use the gaze data to prevent changes from being made to text that the user has already read. Corrections of text of the stream of content may be limited to a portion of the text that the user is actively looking at, or to the text they haven't seen yet but will read soon, based on the gaze data. This makes the correction process feel much smoother and less disruptive for the user.

[0012] To illustrate, a user may be watching a foreign film on a device having a system for showing hve-translated subtitles. A character may utter a long sentence, and the system may initially translate one or more words incorrectly. Without this technology, the entire subtitle might re-render and / or shift on the screen as the corrections are made, pulling the user’s attention away from the film. With this technology, the system would know that the user has already read the beginning of the sentence and would prevent corrections to that portion of the captions. The system may correct only a portion of the captions that the user is currently reading. This may make corrections to the captions subtle and seamless from the user’s perspective, which may result in a much more comfortable and immersive viewing experience for the user.

[0013] Live captions have gained popularity through their avai 1 abi 11 ty in remote video conferences, mobile applications, and the like. Unlike preprocessed subtitles or captions, live captions require real-time responsiveness by displaying results of interim speech-to-text operations. As prediction confidences for portions of content displayed in real-time including live captions change, one or more portions of the captions may be updated. Such caption updates may lead to visual instability that interferes with a user’s viewing experience.

[0014] The widespread adoption of automatic speech recognition (ASR) technology has made conversations more accessible, for example by enabling live captioning and / or realtime translation in remote conferencing software, mobile applications, head-wom displays, and the like. However, to maintain real-time responsiveness, live caption systems mayAtty Docket No. 0120-1136W01display interim ASR predictions that may be updated subsequently, for example as new utterances are received and prediction confidences change. Real-time updating of portions of live captions may introduce visual instability to one or more portions of text. Such text instability in live captions may significantly impair a user’s reading experience. Users may be distracted by changes in the layout of a live caption (e.g., words moving between lines, the spacing between words changing, etc.), by the modification of words, by the adjustment of punctuation in live captions, and the like.

[0015] Live captioning and / or translation sendees have become indispensable tools for accessibility7and communication, empowering millions of users globally. These real-time sen ices may be based on processing pipelines (e.g.. artificial intelligence (Al) pipelines) that transcribe and / or translate spoken language into text, thereby providing immediate access to information for individuals that are deaf or have difficulty' hearing, language learners, and individuals in noisy environments. In an example implementation, live captioning and / or translation services may be at least partially performed by a speech-to-text application programming interface (API) executing on a computing device or computing system.

[0016] Live captioning and / or translation services may be implemented in automatic speech recognition (ASR) systems. While ASR systems have been widely adopted for sen ees like voice assistants, computer-mediated communication tools and assistive applications, ASR is, by nature, imperfect. This has led to concerns over transcription quality. There is a hesitation to adopt ASR if it means that high-quality professional captioning is replaced with lower-quality automated solutions.

[0017] An extended reality' (XR) device incorporates a spectrum of technologies that blend physical and virtual worlds, including virtual reality' (VR), augmented reality (AR), and mixed reality (MR). These devices immerse users in digital environments, either by blocking out the real world (VR), overlaying digital content onto the real world (AR), or blending digital and physical elements seamlessly (MR). XR devices include glasses, headsets, or screens equipped with sensors, cameras, and displays that utilize the movement of users and their surroundings to deliver immersive experiences across various applications such as gaming, education, healthcare, on-the-go computing, and industrial training.

[0018] The concepts described herein are related to enhancing the accuracy of communications that include real-time text streams, such as live captioning and / or translation, by selectively updating one or more portions of a text stream based on eyemonitoring information associated with a reader of the text stream. The concepts described herein may be implemented, for example, within virtual reality (VR) or extended reality (XR)Atty Docket No. 0120-1136W01systems such as, for example, augmented reality (AR) glasses, head mounted display (HMD) devices, and the like. The concepts may also be used in. for example, any type of computing system.

[0019] At least one technical problem with known systems for enhancing the accuracy of real-time text streams, for example those generated by Al pipelines, is that they often cause visual instability to the text stream, for instance due to frequent updates and / or corrections. Updating and / or correcting may be referred to modifying respective portions of the text stream. To illustrate, when accuracy enhancement of a real-time text stream is coupled with audio input in a foreign language, the results of real-time text enhancement may vary. For example, completeness of speech and / or partial speech input may lead to an incorrect interpretation of text corresponding to the speech. Live captions and / or live translation may require updates and / or corrections at a high frequency, with words or phrases being modified as confidence scores of predictions change. Such instability may be characterized by jittering and / or flickering text, for example as text based on inaccurate predictions is subsequently corrected, which may negatively impact readability and / or comprehension, such that qualify of a user’s experience is reduced. These undesirable effects may be especially impactful on devices with smaller screens, where display space may be limited, such as mobile phones, smart watches, or the like, or to individuals with visual impairments.

[0020] At least some of the technical solutions described herein may be configured to leverage eye-monitoring data to selectively update portions of a real-time text stream that are currently within the gaze of a user. The technical solutions described herein may be configured such that updates are delayed or not performed to portions of text that are not within the gaze of the user. Stated differently, updates to a real-time text stream may be prioritized on a portion of the stream that a user is gazing at and updates to portions of the stream that are not within the gaze of the user are delayed or avoided. These technical solutions may be referred to as foveated stabilization of real-time text streams.

[0021] The technical solutions described herein may involve identifying one or more inaccuracies, for example an error related to at least one character of text, in a real-time stream of content that includes text. Such an error may be referred to as a text error. In the context of a streaming speech-to-text (STT) or automatic speech recognition (ASR) system, an inaccuracy often refers to an interim or low-confidence prediction that the ASR system subsequently refines as it processes more of the audio stream. For example, as a speaker completes a sentence, the additional context may allow the STT or ASR model to increase itsAtty Docket No. 0120-1136W01confidence in an earlier part of the transcription and issue an updated, more accurate word or phrase.

[0022] The technical solutions described herein may involve receiving gaze data related to a user viewing a stream of content that includes text in real time. The gaze data may be based on eye monitoring information of the user captured by one or more sensors of a device, such as infrared cameras of a wearable device that are pointed at the user’s eyes. The gaze data may be received continuously from the device. The gaze data may be used to identify a focal region within a stream of content that is representative of a portion of the content on which the user’s gaze is currently focused.

[0023] The technical solutions described herein may involve identifying one or more inaccuracies in a real-time stream of content that includes text. The inaccuracies may result from speech-to-text errors that occur for various reasons, such as audio quality issues, characteristics of a speaker’s speech, environmental noise, and the like. Such inaccuracies may include errors in the text produced by speech-to-text software or services, such as an error related to at least one character of the text within the content. Such speech-to-text errors may include, for example, misused words, gaps in content (e.g., missing words), mistakes in contextual prediction, and the like, and may be referred to as text errors and more generally as errors.

[0024] The technical solutions described herein may involve determining that one or more identified errors, such as text errors, should be modified (e.g.. corrected), for example based on respective locations of the one or more errors relative to the focal region of the user (e.g., where the user’s gaze is focused). Priority may be given to correcting errors that are within the focal region of the content, since it corresponds to a portion of the stream of content where the user’s gaze is focused. Identified errors that are not within or near the focal region may be assigned lower priority, such that they may not be corrected or their correction may be delayed. For example, if errors are in portions of the text that are not within the focal region, they may not be corrected to avoid the introduction of visual noise that may reduce the reading experience of the user. The technical solutions described herein may include displaying corrections of errors, such as text errors, which were determined should be corrected.

[0025] At least one technical benefit of these technical solutions is reducing or eliminating unnecessary updates to portions of text streams that fall outside the focal region of the gaze of the user. Reducing or eliminating unnecessary updates may reduce visualAtty Docket No. 0120-1136W01noise, which in turn may improve an overall reading experience of the user, without sacrificing accuracy and / or latency.

[0026] An example implementation of the concepts described herein may include a software and / or hardware process that is configured to receive a real-time stream of text, for example. The software and / or hardware process may be configured to receive gaze tracking information corresponding to a user that is gazing at (e.g., reading) the stream. The software and / or hardware process may be configured to identify a region within the portion of the stream that corresponds to a focal point of the user’s gaze. The software and / or hardware process may be configured to identify one or more portions of content, such as words, within and / or adjacent to the region that are inaccurate. The software and / or hardware process may be configured to replace the inaccurate content with modified content, and to display the modified content in place of the inaccurate content within the region.

[0027] In some implementations, the software and / or hardware process may be implemented within a service and / or application that outputs a stream of information, such as stream of text, in real-time. For example, the software and / or hardware process may be implemented in a transcription service and / or application, a translation service and / or application, or the like. Such a service and / or application may be configured to employ Al, such as a language model, to generate all or a portion of the stream of information. In some implementations, the software and / or hardware process may be configured to be executable on one or more devices, such as AR glasses, an HMD device, a mobile phone, a wearable (e.g., a smart watch), a tablet, a laptop computer, a desktop computer, a television or other display medium, or the like.

[0028] FIG. 1 illustrates an example implementation of a computing environment 100 in which selectively updating text of a stream of content in real time may be implemented.

[0029] In the illustrated implementation, the computing environment 100 includes a user 110, a device 120, and an image 130 displayed to the user 110 by the device 120. The image 130 is a representation of an environment 140, such as an XR environment, which may include images of physical, real-world objects and virtual objects that are created, placed, and managed within the environment 140 by the device 120.

[0030] In accordance with the illustrated implementation of the computing environment 100, the environment 140 includes an example of selective, real-time updating of a stream of content that includes text. As shown, the environment 140 includes a graphical representation of a speaker 142 and a content window 144. The content window 144 may display a stream of content representative of speech that originates from the speaker 142. TheAtty Docket No. 0120-1136W01graphical representation may be a real-time video feed of the speaker 142, a photo of the speaker 142, an avatar or other graphic representing the speaker 142, or the like. As shown, the content window 144 may include a stream of text 146 spoken by the speaker 142. The text 146 may be displayed in the native language spoken by the speaker 142, and / or may be a translation of the speech being spoken by the speaker 142.

[0031] In some implementations, the device 120 may be provided as an XR device, for example a head-mounted device such as a pair of AR glasses, a VR headset device, or the like. As shown, the device 120 may be provided with components including one or more sensors 122, the display 124, and a processing system 126. The one or more sensors 122 may be used to collect gaze data 128 related to the user 110. The gaze data 128 may be provided to the processing system 126.

[0032] In some implementations, the sensors 122 may include one or more image sensors (e.g., RGB cameras), thermal sensors (e.g., infrared imaging devices), gy roscopes, accelerometers, magnetometers, depth sensors, and audio sensors (e.g., microphones), in any combination. The sensors 122 are configured to sense and collect data relative to the user 110 and the environment 140, and to provide the data to processing system 126. In some implementations, the device 120 may be configured to combine data sensed by one or more of the sensors 122.

[0033] In some implementations, the device 120 may be configured to leverage one or more of the sensors 122 to collect the gaze data 128. For example, in an example implementation the sensors 122 of the device 120 may include one or more high-speed infrared (IR) cameras mounted to the device 120 and aimed at one or both eyes of the user 110. The device 120 may further include one or more IR light-emitting diodes (LEDs) that safely and invisibly illuminate the user’s eyes, creating distinct reflection patterns, known as glints, on the surfaces of the corneas and making the pupils clearly visible. The device 120 may further include one or more internal cameras that may be used to capture images of the user’s eyes and of these reflection patterns. The images may be used in determining gaze data 128 related to the user 110, relative to the environment 140 displayed by the display 124. For example, the images may be provided to the processing system 126, which may determine the gaze data 128 for the user 110. Such processes of capturing image data of the user’s eyes and using the images to determine the gaze data 128 for the user 110 may be referred to interchangeably as eye monitoring, eye tracking, gaze monitoring, or gaze tracking.

[0034] In some implementations, the gaze data 128 may be determined by a discrete component of the device 120, such as an application-specific integrated circuit (ASIC) inAtty Docket No. 0120-1136W01combination with one or more sensors (e.g., IR cameras and / or IR lights). In some implementations, one or more components used to determine the gaze data 128 may be integrated with a hardware and / or software component of the processing system 126. In some implementations, the gaze data 128 may include a sequence of gaze coordinates and corresponding confidence scores obtained from eye monitoring. In some implementations, the gaze data 128 may be sourced from on-device sensors, such as the sensors 122. may be streamed from one more wearable devices in communication with the device 120, or a combination thereof.

[0035] The display 124 of the device 120 may be provided as one or more screens or projection surfaces that present immersive visual content of the environment 140 to the user 110. The display 124 may be configured to merge virtual objects with physical, real-world objects in the environment 140. Example implementations of the display 124 can include optical see-through displays on a head-mounted device (e.g., lenses of AR glasses or XR glasses) or video pass-through of ahead-mounted device (e g., screens in an MR device or VR headset device).

[0036] The processing system 126 of the device 120 may include one or more processors, such as central processing units (CPUs), graphics processing units (GPUs), specialized artificial intelligence (Al) processors, and the like. The processing system 126 may include one or more processors that reside in the device 120 (e.g., AR glasses. VR headset device, etc.), one or more processors or devices that are communicatively coupled to and share processing load with the device 120, such as processors of a mobile device (e g., mobile phone, laptop, etc.), server processors, cloud-based processors, or the like, in any combination.

[0037] In some implementations, image data of the user's eyes (e.g., a plurality of images or a continuous stream of images) may be provided to the processing system 126 for determining the gaze data 128. For example, a stream of images captured by the sensors 122 may be provided to the processing system 126. The images may be processed by the processing system 126 using one or more computer vision algorithms. Such algorithms may identify the pupils and the respective positions of corneal glints. The processing system 126 may calculate, for example using the algorithms, geometric relationships between respective locations of portions of the user’s eyes (e.g., respective centers of the pupils) and respective locations of the glints, to determine respective directions the user’s eyes are pointing, with a high degree of accuracy. For example, the processing system 126, via the algorithms, may determine respective gaze vectors for the user’s eyes. The processing system 126 mayAtty Docket No. 0120-1136W01perform the calculations in real time, such that the gaze vectors are dynamically updated, representing movement of the user's gaze over time.

[0038] The gaze vectors may be projected into the environment 140, allowing the processing system 126 to calculate one or more locations where the user’s gaze intersects with content being displayed on the display 124. To illustrate, in an example where a stream of content that includes text is being displayed on the display 124, for example in real time, the processing system may calculate, dynamically in real time, a region of the content that the user 110 is currently gazing at. This region may be referred to as a focal region relative to the user’s gaze. The processing system 126 may identify one or more specific portions of content that are within the focal region. The one or more specific portions of content may include individual characters, one or more words, one or more graphical images (e.g., emojis), and so on, in any combination.

[0039] In accordance with the illustrated implementation of the computing environment 100, the device 120 displays an image 130 to the user 110 that may include physical, real-world objects in the environment 140 and / or virtual objects generated within the environment 140, for example by the processing system 126. For example, as shown, the environment 140 includes an image of the speaker 142 and an overlaid graphical representation of the content window^ 144 having a stream of content that includes text 146 displayed therein. It should be appreciated that the stream of content may be independently- presented within the image 130, omitting the content window 144.

[0040] In some implementations, the software and / or hardw are process may be configured to receive a stream of content that includes text 146. For example, the stream of content may comprise text 146 that corresponds to speech originating from a speaker (not shown) who is speaking. In an implementation, the text 146 may include predicted text (e.g.. automatic speech recognition (ASR) predicted text) that corresponds to output from an artificial intelligence (Al) pipeline, such as from a transcription process or a translation process carried out by a machine learning model, for example.

[0041] In some implementations, the machine learning model may be implemented as a Recurrent Neural Netw ork (RNN). such as a Long Short-Term Memory (LSTM) or a Gated Recurrent Unit (GRU), as a Temporal Convolutional Network (TCN), or another temporal neural netw ork. Outputs of the machine learning model may include, for example, predicted text determined with one or more speech-to-text processes or techniques.

[0042] In some implementations, training the machine learning model may include providing inputs from comprehensive data collection including a broad range of speech-to-Atty Docket No. 0120-1136W01text determinations in various environments, for example in noisy environments, quiet environments, indoor environments, and / or outdoor environments, speech-to-text determinations from a variety of acoustic features, for example various tones, pitches, levels of loudness, vocal timbres, etc., and a broad range and / or speech-to-text translations from various languages and / or regional dialects, and including contextual information, in any combination.

[0043] As shown, the content window 144 displays a real-time stream of content that includes text 146 corresponding to speech of the speaker. The text 146 may be displayed in the content window 144 in real-time as it is spoken by the speaker.

[0044] In some implementations, the software and / or hardware process may include gaze data 128 that corresponds to the user 110. In an example implementation, the gaze data 128 may be collected by the sensors 122 of the device 120 and provided to the processing system 126. The processing system 126 may use the gaze data 128 may to determine a focal region 112 associated with the user 110, for example as described elsewhere herein. The focal region 112 may include, and be representative of, a portion of content of the stream that the user’s gaze is focused on. To illustrate, the user’s gaze may be focused on a portion of the text 146, such as one or more words for example, within a stream of content displayed in the content window 144. The focal region may alternatively be referred to as a region, relative to the portion of content of the stream that the user’s gaze is focused on.

[0045] As shown, the focal region 112 is bounded to represent a portion of the content that the user 110 is gazing at. The focal region 112 may be updated dynamically and thus may correspond to a portion of the text 146 that the user 110 is gazing at in real time. As shown, the focal region 112 may move within the content window 144, for example as the user 110 reads respective portions of the text 146 of the stream of content. In this regard, the focal region 112 may be updated dynamically, in real time. It should be appreciated that the depiction of the focal region 112 in FIGs. 2A-2D is for the purposes of illustration and description of the software and / or hardware process, and that the focal region 112 need not be represented graphically (e.g., displayed) by the device 120. Stated broadly, the focal region 112 may be indicative of the user 110 reading one or more portions of the text 146 of the stream of content.

[0046] In some implementations, the processing system 126 may continuously receive the gaze data 128 from the sensors 122. For example, the sensors 122 may provide a continuous stream of images of the pupils of the user’s eyes and corresponding glints. The processing system 126 may use the gaze data 128 that is continuously received toAtty Docket No. 0120-1136W01continuously update positioning of the focal region 112 relative to the text 146 of the content stream (e.g., to update the location of the focal region 112 within the content window 144). In an example implementation, the processing system 126 may update the focal region 112 concurrently with receipt of updates of the gaze data 128.

[0047] In some implementations, the software and / or hardware process may be configured to identify and correct errors within the stream of content that have resulted from real-time transcription and / or translation of utterances from a speaker. Such errors may be produced by speech-to-text applications or services, for example when a prediction of a word or phrase that was uttered by the speaker is incorrect. Inaccuracies may introduce textual errors into a real-time stream of content that includes text. Textual errors may be referred to as text errors. For example, a stream of content that includes text that is output from a machine learning model or Al pipeline may contain one or more text errors.

[0048] In some implementations, a stream of content that includes text may be provided by speech-to-text (STT) software, which may be referred to as automatic speech recognition (ASR) software. The STT or ASR processes may be implemented in and / or performed by an Al, such as a machine learning model. In an implementation, an ASR model may produce an initial transcript or translation and provide it to the software and / or hardware process. In another implementation, the software and / or hardware process may include the STT or ASR model.

[0049] In some implementations, the software and / or hardware process may use a specialized machine learning model, such as a large language model (LLM) to perform an error checking process on the initial transcript or translation. Such a model may use linguistic context techniques to identify one or more errors in the stream of content. Such errors may include, for example, errors related to at least one character within the content. The model may perform processes such as N-Best Hypothesis Reranking and / or Contextual and Semantic Correction, to identify errors in the stream of content. Such processes may identify errors in the content based on a criterion or multiple criteria.

[0050] The criteria used to identify errors in a speech-to-text transcript or translation fall into several categories, focusing on both word-level accuracy and contextual meaning. A first category of criteria may include word-level accuracy criteria that may be used to evaluate word and / or phrase accuracy. For example, they may be used to check whether spoken words and phrases are correctly represented in the text. Word-level accuracy criteria may include substitutions where a different word takes the place of a word that was actually spoken (e g., caused by homophones), deletions and / or omissions where a word or entireAtty Docket No. 0120-1136W01phrase that was spoken is missing, insertions of one or more words that were not spoken, misspelled names and / or terminology, homophone errors such as the use of words that sound the same but have different spellings and / or meanings, or the like. A second category of criteria may include contextual meaning criteria that may be used to evaluate contextual and semantic accuracy. For example, they may be used to assess whether errors change the meaning or integrity of the original message. Contextual meaning criteria may include semantic errors which may distort meaning, nonsense errors wherein words do not make sense, critical errors of words that could impact a major decision (e g., legal, medical, or financial), or the like.

[0051] In some implementations, the software and / or hardware process may evaluate the stream of content by evaluating text of the content based on one or more criteria. The evaluation may be performed by a specialized machine learning model, such as an LLM, to perform error checking on the initial transcript or translation using a criterion or multiple criteria.

[0052] The text of the stream of content may be evaluated based on the criterion. In an implementation, examples of a criterion for evaluating text may include one or more of accuracy of the text, fluency of the text, readability of the text, consistency of the text, and the like. Errors in the stream of content that may be identified as errors in the content may include, for example, words that are incorrectly transcribed and / or translated, words that distort meaning or context relative to that of surrounding words in the content, and the like. Such errors may be referred to as errors related to one or more characters within the content. Such errors may result from prediction errors made by the software and / or hardware processes (e.g., STT or ASR processes), for example during real-time transcription and / or translation of the speech of a speaker.

[0053] In some implementations, the software and / or hardw are process may be configured to perform error detection processes as part of the real-time transcription and / or translation of the speech of a speaker. For example, the software and / or hardware process may include error detection performed by an artificial intelligence, such as a machine learning model. In an example implementation, the software and / or hardware process may identify errors in a stream of content that includes text, such as an error related to at least one character w ithin the content, using a machine learning model that uses an acoustic encoder that processes audio of the speech, a linguistic encoder that processes the predicted transcription and / or translation, and a classifier that determines positive entailments or negative entailments based on outputs of the encoders. In another example implementation.Atty Docket No. 0120-1136W01the software and / or hardware process may identify errors in a stream of content that includes text, such as an error related to at least one character within the content, using a machine learning model that uses probes or classifiers trained on an internal state of the model to identify problems like incorrect accent or noise, which can be indicative of transcription and / or translation errors.

[0054] The software and / or hardware process may be configured to determine whether errors identified in a stream of content including text should be modified, for example corrected. Dynamic updates made to a real-time stream of content that includes text, such as corrections of identified errors that include updating and / or replacing one or more words may introduce visual noise, such as jittering and / or flickering to displayed portions of the content. For example, as an artificial intelligence or machine learning model refines one or more predictions for the text in a real-time stream of content, the displayed text may undergo frequent updates and / or corrections, which may lead to visual noise including jittering and / or flickering of the text. This visual noise may become increasingly problematic with an increased number of corrections, for example if multiple corrections are being made concurrently, or substantially concurrently, at different locations of displayed content that are located within and / or proximate to a portion of the content that a user is currently reading. Such visual noise may degrade an overall reading experience of the user.

[0055] In some implementations, the software and / or hardw are process may be configured to selectively correct one or more errors identified within a real-time stream of content that includes text, to lessen the likelihood of visual noise associated with correcting the one or more errors negatively affecting a user reading the stream of content. For example, in some implementations, the software and / or hardw are process may be configured to delay¬ er omit the correction of one or more identified errors that are located outside of the focal region 112 of the user 110. In this regard, the updating and / or replacing of one or more errors in respective portions of the content that are not currently within the user's gaze may be limited. Stated differently, the software and / or hardware process may be configured to prioritize the correction of identified errors that are located within or proximate to the focal region 112 of the user’s gaze over correcting identified errors in other portions of the stream of content that is not within the user’s gaze. To illustrate, in some implementations the software and / or hardware process may be configured to delay or omit the correction of identified errors that are located in a portion of the stream of content that has already been read by the user, may be configured to delay the correction of identified errors that are located in a portion of the stream of content that has the user has not yet read, or anyAtty Docket No. 0120-1136W01combination thereof. The delay or omission of correcting identified errors may reduce jittering and / or flickering of text within a currently displayed portion of the stream of content, which in turn may maintain and / or improve an overall reading experience of the user.

[0056] In some implementations, the software and / or hardware process may be configured to determine whether an error should be corrected. For example, this determination process may be triggered when an underlying STT or ASR system provides a potential update or correction for a previously displayed portion of text within the content. In an example implementation, upon receiving a potential update, the software and / or hardware process may be configured to determine whether to apply the update by inputting information related to the identified error and the potential correction into an optimization algorithm. The result of an analysis using the algorithm may determine whether to immediately correct the error, delay correction of the error, or omit correction of the error entirely.

[0057] In some implementations, the software and / or hardware process may be configured to determine whether an error, such as a text error, should be corrected. In an example implementation, the software and / or hardware process may be configured to determine whether an error in a stream of content should be corrected based on the results of inputting information related to an identified error into an optimization algorithm. The result of an analysis using the algorithm may determine whether to correct the error.

[0058] In an example implementation, the optimization algorithm may be based on finding a global minimum per frame of whether or not to update a subset of words from {u 1, u_2, ... u_N}, to minimize the cost function of C_{render}, in accordance with the below representative formula. Stated differently, the algorithm may comprise an optimization problem that minimizes a cost function, by determining whether or not to update one or more words that are identified as errors to a stable, corrected state, or latest, in {u_l , u_2, ... u_N } .{u_l, u_2, ... u_N} = argmin {C_render = C Jitter + C_semantic + C_foveation}

[0059] In some implementations, C Jitter may be a measure of visual instability caused by text updates. Visual instability may correspond to a visual disturbance of the content caused by modifying the portion of the content. For example, modifying a portion of the content may result in flickering of the content being modified. Modifying a portion of the content may include displaying a correction in place of a corresponding error. A correction may include a portion of text, such as one or more characters (e.g., letters, punctuation marks, etc.), one or more words, or the like that are altered such that the correction replaces the errorAtty Docket No. 0120-1136W01in the content. Such a correction may be referred to as a text correction. Modifying a portion of the content to replace or update an error with a correction may be referred to as correcting that portion of the content.

[0060] In an implementation, C Jitter may be a measure of visual flicker that is estimated to result from modifying, for instance correcting, one or more errors identified in the stream of content, such as an error related to at least one character within the content. Stated differently, CJitter may be a representation of an instability associated with displaying a correction. In some implementations, CJitter may be calculated using the weighted average of Fourier transform differences between consecutive frames of text. In some implementations, the software and / or hardware process may perform the algorithm to determine a global minimum per frame of whether or not to update (e.g., correct) a subset of words from {u_l, u_2, ... , u_N} that are identified as errors, which may minimize a cost function of C_render.

[0061] In some implementations, the software and / or hardware process may be configured to reduce this cost function as much as possible (e.g., to minimize the cost function). In so doing, the software and / or hardware process may be configured to attempt to achieve a balance between a threshold visual stability and an importance metric of displaying accurate and up-to-date content (e.g., text of the stream of content). Corresponding values for the threshold visual stability and the importance metric may be predetermined, may be specified by the user 110 of the device 120, or may be determined by the software and / or hardware process (e.g., based on user behavior), or the like, in any combination.

[0062] In some implementations, optionally, the algorithm may be implemented with an additional modifier lambda. The lambda modifier may function as a weighting factor in the cost function, allowing a user to prioritize either visual stability or immediate accuracy. For example, a lower lambda value may correspond to faster update rate, and a higher lambda value may correspond to a slower update rate. To illustrate, a higher lambda value may serve as a weighting factor for CJitter, allowing a user to specify a preference for visual stability over immediate semantic correction. Relative to the cost function, lambda may be expressed in accordance with the below representative formula.C_render = \lambda * CJitter + ...

[0063] In some implementations, lambda may be implemented into the algorithm to control a rate at which the software and / or hardware process updates and / or corrects portionsAtty Docket No. 0120-1136W01of the stream of content, for example by correcting respective errors in one or more portions of text of the stream of content. This rate at which the software and / or hardware process updates and / or corrects portions of the stream of content may be referred to as an update rate of the software and / or hardware process. In an example implementation, an update rate may be specified by a user, for example as a user preference. To illustrate, some users (e.g., students) may prefer to have updates performed at a high update rate, for example in scenarios of transcription or Al translation. Users in such scenarios may be tolerant of jittering and / or flickering text. In another illustration, some users (e.g., those who are deaf or have difficulty hearing) may depend on deriving the correct meaning from real-time text, and thus may prefer any updates and / or corrections provided by enhancement to be as accurate as possible. Users in such scenarios may be willing to endure slower interpretation of live transcription, translation, and / or Al feedback, and thus may prefer to specify a slower update rate.

[0064] In some implementations, C_semantic may quantify a semantic impact of word changes (e.g., correction of errors in the stream of content). Semantic impact may refer to the effect of word meanings on communication, understanding, and interpretation. More specifically, semantic impact may refer to how the specific meanings of words and their relationships influence how a message is conveyed and received. In this regard, C_semantic may be related to an effect on a meaning of the content associated with displaying the correction, for example how displaying a correction in place of an error may affect the meaning of one or more words that come before and / or after the error in the stream of content. To illustrate, in an example where the stream of content corresponds to real-time translation of speech, semantic impact may refer to the influence and effect that the original text's meaning has on the translated text, and how meaning can change between languages, for example due to cultural and / or linguistic differences.

[0065] In an implementation, the algorithm may be configured to determine C_semantic using a distance between word embeddings. In some implementations, the algorithm may be configured to determine C_semantic using a semantic similarity oracle to update one or more Case C tokens. In an example, some or all English stop-words and / or punctuations may be removed. One or more remaining words may be lemmatized. Some or all of the words may be transformed to lowercase. Original and updated and / or corrected words may then be mapped into a vector space, for example using a sentence transformer. Semantic similarity of one or more words may be measured, for example by computing a dot product of the two vectors. In an example, the algorithm may be configured such that twoAtty Docket No. 0120-1136W01tokens may be considered to be semantically similar if their similarity score is greater than 0.85. One or more techniques of natural language processing (NLP) may be used for obtaining vector representations of one or more words. Such vectors may capture information about the meaning of the word based on the surrounding words, for example. The algorithm may be configured to use, or may be integrated with, an NLP algorithm that estimates these representations, for example by modeling text in a large corpus.

[0066] In some implementations, C_foveation may represent an assessment of an alignment between the focal region 112, and thus user’s gaze, and one or more identified errors that may be corrected (e.g., via textual changes and / or updates). In some implementations, an input to the algorithm for determining C_foveation may include the gaze data 128. which the algorithm may be configured to use to determine the focal region 112 for use in determining C_foveation. In an implementation, the alignment upon which the determination of C_foveation is based may be a location of one or more identified errors relative to the focal region 112 of the user’s gaze. In an example, if the location of an identified error is within the focal region 112, or is proximate to the focal region 112, C_foveation may represent a strong alignment between the focal region 112 and the identified error. A strong alignment may result in C_foveation indicating that the error should be corrected. For example, if the error is within the focal region 112, correcting the error may be important to the reading experience of the user 110. Stated differently, modifying a portion of the content that corresponds to the error may be based on the location of the error being within the focal region 112. In another example, if the location of an identified error is not within the focal region 112 and / or is located at least a threshold distance from the focal region 112, C_foveation may represent a weak alignment, or misalignment between the focal region 112 and the identified error. A weak alignment or misalignment may result in C_foveation indicating that the error should not be corrected. For example, if the error is outside the focal region 112, for example in a portion of the stream of content that the user 110 has already read, or in a portion of the stream of content that the user is yet to read, correcting the error may be cause visual noise that may degrade the reading experience of the user 110. Stated more generally, the software and / or hardware process may be configured to determine that an identified error should be corrected if it is located in a portion of the stream of content that is currently within the focal region 112 of the user 110. In some implementations, the algorithm may be configured to compare respective heatmaps associated with the gaze data 128 (e.g., the focal region 112) and one or more identifiedAtty Docket No. 0120-1136W01errors that may be corrected in assessing the alignment between the focal region 112 and the one or more identified errors.

[0067] In some implementations, an algorithm that integrates the gaze data 128 may offer a powerful contextual signal to improve the transcription and / or translation process itself and its presentation. For example, integrating the gaze data 128 may improve accuracy. To illustrate, knowing where the user 110 is looking may enable the algorithm to bias its language model (LM) toward words and / or phrases semantically related to the gazed-upon content. In some implementations, the gaze 128 data may inform when and where new text is displayed by the algorithm. For example, if the user 110 is actively reading a portion of a stream of content, such as within the focal region 112, the algorithm may be configured to momentarily pause or stabilize updates to that portion of the content to avoid distracting the user 110, thereby reducing visual jitter in the focal region 112, before scrolling updated, stable text into the peripheral view of the user 110.

[0068] In some implementations, the software and / or hardware process may be configured to attempt to balance the respective impacts (e.g., to minimize the cost functions) of C Jitter vs. C_semantic vs. CJfoveation. Balancing visual jitter, semantic meaning, and a user's gaze in a real-time text-to-speech transcription and / or translation algorithm may provide significant benefits in usability7, accuracy, and user experience. This balancing may address the inherent trade-off in real-time transcription and / or translation between low latency (speed) and high accuracy / coherence. To illustrate, visual jitter in a transcription display may refer to frequent, distracting updates and / or changes to the text as the algorithm revises its initial word predictions with more audio context. By prioritizing semantic meaning (coherence) over the most immediate, unverified text, the algorithm may strategically buffer, delay, or revise text chunks to ensure the final output is grammatically and contextually correct. This may reduce a user's cognitive load by providing a more stable and readable output.

[0069] With reference also to FIGs. 2A-2D, an example of the software and / or hardware process for selectively updating the text of a stream of content in real time, using gaze data of a user viewing the content, is depicted. For the purposes of illustration, the software and / or hardware process is described as being performed by the device 120, operated by the user 110. However, it should be appreciated that the example software and / or hardware process is not limited to being performed by the device 120 as illustrated and described herein.Atty Docket No. 0120-1136W01

[0070] As show n, the user 110 is using the device 120 to view a real-time stream of content that includes text 146. The content is displayed in the content window 144 on the display 124 of the device 120. The device 120 may be executing a software and / or hardware process for selectively updating the text of a stream of content in real time, for example as described herein.

[0071] The software and / or hardware process may receive the stream of content, for example from a source, such as a video conference feed. The software and / or hardware process may display the content in the content window 144, for example in real-time or subject to a delay, for example based on user preferences.

[0072] The software and / or hardware process may identify one or more errors, such as text errors, in the stream of content, for example in the text 146. Such an error may, for example, be related to at least one character within the content. As shown, the software and / or hardware process identifies a first error 202, a second error 204, a third error 206, and a fourth error 208. The software and / or hardware process may determine whether one or more, such as each, of the identified errors should be corrected, for example by performing the optimization algorithm. The software and / or hardware process may be configured to continuously perform the optimization algorithm, for example as the user 110 reads the text 146 of the stream of content.

[0073] FIGs. 2A-2D depicts an example of how the software and / or hardware process may selectively update portions of the text 146 of the stream of content that are within the gaze of the user 110 (e.g., within the focal region 112). At FIG. 2A, the optimization algorithm may determine, via the render cost C_render, that none of the identified errors should be corrected. This determination may be based on the respective locations of the first error 202, the second error 204, the third error 206, and the fourth error 208, which are outside the focal region of the user 110, such that correcting the identified errors may introduce undesirable visual noise into the stream of content.

[0074] At FIG. 2B, as the user 110 reads the text 146, the focal region 112 moves, such that the first error 202 is within the focal region 112. The software and / or hardware process may perform the optimization algorithm again. The optimization algorithm may determine that the first error 202 should be corrected. As shown, the software and / or hardware process may correct the first error 202 by replacing the word “it’s” with “it was.” The determination that the first error 202 should be corrected may be based on the cost function of making the correction, including the visual noise associated with the correction (CJitter), the semantic impact of the change (C_semantic), and the location of the first errorAtty Docket No. 0120-1136W01202 within the focal region 112 (C_foveation). The optimization algorithm may further determine that the remaining identified errors, including the second error 204, the third error 206, and the fourth error 208, should not be corrected, based on their respective locations still being outside the focal region 112.

[0075] At FIG. 2C, as the user 110 continues to read the text 146, the focal region 112 moves, such that the second error 204 and the third error 206 are within the focal region 112. The software and / or hardware process may perform the optimization algorithm again. The optimization algorithm may determine that the second error 204 and the third error 206 should be corrected. As shown, the software and / or hardware process may correct the second error 204 by replacing the word “Went’' with “I went,” and may correct the third error 206 by¬ replacing the word “sand” with “San.” The determination that the second error 204 and the third error 206 should be corrected may be based on the respective cost functions of making the corrections, including the respective visual noise associated with displaying each correction (CJitter), the semantic impacts of the changes (C_semantic) and the respective locations of the second error 204 and the third error 206 within the focal region 112(C foveation). The optimization algorithm may further determine that the remaining identified error, the fourth error 208, should not be corrected, based on its location still being outside the focal region 112.

[0076] At FIG. 2D, as the user 110 continues to read the text 146, the focal region 112 moves, such that the fourth error 208 is within the focal region 112. The software and / or hardware process may perform the optimization algorithm again. The optimization algorithm may determine that the fourth error 208 should be corrected. As shown, the software and / or hardware process may correct the fourth error 208 by inserting the word “friends” into the blank space between “my” and “last.” The determination that the fourth error 208 should be corrected may be based on the cost function of making the correction, including the visual noise associated wdth displaying the correction (CJitter), the semantic impact of the change (C_semantic) and the location of the fourth error 208 within the focal region 112 (C_foveation).

[0077] As shown in FIGs. 2A-2D, the software and / or hardware process may be configured to identify one or more errors within the focal region 112 of the stream of content and to display respective corrections in place of the one or more errors.

[0078] In some implementations, the stream of content may be received by the software and / or hardware process as an unoptimized text stream, such as a raw output of an Al pipeline. The unoptimized text stream may be referred to as a first input to the softwareAtty Docket No. 0120-1136W01and / or hardware process and / or to the optimization algorithm. The raw output of an Al pipeline may include one or more lines of text with potentially unstable words and / or phrases, for example.

[0079] In some implementations, the stream of content may be received by the software and / or hardware process as a buffered stabilized text stream. The buffered stabilized text stream may be referred to as a second input to the software and / or hardware process and / or to the optimization algorithm. The buffered stabilized text stream may include one or more portions of previously displayed text of the stream of content, for example. The buffered stabilized text stream may represent a current stable state of content of the stream.

[0080] In some implementations, an output of the software and / or hardware process and / or of the algorithm may include a stabilized text stream enhanced for display to a user reading the stream (e.g., the user 110). Corrections and / or updates to portions of the content of the stream may be selectively applied by the optimization algorithm, based on the user's gaze (e g., the focal region 112).

[0081] In some implementations, the software and / or hardware process may be configured to use smooth textual morphing techniques for applying corrections and / or updates to identified errors in the content, such as errors related to one or more characters within the content. For example, the software and / or hardware process may be configured to use smooth textual morphing when updating and / or replacing portions of the text 146 of the stream of content. In this regard, modifying a portion of the content may include smoothing a displayed transition from an error to a correction. The use of smooth textual morphing may limit jarring transitions within the stream of content. For example, the software and / or hardware process may be configured to perform smoothed morphing of signed distance functions. Employing this technique may provide smooth transitions between different text states, thereby reducing perceived abruptness of updates, which may in turn enhance visual comfort for the user 110. In some implementations, the software and / or hardware process may be configured to display one or more corrections by morphing one or more corresponding portions of the content. Morphing the one or more portions of the content may include using smooth textual morphing to transition one or more characters of text, words, phrases, for example from an error to a correction, in order to limit visual disturbances associated with morphing the content that may be perceived by the user 110.

[0082] In some implementations, the software and / or hardware process may be configured to reduce an emphasis of one or more portions of the content that are adjacent to the focal region 112 within the stream. To illustrate, the software and / or hardware processAtty Docket No. 0120-1136W01may be configured to reduce the emphasis of one or more words that precede the focal region 112 and / or to reduce the emphasis of one or more words that follow the focal region 112. Reducing the emphasis of one or more portions of the content may include techniques intended to cause the user 110 to pay less attention to the one or more portions, such as fading content within the one or more portions, blurring content within the one or more portions, or the like.

[0083] FIG. 3 is a block diagram of an example process 300 for enhancing the accuracy of a stream of content that includes text by selectively updating one or more portions of text based on eye monitoring information. For the purposes of illustration, the process 300 is described with reference to an implementation using systems and components of the computing environment 100. For instance, as shown the example process 300 may be performed on the device 120. One or more portions of the example process 300 may be performed by processing system 126 of the device 120.

[0084] In some implementations, the example process 300 may be initiated when a corresponding application or mode on the device 120 is enabled, such as a real-time transcription and / or translation application, or a real-time transcription and / or translation application mode, for example. In an implementation, the device 120 may be configured to automatically initiate the application or mode. In another implementation, the application or mode may be initiated by the user 110, for example in response to an action of the user 110, such as operating a control of a user interface, issuing a voice command, performing a gesture, or the like. Once enabled, the example process 300 may continue executing until the application or mode is subsequently terminated or disabled by the user 110 (e.g., by the user 110 terminating an application that uses the example process 300).

[0085] In some implementations of the example process 300, the device 120 may be a wearable device worn by the user 110, such as a head-mounted device (e.g., AR glasses, XR glasses, a VR headset device, or the like). As shown, the device 120 is operating and displaying the image 130 of the environment 140 to the user 110, via the display 124. The environment 140 may be an AR environment, an MR environment, an XR environment, or another type of environment, for example.

[0086] In accordance with the illustrated implementation, at step 301 the example process 300 receives a stream of content. The stream of content may include text, such as the text 146. The stream of content may be received by the example process 300 in real time. In some implementations, the stream of content may be related to real-time transcription or captioning, to real-time translation, or the like, in any combination.Atty Docket No. 0120-1136W01

[0087] At step 302, the example process 300 may receive gaze monitoring information (e.g., the gaze data 128) related to the user 110 viewing one or more portions of the stream of content, such as a portion of the text 146 displayed in the content window 144, for example. The gaze data 128 may be indicative of a direction in which the user 110 is currently gazing. The gaze data 128 may be received by the processing system 126 of the device 120.

[0088] The example process 300 may identify, based on the gaze data 128 received, a region within the stream of content where the user 110 is gazing. As shown, the region may be a focal region 112 within the content where the user’s gaze is focused. In this regard, the example process 300 may receive data related to the gaze of the user 110 viewing a region (e.g., the focal region 112) of the content. The example process 300 may determine the focal region 112 as described elsewhere herein.

[0089] At step 303, the example process 300 may identify one or more errors, such as text errors, within the content. The example process 300 may be configured to identify the one or more errors using error detection performed by an artificial intelligence, such as a machine learning model, for example as described elsewhere herein. At step 303, the example process 300 may also identify semantic similarities within one or more portions of the content, such as portions of the content that are within or surrounding identified errors. In this regard, the example process 300 may perform identifying semantic similarities in the content, for example as a part of error detection performed by an artificial intelligence, such as a machine learning model.

[0090] The example process 300 may determine that the one or more errors should be corrected based on respective locations of the one or more errors relative to the focal region 112. The example process 300 may be configured to use an optimization algorithm to determine that one or more errors in a stream of content should be corrected, for example as described elsewhere herein. An output of the optimization algorithm may indicate that the one or more errors should be corrected based on a magnitude of a cost function (e.g., C_render) associated with correcting the one or more errors.

[0091] In some implementations, the example process300 may be configured to determine that one or more identified errors in the stream of content should not be corrected. For instance, an output of the optimization algorithm may indicate that one or more identified errors in the stream of content should not be corrected based on a magnitude of a cost function (e.g., C render) associated with correcting the one or more errors. In some implementations, the example process 300 may be configured to determine that the correctionAtty Docket No. 0120-1136W01of one or more errors should be delayed, for example until the user's gaze is no longer located near the one or more errors (e.g., the one or more errors are outside the focal region 112). For instance, an output of the optimization algorithm may indicate that respective corrections of one or more identified errors in the stream of content should be delayed based on a magnitude of a cost function (e.g., C_render) associated with correcting the one or more errors.

[0092] At step 304, the example process 300 may modify a portion of the content that includes the error based on the location of the error relative to the focal region 112. For example, the example process 300 may be configured to replace the error with a text correction if the location of the error is within or proximate to the focal region 112 of the user’s gaze. Stated differently, the example process 300 may display respective corrections in place of one or more errors that the optimization algorithm has determined should be corrected. To illustrate, one or more errors in the text 146 of the stream of content that the example process 300 has determined should be corrected may be updated or replaced, for example using smooth textual morphing.

[0093] FIG. 4 illustrates a computing system 400 to enhance the accuracy of a stream of content that includes text by selectively updating one or more portions of text based on eye monitoring information, according to an implementation. Computing system 400 represents any apparatus, computing system, or systems with which the various operational architectures, processes, scenarios, and sequences are disclosed herein for enhancing the accuracy of a stream of content that includes text by selectively updating one or more portions of text based on eye monitoring information can be implemented. Computing system 400 can be provided as an AR device, an XR device, a wearable device, or another computing device capable of the operations described herein. For example, the computing system 400 may be implemented as the device 120.

[0094] As shown, the computing system 400 includes a storage system 410, a processing system 420, a communication interface 430, input / output (I / O) device(s) 440, and sensors 450. The sensors 450 can be provided as the sensors 122 of the device 120, for example. The processing system 420 is operatively linked to the communication interface 430, the I / O device(s) 440, the sensors 450, and the storage system 410. In some implementations, the sensors 450, the communication interface 430, and / or the I / O device(s) 440 may be communicatively linked to the storage system 410. The computing system 400 may further include other components such as a battery and an enclosure that are not shown for clarity.Atty Docket No. 0120-1136W01

[0095] The communication interface 430 includes components that communicate over communication links, such as network cards, ports, radio frequency, processing circuitry (and corresponding software), or some other communication devices. The communication interface 430 may be configured to communicate over metallic, wireless, or optical links. The communication interface 430 may be configured to use Time Division Multiplex (TDM), Internet Protocol (IP), Ethernet, optical networking, wireless protocols, communication signaling, or some other communication format - including combinations thereof. The communication interface 430 may be configured to communicate with external devices, such as servers, user devices, or other computing devices.

[0096] The I / O device(s) 440 may include peripherals of a computer that facilitate the interaction between the user 110 and the computing system 400. Examples of the I / O device(s) 440 may include keyboards, mice, trackpads, monitors, displays, printers, cameras, microphones, external storage devices, and the like. In some implementations, the I / O device(s) 440 includes a see-through or video pass-through display providing a view of the physical environment, such as the display 124. In some implementations, the computing system 400 can include at least one camera, such as an RGB camera and / or a depth camera that captures image data, for example eye monitoring information that may be used to determine the gaze data 128.

[0097] The storage system 410 may include volatile and nonvolatile, removable, and non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules, or other data. The storage system 410 may be implemented as a single storage device but may also be implemented across multiple storage devices or sub-systems. The storage system 410 may include additional elements, such as a controller to read operating software from the storage systems. Examples of storage media (also referred to as computer-readable storage media or a computer-readable storage medium) include random access memory, read-only memory, magnetic disks, optical disks, and flash memory, as well as any combination or variation thereof, or any other type of storage media. In some implementations, the storage media may be non- transitory. In some instances, at least a portion of the storage media may be transitory. In no case is the storage media a propagated signal. The computer-readable storage medium may have program instructions stored thereon.

[0098] The processing system 420 can include microprocessor circuitry (e.g., at least one processor) and other circuitry that retrieves and executes operating software (i.e., program instructions) from the storage system 410. The processing system 420 may compriseAtty Docket No. 0120-1136W01the processing system 126 of the device 120. In some implementations, the processing system 420 can include external computing resources, such as those provided by another device in communication with the computing system 400, cloud processing resources, or the like that are accessible to the computing system 400 via the communication interface 430.

[0099] The processing system 420 may be mounted on a circuit board that may also hold the storage system 410. Operating software of the storage system 410 may include computer programs, firmware, or some other form of machine-readable program instructions. The operating software on the storage system 410 may further include an operating system, utilities, drivers, network interfaces, applications, or some other type of software. When read and executed by processing system 420, the operating software on the storage system 410 directs the computing system 400 to operate as described herein. In at least one implementation, the operating software can provide the process 300. The operating software can provide or cause at least one processor to enhance the accuracy of a stream of content that includes text by selectively updating one or more portions of text based on eye monitoring information, for example as described herein.

[0100] In at least one implementation, the processing system 420 receives eye monitoring information (e.g., gaze data 128) related to the user 110, for instance captured by one or more of the sensors 450. The processing system 420 analyzes the gaze data 128 to determine the focal region 112. The sensors 450 and the processing system 420 may cooperate to continuously determine the focal region 112. for example dynamically in real time. The processing system 420 may identify one or more errors, such as text errors, in a stream of content. For example, the processing system 420 may identity7one or more errors in the text 146 of the stream of content the user 110 is reading. The processing system 420 may determine that one or more errors in the stream of content should be corrected, for example based on respective locations of the one or more errors, for example as described elsewhere herein. The processing system 420 may display one or more corrections of one or more respective identified errors in the stream of content to the user 110, for example via the I / O device(s) 440 (e.g., the display 124). In some implementations, the processing system 420 may be configured to provide information related to the focal region 112 of the user 110. for example via the communication interface 430.

[0101] Techniques and processes for enhancing the accuracy of communications that include real-time text streams, such as live captioning and / or translation services and / or applications, by selectively updating one or more portions of a text stream based on eyeAtty Docket No. 0120-1136W01monitoring information, for example as illustrated and described herein, may be referred to as foveated stabilization.

[0102] Foveated stabilization may be implemented in one or more services and / or applications that include real-time text streams. For example, foveated stabilization may be implemented in a live captioning service and / or application, such as those used in video conferencing, presentation, online video sessions, and the like. In such environments, foveated stabilization may reduce distractions, such as those caused by flickering captions, when compared to known systems.

[0103] In another example, foveated stabilization may be implemented in a live translation service and / or application, such as those used for international communication, language learning, and the like. In such environments, foveated stabilization may enable a smoother reading experience for translated text, when compared to known systems.

[0104] In another example, foveated stabilization may be implemented in a service and / or application executing within an AR environment, such as an AR application with dynamic text overlays, or the like. In such environments, foveated stabilization may enhance readability and / or immersion, when compared to known systems.

[0105] In some implementations, foveated stabilization may enable benefits over known techniques of enhancing real-time text streams. For example, foveated stabilization may reduce visual instability associated with the enhancement of real-time text streams. To illustrate, implementations using foveated stabilization may minimize distraction and / or improve readability. Additionally, foveated stabilization may improve a user experience of enhanced real-time text streams, for example by creating a more comfortable and / or less jarring viewing experience for a user. Furthermore, foveated stabilization may preserve accuracy and / or latency within the enhancement of real-time text streams, for example by maintaining a responsiveness and / or an accuracy of real-time text streams.

[0106] Below are example clauses associated with the present disclosure. The described clauses should not be considered exhaustive.

[0107] Clause 1. A method comprising: receiving a stream of content; receiving data related to a gaze of a user viewing a region of the content; identifying an error related to at least one character within the content, based on a criterion; and modifying a portion of the content that includes the error based on a location of the error relative to the region.

[0108] Clause 2. The method of clause 1, wherein modifying the portion of the content is based on the location being within the region.Atty Docket No. 0120-1136W01

[0109] Clause 3. The method of clause 2, wherein modifying the portion of the content is further based on a visual disturbance of the content caused by modifying the portion of the content, and an effect on a meaning of surrounding content associated with modifying the content.

[0110] Clause 4. The method of one of the preceding clauses f to 3, wherein identifying the error comprises performing an error checking process on the content, and wherein the critenon corresponds to an accuracy of the error relative to surrounding content.

[0111] Clause 5. The method of one of the preceding clauses 1 to 4, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to a contextual meaning of the error relative to surrounding content.

[0112] Clause 6. The method of one of the preceding clauses 1 to 5, wherein modifying the portion of the content is further based on an update rate related to the stream of content.

[0113] Clause 7. The method of one of the preceding clauses 1 to 6, wherein modifying the portion of the content includes smoothing a displayed transition from the error to a correction displayed in place of the error.

[0114] Clause 8. A system, comprising: a processor; and a memory comprising instructions that when executed by the processor, cause the processor to: receive a stream of content; receive data related to a gaze of a user viewing a region of the content; identify an error related to at least one character within the content, based on a criterion; and modify’ a portion of the content that includes the error based on a location of the error relative to the region.

[0115] Clause 9. The system of clause 8, wherein modifying the portion of the content is based on the location being within the region.

[0116] Clause 10. The system of clause 9, wherein modifying the portion of the content is further based on a visual disturbance of the content caused by modifying the portion of the content, and an effect on a meaning of surrounding content associated with modifying the content.

[0117] Clause 11. The system of one of the preceding clauses 8 to 10, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to an accuracy of the error relative to surrounding content.

[0118] Clause 12. The system of one of the preceding clauses 8 to 11, wherein identifying the error comprises performing an error checking process on the content, andAtty Docket No. 0120-1136W01wherein the criterion corresponds to a contextual meaning of the error relative to surrounding content.

[0119] Clause 13. The system of one of the preceding clauses 8 to 12, wherein modifying the portion of the content is further based on an update rate related to the stream of content.

[0120] Clause 14. The system of one of the preceding clauses 8 to 13, wherein modifying the portion of the content includes smoothing a displayed transition from the error to a correction displayed in place of the error.

[0121] Clause 15. A computer-readable storage medium having stored thereon computer-readable instructions that when executed by a processor perform a method comprising: receiving a stream of content; receiving data related to a gaze of a user viewing a region of the content; identifying an error related to at least one character within the content, based on a criterion; and modifying a portion of the content that includes the error based on a location of the error relative to the region.

[0122] Clause 16. The computer-readable storage medium of clause 15, wherein modifying the portion of the content is based on the location being within the region.

[0123] Clause 17. The computer-readable storage medium of clause 16, wherein modifying the portion of the content is further based on a visual disturbance of the content caused by modify ing the portion of the content, and an effect on a meaning of surrounding content associated with modifying the content.

[0124] Clause 18. The computer-readable storage medium of one of the preceding clauses 15 to 17, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to an accuracy of the error relative to surrounding content.

[0125] Clause 19. The computer-readable storage medium of one of the preceding clauses 15 to 18, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to a contextual meaning of the error relative to surrounding content.

[0126] Clause 20. The computer-readable storage medium of one of the preceding clauses 15 to 19, wherein modifying the portion of the content includes smoothing a displayed transition from the error to a correction displayed in place of the error.

[0127] Clause 21. The computer-readable storage medium of one of the preceding clauses 15 to 20, wherein modifying the portion of the content is further based on an update rate related to the stream of content.Atty Docket No. 0120-1136W01

[0128] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the specification. For example, it is to be understood that the concepts illustrated and described herein are not limited to being implemented within live captioning and / or translation services and / or applications, and that the concepts may be implemented in other types of systems where enhancing the accuracy of the predicted content (e.g., text) of realtime streams, such as those produced by Al pipelines, is desirable.

[0129] In some implementations, live-captioning may refer to the process of generating and displaying a textual transcription of spoken language in real-time or near realtime, wherein the transcription is created contemporaneously with the spoken language. In some implementations, live-captioning may be a service that provides a stream of text corresponding to spoken audio as the audio is being produced, wherein the stream of text may include interim predictions that are subject to subsequent correction or refinement. In some implementations, live-captioning can be a computer-implemented process that utilizes an automatic speech recognition (ASR) model to convert an audio stream into a corresponding text stream for display, wherein the conversion and display occur with a latency that allows a user to read the text as the audio is being spoken.

[0130] In accordance with aspects of the instant disclosure, implementations of various techniques and methods described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them.Implementations may be implemented as a computer program product (e g., a computer program tangibly embodied in an information carrier, a machine-readable storage device, a computer-readable medium, a tangible computer-readable medium), for processing by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). In some implementations, a tangible computer-readable storage medium may be configured to store instructions that when executed cause a processor to perform a process. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be processed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.Atty Docket No. 0120-1136W01

[0131] While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the implementations. It should be understood that they have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the apparatus and / or methods described herein may be combined in any combination, except mutually exclusive combinations. The implementations described herein can include various combinations and / or sub-combinations of the functions, components and / or features of the different implementations described.

[0132] Further to the descriptions above, a user may be provided with controls allowing the user to make an election as to both if and when systems, programs, or features described herein may enable collection of user information (e.g., information about a user’s social network, social actions, or activities, profession, a user’s preferences, or a user’s current location), and if the user is sent content or communications from a server. In addition, certain data, such as information related to portions of a user’s body and / or surrounding environment, may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user’s identity may be treated so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city’, ZIP code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over what information is collected about the user, how that information is used, and what information is provided to the user.

[0133] It should be appreciated that logic flows depicted in the figures, such as the process 300 shown in FIG. 3, do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the claims.

Claims

1. Atty Docket No. 0120-1136W01WHAT IS CLAIMED IS:

1. A method comprising:receiving a stream of content;receiving data related to a gaze of a user viewing a region of the content; identifying an error related to at least one character within the content, based on a criterion; andmodifying a portion of the content that includes the error based on a location of the error relative to the region.

2. The method of claim 1 , wherein modifying the portion of the content is based on the location being within the region.

3. The method of claim 2, wherein modifying the portion of the content is further based on a visual disturbance of the content caused by modifying the portion of the content, and an effect on a meaning of surrounding content associated with modifying the content.

4. The method of one of the preceding claims 1 to 3, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to an accuracy of the error relative to surrounding content.

5. The method of one of the preceding claims 1 to 4, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to a contextual meaning of the error relative to surrounding content.

6. The method of one of the preceding claims 1 to 5, wherein modifying the portion of the content is further based on an update rate related to the stream of content.

7. The method of one of the preceding claims 1 to 6, wherein modifying the portion of the content includes smoothing a displayed transition from the error to a correction displayed in place of the error.

8. A computing system, comprising:a computer-readable storage media;Atty Docket No. 0120-1136W01at least one processor operatively coupled to the computer-readable storage media; andprogram instructions stored on the computer-readable storage media that, when executed by the at least one processor, direct the at least one processor to perform a method, the method comprising:receiving a stream of content;receiving data related to a gaze of a user viewing a region of the content; identifying an error related to at least one character within the content, based on a criterion; andmodifying a portion of the content that includes the error based on a location of the error relative to the region.

9. The computing system of claim 8, wherein modifying the portion of the content is based on the location being within the region.

10. The computing system of claim 9, wherein modifying the portion of the content is further based on a visual disturbance of the content caused by modify ing the portion of the content, and an effect on a meaning of surrounding content associated with modifying the content.

11. The computing system of one of the preceding claims 8 to 10, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to an accuracy of the error relative to surrounding content.

12. The computing system of one of the preceding claims 8 to 11, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to a contextual meaning of the error relative to surrounding content.

13. The computing system of one of the preceding claims 8 to 12, wherein modifying the portion of the content is further based on an update rate related to the stream of content.

14. The computing system of one of the preceding claims 8 to 13, wherein modifying the portion of the content includes smoothing a displayed transition from the error to a correction displayed in place of the error.Atty Docket No. 0120-1136W0115. A computer-readable storage medium having program instructions stored thereon that, when executed by at least one processor, direct the at least one processor to perform a method comprising:receiving a stream of content;receiving data related to a gaze of a user viewing a region of the content; identifying an error related to at least one character within the content, based on a criterion; andmodifying a portion of the content that includes the error based on a location of the error relative to the region.

16. The computer-readable storage medium of claim 15, wherein modifying the portion of the content is based on the location being within the region.

17. The computer-readable storage medium of claim 16, wherein modifying the portion of the content is further based on a visual disturbance of the content caused by modifying the portion of the content, and an effect on a meaning of surrounding content associated with modifying the content.

18. The computer-readable storage medium of one of the preceding claims 15 to 17, wherein identifying the error comprises performing an error checking process on the content, and wherein the criterion corresponds to an accuracy of the error relative to surrounding content.

19. The computer-readable storage medium of one of the preceding claims 15 to 18, wherein identifying the error comprises performing an error checking process on the content, and w herein the criterion corresponds to a contextual meaning of the error relative to surrounding content.

20. The computer-readable storage medium of one of the preceding claims 15 to 19, wherein modifying the portion of the content includes smoothing a displayed transition from the error to a correction displayed in place of the error.