Generative ai for time series prediction for creating radiotherapy treatment planning systems

By using a generative AI model to monitor and predict TPS screen capture sequences, the problem of long training time and high resource consumption caused by the complexity of TPS is solved, realizing efficient and real-time operation guidance for radiotherapy treatment planning system, which can be adapted to the unique practices of different clinics.

CN122201625APending Publication Date: 2026-06-12SIEMENS HEALTHINEERS INTERNATIONAL AG

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SIEMENS HEALTHINEERS INTERNATIONAL AG
Filing Date
2025-12-10
Publication Date
2026-06-12

Smart Images

  • Figure CN122201625A_ABST
    Figure CN122201625A_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure relate to generative AI for time series prediction of radiotherapy treatment planning system creation. A server monitors screen capture sequences from a user interface of a radiotherapy treatment planning platform operated by a medical professional. The server generates a training dataset comprising time series of the screen captures, and uses the dataset to train a machine learning model. The trained model is configured to predict visual attributes of the user interface to determine attributes of a next screen based on previous interactions. When executed, the model predicts future visual attributes of the interface as a user interacts with a current screen to provide real-time guidance for navigating the treatment planning platform.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application generally relates to training generative AI (artificial intelligence) models for radiotherapy treatment planning systems. Background Technology

[0002] Radiation therapy (RT) is one of the primary modalities used in cancer treatment, and RT treatment planning (RTTP) is a complex process that involves specific guidelines, protocols, and instructions adopted by various healthcare professionals, such as clinicians, medical device manufacturers, etc. Healthcare professionals strive to provide patients with the safest and most effective treatment.

[0003] RTTP creation typically involves collaboration between multiple professionals and automated computer models, such as treatment planners or plan optimizers. The initial treatment plan is usually created based on the best available information and is tailored to the individual patient according to standard treatment protocols. Specifically, various attributes are input by healthcare professionals into treatment planning software (TPS), which is the interface or platform for the treatment planner, where the computer model uses these attributes to optimize the RTTP.

[0004] TPS (Training Practice) systems typically consist of a highly complex platform with a series of interactive interfaces, each containing multiple menus, buttons, and data fields. Healthcare professionals must navigate these menus, buttons, and data fields to perform tasks and ultimately input the medical features of RTTP (Real-Time Treatment Planning). The sheer number of options and non-linear workflows make it difficult for users to know which steps to take next, especially when each clinic may have its own preferred approach. This complexity often results in a steep learning curve, requiring extensive training and experience to navigate the system efficiently and correctly. Navigating a TPS generally requires specialized knowledge, necessitating comprehensive training for new clinicians before they can use it effectively. This training is crucial when staff begin using a new planning system or when software is updated to support new optimizations or treatment techniques, ensuring alignment with current clinical practice. Acquiring full expertise takes even longer, and the rapidly evolving field of RT requires continuous learning of new aspects of treatment planning. Educating new or junior staff increases the workload of senior staff, which is highly undesirable. Summary of the Invention

[0005] Due to the sheer number of features and the multiple ways users can achieve their desired results, training healthcare professionals to become proficient in complex treatment planning systems (TPS) typically requires a considerable amount of time. Traditional user manuals are insufficient to cover all potential workflow and variations in TPS usage. Recently, Large Language Models (LLMs) have gained attention as a potential solution for simplifying user interfaces and enhancing TPS user training. Commonly proposed uses of LLMs include providing a natural language question-and-answer interface for "help" functions or enabling direct TPS control via natural language commands.

[0006] However, conventional LLMs are plagued by technical challenges. For example, while LLMs can quickly provide a reliable approach to organizational help features, they are not tailored to the diverse ways different clinics use TPS. Standard LLMs may not account for varied practices and each clinic's unique workflows, and collecting the specific training data required for further LLM customization can be challenging. Moreover, integrating LLM-driven guidance into existing TPS structures—rich in visual elements such as menus, buttons, fields, and graphs—presents integration difficulties because the natural language processing capabilities of LLMs are not easily translated into such visual interfaces.

[0007] The methods and systems discussed in this paper (at least in part) target training and using generative AI models to predict how the visual interface of a TPS will change as the user progressively advances the task. The model disclosed in this paper can monitor the TPS screen in real-time (or near real-time) and can predict the next visual state based on the user's actions. Unlike other conventional methods, the model discussed in this paper can monitor visual elements and guide the user based on the sequence of screens navigated by the user. Moreover, the predictions provided by this model can be visual in nature. For example, the model can be trained and configured to predict the next graphical user interface that should ultimately lead to the desired outcome. Accordingly, the model discussed in this paper improves the capabilities of other machine learning models configured to perform the same task. For example, by excluding actual treatment data, the model discussed in this paper can be trained with less data and can perform faster with less computational power (because it uses less, in some cases, non-essential data). Unlike many other existing solutions, the model discussed in this paper does compute the cost function of the potential or candidate RTTP. Therefore, the model discussed in this paper can run more efficiently with less computational power and provide answers in a shorter time.

[0008] The predictions provided by the model discussed in this paper allow it to suggest actions, making workflows more intuitive and efficient for users. The model disclosed in this paper can be trained on time-series screen captures from healthcare professionals previously working with a TPS. Additional training can be performed using video recordings from a clinical setting or algorithmically generated input, enabling the model to predict the next screen in the sequence. In one example implementation, the Video Stream Model (VSM) acquires a series of screen captures and predicts the next frame purely based on visual data. It should be noted that this model differs from other models or AI copilots because the model discussed here is trained on the visual aspects of the TPS, not necessarily on interaction data generated as a result of interactions between the healthcare professional and the TPS. For example, the model discussed in this paper may not necessarily check the effectiveness of RTTP or be trained to optimize one or more treatment attributes. Instead, the model can be trained using visual elements of a specific TPS that previous users have already used. These unique characteristics allow the model discussed in this paper to run more efficiently (e.g., faster) and require less training data.

[0009] In some aspects, the technology described herein relates to a method comprising: monitoring a sequence of screen captures generated from a user interface of a radiotherapy treatment planning platform operated by a group of medical professionals by at least one processor; generating a training dataset comprising a time-series dataset corresponding to the screen capture sequence by at least one processor; training a machine learning model by at least one processor using the training dataset such that the machine learning model is configured to predict the visual properties of the next screen capture of the user interface of the radiotherapy treatment planning platform based on previous screen captures of the radiotherapy treatment planning platform; and executing the machine learning model by at least one processor to predict the future visual properties of future screen captures of the radiotherapy treatment planning platform for the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

[0010] In some respects, the technology described herein relates to a method that further includes displaying a warning notification by at least one processor when a future screen capture deviates from future visual properties predicted by a machine learning model.

[0011] In some respects, the techniques described in this paper involve a method in which the machine learning model is also trained using videos recording interactions between medical professionals and a radiation therapy treatment planning platform.

[0012] In some respects, the techniques described herein relate to a method that further includes: editing at least one visual attribute within at least one screen capture in a training dataset by at least one processor.

[0013] In some respects, the techniques described in this paper involve a method in which the machine learning model is also trained using auditory instructions from medical professionals interacting with the radiotherapy treatment planning platform.

[0014] In some respects, the technology described herein relates to a method that further includes: displaying at least one future visual attribute by at least one processor.

[0015] In some respects, the techniques described in this paper involve a method in which a machine learning model is trained for a specific clinic.

[0016] In some aspects, the techniques described herein relate to a computer-readable medium comprising a non-transitory set of instructions that, when executed, cause a processor to: monitor a sequence of screen captures generated from a user interface of a radiotherapy treatment planning platform operated by a group of medical professionals; generate a training dataset comprising a time-series dataset corresponding to the screen capture sequence; use the training dataset to train a machine learning model such that the machine learning model is configured to predict the visual properties of the next screen capture of the user interface of the radiotherapy treatment planning platform based on previous screen captures of the radiotherapy treatment planning platform; and execute the machine learning model to predict the future visual properties of future screen captures of the radiotherapy treatment planning platform for the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

[0017] In some respects, the techniques described herein relate to a computer-readable medium in which the instruction set also enables the processor to display a warning notification when a future screen capture deviates from future visual properties predicted by a machine learning model.

[0018] In some respects, the techniques described herein relate to a computer-readable medium in which machine learning models are also trained using videos recording interactions between medical professionals and a radiotherapy treatment planning platform.

[0019] In some respects, the techniques described herein relate to a computer-readable medium in which the instruction set also enables the processor to: edit at least one visual attribute within at least one screen capture within a training dataset.

[0020] In some respects, the techniques described herein relate to a computer-readable medium in which machine learning models are also trained by medical professionals interacting with a radiotherapy treatment planning platform using auditory instructions.

[0021] In some respects, the techniques described herein relate to a computer-readable medium in which the instruction set also enables the processor to display at least one future visual attribute.

[0022] In some respects, the techniques described herein relate to a computer-readable medium in which a machine learning model is trained for a specific clinic.

[0023] In some aspects, the technology described herein relates to a system including a server configured to: monitor a sequence of screen captures generated from the user interface of a radiotherapy treatment planning platform operated by a group of medical professionals; generate a training dataset including a time-series dataset corresponding to the screen capture sequence; use the training dataset to train a machine learning model such that the machine learning model is configured to predict the visual properties of the next screen capture of the user interface of the radiotherapy treatment planning platform based on previous screen captures of the radiotherapy treatment planning platform; and execute the machine learning model to predict the future visual properties of future screen captures of the radiotherapy treatment planning platform for the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

[0024] In some respects, the technology described herein relates to a system in which the server is also configured to display a warning notification when a future screen capture deviates from the future visual attributes predicted by a machine learning model.

[0025] In some respects, the technology described in this paper relates to a system in which the machine learning model is also trained using videos recording interactions between medical professionals and a radiotherapy treatment planning platform.

[0026] In some respects, the technology described herein relates to a system in which the server is also configured to: edit at least one visual attribute within at least one screen capture in a training dataset.

[0027] In some respects, the technology described in this paper relates to a system in which the machine learning model is also trained using auditory instructions from medical professionals interacting with a radiotherapy treatment planning platform.

[0028] In some respects, the technology described herein relates to a system in which the server is also configured to display at least one future visual attribute. Attached Figure Description

[0029] Non-limiting embodiments of this disclosure are described by way of example with reference to the accompanying drawings, which are schematic and not intended to be drawn to scale. Unless indicated as background art, these drawings represent aspects of this disclosure.

[0030] Figure 1 The illustration shows system components for implementing an AI-enabled treatment planning system according to one embodiment.

[0031] Figure 2The operational workflow of a method performed in the hardware computing component and software computing component of a system for hosting and executing AI-enabled treatment planning, according to one embodiment, is illustrated.

[0032] Figure 3 The illustration shows a non-limiting example of data flow within an AI-enabled treatment planning system according to one embodiment. Detailed Implementation

[0033] Reference will now be made to the illustrative embodiments depicted in the accompanying drawings, and these embodiments will be described herein using specific language. However, it will be understood that this is not intended to limit the scope of the claims or this disclosure. Changes and further modifications to the inventive features shown herein, as well as additional applications to the principles of the subject matter shown herein, that will occur to those skilled in the art and those with knowledge of this disclosure, will be considered within the scope of the subject matter disclosed herein. Other embodiments may be used, and / or other variations may be made, without departing from the spirit or scope of this disclosure. The illustrative embodiments described in the detailed description are not intended to limit the subject matter presented.

[0034] Figure 1 The illustration shows components of a system 100 for an AI-enabled treatment planning system according to one embodiment. System 100 may include an analytics server 110a, a system database 110b, a machine learning model 111, end-user devices 120a-120f (collectively referred to as end-user device 120), a medical device 150, a medical device computer 152, a database 160, and a radiotherapy planning optimizer 162. Figure 1 The various components depicted may belong to a radiotherapy treatment clinic, where patients can receive radiotherapy treatment (in some cases, via one or more radiotherapy machines (e.g., medical device 150)).

[0035] System 100 is not limited to the components described herein, and may include additional or other components that are considered to be within the scope of the embodiments described herein but are not shown for the sake of brevity.

[0036] The aforementioned components can be interconnected through one or more networks 130. Examples of networks 130 may include, but are not limited to, private or public local-area networks (LANs), wireless local-area networks (WLANs), metropolitan-area networks (MANs), wide-area networks (WANs), and the Internet. Network 130 may include wired and / or wireless communications according to one or more standards and / or via one or more transmission media. Communication on network 130 may be performed according to various communication protocols, such as Transmission Control Protocol and Internet Protocol (TCP / IP), User Datagram Protocol (UDP), and IEEE communication protocols. In one example, network 130 may include wireless communications according to the Bluetooth specification set or another standard or proprietary wireless communication protocol. In another example, network 130 may also include communications on a cellular network, including, for example, GSM (Global System for Mobile Communications), CDMA (Code Division Multiple Access), or EDGE (Enhanced Data for Global Evolution) networks.

[0037] Analysis server 110a can generate and display an electronic platform (also referred to herein as a treatment planning system or TPS) configured to interface with the radiotherapy planning optimizer 162 and / or (indirectly) the machine learning model 111, and to receive patient information (via various sources) and visualize preferences / instructions. Analysis server 110a can then use the TPS to output the execution results of the machine learning model 111 and the radiotherapy planning optimizer 162. In some embodiments, the interface of the TPS and / or the platform can be populated via the radiotherapy planning optimizer 162 itself.

[0038] TPS may include a graphical user interface (GUI) displayed on each of the end-user device 120, medical device 150, and / or medical device computer 152. Examples of TPS generated and hosted by analytics server 110a may be web-based applications or websites configured to be displayed on various electronic devices, such as mobile devices, tablets, personal computers, etc.

[0039] The TPS hosted on another device, analytics server 110a or system 100, includes collaboration software accessible to user devices 120 of participating members, enabling multiple healthcare professionals to collaborate and view visualizations provided by analytics server 110a. The collaboration software can include any type of software that facilitates collaboration among user groups, which can include live interactive software (e.g., teleconferencing software) or asynchronous collaboration (e.g., online publishing).

[0040] Collaborative software can also facilitate communication between one or more healthcare professionals and the analytics server 110a and / or the radiotherapy planning optimizer 162. For example, the platform provided or hosted by the analytics server 110a may include input elements (e.g., visual, text, or auditory elements) allowing users (e.g., one or more healthcare professionals) to input their desired treatment plan attributes. The TPS can also display the predicted output of the radiotherapy planning optimizer 162. As described herein, the analytics server 110a can also display outputs predicted by the machine learning model 111.

[0041] The information displayed by the TPS analysis server 110a may include any contextual data associated with user input during the generation of treatment plans. For example, the analysis server may display any data required to generate treatment plans for one or more patients. Non-limiting examples may include data associated with the patient to be treated (e.g., planning goals) or visualization attributes (e.g., what a medical professional expects to see), missing data, erroneous data, contextual data associated with how the radiotherapy planning optimizer 162 operates, etc.

[0042] The analytics server 110a can be any computing device, including a processor and non-transitory machine-readable storage, capable of performing the various tasks and processes described herein. The analytics server 110a can employ various processors, such as a central processing unit (CPU) and a graphics processing unit (GPU). Non-limiting examples of such computing devices may include workstation computers, laptop computers, server computers, etc. While system 100 includes a single analytics server 110a, the analytics server 110a can also include any number of computing devices operating in a distributed computing environment, such as a cloud environment.

[0043] End-user device 120 can be any computing device, including processors and non-transitory machine-readable storage media, capable of performing the various tasks and processes described herein. Non-limiting examples of end-user device 120 may include workstation computers, laptop computers, tablet computers, or server computers. In operation, various users can use end-user device 120 to access a GUI operatively managed by analytics server 110a. Specifically, end-user device 120 may include clinic computer 120a, clinic server 120b, and medical professional device 120c, which may include any electronic device operated by oncology committee members, medical professionals, and scientists that accesses and examines various types of patient-related treatment data and RTTPs, as well as other types of data and information exchange.

[0044] In a non-limiting example, multiple healthcare professionals can operate healthcare professional device 120c to examine patient-related treatment data to reach a consensus on patient treatment. Even though these devices are referred to herein as "end-user" devices, they may not always be operated by end-users. For example, clinic server 120b may not be directly used by end-users. However, results stored on clinic server 120b can be used to populate various GUIs accessed by end-users via healthcare professional device 120c. Patient-related information generated by various types of devices in system 100 (outside the context of the AI ​​treatment planning agent) can be stored within database 100b. The stored patient data can be referenced by analytics server 110a to train machine learning model 111.

[0045] Medical device 150 may be a radiotherapy machine configured to administer radiotherapy to a patient. Medical device 150 may also communicate with medical device computer 152 configured to display various GUIs discussed herein. For example, analysis server 110a may display results predicted by radiotherapy planning optimizer 162 on the computing device described herein.

[0046] The machine learning model 111 can be stored in the system database 110b. The machine learning model 111 can be configured or trained to automatically generate text, image, or video responses based on input received at the user interface or other types of input (e.g., speech captured at a conference room microphone or the microphone of the end-user device 120).

[0047] In some embodiments, analysis server 110a can retrieve data input via TPS and execute radiotherapy planning optimizer 162 to generate one or more treatment attributes for an RTTP that meets any radiotherapy planning objectives, based on patient attributes of the patient for whom a radiotherapy treatment plan is being generated. Radiotherapy planning optimizer 162 can be stored in database 160. Radiotherapy planning optimizer 162 can generate one or more treatment attributes, for example, by iteratively calculating one or more treatment attributes, wherein for each iteration, radiotherapy planning optimizer 162 can revise one or more treatment attributes of the RTTP based on cost values.

[0048] Analysis server 110a can deploy a radiotherapy planning optimizer 162 to generate RTTPs for patients based on patient attributes and one or more treatment attributes received (via TPS) from one or more end-user devices 120. The radiotherapy planning optimizer 162 can iteratively compute one or more treatment attributes of the RTTP. For example, for each iteration, the radiotherapy planning optimizer 162 can generate candidate RTTPs with various attributes. The planning optimizer 162 can then use one or more loss functions to compute the cost value of the generated candidate RTTPs. The cost value can indicate the likelihood that the candidate RTTP violates a set of rules (whether internal or external). For example, the cost value can indicate whether the candidate RTTP violates any planning objectives. The radiotherapy planning optimizer 162 can analyze the cost value. If needed (e.g., when the cost value meets a threshold), the radiotherapy planning optimizer 162 can revise the candidate RTTP and re-execute its loss function to generate new cost values.

[0049] Depending on whether the new cost function is increasing or decreasing, the planning optimizer computer model can revise the candidate RTTP again and recalculate the cost value. The radiotherapy planning optimizer 162 can continue this iterative approach until convergence to an RTTP (or final RTTP) with a cost value that meets a threshold. In some implementations, patient treatment attributes can also indicate how radiotherapy can be combined with or sequentially implemented with other types of treatment modalities (e.g., surgery, chemotherapy).

[0050] In some embodiments, the analytics server 110a or end-user device 120 may use RTTP to automatically control the medical device 150 to treat patients based on RTTP attributes. The system database 110b may contain data for training the machine learning model 111 to assist users in operating the TPS.

[0051] Figure 2The operational workflow of method 200, performed within hardware and software computing components of an AI-enabled treatment planning system according to one embodiment, is illustrated. Method 200 may include steps 202-208. However, other embodiments may include additional or alternative steps, or one or more steps may be omitted entirely. Method 200 is described as being performed by a server (such as...) Figure 1 The analysis server described in [the document] can be used to execute this. However, one or more steps of method 200 can be performed by [the server described in the document]. Figure 1 The distributed computing system described herein can operate on any number of computing devices to execute. For example, one or more computing devices can execute locally. Figure 2 Some or all of the steps described in the document.

[0052] In step 202, the analysis server can monitor a sequence of screen captures generated from the user interface of a radiotherapy treatment planning platform operated by a group of medical professionals.

[0053] As discussed in this paper, screen capture can refer to a time series or a collection of one or more images of screen captures obtained from a TPS interface when a user (e.g., a physicist, clinician, or other medical professional) is using the TPS interface. Furthermore, as used in this paper, a radiotherapy treatment planning platform refers to the TPS interface. Screen captures can be recorded periodically or triggered when significant changes occur in the interface (such as a user selecting a menu option or adjusting treatment parameters). Another trigger for screen capture may be a change in system state, such as when a new tool is activated or a dialog box opens. As discussed in this paper, such data can be ingested by machine learning models to predict the next screen (e.g., the visual elements of the next screen), thus reflecting the likely state of the interface after the next major user interaction or system change, helping to guide the user through the workflow.

[0054] Each screen capture can include a timestamp that can be used to generate a time-series dataset corresponding to the progress of how TPS was used to generate RTTP.

[0055] In some embodiments, when a group of medical professionals operate a radiotherapy treatment planning platform (e.g., TPS), an analytics server can monitor a sequence of screen captures generated from the user interface of the radiotherapy treatment planning platform. In these embodiments, the server captures and tracks visual changes within the interface in real-time or near real-time, allowing it to observe the actions taken by users during their workflow. This sequence of screen captures can provide the analytics server with a detailed view of the user's interaction with the TPS.

[0056] In some embodiments, the analytics server can monitor user interactions with the TPS by performing periodic screen captures of the user's computer (e.g., generating images associated with the TPS). Using the captured images, the analytics server can generate sequences of screen captures reflecting visual changes occurring in the TPS user interface as healthcare professionals perform their tasks and navigate between different interfaces to generate RTTP. These screen captures can be acquired in real-time or near real-time, recording every action taken by the user, such as selecting menu options, entering data into fields, or adjusting treatment parameters. In some embodiments, the analytics server can record (e.g., generate video files) the TPS interface.

[0057] By continuously monitoring these visual interactions, the analytics server can track workflow progress, understand user behavior, and later use this data for predictive analytics or training purposes. Additionally, the analytics server can capture sequences of the interface. For example, the monitored data can reveal specific user patterns and common patterns they might use in navigating TPS (Transportation Per Second).

[0058] In step 204, the analysis server can generate a training dataset that includes a time-series dataset corresponding to the screen capture sequence.

[0059] Using the monitored data, the analytics server can generate a training dataset. This training dataset can be generated by collecting video recordings using TPS during a healthcare professional's normal workflow or by aggregating images captured by the healthcare professional. These recordings capture sequences of screen changes and interactions made by the user as they navigate various tasks within the TPS. The captured data can also indicate clickstreams and interaction data (e.g., when and where the user clicks). Additionally, verbal annotations provided by the professional during or after the recording can be included to add contextual information about the planning process.

[0060] The analytics server can aggregate monitored / captured data into a training dataset. In some embodiments, the analytics server can perform one or more preprocessing protocols. For example, in some embodiments, the analytics server can generate timestamped time series of the captured data. In another embodiment, the analytics server can edit sensitive data from screen captures or contextual data used to augment the training dataset.

[0061] The training dataset can capture a comprehensive view of how healthcare professionals interact with the TPS during their routine workflows. The training dataset can include a series of screen captures or video recordings documenting every action a user takes while navigating the TPS and generating RTTPs. These screen captures can reflect dynamic changes in the user interface as professionals engage with various features, such as selecting options from menus, entering data into fields, and adjusting treatment parameters. These visual representations of interactions can be directly used to train a model to understand how the TPS is used in different clinical settings and / or different clinics.

[0062] Additionally, the server can incorporate contextual information to augment the training dataset (such as patient data or user instructions) to enhance its understanding of these interactions. As used in this paper, contextual information can refer to any additional data beyond the screen captures that can enhance model predictions. This information can include a list of lexical units representing patient-specific data retrieved from a database (such as medical history or treatment goals). It can also encompass natural language input, either spoken or written by the clinician, or inferred from the screen captures and processed by a specialized large language model (LLM). This contextual data can provide the (trained) machine learning model with a deeper understanding of the clinical scenario, enabling it to deliver more accurate predictions and customized guidance within TPS.

[0063] In some embodiments, the analytics server may use a secondary model to generate or transform contextual data. For example, the analytics server may use a visual embedding model. As used herein, a visual embedding model can refer to any specialized neural network designed to transform screen captures from TPS into serialized data, such as natural language descriptions or latent vector representations. This transformation can allow machine learning models to extract meaningful semantic and structural information from the raw bitmap image within the screen capture. For example, a visual embedding model can analyze the screen capture to identify and describe user interface elements, such as menus, data fields, or open dialogs. It can also extract numerical or textual data displayed on the screen, transforming it into a structured format that can be processed by another model, such as an LLM.

[0064] The visual embedding model can be trained using annotated screen captures. Accordingly, the visual embedding model can learn to recognize and interpret specific elements and data displayed within the TPS interface. Training can be tailored to a specific TPS, ensuring that the model understands the unique layout and functionality of that TPS. In some embodiments, the model is designed to be generalizable across different clinics using the same TPS, meaning it does not require further retraining or adjustments to account for changes in workflows between clinics.

[0065] In some embodiments, the training dataset can be generated by recording video sessions of clinicians performing various tasks within a specific clinic's TPS (Transportation System). These recordings can be further enhanced with verbal annotations provided by the clinicians during or after the sessions, allowing one or more users to provide additional context about their actions. This process enables clinics to collect clinic-specific training datasets that capture how the TPS is utilized in their specific environment. As a result, a machine learning model can be trained to be configured for use in a specific clinic and to make its predictions using clinic-specific rules and protocols.

[0066] Once this clinic-specific dataset is collected, the machine learning model can be trained by building upon an existing, pre-trained general-purpose model. This approach allows for further fine-tuning of the model using data that reflects the clinic's unique workflows and practices. By continuing the training process using this localized data, the model can become better suited to predicting user actions and providing real-time guidance tailored to the specific usage patterns of TPS within the clinic, thereby improving both the accuracy and relevance of routine operations.

[0067] Analytics servers can augment training datasets with several additional types of information to enhance the model's predictive capabilities. For example, in addition to screen captures, training datasets can be enhanced by verbal or written annotations provided by users during or after their interactions. These annotations provide contextual information, explaining why a particular action was taken or clarifying the rationale behind certain choices. For instance, a user might explain why they adjusted treatment parameters in response to a specific patient condition. In another example, a user might describe why a particular path was taken and why the inputs were provided in a particular manner and order. This information not only helps machine learning models understand what actions are being performed but also the intent behind those actions, thus enabling them to more effectively predict future user behavior.

[0068] In some embodiments, the analytics server can augment the training dataset with patient data such as the type of cancer being treated, treatment goals, any specific instructions provided by the attending physician, or patient attributes such as BMI (body mass index), tumor location, etc. By incorporating this additional layer of context, the model can better tailor its predictions to the specific needs of the clinic and the patients being treated.

[0069] In some embodiments, the analytics server can augment the training dataset using user interaction logs, which capture detailed records of actions such as mouse clicks, keystrokes, and menu selections. These logs, when aligned with screen captures, can provide a more granular view of how users interact with TPS, thereby increasing the precision and depth of the dataset.

[0070] In some embodiments, the analytics server can augment the training dataset using external clinical guidelines and treatment protocols that guide medical decision-making in radiotherapy. By embedding these guidelines, the model can align its predictions with best practices and regulatory standards, thereby supporting clinicians in making clinically rational choices. Finally, historical performance data from past planning sessions and treatment outcomes can be included in the dataset. This historical data allows the model to identify patterns that have led to successful treatment plans, helping it to provide guidance informed by real-world success stories. Together, these data sources create a comprehensive and highly informative training dataset for the model.

[0071] In addition, the video input generated by the algorithm can be used to further refine the model.

[0072] The contextual data discussed in this paper, combined with screen capture and annotation, can form a rich multidimensional training set that enables the system to learn both the visual patterns of interaction and the underlying medical theories that guide the use of TPS.

[0073] In step 206, the analysis server can use the training dataset to train a machine learning model, such that the machine learning model is configured to predict the visual attributes of the next screen capture of the user interface of the radiotherapy treatment planning platform based on at least one previous screen capture of the radiotherapy treatment planning platform.

[0074] Using the training dataset discussed in this paper, the analytics server can train a machine learning model. By ingesting visual data and combining it with the interaction logs and patient information discussed in this paper, the model can learn to form a more comprehensive understanding of the workflows involved in treatment planning. This multimodal approach allows the model to provide customized guidance so that its predictions are aligned not only with the visual flow of the interface but also with the unique needs of each situation. Training can be any deep learning technique combined with any supervised, unsupervised, or semi-supervised technique. The model can be trained using the training data to predict the visual properties of subsequent interfaces of the TPS, thus providing a sequence of the current interface and / or past interfaces of the TPS. For example, the model can be configured to ingest the graphical user interface of the TPS and predict how the next step should look (e.g., the set of visual properties of the next GUI).

[0075] Machine learning models can be further refined using historical performance data and / or real-time TPS information. Using historical data during training helps the model identify patterns that lead to successful treatment plans, allowing it to learn from past successes and mistakes. Real-time TPS status data, such as the current stage of a treatment plan or activity tool, allows the model to track how the TPS evolves throughout the workflow. By incorporating all these elements, machine learning models can become more robust and context-aware, enabling them to more accurately predict user actions and provide real-time guidance aligned with both the current state of the TPS and the clinical goals of the healthcare professional.

[0076] In practice, this model can be configured to predict the next steps or actions a user might take within the TPS based on their current interaction. For example, the model can handle real-time (or near real-time) screen captures and user input to anticipate how the visual interface will evolve as the user navigates the system. The model can also consider contextual information such as patient data, current system state, and user behavior patterns, allowing it to predict specific actions, such as selecting menu options, adjusting treatment parameters, or opening dialog boxes. Furthermore, the model can be trained to predict how the interface should respond to these actions, providing a continuous, real-time understanding of the workflow. This predictive capability can help guide users through the complexities of the TPS environment by offering suggestions for next steps or highlighting potential options to optimize the treatment planning process.

[0077] In some embodiments, the video streaming model (VSM) can be trained using a training dataset. Specifically, a dataset consisting of screen captures from healthcare professionals interacting with the TPS can be ingested by the VSM. In some embodiments, the training dataset can also be augmented using the data and methods discussed herein.

[0078] The training process can begin by collecting a series of screen capture sequences documenting the user's interactions with the TPS, capturing actions such as selecting menu options, entering treatment parameters, and adjusting settings. These sequences can then be paired with any available contextual information, such as patient data or task-related annotations, providing additional insights into why certain actions were taken.

[0079] A Visual Simulator (VSM) can be trained to predict the next screen in a sequence by analyzing patterns in previous screens (e.g., at least one visual attribute of the next screen). For example, given a set of frames showing a user setting treatment parameters, the VSM can learn to predict what the next screen will display based on a sequence of previous actions. For instance, the VSM can predict what menu options on the next screen will indicate. In some embodiments, training can be performed in a supervised learning manner, where the actual next screen acts as a ground truth, and the model iteratively improves by reducing the discrepancy between its predictions and the actual screen.

[0080] To further refine the model, advanced techniques such as Generative Adversarial Networks (GANs) can be used, where a generator predicts the next screen and a discriminator evaluates the accuracy of that prediction. Over time, the VSM could become adept at predicting future screens, thus providing real-time guidance to users by predicting what their next steps should be based on their interactions with TPS.

[0081] In step 208, the analysis server can execute a machine learning model to predict the future visual attributes of future screen captures of the radiotherapy treatment planning platform based on the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

[0082] As discussed in this paper, the model can be configured to predict the user's next action (or expected action) within a TPS based on the user's current interaction. The model can provide real-time (or near-real-time) guidance, suggesting the next steps in the workflow, or highlighting potential options for the user based on patterns it has learned during training.

[0083] This model can ingest real-time (or near real-time) data instructing users on how to navigate TPS. As a result, the model can predict the visual attributes of the next screen. The analytics server can use various methods to provide the user with this predicted information.

[0084] In some embodiments, the model can be configured to detect deviations from the expected workflow. For example, if a user is about to take an action that differs from the predicted optimal path or may potentially lead to error, the model can flag such behavior and provide corrective suggestions and / or send warnings to the user. This capability helps prevent errors and ensures that users follow best practices or clinic-specific protocols during treatment plans. In some embodiments, the model can also act as a proactive assistant, providing predictive visual aids or tooltips. For example, the model can display predicted future screens or suggested sequences of actions, which will help users complete their tasks more efficiently. This allows the system to act like a virtual expert, guiding users through complex workflows and improving their proficiency in using TPS.

[0085] In some embodiments, predictions generated by a machine learning model can be used to guide users by showing them what action is expected next based on their previous interactions with TPS. For example, the analytics server can suggest which menu item to select, what value to enter or adjust, or which dialog box to complete. These predicted screen images can be created in the background, and the analytics server can only display them to the user if it has high confidence that the predicted action aligns with the user's current workflow.

[0086] Additionally or alternatively, the analytics server can use predictions to alert users when they are about to perform potentially harmful or incorrect actions. By flagging deviations from the expected workflow, the model helps prevent errors that could compromise RTTP. If the model is equipped with contextual information—such as patient data or task-specific details—the analytics server can generate a predictive screen only when explicitly requested by the user, guiding them through specific situations requiring additional assistance.

[0087] In embodiments using visual embedding models, predictions can be enhanced by analyzing the content of the prediction screen to provide more detailed contextual guidance. For example, a machine learning model can recognize that a user is performing a specific task (such as adjusting treatment parameters) and suggest the next action accordingly. Machine learning models can also support the creation of multiple consecutive prediction screens, allowing the system to guide users through more complex workflows involving multiple steps or stages, providing a comprehensive, step-by-step visual roadmap for the task at hand.

[0088] Compared to traditional methods that use paper or electronic manuals to provide general guidance on how to navigate software, machine learning models trained and implemented using the methods and systems discussed in this paper can act as experienced colleagues, assisting users in navigating the workflow step-by-step. Using the methods and systems discussed in this paper, the analytics server can present users with small, specific actions that can be taken to advance the planning process in real-time or near real-time. This approach allows users to efficiently navigate workflows with personalized, real-time guidance based on specific clinical situations. As a result, the methods and systems discussed in this paper significantly reduce the time required to navigate complex tasks, helping users complete comprehensive and approved radiotherapy plans faster and more accurately than with traditional manual guidance or conventional LLM.

[0089] In one example, medical physicists use TPS to develop radiotherapy plans for cancer patients. The physicists begin by selecting appropriate cases and examining tumor locations. As they successfully handle the steps of adjusting the treatment beam and defining the dose distribution, the system, powered by the generative AI model discussed in this paper, continuously monitors the user's actions by capturing screen images and analyzing input.

[0090] As physicists move through the workflow (e.g., different GUIs of TPS), the model predicts the visual properties of the next GUI of TPS based on past interactions, patient-specific data, and the current screen layout.

[0091] Based on predicted visual attributes, the model can determine whether the physicist has navigated to a next GUI that matches the predicted visual attributes. If the visual attributes indicate that the physicist has navigated to a different page / GUI, the system can display a notification and guide the physicist to the correct next step. For example, the system could display a small notification suggesting specific menu options. Alternatively, the system might automatically prepare the next screen or dialog box, making it easier for the physicist to continue working without manually navigating the interface.

[0092] If a physicist is about to make a mistake or has already made one (such as entering a dosage value in the wrong field), the model can detect this deviation from the expected workflow and provide an alert, thus suggesting the correct action. Additionally, if users are unsure what to do next, they can request guidance, and the system will visually display the next predicted step, much like a virtual assistant offering expert advice.

[0093] Now for reference Figure 3This paper depicts a non-limiting example of data flow within an AI-enabled treatment planning system. In the depicted example, a generative machine learning model is trained to predict how the TPS user interface will visually change over time as a user performs various tasks. While trained, the model can be used to continuously monitor the TPS interface (e.g., while it is being used by a user), thereby predicting upcoming screen changes in real-time or near real-time based on user actions.

[0094] In the depicted embodiment, the model is a VSM. However, the methods and systems discussed herein are applicable to all machine learning models. Therefore, no limitation is intended.

[0095] Example 300 describes a non-limiting implementation of a VSM involving a series of screen captures of variable length as input, wherein the VSM is trained to predict the next screen in the sequence. In this approach, the VSM operates only on the visual data presented in these screen captures, without requiring additional contextual information such as patient data or TPS status. The VSM's task is to learn patterns of user interaction by observing how the interface evolves over time.

[0096] To evaluate the accuracy of the VSM's predictions, the system can calculate the pixel difference between the predicted screen and the actual next screen in the sequence. This method quantifies how close the VSM's predictions are to the real-world screen alignment, thus helping to fine-tune its accuracy. Additionally, to ensure compliance with privacy and data sensitivity requirements, screen captures can be preprocessed to filter (e.g., edit) irrelevant or sensitive information, such as patient names, identifiers, or temporary data that does not contribute to training. This allows the VSM to focus only on relevant interface components, thereby improving its predictive efficiency while maintaining the privacy of sensitive data. Furthermore, this allows the VSM not to learn (and prevents accidental sharing) confidential patient data.

[0097] In embodiment 302, the VSM ingests both screen captures and contextual information as input to enhance its predictive capabilities. In addition to visual data from the TPS interface, the VSM incorporates relevant patient information such as medical history, treatment goals, prognostic information, and physician instructions. This increased context allows the VSM to better understand specific situations and tailor its predictions to clinical circumstances. By doing so, the VSM can not only predict the next screen but also suggest actions based on the patient's treatment plan and the current state of the system.

[0098] VSM can also take into account the operational state of TPS when making predictions. For example, VSM can consider which features or menus are currently active, what tools the user has selected, and the progress made in the workflow. This allows VSM to generate visual layouts that go beyond the screen and are highly relevant to the user's workflow and system state. If the user has reached a specific stage in the treatment plan, VSM can anticipate the next logical action or system change, thereby simplifying the process and minimizing unnecessary / incorrect / inefficient steps.

[0099] Additionally, VSM can incorporate verbal explanations or other real-time feedback from the user. For example, if a user verbally indicates that they will adjust treatment parameters, VSM can integrate this information and predict the appropriate tool or screen. This multifaceted approach provides dynamic, real-time (or near real-time) guidance, thus not only predicting the next screen but also understanding the theory behind the user's actions. This results in more accurate, context-aware assistance, reducing errors and improving overall workflow efficiency within TPS.

[0100] In the final embodiment 304, the AI-enabled treatment planning system includes both a Visual Sense Model (VSM) and a Latent Space Vector Model (LLM) linked together by a visual embedding model. In this embodiment, the visual embedding model converts screen captures or images from the Treatment Processing System (TPS) into natural language descriptions or latent space vector representations. These representations can capture necessary information from the TPS interface, such as menus, data fields, or graphical elements, in a manner that can be interpreted by the LLM.

[0101] Once the image data is converted into the corresponding visual embedding, LLM can process the output of the visual embedding model along with contextual information, such as patient-specific data or the current TPS state. Trained to understand both clinical context and TPS structure, an LLM can generate natural language interpretations or structured information that informs the next prediction. For example, an LLM might interpret a screen containing dose parameters and predict that the next action might involve adjusting the treatment beam based on the current patient condition.

[0102] Finally, in Example 304, the VSM leverages the rich output from the LLM to predict the next screen in the TPS interface. By combining visual data with natural language interpretation and clinical context, the VSM can provide more accurate and context-aware predictions of user actions. This integrated system not only predicts visual transitions but also generates meaningful insights into the user's workflow, thereby helping to guide the user through complex tasks with personalized, step-by-step assistance adapted to the clinical setting and user behavior.

[0103] In a non-limiting example of Example 304, a medical professional using a TPS to design a radiotherapy plan for a patient adjusts dose parameters on a screen. As they interact with the interface, the visual embedding model converts the screen capture into a latent vector representation or natural language description, such as “the current screen shows dose adjustments for a left lung tumor.” This data is then transmitted to an LLM, which combines it with patient-specific information, such as the location of the tumor and the prescribed treatment goals. The LLM interprets this context and generates predictions for the next possible action, such as “adjust beam intensity” or “navigate to the next planning step.” The VSM then predicts the next screen interface based on this combined input / prediction from the LLM. For example, the VSM could predict the screen showing the next step in the beam adjustment workflow, thus providing real-time guidance and advice to the user as they proceed with the treatment planning process.

[0104] The various illustrative logic blocks, modules, circuits, and algorithmic steps described in conjunction with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability between hardware and software, the various illustrative components, blocks, modules, circuits, and steps have been generally described above in terms of their functionality. Whether this functionality is implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system. Those skilled in the art can implement the described functionality in different ways for each specific application, but such implementation decisions should not be construed as departing from the scope of this disclosure or the claims.

[0105] Implementations using computer software can be implemented using software, firmware, middleware, microcode, hardware description languages, or any combination thereof. Code segments or machine-executable instructions can represent procedures, functions, subroutines, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. Code segments can be coupled to other code segments or hardware circuitry by passing and / or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., can be passed, forwarded, or transmitted via any suitable means, including memory sharing, message passing, token passing, network transmission, etc.

[0106] The actual software code or specialized control hardware used to implement these systems and methods is not limited to the claimed features or this disclosure. Therefore, the operation and behavior of the systems and methods are described without reference to specific software code, and it should be understood that the software and control hardware can be designed to implement the systems and methods based on the description herein.

[0107] When implemented in software, functions can be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium. The steps of the methods or algorithms disclosed herein can be embodied in a processor-executable software module that may reside on a computer-readable or processor-readable storage medium. Non-transitory computer-readable or processor-readable media include both computer storage media and tangible storage media that facilitate the transfer of a computer program from one place to another. A non-transitory processor-readable storage medium can be any available medium accessible to a computer. By way of example and not limitation, such non-transitory processor-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disc storage devices, disk storage devices or other magnetic storage devices, or any other tangible storage medium that can be used to store desired program code in the form of instructions or data structures and is accessible to a computer or processor. Disks and optical discs as used herein include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy disks, and Blu-ray discs, wherein disks typically magnetically reproduce data, while optical discs optically reproduce data using lasers. Combinations of the foregoing should also be included within the scope of computer-readable media. Additionally, the operation of a method or algorithm may reside as one of the codes and / or instructions, or any combination or set thereof, on a non-transitory processor-readable medium and / or computer-readable medium that may be incorporated into a computer program product.

[0108] The foregoing description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the embodiments described herein and variations thereof. Various modifications to these embodiments will readily be apparent to those skilled in the art, and the principles defined herein may be applied to other embodiments without departing from the spirit or scope of the subject matter disclosed herein. Therefore, this disclosure is not intended to be limited to the embodiments shown herein, but is accorded the widest scope consistent with the appended claims and the principles and novel features disclosed herein.

[0109] While various aspects and embodiments have been disclosed, other aspects and embodiments are contemplated. The various aspects and embodiments disclosed are for illustrative purposes and are not intended to be limiting, wherein the true scope and spirit are indicated by the appended claims.

Claims

1. A method comprising: A sequence of screen captures generated from the user interface of a radiotherapy treatment planning platform operated by a group of medical professionals is monitored by at least one processor. A training dataset comprising a time-series dataset corresponding to the screen capture sequence is generated by the at least one processor; The at least one processor uses the training dataset to train a machine learning model, such that the machine learning model is configured to predict the visual attributes of the next screen capture of the user interface of the radiotherapy treatment planning platform based on previous screen captures of the radiotherapy treatment planning platform. as well as The machine learning model is executed by the at least one processor to predict future visual attributes of future screen captures of the radiotherapy treatment planning platform based on the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

2. The method according to claim 1, further comprising: When a future screen capture deviates from the future visual attributes predicted by the machine learning model, a warning notification is displayed by the at least one processor.

3. The method of claim 1, wherein the machine learning model is further trained using videos recording interactions between medical professionals and the radiotherapy treatment planning platform.

4. The method according to claim 1, further comprising: The at least one processor edits at least one visual attribute within at least one screen capture in the training dataset.

5. The method of claim 1, wherein the machine learning model is further trained using auditory instructions from a medical professional interacting with the radiotherapy treatment planning platform.

6. The method according to claim 1, further comprising: The at least one future visual attribute is displayed by the at least one processor.

7. The method of claim 1, wherein the machine learning model is trained for a specific clinic.

8. A computer-readable medium comprising a non-transitory instruction set, said non-transitory instruction set causing a processor, when executed, to: Monitoring is performed on a sequence of screen captures generated from the user interface of a radiotherapy treatment planning platform operated by a group of medical professionals. Generate a training dataset that includes a time-series dataset corresponding to the screen capture sequence; The training dataset is used to train a machine learning model, which is configured to predict the visual attributes of the next screen capture of the user interface of the radiotherapy treatment planning platform based on the previous screen capture of the radiotherapy treatment planning platform. as well as The machine learning model is executed to predict the future visual attributes of future screen captures of the radiotherapy treatment planning platform based on the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

9. The computer-readable medium of claim 8, wherein the instruction set further enables the processor to: A warning notification is displayed when a future screen capture deviates from the future visual attributes predicted by the machine learning model.

10. The computer-readable medium of claim 8, wherein the machine learning model is further trained using video recordings of medical professionals interacting with the radiotherapy treatment planning platform.

11. The computer-readable medium of claim 8, wherein the instruction set further enables the processor to: Edit at least one visual attribute in at least one screen capture within the training dataset.

12. The computer-readable medium of claim 8, wherein the machine learning model is further trained using auditory instructions from a medical professional interacting with the radiotherapy treatment planning platform.

13. The computer-readable medium of claim 8, wherein the instruction set further enables the processor to: Display at least one of the aforementioned future visual attributes.

14. The computer-readable medium of claim 8, wherein the machine learning model is trained for a specific clinic.

15. A system including a server, said server being configured to: Monitoring is performed on a sequence of screen captures generated from the user interface of a radiotherapy treatment planning platform operated by a group of medical professionals. Generate a training dataset that includes a time-series dataset corresponding to the screen capture sequence; The training dataset is used to train a machine learning model, which is configured to predict the visual attributes of the next screen capture of the user interface of the radiotherapy treatment planning platform based on the previous screen capture of the radiotherapy treatment planning platform. as well as The machine learning model is executed to predict the future visual attributes of future screen captures of the radiotherapy treatment planning platform based on the current screen capture of the radiotherapy treatment planning platform with which the user is interacting.

16. The system of claim 15, wherein the server is further configured to: A warning notification is displayed when a future screen capture deviates from the future visual attributes predicted by the machine learning model.

17. The system of claim 15, wherein the machine learning model is further trained using videos recording interactions between medical professionals and the radiotherapy treatment planning platform.

18. The system of claim 15, wherein the server is further configured to: Edit at least one visual attribute in at least one screen capture within the training dataset.

19. The system of claim 15, wherein the machine learning model is further trained using auditory instructions from a medical professional interacting with the radiotherapy treatment planning platform.

20. The system of claim 15, wherein the server is further configured to: Display at least one of the aforementioned future visual attributes.