Medical event prediction using machine-learning

Fine-tuned large language models with integrated gradients enhance the accuracy and explainability of medical event predictions from EHRs, providing actionable insights for clinical decisions.

WO2026131101A1PCT designated stage Publication Date: 2026-06-25SANOFI SA(FR)

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SANOFI SA(FR)
Filing Date
2025-12-02
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing machine-learning models for predicting medical events from electronic health records lack accuracy and explainability, failing to effectively attribute contributions of specific features to predictions.

Method used

Utilizing large language models like BERT, fine-tuned for medical event prediction tasks, and integrated gradients to quantify the contribution of each token in EHR data to predictions, enhancing explainability and accuracy.

Benefits of technology

Improves prediction accuracy and provides explicit insights into feature impacts, enabling better clinical decision-making.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025085107_25062026_PF_FP_ABST
    Figure EP2025085107_25062026_PF_FP_ABST
Patent Text Reader

Abstract

This specification relates to predicting medical events based on medical records using machine learning. According to a first aspect of this specification, there is described a computer implemented method, comprising: inputting, into a machine-learning language model, a set of input data relating to a subject, the input data comprising: a plurality of medical record tokens, wherein a plurality of the medical record tokens each indicate a respective medical event for the subject within a first time window; and a plurality of time tokens, each time token corresponding to a respective medical record token and indicating a respective time associated with the medical record token. The method further comprises processing, by the machine-learning language model, the set of input data to generate a set of output data, the set of output data comprising data indicative of a prediction of a medical event for the subject within a second time window subsequent to the first time window. The method further comprises generating, based on the set of input data and the prediction of a medical event, an attribution score for each of a plurality of the input medical record tokens using an integrated gradients method, wherein the attribution score for a token is indicative of an amount that said token contributed to the prediction of the medical event.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] MEDICAL EVENT PREDICTION USING MACHINE-LEARNING

[0002] TECHNICAL FIELD

[0003] This specification relates to predicting medical events based on medical records using machine learning.

[0004] BACKGROUND

[0005] Electronic health records (EHRs) are digital versions of patients’ medical histories, systematically capturing comprehensive information about a patient’s past diagnoses, treatments, medications, immunizations, laboratory test results, and more. With the move to electronic EHRs, their use in studies regarding patients and diseases have exploded. Recently, attention-based deep neural networks have been applied to analyse EHR data. For instance, numerous studies have utilized the encoder-only bidirectional transformer (e.g., BERT) model architecture for clinical applications, including diagnosing previously unidentified diseases, predicting test results, and forecasting treatment outcomes.

[0006] SUMMARY

[0007] According to a first aspect of this specification, there is described a computer implemented method, comprising: inputting, into a machine-learning language model, a set of input data relating to a subject, the input data comprising: a plurality of medical record tokens, wherein a plurality of the medical record tokens each indicate a respective medical event for the subject within a first time window; and a plurality of time tokens, each time token corresponding to a respective medical record token and indicating a respective time associated with the medical record token. The method further comprises processing, by the machine-learning language model, the set of input data to generate a set of output data, the set of output data comprising data indicative of a prediction of a medical event for the subject within a second time window subsequent to the first time window. The method further comprises generating, based on the set of input data and the prediction of a medical event, an attribution score for each of a plurality of the input medical record tokens using an integrated gradients method, wherein the attribution score for a token is indicative of an amount that said token contributed to the prediction of the medical event.

[0008] The method may further comprise one or more of the following features, either alone or in combination. The plurality of medical record tokens may comprises one or more lab test tokens identifying a lab test. The set of input data may further comprise one or more test result tokens each corresponding to a respective lab test token, wherein each test result token indicates an outcome of the respective lab test. A test result token may represent a percentile value for a respective lab test result.

[0009] The plurality of medical record tokens may comprise one or more of: a diagnoses event token; a medication token; a procedural token; and / or a treatment token. The plurality of medical record tokens may comprise one or more demographic tokens indicative of a property of the subject. Respective time tokens associated with the one or more demographic tokens may have a null value.

[0010] Generating, based on the set of input data and the prediction of a medical event, the attribution score for each of a plurality of the input medical record tokens using an integrated gradients method may comprise, for each of a plurality of medical record tokens: computing a set of gradients of the set of output data along an interpolation path between a baseline set of input data and the set of input data in an embedding space, wherein the interpolation path is defined by an interpolation parameter; numerically integrating the set of gradients over the interpolation parameter; scaling the integrated set of gradients by a difference in the embedding space between the medical record token in the baseline set of input data and the medical record token in the set of input data to obtain a scaled integrated gradient; and summing components of the scaled integrated gradient to obtain the attribution score for the medical record token.

[0011] The baseline set of input data may comprise a null value for each of a plurality of medical record tokens. The plurality of medical record tokens may comprise one or more demographic tokens indicative of a property of the subject; and the baseline set of input data may comprise, for each demographic token, a respective average demographic token value. The plurality of medical record tokens may comprise one or more lab test tokens identifying a lab test. The set of input data may further comprise one or more test result tokens each corresponding to a respective lab test token, wherein each test result token indicates an outcome of the respective lab test; and the baseline set of input data comprises, for each test result token, a 50th percentile token.

[0012] The method may further comprise: iterating the method over a set of health records, the set of health records comprising a plurality of sets of input data, each relating to a respective subject; generating, from the attribution scores for the plurality of sets of input data, an average attribution score for each token type in the medical event tokens; and identifying a set of token types that are most predictive of the medical event in the second time window.

[0013] The medical events for the subject may relate to an asthma diagnosis, and the prediction of a medical event for the subject may relate to an asthma event. The medical events for the subject may relate to a diabetes diagnosis, and the prediction of a medical event for the subject may relate to a diabetes event. The medical events for the subject may relate to an uncontrolled disease outcome. The medical events for the subject may relate to a non-response in a user therapy.

[0014] The method may further comprise: identifying, based on one or more of the attribution scores, one or more input tokens that contributed most to the prediction of the medical event; and providing a treatment and / or therapy to the subject based at least in apart on the identified one or more input tokens.

[0015] The time token may indicate a time since a first medical event in the plurality of medical record tokens. The first medical event may be a diagnosis event.

[0016] According to a further aspect of this specification, there is described a computer program product comprising computer readable instructions that, when executed by a computer, cause the computer to perform any one or more of the methods described herein.

[0017] According to a further aspect of this specification, there is described a system comprising one or more processors and a memory, the memory storing computer readable instructions that, when executed by the one or more processors, cause the system to perform any one or more of the methods described herein.

[0018] BRIEF DESCRIPTION OF THE DRAWINGS

[0019] Exemplary embodiments of the present invention are described with reference to the accompanying drawings, in which:

[0020] FIG. 1 shows an overview of an example method for predicting medical events from electronic health records using a machine-learning model;

[0021] FIG. 2 shows an example of an electronic health record;

[0022] FIG. 3 shows an overview of an example method for training and evaluating a machinelearning model; FIG. 4 shows a flow diagram of an example method for predicting medical events from electronic health records using a machine-learning model;

[0023] FIG. 5 shows a comparison of the performance of an example model trained using the method described herein with other machine-learning medical event prediction models on the tasks of predicting asthma exacerbation; and

[0024] FIG. 6 shows a schematic overview of an example computing system.

[0025] DETAILED DESCRIPTION

[0026] The specification describes the use of machine-learning models, and in particular large language models, to predict medical events for a subject (e.g., a patient) using electronic health records (EHRs), e.g., a history of diagnoses, lab tests, prescriptions and procedures for the subject. Compared with traditional machine-learning approaches, the models described herein have a higher accuracy. Furthermore, this specification describes the use of integrated gradients to attribute events in medical records that lead to said predicted medical events. The explainability approach described herein can be applied to many therapeutic areas and prediction tasks using language models trained on electronic health records. Examples are described herein with reference to asthma exacerbation, but the methods described herein are applicable to other therapeutic areas too, e.g., to diabetes (such as Type I diabetes), inflammatory diseases. The methods described herein may also be used to predict the response of a subject to a treatment.

[0027] To address the shortcomings of existing EHR models and improve on the use of explainable Al in the analysis of EHR data, this specification uses a large language models (LLM(s)), e.g., a BERT model, that has been pre-trained on a large dataset of EHR and, in some examples, claims data. The LLM is fine-tuned for prediction tasks, e.g., prediction tasks related to asthma exacerbation, diabetes screening, and / or disease progression out of control and / or non-response to a treatment. The models described herein can yield results with an increased accuracy when compared to previous methods. Furthermore, the features identified by the models can include novel predictions that can provide new insights into disease progression and treatment response. To enhance explainability compared to prior methods, Integrated Gradients methods are used to quantify the contribution of each token to a prediction. This approach begins with a baseline that represents a neutral input, such as an all-zeros vector, and gradually transitions to the actual EHR input. As the input changes, the model records the rate of change for each newly added feature, which helps identify key features. This method is agnostic to the model’s implementation, requiring only that the training process is differentiable with respect to the input feature space. Therefore, unlike previous methods, the models described herein can incorporate lab tests and provide explicit statements regarding the impact of specific feature values on the outcome.

[0028] FIG. 1 shows an overview of an example method 100 for predicting medical events from electronic health records using a machine-learning model. The method 100 receives as input a set of input data 102, x, comprising a tokenised EHR for a subject in a first time window, and processes the input data using a machine-learning model 104 to generate output data 106, F(x), comprising a prediction of a medical event in a second time window. Attribution scores 108 are generated for each of a plurality of tokens in the tokenised EHR using an integrated gradient method.

[0029] The input data 102 comprises a tokenised representation of elements of the EHR in a first time window (also referred to herein as a “baseline period”), i.e., an entry in the EHR is represented by one or more tokens in an embedding space taken from a token vocabulary. The tokens may be arranged in an array, for example as described in relation to FIG. 2. The input data comprises a plurality of tokens representing demographic data and medical event data of the subject of the EHR. Each of these tokens is associated with a respective temporal token indicating when the event occurred (for medical events) or indicating that the token is demographic data.

[0030] The medical event data comprises one or more medical events that occur within the first time window. Such events may, for example, include one or more of: a diagnosis of a condition; a treatment for a condition (e.g., a prescription and / or a therapy); a lab test; and / or any other event associated with a medical condition. The first time window may be a time period starting from the initial diagnosis of a condition. For example, the fixed time window may run from the date of the first diagnosis to a current time. Alternatively, the fixed time window may run from the first diagnosis to a predefined subsequent time, e.g., six months, one year, or eighteen months.

[0031] The demographic data represents demographic information relating to the subject of the EHR. The demographic data may, for example, include one or more of: a subject age; a subject gender; a subject race / ethnicity; a subject weight; a subject height; a subject location / region; and / or other demographic data. The demographic data may be arranged in the sequence of tokens before the medical event tokens.

[0032] The temporal tokens indicate either: (i) a time that a medical event occurred, e.g., measured from an initial medical event; or (ii) that the token is a demographic token. For example, a zero temporal token may be used to indicate that the corresponding token is a demographic token, and a numerical token greater than or equal to one indicting a time from an initial diagnosis that the corresponding medical event occurred.

[0033] In some examples, medical event tokens indicting a lab test occurrence are further associated with lab test result token indicting the result of the lab test. Some lab test results may be categorical values (e.g., a positive or negative result). The results of such lab tests can be assigned to predefined tokens directly. Some lab test results may take values in a continuous range. The results of such lab tests can be assigned to predefined tokens that represent percentile ranges of the test results, e.g., into 10 bands of ten- percentile ranges. Similar to temporal tokens, the percentile tokens for all non-numeric medical events are added with a 0 token.

[0034] The machine-learning model 104 is, in some examples, a large language model (LLM) with one or more classification heads that has been fine-tuned for predicting the occurrence of a medical event in a second time window. An example of such a model in the BERT model. The LLM may be a transformer based LLM, e.g., comprise a plurality of transformer layers.

[0035] The medical event predictions 106 may comprise a prediction that a medical event will occur within a second time window (also referred to herein as a “follow-up period”) that is subsequent to the first time window. The second time window may be a predefined time period contiguous with the first time window. For example, the second time window may be the six months, year or eighteen months following the first time window.

[0036] The prediction that a medical event will occur within a second time window may comprise a classification of the subject into one or more pre-defined classes that correspond to types of medical event. For example, the predefined classes may comprise an exacerbation class, indicating that the subject is predicted to experience an exacerbation of the medical condition. The predefined classes may alternatively or additionally comprise a loss-of-control class, indicating that the subject is predicted to experience a loss-of-control of a condition that they currently have under control. The predefined classes may alternatively or additionally comprise of a response / non-response label, indicating that the subject is predicted to respond well to the current therapy regime or requires an adjustment to their clinical care.. The prediction that a medical event will occur may comprise a distribution of scores over a plurality of classes. The scores may indicate a probability of the event corresponding to the class occurring. In some examples, the highest scoring class is selected as the output classification. Alternatively, one or more classes having a score above a threshold value may be selected as the output classification(s).

[0037] The Integrated Gradients (IG) method provides a systematic approach to quantify the contribution of each input token to a particular prediction. The IG method is based on the principle of attributing the difference between the model’s prediction for an actual input (x) and a baseline input (x”). The baseline serves as a neutral reference point to highlight the contribution of features, and it is typically chosen to represent a "neutral" or "uninformative" input state. For example, the neutral state for medical event tokens may be taken to be zero. The neutral state for a lab test token may be the embedding of a 50thpercentile token for lab tests with continuous results. The neutral state for demographic information may be an average value of the embeddings for the demographic token over an example dataset.

[0038] The baseline and actual input are represented as embeddings, with the baseline as a vector of neutral embeddings and the actual input represented by its corresponding embedding. Given these two inputs, the IG method works by computing the gradient of the model prediction along a path that interpolates between the baseline and the actual input and summing these gradients (in some examples with a scaling applied based on the difference between the actual input and the baseline input in an embedding space). The attribution score is calculated by summing over the hidden dimension.

[0039] For example, the IG for the i-th input feature may be defined as:

[0040] Here, F denotes the prediction function of the model; x' + a(x - x) represents the interpolated input with the scaling parameter a (which has values between zero and one), and 9F(x) / dXi is the gradient of F(x) along the i-th dimension. In some examples, the integral over a is approximated using a Riemann sum over m steps, where a = k / m for k = 1, . . . ,m. The Integrated gradient may then be scaled by the difference between the actual and baseline input (in embedding space), i.e., (x, - x',), to reflect the attribution of the i-th feature within the embedding space. The resulting output vector has the same shape (i.e., the same dimensions) as the token embedding. A scalar attribution score for token the i-th token can be obtained by summing across all dimensions of this vector.

[0041] The resulting attribution scores can be output to a user, e.g., a healthcare professional or the subject, to guide clinical decisions. For example, the input sequence tokens may be displayed via the display of a user device, with highlights indicating the corresponding attribution scores, e.g., scores with a darker shade may reflect a greater positive attribution towards the prediction. This can quickly allow a user to identify the medical events in the input data that most contributed to the prediction.

[0042] The method 100 may be iterated over a test dataset to generate a plurality of attribution scores for each input token. Statistical analysis of the attribution scores may be performed to identify features that contribute highly to the output classifications. Such an analysis is described in further detail herein with respect to FIG. 3.

[0043] FIG. 2 shows an example of an electronic health record 200 for a subject. The electronic health record 200 comprises an array of tokens. An identification column 202 of the array comprises one or more identification tokens, a temporal column 204 of the array comprises a plurality of time tokens, an event column 206 of the array comprises a plurality of medical event tokens, and a test result column 208 of the array comprises one or more lab test result tokens. The array 200 columns 202-208 may be stored in any order; the ordering shown is by way of example only. For example, the time column 204 may alternatively be the final column of the array.

[0044] The identity column 202 of the array comprises one or more identification tokens that indicate an identity of the subject of the EHR. The identification tokens are, in some examples, anonymised to obscure the actual identity of the subject. In the example shown, a single identity token is included. The identity token(s) are not input into the machine-learning model when generating the event predictions.

[0045] The temporal column 204 of the array comprises a plurality of time / temporal tokens (also referred to as “position IDs”), each corresponding to a respective medical event in the event column 206. One or more (e.g., a plurality) of the time tokens represent a time associated with their respective medical events. For example, the time tokens can represent a time since the subject was first diagnosed with a condition, a time since the subject first presented with symptoms of the condition, or the like. The time is, in some examples, a number of days since the first event. The “first event” time is represented by the token ‘1 ’ in some examples, with the token ‘0’ reserved for indicating demographic data. In other examples, the “first event” time is represented by the token ‘O’.

[0046] In some implementations, the temporal column 204 further comprises one or more tokens indicating that the respective medical record associated with that token comprises demographic information rather than information relating to a medical event. A predefined token may be used to indicate demographic information, e.g., the token ‘0’ or an alphabetical token (e.g., ‘d’).

[0047] Tokens in the array 200 may be ordered based on the temporal tokens in the temporal column 204. For example, the event column 206 and lab test column 208 may be time ordered based on the temporal tokens in the temporal column 204, i.e., so that earlier medical events in time occur in the column before later events. Demographic tokens in the event column 206 may be arranged to occur before the medical event tokens in the event column 206.

[0048] The event column 206 of the array comprises a plurality of medical event tokens and, in some examples, one or more demographic tokens.

[0049] Each medical event token indicates a respective medical event that has occurred for the subject. Such medical events, for example, include one or more of: a medical diagnosis; a medicament (e.g., the prescription and / or administration of a medicament); a therapy or other treatment; a procedure; and / or a lab test. The medical events may represent one or more International Classification of Diseases (IDC) codes, National Drug Codes (NDC(s)), lab codes (e.g., LOINC codes) and / or procedure codes (e.g., CPT / HCPCS codes).

[0050] The demographic tokens each indicate demographic data / information relating to the subject. Such demographic information, for example, includes one or more of: a subject age; a subject ethnicity and / or race; a subject gender; one or more subject regions; a subject height; a subject weight; a subject BMI; one or more lifestyle indications (e.g., whether the subject is a smoker, the weekly consumption of alcohol by the subject, or the like); and / or one or more other physiological measurements of the subject.

[0051] The test result column 208 of the array comprises one or more lab test result tokens that each correspond to a respective “lab test” medical event token in the vent column 206. Each lab test result token indicates a result of its respective lab test for the subject. In some examples, for lab tests with continuous values, a lab test token can represent a percentile value of the lab test. For example, the lab test results are discretized by converting each test result into one of N (e.g., N=5, 10, or 20) percentile-based categories. For example, for N=10, a value of 1 indicates that the result falls within the 0- 10% percentile range, while a value of 9 corresponds to the 80-90% percentile range. For categorical lab tests (e.g., “positive” or “negative”, etc.), discrete categorical tokens can be used.

[0052] FIG. 3 shows an overview of an example pipeline 300 for training and utilising a large language model for medical event prediction and EHR analysis. LLM models, such as BERT, were first introduced in the context of natural language, allowing unsupervised pretraining using large unlabelled datasets. This approach can be extended to process structured electronic health records (EHRs) and to fine-tune models for disease prediction tasks.

[0053] The pipeline 300 uses a database of EHRs 302 as a source of training data for a training dataset 304, validation data for a validation dataset 306 and test data for a test dataset 308. Predefined ratios may be used to split the data into the training dataset 304, validation dataset 306 and test dataset 308, e.g., 70:15:15. The EHRs are pre-processed 310 to convert them to token-based representation; since the data is a set of events, such as various diagnoses, prescriptions, and laboratory tests, the input to the model is a sequence of coded tokens each representing an event (e.g., an ICD-10 code). To incorporate lab tests with continuous values, we discretized them by converting each test result into one of 10 percentile-based categories. For example, a value of 1 indicates that the result falls within the 0-10% percentile range, while a value of 9 corresponds to the 80- 90% percentile range. Following this procedure, the EHR of each subject is represented as three input arrays: (i) a sequence of input tokens including demographic information (e.g., age or gender) and medical events in a time-window (e.g., diagnosis or prescription); (ii) a sequence of position IDs, representing the number of days since the first medical event in the time-window; and (iii) a sequence of percentile IDs, representing what range of values a given lab test result fall under.

[0054] Each EHR in the training 304 and / or validation 306 dataset may be divided into one or more baseline time windows and a respective one or more follow-up time windows. For example, the baseline time window(s) may be six months, one year (365 days) or 18 months. The follow-up time window(s) may be one month, six months, one year (365 days) or 18 months. In some examples, the baseline time window(s) and a follow-up time window(s) are the same length. In some examples, the baseline time window(s) and a follow-up time window(s) are different lengths. Alternatively or additionally, baseline time windows and follow-up time windows can be determined based on clinically defined events, i.e., do not have a fixed time period. For example, the baseline time window may be defined as the time between a first diagnosis and the start of a first biologic therapy, with the follow-up window being defined as the duration of the biologic therapy. Many other examples of event-based windowing are possible.

[0055] The test dataset 304 and validation dataset are used to pre-train 312 and validate one or more base LLM models 314. Each trained base model 314 has a respective vocabulary. Each vocabulary comprises, for example, a minimum number of unique tokens needed to cover at least 99.95% of the occurrences in each EHR. The number of tokens may depend on a granularity of the token coverage; more granular coverage requires a higher number of tokens. Pre-training 312 can, for example, be performed using self-supervision relying on masked language modelling, for example as described in “Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction" (L. Rasmy et al., npj Digital Medicine, 4(1), May 2021 . ISSN 2398-6352, the contents of which are incorporated herein by reference) and “Bert: Pre-training of deep bidirectional transformers for language understanding" (J. Devlin et al., In North American Chapter of the Association for Computational Linguistics, 2019, the contents of which are incorporated herein by reference). In some examples, the pre-training 312 can use early stopping evaluated on the categorical cross entropy loss evaluated every 3000 training steps. The pre-training 312 essentially trains the base model 314 to predict one or more subsequent tokens in a sequence given an input sequence of tokens.

[0056] For example, a base model which incorporates diagnosis codes, lab tests, prescriptions, and procedure codes, can be generated with a first vocabulary size at a first granularity. A further base model that uses only diagnosis codes, can be generated with a second vocabulary size at a second granularity.

[0057] While the LLM captures unsupervised information, for downstream tasks the model can be fine-tuned on a specific dataset. Cohort selection 316 may be applied to each of the training dataset 304, validation dataset 306 and test dataset 308 to generate a fine-tuning training dataset 318, a fine-tuning validation dataset 320 and a fine tuning test dataset 322. One or more predefined inclusion criteria and / or one or more exclusion criteria may be applied to each dataset to prune subjects from the dataset. For example, the one or more inclusion criteria for an EHR may comprise one or more of: an EHR including events within a predefined time window, e.g., between 2017 and 2021 inclusive; an EHR having at least with two events occurring at least a predefined time period apart, e.g., 365 days; the HER having at least one inpatient and / or outpatient diagnoses of the condition under study; and / or at least one treatment of a particular type in a baseline time window. The one or more exclusion criteria for an EHR may comprise one or more of: evidence of one or more conflicting diseases; breaks in the EHR for the baseline time period and / or follow up period; death of the subject of the EHR in the baseline time period and / or follow up period; a subject age being below a threshold value, e.g., below 18 years old; and / or the use of one or more predefined therapies in the baseline time period and / or follow up period.

[0058] In some implementations, subjects included in the fine-tuning datasets are each classified as belong to one or more of a plurality of classes based on their respective EHR. In some examples, a subject is assigned one or more classes for the baseline period and one or more classes for the follow-up period. In some examples, the subject is assigned one or more classes in the follow-up period only. Two or more of the classes may be mutually exclusive.

[0059] For example, for an asthma exacerbation model, subjects meeting the study inclusion / exclusion are classified as belonging to one of three mutually exclusive asthma categories / profiles of interest during two separate time periods, 1) the 365-day baseline period and 2) the 365-day follow-up period. The first category / profile may be an exacerbation category, and comprises subjects having: an inpatient visit with asthma as the primary, admitting, or discharge diagnosis; an emergency department visit with a diagnosis of asthma as the primary diagnosis; and / or an oral corticosteroid (days’ supply greater or equal to 3 days but less than 30 days) within 7 days of an asthma diagnosis. The second category / profile may be a symptoms without exacerbation category that comprises subjects having: a step-up from a lower to higher GINA step (e.g., Step 2 to 3, Step 3 to 4); receipt of greater or equal to 6 SABA rescue inhaler prescriptions; diagnosis of uncontrolled asthma symptoms, including dyspnea, wheezing, chest pain or cough at an outpatient office visit with an asthma diagnosis in any position; received an oral corticosteroid prescription with a <3-day supply; and / or occurrence of greater or equal to 2 outpatient office visits with an asthma diagnosis code in any position. If a patient experiences both a first profile and second profile qualifying event, they will be classified in the exacerbation category. The third category / profile comprises subjects that do not fall in ither of the first of second category. It will be appreciated that other categories may alternatively or additionally be defined, based on the condition under investigation.

[0060] In some examples, EHRs may be redistribute between the fine-tuning test dataset 304, fine-tuning validation dataset 306 and fine-tuning test dataset 308 subsequent to cohort selection to ensure balanced datasets, e.g., to maintain the correct ratio of training examples between sets.

[0061] The fine-tuning 324 may be based on the approach used in the Med-BERT model described in “Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction””. One or more classification heads may be added to the base model 314. The base model may then be retrained on the fine- tuning training dataset 318. Each model may be fine-tuned 324 for one or more prediction tasks.

[0062] To mitigate potential class imbalance issues during training, the model may be trained in a stratified manner, ensuring that data from each class was presented to the model in equal proportions during each iteration. In some examples, early stopping with patience metrics is used to halt training if the performance metric plateaus or declines over a specified number of evaluation rounds (e.g., ROC-AUC for binary models; average ROC- AUC for multiclass models). Model performance can be comprehensively assessed using ROC-AUC, F1 score, precision, recall, and accuracy.

[0063] As an example, for an asthma exacerbation use case, the base model(s) 314 may be finetuned 324 for three prediction tasks:

[0064] • Task 1 : Exacerbation prediction: For patients symptomatic at baseline, the model can be fine-tuned to predict whether these patients would experience an exacerbation during the follow-up period, remain the same or get better (i.e. controlled).

[0065] • Task 2: Loss-of-control prediction: For patients who were classified as having controlled asthma during the baseline period, the model can be fine-tuned to predict whether they would experience a loss of control (either symptomatic or exacerbation) during the followup period.

[0066] • Task 3: Multiclass prediction: This task expanded on Task 2 using a 3 way classification for remaining controlled, developing symptomatic loss of control, or experiencing an exacerbation. Once fine-tuned 324, the model can be applied to the fine-tuning test dataset 322 to generate predictions relating to the subjects in the test dataset. IG-based attribution 326 can be applied to the results to generate, for each prediction, an attribution score for each input token. Statistics of the attribution scores for each token type can be collated to identify which token types contribute most to which predictions. For example, the N-most highly scoring tokens (by magnitude) for a class can be identified, where N is a predefined number, e.g., in the range [20, 100], such as 40.

[0067] For example, attribution scores can be calculated for each token for every subject in the test set. These individual scores can then be averaged across the entire set and ranked by magnitude to identify the tokens with the highest overall attribution scores in the population. To ensure that the aggregated attribution scores reflect meaningful insights, an adjustment was applied to account for infrequent tokens using the following formula:

[0068] Where k is a user-defined minimum frequency threshold and n represents the count of occurrences of the token. Tokens with a frequency less than or equal to k were assigned a score of 0, and rarer events (but still with n>k) received heavier penalties.

[0069] The resulting statistics can be evaluated 328 to identify medical events that are most predictive of a given classification output.

[0070] FIG. 4 shows a flow diagram of an example method 400 for predicting medical events from electronic health records using a machine-learning model. The method 400 may be performed by one or more computers operating in one or more locations, e.g., the computing system of FIG. 6. For convenience, the method 400 is described as being performed by a system.

[0071] At operation 402, the system inputs, into a machine-learning model, a set of input data relating to a subject / patient. The set of input data may represent a set of EHRs for the subject. The input data may be in the form of an array of tokens, e.g., an array as described in relation to FIG. 2.

[0072] The input data is derived from an EHR for the subject and comprises a plurality of medical record tokens and a plurality of corresponding time tokens. The plurality of the medical record tokens comprises one or more tokens that each indicate a respective medical event for the subject within a first time window. Examples of medical events include, but are not limited to, a diagnoses event, a medication event, a treatment event, a lab test, and / or an adverse event. The first time window may, for example, encompass the time from an initial medical event (e.g., a first diagnosis) to a present time.

[0073] Each time token corresponds to a respective medical record token and indicates a respective time associated with the medical record token. The time may be a time measured from an initial medical event in the EHR, e.g., a number of days since the first medical event, where the first medical event is associated with the number one.

[0074] In some implementations, the set of medical record tokens further comprises a set of demographic tokens representing demographic information relating to the subject. For example, the demographic tokens may indicate one or more of: a subject age; a subject ethnicity; a subject gender; one or more subject locations, e.g., birth location, home address location, etc.; and / or the like. Each demographic token is, in some examples, associated with a null time token that indicates that the medical record token is a demographic token. For example, each demographic token may be associated with a “time” token with a value of zero or a predefined letter, e.g., “d”.

[0075] In some implementations, the set of input data further comprises a set of lab test result tokens. Each lab test result token is associated with a respective medical record token indicating that a lab test has been performed, and indicates the result of the test. In some examples, a lab test result token indicates a percentile value for the result of the lab test rather than an absolute value of the lab test.

[0076] The machine-learned model may be a language model, e.g., a large language model, with a classification head.

[0077] At operation 404, the system processes, using the machine-learning model, the set of input data to generate a set of output data. The set of output data comprises data indicative of a prediction of a medical event for the subject within a second time window subsequent to the first time window. The second time window may have a fixed duration, e.g., one week to four weeks, one to six months, one year, or the like.

[0078] In some examples, the data indicative of a prediction of a medical event comprises an indication of an event type that is predicted to occur in the second time window. At operation 406, the system generates, based on the set of input data and the prediction of a medical event, an attribution score for each of a plurality of the input medical record tokens using an integrated gradients method. The attribution score for a token is indicative of an amount that said token contributed to the prediction of the medical event. In some examples, a visual output may be generated based on the set of attribution scores, e.g., an image of the input tokens that are shaded in dependence on their respective attribution scores, or a table of the tokens and their corresponding attribution scores, which may be ordered by score.

[0079] In some implementations, generating the attribution score for each of a plurality of the input medical record tokens comprises, for each of a plurality of medical record tokens computing a set of gradients of the set of output data along an interpolation path between a baseline set of input data and the set of input data in an embedding space. The interpolation path is defined by an interpolation parameter, a, e.g., a value that varies between zero and one.

[0080] For example, a plurality of interpolated input token values is generated between the baseline input value of a token, x,’, and the actual input value for that token, x,, e.g., evenly spaced values between the baseline input value of the token and the actual input value for that token. For each interpolated value of the input token, a gradient / derivative of the output of the machine-learning model, F, with respect to the input token, x,, is determined. For non-demographic medical record tokens, the baseline value of a token may be taken to be a null value, e.g., zero. For demographic tokens, the baseline value of a token may be taken to be an average value of the demographic token over a set of medical records of different subjects. For lab test result tokens, the baseline value of a token may be taken to be a fiftieth percentile token.

[0081] The resulting gradients for the input token are then numerically integrated over the interpolation parameter. This results in a set of integrated gradients for the input token, x,. In some implementations, a Riemann sum may be used to numerically integrate the gradients.

[0082] The integrated set of gradients is then scaled by a difference (measured in the embedding space) between the medical record token in the baseline set of input data and the medical record token in the set of input data. The result is a scaled integrated gradient. The components of the scaled integrated gradient are then summed to obtain the attribution score for the medical record token. FIG. 5 shows a comparison of the performance of an example model trained using the method described herein (BERT-LER) with other machine-learning medical event prediction models on the tasks of predicting asthma exacerbation, predicting asthma loss- of-control, and multiclass prediction. The BERT model corresponds to an LLM model that has been trained using the methods described in relation to FIG. 3, but only using diagnosis medial event tokens (i.e., not using lab test result and treatment tokens). Each evaluation metric represents the weighted average of metrics for each class, except for ROC-AUC, which is the unweighted average of the ROC-AUC for each class. Underlined scores indicate better performance. BERT-LER demonstrated superior capabilities in predicting all three task outcomes when compared to traditional feature-based and MCA- based ML models (average ROC-AUC: BERT-LER: 0.645, Logistic Regression: 0.607, XGBoost: 0.605, EBM: 0.607, MCA Decision Tree: 0.512, MCA Random Forest: 0.545). The inclusion of ordinal lab test results, prescriptions, and procedures in the comprehensive BERT-LER model generally led to improved performance compared to the diagnoses-only BERT model across most evaluation metrics (increase in ROC-AUC: Exacerbation: 0.019, Loss-of-control: 0.071 , Multiclass: 0.080).

[0083] FIG. 6 shows a schematic example of a system / apparatus for performing any of the methods described herein. The system / apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices / systems may alternatively be used to implement the methods described herein, such as a distributed computing system.

[0084] The apparatus (or system) 500 comprises one or more processors 502. The one or more processors control operation of other components of the system / apparatus 500. The one or more processors 502 may, for example, comprise a general-purpose processor. The one or more processors 502 may be a single core device or a multiple core device. The one or more processors 502 may comprise a Central Processing Unit (CPU) or a graphical processing unit (GPU). Alternatively, the one or more processors 802 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.

[0085] The system / apparatus comprises a working or volatile memory 504. The one or more processors may access the volatile memory 504 in order to process data and may control the storage of data in memory. The volatile memory 504 may comprise RAM of any type, for example, Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.

[0086] The system / apparatus comprises a non-volatile memory 506. The non-volatile memory 506 stores a set of operation instructions 508 for controlling the operation of the processors 502 in the form of computer readable instructions. The non-volatile memory 506 may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory.

[0087] The one or more processors 502 are configured to execute operating instructions 808 to cause the system / apparatus to perform any of the methods described herein. The operating instructions 508 may comprise code (i.e. drivers) relating to the hardware components of the system / apparatus 500, as well as code relating to the basic operation of the system / apparatus 500. Generally speaking, the one or more processors 502 execute one or more instructions of the operating instructions 508, which are stored permanently or semi-permanently in the non-volatile memory 506, using the volatile memory 504 to store temporarily data generated during execution of said operating instructions 508.

[0088] Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and / or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 8, cause the computer to perform one or more of the methods described herein.

[0089] Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.

[0090] Furthermore, any, some and / or all features in one aspect can be applied to any, some and / or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and / or supplied and / or used independently. Although several embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims and their equivalents.

[0091] The terms “drug” or “medicament” are used synonymously herein and describe a pharmaceutical formulation containing one or more active pharmaceutical ingredients or pharmaceutically acceptable salts or solvates thereof, and optionally a pharmaceutically acceptable carrier. An active pharmaceutical ingredient (“API”), in the broadest terms, is a chemical structure that has a biological effect on humans or animals. In pharmacology, a drug or medicament is used in the treatment, cure, prevention, or diagnosis of disease or used to otherwise enhance physical or mental well-being. A drug or medicament may be used for a limited duration, or on a regular basis for chronic disorders.

[0092] As described below, a drug or medicament can include at least one API, or combinations thereof, in various types of formulations, for the treatment of one or more diseases. Examples of API may include small molecules having a molecular weight of 500 Da or less; polypeptides, peptides and proteins (e.g., hormones, growth factors, antibodies, antibody fragments, and enzymes); carbohydrates and polysaccharides; and nucleic acids, double or single stranded DNA (including naked and cDNA), RNA, antisense nucleic acids such as antisense DNA and RNA, small interfering RNA (siRNA), ribozymes, genes, and oligonucleotides. Nucleic acids may be incorporated into molecular delivery systems such as vectors, plasmids, or liposomes. Mixtures of one or more drugs are also contemplated.

[0093] The drug or medicament may be contained in a primary package or “drug container” adapted for use with a drug delivery device. The drug container may be, e.g., a cartridge, syringe, reservoir, or other solid or flexible vessel configured to provide a suitable chamber for storage (e.g., short- or long-term storage) of one or more drugs. For example, in some instances, the chamber may be designed to store a drug for at least one day (e.g., one to at least 30 days). In some instances, the chamber may be designed to store a drug for about 1 month to about 2 years. Storage may occur at room temperature (e.g., about 20°C), or refrigerated temperatures (e.g., from about - 4°C to about 4°C). In some instances, the drug container may be or may include a dual-chamber cartridge configured to store two or more components of the pharmaceutical formulation to-be-administered (e.g., an API and a diluent, or two different drugs) separately, one in each chamber. In such instances, the two chambers of the dual-chamber cartridge may be configured to allow mixing between the two or more components prior to and / or during dispensing into the human or animal body. For example, the two chambers may be configured such that they are in fluid communication with each other (e.g., by way of a conduit between the two chambers) and allow mixing of the two components when desired by a user prior to dispensing. Alternatively or in addition, the two chambers may be configured to allow mixing as the components are being dispensed into the human or animal body.

[0094] The drugs or medicaments contained in the drug delivery devices as described herein can be used for the treatment and / or prophylaxis of many different types of medical disorders. Examples of disorders include, e.g., diabetes mellitus or complications associated with diabetes mellitus such as diabetic retinopathy, thromboembolism disorders such as deep vein or pulmonary thromboembolism. Further examples of disorders are acute coronary syndrome (ACS), angina, myocardial infarction, cancer, macular degeneration, inflammation, hay fever, atherosclerosis and / or rheumatoid arthritis. Examples of APIs and drugs are those as described in handbooks such as Rote Liste 2014, for example, without limitation, main groups 12 (anti-diabetic drugs) or 86 (oncology drugs), and Merck Index, 15th edition.

[0095] Examples of APIs for the treatment and / or prophylaxis of type 1 or type 2 diabetes mellitus or complications associated with type 1 or type 2 diabetes mellitus include an insulin, e.g., human insulin, or a human insulin analogue or derivative, a glucagon-like peptide (GLP-1 ), GLP-1 analogues or GLP-1 receptor agonists, or an analogue or derivative thereof, a dipeptidyl peptidase-4 (DPP4) inhibitor, or a pharmaceutically acceptable salt or solvate thereof, or any mixture thereof. As used herein, the terms “analogue” and “derivative” refers to a polypeptide which has a molecular structure which formally can be derived from the structure of a naturally occurring peptide, for example that of human insulin, by deleting and / or exchanging at least one amino acid residue occurring in the naturally occurring peptide and / or by adding at least one amino acid residue. The added and / or exchanged amino acid residue can either be codable amino acid residues or other naturally occurring residues or purely synthetic amino acid residues. Insulin analogues are also referred to as "insulin receptor ligands". In particular, the term ..derivative” refers to a polypeptide which has a molecular structure which formally can be derived from the structure of a naturally occurring peptide, for example that of human insulin, in which one or more organic substituent (e.g., a fatty acid) is bound to one or more of the amino acids. Optionally, one or more amino acids occurring in the naturally occurring peptide may have been deleted and / or replaced by other amino acids, including non-codeable amino acids, or amino acids, including non-codeable, have been added to the naturally occurring peptide.

[0096] Examples of insulin analogues are Gly(A21), Arg(B31), Arg(B32) human insulin (insulin glargine); Lys(B3), Glu(B29) human insulin (insulin glulisine); Lys(B28), Pro(B29) human insulin (insulin lispro); Asp(B28) human insulin (insulin aspart); human insulin, wherein proline in position B28 is replaced by Asp, Lys, Leu, Vai or Ala and wherein in position B29 Lys may be replaced by Pro; Ala(B26) human insulin; Des(B28-B30) human insulin; Des(B27) human insulin and Des(B30) human insulin.

[0097] Examples of insulin derivatives are, for example, B29-N-myristoyl-des(B30) human insulin, Lys(B29) (N- tetradecanoyl)-des(B30) human insulin (insulin detemir, Levemir®); B29-N-palmitoyl-des(B30) human insulin; B29-N-myristoyl human insulin; B29-N-palmitoyl human insulin; B28-N-myristoyl LysB28ProB29 human insulin; B28-N-palmitoyl- LysB28ProB29 human insulin; B30-N-myristoyl-ThrB29LysB30 human insulin; B30-N- palmitoyl- ThrB29LysB30 human insulin; B29-N-(N-palmitoyl-gamma-glutamyl)-des(B30) human insulin, B29-N-omega-carboxypentadecanoyl-gamma-L-glutamyl-des(B30) human insulin (insulin degludec, Tresiba®); B29-N-(N-lithocholyl-gamma-glutamyl)-des(B30) human insulin; B29-N-(w-carboxyheptadecanoyl)-des(B30) human insulin and B29-N-(w- carboxyheptadecanoyl) human insulin.

[0098] Examples of GLP-1 , GLP-1 analogues and GLP-1 receptor agonists are, for example, Lixisenatide (Lyxumia®), Exenatide (Exendin-4, Byetta®, Bydureon®, a 39 amino acid peptide which is produced by the salivary glands of the Gila monster), Liraglutide (Victoza®), Semaglutide, Taspoglutide, Albiglutide (Syncria®), Dulaglutide (Trulicity®), rExendin-4, CJC-1134-PC, PB-1023, TTP-054, Langlenatide / HM-11260C (Efpeglenatide), HM-15211 , CM-3, GLP-1 Eligen, GRMD-0901 , NN-9423, NN-9709, NN- 9924, NN-9926, NN-9927, Nodexen, Viador-GLP-1 , CVX-096, ZYOG-1 , ZYD-1 , GSK- 2374697, DA-3091 , MAR-701 , MAR709, ZP-2929, ZP-3022, ZP-DI-70, TT-401 (Pegapamodtide), BHM-034. MOD-6030, CAM-2036, DA-15864, ARI-2651 , ARI-2255, Tirzepatide (LY3298176), Bamadutide (SAR425899), Exenatide-XTEN and Glucagon- Xten.

[0099] An example of an oligonucleotide is, for example: mipomersen sodium (Kynamro®), a cholesterol-reducing antisense therapeutic for the treatment of familial hypercholesterolemia or RG012 for the treatment of Alport syndrom. Examples of DPP4 inhibitors are Linagliptin, Vildagliptin, Sitagliptin, Denagliptin, Saxagliptin, Berberine.

[0100] Examples of hormones include hypophysis hormones or hypothalamus hormones or regulatory active peptides and their antagonists, such as Gonadotropine (Follitropin, Lutropin, Choriongonadotropin, Menotropin), Somatropine (Somatropin), Desmopressin, Terlipressin, Gonadorelin, Triptorelin, Leuprorelin, Buserelin, Nafarelin, and Goserelin.

[0101] Examples of polysaccharides include a glucosaminoglycane, a hyaluronic acid, a heparin, a low molecular weight heparin or an ultra-low molecular weight heparin or a derivative thereof, or a sulphated polysaccharide, e.g. a poly-sulphated form of the above-mentioned polysaccharides, and / or a pharmaceutically acceptable salt thereof. An example of a pharmaceutically acceptable salt of a poly-sulphated low molecular weight heparin is enoxaparin sodium. An example of a hyaluronic acid derivative is Hylan G-F 20 (Synvisc®), a sodium hyaluronate.

[0102] The term “antibody”, as used herein, refers to an immunoglobulin molecule or an antigenbinding portion thereof. Examples of antigen-binding portions of immunoglobulin molecules include F(ab) and F(ab')2 fragments, which retain the ability to bind antigen. The antibody can be polyclonal, monoclonal, recombinant, chimeric, de-immunized or humanized, fully human, non-human, (e.g., murine), or single chain antibody. In some embodiments, the antibody has effector function and can fix complement. In some embodiments, the antibody has reduced or no ability to bind an Fc receptor. For example, the antibody can be an isotype or subtype, an antibody fragment or mutant, which does not support binding to an Fc receptor, e.g., it has a mutagenized or deleted Fc receptor binding region. The term antibody also includes an antigen-binding molecule based on tetravalent bispecific tandem immunoglobulins (TBTI) and / or a dual variable region antibody-like binding protein having cross-over binding region orientation (CODV).

[0103] The terms “fragment” or “antibody fragment” refer to a polypeptide derived from an antibody polypeptide molecule (e.g., an antibody heavy and / or light chain polypeptide) that does not comprise a full-length antibody polypeptide, but that still comprises at least a portion of a full-length antibody polypeptide that is capable of binding to an antigen. Antibody fragments can comprise a cleaved portion of a full length antibody polypeptide, although the term is not limited to such cleaved fragments. Antibody fragments that are useful in the present invention include, for example, Fab fragments, F(ab')2 fragments, scFv (single-chain Fv) fragments, linear antibodies, monospecific or multispecific antibody fragments such as bispecific, trispecific, tetraspecific and multispecific antibodies (e.g., diabodies, triabodies, tetrabodies), monovalent or multivalent antibody fragments such as bivalent, trivalent, tetravalent and multivalent antibodies, minibodies, chelating recombinant antibodies, tribodies or bibodies, intrabodies, nanobodies, small modular immunopharmaceuticals (SMIP), binding-domain immunoglobulin fusion proteins, camelized antibodies, and VHH containing antibodies. Additional examples of antigenbinding antibody fragments are known in the art.

[0104] The terms “Complementarity-determining region” or “CDR” refer to short polypeptide sequences within the variable region of both heavy and light chain polypeptides that are primarily responsible for mediating specific antigen recognition. The term “framework region” refers to amino acid sequences within the variable region of both heavy and light chain polypeptides that are not CDR sequences, and are primarily responsible for maintaining correct positioning of the CDR sequences to permit antigen binding. Although the framework regions themselves typically do not directly participate in antigen binding, as is known in the art, certain residues within the framework regions of certain antibodies can directly participate in antigen binding or can affect the ability of one or more amino acids in CDRs to interact with antigen.

[0105] Examples of antibodies are anti PCSK-9 mAb (e.g., Alirocumab), anti IL-6 mAb (e.g., Sarilumab), and anti IL-4 mAb (e.g., Dupilumab).

[0106] Pharmaceutically acceptable salts of any API described herein are also contemplated for use in a drug or medicament in a drug delivery device. Pharmaceutically acceptable salts are for example acid addition salts and basic salts.

[0107] Those of skill in the art will understand that modifications (additions and / or removals) of various components of the APIs, formulations, apparatuses, methods, systems and embodiments described herein may be made without departing from the full scope and spirit of the present invention, which encompass such modifications and any and all equivalents thereof.

[0108] An example drug delivery device may involve a needle-based injection system as described in T able 1 of section 5.2 of ISO 11608-1 :2014(E). As described in ISO 1 1608- 1 :2014(E), needle-based injection systems may be broadly distinguished into multi-dose container systems and single-dose (with partial or full evacuation) container systems. The container may be a replaceable container or an integrated non-replaceable container.

[0109] As further described in ISO 11608-1 :2014(E), a multi-dose container system may involve a needle-based injection device with a replaceable container. In such a system, each container holds multiple doses, the size of which may be fixed or variable (pre-set by the user). Another multi-dose container system may involve a needle-based injection device with an integrated non-replaceable container. In such a system, each container holds multiple doses, the size of which may be fixed or variable (pre-set by the user).

[0110] As further described in ISO 11608-1 :2014(E), a single-dose container system may involve a needle-based injection device with a replaceable container. In one example for such a system, each container holds a single dose, whereby the entire deliverable volume is expelled (full evacuation). In a further example, each container holds a single dose, whereby a portion of the deliverable volume is expelled (partial evacuation). As also described in ISO 1 1608-1 :2014(E), a single-dose container system may involve a needlebased injection device with an integrated non-replaceable container. In one example for such a system, each container holds a single dose, whereby the entire deliverable volume is expelled (full evacuation). In a further example, each container holds a single dose, whereby a portion of the deliverable volume is expelled (partial evacuation).

[0111] Those of skill in the art will understand that modifications (additions and / or removals) of various components of the embodiments described herein may be made without departing from the full scope and spirit of the present invention, which encompass such modifications and any and all equivalents thereof.

Claims

- 25 -Claims1 . A computer implemented method, comprising: inputting, into a machine-learning language model, a set of input data relating to a subject, the input data comprising: a plurality of medical record tokens, wherein a plurality of the medical record tokens each indicate a respective medical event for the subject within a first time window; and a plurality of time tokens, each time token corresponding to a respective medical record token and indicating a respective time associated with the medical record token; processing, by the machine-learning language model, the set of input data to generate a set of output data, the set of output data comprising data indicative of a prediction of a medical event for the subject within a second time window subsequent to the first time window; and generating, based on the set of input data and the prediction of a medical event, an attribution score for each of a plurality of the input medical record tokens using an integrated gradients method, wherein the attribution score for a token is indicative of an amount that said token contributed to the prediction of the medical event.

2. The method of claim 1 , wherein: the plurality of medical record tokens comprises one or more lab test tokens identifying a lab test; and the set of input data further comprises one or more test result tokens each corresponding to a respective lab test token, wherein each test result token indicates an outcome of the respective lab test.

3. The method of claim 2 wherein a test result token represents a percentile value for a respective lab test result.

4. The method of any preceding claim, wherein the plurality of medical record tokens comprises one or more of: a diagnoses event token; a medication token; a procedural token; and / or a treatment token.

5. The method of any preceding claim, wherein: the plurality of medical record tokens comprises one or more demographic tokens indicative of a property of the subject; and respective time tokens associated with the one or more demographic tokens have a null value.

6. The method of any preceding claim, wherein generating, based on the set of input data and the prediction of a medical event, the attribution score for each of a plurality of the input medical record tokens using an integrated gradients method comprises, for each of a plurality of medical record tokens: computing a set of gradients of the set of output data along an interpolation path between a baseline set of input data and the set of input data in an embedding space, wherein the interpolation path is defined by an interpolation parameter; numerically integrating the set of gradients over the interpolation parameter; scaling the integrated set of gradients by a difference in the embedding space between the medical record token in the baseline set of input data and the medical record token in the set of input data to obtain a scaled integrated gradient; and summing components of the scaled integrated gradient to obtain the attribution score for the medical record token.

7. The method of claim 6, wherein the baseline set of input data comprises a null value for each of a plurality of medical record tokens.

8. The method of claim 6 or claim 7, wherein: the plurality of medical record tokens comprises one or more demographic tokens indicative of a property of the subject; and the baseline set of input data comprises, for each demographic token, a respective average demographic token value.

9. The method of any of claims 6 to 8, wherein: the plurality of medical record tokens comprises one or more lab test tokens identifying a lab test; the set of input data further comprises one or more test result tokens each corresponding to a respective lab test token, wherein each test result token indicates an outcome of the respective lab test; andthe baseline set of input data comprises, for each test result token, a 50thpercentile token.

10. The method of any preceding claim, wherein each time token indicates a time since a first medical event in the plurality of medical record tokens.11 . The method of any preceding claim, further comprising: iterating the method over a set of health records, the set of health records comprising a plurality of sets of input data, each relating to a respective subject; generating, from the attribution scores for the plurality of sets of input data, an average attribution score for each token type in the medical event tokens; and identifying a set of token types that are most predictive of the medical event in the second time window.

12. The method of any preceding claim, wherein: the medical events for the subject relate to an asthma diagnosis, and the prediction of a medical event for the subject relates to an asthma event; and / or the medical events for the subject relate to a diabetes diagnosis, and the prediction of a medical event for the subject relates to a diabetes event; and / or the medical events for the subject relate to an uncontrolled disease outcome; and / or the medical events for the subject relate to a non-response in a user therapy.

13. The method of any preceding claim, further comprising: identifying, based on one or more of the attribution scores, one or more input tokens that contributed most to the prediction of the medical event; and providing a treatment and / or therapy to the subject based at least in apart on the identified one or more input tokens.

14. A computer program product comprising computer readable instructions that, when executed by a computer, cause the computer to perform a method according to any preceding claim.

15. A system comprising one or more processors and a memory, the memory storing computer readable instructions that, when executed by the one or more processors, cause the system to perform the method of any preceding claim.