System and method for incorporating real-world evidence data and factor analysis in clinical trials

By integrating RWE data into clinical trials, the system enhances predictive accuracy and transparency by identifying key factors influencing patient outcomes, addressing inefficiencies and limitations of traditional methods.

WO2026136076A1PCT designated stage Publication Date: 2026-06-25PROKIDNEY IPCO LLC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
PROKIDNEY IPCO LLC
Filing Date
2025-12-10
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Clinical trials face inefficiencies due to high costs, lengthy durations, and challenges in patient enrollment and retention, particularly when using control arms like placebos, and existing synthetic controls lack insight into the factors influencing patient outcomes, providing one-sided predictions.

Method used

A system that combines clinical trial data with real-world evidence (RWE) to create predictive models, allowing for the identification of key factors influencing patient outcomes by comparing performance metrics before and after removing specific variables, thereby enhancing predictive accuracy and transparency.

Benefits of technology

This approach reduces trial time and costs, increases patient enrollment, and provides deeper insights into outcome drivers, improving the clarity and efficiency of clinical trials by identifying critical variables, applicable across various predictive models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IMGF000016_0001
    Figure IMGF000016_0001
  • Figure IMGF000016_0002
    Figure IMGF000016_0002
  • Figure IMGF000017_0001
    Figure IMGF000017_0001
Patent Text Reader

Abstract

A method comprising generating an appended dataset by combining treatment arm data from a clinical trial and real-world evidence data; creating a first predictive model based on a training subset of the appended dataset; generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset, removing one or more columns from the appended dataset to generate a modified dataset, creating a second predictive model based on the modified dataset, generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome for the test subset, returning a result that indicates an impact level of the one or more columns.
Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No.: 55701WO / 0728.000557W001SYSTEM AND METHOD FOR INCORPORATING REAL-WORLD EVIDENCE DATA AND FACTOR ANALYSIS IN CLINICAL TRIALSCROSS-REFERENCE TO RELATED APPLICATIONS[00.5] This application claims the benefit of U.S. Provisional Application Serial No. 63 / 735,161, filed on December 17, 2024, the disclosure of which is incorporated by reference herein in its entirety.TECHNICAL FIELD

[0001] The subject matter described herein relates to machine learning (ML) technology, specifically systems and methods for incorporating real-world evidence (RWE) data and factor analysis into clinical trials to facilitate outcome prediction and factor impact determination.BACKGROUND

[0002] Clinical trials are essential for evaluating the effectiveness of new treatments, therapies, or medical interventions. Traditionally, clinical trials consist of two arms: a treatment arm, where participants receive the experimental therapy, and a control arm, where participants either receive no treatment, e.g., placebo, or a standard-of-care treatment. While this structure is necessary for comparison, it often results in several inefficiencies and challenges. Clinical trials are typically expensive, lengthy, and inefficient due to difficulties in patient enrollment and retention, particularly in cases where patients fear being assigned to the control arm. This is especially pronounced in trials where the control arm involves placebos or ineffective treatments, leading to potential dropout rates and failure to meet enrollment targets. To address theseAttorney Docket No.: 55701WO / 0728.000557W001 challenges, the FDA has introduced programs such as the Real-World Evidence (RWE) program, which accelerates the approval of new drugs by utilizing external data from real-world sources. One of the outcomes of such programs is the development of synthetic controls (sometimes referred to as digital twins), which use RWE data to predict the outcomes of control arms in single-arm clinical trials. Synthetic controls allow researchers to simulate control conditions based on external data, thereby reducing the need for an actual control group in the trial. This approach not only reduces time and costs but also increases the attractiveness of trials by allowing more patients to receive the experimental therapy. However, despite the advantages of synthetic controls, they come with notable shortcomings. A key limitation is that they do not identify which factors or variables contribute most significantly to patient outcomes. Synthetic controls typically focus on predicting outcomes for the control condition, but they do not provide insights into which specific variables, either alone or in combination, influenced those outcomes. Moreover, synthetic controls tend to generate one-sided predictions, predicting outcomes for the treatment group under control conditions but not vice versa, limiting their broader applicability in understanding the complete effect of a treatment. Thus, there is a need for improved systems and methods that can not only leverage RWE data to enhance the efficiency and clarity of clinical trials but also identify the factors that most influence patient outcomes and enable prediction across both treatment and control scenarios.SUMMARY

[0001] Methods, systems, and articles of manufacture, including computer program products, are provided for generating an appended dataset by combining treatment arm data from a clinical trial with real-world evidence data, wherein the appended dataset is a matrix of two- dimensional arrays where rows represent individual subjects and columns represent attributesAttorney Docket No.: 55701WO / 0728.000557W001 associated with a subject; creating a first predictive model based on a training subset of the appended dataset; generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset; removing one or more columns from the appended dataset to generate a modified dataset; creating a second predictive model based on the modified dataset; generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome for the test subset; and returning a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement.

[0002] In some variations, the appended dataset comprises data from real-world evidence sources external to the clinical trial.

[0003] In some variations, the first performance measurement and the second performance measurement are based on accuracy, precision, recall, root mean squared error (RMSE), Fl Score, AUC-ROC (Area Under the Receiver Operating Characteristic Curve), AUC-PR (Area Under the Precision-Recall Curve, Confusion Matrix, Log Loss (Cross-Entropy Loss), Mean Absolute Error (MAE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), R-Squared (Coefficient of Determination), Mean Squared Logarithmic Error (MSLE), Loss (training and testing), Mean Average Precision (MAP), Average Precision (AP), Mean Average Recall (MAR), Mean Absolute Scaled Error (MASE), or a combination thereof.

[0003] In some variations, the method further comprises identifying attributes that influence patient outcomes by evaluating the impact level of the removed one or more columns.Attorney Docket No.: 55701WO / 0728.000557W001

[0004] In some variations, removing one or more columns from the appended dataset comprises removing columns corresponding to demographic factors, clinical variables, or treatment indicators.

[0005] In some variations, the method further comprises dynamically adjusting the columns to be removed based on predefined criteria comprising variable importance scores or feature selection algorithms.

[0006] In some variations, comparing the first performance measurement and the second performance measurement further comprises generating a visualization that highlights the impact level of the removed one or more columns on the predictive model's performance.

[0007] In another aspect, a computer program product comprising a non-transitory machine-readable medium storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising generating an appended dataset by combining treatment arm data from a clinical trial with real- world evidence data, wherein the appended dataset is a matrix of two-dimensional arrays where rows represent individual subjects and columns represent attributes associated with a subject; creating a first predictive model based on a training subset of the appended dataset; generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset; removing one or more columns from the appended dataset to generate a modified dataset; creating a second predictive model based on the modified dataset; generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome forAttorney Docket No.: 55701WO / 0728.000557W001 the test subset; and returning a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement.

[0008] In some variations, the appended dataset comprises data from real-world evidence sources external to the clinical trial. For example, in some embodiments, the data may be derived from sources such as electronic health records, registries, or claims databases, and may include subjects who are not participants in the clinical trial. In cases where there is overlap in subjects between the clinical trial and the external dataset, the overlapping data can be excluded from the control training dataset to avoid duplication or bias.

[0009] In some variations, the first performance measurement and the second performance measurement are based on accuracy, precision, recall, root mean squared error (RMSE), or a combination thereof.

[0010] In some variations, the operations further comprise identifying attributes that influence patient outcomes by evaluating the impact level of the removed one or more columns.

[0011] In some variations, removing one or more columns from the appended dataset comprises removing columns corresponding to demographic factors, clinical variables, or treatment indicators.

[0012] In some variations, the operations further comprise dynamically adjusting the columns to be removed based on predefined criteria comprising variable importance scores or feature selection algorithms.

[0013] In some variations, comparing the first performance measurement and the second performance measurement further comprises generating a visualization that highlights the impact level of the removed one or more columns on the predictive model's performance.Attorney Docket No.: 55701WO / 0728.000557W001

[0014] A system comprising a programmable processor and a non-transitory machine- readable medium storing instructions that, when executed by the processor, cause the programmable processor to perform operations comprising generating an appended dataset by combining treatment arm data from a clinical trial with real-world evidence data, wherein the appended dataset is a matrix of two-dimensional arrays where rows represent individual subjects and columns represent attributes associated with a subject; creating a first predictive model based on a training subset of the appended dataset; generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset; removing one or more columns from the appended dataset to generate a modified dataset; creating a second predictive model based on the modified dataset; generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome for the test subset; and returning a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement.

[0015] In some variations, the appended dataset comprises data from real-world evidence sources external to the clinical trial.

[0016] In some variations, the first performance measurement and the second performance measurement are based on accuracy, precision, recall, root mean squared error (RMSE), or a combination thereof.

[0017] In some variations, the operations further comprise identifying attributes that influence patient outcomes by evaluating the impact level of the removed one or more columns.Attorney Docket No.: 55701WO / 0728.000557W001

[0004] In some variations, removing one or more columns from the appended dataset comprises removing columns corresponding to demographic factors, clinical variables, or treatment indicators.

[0005] In some variations, the operations further comprise dynamically adjusting the columns to be removed based on predefined criteria comprising variable importance scores or feature selection algorithms.

[0006] The approach described herein offers several key benefits and advantages over traditional clinical trial methodologies and existing synthetic control systems. By incorporating real-world evidence (RWE) data into clinical trials, this approach significantly reduces the time, cost, and inefficiencies associated with conventional trials. One of the major advantages is that it eliminates or reduces the need for a traditional control arm, thereby allowing more participants to receive the experimental therapy, which can lead to increased patient enrollment and retention. Furthermore, the proposed system not only predicts control outcomes using RWE data, it also provides critical insights into which variables or factors contribute most to patient outcomes. By building multiple predictive models that focus on different sets of variables and comparing their performance, the system identifies key factors that influence the outcome, offering clarity into the effects of treatment. This dual capability of outcome prediction and factor identification enhances the overall accuracy, interpretability, and transparency of clinical trial results, ultimately accelerating drug development and improving patient care. Additionally, the invention allows for model manipulation in certain machine learning techniques, making it possible to assess variable importance without rebuilding the entire model, further increasing efficiency.

[0018] Another advantage of the approach described herein is its versatility in application across a wide range of predictive models. The method can be utilized with any type of model,Attorney Docket No.: 55701WO / 0728.000557W001 regardless of the specific architecture or algorithm employed. Whether the predictive model is a neural network, decision tree, support vector machine, or any other machine learning model, the process of generating an appended dataset, removing specific variables, and measuring the impact on performance remains consistent. This flexibility allows the method to be integrated into various clinical trial frameworks and modeling environments, making it adaptable to different predictive tasks and ensuring its broad applicability across diverse machine learning models.

[0019] Implementations of the current subject matter can include, but are not limited to, methods consistent with the descriptions provided herein as well as articles that include a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations implementing one or more of the described features. Similarly, computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors. A memory, which can include a computer-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein. Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and / or commands or other instructions or the like via one or more connections, including but not limited to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.Attorney Docket No.: 55701WO / 0728.000557WG01

[0020] The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. The claims that follow this disclosure are intended to define the scope of the protected subject matter.DESCRIPTION OF DRAWINGS

[0021] The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,

[0022] FIG. 1 is a diagram illustrating a flow chart of a process 100 for identifying key factors influencing patient outcomes to enhance clinical trial efficiency and clarity, in accordance with one or more embodiments of the current subject matter.

[0023] FIG. 2 is a diagram illustrating an example of a first performance measurement associated with a first predictive model, in accordance with one or more embodiments of the current subject matter.

[0024] FIG. 3 is a diagram illustrating an example of a second performance measurement associated with a second predictive model, in accordance with one or more embodiments of the current subject matter.

[0025] FIG. 4 depicts a block diagram illustrating a computing system consistent with implementations of the current subject matter.Attorney Docket No.: 55701WO / 0728.000557W001

[0026] When practical, like labels are used to refer to same or similar items in the drawings.DETAILED DESCRIPTION

[0027] The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings.

[0007] As discussed herein elsewhere, machine learning models, including synthetic control systems and real-world evidence (RWE) models, have become integral to enhancing clinical trials and medical research. These models have demonstrated their ability to solve complex challenges, from predicting patient outcomes to accelerating drug development. For instance, synthetic control models have been applied to simulate control arm outcomes in clinical trials, allowing researchers to assess the effects of treatments without needing a full control group. In many cases, these models have the potential to influence life-altering decisions, such as determining the effectiveness of a new treatment or therapy. As a result, there is an increasing need to ensure that these models not only improve clinical trial efficiency and outcome clarity but also provide deeper insights into the factors driving patient outcomes. This includes designing the models with fairness, interpretability, and transparency considerations, thereby improving confidence in the results. In response to these needs, the approach described herein integrates RWE data into clinical trials, offering a system that enhances the prediction of control outcomes while also identifying the variables that most influence patient outcomes, facilitating a more transparent and informed decision-making process.

[0008] FIG. 1 is a diagram illustrating a flow chart of a process 100 for identifying key factors influencing patient outcomes to enhance clinical trial efficiency and clarity. As shown inAttorney Docket No.: 55701WO / 0728.000557W001FIG. 1, the process 100 may begin with operation 102, wherein the system generates an appended dataset by combining treatment arm data from a clinical trial with real-world evidence (RWE) data. In some embodiments, the RWE data may be sourced from, for example, electronic health records, claims data, or patient registries. The RWE data may be all or part of control arm data for the trial, which may also include control patient data and / or synthetic control data, simulating outcomes for patients who did not receive the experimental therapy. Synthetic control data may be derived from the RWE data. This setup may eliminate the need for, or reduce the number of patients enrolled in, a traditional control arm in the clinical trial, allowing more patients to receive the experimental therapy. In some embodiments, the system may perform data preprocessing and normalization before feeding the appended dataset into the predictive models to aid in consistency and comparability across different data sources. This preprocessing operation may involve normalizing or scaling the data to address variations in data formats, measurement units, and distribution characteristics that can arise from combining real-world evidence (RWE) data with clinical trial data. For example, demographic factors, clinical variables, and treatment indicators may be standardized using methods such as min-max scaling or z-score normalization to align their ranges and distributions. This preprocessing ensures that the predictive models are not unduly influenced by differences in data scales or outliers, leading to more accurate and reliable outcome predictions. By applying consistent normalization techniques across the dataset, the system enhances the comparability of data entries from diverse sources, improving the overall robustness and performance of the predictive models. In some embodiments, the appended dataset is structured as a two-dimensional array and / or matrix where rows represent individual subjects, and columns represent attributes. The attributes may include, but are not limited to, demographic factors, clinical variables, and treatment indicators.Attorney Docket No.: 55701WO / 0728.000557W001Demographic factors may include age, gender, income, and geographic location of the subjects. Clinical variables may encompass a variety of data types, such as laboratory results, genetic test outcomes, gene expression profiles, body measurements, and biometric data, including but not limited to heart rate, blood pressure, or oxygen saturation levels. Treatment indicators may include information regarding drugs administered to the subjects, including dosage and timing, as well as medical procedures performed, such as surgical interventions, imaging procedures, or therapeutic treatments.

[0009] In operation 104, the system may create a first predictive model based on a training subset of the appended dataset. The first predictive model may be trained to predict patient outcomes by learning from both treatment and RWE data, and, optionally, any other data that may be available as control arm data. By incorporating external RWE data, the model may enhance the outcome interpretation and streamline the trial process.

[0010] In operation 106, the system may generate a first performance measurement for the first predictive model. The performance measurement may be based on the predictive model's ability to forecast outcomes for a test subset of the appended dataset, which includes both actual and RWE data. Metrics such as accuracy, precision, recall, and root mean squared error (RMSE), or combinations thereof, may be used to evaluate the performance, quantifying how well the model predicts an actual outcome for the test subset.

[0028] The process 100 may proceed to operation 108, wherein the system removes one or more columns from the appended dataset to generate a modified dataset. The removed columns may correspond to variables such as demographic factors (e.g., gender, age), clinical variables (e.g., baseline health conditions, biomarkers), or treatment indicators (e.g., dosage or treatment duration). In some embodiments, this operation may facilitate an evaluation of howAttorney Docket No.: 55701WO / 0728.000557W001 these factors affect predictive outcomes. By removing certain variables, the system may analyze their influence on the model's predictions. In some embodiments, the system may dynamically adjust the selection of columns for removal based on predefined criteria, such as variable importance scores or feature selection algorithms. In operation 110, the system may create a second predictive model based on the modified dataset. The second model may be generated to assess how the removal of specific variables impacts the predictive accuracy of the model. This second predictive model may be a variation of the first model, with certain influencing factors excluded, allowing for a more focused analysis of the remaining variables. In operation 112, the system may generate a second performance measurement for the second predictive model. This second measurement may be based on metrics similar to those used for the first model, including accuracy, precision, recall, RMSE, or combinations thereof. The performance measurement for the second model may be used to evaluate the predictive accuracy on the same test subset of the appended dataset, allowing for a consistent comparison between models. In operation 114, the system may return a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement. In some embodiments, this is implemented by comparing the first performance measurement with the second performance measurement to determine the impact of the removed columns. This comparison may identify variables, such as demographic or clinical factors, that influenced the model’s predictive accuracy.

[0029] In some embodiments, performance measurement refers to a quantitative assessment of the accuracy and effectiveness of a predictive model in relation to a defined set of criteria. Performance measurement may be calculated based on various metrics, including but not limited to accuracy, precision, recall, root mean squared error (RMSE), and other relevantAttorney Docket No.: 55701WO / 0728.000557W001 statistical measures. These metrics evaluate how well the model's predicted outcomes align with actual outcomes in a test dataset, providing insights into the model’s predictive capability and overall performance in determining patient outcomes within the context of a clinical trial or similar application. In some embodiments, performance measurement may extend beyond accuracy to include additional metrics that provide a more comprehensive evaluation of the predictive model, particularly in the context of imbalanced datasets, which are common in clinical trials. For example, the system may utilize the Fl score, which balances precision and recall, making it especially useful when dealing with uneven class distributions where accuracy alone may not provide an accurate representation of the model’s performance. The Fl score helps in cases where false positives and false negatives carry different implications for patient outcomes. Additionally, the area under the receiver operating characteristic curve (AUC-ROC) may be employed as a performance metric to assess the model’s ability to discriminate between different outcome classes. AUC-ROC provides insight into the trade-offs between the true positive rate and the false positive rate, offering a valuable measure of the model's predictive power, particularly when distinguishing between treatment success and failure. By incorporating metrics such as the Fl score and AUC-ROC, the system may provide a more nuanced and reliable assessment of predictive model performance, especially in clinical trials where class imbalance and diverse outcomes require careful consideration.

[0030] In some embodiments, the terms "variables," "columns," and "factors" may be used interchangeably to refer to distinct data elements or attributes within a dataset that contribute to the predictive modeling process. These terms may represent any measurable characteristics, features, or indicators relevant to the dataset, such as demographic information, clinical metrics, treatment indicators, or other attributes. Each variable, column, or factor mayAttorney Docket No.: 55701WO / 0728.000557WG01 influence the outcome of the predictive model in different ways, and their inclusion, exclusion, or modification may be analyzed to assess their impact on the overall performance of the model in predicting patient outcomes.

[0031] By analyzing the performance differences between the two models, the system may determine which variables contributed most to patient outcomes. In some embodiments, the system may generate a visualization that highlights the impact of the removed columns on the predictive model's performance, offering insights into the relationships between the variables and the trial outcomes. The approach described herein may incorporate synthetic controls and RWE data, which may improve the efficiency of clinical trials by reducing time and cost while enhancing the clarity of outcome interpretation. In some embodiments, this method may address limitations of one-sided predictions by allowing the system to model outcomes for both treatment and control conditions, thereby increasing the robustness of the clinical trial analysis.

[0032] FIG. 2 is a diagram illustrating an example of a first performance measurement associated with a first predictive model, and FIG. 3 is a diagram illustrating an example of a second performance measurement associated with a second predictive model, in accordance with one or more embodiments of the current subject matter. In an example, a first dataset depicting the treatment arm data is shown in Table 1.Table 1 : Treatment Arm Data

[0033] In some embodiments, a dataset the control arm data is shown in Table 2.Attorney Docket No.: 55701WO / 0728.000557WG01Table 2: Control Arm Data (External to Treatment Arm; not part of clinical trial)

[0034] In some embodiments, the system described herein may generate the appended data by combining the Treatment Arm Data and the Control Arm Data. For example, Table 3 illustrates the appended data:Table 3: Appended Data

[0035] In some embodiments, this appended data may be utilized to train one or more predictive models. For example, as shown in FIG. 2, a first predictive model may be built from the appended data. As shown in FIG. 1, upon working on a test dataset, a first measurement of 100% accuracy for the first predictive model is calculated, meaning that the first model is accurate on the test dataset. In some embodiments, a separate test dataset is generated for testing the first predictive model. In some embodiments, this separate test dataset is a subset of the appended dataset. In some embodiment, the entire appended dataset may be utilized as a test dataset.

[0036] As shown in FIG. 2, when removing the treatment or control column, the performance measurement does not change. This result indicates that the removed column has a low impact level on the outcome of the patient. In some embodiments, this may suggest that theAttorney Docket No.: 55701WO / 0728.000557W001 factor represented by the removed column, such as treatment or control status, may not play a significant role in influencing patient outcomes for this specific clinical trial or predictive model.

[0037] In contrast, as shown in FIG. 3, when the gender column is removed, the accuracy measurement changes. This change in performance measurement may indicate that gender is a more influential factor in determining patient outcomes. The degree of impact may be reflected in the magnitude of the difference between the first and second performance measurements. In some embodiments, the removal of the gender column may result in a noticeable decrease in accuracy, precision, or other performance metrics, demonstrating that gender plays a key role in the model's ability to accurately predict outcomes. This may provide insights into the underlying relationships between patient demographics and clinical results.

[0011] As shown in FIG. 3, when evaluating the impact of gender and treatment / control status on the outcome, it was observed that the removal of the gender column resulted in a more substantial change in the model's accuracy compared to the removal of the treatment or control column. Specifically, the accuracy calculation after removing the gender column showed a combined average accuracy of 87.5%, calculated as (50% + 75%) / 2. This result suggests that gender plays a more influential role in determining patient outcomes than treatment or control status. The difference in performance between the two scenarios indicates that the impact of gender on the predictive model’s outcome is 12.5% greater, or approximately 1.14 times more significant, than that of the treatment or control variable. In this case, the treatment does not appear to have a substantial effect on the outcome, and the control group demonstrates a similar outcome performance as the treatment group, suggesting that other factors, such as gender, may be more critical in influencing the clinical trial results.Attorney Docket No.: 55701WO / 0728.000557W001

[0038] In comparison to existing methods, such as Synthetic Control, the approach described herein offers enhanced flexibility in evaluating the influence of individual factors on clinical outcomes. Traditional Synthetic Control models are limited in their ability to assess the impact of one or more columns, as they are typically built relative to all available columns and focused exclusively on the control group. This limitation prevents Synthetic Control methods from isolating and measuring the significance of specific variables on patient outcomes. By contrast, the method described herein allows for the removal of individual columns from the dataset, enabling the system to measure the direct impact of those columns on the performance of both the treatment and control models. This provides a more granular analysis of how particular factors contribute to clinical trial outcomes, offering a deeper understanding of the variables that drive patient responses.

[0012] In some embodiments, the approach described herein provides the advantage of not requiring the retraining of the second predictive model after column removal. Instead of rebuilding the model from scratch with the modified dataset, the system may manipulate the existing model to simulate the effect of removing specific columns. This eliminates the need for a complete retraining process, saving computational resources and time. By directly adjusting the model to account for the removed variables, the system can efficiently measure their impact on performance without the overhead typically associated with training a new model, enhancing the overall efficiency of the method. In some embodiments, the need for retraining a model after removing a data column can be eliminated through the use of specific mathematical properties inherent to certain model architectures. For example, a Gaussian Mixture Model (GMM), comprising one or more components defined by a covariance matrix, a mean vector, and probabilities for each component, may inherently accommodate variable removal withoutAttorney Docket No.: 55701WO / 0728.000557WG01 requiring a complete retraining process. In such models, removing a variable can be achieved by eliminating the corresponding row and column from the covariance matrix and the corresponding value from the mean vector. This adjustment allows the model to continue producing predictions as if retrained on the dataset with the variable removed. This capability arises from the mathematical property of GMMs, where the removal of a variable through this method is equivalent to marginalizing over the removed variable, which, in effect, mirrors the outcome of retraining the model without the variable. Such an approach not only preserves computational efficiency but also enhances the adaptability of the system to changing data conditions.

[0039] FIG. 4 depicts a block diagram illustrating a computing system 400 consistent with implementations of the current subject matter. As shown in FIG. 4, the computing system 400 can include a processor 410, a memory 420, a storage device 430, and input / output devices 440. The processor 410, the memory 420, the storage device 430, and the input / output devices 440 can be interconnected via a system bus 450. The computing system 400 may additionally or alternatively include a graphic processing unit (GPU), such as for image processing, and / or an associated memory for the GPU. The GPU and / or the associated memory for the GPU may be interconnected via the system bus 450 with the processor 410, the memory 420, the storage device 430, and the input / output devices 440. The memory associated with the GPU may store one or more images described herein, and the GPU may process one or more of the images described herein. The GPU may be coupled to and / or form a part of the processor 410. The processor 410 is capable of processing instructions for execution within the computing system 400. In some implementations of the current subject matter, the processor 410 can be a singlethreaded processor. Alternately, the processor 410 can be a multi -threaded processor. The processor 410 is capable of processing instructions stored in the memory 420 and / or on theAttorney Docket No.: 55701WO / 0728.000557W001 storage device 430 to display graphical information for a user interface provided via the input / output device 440.

[0040] The memory 420 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 400. The memory 420 can store data structures representing configuration object databases, for example. The storage device 430 is capable of providing persistent storage for the computing system 400. The storage device 430 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input / output device 440 provides input / output operations for the computing system 400. In some implementations of the current subject matter, the input / output device 440 includes a keyboard and / or pointing device. In various implementations, the input / output device 440 includes a display unit for displaying graphical user interfaces.

[0041] According to some implementations of the current subject matter, the input / output device 440 can provide input / output operations for a network device. For example, the input / output device 440 can include Ethernet ports or other networking ports to communicate with one or more wired and / or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).

[0042] In some implementations of the current subject matter, the computing system 400 can be used to execute various interactive computer software applications that can be used for organization, analysis and / or storage of data in various (e.g., tabular) format (e.g., Microsoft Excel®, and / or any other type of software). Alternatively, the computing system 400 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and / or any other objects, etc ), computingAttorney Docket No.: 55701WO / 0728.000557W001 functionalities, communications functionalities, etc. The applications can include various add-in functionalities or can be standalone computing products and / or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input / output device 440. The user interface can be generated and presented to a user by the computing system 400 (e.g., on a computer screen monitor, etc.).

[0043] One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed framework specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and / or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and / or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0044] These computer programs, which can also be referred to as programs, software, software frameworks, frameworks, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural language, an object-oriented programming language, a functional programming language, a logical programming language, and / or in assembly / machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and / or device,Attorney Docket No.: 55701WO / 0728.000557W001 such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and / or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and / or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid- state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example as would a processor cache or other random access memory associated with one or more physical processor cores.

[0045] To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including, but not limited to, acoustic, speech, or tactile input. Other possible input devices include, but are not limited to, touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive trackpads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.Attorney Docket No.: 55701WO / 0728.000557W001

[0046] In the descriptions above and in the claims, phrases such as “at least one of’ or “one or more of’ may occur followed by a conjunctive list of elements or features. The term “and / or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases “at least one of A and B;” “one or more of A and B;” and “A and / or B” are each intended to mean “A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and / or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.” Use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.

[0047] The subject matter described herein can be embodied in systems, apparatus, methods, and / or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and / or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and / or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and / or described herein do notAttorney Docket No.: 55701WO / 0728.000557W001 necessarily require the particular order shown, or sequential order, to achieve desirable results.Other implementations may be within the scope of the following claims.

Claims

Attorney Docket No.: 55701WO / 0728.000557W001CLAIMSWhat is claimed is:

1. A computer-implemented method, comprising: generating an appended dataset by combining treatment arm data from a clinical trial with real-world evidence data, wherein the appended dataset is a matrix of two-dimensional array where rows represent individual subjects and columns represent attributes associated with a subject; creating a first predictive model based on a training subset of the appended dataset; generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset; removing one or more columns from the appended dataset to generate a modified dataset; creating a second predictive model based on the modified dataset; generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome for the test subset; and returning a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement.

2. The method of claim 1, wherein the appended dataset comprises data from real-world evidence sources external to the clinical trial.

3. The method of claim 1, wherein the first performance measurement and the second performance measurement are based on accuracy, precision, recall, root mean squared error (RMSE), Fl Score, AUC-ROC (Area Under the Receiver Operating Characteristic Curve),Attorney Docket No.: 55701WO / 0728.000557W001AUC-PR (Area Under the Preci si on -Recall Curve, Confusion Matrix, Log Loss (Cross-Entropy Loss), Mean Absolute Error (MAE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), R-Squared (Coefficient of Determination), Mean Squared Logarithmic Error (MSLE), Loss (training and testing), Mean Average Precision (MAP), Average Precision (AP), Mean Average Recall (MAR), Mean Absolute Scaled Error (MASE), or a combination thereof.

4. The method of claim 1, further comprising identifying attributes that influence patient outcomes by evaluating the impact level of the removed one or more columns.

5. The method of claim 1, wherein removing one or more columns from the appended dataset comprises removing columns corresponding to demographic factors, clinical variables, or treatment indicators.

6. The method of claim 1, further comprising dynamically adjusting the columns to be removed based on predefined criteria comprising variable importance scores or feature selection algorithms.

7. The method of claim 1, wherein comparing the first performance measurement and the second performance measurement further comprises generating a visualization that highlights the impact level of the removed one or more columns on predictive model's performance.

8. A computer program product comprising a non-transient machine-readable medium storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising: generating an appended dataset by combining treatment arm data from a clinical trial with real-world evidence data, wherein the appended dataset is a matrix of two-dimensional array where rows represent individual subjects and columns represent attributes associated with a subject;Attorney Docket No.: 55701WO / 0728.000557W001 creating a first predictive model based on a training subset of the appended dataset; generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset; removing one or more columns from the appended dataset to generate a modified dataset; creating a second predictive model based on the modified dataset; generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome for the test subset; and returning a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement.

9. The computer program product of claim 8, wherein the appended dataset comprises data from real-world evidence sources external to the clinical trial.

10. The computer program product of claim 8, wherein the first performance measurement and the second performance measurement are based on accuracy, precision, recall, root mean squared error (RMSE), Fl Score, AUC-ROC (Area Under the Receiver Operating Characteristic Curve), AUC-PR (Area Under the Precision-Recall Curve, Confusion Matrix, Log Loss (CrossEntropy Loss), Mean Absolute Error (MAE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), R-Squared (Coefficient of Determination), Mean Squared Logarithmic Error (MSLE), Loss (training and testing), Mean Average Precision (MAP), Average Precision (AP), Mean Average Recall (MAR), Mean Absolute Scaled Error (MASE), or a combination thereof.Attorney Docket No.: 55701WO / 0728.000557W00111 . The computer program product of claim 8, wherein the operations further comprise identifying attributes that influence patient outcomes by evaluating the impact level of the removed one or more columns.

12. The computer program product of claim 8, wherein removing one or more columns from the appended dataset comprises removing columns corresponding to demographic factors, clinical variables, or treatment indicators.

13. The computer program product of claim 8, wherein the operations further comprise dynamically adjusting the columns to be removed based on predefined criteria comprising variable importance scores or feature selection algorithms.

14. The computer program product of claim 8, wherein comparing the first performance measurement and the second performance measurement further comprises generating a visualization that highlights the impact level of the removed one or more columns on predictive model's performance.

15. A system comprising: a programmable processor; and a non-transient machine-readable medium storing instructions that, when executed by the processor, cause the programmable processor to perform operations comprising: generating an appended dataset by combining treatment arm data from a clinical trial with real-world evidence data, wherein the appended dataset is a matrix of two- dimensional array where rows represent individual subjects and columns represent attributes associated with a subject; creating a first predictive model based on a training subset of the appended dataset;Attorney Docket No.: 55701WO / 0728.000557W001 generating a first performance measurement for the first predictive model based on a first outcome for a test subset of the appended dataset using the first predictive model and an actual outcome for the test subset; removing one or more columns from the appended dataset to generate a modified dataset; creating a second predictive model based on the modified dataset; generating a second performance measurement for the second predictive model based on a second outcome for the test subset of the appended dataset using the second predictive model and the actual outcome for the test subset; and returning a result that indicates an impact level of the one or more columns by comparing the first performance measurement and the second performance measurement.

16. The system of claim 15, wherein the appended dataset comprises data from real -world evidence sources external to the clinical trial.

17. The system of claim 15, wherein the first performance measurement and the second performance measurement are based on accuracy, precision, recall, root mean squared error (RMSE), Fl Score, AUC-ROC (Area Under the Receiver Operating Characteristic Curve), AUC-PR (Area Under the Precision-Recall Curve, Confusion Matrix, Log Loss (Cross-Entropy Loss), Mean Absolute Error (MAE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), R-Squared (Coefficient of Determination), Mean Squared Logarithmic Error (MSLE), Loss (training and testing), Mean Average Precision (MAP), Average Precision (AP), Mean Average Recall (MAR), Mean Absolute Scaled Error (MASE), or a combination thereof.Attorney Docket No.: 55701WO / 0728.000557W00118. The system of claim 15, wherein the operations further comprise identifying attributes that influence patient outcomes by evaluating the impact level of the removed one or more columns.

19. The system of claim 15, wherein removing one or more columns from the appended dataset comprises removing columns corresponding to demographic factors, clinical variables, or treatment indicators.

20. The system of claim 15, wherein the operations further comprise dynamically adjusting the columns to be removed based on predefined criteria comprising variable importance scores or feature selection algorithms.