Systems and methods for fair prediction of undiagnosed disease

WO2026136828A1PCT designated stage Publication Date: 2026-06-25RGT UNIV OF CALIFORNIA

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
RGT UNIV OF CALIFORNIA
Filing Date
2025-12-19
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Conventional methods for predicting Alzheimer's Disease (AD) fail to accurately diagnose the condition in understudied demographic groups, leading to significant underdiagnosis and racial/ethnic disparities, and do not adequately address label bias in existing datasets.

Method used

Employ semi-supervised positive-unlabeled learning coupled with race-specific bias mitigation to generate fair and accurate AD risk predictions using electronic medical records (EMRs), leveraging both labeled and unlabeled data to improve disease state prediction models.

Benefits of technology

The approach enhances the accuracy and fairness of AD prediction models by mitigating biases and leveraging underutilized data, enabling equitable diagnosis across diverse populations without compromising accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025060579_25062026_PF_FP_ABST
    Figure US2025060579_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Systems and methods are disclosed for developing models for predicting Alzheimer's Disease (AD) and other disease states with improved fairness and bias mitigation. An example method includes receiving an EMR dataset comprising a first EMR subset for a first population of patients having confirmed positive disease indications, and an unlabeled EMR subset for a remainder population of the patients. The patients are categorized into various demographic groups. The method further includes using the EMR dataset to train a set of machine learning models via positive unlabeled learning (PUL) to first determine a second EMR subset for a second population of patients having reliable negative indications for the disease state, and then determine additional positive and additional negative indications. Biases specific to each group are mitigated by applying probabilistic criteria specific to each demographic group to subsets of the EMR data during the training of these machine learning models.
Need to check novelty before this filing date? Find Prior Art

Description

SYSTEMS AND METHODS FOR FAIR PREDICTION OF UNDIAGNOSED DISEASERELATED APPLICATION

[0001] This application claims priority to U.S. Provisional Patent Application Serial No. 63 / 737,184, filed on December 20, 2024, which is incorporated herein by reference in its entirety for all purposes.STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

[0002] This invention was made with government support under AG085518 awarded by the National Institutes of Health. The government has certain rights in the invention.BACKGROUND

[0003] The embodiments disclosed herein are generally directed towards systems, processes and methods for predicting Alzheimer’s Disease (AD). Alzheimer’s Disease is one of the most common neurodegenerative disease affecting seniors and poses a significant health and economic burden in the US. By 2050, it is projected that over 12.7 million people age 65+ in the US are projected to have AD. Early AD diagnosis is therefore important for patients to implement lifestyle changes, plan for the future, and receive optimal treatment.

[0004] Unfortunately, AD is largely underdiagnosed. For example, while large cohort studies use in-person assessments of dementia (i.e., the gold standard for AD diagnosis), real-world settings have relied on data from Medicare claims to estimate AD prevalence. Compared to the gold standard diagnoses, the sensitivity of Medicare claims was only about 50-65%, which highlights the significant underdiagnosis of AD in real-world settings.

[0005] Furthermore, this underdiagnosis is particularly prevalent in understudied demographic groups. For example, when compared to patients in the White demographic group, patients in the non-Hispanic African American (NH-AfAm) demographic group are about twice as likely to have AD or other dementias, but are only about 34% or more likely to be diagnosed, and patients in the Hispanic / Latino (HL) demographic group are about 1.5 times as likely to have AD or other dementias, but are only about 18% more likely to be diagnosed.

[0006] There is a desire and need for improved diagnosis of AD to allow patients to receive proper treatment and plan their lives accordingly. However, conventional methods for predicting AD among patients fail to account for the underdiagnoses of AD among patients of understudied groups. Furthermore, such conventional methods fail to mitigate biases in diagnosing dementia that are specific to each understudied group.

[0007] Various embodiments of the present disclosure address one or more of the above described shortcomings.SUMMARY

[0008] Recognized herein is a need for detecting undiagnosed Alzheimer’s Disease (AD) in an accurate and equitable manner using real-world electronic medical records (EMRs), and for addressing label bias and racial and ethnic disparities that can arise in AD identification. The present disclosure describes methods and systems for applying semi-supervised positive- unlabeled learning coupled with race-specific bias mitigation to generate fair, accurate, and clinically meaningful AD risk predictions across diverse patient populations.

[0009] Early AD diagnosis is important for patients to implement lifestyle changes, plan for the future, and receive optimal treatment. However, there is a discrepancy in reported diagnoses of AD based on medical-driven assessments (e.g., in-person assessments of dementia) and actual prevalence of AD in real-world settings based on patient-driven reports (e.g., Medicare claims, reported symptoms, exhibited physiologies of AD and / or dementia, etc.). This discrepancy is especially acute in understudied groups in the US (e.g., Hispanic Latinos (HL), non-Hispanic African Americans (NH-AfAm), and East Asian (EA), etc.).

[0010] Conventional methods for predicting AD among patients fail to account for the underdiagnoses of AD among patients of understudied groups, and fail to mitigate biases in diagnosing dementia that are specific to each understudied group. Furthermore, when datasets are utilized for conventional AD prediction models, such datasets have limited samples of understudied patient populations and have failed to stratify or mitigate for labeling bias in these patient population. Thus, such AD prediction models have failed to exploit the full range of patient data pertaining to AD.

[0011] Accordingly, provided herein is a computer-implemented method for improving disease state prediction models from unlabeled electronic medical records (EMRs), the methodcomprising: (a) receiving, by a processor, an initial electronic medical record (EMR) dataset of patient specific EMR data for a plurality of patients, wherein the initial EMR dataset comprises a first labeled EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a remainder population of the plurality of patients excluding the first population, wherein the first labeled EMR subset indicates that the first population has confirmed positive indications for a disease state, and wherein the unlabeled EMR subset indicates that the remainder has unconfirmed indications for the disease state; (b) inputting the unlabeled EMR subset to a trained machine learning model to predict disease state outcomes for a second population of the plurality of patients, wherein the trained machine learning model is configured to predict the disease state outcomes based on the initial EMR dataset, wherein each disease state outcome of the disease state outcomes is a reliable negative indication for the disease state or a confirmed positive indication for the disease state, and wherein the second population corresponds to a subset of the remainder population; and (c) generating, based on the predicted disease state outcomes that are reliable negative indications, an updated EMR dataset comprising: (i) the first labeled EMR subset for the first population having the confirmed positive indication for the disease state, and (ii) a second labeled EMR subset for the second population having the reliable negative indication for the disease state.

[0012] In some embodiments, the disease state can comprise a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition. In some embodiments, the disease state can be Alzheimer’s Disease. In some embodiments, the patient specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of chronic pain; an indication of cerebral degeneration; a secondary malignancy of lymph nodes; an indication of osteoarthrosis; an indication of inflammatory and toxic neuropathy; an indication of a disturbance of skin sensation; an indication of septicemia; an indication of vascular dementia; an indication of astigmatism; an indication of acute posthemorrhagic anemia; an indication of melanomas of skin; an indication of a respiratory failure; an indication of a secondary malignancy of bone; an indication of an antineoplastic and / or an immunosuppressive drug causing an adverse effect; an indication of morbid obesity; an indication of a voice disturbance; an indication of alopecia; or an indication of an acute pain. In some embodiments, the patient specific EMR data can comprise two or more values selected from a plurality of parameterslisted in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome. In some embodiments, the patient specific EMR data can comprise one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder. In some embodiments, the trained machine learning model can comprise a gradient-boosted decision tree model, an ensemble of positive-unlabeled classifiers, multilayer neural network, or any combination thereof. In some embodiments, the method can further comprise generating derived features comprising record density, encounter frequency, or diagnosis count prior to the inputting. In some embodiments, the method can further comprise converting diagnosis codes into phecodes prior to predicting the disease state outcomes. In some embodiments, the method can further comprise validating the predicted disease state outcomes, wherein the validating can comprise comparing polygenic risk score distributions between the first population having the confirmed positive indication for the disease state and the second population having the reliable negative indication for the disease state. In some embodiments, the validating can further comprise assessing an apolipoprotein E (APOE) e4 allele count for each patient in the first population.

[0013] Also provided herein is a computer-implemented method for improving fairness of disease state prediction models from unlabeled EMR data of underrepresented demographic groups, the method comprising: (a) receiving, by a processor, a labeled EMR dataset of patient specific EMR data for a first population and a second population of a plurality of patients, wherein the labeled EMR dataset labels the first population as having a confirmed positive indication for a disease state and labels the second population as having a reliable negative indication for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups; (b) receiving, by the processor, a plurality of unlabeled groupspecific EMR datasets of patient specific EMR data for a remainder population of the plurality of patients that excludes the first population and second population, wherein the remainder population has unconfirmed indications for the disease state, and wherein the plurality of unlabeled group-specific EMR datasets is associated with the plurality of demographic groups; (c) inputting each unlabeled group-specific EMR dataset of the plurality of unlabeled groupspecific EMR datasets to a trained machine learning model to predict a disease state outcome for each patient in each demographic group of the plurality of demographic groups, wherein thetrained machine learning model is configured to predict a disease state outcome based on input data, wherein the disease state outcome is a positive indication for the disease state or a negative indication for the disease state, and wherein the each unlabeled group-specific EMR dataset is of the each demographic group; and (d) generating, based on the labeled EMR dataset and the predicted disease state outcome, an updated EMR dataset labeling the first population, the second population, and the remainder population.

[0014] In some embodiments, the disease state can comprise a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition. In some embodiments, the disease state can be Alzheimer’s Disease. In some embodiments, the updated EMR dataset can comprise: a first EMR subset for the first population, a second EMR subset for the second population, a third EMR subset for a third population having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population having an additional negative indication for the disease state. In some embodiments, the method can further comprise generating derived features comprising record density, encounter frequency, or diagnosis count prior to the inputting. In some embodiments, the method can further comprise converting diagnosis codes into phecodes prior to predicting the disease state outcome. In some embodiments, the inputting can further comprise: mitigating a bias associated with the each demographic group by adjusting the each unlabeled group-specific EMR dataset of the each demographic group. In some embodiments, the mitigating the bias can further comprise optimizing parity of one or more of balanced accuracy, precision, specificity, equal opportunity, or cumulative parity loss across the plurality of demographic groups. In some embodiments, the method can further comprise validating the predicted disease state outcome, wherein the validating can comprise comparing polygenic risk score distributions between the first population and the second population. In some embodiments, the validating can further comprise assessing an APOE s4 allele count for each patient in the first population. In some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of an altered mental status; an indication of a neurological disorder; an indication of an abnormality of gait; an indication of a mild cognitive impairment; an indication of a vascular dementia; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of a delirium due to conditions classified elsewhere; an indication of amajor depressive disorder; an indication of a cerebral degeneration; an indication of a cerebral ischemia; an indication of an encephalopathy; an indication of a developmental delay and / or disorder; an indication of a urinary tract infection; an indication of a urinary incontinence; an indication of a transient alteration of awareness; an indication of decubitus ulcer; an indication of aphasia / speech disturbance; or an indication of malaise and fatigue. In some embodiments, the patient specific EMR data can comprise two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome. In some embodiments, the patient specific EMR data can comprise one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

[0015] Further provided herein is a computer-implemented method for disease state prediction with improved fairness and bias mitigation using EMR data, the method comprising: (a) receiving, by a processor, a plurality of EMR datasets of patient-specific EMR data for a plurality of patients having a disease state outcome that is known, wherein the disease state outcome is a positive indication for a disease state or a negative indication for the disease state, wherein the plurality of patients are categorized into a plurality of demographic groups, and wherein each EMR dataset of the plurality of EMR datasets is associated with each patient of the plurality of patients; (b) mitigating a bias in a trained machine learning model for each demographic group of the plurality of demographic groups, wherein the trained machine learning model is configured to predict the disease state outcome based on input data; (c) predicting the disease state outcome for the each patient by inputting the each EMR dataset to the trained machine learning model.

[0016] In some embodiments, the disease state can comprise a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition. In some embodiments, the disease state can be Alzheimer’s Disease. In some embodiments, the EMR dataset can comprise: a first EMR subset for a first population of the plurality of patients having a confirmed positive indication for the disease state, a second EMR subset for a second population of the plurality of patients having a reliable negative indication for the disease state, a third EMR subset for a third population of the plurality of patients having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population of the pluralityof patients having an additional negative indication for the disease state. In some embodiments, the method can further comprise: validating the trained machine learning model using the first EMR subset; and adjusting the second EMR subset, the third EMR subset, and the fourth EMR subset. In some embodiments, the mitigating the bias can comprise: receiving, for the each demographic group, a group benefit equality (GBE) value; and tuning the trained machine learning model based on the GBE value for the each demographic group. In some embodiments, the mitigating the bias can further comprise optimizing parity of one or more of balanced accuracy, precision, specificity, equal opportunity, or cumulative parity loss across the plurality of demographic groups. In some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon. In some embodiments, the patient specific EMR data can comprise two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome. In some embodiments, the patient specific EMR data can comprise one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

[0017] Also provided herein is a computer-implemented method for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data, the method comprising: (a) receiving, by a processor, a first EMR dataset of patient-specific EMR data for a plurality of patients, wherein the first EMR dataset comprises a first EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a first remainder population of the plurality of patients excluding the first population, wherein the first EMR subset indicates that the first population has confirmed positive indications for a disease state,wherein the unlabeled EMR subset indicates that the first remainder population has a first set of unconfirmed indications for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups; (b) inputting the unlabeled EMR subset into a first trained machine learning model to predict disease state outcomes for a second population of the plurality of patients, wherein the second population is a subset of the first remainder population, and wherein the disease state outcomes are reliable negative indications for the disease state; (c) receiving, by the processor, a second EMR dataset comprising the first EMR subset and a second EMR subset for the second population, and a plurality of unlabeled groupspecific EMR datasets of patient-specific EMR data for a second remainder population of the plurality of patients, wherein the second remainder population excludes the first population and the second population and has a second set of unconfirmed indications for the disease state, wherein the second EMR subset indicates that the second population has the reliable negative indications for the disease state, and wherein each unlabeled group-specific EMR dataset of the plurality of unlabeled group-specific EMR datasets is associated with each demographic group of the plurality of demographic groups; (d) inputting the each unlabeled group-specific EMR dataset to a second trained machine learning model to predict the disease state outcomes for the each demographic group, wherein the each demographic group belongs to the second remainder population; (e) generating, based on the predicted disease state outcomes for the second remainder population, a third EMR subset for a third population of the plurality of patients, and a fourth EMR subset for a fourth population of the plurality of patients; and (f) predicting the disease state outcome with improved fairness by inputting a third EMR dataset into a third trained machine learning model, wherein the third EMR dataset comprises the first EMR subset, the second EMR subset, the third EMR subset, and the fourth EMR subset, wherein the first EMR subset indicates that the first population has the confirmed positive indications for the disease state, wherein the second EMR subset indicates that the second population has the reliable negative indications for the disease state, wherein the third EMR subset indicates that the third population has an additional positive indication for the disease state, and wherein the fourth EMR subset indicates that the fourth population has an additional negative indication for the disease state; wherein the first trained machine learning model, the second trained machine learning model, and the third trained machine learning model are configured to predict the disease state outcomes based on input data, wherein the disease state outcomes are the confirmedpositive indications for the disease state or the reliable negative indications for the disease state, and wherein training of the third trained machine learning model comprises mitigating a bias using a group benefit equality (GBE) value for the each demographic group.

[0018] In some embodiments, the method can further comprise: receiving, by the processor, an EMR data associated with a patient in the each demographic group, wherein the patient has an unconfirmed indication of the disease state; and predicting a disease state outcome for the patient by inputting the EMR data associated with the patient into the third trained machine learning model. In some embodiments, the disease state can comprise a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition. In some embodiments, the disease state can be Alzheimer’s Disease. In some embodiments, the method can further comprise: validating the third trained machine learning model using the first EMR subset; and adjusting the second EMR subset, the third EMR subset, and the fourth EMR subset. In some embodiments, the mitigating the bias can comprise: receiving, by the processor, the GBE value for the each demographic group; and tuning the third machine learning model based on the GBE value for the each demographic group. In some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon. In some embodiments, the patient specific EMR data can comprise two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome. In some embodiments, the patient specific EMR data can comprise one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

[0019] Also provided herein is a computer-implemented method for predicting a disease state of a patient from EMR data using, the method comprising: (a) receiving, by a process, an initial EMR dataset of patient-specific EMR data for a plurality of patients, wherein the initial EMR dataset comprises a first labeled EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a reminder population of the plurality of patients excluding the first population, wherein the first labeled EMR subset indicates the first population has confirmed positive indications for the disease state, wherein the unlabeled EMR subset indicates that the remainder population has unconfirmed indications for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups; (b) identifying, based on probabilistic gap values, a first subset of the unlabeled EMR subset as reliable negative indications for the disease state, wherein a probabilistic gap value of the probabilistic gap values is generated by the processor for each patient in the unlabeled EMR subset based on a first trained machine learning model, and wherein the reliable negative indications correspond to the probabilistic gap values that are below a threshold derived from the probabilistic gap values of the first labeled EMR subset; (c) assigning, by the processor, additional positive indications and additional negative indications to a second subset of the unlabeled EMR subset using racespecific probabilistic criteria based on the probabilistic gap values; (d) classifying, by the processor, the unlabeled EMR subset to predict the disease state for the plurality of patients using a second trained machine learning model, wherein the second trained machine learning is trained using an expanded labeled EMR dataset, and wherein the expanded labeled EMR dataset comprises the first labeled EMR subset, the first subset having the reliable negative indications, and the second subset having the additional positive indications and the additional negative indications; (e) optimizing, by the processor, a decision threshold for each demographic group of the plurality of demographic groups by selecting a group-specific threshold that maximizes group benefit equality (GBE) for the disease state; and (f) predicting the disease state for the patient by inputting an EMR dataset of the patient into the second trained machine learning model. In some embodiments, the first trained machine learning can be a trained probabilistic model, and the second trained machine learning model can be a semi-supervised machine learning model.

[0020] In some embodiments, the predicting can comprise applying a demographic-group- specific optimized threshold. In some embodiments, the disease state can comprise a chronic,neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition. In some embodiments, the disease state can be Alzheimer’s Disease. In some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of a neurological disorder; an indication of delirium; an indication of vascular dementia; an indication of a mild cognitive impairment; an indication of encephalopathy; an indication of a delusional disorder; an indication of an episodic mood disorder; an indication of an anxiety disorder; an indication of altered mental status; an indication of memory loss; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of adjustment reaction; screening for malignant neoplasms; an indication of decubitus ulcer; an indication of palpitations; an indication of other immunological findings; an indication of aphasia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; an indication of benign neoplasm of colon; an age of the patient; a number of encounters; a number of diagnoses; a record length; or a record density. In some embodiments, the patient specific EMR data can comprise two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome. In some embodiments, the patient specific EMR data can comprise one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

[0021] Further provided herein is a system for improving disease state prediction models from unlabeled EMRs, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform any one of the methods of the present disclosure.

[0022] Further provided herein is a system for improving fairness of disease state prediction models from unlabeled EMR data of underrepresented demographic groups, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform any one of the methods of the present disclosure.

[0023] Further provided herein is a system for disease state prediction with improved fairness and bias mitigation using EMR data, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform any one of the methods of the present disclosure.

[0024] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

[0025] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and / or take precedence over any such contradictory material.BRIEF DESCRIPTION OF THE DRAWINGS

[0026] The features of the present disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which:

[0027] FIG. l is a schematic showing an example semi-supervised learning framework for developing models to predict Alzheimer’s Disease (AD) (and other disease states) with improved fairness, according to example embodiments of the present disclosure.

[0028] FIG. 2 is a block diagram illustrating a computing device upon which embodiments of the present teachings may be implemented.

[0029] FIGS. 3A-3C collectively illustrate block diagrams illustrating example computer- implemented methods for developing disease state prediction models with improved fairness andpredicting disease states using such models, according to non-limiting embodiments of the present disclosure.

[0030] FIG. 3A is a block diagram illustrating a non-limiting exemplary method 300A for improving disease state prediction models from unlabeled electronic medical records (EMRs).

[0031] FIG. 3B is a block diagram illustrating a non-limiting exemplary method 300B for improving disease state prediction models from unlabeled electronic medical records (EMRs).

[0032] FIG. 3C is a block diagram illustrating a non-limiting exemplary method 300C for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data.

[0033] FIG. 4 is a graph illustrating comparing fairness by cutoff method. Fairness metrics were averaged over 1000 test sets. Cutoffs for MCC-maximized models were selected by maximizing the MCC for unlabeled data in the validation set using proxy labels. Cutoff for GBE-optimized models were selected by optimizing the GBE for each race / ethnicity in the validation set. A lower cumulative parity loss reflects better overall fairness. BA=balanced accuracy, GBE=group benefit equality, MCC=Matthew’s Correlation Coefficient, SSPUL=semi- supervised positive unlabeled learning.

[0034] FIG. 5 is a schematic illustrating study design. (Phase 1) Patient- and record-level data preprocessing. (Phase 2) Train and evaluate. Post-preprocessing, data for non- ATLAS patients was randomly split (labeled positive-stratified). The training set was used for training SSPUL framework, while the validation set was used for determining the optimal GBE cutoff for each race and ethnicity. The trained model was then applied to the test set for race-stratified performance and fairness evaluation using proxy ICDs and medications. Process was repeated 1000 times. Performance and fairness metrics were averaged over the 1000 splits and the trained model from each split was saved for predicting unlabeled ATLAS patients for validation.(Phase 3) Validate. Polygenic risk scores were obtained for ATLAS patients using LDPred2. Applying each trained model from phase 2, the mean PRS and s4 allele count for each final classification of unlabeled patients [i.e., predicted positive (PP) or predicted negative (PN)] was obtained (1000 mean PRS for predicted positives and predicted negatives total). The mean of PRS means and mean of c4 allele count means were then obtained by aggregating the PRS means and s4 allele count means, respectively, followed by racestratified validation. ATLAS = UCLA ATLAS Community Health Initiative, DDR = Data Discovery Repository, EA = EastAsian, HL = Hispanic Latino, ICD = International Classification of Diseases, NH-AfAm = NonHispanic African American, NH-white = non-Hispanic white, PN=predicted negatives, PP = predicted positives, PRS = polygenic risk scores, SSPUL = semi-supervised positive unlabeled learning, VAL = validation set.

[0035] FIGS. 6A-6B collectively illustrates four-step semi-supervised positive unlabeled learning (SSPUL) framework of the present disclosure.

[0036] FIG. 6A is a schematic illustrating SSPUL framework overview. (Step 1) Identify reliable negatives: Following feature selection, a Generalized Linear Model (GLM) was trained on labeled positive (LP) and unlabeled data. Reliable negatives were obtained based on having a probabilistic gap that is smaller than the smallest observed probabilistic gap of LPs. (Step 2) Pre-processing racial bias mitigation: additional positive (AP) and negative (AN) labels were assigned using race-specific probabilistic gaps. (Step 3) Train final classifier: XGBoost classifier was trained on all labeled and pseudo-labeled patients. (Step 4) Post-processing bias mitigation: classification cutoffs were determined by optimizing the group benefit equality (GBE) for each race and ethnicity.

[0037] FIG. 6B is a schematic illustrating pre- and post-processing bias mitigation details. Preprocessing bias mitigation: After training a distributed random forest classifier using LPs and RNs, APs and ANs were assigned for a subset of the remaining unlabeled data such that the following race and ethnic-specific probabilistic criteria are met: (1) APs and ANs have racespecific probabilistic gaps that are greater and smaller than the smallest observed probabilistic gap of LPs and largest observed probabilistic gap of RNs, respectively, for each race and ethnicity; (2) the prevalence of positive labels for each race and ethnicity closely matches the corresponding population AD prevalence. Post-processing bias mitigation: Predicted probabilities for unlabeled patients in the validation set were obtained from the trained final XGBoost classifier. The classification cutoff for each race and ethnicity was determined by optimizing the GBE for each race and ethnicity to ensure that the prevalence of LPs and predicted positives matched that of labeled and proxy -validated positives. The cutoffs were then applied to the test set for classification. AN = additional negative, AP = additional positive, DRF = distributed random forest, g = race or ethnicity variable, GBE = group benefit equality, LP = labeled positive, n = number of patients, RN = reliable negative, U = unlabeled, APLP = observedprobabilistic gap of LPs, APRN = observed probabilistic gap of RNs, APu=observed probabilistic gap of unlabeled patients, 7tg= population prevalence for race or ethnicity g.

[0038] FIGS. 7A-7B collectively illustrates confusion matrices for predictions.AD=Alzheimer’s disease, EA=East Asian, GBE=group benefit equality, HL=Hispanic Latino, MCC=Matthew’s Correlation Coefficient, NH-AfAm=non-Hispanic African American, NH- white=non-Hispanic white, SSPUL=semi-supervised positive unlabeled learning.

[0039] FIG. 7A is confusion matrices illustrating additional positives and additional negatives assigned by step 2 of SSPUL. Values in confusion matrices are means of 1000 training sets.

[0040] FIG. 7B is confusion matrices illustrating comparison of final predictions across models. Values in confusion matrices are means of 1000 test sets.

[0041] FIGS. 8A-8B collectively illustrates evaluation of fairness across models. Fairness metrics were averaged over 1000 test sets. BA=balanced accuracy, EA=East Asian, GBE=group benefit equality, HL=Hispanic Latino, MCC=Matthew’s Correlation Coefficient, NH- AfAm=non-Hispanic African American, NH-white=non-Hispanic white, NPV=negative predictive value, SSPUL=semi-supervised positive unlabeled learning.

[0042] FIG. 8A is graphs illustrating comparison of GBE by cutoff method.

[0043] FIG. 8B is a graph illustrating comparison of cumulative parity loss across models.

[0044] FIGS. 9A-9C collectively illustrates comparison of calibration performance across models. EA=East Asian, ECE=Expected Calibration Error, GBE=group benefit equality, HL=Hispanic Latino, MCC=Matthew’s Correlation Coefficient, NH-AfAm=non-Hispanic African American, NH-white=non-Hispanic white, SSPUL=semi-supervised positive unlabeled learning.

[0045] FIG. 9A is a graph illustrating calibration curves of SSPUL and baseline models. Each bin represents approximately 10% mean predicted probability. Supervised (risk factors) had a maximum predicted probability of 0.79, resulting in missing points in bins 8-10 from the corresponding calibration curve.

[0046] FIG. 9B is a graph illustrating distributions of predicted probabilities for test set proxy- validated AD cases, stratified by model.

[0047] FIG. 9C is Brier scores stratified by model and race and ethnicity.

[0048] FIGS. 10A-10E collectively illustrates analyses of top predictive features and test set predictions.

[0049] FTG. 10A illustrates prevalence of top 20 predictive phecodes by test set predictions.

[0050] FIG. 10B illustrates visualization of labeled positives and predicted labels in a 2- dimensional Euclidean feature space via factor analysis.

[0051] FIG. 10C illustrates factor analysis of top 20 predictive features reveals predictive features that are distinct among unlabeled AD patients (purple). Unlabeled AD patients are represented by a true positive (TP) subset (i.e., proxy -validated predicted positives) with coordinates that did not overlap with any of those of LPs in the feature space from FIG. 10B. “_1” indicates the positive class of the feature while “_0” indicates the negative class. Age during last visit, number of encounters, number of diagnoses, record density, and record length were discretized into quartiles.

[0052] FIG. 10D illustrates prevalence of top TP subset-specific and neurological features among LPs and the TP subset.

[0053] FIG. 10E illustrates comparison of TP subset-specific features’ impact on TP subset vs. LP sample (N = 215 for each group). TP subset-specific features are binary (0 = low / blue, 1 = high / red). Purple represents a mixture of 0 and 1 values. Dim dimension, EF ejection fraction, LP labeled positive, SHAP SHapley Additive exPlanations, TP True positive.

[0054] FIG. 11 illustrates comparison of SHAP value magnitude and direction among the top 20 features across racial and ethnic groups. Feature values for phecode features are binary (0 = low / blue, 1 = high / red). Feature values for continuous features (age at last visit, record density (per year), number of diagnoses, number of encounters, and record length) were min-max scaled. Purple represents intermediate values for continuous features. EA=East Asian, HL=Hispanic Latino, NH-AfAm=non-Hispanic African American, NH-white=non-Hispanic white, SHAP=SHapley Additive exPlanations.

[0055] FIGS. 12A-12B collectively illustrates genotype validation of holdout set prediction labels. Prediction labels for the holdout set were inferred using 1000 trained models. The mean polygenic risk scores (PRSs) and s4 allele counts were obtained for LPs and each prediction labels for each iterations, and then averaged. EA=East Asian, HL=Hispanic Latino, NH- AfAm=non-Hispanic African American, NH-white=non-Hispanic white, PRS=polygenic risk score.

[0056] FIG. 12A illustrates PRSs, stratified by holdout set prediction labels.

[0057] FIG. 12B illustrates s4 allele count, stratified by holdout set prediction labels.

[0058] FTGS. 13A-13C collectively illustrates AD genetic risk association with classification and proxy-validated labels. EA=East Asian, HL=Hispanic Latino, NH-AfAm=non-Hispanic African American, NH-white=non-Hispanic white, N.S =not significant (p > 0.05), PRS=polygenic risk score.

[0059] FIG. 13A illustrates e4 allele count proportions of holdout set prediction labels.

[0060] FIG. 13B illustrates PRS and s4 allele counts of proxy-validated labels. Prediction labels for the holdout set were inferred using 1000 trained models.

[0061] FIG. 13C illustrates the s4 allele proportions for labeled positives and each prediction label were obtained for each iteration, then averaged.

[0062] It is to be understood that the figures are not necessarily drawn to scale, nor are the objects in the figures necessarily drawn to scale in relationship to one another. The figures are depictions that are intended to bring clarity and understanding to various embodiments of apparatuses, systems, and methods disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be appreciated that the drawings are not intended to limit the scope of the present teachings in any way.DETAILED DESCRIPTION

[0063] The present disclosure describes a novel and nonobvious solution to the aforementioned shortcomings by describing systems, processes and methods for developing AD prediction models with improved fairness and predicting AD in undiagnosed patients using the models. In various embodiments, such AD prediction models have improved robustness and reliability as the development of these models involve large EMR datasets combining not only diagnosed AD patients as well as undiagnosed patients. The AD prediction models are able to harness EMR datasets of the previously ignored or underutilized undiagnosed patients using semi-supervised learning (e.g., machine learning based on a labeled and unlabeled data), to determine AD outcomes of the undiagnosed patients (e g., whether the undiagnosed patients have a negative or positive indication of AD). The present disclosure finds that developing AD prediction models using pre-existing labels (e.g., supervised learning) may be subject to bias (i.e., label bias) as such labels (e.g., whether the patient is known to have a positive or negative diagnosis of AD) can be differentially ascertained across these demographic and other patient groups. In the caseof race and ethnicity, systemic racism, practitioner bias, and lack of healthcare access in underserved communities contribute to the lower prevalence of labels for instances belonging to an underrepresented populations. The present disclosure thus finds that semi-supervised learning can alleviate label bias by leveraging unlabeled data, which is inexpensive and more abundantly available for understudied populations.

[0064] Various embodiments of the present disclosure further utilize positive unlabeled learning (PUL), a special case of semi-supervised learning that is based on learning from both labeled positive and unlabeled data. For example, in various embodiments, an electronic medical records (EMR) dataset comprising patient-specific EMR data of a plurality of patients may be received, but only a portion of the EMR dataset may comprise of EMR data of patients that are known to have a positive AD diagnosis and are thus labeled positive. The remainder of the EMR dataset may comprise of EMR data of patients that are undiagnosed (e.g., are yet to be diagnosed for an AD outcome of whether such patients have a positive or negative indication of AD), and are thus unlabeled data. Each patient-specific EMR of the EMR dataset can document what diseases a patient may have but may not necessarily document what diseases the patient does not have. Undiagnosed diseases may thus be undocumented in the EMR due to providers not entering diagnostic codes or not being aware of patient disease conditions. Therefore, in the EMR, there may be labeled instances representing patients with documented disease and unlabeled instances representing patients who may be cases or controls.

[0065] In various embodiments, EMR datasets may be segmented and / or organized based on the patient demographic groups being represented by the respective subset of the EMR dataset (such EMR dataset subsets referred to herein as group-specific EMR datasets). For example, within an unlabeled portion of the EMR datasets, group-specific EMR datasets may be formed for each understudied demographic group. Various embodiments of the present disclosure may promote algorithmic fairness via pre-processing and post processing bias mitigation. For example, in some embodiments, bias mitigation may be performed prior to a machine learning model training (e.g., during a pre-processing stage) by assigning positive and negative labels for a subset of unlabeled instances based on race-specific probabilistic criteria. In some embodiments, bias mitigation may be performed after the machine learning model training (e.g., at a post-processing stage) by selecting classification cutoffs that optimize the group benefit opportunity (GBE) value for each demographic group (e.g., racial and ethnic group).

[0066] The present disclosure also finds various features from the EMR dataset to be significant in their predictability of an AD outcome for a patient (e.g., whether the patient has a positive or negative indication for AD). Such features may include but are not limited to: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon. In some embodiments, each feature may be assigned a weight (e.g., for a trained machine learning model) indicative of how significant the feature may be to predicting the AD outcome. In some embodiments, the aforementioned list of features are arranged in order of significance (from high to low) in their ability to predict the AD outcome.

[0067] In some embodiments, trained machine learning models for AD prediction are validated at the phenotype level using diagnoses and medications that serve as proxies for AD and at the genotype level by comparing the polygenic risk scores (PRS) and Apolipoprotein E (APOE) e4 allele count for the predictions of a holdout set. By optimizing both performance and fairness, the presently disclosed prediction models can predict AD equitably without compromising accuracy. Furthermore, the models can more broadly be used to identify undiagnosed AD, screen AD with high sensitivity, reduce racial disparities in AD diagnosis, and guide clinical decision making.

[0068] The embodiments of the present disclosure are described in greater details below.

[0069] The following descriptions and examples illustrate embodiments of the present disclosure in detail. Although the present disclosure has been described in some details by way of illustration and example for purposes of clarity and understanding, it will be apparent that certain changes and modifications can be practiced within the scope of the appended claims.

[0070] The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.

[0071] Although various features of the disclosure can be described in the context of a single embodiment, the features can also be provided separately or in any suitable combination. Conversely, although the present disclosure can be described herein in the context of separate embodiments for clarity, the present disclosure can also be implemented in a single embodiment. It is to be understood that the present disclosure is not limited to the particular embodiments described herein and as such can vary. Those of skill in the art will recognize that there are variations and modifications of the present disclosure, which are encompassed within its scope.

[0072] It is intended that every maximum numerical limitation given throughout this specification includes every lower numerical limitation, as if such lower numerical limitations were expressly written herein. Every minimum numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.

[0073] I. Definition

[0074] All terms are intended to be understood as they would be understood by a person skilled in the art. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.

[0075] It is intended that every maximum numerical limitation given throughout this specification includes every lower numerical limitation, as if such lower numerical limitations were expressly written herein. Every minimum numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.

[0076] Headings, e.g., I, II, III, or A, B, C etc., are presented merely for ease of reading the specification and claims. The use of headings in the specification or claims does not require thesteps or elements be performed in alphabetical or numerical order or the order in which they are presented.

[0077] The singular form “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a cell” includes one or more cells, including mixtures thereof.

[0078] “A and / or B” is used herein to include all of the following alternatives: “A”, “B”, “A or B”, and “A and B”.

[0079] The term “about” or “approximately” means within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value should be assumed.

[0080] It is understood that aspects and embodiments of the disclosure described herein include “comprising,” “consisting,” and “consisting essentially of’ aspects and embodiments. As used herein, “comprising” is synonymous with “including”, “containing”, or “characterized by”, and is inclusive or open-ended and does not exclude additional, unrecited elements or method steps. As used herein, “consisting of’ excludes any elements, steps, or ingredients not specified in the claimed composition or method. As used herein, “consisting essentially of’ does not exclude materials or steps that do not materially affect the basic and novel characteristics of the claimed composition or method. Any recitation herein of the term “comprising”, particularly in a description of components of a composition or in a description of steps of a method, is understood to encompass those compositions and methods consisting essentially of and consisting of the recited components or steps.

[0081] The term “phecode” as used herein is a manually curated grouping of International Classification of Diseases (ICD) diagnosis codes that can capture a clinically meaningful phenotype. In some embodiments, a phecode can represent a consolidated diagnostic category formed by mapping multiple ICD-9 or ICD- 10 codes to a single phenotype code. Through this grouping, redundant or highly correlated ICD codes can be reduced, and the dimensionality of diagnostic data can be lowered. In some embodiments, phecodes can be used to standardize disease categories, enhance interpretability of electronic medical record-derived features, and facilitate large-scale statistical or machine learning analyses involving clinical phenotypes.

[0082] It is appreciated that certain features of the disclosure, which are, for clarity, described in the context of separate embodiments, can also be provided in combination in a single embodiment. Conversely, various features of the disclosure, which are, for brevity, described in the context of a single embodiment, can also be provided separately or in any suitable subcombination. All combinations of the embodiments pertaining to the disclosure are specifically embraced by the present disclosure and are disclosed herein just as if each and every combination was individually and explicitly disclosed. In addition, all sub-combinations of the various embodiments and elements thereof are also specifically embraced by the present disclosure and are disclosed herein just as if each and every such sub combination was individually and explicitly disclosed herein.II. Overview

[0083] Alzheimer’s Disease (AD) is the most common neurodegenerative disease and the sixth leading cause of death in the United States, affecting 1 in 9 (10.7%) Americans aged 65 and older. Currently, more than 6 million Americans are living with AD, and 1 in 3 seniors dies with AD or another dementia. Combined, the total cost of treatment for individuals with AD and other dementias is $345 billion in 2023 and is projected to increase to over $1 trillion by 2050. Skaria, Am. J. Mcmag. Care 28(10 Suppl):S188-S196 (2022).

[0084] Coupled with the substantial health and economic burden that AD poses, early AD diagnosis can be important for patients to implement lifestyle changes, plan for the future, and receive optimal treatment. Previous studies have shown a discrepancy between the prevalence of AD in large longitudinal cohort studies and real-world, community settings. Amjad et al., J.Gen. Intern. Nied. 33(7): 1131-1138 (2019); Bradford et al., Alzheimer Dis. Assoc. Disord. 23(4):306-314 (2009); Talor et al., J. Alzhimers Dis. JAD 17(4):807-815 (2009). While large cohort studies use in-person assessments of dementia (i.e., the gold standard), real-world settings have relied on Medicare claims data to estimate AD prevalence. Compared to gold standard diagnoses, the sensitivity of Medicare claims can be only 50-65%, highlighting significant underdiagnosis of AD in real-world settings. In understudied populations, AD underdiagnosis can be further exacerbated. Gianattasio et al., Alzheimer s Dement. Transl. Res. Clin. Interv. 5:891-898 (2019); Lin et al., Med. Care 59(8):679-686 (2021); Lim et al., J. Alzhimers Dis. JAD 77(2):523-537 (2020). The estimated AD prevalence based on projections from longitudinalstudies is 10% among non-Hispanic whites (NH-whites), 14% among Hispanic Latinos (HLs), 18.6% among non-Hispanic African Americans (NH-AfAms), and 7.4% among East Asians (EA). Rajan et al., Alzheimers Dement. J. Alzheimer s Assoc. 17(12): 1966-1975 (2021); Zhu et al., Alzheimers Dement. 20:4315-4330 (2024). Despite being almost two times more likely to have AD than NH-whites, NH-AfAms are only 34% more likely to have a diagnosis in Medicare claims data. Likewise, despite being approximately 1.5 times more likely to have AD than NH- whites, HLs are only 18% more likely to be diagnosed. Rajan et al., Alzheimers Dement. J. Alzheimers Assoc. 17(12): 1966-1975 (2021); Mattews et al., Alzheimers Dement. J. Alzheimers Assoc. 15(1): 17-24 (2019). Studies have indicated that Asian Americans, including EA, face higher risk of under-detection and delayed diagnosis of cognitive impairment due to lower awareness of AD risk factors and cultural stigma. These findings emphasize the need for not only more sensitive, but also more equitable AD diagnosis.

[0085] Recent efforts to improve AD diagnosis include numerous studies that use data-driven methods to predict the onset of AD. Several studies have leveraged electronic medical records (EMR) data for building machine learning models. Bames et al., J. Am. Geriatr. Soc. 68(1): 103- 111 (2020); Li et al., Alzheimers Dement 19(8):3506-3518 (2023); Shao et al., BMC Med. Inform. Decis. Mak. 19: 128 (2019); Tang et al., Nat. Aging 4:379-395 (2024). For example, eRADAR has been developed and validated using clinical features from EMR data linked to participants in the Adult Changes in Thought study. Bames et al., J. Am. Geriatr. Soc.68(1): 103- 111 (2020). Additional studies have incorporated behavioral and other clinical risk factors and evaluated non-linear supervised learning models, including Gradient-Boosted Trees. Akter et al., J. Prev. Alzheimers Dis. 12(7): 100169 (2025). These prior studies demonstrate the advantage of large EMR sample size for training prediction models. Only some prior studies have addressed the limited sample size of understudied populations and exploited the full range of diagnoses as features rather than only those selected by experts. Even fewer studies have focused on algorithmic fairness for preventing disparities in dementia diagnosis accuracy in the EMR with respect to race and ethnicity. These fairness-oriented studies employed a supervised learning framework, which can require expensive labels and can be subject to label bias. Further, these studies do not focus on predicting undiagnosed AD and rely primarily on knowledge-driven rather than data-driven feature engineering, limiting the scope of predictive features. In addition, some studies did not include popular machine learning fairness metrics,while others evaluated fairness across algorithms without implementing bias mitigation. These gaps indicate that existing fairness studies in AD in the EMR have yet to simultaneously incorporate unlabeled data, data-driven feature learning, and bias mitigation.

[0086] Semi-supervised learning (SSL) is a class of machine learning algorithms that learns from a limited set of labels and unlabeled data. A number of studies have utilized SSL to diagnose AD, as SSL can overcome the expense of manual labeling via domain experts. Protected groups can include individuals legally protected against discrimination, such as those categorized by sex, race and ethnicity, and religion. Pre-existing labels can be subject to bias (i.e., label bias), as labels can be differentially ascertained across these groups. In the case of race and ethnicity, systemic racism, practitioner bias, and lack of healthcare access in underserved communities can contribute to lower prevalence of labels for patients in underrepresented populations. SSL can alleviate label bias by leveraging unlabeled data, which is inexpensive and more abundantly available for understudied populations. Positive unlabeled learning (PUL) is a special case of SSL that learns from labeled positive and unlabeled data. EMR data fits well in the PUL framework because the EMR documents what diseases patients have but not what they do not have. Undiagnosed diseases can be undocumented in the EMR due to providers not entering diagnostic codes or not being aware of patient disease conditions. Therefore, in the EMR, there can be labeled patients with documented disease and unlabeled patients who can be cases or controls. PUL has been successfully applied to a variety of disease diagnosis prediction tasks, including AD diagnosis using imaging data. However, PUL has not been implemented in the context of AD diagnosis using real-world EMR data, which can be more practical for large-scale prediction than neuroimaging data because imaging data is costly and acquired primarily from symptomatic or high-risk patients. Moreover, studies that have utilized PUL for AD diagnosis do not stratify analyses by protected group or incorporate bias mitigation approaches for reducing outcome inequities among protected groups. Consequently, these models can be subject to bias and perform more poorly in protected groups.

[0087] In some embodiments, the present disclosure aims to identify patients with undiagnosed AD from NH-white and understudied populations, including NH-AfAm, HL, and EA. A semisupervised PUL (SSPUL) framework is presently developed that leverages the full range of diagnoses via data-driven feature selection and a large number of patients from understudied populations with unknown AD status from EMRs, such as, but not limited to, UCLA HealthEMR. To promote algorithmic fairness, pre-processing racial bias mitigation can be employed by assigning positive and negative labels for a subset of unlabeled patients based on race-specific probabilistic criteria, along with post-processing racial bias mitigation by selecting classification cutoffs that optimize the group benefit equality (GBE) for each racial and ethnic group. Important EMR features, including those shared and distinct among labeled and unlabeled AD patients, can be highlighted. In the model evaluations, rigorous validation can be performed at the phenotype level using diagnoses and medications that serve as proxies for AD, and at the genotype level by comparing polygenic risk scores (PRSs) and Apolipoprotein E (APOE) e4 allele count for predictions of a holdout set. Robustness to proxy label distribution shifts can also be demonstrated. By optimizing both performance and fairness, the model of the present disclosure can predict AD equitably and outperform baseline supervised models across multiple discrimination performance and fairness metrics. The present disclosure provides the first implementation that bridges PUL with racial and ethnic bias mitigation in the context of undiagnosed AD prediction in the EMR and has implications for improving identification of undiagnosed AD, reducing racial disparities in AD diagnosis, and guiding clinical decision making.III. Systems and Methods of the Disclosure

[0088] Provided herein are systems and methods of identifying undiagnosed Alzheimer’s Disease using semi-supervised learning applied to electronic medical record data.

[0089] Accordingly, provided herein is a computer-implemented method for improving disease state prediction models from unlabeled electronic medical records (EMRs), the method comprising: (a) receiving, by a processor, an initial electronic medical record (EMR) dataset of patient specific EMR data for a plurality of patients, wherein the initial EMR dataset comprises a first labeled EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a remainder population of the plurality of patients excluding the first population, wherein the first labeled EMR subset indicates that the first population has confirmed positive indications for a disease state, and wherein the unlabeled EMR subset indicates that the remainder has unconfirmed indications for the disease state; (b) inputting the unlabeled EMR subset to a trained machine learning model to predict disease state outcomes for a second population of the plurality of patients, wherein the trained machine learning model is configuredto predict the disease state outcomes based on the initial EMR dataset, wherein each disease state outcome of the disease state outcomes is a reliable negative indication for the disease state or a confirmed positive indication for the disease state, and wherein the second population corresponds to a subset of the remainder population; and (c) generating, based on the predicted disease state outcomes that are reliable negative indications, an updated EMR dataset comprising: (i) the first labeled EMR subset for the first population having the confirmed positive indication for the disease state, and (ii) a second labeled EMR subset for the second population having the reliable negative indication for the disease state.

[0090] Also provided herein is a computer-implemented method for improving fairness of disease state prediction models from unlabeled EMR data of underrepresented demographic groups, the method comprising: (a) receiving, by a processor, a labeled EMR dataset of patient specific EMR data for a first population and a second population of a plurality of patients, wherein the labeled EMR dataset labels the first population as having a confirmed positive indication for a disease state and labels the second population as having a reliable negative indication for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups; (b) receiving, by the processor, a plurality of unlabeled groupspecific EMR datasets of patient specific EMR data for a remainder population of the plurality of patients that excludes the first population and second population, wherein the remainder population has unconfirmed indications for the disease state, and wherein the plurality of unlabeled group-specific EMR datasets is associated with the plurality of demographic groups;(c) inputting each unlabeled group-specific EMR dataset of the plurality of unlabeled groupspecific EMR datasets to a trained machine learning model to predict a disease state outcome for each patient in each demographic group of the plurality of demographic groups, wherein the trained machine learning model is configured to predict a disease state outcome based on input data, wherein the disease state outcome is a positive indication for the disease state or a negative indication for the disease state, and wherein the each unlabeled group-specific EMR dataset is of the each demographic group; and (d) generating, based on the labeled EMR dataset and the predicted disease state outcome, an updated EMR dataset labeling the first population, the second population, and the remainder population. In some embodiments, the updated EMR dataset comprises: a first EMR subset for the first population, a second EMR subset for the second population, a third EMR subset for a third population having an additional positiveindication for the disease state, and a fourth EMR subset for a fourth population having an additional negative indication for the disease state. In some embodiments, the inputting further comprises: mitigating a bias associated with the each demographic group by adjusting the each unlabeled group-specific EMR dataset of the each demographic group.

[0091] Further provided herein is a computer-implemented method for disease state prediction with improved fairness and bias mitigation using EMR data, the method comprising: (a) receiving, by a processor, a plurality of EMR datasets of patient-specific EMR data for a plurality of patients having a disease state outcome that is known, wherein the disease state outcome is a positive indication for a disease state or a negative indication for the disease state, wherein the plurality of patients are categorized into a plurality of demographic groups, and wherein each EMR dataset of the plurality of EMR datasets is associated with each patient of the plurality of patients; (b) mitigating a bias in a trained machine learning model for each demographic group of the plurality of demographic groups, wherein the trained machine learning model is configured to predict the disease state outcome based on input data; (c) predicting the disease state outcome for the each patient by inputting the each EMR dataset to the trained machine learning model. In some embodiments, the EMR dataset comprises: a first EMR subset for a first population of the plurality of patients having a confirmed positive indication for the disease state, a second EMR subset for a second population of the plurality of patients having a reliable negative indication for the disease state, a third EMR subset for a third population of the plurality of patients having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population of the plurality of patients having an additional negative indication for the disease state. In some embodiments, the method further comprises: validating the trained machine learning model using the first EMR subset; and adjusting the second EMR subset, the third EMR subset, and the fourth EMR subset. In some embodiments, the mitigating the bias comprises: receiving, for the each demographic group, a group benefit equality (GBE) value; and tuning the trained machine learning model based on the GBE value for the each demographic group.

[0092] Further provided herein is a computer-implemented method for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data, the method comprising: (a) receiving, by a processor, a first EMR dataset of patient-specific EMR data for a plurality of patients, wherein the first EMR dataset comprises a first EMR subset for afirst population of the plurality of patients and an unlabeled EMR subset for a first remainder population of the plurality of patients excluding the first population, wherein the first EMR subset indicates that the first population has confirmed positive indications for a disease state, wherein the unlabeled EMR subset indicates that the first remainder population has a first set of unconfirmed indications for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups; (b) inputting the unlabeled EMR subset into a first trained machine learning model to predict disease state outcomes for a second population of the plurality of patients, wherein the second population is a subset of the first remainder population, and wherein the disease state outcomes are reliable negative indications for the disease state; (c) receiving, by the processor, a second EMR dataset comprising the first EMR subset and a second EMR subset for the second population, and a plurality of unlabeled groupspecific EMR datasets of patient-specific EMR data for a second remainder population of the plurality of patients, wherein the second remainder population excludes the first population and the second population and has a second set of unconfirmed indications for the disease state, wherein the second EMR subset indicates that the second population has the reliable negative indications for the disease state, and wherein each unlabeled group-specific EMR dataset of the plurality of unlabeled group-specific EMR datasets is associated with each demographic group of the plurality of demographic groups; (d) inputting the each unlabeled group-specific EMR dataset to a second trained machine learning model to predict the disease state outcomes for the each demographic group, wherein the each demographic group belongs to the second remainder population; (e) generating, based on the predicted disease state outcomes for the second remainder population, a third EMR subset for a third population of the plurality of patients, and a fourth EMR subset for a fourth population of the plurality of patients; and (f) predicting the disease state outcome with improved fairness by inputting a third EMR dataset into a third trained machine learning model, wherein the third EMR dataset comprises the first EMR subset, the second EMR subset, the third EMR subset, and the fourth EMR subset, wherein the first EMR subset indicates that the first population has the confirmed positive indications for the disease state, wherein the second EMR subset indicates that the second population has the reliable negative indications for the disease state, wherein the third EMR subset indicates that the third population has an additional positive indication for the disease state, and wherein the fourth EMR subset indicates that the fourth population has an additional negative indication for thedisease state; wherein the first trained machine learning model, the second trained machine learning model, and the third trained machine learning model are configured to predict the disease state outcomes based on input data, wherein the disease state outcomes are the confirmed positive indications for the disease state or the reliable negative indications for the disease state, and wherein training of the third trained machine learning model comprises mitigating a bias using a group benefit equality (GBE) value for the each demographic group. In some embodiments, the method further comprises receiving, by the processor, an EMR data associated with a patient in the each demographic group, wherein the patient has an unconfirmed indication of the disease state; and predicting a disease state outcome for the patient by inputting the EMR data associated with the patient into the third trained machine learning model. In some embodiments, the method further comprises validating the third trained machine learning model using the first EMR subset; and adjusting the second EMR subset, the third EMR subset, and the fourth EMR subset. In some embodiments, the mitigating the bias comprises receiving, by the processor, the GBE value for the each demographic group; and tuning the third machine learning model based on the GBE value for the each demographic group.

[0093] Further provided herein is a computer-implemented method for predicting a disease state of a patient from EMR data using, the method comprising: (a) receiving, by a process, an initial EMR dataset of patient-specific EMR data for a plurality of patients, wherein the initial EMR dataset comprises a first labeled EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a reminder population of the plurality of patients excluding the first population, wherein the first labeled EMR subset indicates the first population has confirmed positive indications for the disease state, wherein the unlabeled EMR subset indicates that the remainder population has unconfirmed indications for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups; (b) identifying, based on probabilistic gap values, a first subset of the unlabeled EMR subset as reliable negative indications for the disease state, wherein a probabilistic gap value of the probabilistic gap values is generated by the processor for each patient in the unlabeled EMR subset based on a first trained machine learning model, and wherein the reliable negative indications correspond to the probabilistic gap values that are below a threshold derived from the probabilistic gap values of the first labeled EMR subset; (c) assigning, by the processor, additional positive indications and additional negative indications to a second subset of theunlabeled EMR subset using race-specific probabilistic criteria based on the probabilistic gap values; (d) classifying, by the processor, the unlabeled EMR subset to predict the disease state for the plurality of patients using a second trained machine learning model, wherein the second trained machine learning is trained using an expanded labeled EMR dataset, and wherein the expanded labeled EMR dataset comprises the first labeled EMR subset, the first subset having the reliable negative indications, and the second subset having the additional positive indications and the additional negative indications; (e) optimizing, by the processor, a decision threshold for each demographic group of the plurality of demographic groups by selecting a group-specific threshold that maximizes group benefit equality (GBE) for the disease state; and (f) predicting the disease state for the patient by inputting an EMR dataset of the patient into the second trained machine learning model. In some embodiments, the first trained machine learning is a trained probabilistic model, and the second trained machine learning model is a semi-supervised machine learning model. In some embodiments, the predicting comprises applying a demographic-group-specific optimized threshold.

[0094] In any of the methods the present disclosure, in some embodiments, the disease state comprises a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition. In any of the methods the present disclosure, in some embodiments, the disease state can be Alzheimer’s Disease.

[0095] In any of the methods the present disclosure, in some embodiments, the patient specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of an indication of dementia; an indication of memory loss; an indication of chronic pain; an indication of cerebral degeneration; a secondary malignancy of lymph nodes; an indication of osteoarthrosis; an indication of inflammatory and toxic neuropathy; an indication of a disturbance of skin sensation; an indication of septicemia; an indication of vascular dementia; an indication of astigmatism; an indication of acute posthemorrhagic anemia; an indication of melanomas of skin; an indication of a respiratory failure; an indication of a secondary malignancy of bone; an indication of an antineoplastic and / or an immunosuppressive drug causing an adverse effect; an indication of morbid obesity; an indication of a voice disturbance; an indication of alopecia; or an indication of an acute pain.

[0096] In any of the methods the present disclosure, in some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more ofan indication of dementia; an indication of memory loss; an indication of an altered mental status; an indication of a neurological disorder; an indication of an abnormality of gait; an indication of a mild cognitive impairment; an indication of a vascular dementia; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of a delirium due to conditions classified elsewhere; an indication of a major depressive disorder; an indication of a cerebral degeneration; an indication of a cerebral ischemia; an indication of an encephalopathy; an indication of a developmental delay and / or disorder; an indication of a urinary tract infection; an indication of a urinary incontinence; an indication of a transient alteration of awareness; an indication of decubitus ulcer; an indication of aphasia / speech disturbance; or an indication of malaise and fatigue.

[0097] In any of the methods the present disclosure, in some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon.

[0098] In any of the methods the present disclosure, in some embodiments, the patient-specific EMR data can comprise a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic orpsychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon.

[0099] In any of the methods the present disclosure, in some embodiments, the patient-specific EMR data comprises a value for each of a plurality of parameters comprising one or more of an indication of dementia; an indication of a neurological disorder; an indication of delirium; an indication of vascular dementia; an indication of a mild cognitive impairment; an indication of encephalopathy; an indication of a delusional disorder; an indication of an episodic mood disorder; an indication of an anxiety disorder; an indication of altered mental status; an indication of memory loss; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of adjustment reaction; screening for malignant neoplasms; an indication of decubitus ulcer; an indication of palpitations; an indication of other immunological findings; an indication of aphasia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; an indication of benign neoplasm of colon; an age of the patient; a number of encounters; a number of diagnoses; a record length; or a record density.

[0100] Also provided herein are systems comprising a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform any of methods described herein.A. Machine Learning of the Present Disclosure

[0101] SSPUL of the present disclosure is developed for equitably and accurately identifying undiagnosed AD given a limited number of positive labels. The innovation of the present disclosure lies in coupling a PUL framework with pre- and post-processing bias mitigation approaches via race-specific probabilistic criteria and optimal GBE thresholding. In some embodiments, limitations of label bias and the expense of supervised learning employed in previous studies on fairness in AD can be addressed. In some embodiments, limitations of previous SSL studies that rely solely on neuroimaging data, which can be less practical for large- scale prediction tasks, can be addressed. In some embodiments, racial group fairness can be improved, and rigorous validation for the results can be provided using proxy AD ICD codes,AD medications, and PRSs. In some embodiments, to promote interpretability of the model and results, thorough explanations via feature, sensitivity, and SHAP analyses can be provided. In some embodiments, robustness of the model to proxy label distribution shifts can additionally be demonstrated.

[0102] As detailed in Examples 2-9 infra, SSPUL (GBE or MCC) of the present disclosure can outperform baseline models with respect to multiple discrimination and calibration performance metrics, most notably sensitivity and AUCPR across NH-white, NH-AfAm, HL, and EA groups. Despite high discrimination and calibration performances, SSPUL of the present disclosure does not suffer with respect to fairness, achieving the lowest cumulative parity loss compared to baseline models. In addition, mean sensitivity and precision can remain stable across a wide range of proxy label definitions, highlighting SSPUL’ s robustness. In some embodiments, in the feature analysis, top predictive features related to neurological / mental disorders (e.g., memory loss) and non-neurological disorders (e.g., decubitus ulcer) can be identified. In some embodiments, top predictive features that are specific to undiagnosed AD patients (e.g., palpitations) can also be identified. In some embodiments, all top predictive features can contribute similarly to AD prediction across racial and ethnic groups, both in magnitude and direction. When stratifying by prediction labels and race and ethnicity, PRSs can be shown to be significantly higher for LPs and predicted positives than for predicted negatives in NH-white, HL, and EA patients, while e4 allele counts can be higher in NH-white and EA patients, providing additional support for the model’s predictions.

[0103] Among possible approaches for leveraging PUL, the present disclosure applies a four- step method (detailed in Example 1), which can be modified from the vanilla 2-step PUL framework. In some embodiments, other PUL methods such as biased learning rely on the stringent selected completely at random assumption, which states that labeled instances can be an i.i.d. sample of the positive distribution. In some embodiments, the data likely cannot satisfy this assumption as factors such as systemic biases and access to care can shape who receives an AD diagnosis. As a result, diagnosed patients can be a biased sample of the true positive distribution. In some embodiments, the framework can be shown to be superior to baseline supervised models and the vanilla 2-step PUL framework. Although supervised models that treat unlabeled instances as negative can be commonly used in practice for PUL, they are demonstrated to be inferior to SSPUL (GBE) with regard to almost all discrimination andcalibration performance metrics across all races and ethnicities. Tn some embodiments, the sensitivity / precision tradeoff for SSPUL (GBE) can be significantly more favorable than that of Supervised (full / MCC), especially in HL, where SSPUL (GBE) can have on average about 25% higher sensitivity without loss in precision. Additionally, SSPUL (GBE) is shown to outperform vanilla 2-step PUL using race-specific GBE cutoffs, achieving higher precision and balanced accuracy across all racial and ethnic groups, and higher sensitivity among NH-white, NH-AfAm, and HL. The results shown in Examples 2-9 highlight the effectiveness of the pseudo-labeling strategy in the third step of the framework.

[0104] In some embodiments, group fairness adopts various definitions. In some embodiments, group fairness can be defined based on parity of performance metrics across races and ethnicities as it enables clear interpretation of results and is widely used in the literature.Due to differences in the prevalence of AD among NH-white, NH-AfAm, HL, and EA groups in the EMR and population, demographic parity, though popular, cannot be deemed appropriate for comparing fairness as it unrealistically favors an equal positive selection rate among all races and ethnicities. Instead, GBE is included as a fairness metric and optimization target for cutoff selection to ensure that the prevalence of each race and ethnicity can be reached. GBE can be related to the concept of the class prior (the proportion of positive data) in PUL as the latter must be estimated to optimize the former. In the present disclosure, AD diagnosis and medication proxies can be used to estimate the class prior. In some embodiments, by optimizing the GBE for each race and ethnicity, SSPUL (GBE) not only achieves the lowest cumulative parity loss relative to baseline models, but also the best balance between sensitivity and precision. In some embodiments, similar patterns are observed when comparing the MCC cutoff to the GBE cutoffs across models. In some embodiments, while GBE optimization yields an ideal GBE of 1 for most splits, GBE variability can exist across models. This variability can be driven by differences in the GBE between the validation and test sets across splits. The relatively high GBE variability for Supervised (risk factors / MCC) can stem from the model’s limited feature set, resulting in many positives and negatives having similar predicted probabilities and leading to an unstable MCC cutoff across splits.

[0105] In addition to GBE optimization, the superior fairness of the model can be attributed to the use of race-specific population prevalence and PLP and PRN to inform the assignment of pseudo-labels in the semi -supervised framework. This approach can produce pseudo-labels thathave low false discovery and false omission rates across all racial and ethnic groups and can increase the representation of unprivileged races and ethnicities in the training set. In some embodiments, the positive impact of these contributions is reflected in the sensitivity analysis of self-reported race and ethnicity features, which shows minimal changes to the sensitivity of each race and ethnicity after recoding it as another race and ethnicity while keeping all other features fixed.

[0106] Overall, the top 20 predictive features can contribute similarly to AD prediction across all racial and ethnic groups, suggesting that SSPUL can learn consistent and generalizable patterns. As detailed in Examples 2-9, 18 of the features relate to mental or neurological disorders, healthcare utilization, or age. Among the remaining features, screening for malignant neoplasms shows a negative association with AD, consistent with prior studies reporting an inverse relationship between AD and cancer. Decubitus ulcers have been linked to AD, likely through increased immobility from cognitive and functional decline. More locally, predictive features with low prevalence among diagnosed AD patients but higher prevalence in a TP subset distinct from LPs are identified. These include palpitations (427.9), other immunological findings (279.7), and screening for malignant neoplasms. Palpitations can be a risk factor for atrial fibrillation, which has been associated with AD cognitive decline. Other immunological findings include abnormal immunological finding in serum (R76.9), other specified abnormal immunological findings in serum (R76.8), and raised antibody titer (R76.0), which can reflect evidence of immune dysregulation in AD.

[0107] In some embodiments, for training and validation, large, diverse EMR data containing rich diagnoses and genetic information derived from a data discovery repository (DDR), such as, but not limited to, UCLA DDR, can be leveraged. In some embodiments, to ensure data integrity, patients with missing demographics are excluded. In some embodiments, while imputation of missing demographics such as race and ethnicity can be performed to reduce the number of excluded patients, fairness assessments across racial and ethnic groups can be affected or biased by imputation quality. Validated methods for imputation from de-identified EMR have been developed, such as the Medicare Bayesian Improved Surname Geocoding, which uses surnames to perform imputation. However, such methods are specific to Medicare data elements, and names are not available in the de-identified data.

[0108] With the presently described systems and methods, undiagnosed AD can be successfully predicted from the EMR with high sensitivity and precision by bridging a PUL framework with pre- and post-processing bias mitigation approaches. The implications of the present disclosure is three-fold. First, the presently developed model shows potential for assisting providers in identifying high-risk AD patients that can be appropriate for further clinical evaluation or screening. Second, by ensuring equitable predictions across racial and ethnic groups, the presently developed model helps remedy significant underdiagnosis in underrepresented populations, addressing long-standing disparities in AD diagnosis stemming from systemic and algorithmic biases. Lastly, by enhancing interpretability using a variety of race- and ethnicity-stratified analyses, black-box models that can be used to guide clinical decision-making are avoided.1. Semi-Supervised Learning (SSL)

[0109] In some embodiments, semi-supervised learning (SSL) can be employed to leverage both labeled and unlabeled data when constructing predictive models for identifying undiagnosed diseases or other latent clinical outcomes. SSL can be particularly advantageous in healthcare settings, where high-quality labeled data can be scarce, costly, or inconsistently captured, while large volumes of unlabeled patient records can be available within electronic medical record (EMR) systems. In contrast to traditional supervised learning, which relies exclusively on labeled examples, SSL can utilize patterns, structures, and statistical regularities present within the unlabeled population to improve classification accuracy, generalizability, and robustness.

[0110] In some embodiments, SSL frameworks can incorporate generative models, discriminative models, graph-based models, or hybrid approaches. Generative SSL approaches can model the joint distribution of features and labels, enabling the system to infer latent class structure from the unlabeled data. Discriminative SSL approaches can refine decision boundaries by incorporating unlabeled samples through consistency regularization, pseudolabeling, entropy minimization, or margin-based techniques. Graph-based SSL approaches can propagate labels across patient similarity networks, allowing clinical phenotypes to be inferred through manifold structure in high-dimensional EMR spaces. Hybrid approaches can combinemultiple SSL strategies to improve performance across heterogeneous data modalities such as diagnoses, laboratory results, imaging features, or clinical narratives.

[0111] In some embodiments, SSL can be particularly useful for disease detection applications where positive labels are limited and negative labels cannot be reliably inferred. For example, undiagnosed disease states can be under-documented or coded inconsistently, leading to substantial uncertainty in the label space. SSL can utilize manifold assumptions or cluster consistency assumptions to assign soft or probabilistic labels to unlabeled instances, allowing the model to learn disease-associated patterns even when only a small number of confirmed cases are available. This approach can prevent models from overfitting to the limited labeled set and can produce decision boundaries that better reflect the underlying clinical population.

[0112] In some embodiments, SSL can incorporate regularization strategies that enforce smoothness, class-conditional clustering, or confidence-based constraints. Smoothness regularization can encourage the model to assign similar predictions to patients with similar clinical features. Cluster-based constraints can promote the formation of feature-based patient subgroups, within which unlabeled samples can receive consistent labels. Confidence-based constraints can require the model to issue high-confidence predictions only in regions of the feature space where sufficient unlabeled support exists, thereby preventing over-confident and spurious predictions.

[0113] In some embodiments, SSL can be used to integrate data from multiple healthcare modalities. For example, diagnoses, procedure codes, medication histories, laboratory trajectories, vital signs, social determinants of health, and genetic information can all be used as inputs. Some of these modalities can be sparsely labeled or inconsistently recorded, making SSL an appropriate choice for harmonizing heterogeneous data. SSL can leverage unlabeled text from clinical notes, unlabeled genetic variants, or unlabeled medication patterns to capture latent clinical structure that would otherwise remain unutilized in supervised frameworks.

[0114] In some embodiments, SSL can incorporate temporal learning mechanisms, particularly when applied to longitudinal EMR records that reflect multi-year patient trajectories. Temporal SSL approaches can enforce consistency in predictions across adjacent time steps or across multiple clinical encounters. Unlabeled time intervals can contribute to learning temporal patterns associated with disease onset, progression, or risk stratification. Temporal consistencyobjectives can help stabilize predictions in patients with irregular visit schedules, sparse documentation, or limited longitudinal depth.

[0115] In some embodiments, SSL can include uncertainty quantification mechanisms to evaluate prediction reliability across both labeled and unlabeled subsets. Techniques such as Monte Carlo dropout, deep ensembles, or Bayesian approximations can be applied to identify uncertain predictions, which can then be weighted differently during training, flagged for clinician review, or excluded from pseudo-label generation. This can reduce the propagation of noise from poorly supported regions of the feature space.

[0116] In some embodiments, SSL can be combined with active learning, enabling the system to request additional labels for samples that convey high informational value. Active semisupervised systems can identify ambiguous or boundary-case patient records and prioritize them for expert chart review, improving model performance with minimal labeling effort. This hybrid approach can be useful in clinical environments where labeling resources are constrained.

[0117] In some embodiments, SSL can integrate fairness-aware learning objectives to ensure equitable performance across protected groups. For example, group-specific thresholds, parity constraints, or representation-balancing regularization terms can be applied during training. Because SSL relies heavily on unlabeled data, fairness mechanisms in SSL can mitigate biases introduced by label scarcity, misdiagnosis, or systemic inequities in healthcare access across racial, ethnic, or socioeconomic groups.

[0118] In some embodiments, SSL can be deployed as part of a real-time or near-real-time inference engine within a clinical decision support system. Streaming EMR data can be incorporated as unlabeled input to continuously refine the model, allowing for up-to-date prediction logic that adapts to evolving patterns of healthcare utilization, changes in diagnostic practices, or shifts in population demographics. SSL-enabled drift detection can identify when new unlabeled data deviate from the training distribution and can trigger model recalibration or retraining workflows.2. Positive-Unlabeled Learning (PUL)

[0119] In some embodiments, a semi-supervised learning system can incorporate a positive- unlabeled learning (PUL) framework to address scenarios in which only a limited number of positive labels are available while the majority of data remains unlabeled. Such scenarios canoccur frequently in healthcare, where diagnostic codes are inconsistently captured, delayed, or entirely absent for certain conditions. In these situations, a conventional supervised learning approach can be unreliable or biased because negative labels cannot be confidently assumed from their absence in the records. PUL frameworks can instead treat labeled positives as confirmed cases while treating the remaining population as a mixture of true positives and true negatives. This structure can allow a model to effectively leam disease-associated patterns even when ground-truth labels cannot be comprehensively obtained.

[0120] In some embodiments, PUL can be implemented using a multi-step procedure that infers reliable negative labels and pseudo-positive labels based on statistical properties of the data. For example, probabilistic scores derived from an initial classifier can be used to estimate the likelihood that an unlabeled instance belongs to the positive or negative class. Patients whose probabilistic gaps fall below an estimated threshold can be designated as reliable negatives, while patients whose probabilistic gaps exceed a race- or subgroup-specific threshold can be designated as additional positives. This iterative pseudo-labeling process can significantly expand the effective training dataset by incorporating unlabeled samples that are likely to belong to a particular class, thereby allowing the classifier to learn a more representative decision boundary.

[0121] In some embodiments, the selection of pseudo-labels can rely on race-, ethnicity-, or subgroup-specific probabilistic criteria when fairness is a desired objective. For instance, population-level disease prevalence estimates can be used to constrain how many pseudopositive labels are assigned to each subgroup, avoiding the risk of under-labeling populations that historically experience low diagnostic rates. Similarly, reliable-negative thresholds can be tailored for each subgroup using observed distributions of probabilistic gaps among confirmed positives and inferred negatives. These strategies can reduce the propagation of historical inequities in labeling practices into the learned model.

[0122] In some embodiments, PUL can incorporate data-driven feature selection to ensure that the model captures meaningful signals from a wide range of structured or unstructured inputs, such as, but not limited to, diagnostic codes, laboratory results, problem lists, demographic information, clinical narrative embeddings, or temporal utilization patterns. Feature selection can be applied prior to the PUL training stages or iteratively updated after pseudo-labels areassigned. This integration of feature learning with pseudo-labeling can improve both predictive performance and interpretability.

[0123] In some embodiments, PUL can operate in conjunction with iterative re-training, where each round of pseudo-labeling is followed by model refinement using an expanded labeled dataset. The probabilistic thresholds for identifying reliable negatives and probable positives can be updated at each iteration. This iterative process can continue until convergence criteria are met, such as stabilization of the pseudo-label distribution, convergence of validation metrics, or satisfaction of fairness constraints. Such iterative learning schemes can be particularly useful in medical datasets, where the unlabeled pool can be large and noisy.

[0124] In some embodiments, PUL can be integrated with ensemble learning techniques. For example, multiple models, each trained on different pseudo-label subsets, can be combined using averaging, weighted voting, or stacking approaches. Ensemble-based PUL frameworks can reduce variance introduced by noisy pseudo-labels and can further stabilize performance across demographic subgroups. In some embodiments, bootstrapping or resampling procedures can generate multiple training partitions for robust pseudo-label estimation.

[0125] In some embodiments, the PUL framework can be extended to multi-class or multilabel disease prediction tasks. Although PUL is traditionally formulated as a binary task (positive vs. unlabeled), healthcare data often involve multiple overlapping conditions. Adaptations such as class-conditional risk estimation or pairwise-coupled PUL can be employed to generalize the PUL approach to multiple outcomes. This can enable simultaneous detection of undiagnosed conditions beyond Alzheimer’s disease, including, but not limited to, chronic kidney disease, heart failure, autoimmune disorders, or psychiatric conditions.

[0126] In some embodiments, PUL can explicitly account for label noise that arises when positive labels contain some degree of misclassification. Noise-robust loss functions, uncertainty-weighted training examples, or confidence-adjusted pseudo-labels can be employed to mitigate the effect of mislabeled positives. This can be especially important in electronic health records, where diagnostic codes can occasionally be incorrectly applied or inconsistently updated.

[0127] In some embodiments, a PUL system can incorporate calibration procedures to adjust predicted probabilities so that they reflect meaningful long-run frequencies of disease in both labeled and unlabeled subsets. For example, temperature scaling, isotonic regression, orreliability diagram -guided calibration can be used to improve the interpretability of model outputs. In some embodiments, calibrated predictions can be valuable for downstream clinical decision-support tools that rely on absolute risk estimates rather than binary classifications.

[0128] In some embodiments, PUL can be used in combination with post-processing threshold selection methods that optimize for fairness metrics, clinical utility, or prevalence alignment. Threshold selection can be performed separately for each racial or ethnic group, age interval, biological sex category, or other protected attribute category. This flexibility allows the disclosed systems to balance predictive performance and equity across heterogeneous patient populations.3. Fairness Considerations

[0129] In some embodiments, semi-supervised learning (SSL) and positive-unlabeled learning (PUL) frameworks can incorporate algorithmic fairness techniques to ensure that model predictions do not disproportionately disadvantage specific racial, ethnic, or other protected groups. In many clinical contexts, the labeled data available for algorithm development can be significantly biased due to structural inequities in healthcare access, variable diagnostic practices, and provider-level variation in disease recognition. As a result, models trained solely on labeled data can unknowingly replicate or amplify existing disparities. By integrating fairness-aware mechanisms directly into the model training process, the presently disclosed systems and methods can improve equity in diagnostic prediction while maintaining high levels of accuracy.

[0130] In some embodiments, pre-processing fairness mitigation can be applied by adjusting how pseudo-labels are generated for unlabeled patients. For example, race-specific probabilistic thresholds can be used to determine which unlabeled instances are assigned additional positive or negative pseudo-labels. These thresholds can account for population-level prevalence estimates, observed label disparities, or historical differences in diagnosis rates. Through this approach, the composition of the training dataset can more accurately reflect true disease distributions across groups, thereby reducing the risk that the model will learn biased correlations tied to underdiagnosis.

[0131] In some embodiments, post-processing fairness mitigation can be used to refine the classification thresholds applied to each racial or ethnic group after the model has been trained.Group-specific thresholds can be selected to optimize fairness metrics such as, but not limited to, group benefit equality (GBE), equal opportunity, parity of sensitivity, or parity of predicted prevalence. Adjusting the classification threshold in this manner can help ensure that predicted positive rates for each group align with expected disease prevalence levels rather than with potentially biased patterns in the training data. This form of threshold optimization can be particularly valuable when disease prevalence differs substantially between subgroups or when equalizing error rates across groups is clinically meaningful.

[0132] In some embodiments, fairness considerations can also be incorporated into the model evaluation stage. Classification performance can be evaluated separately for each protected group, enabling the detection of disparities in accuracy, sensitivity, specificity, or precision. Fairness metrics such as, but not limited to, cumulative parity loss, group benefit equality, and equal opportunity can be calculated to quantify the extent to which predictions deviate from equitable performance. These metrics can provide insight into how the model performs across diverse populations and can guide further adjustments to training procedures, pseudo-labeling criteria, or threshold selection.

[0133] In some embodiments, fairness can be assessed using sensitivity analyses that examine model behavior under transformations of protected attributes. For example, the value of a patient’s race or ethnicity can be recoded while all other features remain fixed, and the resulting change in predicted probability or classification outcome can be measured. Minimal changes after such recoding can indicate that the model predictions do not depend heavily on the protected attribute, thereby suggesting reduced bias. Larger changes can signal the need for additional fairness interventions. These forms of sensitivity analysis can provide model -agnostic confirmation of fair treatment across groups.

[0134] In some embodiments, fairness-enhanced SSL frameworks can also limit the propagation of label bias, which can occur when the labeled dataset does not accurately represent the true distribution of disease across groups. Because SSL can leverage large quantities of unlabeled data, it can more effectively distinguish between true signal and biased labeling practices. When fairness constraints are added, SSL can further ensure that historical inequities, such as lower diagnosis rates in underserved racial groups, do not lead to systematically lower predicted probabilities for those groups. This combination of SSL and fairness mitigation can be particularly beneficial in clinical applications involving chronic underdiagnosis.

[0135] In some embodiments, fairness considerations can additionally extend to feature importance and model interpretability. SHAP analysis, feature attribution methods, or group- stratified feature correlations can be used to evaluate whether the model relies on features in a consistent manner across populations. Identifying whether certain features disproportionately influence predictions for specific groups can help ensure that no subgroup is systematically misclassified due to spurious correlations or group-specific artifacts in the data. If disparities are identified, feature-engineering steps or regularization strategies can be applied to correct for such imbalances.

[0136] In some embodiments, fairness-aware SSL frameworks can be deployed in healthcare settings to support equitable clinical decision-making. By ensuring that the model performs consistently across racial and ethnic subgroups, the risk that algorithmic tools will unintentionally worsen existing health disparities can be reduced. As a result, fairness considerations can play an essential role in enabling the responsible use of machine learning for early detection of undiagnosed disease.4. Application to Undiagnosed Disease Detection

[0137] In some embodiments, the presently disclosed semi-supervised learning framework can be applied to the detection of undiagnosed disease, such as Alzheimer’s Disease (AD), using electronic medical record (EMR) data. EMR systems can contain extensive longitudinal information about patient encounters, diagnoses, procedures, laboratory results, medications, and healthcare utilization patterns. However, many patients who meet clinical criteria for a disease (e.g., AD) can remain undiagnosed due to structural healthcare barriers, inconsistent documentation practices, or implicit provider bias. As a result, a significant proportion of true disease cases can be found within the unlabeled population. Semi-supervised and positive- unlabeled learning (PUL) methods can therefore be particularly effective for extracting latent diagnostic signals and identifying patients who may benefit from further evaluation.

[0138] In some embodiments, the disclosed framework can incorporate data-driven feature selection methods to leverage the complete set of available diagnostic codes rather than relying solely on expert-curated lists of disease risk factors. This approach can capture a broader spectrum of clinical indicators, including neurological, cognitive, behavioral, cardiovascular, metabolic, and immunological findings, that can be informative of disease risk. By analyzingassociations between features and the labeled positive class over numerous resampled training sets, the framework can highlight enriched diagnoses that contribute to classification performance while mitigating overfitting. These enriched features can then be passed to subsequent pseudo-labeling and model-training steps to improve predictive accuracy.

[0139] In some embodiments, detection of undiagnosed disease (e.g., AD) can be facilitated through a multi-step PUL framework that identifies reliable negatives, assigns pseudo-positive and pseudo-negative labels using race-specific probabilistic criteria, and trains a final classifier on the expanded labeled dataset. The reliable-negative identification step can identify unlabeled patients whose features strongly resemble negative cases, thereby providing a foundation for stable pseudo-labeling. The pseudo-labeling step can apply race-stratified probabilistic thresholds that account for demographic variability in disease prevalence, diagnostic rates, and labeling biases. The resulting pseudo-labeled dataset can substantially increase the training signal available to the model and improve sensitivity for identifying previously unrecognized disease cases.

[0140] In some embodiments, post-processing bias mitigation can be used to adjust the classification threshold for each racial and ethnic population to optimize fairness metrics such as group benefit equality (GBE). GBE can ensure that the predicted prevalence of the disease for each group corresponds to the expected prevalence inferred from observed disease diagnoses and proxy clinical indicators. This approach can mitigate disparities that can arise when a uniform threshold is applied to population subgroups with differing disease prevalence, healthcare access, or coding patterns. When combined with pre-processing bias mitigation, the framework can produce equitable disease predictions across non-Hispanic white, non-Hispanic African American, Hispanic Latino, and East Asian patient groups.

[0141] In some embodiments, validation of the undiagnosed disease predictions, such as AD prediction, can be performed using proxy clinical measures such as, but not limited to, related dementia ICD codes, memory -loss diagnoses, mild cognitive impairment codes, and medications commonly prescribed for dementia. These proxy outcomes can serve as silver standards that approximate ground truth in the absence of gold-standard diagnoses. Unlabeled patients who exhibit these proxy indicators can support the plausibility of the model’s predicted positives. Conversely, predicted negatives with no such indicators can be flagged as likely controls. Additional validation can be performed using polygenic risk scores derived from genome-wideassociation studies and APOE-e.4 allele counts, providing independent genetic evidence supportive of the predicted classification.

[0142] In some embodiments, the disclosed framework can also be applied to the detection of other undiagnosed diseases beyond AD. Diseases with significant underdiagnosis, such as Parkinson’s Disease, chronic kidney disease, heart failure with preserved ejection fraction, autoimmune disorders, or mental health conditions, can benefit from similar semi-supervised learning pipelines when labels are incomplete or biased. The PUL approach can be adapted to any setting in which the positive class is underrepresented, mislabeled, or inconsistently recorded in real-world clinical data. By incorporating condition-specific prevalence estimates, featureselection strategies, and bias-mitigation techniques, the framework can generalize to a wide range of diagnostic applications in healthcare.

[0143] In some embodiments, deployment of the framework can support early detection workflows within clinical settings. High-risk patients identified by the model can be surfaced to providers for additional cognitive screening, neurological evaluation, or referral to memory clinics. Because the framework can operate on routinely collected EMR data, it can be integrated into clinical decision-support modules, population health management systems, or quality-improvement initiatives aimed at reducing missed or delayed diagnoses. The use of bias mitigation can further help ensure that such clinical benefits are equitable across racial and ethnic groups.B. System Architecture

[0144] Provided herein, in some embodiments, are systems and methods configured to identify undiagnosed disease (e.g., Alzheimer’s disease) using a semi-supervised positive unlabeled learning (SSPUL) framework. The system architecture can include multiple coordinated modules, each of which can execute a distinct computational operation within a multi-stage machine-learning pipeline. In some embodiments, the architecture can be implemented on one or more servers, cloud computing platforms, or distributed computing systems connected to electronic medical record (EMR) repositories.1. Preprocessing and Feature Construction Module

[0145] In some embodiments, a preprocessing and feature construction module can be configured to receive, verify, and transform electronic medical record (EMR) data prior to its use in a semi-supervised learning framework. The module can function as an upstream data-curation component that prepares raw clinical data for downstream feature selection, pseudo-labeling, and model training.

[0146] In some embodiments, the module can ingest de-identified longitudinal EMR data from a clinical data repository such as, but not limited to, the UCLA Data Discovery Repository. The data can include diagnosis codes, demographic attributes, encounter histories, medication information, and other clinical observations. As the data enter the system, schema validation can be performed to ensure that fields such as, but not limited to, age, sex, race, ethnicity, and encounter timestamps are present and formatted correctly.

[0147] In some embodiments, the module can harmonize diagnosis codes by mapping ICD-9 and ICD-10 codes to phecodes. This mapping can be used to reduce the dimensionality of diagnoses, merge equivalent codes, and create clinically meaningful groupings. Codes from pregnancy-related, congenital, perinatal, or external-cause categories can be removed to avoid introducing irrelevant conditions. For each patient, only the earliest instance of a phecode can be retained to avoid overcounting repeated diagnoses.

[0148] In some embodiments, demographic completeness can be required. Patients missing sex, race, or ethnicity can be excluded to ensure that fairness assessments across protected groups remain valid. In other embodiments, imputation procedures can be applied, although such imputation can potentially introduce bias. Age filters can be imposed so that patients outside a clinically relevant age range for e.g., Alzheimer’s disease (AD) do not enter the training cohort. Record-level filtering can also be carried out to remove individuals who do not meet minimum thresholds for encounter frequency or record length.

[0149] In some embodiments, temporal and healthcare-utilization features can be constructed. These features can include age during the last encounter, the total number of encounters, the overall record length, the density of encounters relative to time in care, and the number of unique diagnoses observed. Such features can serve as indirect indicators of disease progression, care intensity, or clinical complexity.

[0150] In some embodiments, continuous variables can be normalized to a common range using min-max scaling. This normalization can reduce feature-scale disparities and can stabilize the behavior of logistic regression and other machine learning models used later in the pipeline. Although min-max scaling can be preferred for its interpretability, other scaling approaches can be employed when desirable, including z-score normalization, quantile transformation, or rankbased normalization.

[0151] In some embodiments, phecodes can be encoded as binary indicators reflecting the presence or absence of the corresponding clinical condition. This representation can simplify downstream computation and can make it feasible to evaluate thousands of potential features. In other embodiments, diagnosis counts or encounter frequencies can be encoded instead of binary indicators.

[0152] In some embodiments, the module can separate the population into labeled positives and unlabeled instances. Labeled positives can include, for example, individuals with documented AD ICD codes, whereas unlabeled instances can include, for example, both undiagnosed AD cases and true negative patients. This separation prepares the dataset for the later positive-unlabeled learning (PUL) stages without assigning any pseudo-labels at this stage.

[0153] In some embodiments, the preprocessing and feature construction module can output a high-dimensional patient-feature matrix that includes binary encoded phecodes, normalized temporal and utilization features, demographic metadata, and identifiers marking labeled positive and unlabeled patients. This matrix can be forwarded directly to the feature-selection component and can serve as the structured foundation upon which SSPUL training is performed.2. Feature Selection Module

[0154] In some embodiments, a feature selection module can be configured to identify clinical features that exhibit a statistically meaningful association with a target condition, such as, but not limited to, Alzheimer’s disease (AD). The module can operate on the structured patient-feature matrix generated during preprocessing and can determine which phecodes or other EMR-derived features are enriched among labeled positive patients relative to unlabeled patients.

[0155] In some embodiments, logistic regression can be employed as the primary statistical mechanism for feature selection. Each phecode can be evaluated independently by fitting a logistic regression model with disease (e.g., AD) status as the binary outcome. Labeled diseasepatients can serve as positive examples, while unlabeled patients can be temporarily treated as noisy controls for the purpose of identifying enriched diagnoses. The regression can be repeated across multiple random splits of the dataset to mitigate instability arising from sampling variability within the labeled positive group.

[0156] In some embodiments, covariates such as age since last visit, record length, record density, number of encounters, number of diagnoses, sex, and self-reported race and ethnicity can be incorporated as adjustment variables within each logistic regression model. These covariates can be normalized prior to use, enabling each logistic regression model to account for demographic or healthcare-utilization factors that might otherwise confound diagnostic enrichment signals.

[0157] In some embodiments, the feature selection module can apply significance testing to regression outputs. Wald p-values can be computed for each phecode coefficient, and Bonferroni or similar corrections can be applied to account for multiple comparisons across thousands of phecodes. Phecodes that meet a defined statistical significance threshold and exceed a minimum prevalence criterion can be retained as candidate enriched features. Prevalence thresholds can be used to filter out exceedingly rare conditions that would not meaningfully contribute to disease (e.g., AD) prediction or may destabilize model training.

[0158] In some embodiments, the module can aggregate significant features across a large number of random splits, such as 1000 independent train / validation / test partitions. The final set of enriched phecodes can reflect the union or consensus of features that repeatedly achieve significance across splits. This aggregation process can reduce sensitivity to sampling noise and can yield a stable, data-driven representation of conditions associated with disease (e.g., AD) in the study population.

[0159] In some embodiments, the feature selection module can produce a ranked list of enriched features. Such ranking can be based on frequency of significance across splits, magnitude of regression coefficients, or adjusted p-values. Although logistic regression can be preferred for computational efficiency and interpretability, alternative selection strategies can be employed in other embodiments. These alternatives can include LI -regularized regression, mutual information ranking, random forest importance estimation, or embedded feature selection within a gradient boosting framework.

[0160] In some embodiments, the module can generate feature subsets tailored to different downstream tasks. A minimal subset containing the most robust disease-associated phecodes can be constructed for use in lightweight baseline models, while a more comprehensive subset containing all significant features can be provided to the SSPUL framework to maximize predictive power.

[0161] In some embodiments, the feature selection module can output an enriched feature set that captures diagnostic patterns consistently associated with disease (e.g., AD) across demographic groups. This output can be passed directly to the pseudo-labeling and semisupervised learning module, where it can inform both probabilistic gap estimation and final classifier training. Through this process, the feature selection module can ensure that the machine learning system leverages clinically meaningful, data-driven features in both supervised and semi-supervised components.3. SSPUL Core Module

[0162] In some embodiments, the semi-supervised positive unlabeled learning (SSPUL) core module can implement a multi-stage classification pipeline designed to learn from a limited number of labeled positive cases and a large set of unlabeled patient records. The SSPUL framework can be configured to overcome the challenge that unlabeled patients can include both true positives and true negatives, such that explicit negative labels are not available. The system can therefore rely on probabilistic criteria, pseudo-label generation, and iterative refinement to progressively distinguish reliable negatives, assign additional positive or negative pseudo-labels, and train a final classifier capable of predicting undiagnosed disease cases.

[0163] In some embodiments, the first stage of the SSPUL pipeline can involve the identification of reliable negatives from the unlabeled population. A preliminary classifier, such as a generalized linear model, can be trained using the labeled positive patients and the unlabeled patients. This model can estimate, for each patient, a probabilistic gap defined as the difference between the estimated probability of belonging to the labeled-positive class and the probability of belonging to the complement class. The probabilistic gap can serve as a surrogate measurement of how similar a patient is to known positive cases relative to presumed negatives. Patients whose probabilistic gaps fall below the smallest observed gap among labeled positives can be treated as reliable negatives, optionally after applying additional exclusion criteria such asminimum age thresholds or exclusion of individuals with diagnostic codes related to the target disease. In some embodiments, reliable negatives can also undergo manual review or automated audit procedures to confirm their plausibility.

[0164] In some embodiments, the second stage of the SSPUL pipeline can involve the assignment of additional pseudo-labels to a subset of the remaining unlabeled patients. To reduce racial or ethnic bias in the resulting classifier, race-specific probabilistic thresholds can be applied. A secondary classifier can be trained using the labeled positives and the previously identified reliable negatives. Updated probabilistic gaps can be generated from this secondary classifier. For each racial or ethnic group, unlabeled patients exhibiting gaps larger than the smallest labeled-positive gap can be assigned as additional positives, while those exhibiting gaps smaller than the largest reliable-negative gap can be designated as additional negatives. In some embodiments, the quantity of additional positive labels assigned to each group can be calibrated to match population-level prevalence estimates derived from demographic studies or external epidemiological sources. This process can expand the training set while enforcing proportional representation across racial and ethnic groups, thereby reducing pre-processing inequities that can arise from underdiagnosis or label scarcity.

[0165] In some embodiments, the third stage of the SSPUL pipeline can involve the training of a final classifier using all available labels, including labeled positives, reliable negatives, and pseudo-labeled positives and negatives. This final classifier can be configured to learn a refined decision boundary based on a substantially larger and more diverse training dataset than would be available under strictly supervised learning. The final classifier can then be applied to predict disease status in the test set, generating probabilistic outputs that reflect confidence in disease classification. These probabilities can subsequently be thresholded using a post-processing fairness optimization module, such as one that selects group-specific cutoffs to optimize group benefit equality.

[0166] In some embodiments, the SSPUL core module can operate iteratively, enabling retraining or adjustment of intermediate classifiers as additional data become available. Such iterative refinement can accommodate shifts in disease prevalence, changes in patient demographics, or updated clinical guidelines. The probabilistic gap thresholds, prevalence targets, or pseudo-labeling rules can also be dynamically adapted to maintain desired fairness objectives or classification performance.

[0167] In some embodiments, alternative machine learning models can be substituted for any of the classifiers within the SSPUL pipeline. Gradient-boosting machines, random forests, neural networks, or ensemble models can be employed depending on the available computational resources, scale of the dataset, and desired interpretability. Similarly, the pseudo-labeling scheme can incorporate additional constraints such as clinical rule-based heuristics, longitudinal trajectory modeling, or multimodal data integration involving laboratory, imaging, or genomic features.

[0168] In some embodiments, the SSPUL core module serves as the central learning engine that harmonizes probabilistic inference, fairness-motivated pseudo-labeling, and final disease prediction. Through its multi-step structure, the module can leverage unlabeled data in a principled manner, mitigate racial and ethnic bias in training data distribution, and generate accurate and equitable predictions of undiagnosed diseases such as Alzheimer’s disease.4. Fairness Optimization Module

[0169] In some embodiments, a fairness optimization module can be incorporated to enforce equitable performance of the prediction model across different demographic groups. This module can operate after the SSPUL core module has generated probabilistic predictions for each patient and can adjust the final classification threshold for each protected group based on a defined fairness objective. In some embodiments, the fairness objective can include group benefit equality (GBE), equal opportunity (EO), or other group-level parity metrics used in algorithmic fairness literature.

[0170] In some embodiments, the fairness optimization module can calculate, for each racial or ethnic group, the empirical prevalence of the disease based on available labeled positives and proxy -validated cases. These group-specific prevalence estimates can be compared to the predicted prevalence obtained from the classifier’s probability outputs. The module can then solve an optimization problem that selects a probability cutoff for each group such that the predicted prevalence matches or closely approximates the observed or expected prevalence derived from proxy labels or epidemiological benchmarks. In many instances, the Nelder-Mead optimization algorithm or other numerical optimization procedures can be utilized to identify thresholds that minimize the difference between predicted and expected group benefit.

[0171] In some embodiments, the fairness optimization module can incorporate constraints to ensure that the resulting cutoffs do not degrade model performance below acceptable bounds. For example, a minimum sensitivity or a minimum precision for each group can be required. The module can also incorporate regularization terms designed to prevent extreme thresholds that would assign too many or too few positive predictions for specific groups. Such regularization can prevent fairness objectives from overcorrecting or amplifying statistical noise in small subgroups.

[0172] In some embodiments, the module can dynamically adjust probability thresholds across iterations of the SSPUL pipeline or across different training and test splits. Because disease prevalence, sample composition, or pseudo-label assignments can vary between iterations, the fairness thresholds can be recalibrated each time the model is retrained. This dynamic adaptation can ensure that fairness gains are preserved even under changing data distributions.

[0173] In some embodiments, alternative fairness metrics can be incorporated depending on the clinical, legal, or operational objectives of the deployment environment. For example, when early disease identification is prioritized, the module can enforce equal opportunity, ensuring that sensitivity is balanced across protected groups. In other embodiments, demographic parity can be applied in contexts where equal selection rates are required due to regulatory or institutional constraints. The architecture can flexibly support multiple fairness definitions and can be configured to select the metric most appropriate for a given use case.

[0174] In some embodiments, calibration-based fairness adjustments can be implemented to align predicted probabilities across groups. This can involve recalibrating each group’s probability distribution using group-specific isotonic regression, Platt scaling, or temperature scaling, thereby ensuring that predicted probabilities reflect similar real-world disease risks across all groups.

[0175] In some embodiments, the fairness optimization module can provide diagnostic output that quantifies fairness metrics before and after threshold adjustment . These metrics can include group benefit equality, equal opportunity differences, equalized odds differences, parity losses, and other statistical measures. Such diagnostic information can be displayed via a dashboard, written to logs, or stored for auditing or regulatory compliance. The module can therefore support transparency and reproducibility of fairness behavior.

[0176] In some embodiments, the fairness optimization module can interact with the SSPUL core module by feeding back fairness adjustments that influence future pseudo-labeling or classifier retraining. This feedback loop can improve fairness at both the labeling and inference stages. For example, if a consistent imbalance is detected in predicted positives for a particular group, the pseudo-labeling thresholds (APLP or APRN) for that group can be updated in subsequent training cycles to mitigate such imbalance.

[0177] In some embodiments, the fairness optimization module can be deployed in real-time or batch-processing environments. In real-time systems, group-specific thresholds can be applied at the moment predictions are generated. In batch systems, thresholds can be applied to large cohorts of patients at once, enabling health systems to run equitable screening campaigns or population-level diagnostics.5. Baseline Model Module

[0178] In some embodiments, a baseline model module can be implemented to provide reference performance against which the semi-supervised positive-unlabeled learning (SSPUL) system can be compared. This module can generate one or more supervised learning models trained under assumptions commonly used in conventional EMR-based disease prediction studies, such as treating all unlabeled patients as controls. The baseline model module can therefore serve to quantify the improvements introduced by the SSPUL framework in terms of sensitivity, precision, calibration, and fairness.

[0179] In some embodiments, the baseline model module can include a model trained solely on demographics and a manually curated set of clinically recognized disease (e.g., AD) risk factors. Such a model can reflect traditional expert-driven feature engineering approaches in which only domain-selected predictors are used. This model can be trained using logistic regression, generalized linear models, or other supervised classifiers and can rely on simple cutoff-based or probability-based prediction strategies. Because this baseline relies heavily on expert-curated features, it can be used to illustrate the limitations of knowledge-driven models compared to data-driven models that learn from the broader EMR feature space.

[0180] In some embodiments, a second baseline model can be implemented using all enriched clinical features identified through logistic-regression-based feature selection. This supervised baseline can use the same set of selected phecodes used by the SSPUL system but can assignnegative labels to all unlabeled patients. The resulting model can therefore reflect the performance of a standard supervised classifier built on the same feature set while lacking pseudo-labeling, probabilistic gap modeling, and other semi-supervised mechanisms.Comparisons between this supervised baseline and the SSPUL model can highlight the value of incorporating unlabeled data and race-specific pseudo-label assignment into the learning framework.

[0181] In some embodiments, the baseline model module can use the AutoML component to identify the optimal supervised classifier. The AutoML process can evaluate generalized linear models, gradient boosting machines, distributed random forests, and extreme gradient boosting models under cross-validation. The classifier with the highest mean AUCPR or other accuracy metric can be selected as the baseline model. By standardizing this automatic selection process, the baseline model module can ensure that performance comparisons are not biased by suboptimal model configuration.

[0182] In some embodiments, the baseline model module can include a threshold-selection mechanism for generating final predictions. The threshold can be selected based on maximizing the Matthew’s Correlation Coefficient (MCC) or other performance metric calculated using proxy labels in the validation set. Such a thresholding strategy can mimic commonly used evaluation protocols in EMR-based machine learning studies that rely on silver- standard proxy outcomes. In addition, the thresholding process can highlight the contrast between baseline approaches that treat unlabeled data as negative and the SSPUL framework, which explicitly corrects for label uncertainty.

[0183] In some embodiments, demographic-specific or group-specific performance of each baseline model can be calculated to evaluate disparities in prediction accuracy. The baseline model module can compute group-stratified sensitivity, specificity, precision, and fairness metrics and can provide these metrics to downstream modules for parity loss assessment. These results can also be stored or displayed to illustrate the degree to which conventional supervised models underperform in underrepresented or underdiagnosed populations.

[0184] In some embodiments, the baseline model module can be used during system development but can also be deployed in production environments when interpretability, simplicity, or low computational overhead is required. However, comparative evaluation repeatedly demonstrates that baseline models trained on noisy negatives typically produce lowersensitivity and weaker calibration performance than SSPUL models. The baseline model module thus provides both a performance benchmark and an illustration of the limitations inherent in conventional supervised learning approaches applied to EMR populations with incomplete labels.

[0185] Collectively, the baseline model module can serve as a critical subsystem that quantifies the added value, robustness, and fairness improvements introduced by the SSPUL framework. By maintaining consistent training pipelines, feature sets, and validation strategies, the module allows rigorous comparison across modeling paradigms and supports clear demonstration of the inventive step represented by integrating semi-supervised learning and racial bias mitigation into a unified predictive framework.6. Interpretation and Explainability Module

[0186] In some embodiments, an interpretation and explainability module can be incorporated to provide transparent and clinically meaningful explanations for predictions generated by the semi-supervised positive-unlabeled learning (SSPUL) system. This module can enable users, clinicians, researchers, or downstream automated systems to understand how specific features influence the classification of undiagnosed disease (e.g., AD), thereby improving trust, safety, and potential regulatory compliance. The interpretability module can also support auditing, fairness assessment, and error analysis by revealing model decision pathways at both the population and individual levels.

[0187] In some embodiments, the interpretation and explainability module can compute Shapley Additive Explanation (SHAP) values to quantify the contribution of each input feature to a prediction. For example, SHAP values can be generated using a tree-based SHAP algorithm when the final SSPUL classifier uses a gradient boosting machine, distributed random forest, or XGBoost model. These values can reflect the marginal impact of each phecode, demographic variable, utilization measure, or probabilistic pseudo-label feature on the model’s output. The module can aggregate SHAP values across patients to identify top predictive features and can additionally stratify these feature impacts by race and ethnicity to assess whether prediction mechanisms remain stable across demographic groups. In some embodiments, SHAP explanations can be generated for both labeled patients and those predicted to have undiagnosedAD, enabling comparison of how neurological and non-neurological features drive prediction across different subpopulations.

[0188] In some embodiments, the interpretation and explainability module can generate visualizations, summaries, or structured outputs that describe global and local model behavior. For global interpretation, the module can produce averaged SHAP importance rankings, correlation analyses between feature magnitudes and their SHAP directions, or biplots derived from factor analysis of mixed data (FAMD). These outputs can highlight the relationships between features and classification outcomes in a reduced-dimensional space. For local interpretation, the module can generate patient-level explanations showing the specific features that elevate or suppress an individual’s predicted probability of disease (e.g., AD), which can assist clinicians in reviewing cases flagged by the model.

[0189] In some embodiments, the interpretation and explainability module can incorporate dimensionality reduction approaches such as FAMD to visualize the distribution of predicted positives, true positives, labeled positives, and predicted negatives in a common feature space. Such visualization can reveal clusters of undiagnosed diease patients whose feature profiles differ from traditionally diagnosed patients, thereby supporting discovery of previously overlooked phenotypes. FAMD projections can be used to identify subsets of predicted positives that exhibit distinct clinical signatures, enabling a deeper understanding of heterogeneity in disease (e.g., AD) presentation.

[0190] In some embodiments, the module can implement statistical comparisons to evaluate whether the direction and magnitude of feature importance differ across demographic groups. Techniques such as Mann-Whitney U tests, permutation-based correlation shift analysis, and Bonferroni-adjusted hypothesis testing can be used to determine whether features exert disproportionate influence on predictions for certain groups. These analyses can reveal whether discrepancies in model behavior exist and can inform fairness-oriented modifications to the SSPUL framework.

[0191] In some embodiments, the interpretation and explainability module can capture and store interpretability metrics for later auditing or compliance review. This can include tracked SHAP distributions, cross-race correlation matrices, feature prevalence shifts, and representations of top predictive features. These outputs can allow the system to document thatpredictions were generated in a transparent, interpretable manner, which can be critical for medical and regulatory adoption.

[0192] In some embodiments, the interpretation and explainability module can interface with the fairness optimization module to provide input for selection of bias mitigation strategies. For example, if the module identifies that a particular feature disproportionately drives disease (e.g., AD) predictions for a protected group, the system can adjust pseudo-labeling parameters, probabilistic gap thresholds, or group-specific cutoff strategies in response. Thus, the explainability outputs can actively refine the SSPUL framework to improve fairness over iterative development cycles.

[0193] In some embodiments, the module can also support model debugging and version control by providing interpretable diagnostics during training and evaluation. Developers can compare SHAP distributions across model iterations, observe stability or drift in top predictive features, and verify whether pseudo-labels influence representation learning in expected ways. The explainability component therefore serves not only clinical and fairness-driven objectives but also development and validation workflows.7. Validation and Robustness Module

[0194] In some embodiments, a validation and robustness module can be implemented to rigorously evaluate the performance, reliability, and generalizability of the semi-supervised positive-unlabeled learning (SSPUL) system. This module can be configured to assess model accuracy under a variety of real-world conditions, including incomplete labeling, demographic imbalances, label noise, and shifting proxy criteria. The module can operate during training, validation, and deployment stages, ensuring that the SSPUL framework performs consistently and equitably across diverse patient populations.

[0195] In some embodiments, the module can validate predictions using phenotype-level proxy measures that serve as silver-standard indicators for disease (e.g., AD). For example, these proxy measures can include alternative ICD-10 dementia-related diagnosis codes and FDA- approved dementia medications. For each model iteration, unlabeled patients can be classified as proxy-validated cases or controls based on the presence or absence of these indicators. The module can compute discrimination metrics such as sensitivity, precision, specificity, balanced accuracy, AUCPR, and AUC. These metrics can be reported both overall and stratified by raceand ethnicity to ensure that the SSPUL system displays stable performance across demographic groups. The module can further calculate calibration metrics, such as balanced Brier score, to assess whether predicted probabilities reflect the true frequency of proxy -validated outcomes.

[0196] In some embodiments, the module can perform genotype-level validation using polygenic risk scores (PRS) and allele (e.g., APOE s4) counts. Genotype data from a holdout set can be used to calculate PRSs for late-onset disease (e.g., AD) based on published genome-wide association studies. The module can assign predicted labels to holdout patients using the final SSPUL classifiers and compute mean PRSs for predicted positives, predicted negatives, and labeled positives. A statistically significant elevation of PRS or allele count among labeled and predicted positives relative to predicted negatives can provide independent validation of the SSPUL system’s ability to detect biologically meaningful disease (e.g., AD) risk. The module can additionally stratify genotype-level validation by race and ethnicity to confirm that genetic risk patterns align with known ancestry-dependent variations.

[0197] In some embodiments, the validation and robustness module can perform fairnessspecific evaluation by quantifying group-level differences in prediction outcomes. Metrics such as equal opportunity (EO) parity, group benefit equality (GBE), specificity parity, and precision parity can be computed for protected demographic groups relative to a privileged baseline group. The module can compute metric-specific parity losses and a cumulative parity loss to provide a single, interpretable measure of fairness. These outputs can be used to verify that the SSPUL system retains equitable predictive performance across groups, even under varying feature distributions or shifts in proxy label definitions.

[0198] In some embodiments, the module can conduct sensitivity analyses to measure model stability when key variables are perturbed. The system can evaluate sensitivity to self-reported race and ethnicity by recoding each patient’s race as another racial category while holding all clinical features fixed. After recoding, the module can recompute predicted probabilities and final labels to determine whether the SSPUL system responds consistently across demographic shifts. Minimal changes in predicted labels following recoding can indicate that the classifier is not disproportionately influenced by race and ethnicity, and that the fairness optimization strategies are functioning correctly.

[0199] In some embodiments, the module can evaluate robustness under proxy label distribution shifts. Because the SSPUL system uses proxy ICD codes and medications forvalidation and GBE optimization, the robustness module can test how performance changes when subsets of proxy labels are systematically removed. This can include removing one ICD code at a time, removing diagnostic categories grouped by phecode, removing random subsets of proxy codes, or removing all dementia-related medications. For each modified proxy subset, the module can recompute sensitivity, precision, and GBE-optimized thresholds over repeated train- validation-test splits. The stability of discrimination and fairness metrics under such perturbations can indicate the system’s resilience to incomplete or evolving clinical coding practices.

[0200] In some embodiments, the validation and robustness module can incorporate holdoutset evaluation to measure out-of-sample performance. Holdout patients, excluded from all stages of SSPUL training and threshold optimization, can be processed through the final classifier to ensure unbiased assessment. Comparisons of model accuracy and fairness between cross- validation sets and holdout sets can reveal generalizability properties and detect overfitting to specific EMR cohorts.

[0201] In some embodiments, the module can maintain comprehensive logs, quantitative summaries, and interpretability artifacts that document system performance. These outputs can be stored for regulatory review, reproducibility, and auditing. Because medical prediction models require high transparency, the validation and robustness module can ensure that each deployed version of the SSPUL system is traceable and accompanied by validated performance documentation.

[0202] In some embodiments, the module can operate in conjunction with the interpretation and explainability module by using SHAP-based outputs to determine whether feature contributions remain stable across splits, demographic groups, and proxy subsets. Instability in feature contribution patterns can trigger retraining, rebalancing, or threshold-adjustment procedures. Thus, the validation and robustness module can function as a safeguard that prevents deployment of models whose behavior deviates from expected clinical or fairness standards.C. Non-Limiting Exemplary Semi-Supervised Learning Framework

[0203] FIG. l is a schematic showing an example semi-supervised learning framework for developing models to predict AD with improved fairness, according to example embodiments ofthe present disclosure. Although FTG. 1 refers to AD, it is contemplated that the same framework can be applied to develop models for predicting other disease states with improved fairness. As shown in FIG. 1, the example semi-supervised learning framework may involve multiple machine learning models (e.g., machine learning models 120, 130, and 140) for determining disease state outcomes for unlabeled EMR data (e.g., corresponding to undiagnosed patients), improving EMR datasets, and / or predicting disease state outcomes. Furthermore, as shown in FIG. 1, the semi-supervised learning framework shows an example process (e.g., blocks 102- 108) for positive unlabeled learning (PUL) (e.g., semi-supervised positive unlabeled learning (SSPUL)) to obtain disease state (e.g., AD) predictions for unlabeled instances in a received EMR dataset 150. To enable learning from positive and unlabeled data, the labeled positive EMR data (e.g., EMR data of a patient known to have a positive indication for AD and labeled as such) may be selected at random. It is contemplated that there may be a multitude of reasons for why some AD patients are diagnosed (e.g., labeled positive) while others are not, such as but not including some providers’ assumption that cognitive changes are normal rather than pathologic, some providers not taking the symptoms and complaints of racialized AD patients seriously, and socioeconomically disadvantaged AD patients’ lack of access to healthcare. Therefore, it is contemplated NH-white patients with high socioeconomic status and access to unbiased providers may be more likely to be diagnosed (i.e., be a labeled positive instance).

[0204] At block 102, a first machine learning model (e.g., generalized linear model (GLM)) 120 may be trained using the EMR dataset 150 to classify LP and unlabeled instances. To identify reliable negative (RN) instances from predicted unlabeled instances, a probabilistic gap for each patient may be calculated. The probabilistic gap may be defined as the difference between the probability of an instance having a label given its features and the complement: APr(x) = Pr(y = 1 |x) -Pr(y = 0|x). Reliable negative instances may be determined based on one or more of the following criteria: 1) probabilistic gap is smaller than the smallest observed probabilistic gap of the labeled positive instances (APLP); 2) age during last visit is at least 70 years; 3) does not have any PHE code falling in the AD PHE code exclusion range (e.g., 290- 292.99). To minimize potential bias that may be introduced by extreme values (i.e., bias resulting from the smallest probabilistic gap being an outlier), probabilistic gaps of a large number (e.g., 500) labeled instances may be randomly sampled a large number of times (e.g., 1000 times), and a mean of the minimum may be taken.

[0205] At block 104, positive and negative labels for a subset of remaining unlabeled instances may be assigned using race and ethnic-specific probabilistic criteria in order to increase the size of the training set while mitigating bias that may arise from overrepresentation of instances belonging to a particular race and ethnicity prior to training (i.e., pre-processing bias mitigation). In some embodiments, LP and RN instances may be used as input for training a second machine learning model classifier (e.g., a distributed random forest model) 130. After obtaining updated predicted probabilities for LP and RN instances from the trained machine learning model, the training data may be subset by race and ethnicity to obtain race-specific APLP and race-specific observed probabilistic gaps of RN instances (APRN). For each race and ethnicity subset, additional positive (AP) instances may be determined based on having a probabilistic gap that is greater than the smallest APLP. Furthermore additional negative (AN) instances may be determined based on having a probabilistic gap that is smaller than the largest APRN. Among AP instances satisfying the probabilistic gap criterium, those with the largest probabilistic gap may be selected such that the prevalence of total positive labels (LP and AP) for each race and ethnicity matches the corresponding population AD prevalence obtained from Medicare claims (10% for NH- white, 18.6% for NH-AfAm, and 14% for HL).

[0206] At block 106, labeled positive (LP) instances and pseudo-labeled (RN, AP, AN) instances (across the demographic groups) may be used as input for training a third machine learning model (e.g., a final classifier, a GLM, etc.) 140. The EMR dataset 150 having the labeled positive and unlabeled data may be applied to the third machine learning model 140.

[0207] At block 108, post-processing bias mitigation may be performed by selecting the probability cutoff that optimizes the group benefit opportunity (GBE) value for each demographic group (e.g., race and ethnicity) for unlabeled data in the validation set and may be further applied to the test set. In some embodiments, an optimization algorithm such as but not limited to the Nelder-Mead optimization algorithm may be used.D. Non-Limiting Exemplary Computer Implemented System

[0208] In various embodiments, the systems and methods for developing AD (and other disease state) prediction models with improved fairness, and predicting AD and other disease states using the same, can be implemented via computer software or hardware. Refer to theAppendix for further information regarding the system, devices and methods provided herein, in accordance with various embodiments.

[0209] FIG. 2 is a block diagram illustrating a computing device 200 (also referred to herein as “computing system” or “computer device”) upon which embodiments of the present teachings may be implemented. In various embodiments of the present teachings, computing device 200 can include a bus 202 or other communication mechanism for communicating information and a processor 204 coupled with bus 202 for processing information. In various embodiments, computing device 200 can also include a memory, which can be a random-access memory (RAM) 206 or other dynamic storage device, coupled to bus 202 for determining instructions to be executed by processor 204. Memory can also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 204. In various embodiments, computing device 200 can further include a read only memory (ROM) 208 or other static storage device coupled to bus 202 for storing static information and instructions for processor 204. A storage device 210, such as a magnetic disk or optical disk, can be provided and coupled to bus 202 for storing information and instructions.

[0210] In various embodiments, computing device 200 can be coupled via bus 202 to a display 212, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. In some embodiments, the display 212 may enable user input (e.g., via a touchscreen). An input device 214, including alphanumeric and other keys, can also enable the same and can be coupled to bus 202 for communication of information and command selections to processor 304. Another type of user input device is a cursor control 216, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 204 and for controlling cursor movement on display 312. This input device 214 typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), that allows the device to specify positions in a plane. Input devices 214 allowing for 3-dimensional (x, y and z) cursor movement are also contemplated herein.

[0211] In some embodiments, the display 212, the input device 214, and / or the cursor control 216 may enable a user (e.g., a medical personnel, a researcher) to enter patient-specific EMR data to determine an AD or other disease state outcome (e.g., whether the patient has a positive or negative indication of the disease state based on their patient-specific EMR data). Consistent with certain implementations of the present teachings, results can be provided by computingdevice 200 in response to processor 204 executing one or more sequences of one or more instructions contained in memory 206. For example, such instructions may include feeding inputted information (e.g., patient-specific EMR data) into a trained machine learning model to predict an AD or other disease outcome for the patient. Such instructions may also include training one or more machine learning models to predict AD and other disease outcome with improved fairness (e.g., based on incorporation of unlabeled patient EMR data and bias mitigation techniques for unrepresented demographic groups) as described herein. Such instructions can be read into memory 206 from another computer-readable medium or computer- readable storage medium, such as storage device 210. Execution of the sequences of instructions contained in memory 206 can cause processor 204 to perform the processes described herein. Alternatively, hard-wired circuitry can be used in place of or in combination with software instructions to implement the present teachings. Thus, implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.

[0212] The term “computer-readable medium” (e.g., data store, data storage, etc.) or “computer-readable storage medium” as used herein refers to any media that participates in providing instructions to processor 204 for execution. Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Examples of non-volatile media can include, but are not limited to, dynamic memory, such as memory 206. Examples of transmission media can include, but are not limited to, coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 202.

[0213] Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, another memory chip or cartridge, or any other tangible medium from which a computer can read.

[0214] In addition to computer-readable medium, instructions or data can be provided as signals on transmission media included in a communications apparatus or system to provide sequences of one or more instructions to processor 204 of computing device 200 for execution. For example, a communication apparatus may include a transceiver having signals indicative of instructions and data. The instructions and data are configured to cause one or more processors to implement the functions outlined in the disclosure herein. Representative examples of datacommunications transmission connections can include, but are not limited to, telephone modem connections, wide area networks (WAN), local area networks (LAN), infrared data connections, NFC connections, etc.

[0215] It should be appreciated that the methodologies described herein, flow charts, diagrams and accompanying disclosure can be implemented using computing device 200 as a standalone device or on a distributed network or shared computer processing resources such as a cloud computing network.

[0216] The methodologies described herein may be implemented by various means depending upon the application. For example, these methodologies may be implemented in hardware, firmware, software, or any combination thereof. For a hardware implementation, the processing unit may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.

[0217] In various embodiments, the methods of the present teachings may be implemented as firmware and / or a software program and applications written in conventional programming languages such as C, C++, Python, etc. If implemented as firmware and / or software, the embodiments described herein can be implemented on a non-transitory computer-readable medium in which a program is stored for causing a computer to perform the methods described above. It should be understood that the various engines described herein can be provided on a computer system, such as computing device 200, whereby processor 204 would execute the analyses and determinations provided by these engines, subject to instructions provided by any one of, or a combination of, memory components 206 / 208 / 210 and user input.E. Non-Limiting Exemplary Computer Implemented Methods

[0218] FIGS. 3A, 3B, and 3C are block diagrams illustrating example computer-implemented methods 300A, 300B, and 300C, respectively, for developing disease state prediction models with improved fairness and predicting disease states using such models, according to nonlimiting embodiments of the present disclosure. One or more blocks or processes described in the blocks of FIGS. 3 A, 3B, and / or 3C may be performed by one or more computing devices (e.g.,computing device 200). For example, the one or more blocks or processes may be performed by the processor 204 of computing device 200 based on instructions provided by any one of, or a combination of memory components 206 / 208 / 210 and user input (e.g., provided via the input device 214), as will be discussed herein. In various embodiments, the disease state referenced by these methods may include but is not limited to Alzheimer’s Disease (AD).

[0219] Specifically, FIG. 3A is a block diagram illustrating an example method 300A for improving disease state prediction models from unlabeled electronic medical records (EMRs).

[0220] In various embodiments, method 300A may begin at block 302 with a computing device (e.g., processor 204 of computing device 200) receiving an EMR dataset including patient-specific EMR data for a plurality of patients. As this EMR dataset may be further refined, updated, and / or improved in subsequent blocks and / or methods, the received EMR dataset may be referred to herein as the first EMR dataset or the initial EMR dataset. The first EMR dataset may includes a first EMR subset for a first population of the plurality patients that are labeled as having a positive indication for the disease state and an unlabeled EMR subset for a remainder of patients having unconfirmed indications for the disease state.

[0221] In some embodiments, the computing device may receive the EMR dataset from an EMR system or database storing EMR data of patients. In some embodiments, the EMR data may be scraped, deidentified, and / or abridged form of the EMR of the patient, in order to preserve the privacy and data security of the patient, or to ensure the minimization of patientspecific health data. In some embodiments, the processor may further process the EMR dataset to prepare it for machine learning model training. For example, the processor may extract or determine values from the received EMR data (e.g., via text recognition or object recognition) for a set of relevant features or parameters. The values for the set of relevant features may be used to form feature vectors for model training.

[0222] At block 304, the computing device (e.g., via processor 204) may train a machine learning model using the first EMR dataset so that the trained machine learning model is configured to predict a disease state outcome based on EMR data input. As the machine learning model trained at block 304 may be distinguishable from other machine learning models used in subsequent blocks, methods, and / or processes described herein, the machine learning model may be referred to herein as a first machine learning model. As used herein, a disease state outcome may include a positive indication for the disease state (e.g., a patient is diagnosed as having ordetermined to have AD) or a negative indication for the disease state (e.g., a patient is diagnosed as not having or determined to not have AD).

[0223] In some embodiments, training the first machine learning model may involve determining a set of weights to assign to a set of features of the first EMR dataset. For example, the training may involve inputting feature vectors formed from a vector of values assigned to the set of features, determining an initial set of weights for those set of features, and iteratively adjusting the weights until the set of weights for the set of features are able to predict a corresponding outcome (e.g., the confirmed positive indications of the disease state). In some embodiments, such features determined or extracted from the first EMR dataset may include but are not limited to: an indication of dementia; an indication of memory loss; an indication of chronic pain; an indication of cerebral degeneration; a secondary malignancy of lymph nodes; an indication of osteoarthrosis; an indication of inflammatory and toxic neuropathy; an indication of a disturbance of skin sensation; an indication of septicemia; an indication of vascular dementia; an indication of astigmatism; an indication of acute posthemorrhagic anemia; an indication of melanomas of skin; an indication of a respiratory failure; an indication of a secondary malignancy of bone; an indication of an antineoplastic and / or an immunosuppressive drug causing an adverse effect; an indication of morbid obesity; an indication of a voice disturbance; an indication of alopecia; or an indication of an acute pain. As previously discussed, each feature may be assigned a weight (e.g., for the trained machine learning model) indicative of how significant the feature may be to predicting the disease state outcome. In some embodiments, the aforementioned list of features are arranged in order of significance (from high to low) in their ability to predict the disease state outcome.

[0224] At block 306, the computing device may input the unlabeled EMR subset of the received first EMR dataset into the trained machine learning model to predict disease state outcomes for a second population of patients that are reliable negative indications of the disease state. For example, as previously discussed in relation to block 102 of FIG. 1, the machine learning model may be used to predict the reliable negative indications from the unlabeled EMR dataset by determining a probabilistic gap for each patient represented in the unlabeled EMR dataset. In some embodiments, a reliable negative indication for a given patient represented by an EMR data may be determined based on one or more of the following criteria: 1) a probabilistic gap for the given patient is smaller than the smallest observed probabilistic gap ofthe labeled positive instances; 2) an age of the given patient during a last visit is above a predetermined threshold (e.g., at least 70 years); or 3) the EMR data of the given patient does not indicate any phenome (PHE) code falling in the PHE code exclusion range for the disease state (e.g., the code exclusion range for AD being 290-292.99).

[0225] At block 308, the computing device may generate an updated EMR dataset. As the updated dataset updates the first EMR dataset, and may continue to be refined, updated, and / or improved for fairness in subsequent blocks and / or methods described herein, the updated EMR dataset in block 308 may be referred to herein as the second EMR dataset. The second EMR dataset may include: the first EMR subset for the first population of patients that is labeled to indicate the first population as having a confirmed positive indication for the disease state; and a second EMR subset for the second population of the plurality of patients and labeled to indicate the second population as having the reliable negative indication for the disease state.

[0226] FIG. 3B is a block diagram illustrating an example method 300B for improving disease state prediction models from unlabeled electronic medical records (EMRs). In some embodiments, one or more blocks of method 300B may be performed after one or more blocks of method 300A (e.g., based on the second EMR dataset generated using method 300A).

[0227] In various embodiments, method 300B may begin with the computing device (e.g., processor 204 of computing device 200) receiving, at block 322, an EMR dataset. The EMR dataset may include: a first EMR subset for a first population of a plurality of patients and labeled to indicate the first population as having a confirmed positive indication for a disease state (e.g., AD); and a second EMR subset for a second population of the plurality of patients and labeled to indicate the second population as having a reliable negative indication for the disease state. For example, the EMR dataset may be the second EMR dataset generated using method 300A discussed in relation to FIG. 3A. The plurality of patients may be categorized into a plurality of demographic groups of patients. For example, the demographic groups may include but are not limited to a White demographic group, a Hispanic Latinos (HL) demographic group, and non-Hispanic African Americans (NH-AfAm) demographic group. The categorization may assist in bias mitigation techniques described herein for historically understudied and underdiagnosed demographic groups.

[0228] At block 324, the computing device (e.g., via processor 204) may train a machine learning model using the EMR dataset (e.g., the second EMR dataset generated via method300A) so that the trained machine learning model is configured to predict a disease state outcome based on EMR data input. In some embodiments, the machine learning model used in block 324 may be distinguishable from the machine learning model used in method 300A, and may thus be referred to herein as the second machine learning model.

[0229] In some embodiments, training the second machine learning model may involve determining a set of weights to assign to a set of features of the EMR dataset (e.g., the second EMR dataset generated via method 300A). For example, the training may involve inputting feature vectors formed from a vector of values assigned to the set of features, determining an initial set of weights for those set of features, and iteratively adjusting the weights until the set of weights for the set of features are able to accurately predict a corresponding outcome (e.g., the confirmed positive indications of the disease state). In some embodiments, such features determined or extracted from the EMR dataset may include but are not limited to: an indication of dementia; an indication of memory loss; an indication of an altered mental status; an indication of a neurological disorder; an indication of an abnormality of gait; an indication of a mild cognitive impairment; an indication of a vascular dementia; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of a delirium due to conditions classified elsewhere; an indication of a major depressive disorder; an indication of a cerebral degeneration; an indication of a cerebral ischemia; an indication of an encephalopathy; an indication of a developmental delay and / or disorder; an indication of a urinary tract infection; an indication of a urinary incontinence; an indication of a transient alteration of awareness; an indication of decubitus ulcer; an indication of aphasia / speech disturbance; or an indication of malaise and fatigue. As previously discussed, each feature may be assigned a weight (e.g., for the trained machine learning model) indicative of how significant the feature may be to predicting the disease state outcome. In some embodiments, the aforementioned list of features are arranged in order of significance (from high to low) in their ability to predict the disease state outcome.

[0230] At block 326, the computing device may receive a plurality of unlabeled group-specific EMR datasets for a remainder of the plurality of patients outside of the first and second populations. That remainder of patients may have unconfirmed indications for the disease state and therefore their corresponding EMR data may be unlabeled. The plurality of unlabeled groupspecific EMR datasets may be respectively associated with the plurality of demographic groups of patients. For example, an unlabeled group-specific EMR dataset may include EMR data forpatients in a demographic group who have unconfirmed indications of the disease state (e.g., whether the patients have been diagnosed for AD is yet to be determined).

[0231] In some embodiments, as previously discussed in relation to step 104 in FIG. 1, positive and negative labels may be assigned to a subset of unlabeled EMR data of the second EMR data set (e.g., outside of the first and second EMR subsets) using probabilistic criteria that are specific to the demographic group. The assignment of these labels based on the probabilistic criteria may increase the size of the training set while mitigating bias that may arise from overrepresentation of instances belonging to the particular demographic group prior to training step in block 326 (e.g., as part of a pre-processing bias mitigation). The labeled positive and reliable negative instances (of the first and second EMR subsets, respectively) may then be used as input for training the second machine learning model (e g., at block 326). In some embodiments, The updated predicted probabilities for the labeled positive and reliable negative instances from the trained machine learning model may enable the training data to be subset by demographic groups (e.g., group-specific EMR datasets) to obtain demographic group-specific APLP and demographic group-specific observed probabilistic gaps of the reliable negative instances (APRN).

[0232] At block 328, the computing device may input, for each demographic group of patients, a respective unlabeled group-specific EMR dataset to the machine learning model (e.g., second machine learning model) to predict a disease state outcome for patients in the respective demographic group. In some embodiments, inputting, for each demographic group, the respective unlabeled group-specific EMR dataset further includes: mitigating a bias associated with each demographic group by adjusting the respective unlabeled group-specific EMR dataset. For example, as explained in relation to block 104 of FIG. 1, for each demographic group, additional positive (AP) instances may be determined based on having a probabilistic gap that is greater than the smallest APLP. Furthermore additional negative (AN) instances may be determined based on having a probabilistic gap that is smaller than the largest APRN. Among AP instances satisfying the probabilistic gap criterium, those with the largest probabilistic gap may be selected such that the prevalence of total positive labels (LP and AP) for each demographic group matches the corresponding population disease state prevalence obtained from external data (e.g., Medicare claims).

[0233] In some embodiments, a separate machine learning model may be trained (e.g., at block 326) and applied (e.g., at block 328) for each demographic group. For example, the training of group-specific machine learning models may involve using EMR data from the first and second EMR subsets for patients associated with the respective demographic group.

[0234] At block 330, the computing device may generate, based on the labeled EMR dataset and the predicted disease state outcomes, an updated EMR dataset labeling the first and second populations and the remainder of the plurality of patients. For example, the updated EMR dataset may include: a first EMR subset for the first population of patients, a second EMR subset for the second population of patients, a third EMR subset for a third population of the plurality of patients having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population of the plurality of patients having an additional negative indication for the disease state. As the updated dataset updates the second EMR dataset, and may continue to be refined, updated, and / or improved for fairness in subsequent blocks and / or methods described herein, the updated EMR dataset at block 330 may be referred to herein as the third EMR dataset.

[0235] FIG. 3C is a block diagram illustrating an example method 300C for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data. In some embodiments, one or more blocks of method 300C may be performed after one or more blocks of method 300B (e.g., based on the third EMR dataset generated using method 300B)

[0236] In various embodiments, method 300C may begin with a computing device (e.g., processor 204 of computing device 200) receiving (at block 332) an EMR dataset of patientspecific EMR data of a plurality of patients indicating disease state outcomes that are known or determined. As previously discussed, the disease state outcome may refer to a positive indication for a disease state or a negative indication for the disease state. In some embodiments, the EMR dataset may include: a first EMR subset for a first population of the plurality of patients having a confirmed positive indication for the disease state, a second EMR subset for a second population of the plurality of patients having a reliable negative indication for the disease state, a third EMR subset for a third population of the plurality of patients having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population of the plurality of patients having an additional negative indication for the disease state. For example, the received EMRdataset may comprise the third EMR dataset generated via method 300B of FTG. 3B, and may be referred to herein as the third EMR dataset for simplicity. The plurality of patients may be categorized into a plurality of demographic groups.

[0237] At block 334, the computing device may train a machine learning model using the third EMR dataset so that the trained machine learning model is configured to output a disease state outcome based on EMR data input. In some embodiments, the machine learning model used in block 334 may be distinguishable from the machine learning model used in methods 300A and 300B, and may thus be referred to herein as the third machine learning model. The trained machine learning model may be configured to predict a disease state outcome based on EMR data input.

[0238] In some embodiments, training the third machine learning model may involve determining a set of weights to assign to a set of features of the third EMR dataset. For example, the training may involve inputting feature vectors formed from a vector of values assigned to the set of features, determining an initial set of weights for those set of features, and iteratively adjusting the weights until the set of weights for the set of features are able to accurately predict a corresponding outcome (e.g., the confirmed positive indications of the disease state). In some embodiments, such features determined or extracted from the third EMR dataset may include but are not limited to: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon. As previously discussed, each feature may be assigned a weight (e g., for the trained machine learning model) indicative of how significant the feature may be to predicting the disease state outcome. In some embodiments, the aforementioned list of features are arranged in order of significance (from high to low) in their ability to predict the disease state outcome.

[0239] In some embodiments, the computing device may further validate the trained machine learning model. For example, the computing device may rely on a validation dataset formed using the first EMR subset (e.g., of the confirmed positive indications of the disease state). Furthermore, the computing device may adjust the second EMR subset, the third EMR subset, and / or the fourth EMR subset of the EMR dataset based on the validation.

[0240] At block 336, the computing device may mitigate, for each demographic group, a bias in the machine learning model. In some embodiments, the bias mitigation may involve receiving, receiving, for each demographic group, a group benefit equality (GBE) value; and tuning the machine learning model based on the GBE value for each demographic group. For example, as previously discussed in relation to block 106 of FIG. 1, bias mitigation may be performed by selecting the probability cutoff that optimizes the GBE value for each demographic group for unlabeled EMR data in a validation dataset and may be further applied to the third EMR dataset. In some embodiments, an optimization algorithm such as but not limited to the Nelder-Mead optimization algorithm may be used.

[0241] At block 338, the computing device may predict a disease state outcome for a target patient by inputting an EMR data associated with the target patient into the trained machine learning model. For example, the computing device may receive a patient-specific EMR data associated with the target patient having an unconfirmed indication of the disease state. The target patient may be associated with a demographic group of the plurality of demographic groups. The computing device may predict the disease state outcome for the target patient by inputting the EMR data associated with the target patient into the trained third machine learning model. In some embodiments, GBE value and / or a probabilistic criteria associated with the demographic group of the patient may be further used to tune the machine learning model used for predicting the disease state outcome of the target patient.

[0242] In some embodiments, any machine learning model described herein (e.g., first machine learning model, second machine learning model, and third machine learning model) may include but are not limited to, at least one of a decision tree, a classification model, a deep learning model, a neural network, a linear discriminant analysis model, a quadratic discriminant analysis model, a support vector machine, a random forest algorithm, a nearest neighbor algorithm (e.g., a k-Nearest Neighbors algorithm), a combined discriminant analysis model, a k- means clustering algorithm, an unsupervised model, a semi-supervised model, a generalizedlinear model, a multivariable regression model, a penalized multivariable regression model, or another type of model. In various embodiments, the machine learning model may comprise any number of or combination of the models or algorithms described above.

[0243] In describing the various embodiments, the specification may have presented a method and / or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. As one of ordinary skill in the art would appreciate, other sequences of steps may be possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, the claims directed to the method and / or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments. Similarly, any of the various system embodiments may have been presented as a group of particular components. However, these systems should not be limited to the particular set of components, now their specific configuration, communication and physical orientation with respect to each other. One skilled in the art should readily appreciate that these components can have various configurations and physical orientations (e.g., wholly separate components, units and subunits of groups of components, different communication regimes between components).

[0244] Although specific embodiments and applications of the disclosure have been described in this specification, these embodiments and applications are exemplary only, and many variations are possible.EXAMPLES

[0245] These examples are provided for illustrative purposes only and not to limit the scope of the claims provided herein.

[0246] In the following Examples, semi-supervised positive unlabeled learning (SSPUL) coupled with racial bias mitigation was performed for equitable prediction of undiagnosed AD from diverse populations at UCLA Health using electronic medical records. SSPUL achieved superior sensitivity (0.77-0.81) and area under the precision recall curve (AUCPR) (0.81-0.87) across non-Hispanic white, non-Hispanic African American, Hispanic Latino, and East Asiangroups compared to supervised baseline models (sensitivity: 0.39-0.53; AUCPR: 0.3-0.7).SSPUL was also shown to exhibit superior fairness as evidenced by the lowest cumulative parity loss. Top shared and distinct features among labeled and unlabeled AD patients, including neurological features (e.g., memory loss) and non-neurological features (e.g., decubitus ulcer), were identified. Validation was performed using polygenic risk scores, which were higher in labeled and predicted positives than in predicted negatives among non-Hispanic white (NH- whites), Hispanic Latino (HLs), and East Asian (EA) groups (p < 0.001).

[0247] Example 1 provides a detailed description of the computational and analytical methods that were subsequently applied throughout Examples 2-9. These methods were used in a consistent manner across the remaining Examples to generate training data, assign pseudo-labels, train classification models, and evaluate discrimination, calibration, interpretability, fairness, and robustness. Accordingly, Examples 2-9 build upon and apply the methodological pipeline established in Example 1 to demonstrate the performance and utility of the disclosed system across multiple validation settings.EXAMPLE 1. Methods

[0248] This Example provides an overview of the full methodological pipeline that forms the foundation of the disclosed system. The methods established in this Example are applied throughout Examples 2-9 infra to demonstrate the performance and utility of the present disclosure across multiple evaluation settings.Data Sources

[0249] Patient data used in the present disclosure were derived from the UCLA Data Discovery Repository (DDR), a de-identified EMR containing longitudinal records of patients enrolled in the UCLA Health System, including demographics, diagnosis codes, procedures, laboratory tests, medications, and hospital admissions. Johnson et al., Cell Genomics 3:100243 (2023). For model validation using genetic data, samples linked to genetic information collected by the UCLA ATLAS Community Health Initiative (ATLAS) were used as a holdout set (refer to Validation of Holdout Set Predictions Using Polygenic Risk Scores and APOE-s4). ATLAS biological samples were collected during routine clinical lab work performed at a UCLA Health laboratory and were then genotyped using a customized Illumina Global Screening Array.Participants watched a short video explaining the initiative’s goals, and their consent was recorded. For training and testing, non- ATLAS patients with selected EMR features were used. As the EMR and genetic data were de-identified, the study was exempt from human subject research regulations.Sampling of Positive (AD) and Unlabeled Data

[0250] Patients with non-missing demographics (age, sex, and self-reported race and ethnicity in the non-Hispanic-white (NH-white), non-Hispanic African American (NH-AfAm), Hispanic Latino (HL), or East Asian (EA) group) were included. Patients were further included using the following criteria: (1) an average encounter of at least 1 per year and a record length of at least 5 years were required to ensure regular health system visits; and (2) patients were required to be between the ages of 65 and 90 during their last visit.

[0251] From the records of eligible patients, comorbidities represented by International Classification of Diseases (ICD)-IO diagnosis codes that were not risk factors of AD were removed. Diagnoses in chapters XV (Pregnancy, childbirth and the puerperium), XVI (Certain conditions originating in the perinatal period), and XVII (Congenital malformations, deformations and chromosomal abnormalities) were excluded. Diagnoses in XX (External causes of morbidity and mortality) were also excluded, except those beginning with “W,” as they included slipping, tripping, stumbling, and falls, which are known AD comorbidities. In addition, F02.80 (Dementia in other diseases classified elsewhere without behavioral disturbance) was excluded because it was a concurrent diagnosis code with G30 (AD).

[0252] The ICD-10 codes with maximum granularity were then mapped to phecodes, manually curated groups of ICD codes intended to capture clinically meaningful concepts, using Phecode Map 1.2 with ICD-10-CM codes. Bastarache, Annu. Rev. Biomed. Data Sci. 4: 1-19 (2021). This mapping reduced the dimensionality of the data and avoided redundant information from ICD codes within the same group, which may have been highly correlated. Only the first encounter for each phecode was retained.

[0253] After patient- and record-level filtering was performed, positive and unlabeled samples were selected based on the presence and absence, respectively, of at least one diagnosis of AD, indicated by G30. A total of 1000 random 80 / 10 / 10 stratified train / validation / test splits by labeled AD status were performed on non-ATLAS patients (N = 97,403). Kruskal-Wallis testfollowed by Games-Howell or Pearson’s chi-squared test was performed to determine statistical significance among the distributions. ATLAS patients (N = 18,305) were held out for model validation using genetic data.Selection of Enriched Diagnoses in AD Patients

[0254] To select important phecodes for modeling, a logistic regression model [parglm (v. 0.1.7) 70] with AD as a binary outcome variable was fit on training data for each of the 1,638 unique phecodes, excluding 290.11 (AD), over 1000 random splits. Unlabeled patients were treated as controls, although they contained some positive cases. The goal of feature selection was to identify phecodes most strongly associated with AD in an unbiased manner for subsequent steps in the PUL framework. Assigning negative labels from the onset based on the absence of proxy AD ICD codes or medications was avoided as it would not have aligned with the goal of the PUL framework. Logistic regression was chosen as the feature selection method for computational efficiency and to control for covariates. Min-max scaling was performed on patient age since last visit, record length, record density, number of encounters, and number of diagnoses, and these variables were adjusted for in the models along with sex and self-reported race and ethnicity. Phecodes of high risk for AD were determined based on statistical significance (p < 0.05, Wald test) following Bonferroni correction (0.05 / 1,638), and a prevalence of at least 1% in all patients. Across all training sets from the 1000 splits, a total of 458 unique phecodes were identified as significant. The mean number of significant phecodes per training set was 280.Baseline Models

[0255] In this study, all models were classification models aimed at identifying undiagnosed AD cases using proxy ICD codes and medications. Here, “prediction” referred to classifying potential existing cases, not forecasting future development of AD.

[0256] As baselines for comparison, two models were tested, both of which were trained on noisy negative labels (i.e., labeling all unlabeled patients as negative). Although it was obviously incorrect to assume that all unlabeled patients were negative, this assumption was popular in practice as it enabled the direct application of supervised binary classification. Further, training on positive and noisy unlabeled data was considered reasonable because theclass posterior (the probability that an instance belonged to the positive class) for classifiers trained on positive and negative labeled data was monotonically related to the posterior for classifiers trained on positive and unlabeled data. In other words, as the class prior probability increased for the positive vs. negative classifier, it also increased for the positive vs. unlabeled classifier (although not necessarily linearly).

[0257] The two baseline models were as follows: (1) a supervised model trained using only demographics and a list of manually curated AD risk factors supported by expert domain knowledge and evidence from the literature [Supervised (risk factors)] 14; and (2) a supervised model trained using all significant features from feature selection (refer to Selection of Enriched Diagnoses in AD Patients) [Supervised (full)]. The following AD risk factor phecodes were included in model 1 : 249 (Secondary diabetes mellitus), 250.2 (Type 2 diabetes), 250.22 (Type 2 diabetes with renal manifestations), 250.24 (Type 2 diabetes with neurological manifestations), 250.25 (Diabetes type 2 with peripheral circulatory disorders), 272.1 (Hyperlipidemia), 401 (Hypertension), and 401.1 (Essential hypertension). To obtain final predictions for each model, the probability cutoff that maximized the Matthew’s Correlation Coefficient (MCC) for unlabeled data in the validation set was selected and applied to the test set.Semi-Supervised Positive Unlabeled Learning with Pre- and Post-Processing Racial Bias Mitigation

[0258] A 4-step positive unlabeled learning (PUL) framework was developed herein to obtain AD predictions for unlabeled patients in the test set, which was referred to as semi-supervised positive unlabeled learning (SSPUL) (FIGS. 6A-6B). To enable learning from positive and unlabeled data, it was assumed that labeled AD cases were selected at random and that a probabilistic gap existed such that positive instances that resembled negative instances more were less likely to be labeled. The selected-at-random assumption was considered a weaker variant of the selected-completely-at-random assumption, which stated that labeled instances were an i.i.d. sample of the positive distribution. The selected-at-random assumption stated that labeled instances were a biased sample from the positive distribution. This assumption of the labeling mechanism was considered reasonable because there were a multitude of reasons for why some AD patients were diagnosed while others were not. Examples included providers’ assumption that cognitive changes were normal rather than pathologic, providers not followingup on the symptoms and complaints of AD patients from underrepresented groups, and socioeconomically disadvantaged AD patients’ lack of access to healthcare. Therefore, NH- white patients with high socioeconomic status and access to unbiased providers were more likely to be diagnosed (i.e., be a labeled positive instance).

[0259] For the first step, the PUL framework was implemented by training a model to classify labeled positives (LPs) and unlabeled patients using a GLM classifier. To identify reliable negatives (RNs) from unlabeled patients, the probabilistic gap for each patient was calculated, defined as the difference between the probability of an instance having a label given its features and the complement: APr(x) = Pr(y = 1 |x) - Pr(y = 0|x). RNs were determined based on the following criteria: (1) the probabilistic gap was smaller than the smallest observed probabilistic gap of the LPs (APLP); (2) age during the last visit was at least 70 years; and (3) the patient did not have any phecode falling in the AD phecode exclusion range (290-292.99). To minimize potential bias that could have been introduced by extreme values (i.e., bias resulting from the smallest probabilistic gap being an outlier), the probabilistic gaps of 500 LPs were randomly sampled 1000 times, and the mean of the minimums was taken. RNs were manually reviewed by a domain expert for validation (refer to Validation of Reliable Negatives).

[0260] For the second step, positive and negative labels for a subset of the remaining unlabeled patients were assigned using race- and ethnicity-specific probabilistic criteria in order to increase the size of the training set while mitigating bias that could arise from overrepresentation of patients belonging to a particular race and ethnicity prior to training (i.e., pre-processing bias mitigation). T o do this, LPs and RNs were first used as input for training a separate classifier. After updated predicted probabilities for LPs and RNs were obtained from the trained classifier, the training data were subset by race and ethnicity to obtain race-specific PLP and race-specific observed probabilistic gaps of RNs (APRN). For each race and ethnicity subset, additional positives (APs) were determined based on having a probabilistic gap greater than the smallest APLP, while additional negatives (ANs) were determined based on having a probabilistic gap smaller than the largest APRN. Among all APs satisfying the probabilistic-gap criterion, those with the largest probabilistic gaps were selected such that the prevalence of total positive labels (LP and AP) for each race and ethnicity matched U.S. Census-adjusted AD population prevalence estimates from the Chicago Health and Aging Project (10% for NH-white, 18.6% for NH-AfAm, and 14% for HL) and meta-analysis from previous work (7.4% for EA).

[0261] For the third step, all labeled (LPs) and pseudo-labeled (RNs, APs, ANs) patients were used as input for training a final classifier, which was then applied to the test set.

[0262] In the final step, post-processing bias mitigation was performed by selecting the probability cutoff that optimized the group benefit opportunity (GBE) for each race and ethnicity (refer to Evaluation Metrics) for unlabeled data in the validation set, and this cutoff was applied to the test set. For GBE, the expected prevalence was calculated from AD ICD codes and proxy AD ICD codes and medications for a given racial and ethnic group. The Nelder-Mead optimization algorithm was implemented using the optim function in R such that the error from the target GBE of 1 was minimized. The discrimination performance and fairness of baseline models and vanilla 2-step PUL (classifier 2) were compared with SSPUL (final classifier), and statistical significance was reported using Wilcoxon signed-rank test with Bonferroni correction in Tables 1-2, and FIG. 4. To investigate the impact of GBE optimization on discrimination performance and fairness, discrimination performance and fairness metrics using race-specific optimal GBE cutoffs or the MCC cutoff for the baseline models and SSPUL, respectively, were also reported Tables 3-5.Table 1. Test set performance of SSPUL (GBE) and baseline modelsmodels were selected by maximizing theMCC for unlabeled data in the validation set using proxy labels. Cutoffs for SSPUL was selected by optimizing the GBE for each race / ethnicity in the validation set using positive and proxy labels. AUC area under the curve, AUCPR area under the precision recall curve, B. accuracy balanced accuracy, EA East Asian, GBE group benefit equality, HL Hispanic Latino, MCC Matthew’s correlation coefficient, NH-AfAm non-Hispanic African American, NH-white non-Hispanic white, SSPUL semi-supervised positive unlabeled learning.Table 2. Test set performance of models with modified cutoffs and vanilla 2-step PULwere selected by maximizing the MCC for unlabeled data in the validation set using proxy labels. Cutoff for GBE-optimized models were selected by optimizing the GBE for each race / ethnicity in the validation set. SSPUL (GBE) is shown for reference. B. Accuracy=balanced accuracy, EA=East Asian, GBE=group benefit equality, HL=Hispanic Latino, MCC=Matthew's correlation coefficient, NH-AfAm=non-Hispanic African American, NH-white=non-Hispanic white, PUL=positive unlabeled learning, SSPUL=semi- supervised positive unlabeled learning.Table 3. Comparison of test set fairness with respect to NH-AfAm and NH-white usingMCC or GBE cutoffsMetrics reported are means of differences between NH-AfAm and NH-white for 1000 random test sets with 95% CI. Cutoffs for MCC-optimized models were selected by maximizing tire MCC for unlabeled data in the validation set using proxy labels. Cutoff for GBE-optimized models were selected by optimizing the GBE for each race / ethnicity in the validation set. BA=balanced accuracy, EO=equal opportunity, GBE=group benefit equality, MCC=Matthew's correlation coefficient, NH-AfAm=non- Hispanic African American, NH-white=non-Hispanic white, NPV=negative predictive value, PUL=positive unlabeled learning, SSPUL=semi-supervised positive unlabeled learning.Table 4. Comparison of test set fairness with respect to HL and NH-white using MCC orGBE cutoffsMetrics reported are means of differences between HL and NH-white for 1000 random test sets with 95%CI. Cutoffs for MCC-optimizcd models were selected by maximizing the MCC for unlabclcd data in the validation set using proxy labels. Cutoff for GBE-optimized models were selected by optimizing the GBE for each race / ethnicity in the validation set. BA=balanced accuracy, EO=equal opportunity, GBE=group benefit equality, HL=Hispanic Latino, MCC=Matthew's correlation coefficient, NH- white=non-Hispanic white, NPV = negative predictive value, PUL=positive unlabeled learning, SSPUL=semi-supervised positive unlabeled learning.Table 5. Comparison of test set fairness with respect to EA and NH-white using MCC orGBE cutoffsMetrics reported are means of differences between HL and NH-white for 1000 random test sets with 95%CI. Cutoffs for MCC-optimized models were selected by maximizing the MCC for unlabeled data in the validation set using proxy labels. Cutoff for GBE-optimized models were selected by optimizing the GBEfor each race / ethnicity in the validation set. BA=balanced accuracy, EA=East Asian, EO=equal opportunity, GBE=group benefit equality, MCC=Matthew's correlation coefficient, NH-white=non- Hispanic white, NPV = negative predictive value, PUL=positive unlabeled learning. SSPUL=semi- supervised positive unlabeled learning.Classifier Selection

[0263] To select the best classifier for each model, the AutoML feature from the H2O package 74 (v. 3.44.0.3) in R (v. 4.2.2) was used to train and tune four classifiers: Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), and Distributed Random Forest (DRF). Oversampling of patients in the minority class was implemented prior to training to address class imbalance. For both baseline models and steps 1 and 2 of SSPUL, the minority class was LPs. For step 3 of SSPUL, the minority class was LPs plus additional (pseudo-labeled) positives.

[0264] To reduce training time over 1000 splits, AutoML was applied across 10 initial splits, and the classifier to be applied for the remaining 990 splits was then selected based on the best mean area under the precision recall curve (AUCPR) (Table 6). GLM was selected as the classifier for the baseline models and step 1 of SSPUL, and DRF was selected as the classifier for step 2 of SSPUL, as these classifiers achieved the best mean AUCPR evaluated on 10 initial training sets. A higher AUCPR indicated better separability of LPs from unlabeled patients in step 1 and from RNs in step 2, reflected in more confident predicted probabilities. This improved the quality of pseudo-labels by minimizing false positives and false negatives. For step 3 of SSPUL, XGBoost was selected as the classifier as it achieved the best mean AUCPR evaluated on 10 initial validation sets.Table 6. Classifier selection for baseline models and each step of SSPULThe classifiers for the baseline models (Generalized Linear Model) and SSPUL steps (Generalized Linear Model, Distributed Random Forest, XGBoost for steps 1, 2, and 3, respectively) were selected based onthe best average AUCPR, evaluated on training or validation sets from 10 initial splits. Oversampling of the minority class was implemented prior to training to address class imbalance. For both baseline models and steps 1 and 2 of SSPUL, the minority class was labeled positives. For step 3 of SSPUL, the minority class was labeled positives + additional (pseudo-labeled) positives.Validation of Reliable Negatives

[0265] One domain expert (TSC) manually reviewed the diagnoses and medications of 100 random RNs. No RN patient was prescribed a proxy medication for dementia. No RN patient had G chapter (diseases of the nervous system) or R40-46 (symptoms and signs involving cognition, perception, emotional state, and behavior) ICD codes that suggested AD or dementia, except for one RN with memory loss (R41 .3). Common G or R40-46 codes among RNs were dizziness, cerebrovascular disease, or tremor. A few RNs were diagnosed with encephalopathy.Validation of Test Set Predictions Using Proxy AD ICDs and Medications

[0266] Final prediction labels of the test set were validated using proxy AD diagnosis measures, which included alternative diagnosis ICD codes and medications for dementia. These proxy measures were considered silver standards and had been used previously to evaluate unlabeled cases predicted by semi-supervised learning. AD proxy ICD codes included: vascular dementia (F01, F01.5, F01.50, F01.51, F01.0, F01.1, F01.2, F01.3, F01.9, F01.511, F01.518); dementias with cerebral degenerations (G31.0, G31.01, G31.09, G31.1, G31.83); memory loss (R41.1, R41.2, R41.3); mild cognitive impairment (G31.84); unspecified dementias (F03.9, F03.90, F03.91, F03.911, F03.918, F02, F02.8, F02.80, F02.81, F02.0, F02.811, F02.818); senile dementia (F03); and other cerebral degenerations (G31.85). Medications included donepezil, rivastigmine, galantamine, memantine, memantine / donepezil combination, aducanumab, and lecanemab. Unlabeled patients without any proxies were considered healthy controls.Validation of Holdout Set Predictions Using Polygenic Risk Scores and APOE-S4

[0267] Polygenic risk scores (PRS) estimated the heritable risk of an individual for developing a particular disease by combining information from many SNPs associated with the disease or trait of interest in genome-wide association studies (GWAS). As another method for validation, PRS were generated using genotype data from ATLAS patients in a holdout set (N = 18,305). The late-onset AD GWAS conducted by Kunkle et al. Nat. Genet. 51:414-430 (2019) was selected for building PRS based on its large sample size (21,982 cases and 41,944 controls).Quality control (QC) was first performed using PLINK vl .9 following established guidelines. Subsequently, genotype imputation was performed using the Michigan Imputation Server (Das) to enhance the coverage of genetic variants. The LDpred2 tool from the bigsnpr package (v. 1.6.1) was used to build the PRS. This tool updated SNP weights based on linkage disequilibrium (LD) information from a reference population, which in this case was the European genetic ancestry sample from the 1000 Genomes Project. The sum of a patient’s risk allele dosages, weighted by risk allele effect sizes, was calculated to obtain the final PRS.

[0268] To compare the PRSs of LPs and predicted labels, each final SSPUL classifier trained on one of 1000 random splits was applied to the holdout set, the mean PRS of predicted positives or predicted negatives from each split was measured, and the results were aggregated to obtain the mean of PRS means. LPs had a single PRS mean as they were independent of the model’s predictions. The mean of PRS means was stratified by self-reported race and ethnicity and compared across classifications using Wilcoxon signed-rank test with Bonferroni correction.

[0269] In addition to PRS, the model was validated using apolipoprotein E (APOE) s4 allele count. APOE-e4 was considered a significant genetic risk factor for AD and was associated with an increase in the levels of amyloid deposition that leads to AD onset. APOE-e4 allele count was compared across classifications, and the analysis was stratified in the same manner as the PRS validation.Evaluation Metrics

[0270] The area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and precision were used to compare model discrimination performance, stratified by race and ethnicity. To account for class imbalance, balanced accuracy (BA) and AUCPR were also included. Balanced Brier score was additionally included to compare model calibration performance, which was indicative of the quality of the predicted probabilities with respect to the proxy-validated class labels. A well-calibrated model had higher predicted probabilities for proxy-validated AD cases than for controls and therefore had a lower balanced Brier score. Unlike the traditional Brier score, the balanced Brier score was robust to class imbalance as it was composed of stratified Brier scores, each calculated separately over positive or negative samples (Equations 1-3).(Equation 2) Balanced Brier = Brier+4- Brier~(Equation 3)

[0271] To compare model fairness, differences between an unprivileged group (NH-AfAm, HL, or EA) and the privileged group (NH-white) in obtaining the favored outcome (AD) were assessed with respect to the following metrics: specificity, precision, BA, equal opportunity (EO), and group benefit equality (GBE) (Equation 4). EO asserted that the sensitivities for two groups were equal (Equation 5). Because the prevalence of AD differed among NH-white, NH- AfAm, HL, and EA groups, an even proportion of patients in each group identified to have AD was unlikely.^difference ^unprivileged ^privileged’ G specif icity, precisoin, BA, EO, GBE(Equation 4)= 1); unprivileged G {NH Af Am, HL, E A}, privileged = NH white(Equation 5)

[0272] For this reason, group benefit equality (GBE) was also considered. GBE enforced that the rate at which AD was diagnosed based on ICD codes and predicted to occur within a race and ethnicity, and the rate at which it occurred in the EMR by virtue of AD ICD code diagnosis and proxy AD ICD codes and medications, were equal. This relationship was expressed as a ratio (Equation 6), which was 1 in the ideal case, indicating that the number of AD predictions combined with LPs matched that of diagnosed and proxy-validated AD cases.

[0273] By aggregating the absolute differences across unprivileged groups for each metric, metric-specific parity losses were obtained (Equation 7), which were summed to yield the cumulative parity loss (Equation 8).(Equation 8)Factor Analysis of Mixed Data and Selection of True Positive Subset

[0274] Taking the top 20 predictive features of the final classifier by scaled importance, factor analysis of mixed data (FAMD) was performed to visualize the final classifications of the test set of one split in a 2-dimensional Euclidean feature space. FAMD was used as a principal component method that enabled dimensionality reduction of both continuous variables (e.g., age during last visit) and categorical variables (e.g., phecodes) by scaling the former to unit variance and crisp coding, followed by scaling the latter using the specific scaling of multiple correspondence analysis. FAMD was performed using the FactoMineR package (v. 2.11) in R (v. 4.2.2).

[0275] One potential benefit of PUL was predicting true positives (TPs) (i.e., proxy-validated predicted positives) with a feature space that differed from that of LPs, potentially revealing novel informative features. This TP subset in the data was identified by selecting predicted positives that did not overlap with LPs (except one outlier) in a 2-dimensional Euclidean feature space visualized via FAMD, and subsequently filtering for TPs.

[0276] To better understand the relationship between the TP subset, LPs, and top 20 features, the coordinates of their projections onto the same feature space were plotted. To accomplish this, FAMD was re-run after discretizing top continuous features, and the coordinates of the projections on the categories of each feature (N = 15 top phecodes * 2 categories + 5 top continuous features (age during last visit, number of encounters, number of diagnoses, record length, record density) * 4 categories = 50 total categories) were obtained on the feature space from the first FAMD run. A two-proportion z-test was then performed to compare theprevalence of features with coordinates that were closer to those of the TP subset relative to LPs and vice versa.SHAP Analysis

[0277] Originating from cooperative game theory, SHapley Additive exPlanations (SHAP) analysis was widely used for explaining machine learning models by assigning to each feature an importance value that represented its contribution to the model’s output. SHAP values were determined by calculating the marginal contribution of each feature to the final model prediction with respect to all possible subsets of features. To assess the contribution of each TP subsetspecific feature to the final predictions of the TP subset and LP sample described in Factor Analysis of Mixed Data and Selection of True Positive Subset, the mean absolute SHAP values of each feature for the two groups were compared using Mann-Whitney U test with Bonferroni correction. For this analysis, Tree SHAP, a variant of SHAP that efficiently computed exact Shapley values for tree-based models, was implemented using H2O’s predict_contributions function. As input, all significant features from feature selection were included.

[0278] Potential disparities in feature importance across racial and ethnic groups were investigated by examining differences in the magnitude and direction of SHAP values. To accomplish this, 100 unlabeled patients of each race and ethnicity were sampled from a random test set, and differences in feature contributions between NH-white and unprivileged groups were evaluated. The magnitude of feature influence was quantified by averaging the absolute SHAP values of each top feature within each racial and ethnic group. To determine differences in the direction of SHAP values, the correlation between each feature’s value and its corresponding SHAP value was analyzed separately for each group using the Pearson correlation coefficient.

[0279] The null hypothesis tested was that no difference existed in the correlation between feature values and SHAP values across NH-white and unprivileged groups. To simulate the null distribution of correlation differences, racial group labels (NH-white and NH-AfAm; NH-white and HL; NH-white and EA) were randomly shuffled 1000 times under the assumption that if correlation was independent of race and ethnicity, then permuting these labels should not yield any meaningful correlation difference. The observed correlation difference was compared to the null distribution to assess whether it was more extreme than what would be expected by randomchance. The p-value was defined as the proportion of correlation differences resulting from permutation that were equal to or more extreme than the observed correlation difference.Sensitivity Analysis of Self-Reported Race and Ethnicity Features

[0280] To investigate the impact of self-reported races and ethnicities on differential sensitivity performance across models, a sensitivity analysis was conducted. In this analysis, each race and ethnicity was recoded as another race and ethnicity (e.g., NH- white to NH-AfAm) while all other features in the test set were kept fixed. Predicted probabilities were then obtained via the final XGBoost classifier, and the original classification cutoff was applied to acquire updated sensitivities. For models that used GBE cutoffs, the optimal GBE cutoff for NH-white prior to recoding was applied to NH-white patients who were recoded to NH-AfAm, as an example.

[0281] If the model were racially unbiased, the predicted probability of a patient having AD, given the same set of features except race and ethnicity, should have been the same or similar. Therefore, applying the GBE cutoff before and after recoding should have yielded a similar set of predicted labels. After recoding, the gain or loss in sensitivity for each original race and ethnicity was compared across models.Sensitivity Analysis of Proxy Label Distribution Shifts

[0282] The different proxy subsets used for testing model robustness to proxy label distribution shifts included: (1) subsets in which each proxy ICD was removed one at a time; (2) subsets defined by grouping proxy ICDs into phecodes and removing one phecode at a time; (3) subsets removing five randomly selected ICDs at a time; and (4) a subset removing all proxy medications. After the ICD code(s) or medications for the respective subset were removed, patients who still had proxies were validated as cases, while those without proxies were validated as healthy controls. Based on each subset, race-specific GBE cutoffs were optimized, and discrimination performance was evaluated on 1000 random validation and test sets, respectively.EXAMPLE 2. Data Discovery Repository Sample Description

[0283] The samples used in the present disclosure were derived from the UCLA Data Discovery Repository, a de-identified EMR with longitudinal patient records from the UCLA Health System. Samples linked to genetic information collected by the UCLA ATLASCommunity Health Initiative (ATLAS) were used as a holdout set for model validation at the genotype level, while non-ATLAS samples were used for training and testing. An overview of the study design was depicted in FIG. 5.

[0284] Following age and record filtering, N = 129,203 patients were obtained. Patients with missing sex, race, or ethnicity were then excluded, resulting in N = 115,708 eligible patients (10% missing rate). Non-ATLAS samples consisted of 97,403 patients (FIG. 5, Table 7). Among labeled positive (LP) and unlabeled patients, statistically different distributions were observed across sex, self-reported NH-AfAm and EA races, record length, number of diagnoses, number of encounters, record density, and age during last visit (p < 0.05) (Table 7).Table 7. Distributions of non-ATLAS patient demographics and EMR features, stratified by label (N = 97,403) *AfAm = non-Hispanic African American, NH-white = non-Hispanic white.an(%); Median (IQR)bPearson’s Chi-squared test: Wilcoxon rank sum test.

[0285] Assuming that most unlabeled patients were negative, the higher median record length, number of diagnoses, number of encounters, record density, and age during last visit among LP patients were expected. Consistent with previously reported estimates of female AD prevalence, the percentage of female patients diagnosed with AD in the sample was approximately twice that of male patients diagnosed with AD.

[0286] Compared to the estimated population prevalence of AD in patients over 65 years old among NH-white, NH-AfAm, HL, and EA patients from longitudinal cohort studies 11 (10%, 18.6%, 14%, and 7.4%, respectively), the AD prevalence in the sample was significantly lower (4.3%, 5.8%, 4.3%, and 3.9%, respectively), indicating that all four races and ethnicities were heavily underdiagnosed.EXAMPLE 3. SSPUL Training Set Composition and Stratified Predicted AD Prevalence

[0287] A 4-step SSPUL framework (FIGS. 6A-6B), described in Example 1, was leveraged to assign positive and negative labels to unlabeled training data based on the probabilistic gap and other criteria detailed in Example 1 and FIGS. 6A-6B. The probabilistic gap was defined as the difference between the probability of an instance having a label given its features and the complement: APr(x) = Pr(y = 1 |x) - Pr(y = 0|x). Reliable negatives (RNs) were identified in step 1 from unlabeled patients that had a probabilistic gap smaller than the smallest observed probabilistic gap of LP patients (APLP).

[0288] To attain racial fairness for the model, pre-processing bias mitigation was performed by assigning additional positive (AP) and negative (AN) labels to a subset of the remaining unlabeled data in the training set. APs were assigned to patients with probabilistic gaps above the smallest APLP for their race or ethnicity. APs were added until APs + LPs matched each group’s population prevalence estimate based on the Chicago Health and Aging Project and meta-analysis. ANs were assigned to patients with probabilistic gaps below the largest probabilistic gap of RNs ( PRN) for their race or ethnicity.

[0289] In the last step, post-processing bias mitigation was implemented by selecting the classification cutoff for each race and ethnicity to optimize GBE in the validation set, ensuring that the prevalence of predicted positives matched that of LPs and proxy -validated positives. The cutoffs were then applied to the test set for classification. For clarity, the cutoff method was referred to in parentheses (e.g., SSPUL (GBE)).

[0290] Together, LPs and pseudo-labels (APs, ANs, and RNs) made up on average 80% of the total training set (95% CI: 69%, 85%), with on average 12.4%, 19.6%, 14.6%, and 9.8% of NH- white, NH-AfAm, HL, and EA positive labels (LP or AP), respectively. In contrast, baseline supervised models had substantially lower mean prevalences of positive labels (4.3%, 5.8%, 4.3%, and 3.9% for NH-white, NH-AfAm, HL, and EA, respectively).

[0291] FTG. 7A showed confusion matrices for APs and ANs, stratified by race and ethnicity. The mean sensitivity of pseudo-labels compared to LPs and proxy-validated positives across 1000 random train / validation / test splits was 83.8%, 92.3%, 89.0%, and 67.4%, while the mean false discovery rate was 4.1%, 6.1%, 7.6%, and 0.8% forNH-white, NH-AfAm, HL, and EA, respectively. Due to the lower population prevalence for EA (7.4%), fewer APs were added compared to other groups, resulting in a lower sensitivity and false discovery rate.

[0292] RNs and ANs made up on average 87% of the final training set (95% CL 85%, 88%) for SSPUL, with a mean false omission rate of 1.6%, 1.4%, 1.4%, and 3.6% for NH-white, NH- AfAm, HL, and EA, respectively. Supervised models had a higher prevalence of negative labels (96% of the training set) and higher mean false omission rates (15%, 16.7%, 14.6%, and 16.1% for NH- white, NH-AfAm, HL, and EA, respectively).

[0293] For reference, the mean prevalence of LPs and proxy-validated positives in the training set was 18.7%, 21.5%, 18.3%, and 19.4% for NH-white, NH-AfAm, HL, and EA, respectively. In total, SSPUL (GBE) predicted on average 18.7%, 22%, 18.5%, and 19.3% of NH-white, NH- AfAm, HL, and EA patients to have AD - a significant increase from the labeled AD prevalences and closely matching the LP and proxy-validated AD prevalences (Table 8). In addition, although sex was not optimized for, SSPUL (GBE) predicted higher AD prevalence in females than in males (19.3% and 18.5%, respectively), consistent with established findings (Table 8).Table 8. Distributions of final labels of test set, stratified by sex and self-reported race and ethnicityFinal labels reported are means of 1000 random test sets with 95% CI. AD=Alzheimer's Disease, HL=Hispanic Latino, NH-AfAm=non-Hispanic African American, NH-white=non-Hispanic white, SSPUL=semi-supervised positive unlabeled learning.EXAMPLE 4. SSPUL Outperforms Baseline Supervised Models in Accurately Identifying Undiagnosed AD Pateints

[0294] Table 1 showed the discrimination performance of SSPUL (GBE) relative to baseline models. As baselines for comparison, two models were tested, both of which were trained on noisy negative labels and had cutoffs selected based on maximizing the Matthew’s Correlation Coefficient (MCC) for unlabeled data in the validation set using proxy labels. Supervised (risk factors / MCC) was trained using only demographics and a list of manually curated AD risk factors, while Supervised (full / MCC) was trained using all significant features from feature selection. Compared to the baseline models, SSPUL (GBE) performed best across all races and ethnicities with respect to sensitivity (0.77-0.81), balanced accuracy (BA) (0.86-0.88), area under the receiver operating characteristic curve (AUC) (0.91-0.95), and area under the precision recall curve (AUCPR) (0.81-0.87) (p adj. < 0.001).

[0295] Supervised (full / MCC) performed best with respect to precision for NH-white, NH- AfAm, and EA (0.86, 0.83, and 0.88, respectively) (p adj. < 0.001). This performance may have been partially attributed to the presence of significantly more negative labels in the training set and, as a result, Supervised (full / MCC) on average making significantly fewer positive predictions (768 [95% CI: 622, 917] vs. 1423 [95% CI: 1345, 1507]) (p adj. < 0.001) and more negative predictions (8548 [95% CI: 8398, 8693] vs. 7893 [95% CI: 7809, 7970]) (p adj. < 0.001) compared to SSPUL (GBE). For the HL group, Supervised (full / MCC) and SSPUL (GBE) had similar precision (p adj. = 1). Supervised (full / MCC) had lower sensitivity across all races and ethnicities (0.39-0.52) (p adj. < 0.001 for all comparisons), predicting on average two times fewer positives than SSPUL (GBE) (FIG. 7B).

[0296] Notably, SSPUL (GBE) also outperformed vanilla 2-step PUL 31 using race-specific GBE cutoffs with respect to sensitivity, precision, and BA for NH-white, NH-AfAm, and HL (padj. < 0.001) (Table 2), demonstrating the benefit of training using additional pseudo-labels. A similar trend was observed for EA, although sensitivity did not differ significantly (p adj. = 0.07). Overall, SSPUL (GBE) had the best sensitivity / precision tradeoff, marked by a substantial gain in mean sensitivity (+0.25 to +0.38) and relatively marginal loss in precision (0 to -0.11) across all races and ethnicities compared to Supervised (full / MCC).

[0297] Discrimination performance metrics were also evaluated for Supervised (full) when using GBE instead of MCC and for SSPUL when using MCC instead of GBE to distinguish the benefit of SSPUL and GBE optimization. When GBE was optimized rather than MCC for Supervised (full), the model still underperformed relative to SSPUL (GBE) with respect to sensitivity (0.57-0.62 vs. 0.77-0.81) (p adj. < 0.001), despite the adjusted cutoffs yielding more positive predictions overall (Table 2). Regardless of the cutoff selection method, SSPUL was superior to supervised learning in terms of discrimination performance.

[0298] Comparing cutoff methods for SSPUL, race- and ethnicity-specific sensitivity / precision tradeoffs were observed (Table 2). For NH-white, SSPUL (MCC) provided a modest gain in sensitivity (+5%) without loss in precision, while EA showed only minor changes (+1% sensitivity, -5% precision). Consistent with these patterns, the mean GBE for NH-white (1.05) and EA (0.95) under the MCC cutoff were close to those under the GBE cutoff (both 1), whereas greater discrepancies were observed for NH-AfAm and HL (FIG. 8A). Despite these tradeoffs, GBE optimization consistently yielded the best balance between sensitivity and precision across all models (Tables 1-2).

[0299] In addition to improved discrimination performance, SSPUL achieved the best calibration performance relative to baseline models (FIG. 9A). Calibration performance indicated the quality of model predictions with respect to proxy-validated labels, allowing predicted probabilities to be interpreted as long-run frequencies 40,41. For example, if a model’s predicted probability for a group (bin) of patients to have AD was 10%, then 10% of patients in that group would be expected to have AD. Each bin in FIG. 9A represented approximately 10% mean predicted probability.

[0300] While model over-confidence was evident for parts of the SSPUL calibration curve (i.e., points in bins 7-9), SSPUL nonetheless achieved the lowest expected calibration error among models, indicating the best calibration performance (FIG. 9A). In contrast, the calibration curve for Supervised (full) suggested that the model was highly under-confident foralmost all patients above the first bin. A similar, though less pronounced, pattern was observed for Supervised (risk factors). Consistent with these findings, FIG. 9B (original: Supplementary Fig. 2b) showed that most predicted probabilities of SSPUL for proxy-validated AD cases in an example test set were 0.9-1, while most predicted probabilities of the baseline models fell in the lower quartile. Moreover, this was reflected in SSPUL’ s significantly lower balanced Brier score (0.17-0.25 vs. 0.8-0.88) and positive Brier score (0.09-0.2 vs. 0.8-0.88) compared to those of baseline models across all races and ethnicities (p adj. < 0.001) (FIG. 9C).EXAMPLE 5. SSPUL Achieves Racial Fairness Without Compromising Discrimination Performance

[0301] In addition to discrimination performance, fairness across models was compared by measuring the differences between an unprivileged group (NH-AfAm, HL, or EA) and the privileged group (NH-white) in predicting undiagnosed AD with respect to discrimination performance metrics (Equation 4). The absolute differences across unprivileged groups for each metric were also aggregated to obtain metric-specific parity losses (Equation 7), which were then summed to obtain the cumulative parity loss (Equation 8). SSPUL (GBE) was the fairest relative to baseline models, as evidenced by the lowest cumulative parity loss (p adj. < 0.001) (FIG. 8B).

[0302] Stratifying by metric, SSPUL (GBE) had lower mean BA and mean precision parity losses (0.074 and 0.139, respectively) than Supervised (full / MCC) (0.091 and 0.186, respectively) (p adj. < 0.001). Compared to Supervised (risk factors / MCC), SSPUL (GBE) had lower mean specificity parity loss (0.034 vs. 0.074) but higher mean BA parity loss (0.074 vs. 0.064) (p adj. < 0.001). Compared to both baseline models, SSPUL (GBE) achieved the lowest mean equal opportunity (EO) and mean GBE parity losses (FIG. 8B), suggesting consistent improvement in fairness by narrowing group-level differences in sensitivity and predicted AD prevalence between unprivileged and privileged groups.

[0303] When examining group differences underlying parity loss for each metric, SSPUL (GBE) showed the lowest or second-lowest absolute mean differences across multiple metrics for all unprivileged / privileged comparisons (Table 9). The most striking absolute mean differences were in EO and GBE between NH-AfAm and NH-white. While SSPUL (GBE) had absolute mean EO and GBE differences of 0 and 0.03, respectively, Supervised (risk factors / MCC) hadabsolute mean differences of 0.08 and 0.16, and Supervised (full / MCC) had absolute mean differences of 0.06 and 0.1 (p adj. < 0.001).Table 9. Test set fairness of SSPUL (GBE) and baseline modelsMetrics reported are means of differences between an unprivileged group (NH-AfAm, HL, or EA) and the privileged group (NH-white) for 1000 random test sets with 95% CI. Cutoffs for baseline supervised models were selected by maximizing the MCC for unlabeled data in the validation set using proxy labels. Cutoff for SSPUL model was selected by optimizing the GBEfor each race / ethnicity in the validation set using positive and proxy labels.BA=balanced accuracy, EA=East Asian, EO=equal opportunity, GBE=group benefit equality. HL=Hispanic Latino, NH-AfAm=non -Hispanic African American, NH-white=non-Hispanic white.

[0304] To evaluate whether the pseudo-labeling strategy promoted fairness as a pre-processing bias mitigation approach, a sensitivity analysis independent of post-processing cutoff selection was conducted (Table 10) Each race and ethnicity was recoded as another race and ethnicity (e g., NH-white to NH-AfAm) while all other features in the test set were kept fixed, then theoriginal classification cutoff was applied to obtain updated sensitivities for each model. For models that used GBE cutoffs, the optimal GBE cutoff for NH-whites prior to recoding was applied to NH-whites that were recoded to NH-AfAm.Table 10. Changes in sensitivity across models after recoding self-reported race and ethnicity featuresMean changes in sensitivity from 1000 random test sets are reported with 95% CI. Cutoffs for supervised models were selected by maximizing the MCC for unlabeled data in the validation set using proxy labels. Cutoff for SSPUL model was selected by optimizing the GBE for each race / ethnicity in the validation set. The same cutoffs were applied after recoding. EA=East Asian, GBE=group benefit equality, HL=Hispanic Latino, MCC=Matthew's correlation coefficient, NH-AfAm=non-Hispanic African American, NH-white=non-Hispanic white, SSPUL=semi-supervied positive unlabeled learning.

[0305] Minimal changes to sensitivity of the original race and ethnicity were observed for SSPUL (GBE) after recoding (maximum change of 3%). In contrast, larger changes wereobserved for Supervised (risk factors / MCC) (e.g., recoding NH-AfAm as EA decreased sensitivity on average by 10%) and for Supervised (full / MCC) (e.g., recoding HL as EA decreased sensitivity on average by 9%). These findings suggested that SSPUL, independent of the cutoff method, exhibited less racial bias than baseline models, as predicted AD probabilities given identical features aside from race and ethnicity remained stable after recoding, leading to minimal sensitivity shifts.

[0306] Through post-processing bias mitigation, fairness was further improved by selecting classification cutoffs that optimized the GBE for each race and ethnicity in the validation set. This approach consistently achieved a mean GBE of 1 for all races and ethnicities across all models (FIG. 8A). Supervised (risk factors / MCC), on the other hand, over-predicted with a mean GBE of 1.43, 1.53, 1.42, and 1.29, while Supervised (full / MCC) under-predicted with a mean GBE of 0.64, 0.73, 0.76, and 0.55 for NH-white, NH-AfAm, HL, and EA, respectively. SSPUL (MCC) slightly over-predicted for NH-white (1.05), NH-AfAm (1.13), and HL (1.15), and under-predicted for EA (0.95).

[0307] Selecting the cutoff by optimizing GBE also improved cumulative parity loss across all models (FIG. 4). The parity losses for EG and GBE for all models using GBE cutoffs were lower than those using the MCC cutoff (p adj. < 0.001) (FIG. 4). Additionally, BA parity loss decreased for the baseline models, and specificity parity loss decreased for SSPUL (p adj. < 0.001). Racial group differences for each metric for models using the maximum MCC cutoff versus those using the optimal GBE cutoff are reported in Tables 3-5.EXAMPLE 6. Top Predictive Features

[0308] To infer meaningful phecodes and EMR features for AD prediction, the top 20 features from the final XGBoost classifier were extracted, averaged over 1000 splits (Table 11). Among these features, 13 were related to mental or neurological disorders (e.g., memory loss), and 5 were related to age, number of diagnoses, or healthcare utilization (e.g., number of encounters). Other features included screening for malignant neoplasms (1010.2) and decubitus ulcer (707.1). Four of the top 20 features overlapped with phecodes that mapped to proxy ICDs. Consistent with their importance for AD prediction, most of the top features had a higher prevalence in LPs and predicted positives compared to predicted negatives in an example test set (FIG. 10A). Infact, the top features related to neurological or mental disorders (290.1, 292.3, 292.4, 292, 290.2, 290.16, 292.2, and 348.8) were almost absent in predicted negatives.Table 11. Variables of importance of final XGBoost classifier

[0309] Using the top 20 features, Factor Analysis of Mixed Data (FAMD) was performed on the predictions of the same test set in FIG. 10A to visualize them in a 2-dimensional Euclidean space (FIG. 10B). Plotting their coordinates along the first and second dimensions, LPs and predicted positives were observed to form overlapping clusters that were mostly distinct from those of predicted negatives, indicating clear separability with respect to the prediction labels based on the model’s top predictive features.

[0310] Given the substantial contribution of the top 20 features in driving model predictions, an assessment was performed to determine whether their influence was consistent across racial and ethnic groups. Disparities in feature importance may have indicated model bias and compromised group fairness. To assess whether SSPUL relied on the top predictive featuresdifferently across racial and ethnic groups, the magnitude and / or direction of the top 20 features’ SHAP values were examined between the privileged group (NH-white) and unprivileged groups (NH-AfAm, HL, EA) in a random test set.

[0311] The magnitude of feature influence was quantified by averaging the absolute SHAP values of each top feature within each racial and ethnic group. For most top features, absolute SHAP values did not differ significantly between NH-white and unprivileged groups (Mann- Whitney U test with Bonferroni correction, p adj. > 0.05). Exceptions included screening for malignant neoplasms (1010.2), record length, and other immunological findings (279.7). In addition to magnitude, the correlation between each feature’s value and its corresponding SHAP value within each racial and ethnic group was analyzed, and tests were performed to assess whether these correlations differed across NH-white and unprivileged groups. Across all comparisons, no statistically significant differences in correlation were found [NH-white vs. NH- AfAm (p = 0.84), NH-white vs. HL (p = 0.43), NH-white vs. EA (p = 0.2)].

[0312] Taken together, these findings suggested no meaningful variation in either SHAP directionality or magnitude across racial and ethnic groups, reinforcing the fairness of the model’s predictions. FIG. 11 visually illustrates the similarity in the direction and magnitude of SHAP values across NH-white, NH-AfAm, HL, and EA groups.EXAMPLE 7. Predictive Features Corresponding to Undiagnosed AD

[0313] Given the limited number of positive labels that were selected at random, a unique objective of PUL was to predict positives that were feature-wise distinct from LPs in the training or test set in order to obtain a positive sample that more closely resembled the true positive (TP) distribution. To accomplish this, FAMD was performed using the top 20 predictive features, and a subset of TPs (i.e., proxy-validated predicted positives) was selected from the output feature space (FIG. 10B) with coordinates that did not overlap with those of LPs (except one outlier). Age during last visit, number of encounters, record density, record length, and number of diagnoses were then discretized. Subsequently, FAMD was re-run to obtain coordinates of the projections on the categories of each feature on the feature space from the first FAMD run (FIG. 10C).

[0314] As expected, both LPs and the TP subset formed clusters that deviated from points representing the negative class of the top 20 predictive features (i.e., the black triangles clusteredaround the origin in FTG. IOC. While the coordinates of LPs overlapped with those corresponding to the positive class of top neurological predictive features [e.g., 290.1 (dementias), 292 (neurological disorders), 348.8 (encephalopathy, not elsewhere classified), 290.2 (delirium due to conditions classified elsewhere), 290.16 (vascular dementia), 292.2 (mild cognitive impairment)], the coordinates of the TP subset were closer to those corresponding to the positive class of top non-neurological predictive features [427.9 (palpitations), 1010.2 (screening for malignant neoplasms), 279.7 (other immunological findings)].

[0315] To further evaluate these findings, the prevalence of the TP subset-specific features was compared with those overlapping with LPs in the test set (FIG. 10D). A higher prevalence of top non-neurological features was observed in the TP subset relative to LPs (p adj. < 0.01 for all features). In contrast, a lower prevalence of top neurological features was observed in the TP subset relative to LPs (p adj. < 0.001 for all features). Additionally, 427.9 and 279.7 were found to have a greater impact on the model output of the TP subset compared to that of LPs, as evidenced by a higher mean absolute SHAP value across all features (p adj. < 0.001) (FIG. 10E).

[0316] Although no significant difference in mean absolute SHAP values was observed for screening for malignant neoplasms (1010.2), the proportion of patients with positive SHAP values for 1010.2 differed significantly between the TP and LP subsets (Fisher’s exact test, p < 0.001). While 1010.2 contributed both positively and negatively to model predictions in the TP subset, it had a predominantly negative effect in the LP subset.EXAMPLE 8. Genotype Validation of Holdout Set Predictions

[0317] Genetic predisposition plays an important role in AD, with more than 40 genetic risk loci identified. For this reason, the model’s predictions were also validated on a holdout test set containing ATLAS patients using polygenic risk scores (PRS) calculated using a late-onset AD GWAS (FIG. 12A). To accomplish this, each final SSPUL classifier trained on one of 1000 random splits was applied to the holdout set. The mean PRS of predicted positives or predicted negatives from each split was measured, and the resulting values were aggregated to obtain the mean of PRS means. LPs had a single PRS mean because they were independent of the model’s predictions. Overall, the mean PRS of LPs (1.01) and the mean of PRS means of predicted positives (0.59 [95% CL 0.55, 0.64]) were observed to be significantly higher than the mean of PRS means of predicted negatives (0.41 [95% CL 0.40, 0.42]) (p adj. < 0.001).

[0318] When the analysis was stratified by race and ethnicity, a similar trend was observed for NH- white and HL, but not for NH-AfAm. In the NH-AfAm group, the mean of PRS means of predicted positives (3.83 [95% CI: 3.76, 3.94]) was slightly lower than that of predicted negatives (3.96 [95% CI: 3.93, 3.98]) (p adj. < 0.001). This observation may have been partially explained by the lack of generalizability of PRSs calculated from European subjects.

[0319] A similar analysis was also performed comparing the mean of the Apolipoprotein E (APOE) e4 allele count means (FIG. 12B). For NH-white, NH-AfAm, and EA, the mean s4 allele count of LPs and the mean of s4 allele count means of predicted positives were observed to be significantly higher than the mean of e4 allele count means of predicted negatives (p adj. < 0.001). Consistent with these findings, the mean proportion of patients with at least one s4 allele was higher among LPs and predicted positives than among predicted negatives overall and within NH-white and EA (p < 0.001) (FIG. 13A).

[0320] The proportion of NH-AfAm or HL patients with at least one e4 allele did not differ significantly between predicted positives and predicted negatives (p = 1 and 0.99 for NH-AfAm and HL comparisons, respectively). In addition to the slightly lower mean s4 allele count observed in predicted positives relative to predicted negatives for HL, this discrepancy may have been partly attributed to the lower AD risk associated with APOE-e4 in NH-AfAms and HLs compared to NH-whites.

[0321] The mean PRS and mean s4 allele count, stratified by race and ethnicity, were also measured for proxy -validated labels and compared with the mean of PRS means and mean of s4 allele count means for the model’s predictions (FIGS. 13B-13C). Similar trends were observed for both mean PRS and mean s4 allele count for proxy-validated positives and proxy-validated negatives across all races and ethnicities, further supporting the model’s predictions at the genotype level. The differences between proxy -validated positives and proxy -validated negatives were not significant for mean s4 allele count among NH-AfAm and HL, nor for mean PRS among NH-AfAm, HL, and EA. These results were likely due to smaller sample sizes for these groups compared to NH-white. In addition, the mean e4 allele count findings may have been driven by the lower APOE-s4 AD risk in NH-AfAm and HL, as previously noted.EXAMPLE 9. SSPUL Demonstrates Robustness to Proxy Label Distribution Shifts

[0322] Proxy labels were leveraged to select the best-performing final classifier of SSPUL (GBE) and to optimize classification cutoffs using race-specific GBE. While necessary for model selection and tuning in the absence of gold standard labels, this could have unintentionally biased SSPUL (GBE) toward learning the defined proxy label distribution. To test the robustness of the model to shifts in the proxy label distribution, a sensitivity analysis was conducted using different proxy subsets (detailed in Example 1 supra).

[0323] Across nearly all subsets, mean sensitivity and precision remained stable across all racial and ethnic groups (Table 12). Excluding subsets involving F03.90 (F03.90, 290.1, random5) and R41.3 (R41.3, 292.3, randoml), the mean sensitivity and precision ranged from 0.75-0.81 and 0.76-0.82, respectively. These ranges were comparable to those obtained using the full proxy definition (0.77-0.81 for sensitivity and 0.77-0.80 for precision).Table 12. Sensitivity analysis of proxy label distribution shifts290.16 consists of the ICDs F01.50, F01.51, F01.511, and F01.518.290.12 consists of the ICDs G31.01, G31.09, and G31.83.292.3 consists of the ICDs R41.1, R41.2, and R41.3.290.1 consists of the ICDs F03.90, F03.91, F03.911, F03.918. F02.80. F02.81, F02.811. and F02.818. randoml consists of the ICDs G31.84, F01.518, G31.83, F03.918, and R41.3. random2 consists of the ICDs F03.911, F02.818, F02.80, F01.51, and G31.1. randoms consists of the ICDs F03.918, F02.80, R41.2, G31.09, and F03.91. random4 consists of the ICDs G31.83. F02.811, F03.911, F02.818, and F01.518. random5 consists of the ICDs G31.85, F01.51, F03.90, G31.83, and F03.918.EA=East Asian, HL=Hispanic Latino, ICD=Intemational Classification of Diseases, NH=non-Hispanic, NH-AfAm=non-Hispanic African American.

[0324] Removal of subsets involving F03.90 or R41.3 was more extreme because these ICD codes were the most common sole proxies, accounting for 6.1% and 44.6% of all validated cases (excluding labeled positives), respectively. This removal led to a significant decrease in the estimated AD diagnosis and proxy prevalence across races and ethnicities and to corresponding shifts in GBE-optimized cutoffs. As a result, precision decreased (0.72-0.76 and 0.53-0.67 for subsets involving F03.90 and R41.3, respectively) because some predictions that would have been true positives under the full proxy definition were reclassified as false positives. Sensitivity also decreased (0.71-0.75 and 0.59-0.68 for subsets involving F03.90 and R41.3, respectively) due to a stricter cutoff classifying more negatives, some of which were false.

[0325] These findings suggested that SSPUL (GBE) was generally robust to shifts in proxy label definitions. Mean sensitivity and precision remained consistent across a wide range of proxy subsets, with notable drops only in cases where the excluded proxies were the only onesvalidating a substantial number of positive cases. However, the exclusion of such proxies constituted an extreme circumstance.

[0326] While the disclosure has been particularly shown and described with reference to specific embodiments (some of which are preferred embodiments), it should be understood by those having skill in the art that various changes in form and detail can be made therein without departing from the spirit and scope of the present disclosure as disclosed herein.

Claims

1. CLAIMSWe claim:

1. A computer-implemented method for improving disease state prediction models from unlabeled electronic medical records (EMRs), the method comprising:(a) receiving, by a processor, an initial electronic medical record (EMR) dataset of patient specific EMR data for a plurality of patients, wherein the initial EMR dataset comprises a first labeled EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a remainder population of the plurality of patients excluding the first population, wherein the first labeled EMR subset indicates that the first population has confirmed positive indications for a disease state, and wherein the unlabeled EMR subset indicates that the remainder has unconfirmed indications for the disease state;(b) inputting the unlabeled EMR subset to a trained machine learning model to predict disease state outcomes for a second population of the plurality of patients, wherein the trained machine learning model is configured to predict the disease state outcomes based on the initial EMR dataset, wherein each disease state outcome of the disease state outcomes is a reliable negative indication for the disease state or a confirmed positive indication for the disease state, and wherein the second population corresponds to a subset of the remainder population; and(c) generating, based on the predicted disease state outcomes that are reliable negative indications, an updated EMR dataset comprising:(i) the first labeled EMR subset for the first population having the confirmed positive indication for the disease state, and(ii) a second labeled EMR subset for the second population having the reliable negative indication for the disease state.

2. The method of claim 1, wherein the disease state comprises a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition.

3. The method of claim 1 or 2, wherein the disease state is Alzheimer’s Disease.

4. The method of any one of claims 1-3, wherein the patient specific EMR data comprises a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of chronic pain; an indication of cerebral degeneration; a secondary malignancy of lymph nodes; an indication of osteoarthrosis; an indication of inflammatory and toxic neuropathy; an indication of a disturbance of skin sensation; an indication of septicemia; an indication of vascular dementia; an indication of astigmatism; an indication of acute posthemorrhagic anemia; an indication of melanomas of skin; an indication of a respiratory failure; an indication of a secondary malignancy of bone; an indication of an antineoplastic and / or an immunosuppressive drug causing an adverse effect; an indication of morbid obesity; an indication of a voice disturbance; an indication of alopecia; or an indication of an acute pain.

5. The method of any one of claims 1-4, wherein the patient specific EMR data comprises two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome.

6. The method of any one of claims 1-5, wherein the patient specific EMR data comprises one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

7. The method of any one of claims 1-6, wherein the trained machine learning model comprises a gradient-boosted decision tree model, an ensemble of positive-unlabeled classifiers, multilayer neural network, or any combination thereof.

8. The method of any one of claims 1-5, wherein the method further comprises generating derived features comprising record density, encounter frequency, or diagnosis count prior to the inputting.

9. The method of any one of claims 1-8, wherein the method further comprises converting diagnosis codes into phecodes prior to predicting the disease state outcomes.

10. The method of any one of claims 1 -9, wherein the method further comprises validating the predicted disease state outcomes, wherein the validating comprises comparing polygenic risk score distributions between the first population having the confirmed positive indication for the disease state and the second population having the reliable negative indication for the disease state.

11. The method of claim 10, the validating further comprises assessing an apolipoprotein E (APOE) s4 allele count for each patient in the first population.

12. A system for improving disease state prediction models from unlabeled EMRs, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 1-11.

13. A computer-implemented method for improving fairness of disease state prediction models from unlabeled EMR data of underrepresented demographic groups, the method comprising:(a) receiving, by a processor, a labeled EMR dataset of patient specific EMR data for a first population and a second population of a plurality of patients, wherein the labeled EMR dataset labels the first population as having a confirmed positive indication for a disease state and labels the second population as having a reliable negative indication for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups;(b) receiving, by the processor, a plurality of unlabeled group-specific EMR datasets of patient specific EMR data for a remainder population of the plurality of patients that excludes the first population and second population, wherein the remainder population has unconfirmed indications for the disease state, and wherein the plurality of unlabeled group-specific EMR datasets is associated with the plurality of demographic groups;(c) inputting each unlabeled group-specific EMR dataset of the plurality of unlabeled group-specific EMR datasets to a trained machine learning model to predict a disease state outcome for each patient in each demographic group of the plurality of demographic groups, wherein the trained machine learning model is configured to predict a disease state outcome based on input data, wherein the disease state outcome is a positive indication for the disease state or a negative indication for the disease state, and wherein the each unlabeled group-specific EMR dataset is of the each demographic group; and(d) generating, based on the labeled EMR dataset and the predicted disease state outcome, an updated EMR dataset labeling the first population, the second population, and the remainder population.

14. The method of claim 13, wherein the disease state comprises a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition.

15. The method of claim 13 or 14, wherein the disease state is Alzheimer’s Disease.

16. The method of any one of claims 13-15, wherein the updated EMR dataset comprises: a first EMR subset for the first population, a second EMR subset for the second population, a third EMR subset for a third population having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population having an additional negative indication for the disease state.

17. The method of any one of claims 13-16, wherein the method further comprises generating derived features comprising record density, encounter frequency, or diagnosis count prior to the inputting.

18. The method of any one of claims 13-17, wherein the method further comprises converting diagnosis codes into phecodes prior to predicting the disease state outcome.

19. The method of any one of claims 13-18, wherein the inputting further comprises: mitigating a bias associated with the each demographic group by adjusting the each unlabeled group-specific EMR dataset of the each demographic group.

20. The method of claim 19, wherein the mitigating the bias further comprises optimizing parity of one or more of balanced accuracy, precision, specificity, equal opportunity, or cumulative parity loss across the plurality of demographic groups.

21. The method of any one of claims 13-20, wherein the method further comprises validating the predicted disease state outcome, wherein the validating comprises comparing polygenic risk score distributions between the first population and the second population.

22. The method of claim 21, wherein the validating further comprises assessing an APOE s4 allele count for each patient in the first population.

23. The method of any one of claims 13-22, wherein the patient-specific EMR data comprises a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of an altered mental status; an indication of a neurological disorder; an indication of an abnormality of gait; an indication of a mild cognitive impairment; an indication of a vascular dementia; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of a delirium due to conditions classified elsewhere; an indication of a major depressive disorder; an indication of a cerebral degeneration;an indication of a cerebral ischemia; an indication of an encephalopathy; an indication of a developmental delay and / or disorder; an indication of a urinary tract infection; an indication of a urinary incontinence; an indication of a transient alteration of awareness; an indication of decubitus ulcer; an indication of aphasia / speech disturbance; or an indication of malaise and fatigue.

24. The method of any one of claims 13-23, wherein the patient specific EMR data comprises two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome.

25. The method of any one of claims 13-24, wherein the patient specific EMR data comprises one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

26. A system for improving fairness of disease state prediction models from unlabeled EMR data of underrepresented demographic groups, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 13-25.

27. A computer-implemented method for disease state prediction with improved fairness and bias mitigation using EMR data, the method comprising:(a) receiving, by a processor, a plurality of EMR datasets of patient-specific EMR data for a plurality of patients having a disease state outcome that is known, wherein the disease state outcome is a positive indication for a disease state or a negative indication for the disease state, wherein the plurality of patients are categorized into a plurality of demographic groups, andwherein each EMR dataset of the plurality of EMR datasets is associated with each patient of the plurality of patients;(b) mitigating a bias in a trained machine learning model for each demographic group of the plurality of demographic groups, wherein the trained machine learning model is configured to predict the disease state outcome based on input data;(c) predicting the disease state outcome for the each patient by inputting the each EMR dataset to the trained machine learning model.

28. The method of claim 27, wherein the disease state comprises a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition.

29. The method of claim 27 or 28, wherein the disease state is Alzheimer’s Disease.

30. The method of any one of claims 27-29, wherein the EMR dataset comprises: a first EMR subset for a first population of the plurality of patients having a confirmed positive indication for the disease state, a second EMR subset for a second population of the plurality of patients having a reliable negative indication for the disease state, a third EMR subset for a third population of the plurality of patients having an additional positive indication for the disease state, and a fourth EMR subset for a fourth population of the plurality of patients having an additional negative indication for the disease state.

31. The method of claim 30, further comprising: validating the trained machine learning model using the first EMR subset; and adjusting the second EMR subset, the third EMR subset, and the fourth EMR subset.

32. The method of any one of claims 27-31, wherein the mitigating the bias comprises: receiving, for the each demographic group, a group benefit equality (GBE) value; and tuning the trained machine learning model based on the GBE value for the each demographic group.

33. The method of any one of claims 27-32, wherein the mitigating the bias further comprises optimizing parity of one or more of balanced accuracy, precision, specificity, equal opportunity, or cumulative parity loss across the plurality of demographic groups.

34. The method of any one of claims 27-33, wherein the patient-specific EMR data comprises a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia; an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon.

35. The method of any one of claims 27-34, wherein the patient specific EMR data comprises two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome.

36. The method of any one of claims 27-35, wherein the patient specific EMR data comprises one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

37. A system for disease state prediction with improved fairness and bias mitigation using EMR data, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 27-36.

38. A computer-implemented method for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data, the method comprising:(a) receiving, by a processor, a first EMR dataset of patient-specific EMR data for a plurality of patients, wherein the first EMR dataset comprises a first EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a first remainder population of the plurality of patients excluding the first population, wherein the first EMR subset indicates that the first population has confirmed positive indications for a disease state, wherein the unlabeled EMR subset indicates that the first remainder population has a first set of unconfirmed indications for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups;(b) inputting the unlabeled EMR subset into a first trained machine learning model to predict disease state outcomes for a second population of the plurality of patients, wherein the second population is a subset of the first remainder population, and wherein the disease state outcomes are reliable negative indications for the disease state;(c) receiving, by the processor, a second EMR dataset comprising the first EMR subset and a second EMR subset for the second population, and a plurality of unlabeled groupspecific EMR datasets of patient-specific EMR data for a second remainder population of the plurality of patients,wherein the second remainder population excludes the first population and the second population and has a second set of unconfirmed indications for the disease state, wherein the second EMR subset indicates that the second population has the reliable negative indications for the disease state, and wherein each unlabeled group-specific EMR dataset of the plurality of unlabeled groupspecific EMR datasets is associated with each demographic group of the plurality of demographic groups;(d) inputting the each unlabeled group-specific EMR dataset to a second trained machine learning model to predict the disease state outcomes for the each demographic group, wherein the each demographic group belongs to the second remainder population;(e) generating, based on the predicted disease state outcomes for the second remainder population, a third EMR subset for a third population of the plurality of patients, and a fourth EMR subset for a fourth population of the plurality of patients; and(f) predicting the disease state outcome with improved fairness by inputting a third EMR dataset into a third trained machine learning model, wherein the third EMR dataset comprises the first EMR subset, the second EMR subset, the third EMR subset, and the fourth EMR subset, wherein the first EMR subset indicates that the first population has the confirmed positive indications for the disease state, wherein the second EMR subset indicates that the second population has the reliable negative indications for the disease state, wherein the third EMR subset indicates that the third population has an additional positive indication for the disease state, and wherein the fourth EMR subset indicates that the fourth population has an additional negative indication for the disease state; wherein the first trained machine learning model, the second trained machine learning model, and the third trained machine learning model are configured to predict the disease state outcomes based on input data, wherein the disease state outcomes are the confirmed positive indications for the disease state or the reliable negative indications for the disease state, andwherein training of the third trained machine learning model comprises mitigating a bias using a group benefit equality (GBE) value for the each demographic group.

39. The method of claim 38, further comprising: receiving, by the processor, an EMR data associated with a patient in the each demographic group, wherein the patient has an unconfirmed indication of the disease state; and predicting a disease state outcome for the patient by inputting the EMR data associated with the patient into the third trained machine learning model.

40. The method of claim 38 or 39, wherein the disease state comprises a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition.

41. The method of any one of claims 38-40, wherein the disease state is Alzheimer’s Disease.

42. The method of any one of claims 38-41, further comprising: validating the third trained machine learning model using the first EMR subset; and adjusting the second EMR subset, the third EMR subset, and the fourth EMR subset.

43. The method of any one of claims 38-42, wherein the mitigating the bias comprises: receiving, by the processor, the GBE value for the each demographic group; and tuning the third machine learning model based on the GBE value for the each demographic group.

44. The method of any one of claims 38-43, wherein the patient-specific EMR data comprises a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of memory loss; an indication of altered mental status; an indication of a neurological disorder; an indication of a mild cognitive impairment; an indication of delirium; an indication of a specific nonpsychotic mental disorder due to brain damage; an age of the patient; an indication of aphasia;an indication of vascular dementia; an indication of a transient alteration of awareness; a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of palpitation; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; or an indication of benign neoplasm of colon.

45. The method of any one of claims 38-44, wherein the patient specific EMR data comprises two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome.

46. The method of any one of claims 38-45, wherein the patient specific EMR data comprises one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

47. A system for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 38-46.

48. A computer-implemented method for predicting a disease state of a patient from EMR data using, the method comprising:(a) receiving, by a process, an initial EMR dataset of patient-specific EMR data for a plurality of patients,wherein the initial EMR dataset comprises a first labeled EMR subset for a first population of the plurality of patients and an unlabeled EMR subset for a reminder population of the plurality of patients excluding the first population, wherein the first labeled EMR subset indicates the first population has confirmed positive indications for the disease state, wherein the unlabeled EMR subset indicates that the remainder population has unconfirmed indications for the disease state, and wherein the plurality of patients are categorized into a plurality of demographic groups;(b) identifying, based on probabilistic gap values, a first subset of the unlabeled EMR subset as reliable negative indications for the disease state, wherein a probabilistic gap value of the probabilistic gap values is generated by the processor for each patient in the unlabeled EMR subset based on a first trained machine learning model, and wherein the reliable negative indications correspond to the probabilistic gap values that are below a threshold derived from the probabilistic gap values of the first labeled EMR subset;(c) assigning, by the processor, additional positive indications and additional negative indications to a second subset of the unlabeled EMR subset using race-specific probabilistic criteria based on the probabilistic gap values;(d) classifying, by the processor, the unlabeled EMR subset to predict the disease state for the plurality of patients using a second trained machine learning model, wherein the second trained machine learning is trained using an expanded labeled EMR dataset, and wherein the expanded labeled EMR dataset comprises the first labeled EMR subset, the first subset having the reliable negative indications, and the second subset having the additional positive indications and the additional negative indications;(e) optimizing, by the processor, a decision threshold for each demographic group of the plurality of demographic groups by selecting a group-specific threshold that maximizes group benefit equality (GBE) for the disease state; and(f) predicting the disease state for the patient by inputting an EMR dataset of the patient into the second trained machine learning model.

49. The method of claim 48, wherein the first trained machine learning is a trained probabilistic model, and the second trained machine learning model is a semi-supervised machine learning model.

50. The method of claims 48 or 49, wherein the predicting comprises applying a demographic- group-specific optimized threshold.

51. The method of any one of claims 48-50, wherein the disease state comprises a chronic, neurodegenerative, metabolic, oncologic, infectious, psychiatric, or autoimmune disease or condition.

52. The method of any one of claims 48-51, wherein the disease state is Alzheimer’s Disease.

53. The method of any one of claims 48-52, wherein the patient-specific EMR data comprises a value for each of a plurality of parameters comprising one or more of: an indication of dementia; an indication of a neurological disorder; an indication of delirium; an indication of vascular dementia; an indication of a mild cognitive impairment; an indication of encephalopathy; an indication of a delusional disorder; an indication of an episodic mood disorder; an indication of an anxiety disorder; an indication of altered mental status; an indication of memory loss; an indication of a specific nonpsychotic mental disorder due to brain damage; an indication of adjustment reaction; screening for malignant neoplasms; an indication of decubitus ulcer; an indication of palpitations; an indication of other immunological findings; an indication of aphasia; an indication of a transient alteration of awareness;a major depressive disorder; a symptom concerning nutrition, metabolism, and / or development; an alteration of consciousness; an indication of potential health hazards due to socioeconomic or psychosocial circumstances; an indicia of dyspnea; an indication of dyschromia; an indication of benign neoplasm of colon; an age of the patient; a number of encounters; a number of diagnoses; a record length; or a record density.

54. The method of any one of claims 48-53, wherein the patient specific EMR data comprises two or more values selected from a plurality of parameters listed in Table 11, wherein the plurality of parameters are listed in order of significance to the predicted disease state outcome.

55. The method of any one of claims 48-54, wherein the patient specific EMR data comprises one or more values selected from a plurality of parameters comprising: an indication of dementia, an indication of memory loss, an indication of altered mental status, and an indication of neurological disorder.

56. A system for disease state prediction with improved fairness and bias mitigation using electronic medical record (EMR) data, the system comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to perform the method of any one of claims 48-55.