Controlling access using multiple data sources with varying availability

A risk assessment model trained with a foundational model to generate dataset representations addresses the challenge of comparing large data sets, enhancing predictive power and accuracy in access control decisions.

WO2026135669A1PCT designated stage Publication Date: 2026-06-25EQUIFAX INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
EQUIFAX INC
Filing Date
2024-12-17
Publication Date
2026-06-25

Smart Images

  • Figure US2024060570_25062026_PF_FP_ABST
    Figure US2024060570_25062026_PF_FP_ABST
Patent Text Reader

Abstract

In some aspects, a computing system can train a machine learning (ML) model for risk assessment. Once trained, the ML model can determine a risk indicator for a target entity that indicates a level of risk associated with the target entity. Training the ML model can include: using a foundational model pre-trained to predict multiple outcomes to compute the set of common features; and training the machine learning model using the computed set of common features as training inputs.
Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No. 096923-1439762CONTROLLING ACCESS USING MULTIPLE DATA SOURCES WITH VARYING AVAILABILITYTechnical Field

[0001] The present disclosure relates generally to artificial intelligence for risk prediction. More specifically, but not by way of limitation, this disclosure relates to controlling access to secure resources based on a risk assessment model trained on data from multiple data sources.Background

[0002] In machine learning, data serves as a resource on which to train models. Many large data sets may be available to use to train a machine learning model, but it may be difficult to compare data sets to determine the relative value of an individual data set. For example, a test could examine the pairwise correlations between attributes in each data set, but this approach is infeasible for data sets containing hundreds or thousands of attributes. Methods for comparing large data sets are limited by access to memory and computing resources. Additionally or alternatively, determining similarity metrics from obfuscated coordinate systems can be difficult.Summary

[0003] Various aspects of the present disclosure provide systems and methods for controlling access to secure resources based on a risk assessment model trained on data from multiple data sources. In some examples, a method includes one or more processing devices performing various operations. The method can include generating a risk indicator for a target entity using a trained machine learning model and predictor variables associated with the target entity, wherein the trained machine learning model is trained, using a training process, to determine a risk indicator for a target entity from the predictor variables, wherein the trained machine learning model is associated with one or more features, and wherein the training process includes operations comprising: (i) evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set; (ii) selecting a first subset of the first data set based on an amount of variance in the first subset that is explained by an amount of variance in the second data set or a second subset of the second data set being above a predetermined threshold, the amount of variance determinedAttorney Docket No. 096923-1439762based on an analysis of attributes associated with the first data set or the second data set; and (iii) training the trained machine learning model using the first subset. The method can include transmitting, to a remote computing device, a responsive message comprising at least the risk indicator for use in controlling access of the target entity to one or more computing environments.

[0004] In another example, a system includes a processing device; and a memory device in which instructions executable by the processing device are stored for causing the processing device to perform various operations. The system can generate a risk indicator for a target entity using a trained machine learning model and predictor variables associated with the target entity, wherein the trained machine learning model is trainable, using a training process, to determine a risk indicator for a target entity from the predictor variables, wherein the trained machine learning model is associated with one or more features, and wherein the training process includes operations comprising: (i) evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set; (ii) selecting a first subset of the first data set based on an amount of variance in the first subset that is explained by an amount of variance in the second data set or a second subset of the second data set being above a predetermined threshold, the amount of variance determinable based on an analysis of attributes associated with the first data set or the second data set; and (iii) training the trained machine learning model using the first subset. The system can transmit, to a remote computing device, a responsive message comprising at least the risk indicator for use in controlling access of the target entity to one or more computing environments.

[0005] In yet another example, a non-transitory computer-readable storage medium includes program code that is executable by a processor to cause a computing device to perform operations. The operations can include generating a risk indicator for a target entity using a trained machine learning model and predictor variables associated with the target entity, wherein the trained machine learning model is trainable, using a training process, to determine a risk indicator for a target entity from the predictor variables, wherein the trained machine learning model is associated with one or more features, and wherein the training process includes operations comprising: (i) evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set; (ii) selecting a first subset of the first data set based on an amount of variance in the first subset that is explained by an amount of variance in the second data set or a second subset of the second data set being above a predetermined threshold, the amount of variance determinable based on an analysis of attributes associated with the first data set or the second data set; and (iii) training the trainedAttorney Docket No. 096923-1439762machine learning model using the first subset. The operations can include transmitting, to a remote computing device, a responsive message comprising at least the risk indicator for use in controlling access of the target entity to one or more computing environments.

[0006] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification, any or all drawings, and each claim.

[0007] The foregoing, together with other features and examples, will become more apparent upon referring to the following specification, claims, and accompanying drawings.Brief Description of the Drawings

[0008] FIG. 1 is a block diagram depicting an example of an operating environment in which a risk assessment computing system builds and trains a risk assessment model that can be trained to predict risk indicators based on training data according to certain aspects of the present disclosure.

[0009] FIGS. 2A-2B are illustrations depicting different representations of a single data set according to certain aspects of the present disclosure.

[0010] FIG. 3 is an illustration of a data set associated with two attributes according to certain aspects of the present disclosure.

[0011] FIG. 4A is an illustration of measuring the similarity between two sets of attributes of a data set according to certain aspects of the present disclosure.

[0012] FIG. 4B is an example of obfuscation of underlying geometry and dependencies based on coordinate system choice according to certain aspects of the present disclosure.

[0013] FIG. 5A is a dataset with two variables according to certain aspects of the present disclosure.

[0014] FIG. 5B illustrates the latent variables used to represent the data set illustrated in FIG. 5A according to certain aspects of the present disclosure.

[0015] FIG. 6 is an illustration depicting various transformations of a dataset according to certain aspects of the present disclosure.

[0016] FIG. 7A illustrates decorrelated and standardized versions of original datasets having arbitrary coordinates and obtained via SVD according to certain aspects of the present disclosure.Attorney Docket No. 096923-1439762

[0017] FIG. 7B illustrates correlations between various data sets with respect to various transformations according to certain aspects of the present disclosure.

[0018] FIG. 8 illustrates correlations between canonical variables illustrated in FIG. 7B according to certain aspects of the present disclosure.

[0019] FIG. 9 is a flowchart illustrating an example of a process for controlling access using multiple data sources with varying availability according to certain aspects of the present disclosure.

[0020] FIG. 10 is a block diagram depicting an example of a computing device suitable for implementing aspects of the techniques and technologies presented herein.Detailed Description

[0021] Certain aspects and features of the present disclosure are directed to controlling access to secure resources based on a risk assessment model trained on data from multiple data sources. Certain aspects described herein provide improvements to machine learning techniques for assessing risks, for example, in access control associated with entities. For instance, a foundational model described herein can generate a representation of the training data that can be used to train the machine learning model to determine the risk indicator. The foundational model can be used to perform multiple predictive tasks or to reconstruct a dataset from the representation. Additionally or alternatively, the foundational model can be used to determine similarity measures between datasets having arbitrary coordinate systems. Implementations described herein enable improved predictions from downstream models (e.g., the machine learning model) by using the foundational model to generate the representation of a dataset, to determine the similarity measures, or a combination thereof.

[0022] Additionally or alternatively, disclosed systems and methods improve the field of access control by facilitating training of downstream machine learning models for predicting risk. For example, large and complex data sets can be transformed into a fully populated representation of a fixed size, which can reduce the difficulty in developing downstream models. Further, disclosed systems and methods enable previously unusable or difficult to use data sets to be used to train downstream machine learning models. This can yield more accurate machine learning outcomes by enabling downstream machine learning models to be trained on a larger corpus of data. Accordingly, improved predictive power and improved accuracy ofAttorney Docket No. 096923-1439762downstream machine learning models also improves an entity’s ability to make accurate and informed decisions on whether to grant access, by a target entity, to a secured or restricted resource.

[0023] These illustrative examples are given to introduce the reader to the general subject matter discussed here and are not intended to limit the scope of the disclosed concepts. The following sections describe various additional features and examples with reference to the drawings in which like numerals indicate like elements, and directional descriptions are used to describe the illustrative examples but, like the illustrative examples, should not be used to limit the present disclosure.Operating Environment Example for Machine-Learning Operations

[0024] Referring now to the drawings, FIG. 1 is a block diagram depicting an example of an operating environment 100 in which a risk assessment computing system 130 builds and trains a risk assessment model 120 that can be trained to predict risk indicators based on training data. As illustrated, FIG. 1 may depict examples of hardware components of a risk assessment computing system 130 according to some aspects. The risk assessment computing system 130 may be or include a specialized computing system that may be used for processing large amounts of data using a large number of computer processing cycles. The risk assessment computing system 130 can include a model training server 110 for building and training a risk assessment model 120 used to predict risk indicators associated with an entity accessing controlled resources and a foundational model 121 for generating a reduced representation of a dataset that can be used to train the risk assessment model 120. The risk assessment computing system 130 can further include a risk assessment server 118 for performing a risk assessment for given predictor variables 124, or features, using the trained risk assessment model 120.

[0025] The model training server 110 can include one or more processing devices that execute program code such as a model training application 112. The program code is stored on a non-transitory computer-readable medium. The model training application 112 can execute one or more processes or applications to develop, train, and optimize a risk assessment model 120 for predicting risk indicators based on the predictor variables 124. In some instances, the risk assessment model 120 may be trained on a reduced representation of a dataset generated by the foundational model 121.

[0026] In some aspects, the model training application 112 can build and train a risk assessment model 120 using risk assessment training data 126 in a training process. The riskAttorney Docket No. 096923-1439762assessment training data 126 can include multiple training vectors including training predictor variables and training risk indicator outputs corresponding to the training vectors. In some cases, the risk assessment training data 126 may include differing subsets of data sources. In some examples, the foundational model 121 can be used to generate a reduced representation of the risk assessment training data 126 that can then be used to train the risk assessment model 120, thereby improving the efficiency of training the risk assessment model 120. The risk assessment training data 126 can be stored in one or more network-attached storage units on which various repositories, databases, or other structures are stored. An example of these data structures is the risk data repository 122.

[0027] Network-attached storage units can include the risk data repository 122. Network-attached storage units may store a variety of different types of data organized in a variety of different ways and from a variety of different sources. For example, the network-attached storage unit may include storage other than primary storage located within the model training server 110 that is accessible, such as directly or indirectly, by processors located therein. In some aspects, the network-attached storage unit may include secondary, tertiary, or auxiliary storage, such as large hard drives, servers, virtual memory, among other types. Storage devices may include portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing and containing data. A machine-readable storage medium or computer-readable storage medium may include a non-transitory medium in which data can be stored and that does not include carrier waves or transitory electronic signals. Examples of a non-transitory medium may include, for example, a magnetic disk or tape, optical storage media such as a compact disk or digital versatile disk, flash memory, memory, or memory devices.

[0028] The risk assessment server 118 can include one or more processing devices that execute program code, such as a risk assessment application 114. The program code can be stored on a non-transitory computer-readable medium. The risk assessment application 114 can execute one or more processes to use the risk assessment model 120 trained during execution of the model training application 112 to predict risk indicators based on input predictor variables 124. The risk indicators can be used to protect or allocate computing resources of the risk assessment computing system 130.

[0029] Furthermore, the risk assessment computing system 130 can communicate with various other computing systems such as client computing systems 104. For example, client computing systems 104 may send risk assessment queries to the risk assessment server 118 for risk assessment or may send signals to the risk assessment server 118 that control or otherwiseAttorney Docket No. 096923-1439762influence different aspects of the risk assessment computing system 130. The client computing systems 104 may also interact with user computing systems 106 via one or more public data networks 108 to facilitate interactions between users of the user computing systems 106 and interactive computing environments provided by the client computing systems 104.

[0030] Each client computing system 104 may include one or more third-party devices, such as individual servers or groups of servers operating in a distributed manner. A client computing system 104 can include any computing device or group of computing devices operated by a seller, lender, or other providers of products or services. The client computing system 104 can include one or more server devices. The one or more server devices can include or can otherwise access one or more non-transitory computer-readable media. The client computing system 104 can also execute instructions that provide an interactive computing environment accessible to user computing systems 106. Examples of the interactive computing environment include a mobile application specific to a particular client computing system 104, a web-based application accessible via a mobile device, etc. The executable instructions are stored in one or more non-transitory computer-readable media.

[0031] The client computing system 104 can further include one or more processing devices that are capable of providing the interactive computing environment to perform operations described herein. The interactive computing environment can include executable instructions stored in one or more non-transitory computer-readable media. The instructions providing the interactive computing environment can configure one or more processing devices to perform operations described herein. In some aspects, the executable instructions for the interactive computing environment can include instructions that provide one or more graphical interfaces. The graphical interfaces can be used by a user computing system 106 to access, or to provide access to, various functions of the interactive computing environment. For instance, the interactive computing environment may transmit data to and receive data from a user computing system 106 to shift between different states of the interactive computing environment in which the different states allow one or more electronics transactions between the user computing system 106 and the client computing system 104 to be performed.

[0032] In some examples, a client computing system 104 may have other computing resources associated therewith (not shown in FIG. 1), such as server computers hosting and managing virtual machine instances for providing cloud computing services, server computers hosting and managing online storage resources for users, server computers for providing database services, and others. The interaction between the user computing system 106 and the client computing system 104 may be performed through graphical user interfaces presented byAttorney Docket No. 096923-1439762the client computing system 104 to the user computing system 106, or through application programming interface (API) calls or web service calls.

[0033] A user computing system 106 can include any computing device or other communication device operated by an entity, such as a user, an organization, or a company. The user computing system 106 can include one or more computing devices such as laptops, smartphones, and other personal computing devices. A user computing system 106 can include executable instructions stored in one or more non-transitory computer-readable media. The user computing system 106 can also include one or more processing devices that are capable of executing program code to perform one or more of the operations described herein. In various examples, the user computing system 106 can allow a user to access certain online services from a client computing system 104 or other computing resources, to engage in mobile commerce with a client computing system 104, to obtain controlled access to electronic content hosted by the client computing system 104, etc.

[0034] For instance, the user can use the user computing system 106 to engage in an electronic transaction with a client computing system 104 via an interactive computing environment. An electronic interaction between the user computing system 106 and the client computing system 104 can include, for example, the user computing system 106 being used to request online storage resources managed by the client computing system 104, to acquire cloud computing resources, such as virtual machine instances, etc., and so on. An electronic interaction between the user computing system 106 and the client computing system 104 can include, for example, querying a set of sensitive or other controlled data, accessing online financial services provided via the interactive computing environment, submitting an online credit card application or other digital application to the client computing system 104 via the interactive computing environment, or operating an electronic tool within an interactive computing environment hosted by the client computing system (e.g., a content-modification feature, an application-processing feature, etc.).

[0035] In some aspects, an interactive computing environment implemented through a client computing system 104 can be used to provide access to various online functions. As a simplified example, a website or other interactive computing environment provided by an online resource provider can include electronic functions for requesting computing resources, online storage resources, network resources, database resources, or other types of resources. In another example, a website or other interactive computing environment provided by a financial institution can include electronic functions for obtaining one or more financial services, such as loan application and management tools, credit card application and transaction managementAttorney Docket No. 096923-1439762workflows, electronic fund transfers, etc. A user computing system 106 can be used to request access to the interactive computing environment provided by the client computing system 104, which can selectively grant or deny access to various electronic functions. Based on the request, the client computing system 104 can collect data associated with the user and communicate with the risk assessment server 118 for risk assessment. Based on the risk indicator predicted by the risk assessment server 118, the client computing system 104 can determine whether to grant the access request of the user computing system 106 to certain features of the interactive computing environment.

[0036] In a simplified example, the system depicted in FIG. 1 can train the risk assessment model 120 to determine risk indicators, such as credit scores, using predictor variables 124. Additionally or alternatively, the system depicted in FIG. 1 can train the risk assessment model 120 based on a set of features generated form the risk assessment training data 126 by the foundational model 121. A predictor variable 124 can be or include any variable predictive of risk that is associated with an entity. Any suitable predictor variable that is authorized for use by an appropriate legal or regulatory framework may be used.

[0037] Examples of predictor variables 124 used for predicting the risk associated with an entity accessing online resources may include variables indicating the demographic characteristics (e.g., name of the entity, the network or physical address of the company, the identification of the company, the revenue of the company) of the entity, variables (e.g., past requests of online resources submitted by the entity, the amount of online resource currently held by the entity, and so on) indicative of prior actions or transactions involving the entity, variables (e.g., the timeliness of the entity releasing the online resources) indicative of one or more behavioral traits of an entity, etc. Similarly, examples of predictor variables 124 used for predicting the risk associated with an entity accessing services provided by a financial institute may include variables indicative of one or more demographic characteristics (e.g., age, gender, income) of an entity, variables indicative of prior actions or transactions involving the entity (e.g., information that can be obtained from credit files or records, financial records, consumer records, or other data about the activities or characteristics of the entity), variables indicative of one or more behavioral traits of an entity, etc.

[0038] The predicted risk indicator can be used by the service provider, such as the service provider controlling the interactive computing environment, to determine the risk associated with the entity accessing a service provided by the service provider, thereby granting or denying access by the entity to an interactive computing environment implementing the service. For example, if the service provider determines that the predicted risk indicator is lowerAttorney Docket No. 096923-1439762than a threshold risk indicator value, then the client computing system 104 associated with the service provider can generate or otherwise provide access permission to the user computing system 106 that requested the access. The access permission can include, for example, cryptographic keys used to generate valid access credentials or decryption keys used to decrypt access credentials. The client computing system 104 associated with the service provider can also allocate resources to the user and provide a dedicated web address for the allocated resources to the user computing system 106, for example, by adding it in the access permission. With the obtained access credentials and / or the dedicated web address, the user computing system 106 can establish a secure network connection to the computing environment hosted by the client computing system 104 and access the resources via invoking API calls, web service calls, HTTP requests, or other proper mechanisms.

[0039] Each communication within the operating environment 100 may occur over one or more data networks, such as a public data network 108, a network 116 such as a private data network, or some combination thereof. A data network may include one or more of a variety of different types of networks, including a wireless network, a wired network, or a combination of a wired and wireless network. Examples of suitable networks include the Internet, a personal area network, a local area network (“LAN”), a wide area network (“WAN”), or a wireless local area network (“WLAN”). A wireless network may include a wireless interface or a combination of wireless interfaces. A wired network may include a wired interface. The wired or wireless networks may be implemented using routers, access points, bridges, gateways, or the like, to connect devices in the data network.

[0040] The number of devices depicted in FIG. 1 is provided for illustrative purposes. Different numbers of devices may be used. For example, while certain devices or systems are shown as single devices in FIG. 1, multiple devices may instead be used to implement these devices or systems. Similarly, devices or systems that are shown as separate, such as the model training server 110 and the risk assessment server 118, may be instead implemented in a single device or system.Introduction to Data Representations

[0041] FIGs. 2A and 2B are illustrations depicting different representations, such as representations 204A and 204B, of a single data set such as data set 202. Data set 202 can be a representation of a number of N entities. In determining risk, a machine learning model can determine which aspects of each entity are potentially (un)important or (ir)relevant such as by determining attributes and attribute suites. Aspects deemed important for a particular task, suchAttorney Docket No. 096923-1439762as risk assessment, can be implicit in the specific subset of attributes selected for modeling and analysis, as well as in the way these aspects are transformed and processed. Thus, attributes individually or collectively, or any transformations thereof, can constitute different representations of each entity in a data set. In some examples, a representation can be associated with a group of entities rather than individual entities.

[0042] Representation of a data set can be abstracted as a view on a particular collection of entities. These representations can be expressed in a tabular form in which each row corresponds to an entity and each column corresponds to an attribute, as illustrated in FIG. 2A. However, in some examples, attributes need not be in raw form. For example, a system can preprocess one or more columns using imputation, normalization, binning, capping and flooring, and other kinds of explicit transformations prior to modeling. Likewise, any implicit transformations can be learned and applied using neural networks. In these instances, any given layer of the neural network can be regarded as a set of features or attributes as illustrated in FIG. 2B. Deep learning models 1 and 2 can differ in terms of weight initialization, hyperparameters, mini-batch exposure, architecture, and the like, yielding different representations of common data such as the data set 202. Accordingly, datasets (structured or unstructured), neural network layers, and individual attributes or collections thereof can all be regarded as representations of a data set.

[0043] It can be important to understand representations for machine learning applications as poor representations of a training data set can lead to poor performance of a machine learning model. On the other hand, a good representation of a training data set can yield accurate and robust task-specific predictions, population-specific insights, explanatory data associated with black-box models, and a minimal set of meaningful features. In determining which data set to use for training a machine learning model, desirable properties of the representation can rely on context and use-case. Determining which data set to use for training data can be based on a comparison of representations, or a determination of representational overlap.Representational Similarity

[0044] FIG. 3 is an illustration of a data set 302 associated with two attributes: attribute X and attribute Y. An example of measuring the similarity between attribute X and attribute Y is illustrated in FIG. 3. For example, attribute X may be X E RNxland attribute Y may be Y E RNX1. There may be several ways to define a similarity measure between the two attributes. In this example, correlation can be used to measure attribute similarity. For example, correlationAttorney Docket No. 096923-1439762can measure the degree to which attribute X and attribute Y covary, which can be normalized by standard deviation. If, for example, similarity is S(X, K) = rX Y= +1-0 then there may be a perfect linear relationship between attribute X and attribute Y and thus, the representations can be the same. As illustrated in FIG. 3, S(X, Y) = rX Y= ±0.89, meaning the two representations are similar, but there is still some unique variation that is not mutually explained.

[0045] FIG. 4A is an illustration of measuring the similarity between two sets of attributes (X G RNXVX and Y E RNxpy) of a data set 402. As discussed with reference to FIG. 3, correlation can be used to measure representational similarity. This approach can generate pxx pycorrelations as produced below in Equation 1:Equation 1S(X, r) = / (cOrr(X7’, n) = / ([E[X1,r1] - = E[XP1, Y1] - E[Xpi, Fp2] ])

[0046] In this case, there may be no clear means to generate a scalar measure to summarize the similarity between representations in this case. Additionally or alternatively, ordinary correlation may depend on the coordinate system used. For example, the raw attributes may not reveal the underlying information geometry and can obscure underlying strong linear dependencies.

[0047] FIG. 4B illustrates an example of obfuscation of underlying geometry and dependencies based on coordinate system choice. The representations generated in FIG. 4B can be based on two normally distributed, two-dimensional, random vectors X and Y in which Xi + X2= Yt+ Y2. The cross-correlation matrix in the raw coordinates is illustrated on the left of FIG. 4B, while the cross-correlation in the canonical coordinates is illustrated on the right of FIG. 4B. The cross-correlation in the canonical coordinates indicates the presence of a perfect correlation that is not clear from the cross-correlation in the raw coordinates.

[0048] Canonical Correlation Analysis (CCA) can address arbitrary coordinates by finding unique transformations, Wxand WY, that, when applied to X and Y, respectively, yield canonical representations Zxand ZY. The canonical representations can be constructed such that the ithdimensions, Zxand ZYcan be correlated with each other and uncorrelated with other dimensions within or between the representations. Moreover, the first canonical dimensions may be dimensions that maximize the correlation between the two datasets, the second canonical dimensions may be selected to maximize the correlation subject to being orthogonal to the first dimensions, and so on. In other words, the representations can beAttorney Docket No. 096923-1439762constructed such that their cross-correlation matrix is diagonal and the dimensions are ordered from most to least correlated.

[0049] In some examples, CCA may be invariant with respect to affine transformations of the datasets. For example, two new representations can be created as follows: X* = AxX + bxand Y* = AYY + bY. Applying CCA to X* and Y* (after mean-centering) can yield the same canonical representations Zxand ZYwith the same canonical correlations as when CCA is applied to X and Y. Moreover, applying CCA to X and X* (or to Y and F*) can result in a crosscorrelation matrix that can be an identity matrix in which the canonical correlations may be 1.

[0050] Under CCA, two representations, X and Y, can be the same if there exists linear transformations, Wxand WY, such that the cross-correlation matrix of the resultant canonical representations Zxand ZYis identity. Conversely, if the canonical cross-correlation matrix yields all zeros, then CCA says the two representations are completely different. In in-between cases, CCA may not yield a scalar similarity measure: instead of pxX pYcorrelations, there may be pxcorrelations (the lesser of pxand pYsince it is assumed that px< pY, shown above). This yields fewer data points to evaluate than in other cases. However, two facts can be leveraged in order to create a scalar similarity measure: 1) since the canonical dimensions of each representation are orthogonal, they partition the variance in their respective datasets; and 2) since the ithcanonical dimensions, Zxand ZYare uncorrelated with other canonical dimensions, each pair of canonical dimensions can be thought of as independent univariate regressions and the square of their associated canonical correlation coefficient is the fraction of mutual variance explained along the ithcanonical dimension.

[0051] Putting these two insights together (e.g., each canonical dimension can carry a unique fraction of the total dataset variance and the square of the corresponding canonical correlation coefficient may dictate how much of that fraction is explained by the other dataset), it is possible to create a similarity measure S(X, F) that shows what fraction of the variance in X is explained by F, and vice versa. This technique can be leveraged to map the variance explained back to the original attributes (or groups of attributes), which can be used to derive insight into the nature of the (un)explained variation. In evaluating a new data asset, this approach yields a model-agnostic, task-independent method of assessing the amount of novel (linear) information present.Attorney Docket No. 096923-1439762Dataset Structure

[0052] In some examples, the datasets used in models may reside in tables in which rows correspond to measured items (e.g. entities) and columns correspond to measured variables (attributes or features). Tables of data can be represented as matrices X E RNxPxand Y E RNxpy in which both datasets have the same unique measured items, N. The data set X has pxvariables and Y has pyvariables. In this example, the datasets may have been centered such as each column has had its mean subtracted. Each variable in a dataset can be represented as a column vector, such as, Xj or yk. For the purposes of this example, each data set can be scaled xbyJ, - such that:VN-Ivari •ance: x Tlx = 2covariance: xTy = paxay

[0053] FIG. 5 A illustrates a dataset with two variables x and x2. This dataset has a two-dimensional Gaussian distribution. To understand structure, the scatter plot of this data may represent an ellipse rotated at a particular angle. This structure may be related to correlation relationships in the underlying data. The variables x and x2may represent the original measured data.

[0054] Representational similarity can be revisited in more detail through the lens of a univariate regression between x E RNxland x2E RNxl, as produced below by Equation 2:Equation 2°x2 i G - 2 - %2 Px Xj + 1 px(Jx2E°xlThe correlation coefficient px, can capture an amount of linear dependency between the two variables (e.g., the expected change in x2for every unit change in x- while cr%1, which is the standard deviation of x15can standardize the variance of x}and < J%2(the standard deviation of x2) scales the variances of x}and E to match that of x2. In this example, E can be assumed to be standard normally distributed (e~lV(0,l)) and may be orthogonal to x}so that xf E = 0.

[0055] It is also possible to represent xxin terms of x2. It is possible to represent xxand x2in terms of two latent standardized and uncorrelated variables uxland ux2, as produced below:■G ^xlPU-xl ■Sx2'\ / 1 P Rr2■^2 ^xlV 1 P «xl d” Sx2PU-x2

[0056] In some examples, sxland sx2can scale the standardized variables just like the a values in the above-described regression. Additionally or alternatively, the correlationAttorney Docket No. 096923-1439762coefficient p may be a measure of linear dependency and may dictate how much of and x2are explained by uxland ux2. In this example, p #= px, but they may be related.

[0057] FIG. 5B illustrates the latent variables uxland ux2used to represent the data set illustrated in FIG. 5A. As illustrated in FIG. 5B, the dataset may remain the same, but the coordinate system used to represent the data set has changed.

[0058] In some examples, p =cos cos 0. Additionally or alternatively, cos cos 0 and sin sin 0 may take values between -1 and 1, similarly to correlation coefficients. If p = cos cos 0, then through the trigonometric identity cos20 + sin20 = 1, ^ / l —2=sin sin 0. Additionally or alternatively, x and x2can be written as:xi =sxicos cos# uxl— sx2sin sin 0 ux2x2= sxlsin sin 0 uxl+ sx2cos cos 0 ux2

[0059] There may be a close relationship between the degree of (standardized) covariation between the two variables and the degree of alignment (e.g., angle between) two vectors, as produced below in Equation 3.Equation 3NT - VX1 UX1 / ^nl^xnln=l= HxJI COS COS 0= < J%1COS COS 0P°xl

[0060] The system of equations defining the original variables x and x2can be represented in terms of latent variables uxland ux2in terms of a factored matrix equation, as produced below in Equation 4:Equation 4[%i x2] = [uxlux2] [sxl00 sx2] [cos cos 0 sin sin 0 —sin sin 0 cos cos 0 ] This can also be represented as Equation 5:Equation 5X = uxsxvxT

[0061] Equation 5 may be or include the Singular Value Decomposition (SVD) of data matrix X. A data matrix X can be factored into a set of uncorrelated, standardized columns a diagonal scaling matrix Sx, and a notation matrix Vxusing SVD. The columns of Uxand Vxform two sets of orthogonal basis vectors called the left and right singular vectors, respectively, and the scalar values on the diagonal of Sxare called singular values. If X has dimensions ( / V x px), then Uxis ( / V x / V), Sxis ( / V x px), and Vxis (pxx p„). These dimensions correspond to the full SVD, and Sxmay be rectangular with all zeros for all rows below theAttorney Docket No. 096923-1439762diagonal (if N X px). The zero rows effectively zero out the rightmost (IV — px) columns of Ux, so it is common to write the “compact” form where these excess columns have been removed. In some examples, the full SVD may be used when applying CCA to data sets of different dimensionalities. In compact form, Uxis (A X px), Sxis (pxX px), and Vxis Px x Px -

[0062] To find the natural structural representation of dataset X in terms of the latent variables uxland ux2, SVD can be performed on the original data matrix. FIG. 6 is an illustration depicting the original dataset X as plotted in graph 602. The decorrelated data illustrated in graph 604 may represent XVx. illustrated in graph 606, which depicts dataset X in terms of the latent variables uxland ux2.

[0063] The right singular vector Vx, obtained by applying SVD to X, can yield a unique linear transform that rotates X into its principal representation as illustrated in graph 604. Mathematically, this is represented as Equation 6:Equation 6XVx= uxsxvxTvx= UXSX

[0064] Uxmay be or include a set of latent standardized and uncorrelated variables, and Sxcan include scaling factors. It is possible to rewrite the SVD of X as produced below in Equation 7:Equation 7Ux= XVxS~1

[0065] In some examples, Vxcan rotate X into its principal representation (illustrated in graph 606), andcan scale the principal axes to have unit variance. The result Uxis a dataset with standardized, uncorrelated variables in the columns whose covariance matrix is the identity matrix. Geometrically, if the X data point cloud is an ellipsoid, then Vxis aligning the dimensions with the axes of the ellipsoid, andis stretching / compressing the ellipsoid into a circle. Thus, when taken together, these operations can be referred to as a sphering or whitening transformation on the dataset.Dataset Similarity

[0066] Having determined and removed the data structure within each dataset as outlined above, it is possible to uncover the latent similarities between each dataset. It may not be possible to simply examine the cross-covariance matrix computed from the latent variable of each dataset UXUVbecause this can hide true correlations between the datasets. Rather, a set ofAttorney Docket No. 096923-1439762latent variables can be determined to uncover dataset similarity. In this case, the latent variables can be referred to as canonical variables. Equation 8 shows that these variables can be defined such that the canonical correlation matrix is:Equation 8^ ZxT^ zy = sxyin which Sxyis a diagonal matrix with elements specifying the correlation coefficients between the canonical variables for each dataset ordered from maximum to minimum correlation. Because it is a diagonal matrix, this means there are no cross-correlations among the canonical variables. For example, the canonical variables may be orthogonal such that there is at most one canonical variable correlated with another from each dataset. As is the case for latent variables in the above discussion, canonical variables can be standardized to unit variance, ZXZX= I and ZyZy= I.

[0067] To determine similarity between X and Y using canonical correlation analysis, within-dataset structure can be removed from each dataset by applying SVD to produce their sphered principal representations as:Equation 9aLatent Variables for Dataset X: Ux= XVxS~1Equation 9bLatent Variables for Dataset Y: Uy=

[0068] FIG. 7A illustrates the decorrelated and standardized versions, Uxand Uyof the original datasets X and Y in graphs 702 and 704, respectively, obtained through SVD. Each dataset may include N=30,000 samples, though other suitable numbers of samples are possible. To more easily visualize the operation of Canonical Correlation Analysis (CCA), let 706X and 706Y be subsamples of 10 points from identical rows in each dataset (e.g., 10 points that correspond to the same measured item).

[0069] As illustrated in graph (a) of FIG. 7B, based on the subsample, it appears that the two datasets X and Y would be perfectly correlated if an appropriate rotation was applied, as graph (b) of FIG. 7B demonstrates. The equations below compare the covariance matrices of the latent variables, Uxand Uyunderlying the original datasets X and Y, as well as the canonical variables Zxand Zy, which can be found through the application of appropriate linear transforms such as a rotation or a reflection. Because these variables are chosen standardized to unit variance, the elements of the covariance matrices may be correlation coefficients. Equations 10a and 10b provide the covariance matrices:Attorney Docket No. 096923-1439762Equation 10aCovariance Matrix for Dataset Latent Variables: UxUy= [0.87 —0.500.500.87 ]Equation 10bCovariance Matrix for Canonical Variables: ZxZy= [1.000.000.00 1.00 ]

[0070] The largest correlation between the datasets of 0.87 occurs between their first dimensions. As stated above and illustrated in graph (b) of FIG. 7B, the canonical representations of each dataset are perfectly correlated with a value of 1.0. Graph (c) of FIG.7B confirms this with a comparison of the distribution of uylvs. uxlwith that of zylvs. zxl, which include the first columns of the canonical variables Zxand Zv.

[0071] Another example is illustrated in FIG. 8. In this example, the underlying patterns 802X and 802Y do not exactly match. In other words, the correlations between the underlying canonical variables do not equal 1.0. Equations Ila and 11b provide the covariance matrices of the latent variables Uxand Uyand the canonical variables Zxand Zy.Equation IlaCovariance Matrix for Dataset Latent Variables: UxUy= [0.86 —0.500.490.86 ]Equation 11bCovariance Matrix for Canonical Variables: ZxZy= [0.990.000.000.99 ]

[0072] The example illustrated by FIG. 8 may demonstrate how to produce canonical variables, Zxand Zyby applying a simple linear rotation and reflection operation to the datasets Uxand Uyuntil the best possible alignment is obtained. Therefore, two orthonormal matrices, Uxyand Vxy, may be determined such that the latent variables of the original datasets can be transformed into the canonical variables through a rotation and reflection, as produced below in Equations 12a and 12b:Equation 12aCanonical Variables for Dataset X: Zx✓V = UXUXr-VVVEquation 12bCanonical Variables for Dataset Y: Z yv= U yvVxxyv

[0073] It is possible to obtain the canonical variables and their correlations by substituting in the values for Zxand Zyfrom Equations 12a and 12b into Equation 8 to yield Equation 13:Equation 13Attorney Docket No. 096923-1439762\uxuxyJ \^yvxyjuxyPre-multiplying both sides of Equation 13 by Uxyand post-multiplying both sides by Vxyyields Equation 14:Equation 14UTll = JJ s VTin which the matrix operations can be defined in detail as Equation 15:Equation 15UxUy= [cos cos du —sin sin du sin sin cos cos ] [p±00 p2] [cos cos 0vsin sin 0v0vcos cos 0v]which is simply the Singular Value Decomposition (SVD) as discussed above. Thus, to obtain the canonical variables and their correlations, it is possible to apply the SVD to the covariance matrix of the latent variables for each dataset yielding the appropriate transforms (rotations and reflections) to align the datasets in their directions of maximum correlation as demonstrated in FIGs. 7B and 8.

[0074] Putting everything together from the above discussions, a pair of linear transformations Wxand Wycan be defined. These linear transformations can take the original datasets X and Y and generate the canonical variables Zxand Zyby applying SVD, in some examples, three times: once to factor X for Ux, once to factor Y for [7y; and once to factor UxUyfor Sxy. Substituting the definitions for Uxand Uyfrom Equations 9a and 9b into the definitions of Zxand Zyas in Equations 12a and 12b yields Equations 16a and 16b:Equation 16aCanonical Variables for Dataset X:U x Uxy= XVxSx1Uxv= XWxEquation 16bCanonical Variables for Dataset Y:Uy VXy= YVySy ^xyThe above provides the operations to convert the original datasets to the canonical variable space as produced below in Equations 17a and 17b.Equation 17aCanonical Transform for Dataset X: Wx= VxS~1UxvEquation 17bAttorney Docket No. 096923-1439762Canonical Transform for Dataset Y: I / U y = y y

[0075] The above has demonstrated how to thoroughly discover within dataset structure and uncover between dataset similarity and thus describe the datasets and their relationships by applying three singular value decompositions: one SVD to each dataset for their latent variables, and one SVD to the covariance matrix generated from these latent variables for their canonical variables.Variance Explained

[0076] In some examples, what is determined is a correlation-based measure of similarity. Under such measures, two attributes, such as representations, are highly similar if they strongly co-vary: an increase / decrease in one is expected to correspond to similar increase / decrease in the other. On the other hand, if two attributes are highly dissimilar, an increase / decrease in the value of one attribute may indicate little about the value of the other attribute. In some examples, the units of each attribute and, consequently, the units of these variations may differ. As such, correlation-based similarity requires first standardizing each attribute by its standard deviation (the square root of its variance) such that they are unitless.

[0077] Correlation, then, is a standardized measure of covariation, where 0 correlation means that a change in one attribute is completely independent of the other, and a correlation of + / -1 means that a change in one attribute will correspond to exactly the same degree of change in the other attribute (either in the same or the opposite direction). Under the assumption that one attribute can be expressed as a linear model of the other (see, e.g., Equation 1 above), correlation quantifies how much can be known about the value of one attribute just by knowing the value of the other.

[0078] In some examples, Equation 1 dictates that it is possible to split x2into two parts:1. The part of x2that is explained by xx; and2. The part of x2that is not explained by x}.This allocation can be controlled or represented by the correlation coefficient p, which can take a value between -1 and 1. The role of the standard deviations is to ensure commensurability, for example to ensure comparisons are made in the same units. Normalizing x}and x2to unit variance by dividing them by their respective standard deviations yields a regression equation as produced below in Equation 18:Equation 18x2= pxi + 71—P2EAttorney Docket No. 096923-1439762

[0079] Further, the square of the correlation coefficient, p2, may be a direct measure of the fraction of x2’s variance ((TX2) explained by x, as shown in Equation 19:Equation 19Kar(x2) = %2X2 ( °x2. IZ - T \TC °x2, C - 7 \ = Ip — xr+ vl - p2cr%2£ l Ip — xr+ VI - p2ax2E \ / \ / / Z / T \ 2 \ = p2( — ) x[’ x1+ (1 - p2)a22ETE 1 \ / = P2°X2 + (1 - P2)°X2

[0080] The entries of the canonical cross-correlation matrix Sxycan be the canonical correlation coefficients in which the ithdiagonal element may capture the correlation between the ithcanonical dimensions of Zxand Zy. Since each pair of canonical dimensions is uncorrelated with all others both within and between the canonical representations, the ithpair of canonical dimensions can be represented as independent univariate regressions. The set can be expressed as Equation 20 below:Equation 20Zv= Zxassuming that the above relationships hold given that ZxEy=y7Zx= 0.

[0081] Two factors can now be leveraged to address the goal of analyzing representational overlap:1. Since the canonical dimensions of each representation are orthogonal, they partition the variance in their respective datasets; and2. Since the ithpair of canonical dimensions can be thought of as independent univariate regressions, the square of their associated canonical correlation coefficient is the fraction of mutual variance explained along the ithcanonical dimension.

[0082] Thus, each canonical dimension carries a unique fraction of the total dataset variance and the square of the corresponding canonical correlation coefficient indicates how much of that fraction is explained by the other dataset. Having found the canonical representations of our datasets, they can be expressed as Equations 21a and 21b:Equation 21aDataset X Dependency on ZxX = ZXWX7Equation 21bAttorney Docket No. 096923-1439762Dataset Y Dependency on ZyY = ZyW-1in which Wf1and Wf1may provide the linear transformations from canonical space back into the original data spaces for X and Y, respectively. These inverse canonical transformations are defined by Equations 22a and 22b:Equation 22aInverse Canonical Transformation for Dataset X: Wx-1= UxySxVfEquation 22bInverse Canonical Transformation for Dataset Y: W y~ = VxJyvS yvV yf

[0083] Given the representation of Y in terms of its canonical variables and the related inverse canonical transform, it is possible to show that the covariance matrix of Y can be expressed solely in terms of the diagonalized canonical cross-correlation matrix Sxyand the inverse canonical transformation for Y can be produced below in Equation 23:Equation 23I -7z I - = (5xyU71)r(5xyU71) + (Jl - S2yW-1) (Jl - SZyW-1)Example of a Flowchart of a Process for Controlling Access

[0001] FIG. 9 is a flowchart illustrating an example of a process for controlling access using multiple data sources with varying availability according to certain aspects of the present disclosure. One or more computing devices, such as the risk assessment computing system 130, may implement operations illustrated in FIG. 9 by executing suitable program code. For illustrative purposes, the process 900 is described with reference to certain examples depicted in the figures. Other implementations, however, are possible.

[0084] At block 902, the process 900 involves training a machine-learning model using a particular training process. The training process may involve training the machine-learning model to determine a risk indicator for a target entity from predictor variables associated withAttorney Docket No. 096923-1439762the target entity. The machine-learning model may be associated with one or more features, and the training process may involve multiple steps. For example, the training process may include evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set. Additionally or alternatively, the training process may include selecting the first data set based on an amount of variance in the first data set that is explained by an amount of variance in the second data set being above a predetermined threshold. Additionally or alternatively, the training process may involve training the machine learning model using the first data set.

[0085] At block 904, the process 900 involves generating a risk indicator. In some examples, the risk indicator may be generated for the target entity using the trained machinelearning model and the predictor variables. At block 906, the process 900 involves transmitting, to a remote computing device, a responsive message. The responsive message may include at least the risk indicator for use in controlling access of the target entity to one or more computing environments, which may be interactive or otherwise have access to protected data.Example of Computing System for Machine-Learning Operations

[0086] Any suitable computing system or group of computing systems can be used to perform the operations for the machine-learning operations described herein. For example, FIG. 10 is a block diagram depicting an example of a computing device 1000, which can be used to implement the risk assessment server 118 or the model training server 110. The computing device 1000 can include various devices for communicating with other devices in the operating environment 100, as described with respect to FIG. 1. The computing device 1000 can include various devices for performing one or more transformation operations described above with respect to FIGS. 1-9.

[0087] The computing device 1000 can include a processor 1002 that is communicatively coupled to a memory 1004. The processor 1002 executes computer-executable program code stored in the memory 1004, accesses information stored in the memory 1004, or both. Program code may include machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, among others.Attorney Docket No. 096923-1439762

[0088] Examples of a processor 1002 can include a microprocessor, an application-specific integrated circuit, a field-programmable gate array, or any other suitable processing device. The processor 1002 can include any number of processing devices, including one. The processor 1002 can include or communicate with a memory 1004. The memory 1004 can store program code that, when executed by the processor 1002, causes the processor to perform the operations described herein.

[0089] The memory 1004 can include any suitable non-transitory computer-readable storage medium. The computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable program code or other program code. Non-limiting examples of a computer-readable medium include a magnetic disk, memory chip, optical storage, flash memory, storage class memory, ROM, RAM, an ASIC, magnetic storage, or any other medium from which a computer processor can read and execute program code. The program code may include processor-specific program code generated by a compiler or an interpreter from code written in any suitable computerprogramming language. Examples of suitable programming language include Hadoop, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, ActionScript, etc.

[0090] The computing device 800 may also include a number of external or internal devices such as input or output devices. For example, the computing device 1000 is illustrated with an input / output interface 1008 that can receive input from input devices or provide output to output devices. A bus 1006 can also be included in the computing device 1000. The bus 1006 can communicatively couple one or more components of the computing device 1000.

[0091] The computing device 1000 can execute program code 1014 that includes the risk assessment application 114 and / or the model training application 112. The program code 1014 for the risk assessment application 114 and / or the model training application 112 may be resident in any suitable computer-readable medium and may be executed on any suitable processing device. For example, as depicted in FIG. 10, the program code 1014 for the risk assessment application 114 and / or the model training application 112 can reside in the memory 1004 at the computing device 1000 along with the program data 1016 associated with the program code 1014, such as the predictor variables 124 and / or the model training samples. Executing the risk assessment application 114 or the model training application 112 can configure the processor 1002 to perform the operations described herein.

[0092] In some aspects, the computing device 1000 can include one or more output devices. One example of an output device is the network interface device 1010 depicted in FIG. 10. A network interface device 1010 can include any device or group of devices suitable forAttorney Docket No. 096923-1439762establishing a wired or wireless data connection to one or more data networks described herein. Non-limiting examples of the network interface device 1010 include an Ethernet network adapter, a modem, etc.

[0093] Another example of an output device is the presentation device 1012 depicted in FIG. 10. A presentation device 1012 can include any device or group of devices suitable for providing visual, auditory, or other suitable sensory output. Non-limiting examples of the presentation device 1012 include a touchscreen, a monitor, a speaker, a separate mobile computing device, etc. In some aspects, the presentation device 1012 can include a remote client-computing device that communicates with the computing device 1000 using one or more data networks described herein. In other aspects, the presentation device 1012 can be omitted.

[0094] The foregoing description of some examples has been presented only for the purpose of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Numerous modifications and adaptations thereof will be apparent to those skilled in the art without departing from the spirit and scope of the disclosure.

Claims

Attorney Docket No. 096923-1439762Claims1. A method that includes one or more processing devices performing operations comprising:generating a risk indicator for a target entity using a trained machine learning model and predictor variables associated with the target entity, wherein the trained machine learning model is trained, using a training process, to determine a risk indicator for a target entity from the predictor variables, wherein the trained machine learning model is associated with one or more features, and wherein the training process includes operations comprising:evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set; selecting a first subset of the first data set based on an amount of variance in the first subset that is explained by an amount of variance in the second data set or a second subset of the second data set being above a predetermined threshold, the amount of variance determined based on an analysis of attributes associated with the first data set or the second data set; and training the trained machine learning model using the first subset; and transmitting, to a remote computing device, a responsive message comprising at least the risk indicator for use in controlling access of the target entity to one or more computing environments.

2. The method of claim 1, wherein the training process includes operations further comprising applying canonical correlation analysis (CCA) to the first data set and the second data set to generate one or more unique transformations to facilitate a similarity measure for determining the amount of variance.

3. The method of claim 1, wherein the training process includes operations further comprising determining a canonical correlation coordinate system for the first data set and the second data set to maximize a correlation between the first data set and the second data set.

4. The method of claim 3, wherein the training process includes operations further comprising:Attorney Docket No. 096923-1439762determining a first transformation associated with the first data set and a second transformation associated with the second data set, wherein the first transformation and the second transformation, when applied to the first data set and the second data set, respectively, yield their canonical representations, respectively.

5. The method of claim 1, wherein the training process further comprises determining a portion of a covariance of the first data set that is explained by the second data set to facilitate a determination of the amount of variance that is above the predetermined threshold.

6. The method of claim 5, wherein the training process further comprises determining a portion of a covariance of the second data set that is explained by the first data set to facilitate the determination of the amount of variance that is above the predetermined threshold.

7. The method of claim 6, wherein the training process further comprises:determining a percentage of the first data set that is correlated with the second data set based on the portion of a covariance of the first data set that is explained by the second data set and the portion of a covariance of the second data set that is explained by the first data set.

8. A system comprising:a processing device; anda memory device in which instructions executable by the processing device are stored for causing the processing device to perform operations comprising: generating a risk indicator for a target entity using a trained machine learning model and predictor variables associated with the target entity, wherein the trained machine learning model is trainable, using a training process, to determine a risk indicator for a target entity from the predictor variables, wherein the trained machine learning model is associated with one or more features, and wherein the training process includes operations comprising:Attorney Docket No. 096923-1439762evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set;selecting a first subset of the first data set based on an amount of variance in the first subset that is explained by an amount of variance in the second data set or a second subset of the second data set being above a predetermined threshold, the amount of variance determinable based on an analysis of attributes associated with the first data set or the second data set; andtraining the trained machine learning model using the first subset; andtransmitting, to a remote computing device, a responsive message comprising at least the risk indicator for use in controlling access of the target entity to one or more computing environments.

9. The system of claim 8, wherein the training process includes operations further comprising applying canonical correlation analysis (CCA) to the first data set and the second data set to generate one or more unique transformations to facilitate a similarity measure for determining the amount of variance.

10. The system of claim 8, wherein the training process includes operations further comprising determining a canonical correlation coordinate system for the first data set and the second data set to maximize a correlation between the first data set and the second data set.

11. The system of claim 10, wherein the training process includes operations further comprising:determining a first transformation associated with the first data set and a second transformation associated with the second data set, wherein the first transformation and the second transformation, when applied to the first data set and the second data set, respectively, yield their canonical representations, respectively.Attorney Docket No. 096923-143976212. The system of claim 8, wherein the training process further comprises determining a portion of a covariance of the first data set that is explained by the second data set to facilitate a determination of the amount of variance that is above the predetermined threshold.

13. The system of claim 12, wherein the training process further comprises determining a portion of a covariance of the second data set that is explained by the first data set to facilitate the determination of the amount of variance that is above the predetermined threshold.

14. The system of claim 13, wherein the training process further comprises:determining a percentage of the first data set that is correlated with the second data set based on the portion of a covariance of the first data set that is explained by the second data set and the portion of a covariance of the second data set that is explained by the first data set.

15. A non-transitory computer-readable storage medium having program code that is executable by a processor to cause a computing device to perform operations, the operations comprising:generating a risk indicator for a target entity using a trained machine learning model and predictor variables associated with the target entity, wherein the trained machine learning model is trainable, using a training process, to determine a risk indicator for a target entity from the predictor variables, wherein the trained machine learning model is associated with one or more features, and wherein the training process includes operations comprising:evaluating a first data set and a second data set based on a representational similarity between the first data set and the second data set; selecting a first subset of the first data set based on an amount of variance in the first subset that is explained by an amount of variance in the second data set or a second subset of the second data set being above a predetermined threshold, the amount of variance determinable based on an analysis of attributes associated with the first data set or the second data set; and training the trained machine learning model using the first subset; andAttorney Docket No. 096923-1439762transmitting, to a remote computing device, a responsive message comprising at least the risk indicator for use in controlling access of the target entity to one or more computing environments.

16. The non-transitory computer-readable storage medium of claim 15, wherein the training process includes operations further comprising applying canonical correlation analysis (CCA) to the first data set and the second data set to generate one or more unique transformations to facilitate a similarity measure for determining the amount of variance.

17. The non-transitory computer-readable storage medium of claim 15, wherein the training process includes operations further comprising determining a canonical correlation coordinate system for the first data set and the second data set to maximize a correlation between the first data set and the second data set.

18. The non-transitory computer-readable storage medium of claim 17, wherein the training process includes operations further comprising:determining a first transformation associated with the first data set and a second transformation associated with the second data set, wherein the first transformation and the second transformation, when applied to the first data set and the second data set, respectively, yield their canonical representations, respectively.

19. The non-transitory computer-readable storage medium of claim 15, wherein the training process further comprises, to facilitate a determination of the amount of variance that is above the predetermined threshold:determining a portion of a covariance of the first data set that is explained by the second data set; anddetermining a portion of a covariance of the second data set that is explained by the first data set.

20. The non-transitory computer-readable storage medium of claim 19, wherein the training process further comprises:Attorney Docket No. 096923-1439762determining a percentage of the first data set that is correlated with the second data set based on the portion of a covariance of the first data set that is explained by the second data set and the portion of a covariance of the second data set that is explained by the first data set.