Neural network architecture for transaction data processing

By employing a specific neural network architecture and dynamically updated machine learning models, the challenges of real-time performance and accuracy in existing transaction processing systems are addressed, achieving efficient and accurate fraud detection and enhanced security.

CN116235183BActive Publication Date: 2026-06-23FITCHERS BASES LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
FITCHERS BASES LTD
Filing Date
2021-05-24
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing transaction processing systems face challenges in terms of real-time performance, security, and accuracy, especially in large-scale transaction processing. Traditional fraud detection methods have low accuracy and high false positive rates, and existing machine learning models are difficult to apply effectively in real-time transaction processing systems.

Method used

Machine learning systems employing specific neural network architectures, combined with dynamically updated machine learning models, can process large-scale transaction data with second-level or sub-second-level processing latency. By training and optimizing the parameters of neural network layers, they can achieve rapid inference and accurate fraud detection.

Benefits of technology

It achieves efficient and accurate fraud detection in real-time transaction processing, reduces false alarm rates, adapts to the behavioral patterns of different entities, and supports rapid processing and enhanced security of large-scale transactions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116235183B_ABST
    Figure CN116235183B_ABST
Patent Text Reader

Abstract

An example machine learning system for processing data associated with transactions is described. The machine learning system has a first processing stage that includes a recurrent neural network architecture. The recurrent neural network architecture has a forget gate to modify state data of a previous iteration based on data representing a time difference between a proposed transaction and a prior transaction. The machine learning system also has a second processing stage that has an attention neural network architecture communicatively coupled with the first processing stage. The machine learning system is configured to map output data from the second processing stage to a scalar value representing a likelihood that the proposed transaction presents an anomaly in a series of actions. The scalar value is used to determine whether to approve or reject the proposed transaction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to systems and methods for applying machine learning systems to transaction data. Some examples relate to a machine learning system for real-time transaction processing. Some examples relate to a method for training a machine learning system for real-time transaction processing. Background Technology

[0002] Digital payments have exploded over the past two decades, with more than three-quarters of payments worldwide now using some form of payment card or e-wallet. Point-of-sale systems are increasingly moving away from cash and becoming digital. In short, the global business system now relies heavily on electronic data processing platforms. This presents numerous engineering challenges, largely hidden from non-specialist users. For example, digital transactions need to be completed in real time, meaning minimal latency experienced by the computer equipment at the point of purchase. Digital transactions also need to be secure and resistant to attacks and intelligence theft. The processing of digital transactions is also constrained by the historical development of the global electronic payment system. For instance, much of the infrastructure is still configured around models designed for mainframe architectures used more than 50 years ago.

[0003] As digital transactions increase, new security risks become apparent. Digital transactions provide new breeding grounds for fraud and malicious activity. In 2015, it was estimated that 7% of digital transactions were fraudulent, and this number is only going to increase as more economic activity shifts online. Fraudulent losses are estimated to be four times the world's population (in US dollars, for example), and continue to grow.

[0004] While risks such as fraud are an economic concern for companies engaged in business activities, implementing the technological systems used to process transactions presents an engineering challenge. Traditionally, banks, merchants, and card issuers have developed “paper” rules or procedures, manually enforced by staff to flag or block certain transactions. With the digitization of transactions, one approach to building technological systems for processing them has been to provide computer engineers with these established standards and require them to implement these standards using a digital representation of the transactions—essentially translating handwritten rules into coded logical statements applicable to electronic transaction data. This traditional approach has encountered several problems as the volume of digital transactions has grown. First, processing for any application needs to be done in a “real-time” manner (e.g., millisecond-level latency). Second, thousands of transactions need to be processed per second (e.g., a typical “load” might be 1000-2000 per second), and the load can change unpredictably over time (e.g., launching a new product or a new set of tickets can easily increase the average load level several times). Third, for security reasons, transaction processors and banks’ digital storage systems are often isolated or partitioned, but digital transactions also typically involve interconnected networks of merchant systems. Fourth, it is now possible to conduct large-scale analyses of both reported and predicted fraud. This demonstrates the flaws in traditional fraud detection methods; they suffer from low accuracy and high false positive rates. This then has a physical effect on digital transaction processing: more authentic points of sale and online purchases are rejected, while those seeking to steal new digital systems often escape unscathed.

[0005] Over the past few years, machine learning methods have been increasingly adopted for processing transactional data. As machine learning models have matured in academia, engineers have begun to explore their application to transactional data processing. However, this has again encountered challenges. Even when engineers are provided with academic or theoretical machine learning models and tasked with implementing them, it is no easy task. For example, the problems of large-scale transaction processing systems begin to emerge. Machine learning models do not have unlimited inference time as they do in the lab. This means that some models cannot be practically implemented in real-time settings, or they require significant tweaking to allow real-time processing at the transaction volume levels experienced on real servers. Furthermore, engineers need to address the challenge of implementing machine learning models on orphaned or partitioned data with extremely rapid data updates, based on access security. Therefore, the problems faced by engineers building transaction processing systems can be seen as similar to those faced by network or database engineers; the need to apply machine learning models while meeting the system throughput and query response time constraints set by the processing infrastructure. There are no easy solutions to these problems. In fact, many transaction processing systems are confidential, proprietary, and based on older technologies, meaning that engineers often lack the knowledge developed in these adjacent fields and frequently face challenges specific to the transaction processing domain. Furthermore, the field of large-scale practical machine learning is not yet mature enough, and there are few established design patterns or textbooks that engineers can rely on. Summary of the Invention

[0006] Various aspects of the invention are set forth in the appended independent claims. Certain variations of the invention are subsequently set forth in the appended dependent claims. Other aspects, modifications, and examples are described in the following detailed description. Attached Figure Description

[0007] Examples of the invention will now be described by way of example only with reference to the accompanying drawings, in which:

[0008] Figures 1A to 1C This is a schematic diagram illustrating different examples of electronic infrastructure used for transaction processing.

[0009] Figure 2A and 2B This is a schematic diagram illustrating different examples of data storage systems used by machine learning transaction processing systems.

[0010] Figure 3A and 3B This is a diagram illustrating different examples of transaction data.

[0011] Figure 4 This is a schematic diagram illustrating example components of a machine learning transaction processing system.

[0012] Figure 5A and 5BIt is a sequence diagram showing a set of examples of processes performed on transaction data by different computing entities.

[0013] Figure 6A This is a schematic diagram illustrating a set of example components of a first configuration of a machine learning system for processing transaction data.

[0014] Figure 6B It shows Figure 6A A schematic diagram of the first stage of an example machine learning system.

[0015] Figure 6C It shows Figure 6B A schematic diagram of the second level of an example machine learning system.

[0016] Figure 7 This is a schematic diagram illustrating a set of example components of a second configuration of a machine learning system for processing transaction data.

[0017] Figure 8 This is a schematic diagram illustrating an example training configuration for a machine learning system.

[0018] Figure 9 This is a schematic diagram of an example machine learning system suitable for training using synthetic data samples.

[0019] Figure 10 This is a schematic diagram of a processing pipeline used to generate synthetic data samples from examples.

[0020] Figure 11A This is a schematic diagram showing an example portion of a set of transaction data.

[0021] Figure 11B This is a schematic diagram showing an example feature vector.

[0022] Figure 12 This is a flowchart illustrating an example method for training a machine learning system using synthetic data samples.

[0023] Figure 13 This is a flowchart illustrating an example method of applying a machine learning system. Detailed Implementation

[0024] introduction

[0025] Some examples described herein relate to a machine learning system for transaction processing. In some examples, the machine learning system is applied to a real-time, high-volume transaction processing pipeline to indicate whether a transaction or entity matches a previously observed and / or predicted pattern of activity or action, for example, indicating whether the transaction or entity is “normal” or “abnormal.” The term “behavior” is used herein to refer to such a pattern of activity or action. This indication may include a scalar value normalized within a predefined range (e.g., 0 to 1), which can then be used to prevent fraud and other abuses of the payment system. The machine learning system may apply machine learning models that are updated as more transaction data becomes available, for example, by continuously training these models based on new data to reduce false positives and maintain the accuracy of the output metric. This example may be particularly useful for preventing fraud in situations where it is impossible to determine whether a payment card actually exists (e.g., so-called “cardless” online transactions) or for high-value transactions that may be routine and where it may be difficult to classify behavioral patterns as “unexpected” commercial transactions. Therefore, this example facilitates processing transactions that are primarily “online,” i.e., conducted digitally over one or more public communication networks.

[0026] Some of the examples described in this paper allow machine learning models to be customized for specific entities (e.g., account holders and merchants). For instance, the machine learning model can model entity-specific behavioral patterns rather than general group or collective behavior that leads to poorer accuracy. Despite high transaction volumes and / or the need to isolate different data sources, the machine learning systems described in this paper are still able to provide dynamically updated machine learning models.

[0027] This example can be applied to a variety of digital transactions, including but not limited to card payments, so-called “wire transfers,” peer-to-peer payments, Bank Automated Clearing House (BACS) payments, and Automated Clearing House (ACH) payments. The output of the machine learning system can be used to prevent various fraud and criminal activities, such as card fraud, app fraud, payment fraud, merchant fraud, gaming fraud, and money laundering.

[0028] This example machine learning system (e.g., based on the following) Figures 1A to 13 Machine learning systems that are configured and / or trained are capable of fast inference, which can be easily parallelized to provide second- or sub-second processing latency and manage large processing values ​​(e.g., billions of transactions per year).

[0029] The following describes two specific aspects. Figures 1A to 5B This provides context for both aspects. The first aspect concerns a machine learning system with a specific form of neural network architecture, such as... Figures 6A to 6C As shown. The second aspect specifically relates to a method for training a machine learning system, and refers to... Figures 9 to 13The description is as follows. Both the first and second aspects can preferably be applied individually. For example, Figure 7 and Figure 8 Examples of alternative machine learning systems that can be trained based on the second aspect are provided. Figure 8 Alternative methods for training machine learning systems, as described with reference to the first aspect, are illustrated. However, in certain situations, they can be applied together; for example, the training methods of the second aspect can be used to train the machine learning system described in the first aspect. Both aspects offer specific technical advantages compared to other systems and methods, which are elaborated in the following text.

[0030] Definition of certain terms

[0031] The term "data" is used in various contexts herein to refer to digital information, such as information represented by a known bit structure in one or more programming languages. In use, data can refer to digital information stored as a sequence of bits in computer memory. Some machine learning models can operate on structured arrays of data in a predefined bit format. Using art terminology, these can be referred to as multidimensional arrays or "tensors." It should be noted that, for machine learning methods, multidimensional arrays (e.g., those with defined ranges in multiple dimensions) can be "flattened" so that they are represented (e.g., in memory) as sequences or vectors of values ​​stored according to a predefined format (e.g., n-bit integers or floating-point numbers, signed or unsigned). Therefore, the term "tensor" as used herein encompasses multidimensional arrays (e.g., vectors, matrices, volume arrays, etc.) having one or more dimensions.

[0032] The term "structured digital representation" is used to refer to digital data in a structured form, such as an array with one or more dimensions that stores common data types (e.g., integer or floating-point values). Structured digital representations can contain tensors (e.g., tensors used in machine learning terminology). Structured digital representations are typically stored as a set of indexed and / or contiguous memory locations; for example, a one-dimensional array of 64-bit floating-point values ​​can be represented in computer memory as a contiguous sequence of 64-bit memory locations in a 64-bit computing system.

[0033] The term "transaction data" is used herein to refer to electronic data associated with a transaction. A transaction involves a series of communications between different electronic systems to facilitate payment or exchange. Generally, transaction data can include data indicating events (e.g., actions taken in a timely manner) that are relevant to and may inform transaction processing. Transaction data can include structured, unstructured, and semi-structured data. Transaction data can also include data associated with a transaction, such as data used to process the transaction. In some cases, transaction data can be used broadly to refer to actions taken against one or more electronic devices. Transaction data can take many forms, depending on the precise implementation. However, different data types and formats can be converted through appropriate pre-processing or post-processing.

[0034] The term "interface" is used herein to refer to any physical and / or logical interface that allows one or more data inputs and data outputs. An interface can be implemented via a network interface suitable for sending and / or receiving data, or by retrieving data from one or more memory locations, as implemented by a processor executing a set of instructions. An interface can also include physical (network) coupling to receive data, such as hardware that allows wired or wireless communication over a specific medium. An interface can include application programming interfaces and / or method calls or returns. For example, in a software implementation, an interface can include passing data and / or memory accesses to a function initiated by a method call, wherein the function includes computer program code executed by one or more processors; in a hardware implementation, an interface can include a wired interconnect structure between different chips, chipsets, or portions of chips. In the accompanying drawings, an interface can be indicated by the boundaries of processing blocks having inward and / or outward arrows indicating data transfer.

[0035] The terms "component" and "module" are used interchangeably to refer to a hardware structure with a specific function (e.g., in the form of mapping input data to output data) or a combination of general-purpose hardware and specific software (e.g., specific computer program code that executes on one or more general-purpose processors). Components or modules may be implemented as specific packaged chipsets, such as application-specific integrated circuits (ASICs) or programmable field-programmable gate arrays (FPGAs), and / or as software objects, classes, class instances, scripts, code segments, etc., for processor execution.

[0036] The term "machine learning model" is used in this paper to refer to an implementation of a machine learning model or function that is at least executed by hardware. Known models in the field of machine learning include logistic regression models, Naive Bayes models, random forests, support vector machines, and artificial neural networks. Implementations of classifiers can be provided in one or more machine learning programming libraries, including but not limited to scikit-learn, TensorFlow, and PyTorch.

[0037] The term "mapping" here refers to transforming or converting a first set of data values ​​into a second set of data values. These two sets of data values ​​may be arrays of different sizes, where the output array has a lower dimension than the input array. The input and output arrays may have common or different data types. In some examples, the mapping is a one-way mapping to scalar values.

[0038] The term "neural network architecture" refers to a collection of one or more artificial neural networks configured to perform a specific data processing task. For example, a "neural network architecture" can include a specific arrangement of one or more neural network layers of one or more neural network types. Neural network types include convolutional neural networks, recurrent neural networks, and feedforward neural networks. Convolutional neural networks involve the application of one or more convolution operations. Recurrent neural networks involve updating an internal state over a series of inputs. Thus, a recurrent neural network is considered to include a form of recursive or feedback connection where the state of the recurrent neural network at a given time or iteration (e.g., t) is updated using the state of the recurrent neural network at a previous time or iteration (e.g., t-1). Feedforward neural networks involve non-feedback transformation operations, such as operations applied in a unidirectional sequence from input to output. Feedforward neural networks are sometimes referred to as ordinary "neural networks," "multilayer perceptrons," "fully connected" neural networks, or "dense," "linear," or "deep" neural networks (the latter consisting of multiple neural network layers connected in series). Some examples described in this paper utilize recurrent and fully connected neural networks.

[0039] A "neural network layer," typically defined in machine learning programming tools and libraries, can be viewed as an operation that maps input data to output data. A neural network layer can apply one or more parameters (such as weights) to map input data to output data. It can also apply one or more bias terms. The weights and biases of a neural network layer can be applied using one or more multidimensional arrays or matrices. Typically, a neural network layer has multiple parameters whose values ​​affect how the layer maps input data to output data. These parameters can be trained in a supervised manner by optimizing an objective function. This typically involves minimizing a loss function. Some parameters can also be pre-trained or fixed in another way. Fixed parameters can be viewed as configuration data that controls the operation of the neural network layer. A neural network layer or architecture can include a mixture of fixed and learnable parameters. A recurrent neural network layer can apply a series of operations to update the recurrent state and transform the input data. Updating the recurrent state and transforming the input data may involve transformations of one or more of the previous recurrent state and the input data. Recurrent neural network layers can be trained by unfolding modeled recurrent units and can be applied in machine learning programming tools and libraries. Although recurrent neural networks (RNNs) can be viewed as containing multiple (sub)layers applying different gating operations, most machine learning programming tools and libraries refer to the application of an RNN as a whole as a "neural network layer," and this paper will follow this convention. Finally, a feedforward neural network layer can apply one or more of a set of weights and biases to the input data to generate output data. This operation can be represented as a matrix operation (e.g., by appending the value 1 to the input data to include the bias term). Alternatively, the bias can be applied via a separate addition operation. As mentioned above, according to machine learning libraries, the term "tensor" is used to refer to an array that may have multiple dimensions; for example, a tensor may include vectors, matrices, or higher-dimensional data structures. In a preferred example, the tensor described may include a vector with a predefined number of elements.

[0040] To model complex nonlinear functions, the neural network layers described above can be followed by nonlinear activation functions. Common activation functions include the sigmoid function, tanh function, and rectified linear unit (RELU). Many other activation functions exist and are applicable. Activation functions can be chosen based on testing and preference. In some cases, activation functions can be omitted and / or can form part of the internal structure of the neural network layer.

[0041] The example neural network architectures described in this paper can be configured through training. In some cases, “learnable” or “trainable” parameters can be trained using a method called backpropagation. During backpropagation, the neural network layers that make up each neural network architecture are initialized (e.g., using random weights) and then used to make predictions using a set of input data from the training set (e.g., a so-called “forward” pass). This prediction is used to evaluate a loss function. For example, the “true” output can be compared to the predicted output, and the difference may form part of the loss function. In some examples, the loss function may be based on the absolute difference between the predicted scalar value and the binary true value label. The training set may include a set of transactions. If a gradient descent method is used, the loss function is used to determine the gradient of the loss function relative to the neural network architecture parameters, and then the gradient is used to backpropagate updates to the neural network architecture parameter values. Typically, this update is propagated based on the derivative of the neural network layer weights. For example, the gradient of the loss function relative to the neural network layer weights can be determined and used to determine updates to the weights that minimize the loss function. In this case, optimization techniques (e.g., gradient descent, stochastic gradient descent, Adam, etc.) can be used to tune the weights. The gradient of the loss function can be efficiently calculated by applying chain rules and automatic differentiation functions, and then working backward through the neural network layers.

[0042] Example Transaction Processing System

[0043] Figures 1A to 1C A set of example transaction processing systems 100, 102, and 104 are illustrated. These example transaction processing systems are described to provide context for the invention discussed herein, but should not be considered limiting; the configuration of any one implementation may vary depending on the specific requirements of that implementation. However, the described example transaction processing systems allow those skilled in the art to identify certain advanced technical features relevant to the following description. The three example transaction processing systems 100, 102, and 104 illustrate different areas that may vary.

[0044] Figures 1A to 1C A set of client devices 110 configured to initiate a transaction is shown. In this example, the set of client devices 110 includes a smartphone 110-A, a computer 110-B, a point-of-sale (POS) system 110-C, and a portable merchant device 110-D. These client devices 110 provide a non-exhaustive set of examples. Typically, any electronic device or any set of devices can be used to conduct a transaction. In one case, the transaction includes a purchase or payment. For example, the purchase or payment could be an online or mobile purchase or payment made via smartphone 110-A or computer 110-B, or it could be a purchase or payment made at a merchant location, such as via POS system 110-C or portable merchant device 110-D. The purchase or payment may be for goods and / or services.

[0045] exist Figures 1A to 1C In this embodiment, client device 110 is communicatively coupled to one or more computer networks 120. Client device 110 can be communicatively coupled in various ways, including via one or more wired and / or wireless networks, including telecommunications networks. In a preferred example, for instance, all communication on the one or more computer networks is secure using the Transport Layer Security (TLS) protocol. Figure 1A In the diagram, two computer networks are shown as 120-A and 120-B, respectively. These networks can be separate networks or different parts of a public network. The first computer network 120-A couples client device 110 to merchant server 130. Merchant server 130 can execute computer processes that implement transaction processing. For example, merchant server 130 can be a backend server that processes transaction requests received from POS system 110-C or portable merchant device 110-D, or it can be used by online merchants to implement a website where purchases can be made. It is understood that... Figures 1A to 1C The example is a necessary simplification of the actual architecture; for instance, there may be multiple interactive server devices implementing an online merchant, including Hypertext Markup Language (HTML) pages for providing detailed product and / or service descriptions and separate server devices for processing payment procedures.

[0046] exist Figure 1A In this process, merchant server 130 is communicatively coupled with another set of backend server devices to process the transaction. Figure 1A In this configuration, merchant server 130 is communicatively coupled to payment processor server 140 via a second network 120-B. Payment processor server 140 is communicatively coupled to a first data storage device 142 storing transaction data 146 and a second data storage device 144 storing auxiliary data 148. Transaction data 146 may include batches of transaction data related to different transactions conducted over a period of time. Auxiliary data 148 may include data associated with these transactions, such as records storing merchant and / or end-user data. Figure 1AIn this configuration, payment processor server 140 is communicatively coupled to machine learning server 150 via a second network 120-B. Machine learning server 150 implements machine learning system 160 for processing transaction data. Machine learning system 160 is configured to receive input data 162 and map it to output data 164, which is used by payment processor server 140 to process a specific transaction, such as a transaction generated by client device 110. In one case, machine learning system 160 receives transaction data associated with at least a specific transaction and provides an alert or numerical output for payment processor server 140 to determine whether to authorize (i.e., approve) or reject the transaction. Therefore, the output of machine learning system 160 may include tags, alerts, or other indications of fraud, or general malicious or anomalous activity. The output may include probability indications, such as scores or probabilities. In one case, output data 164 may include scalar numerical values. Input data 162 may also include data derived from one or more of transaction data 146 and auxiliary data 148. In one case, output data 164 indicates the level of deviation from a specific expected behavioral pattern based on past observations or measurements. For example, because it often differs significantly from observed behavioral patterns, especially at large scale, it may indicate fraud or criminal activity. Output data 164 can form behavioral measurements. Expected behavioral patterns can be explicitly or implicitly defined based on interactions observed between different entities in the transaction process (e.g., end-users or customers, merchants (including point-of-sale and back-end locations or entities, which may differ), and banks).

[0047] The machine learning system 160 can be implemented as part of a transaction processing pipeline. This will be discussed later. Figure 5A and 5BDescribe an example transaction processing pipeline. The transaction processing pipeline may include electronic communication between client device 110, merchant server 130, payment processor server 140, and machine learning server 150. Other server devices may also be involved, such as a bank server providing issuing bank authorization. In some cases, client device 110 may communicate directly with payment processor server 140. In practice, the transaction processing pipeline typically needs to complete within one to two hundred milliseconds. Generally, sub-second processing times can be considered real-time (e.g., humans typically perceive events over a 400-millisecond timeframe). Furthermore, 100-200 milliseconds may be the maximum latency required for the entire round-trip time of transaction processing; within this timeframe, the time allocated to machine learning system 160 can be a small fraction of this total time, for example, 10 milliseconds (i.e., less than 5-10% of the target processing time), as most of the time can be reserved for other operations in the transaction processing flow. This imposes technical constraints on the implementation of machine learning system 160. Furthermore, in practical implementations, the average processing volume may be around 1000-2000 transactions per second. This means that most "off-the-shelf" machine learning systems are unsuitable for implementing Machine Learning System 160. This further implies that most machine learning methods described in academic papers cannot be implemented in the aforementioned transaction processing pipeline without significant modifications. Another issue is that anomalies are rare events by nature, making accurate machine learning systems difficult to train.

[0048] Figure 1B It shows Figure 1A Variant 102 of the example transaction processing system 100. In this variant 102, the machine learning system 160 is implemented within the payment processor computer infrastructure, for example, executed by the payment processor server 140 and / or executed on a locally coupled server in the same local network as the payment processor server 140. Figure 1B The variant 102 is likely the preferred choice for large-scale payment processors due to its ability to achieve faster response times, better control, and higher security. However, functionally, the transaction processing pipeline may be similar to... Figure 1A The assembly line shown. For example, in Figure 1A In the example, the machine learning system 160 can be invoked via a secure external application programming interface (API), for example, by initiating a Representational State Transfer (REST) ​​API call using the Secure Hypertext Transfer Protocol (HTTPS), while Figure 1B In this context, the machine learning system 160 can be initiated by internal API calls, but the public API can handle both requests (e.g., a REST HTTPS API can provide an external wrapper for the internal API).

[0049] Figure 1C It shows Figure 1A Another variation 104 of the example transaction processing system 100 is shown. In this variation 104, the machine learning system 160 is communicatively coupled to a local data storage device 170. For example, the data storage device 170 may be located on the same local network as the machine learning server 150, or may include a local storage network accessible to the machine learning server 150. In this case, there are multiple local data storage devices 170-A to 170-N, each storing auxiliary data 172 in a partition. The auxiliary data 172 in the partition may include parameters of one or more machine learning models. In one case, the auxiliary data 172 may include the state of the machine learning model, which may relate to a specific entity, such as a user or merchant. The partition of the auxiliary data 172 may need to be used to meet security requirements set by a third party (e.g., a payment processor, one or more banks, and / or one or more merchants). In use, the machine learning system 160 accesses the auxiliary data 172-A to 172-N based on the input data 162 via the multiple local data storage devices 170-A to 170-N. For example, the input data 162 may be received via an API request from a specific source and / or may include data identifying which partitions are used to process API requests. Figure 2A and 2B More details are listed about the different storage systems that can be used to meet security requirements.

[0050] Example data storage configuration

[0051] Figure 2A and 2B Two example data storage configurations, 200 and 202, are shown, which can be used by the example machine learning system 210 to process transaction data. Figure 2A and 2B The examples shown are two non-limiting examples illustrating different options available for implementation, and specific configurations can be chosen based on individual circumstances. Machine learning system 210 may include the aforementioned... Figures 1A to 1C The example describes the implementation of the machine learning system 160. Figure 2A and 2B The example demonstrates the ability to process transaction data protected using heterogeneous cryptographic parameters; for instance, it enables machine learning system 210 to securely process transaction data from heterogeneous entities. It is understood that if machine learning system 160 is used, for example, within an internal transaction processing system for the implementation of a single set of secure transaction and ancillary data, or as a custodian system for a single payment processor, it may not use... Figure 2A and 2B Configuration.

[0052] Figure 2AA machine learning system 210 is illustrated, communicatively coupled to a data bus 220. The data bus 220 may include the internal data bus of a machine learning server 150 or may form part of a storage area network. The data bus 220 communicatively couples the machine learning system 210 to multiple data storage devices 230, 232. The data storage devices 230, 232 may include any known data storage devices, such as disks and solid-state devices. Although the data storage devices 230, 232... Figure 2A They are shown as different devices, but they can alternatively form different physical areas or storage portions within a common data storage device. Figure 2A In this context, multiple data storage devices 230 and 232 are used to store historical transaction data 240 and auxiliary data 242. Figure 2A In this system, the first set of data storage devices 230 is used to store historical transaction data 240, while the second set of data storage devices 232 is used to store auxiliary data 242. Auxiliary data 242 may include one or more model parameters for a set of machine learning models (e.g., training parameters for a neural network architecture and / or configuration parameters for a random forest model) and state data for these models. In one scenario, different sets of historical transaction data 240-A to N and auxiliary data 242-A to N are associated with different entities that securely share services provided by the machine learning system 210. For example, this data may represent data from different banks, and this data needs to be kept separate as part of the conditions for providing machine learning services to these entities.

[0053] Figure 2B This illustrates another method for storing different groups of historical transaction data 240-A to N and auxiliary data 242-A to N. Figure 2B In this system, the machine learning system 210 is communicatively coupled to at least one data storage device 260 via a data transmission channel 250. The data transmission channel 250 may include a local storage bus, a local storage area network, and / or a remote secure storage coupler (e.g., overlaid on an insecure network such as the Internet). Figure 2B In this system, a secure logical storage layer 270 is set up using a physical data storage device 260. The secure logical storage layer 270 can be a virtualized system that exists as a separate physical storage device independent of the machine learning system 210, while actually being implemented independently on at least one data storage device 260. The logical storage layer 270 can provide separate encrypted partitions 280 for data related to entity groups (e.g., related to different issuing banks, etc.), and historical transaction data 240-A to N and auxiliary data 242-A to N for different groups can be stored in the corresponding partitions 280-A to N. In some cases, when a transaction is received based on... Figures 1A to 1CWhen processing data stored in one or more server systems, entities can be dynamically created.

[0054] Example transaction data

[0055] Figure 3A and 3B An example of transaction data that can be processed by machine learning systems such as 160 or 210 is shown. Figure 3A This illustrates how transaction data can comprise a set of 300 time-sorted records, where each record has a timestamp and includes multiple transaction fields. In some cases, transaction data may be grouped and / or filtered based on timestamps. For example, Figure 3A The diagram illustrates dividing transaction data into current transaction data 310, associated with the current transaction, and "older" or historical transaction data 320, within a predefined time range of the current transaction. This time range can be set as a hyperparameter of any machine learning system. Alternatively, the "older" or historical transaction data 320 can be set to a certain number of transactions. A combination of these two methods is also possible.

[0056] Figure 3B This illustrates how transaction data for a specific transaction 330 is stored digitally for processing by one or more machine learning models. For example, in Figure 3B In this context, transaction data must contain at least the following fields: transaction amount, timestamp (e.g., in Unix epoch), transaction type (e.g., card payment or direct debit), product description or identifier (i.e., related to the purchased item), merchant identifier, issuing bank identifier, a set of characters (e.g., Unicode characters in a predefined character length field), country identifier, etc. It should be noted that a wide variety of data types and formats can be received and preprocessed into appropriate numerical representations. In some cases, raw transaction data, such as data generated by a client device and sent to merchant server 130, is preprocessed to convert alphanumeric data types to numeric data types for the application of one or more machine learning models. Other fields present in the transaction data may include, but are not limited to: account number (e.g., credit card number), the location where the transaction occurred, and the manner in which the transaction was performed (e.g., face-to-face, by phone, on a website).

[0057] Example machine learning system

[0058] Figure 4An example 400 of a machine learning system 400 for processing transaction data is shown. Machine learning system 402 may implement one or more machine learning systems 160 and 210. Machine learning system 402 receives input data 410. The form of input data 410 may depend on which machine learning model is applied by machine learning system 402. When machine learning system 402 is configured to perform fraud or anomaly detection related to a transaction (e.g., an ongoing transaction as described above), the input data 410 may include transaction data such as 330 (i.e., data forming part of a data packet for the transaction) and data from historical transaction data (e.g., ...). Figure 3A The data in the 300) and / or from auxiliary data (e.g., Figures 1A to 1C 148 or Figure 2A and 2B The data in section 242) may include secondary data linked to one or more entities identified in the original data associated with the transaction. For example, if the transaction data for an ongoing transaction identifies a user, merchant, and one or more banks (e.g., the user's issuing bank and the merchant's bank) associated with the transaction, for example, via unique identifiers present in the transaction data, the auxiliary data may include data related to these transaction entities. The auxiliary data may also include data from activity records, such as interaction logs and / or authentication records. In one case, the auxiliary data is stored in one or more static data records and retrieved from these records based on the received transaction data. Alternatively, the auxiliary data may include machine learning model parameters for content retrieval based on the transaction data. For example, the machine learning model may have parameters specific to one or more of the user, merchant, and issuing bank, and these parameters can be retrieved based on those parameters identified in the transaction data. For example, one or more of the user, merchant, and issuing bank may have corresponding embeddings that may include retrieveable or mappable tensor representations of the entities. For example, each user or merchant can have a tensor representation (e.g., a floating-point vector of size 128-1024) that can be retrieved from a database or other data store, or generated by an embedding layer, for example, based on a user or merchant index.

[0059] Input data 410 is received at input data interface 412. Input data interface 412 may include an API interface, such as an internal or external API interface as described above. In one case, such as Figures 1A to 1CThe payment processor server 140 shown issues a request to the interface, wherein the request payload contains transaction data. The API interface can be defined regardless of the form or source of the transaction data. The input data interface 412 is communicatively coupled to the machine learning model platform 414. In one case, a request to the input data interface 412 uses transaction data provided to the interface to trigger the execution of the machine learning model platform 414. The machine learning model platform 414 is configured as an execution environment for applying one or more machine learning models to the input data 410. In one case, the machine learning model platform 414 is arranged as an execution wrapper for multiple different optional machine learning models. For example, machine learning models can be defined using a model definition language (e.g., similar to or using a markup language, such as Extended Markup Language - XML). Model definition languages ​​may include (especially individually or in combination): SQL, TensorFlow, Caffe, Thinc, and PyTorch. In one case, the model definition language includes executable computer program code to implement one or more training and inference operations of the defined machine learning model. For example, machine learning models can include: artificial neural network architectures, ensemble models, regression models, decision trees (e.g., random forests), graph models, and Bayesian networks, etc. See below for further reference. Figures 6A to 6C Describe an example machine learning model based on an artificial neural network, see reference later. Figure 7 Describe an example machine learning model based on a random forest model. The machine learning model platform 414 allows for the definition of common (i.e., shared) input and output definitions, enabling different machine learning models to be applied in a common (i.e., shared) manner.

[0060] In this example, the machine learning model platform 414 is configured to provide at least a single scalar output 416. This output can be normalized within a predefined range (e.g., a range of 0 to 1). When normalized, the scalar output 416 can be viewed as the probability that a transaction associated with the input data 410 is fraudulent or anomalous. In this case, a value "0" could represent a transaction that matches a normal activity pattern among one or more of the user, merchant, and issuing bank, while a value "1" could represent a transaction that is fraudulent or anomalous, i.e., does not match the expected activity pattern (although those skilled in the art will recognize that the normalization range can be different, e.g., inverse or in different ranges, but with the same functional effect). It should be noted that although the range of values ​​can be defined as 0 to 1, the output values ​​may not be uniformly distributed within this range; for example, a value "0.2" might be a common output for "normal" events, while a value "0.8" might be considered to exceed the threshold for typical "abnormal" or fraudulent events. Therefore, the machine learning model implemented by the machine learning platform 414 can achieve a mapping between high-dimensional input data (e.g., transaction data and any retrieval auxiliary data) and single-value outputs. In one scenario, for example, the machine learning platform 414 can be configured to receive input data for a machine learning model in a digital format, wherein each defined machine learning model is configured to map to input data defined in the same manner. The exact machine learning model applied by the machine learning model platform 414 and the parameters of that model can be determined based on configuration data. This configuration data can be included within the input data 410, and / or identified using the input data, and / or can be set based on one or more configuration documents parsed by the machine learning platform 414.

[0061] In some cases, the machine learning model platform 414 can provide additional outputs based on context. In some implementations, the machine learning model platform 414 can be configured to return a "reason code" that captures a human interpretation of the machine learning model output based on questionable input attributes. For example, the machine learning model platform 414 can indicate which of one or more input elements or units in the input representation influenced the model output, such as a combination of a "quantity" channel above a learning threshold and a set of "merchant" elements or units (e.g., embeddings or indexes) outside a given cluster. In the case where the machine learning model platform 414 implements a decision tree, these additional outputs can include routing through the decision tree or aggregated feature importance based on an ensemble tree. For neural network architectures, this might include layer output activations and / or layer filters with positive activations.

[0062] exist Figure 4In some implementations, an optional alert system 418 may be included to receive scalar output 416. In other implementations, scalar output 416 may be passed directly to output data interface 420 without post-processing. In the latter case, scalar output 416 may be packaged as a response to the original request of input data interface 412. In both cases, output data 422 from scalar output 416 is provided as the output of machine learning system 402. Output data 422 is returned for final processing of transaction data. For example, output data 422 may be returned to payment processor server 140 and used as the basis for a decision to approve or reject a transaction. Depending on the implementation requirements, in one case, alert system 418 may process scalar output 416 to return a binary value indicating whether the transaction should be approved or rejected (e.g., "1" equals rejection). In some cases, a decision may be made by applying a threshold to scalar output 416. This threshold may be context-dependent. In some cases, the alarm system 418 and / or output data interface 420 may also receive additional inputs, such as interpretable data (e.g., the “reason code” discussed above) and / or raw input data. The output data interface 420 may generate output data packets for output data 422 that combine these inputs with scalar output 416 (e.g., at least for logging and / or later review). Similarly, alarms generated by the alarm system 418 may include and / or be additionally based on, for example, the aforementioned additional inputs in addition to scalar output 416.

[0063] In a preferred implementation, the machine learning system 402 is used in an "online" mode to process a large number of transactions within a narrowly defined timeframe. For example, under normal processing conditions, the machine learning system 402 can process requests within 7-12 milliseconds and manage 1000-2000 requests per second (these requests are from the median constraints of the actual operating conditions). However, the machine learning system 402 can also be used in an "offline" mode, for example, by providing selected historical transactions to the input data interface 412. In offline mode, input data can be delivered to the input data interface in batches (i.e., in groups). The machine learning system 402 is also capable of implementing machine learning models that provide scalar outputs for entities and, alternatively, for transactions. For example, the machine learning system 402 can receive requests associated with an identified user (e.g., a card or payment account holder) or an identified merchant and be configured to provide a scalar output 416 indicating the likelihood that the user or merchant is fraudulent, malicious, or anomalous (i.e., a general threat or risk). This can, for example, form part of a continuous or periodic monitoring process or a one-off request (e.g., as part of a service application). Providing a scalar output for a specific entity can be based on a set of transaction data, including the last approved transaction in a sequence of transaction data (e.g., similar to...). Figure 3A(Transaction data of entities within the entity).

[0064] Example Transaction Processing Flow

[0065] Figure 5A and 5B Two possible example transaction processing flows, 500 and 550, are shown. These processes may occur in... Figures 1A to 1C The example transaction processing systems 100, 102, and 104 are shown in the context of other systems. Processing flows 500 and 550 are provided as examples of contexts where machine learning can be applied to transaction processing systems; however, not all transaction processing flows must follow these examples. Figure 5A and 5B The processes shown are different, and the processing flow may vary between implementations, systems, and over time. Example transaction processing flows 500 and 550 reflect two possible scenarios: a first scenario represented by transaction processing flow 500, where a transaction is approved; and a second scenario represented by transaction processing flow 550, where a transaction is rejected. Each transaction processing flow 500, 550 involves the same set of five interacting systems and devices: a POS or user device 502, a merchant system 504, a payment processor (PP) system 506, a machine learning (ML) system 508, and an issuing bank system 510. The POS or user device 502 may include one of the client devices 110, the merchant system 504 may include a merchant server 130, the payment processor system 506 may include a payment processor server 140, and the machine learning system 508 may include implementations of machine learning systems 160, 210, and / or 402. The issuing bank system 510 may include one or more server devices that implement transaction functions on behalf of the issuing bank. The five interactive systems and devices 502 to 510 can be communicatively coupled through one or more internal or external communication channels, such as network 120. In some cases, certain systems can be combined; for example, the issuing bank can also act as a payment processor, so systems 506 and 510 can be implemented using a common system. In other cases, similar processing flows can be performed specifically for merchants (e.g., without involving a payment processor or issuing bank). In this case, machine learning system 508 can communicate directly with merchant system 504. Amid these variations, the general functional transaction processing flow can remain similar to the flow described below.

[0066] Figure 5A and Figure 5BThe transaction processing flow in each system includes multiple public (i.e., shared) processes 512 to 528. At box 512, the POS or user device 502 initiates a transaction. For a POS device, this might involve a cashier attempting to accept an electronic payment using a front-end device; for user device 502, this might involve a user making an online purchase using a credit or debit card or online payment account (e.g., clicking "Done" in an online basket). At box 514, payment details are received as electronic data by the merchant system 504. At box 516, the transaction is processed by the merchant system 504, which sends a request to the payment processor system 506 to authorize payment. At box 518, the payment processor system 506 receives the request from the merchant system 504. This request can be sent via a proprietary communication channel or as a secure request over a public network (e.g., an HTTPS request over the Internet). The payment processor system 506 then sends a request for a score or probability to the machine learning system 508 to process the transaction. Box 518 may additionally include retrieving auxiliary data to combine with transaction data sent as part of the request to machine learning system 508. In other cases, machine learning system 508 may access data storage devices that store auxiliary data (e.g., similar to...). Figure 2A and 2B The configuration) and therefore this data can be retrieved as part of internal operations (e.g., based on identifiers provided in the transaction data and / or defined as part of the implemented machine learning model).

[0067] Box 520 illustrates a model initialization operation that occurs prior to any request from the payment processor system 506. For example, this model initialization operation may include loading a defined machine learning model and parameters that represent the defined machine learning model. At box 522, the machine learning system 508 receives a request from the payment processor system 506 (e.g., via a request such as...). Figure 4 The data input interface is 412. At box 522, the machine learning system 508 can perform any defined preprocessing before applying the machine learning model initialized at box 520. For example, if the transaction data still retains character data, such as merchant data identified by strings or character transaction descriptions, the transaction data can be converted into suitable structured numerical data (e.g., converting string categorization data into identifiers via lookup operations or other mappings, and / or by mapping characters or character groups to vector embeddings). Then at box 524, the machine learning system 506 applies the illustrated machine learning model, thereby providing the model with the input data received from the request. This box may include applications such as those referenced in the reference. Figure 4The machine learning model platform 414 is described. At box 526, a scalar output is generated by the illustrated machine learning model. This scalar output can then be processed to determine a "approval" or "rejection" binary decision at machine learning system 508, or, preferably, returned to payment processor system 506 as a response to a request made at box 518.

[0068] At box 528, the output of machine learning system 508 is received by payment processor system 506 and used to approve or reject the transaction. Figure 5A The process of approving the transaction based on the output of machine learning system 508 is shown; Figure 5B The process of rejecting the transaction based on the output of machine learning system 508 is illustrated. Figure 5A In the transaction process, at box 528, the transaction is approved. Then, at box 530, a request is sent to the issuing bank system 532. At box 534, the issuing bank system 532 approves or rejects the request. For example, if the end user or cardholder has sufficient funds and is approved to cover the transaction costs, the issuing bank system 532 may approve the request. In some cases, the issuing bank system 532 may apply a second level of security; however, this may not be necessary if the issuing bank relies on anomaly detection performed by the payment processor using the machine learning system 508. At box 536, the authorization from the issuing bank system 510 is returned to the payment processor system 506, which in turn sends a response to the merchant system 504 at box 538, and the merchant system 504 responds to the POS or user equipment 502 at box 540. If the issuing bank system 510 approves the transaction at box 534, the transaction can be completed, and a positive response is returned to the POS or user equipment 502 via the merchant system 504. The end user will experience this response as an "authorization" message on the POS or user device screen at 502. The merchant system at 504 can then complete the purchase (e.g., by initiating internal processing to complete the purchase).

[0069] At a later point in time, one or more of the merchant system 504 and the machine learning system 508 may save data related to the transaction, for example, as part of the transaction data 146, 240, or 300 in the previous example. This process is illustrated in dashed boxes 542 and 544. The transaction data may be saved along with one or more of the output of the machine learning system 508 (e.g., scalar fraud or anomaly probability) and the final outcome of the transaction (e.g., approval or rejection). The saved data may be stored to be used as training data for a machine learning model implemented by the machine learning system 508 (e.g., as...). Figure 8 and Figure 10(This is based on one or more of the training data shown in the diagram). The saved data can also be accessed as part of future iterations of box 524, for example, it can form part of future auxiliary data. In some cases, the final outcome or consequences of a transaction may not be known at the time of the transaction. For example, a transaction may only be flagged as anomalous through later review by analysts and / or automated systems or based on user feedback (e.g., when a user reports fraud or indicates that a payment card or account has been misused since a certain date). In these cases, the true labels used to train the machine learning system 508 can be collected over time after the transaction itself.

[0070] Now go to Figure 5B An alternative processing flow exists, in which one or more of the machine learning system 508 and payment processor system 506 reject the transaction based on the output of machine learning system 508. For example, the transaction may be rejected if the scalar output of machine learning system 508 is higher than a retrieved threshold. At box 552, payment processor system 506 sends a response to merchant system 504, which is received at box 554. At box 554, merchant system 504 is responsible for blocking the transaction from completion and returning an appropriate response to POS or user device 502. Upon receiving the response at box 556, the end user or customer may be notified (e.g., via a “Reject” message on screen) that their payment has been rejected. The end user or customer may be prompted to use a different payment method. Although not explicitly stated in the provided text... Figure 5B As shown, however, in some cases, the issuing bank system 510 may be notified that a transaction relating to a specific account holder has been rejected. The issuing bank system 510 may be notified as... Figure 5B This can be part of the process shown, or it can be part of a periodic (e.g., daily) update. Although the transaction may not be part of transaction data 146, 240, or 300 (because the transaction was not approved), it may still be recorded by at least machine learning system 508, as shown in box 544. For example, for Figure 5A Transaction data can be stored together with the output of the machine learning system 508 (e.g., scalar fraud or anomaly probability) and the final outcome of the transaction (e.g., rejecting the transaction).

[0071] First example configuration of a machine learning system

[0072] Some of the examples described in this article (e.g., Figures 1A to 1C 2A and 2B Figure 4 as well as Figure 5A and 5BThe machine learning systems 160, 210, 402, and 508 described herein can be implemented as a modular platform that allows the use of different machine learning models and configurations to provide the transaction processing described herein. This modular platform enables the use of different machine learning models and configurations as technology improves and / or based on specific features of the available data. Two example configurations of the machine learning system are provided in this document: the first example configuration is as follows... Figures 6A to 6C As shown, the second example configuration is as follows: Figure 7 As shown. These example configurations can be used individually. Figures 6A to 6C The first configuration shown illustrates the neural network architecture. Figure 7 The second configuration shown illustrates a random forest implementation.

[0073] Figure 6A The first example configuration of the machine learning system 600 is shown. Figure 6A In this process, machine learning system 600 receives input data 601 and maps it to a scalar output 602. This general processing follows the same framework as described with reference to the preceding example. Machine learning system 600 includes a first processing stage 603 and a second processing stage 604. The first processing stage 603... Figure 6B The second processing stage 604 is shown in more detail below. Figure 6C This is illustrated in more detail below. A machine learning system 600 is applied to data associated with at least one proposed transaction to generate a scalar output 602 for the proposed transaction. The scalar output 602 represents the probability that the proposed transaction exhibits anomalous behavior, for example, that the proposed transaction embodies a pattern of actions or events different from expected or typical patterns. In some cases, the scalar output 602 represents the probability that the proposed transaction exhibits anomalous behavior in a series of actions, wherein these actions include at least previous transactions and may also include other interactions between the entity and one or more computer systems. The scalar output 602 can be used to make an approval decision regarding the proposed transaction, for example, as referenced... Figure 5A and 5B The described information can be used to decide whether to approve or reject the proposed transaction.

[0074] exist Figure 6A In the input data 601, transaction time data 606 and transaction feature data 608 are included. Transaction time data 606 may include data from sources such as... Figure 3B The timestamp shown, or any other data format representing the date and / or time of the transaction. The date and / or time of the transaction can be set to the time when the transaction is initiated at, for example, at, the client computing device 110 or 502, or it can be set to the time when the request is received at, for example, similar to, at Figure 5A and 5BThe time to receive the request is indicated at box 522 in the diagram. In one case, as referenced later... Figure 6C As shown, transaction time data 606 may include time data for multiple transactions, such as a historical set of the currently proposed transaction and one or more preceding transactions. The time data for the historical set of one or more preceding transactions may be received along with a request and / or retrieved from a storage device communicatively coupled to the machine learning system 600. In this example, transaction time data 606 is converted into relative time data and applied to one or more neural network architectures. Specifically, transaction time data 606 is converted into a set of time differences, where each time difference represents the time difference between the currently proposed transaction and each of the one or more preceding transactions. For example, this time difference may include a standardized time difference in seconds, minutes, or hours. The time difference can be calculated by subtracting one or more timestamps of the preceding transactions from the timestamp of the proposed transaction. This time difference can be standardized by dividing by a maximum predefined time difference and / or by cropping to the maximum time difference. In one case, one or more preceding transactions may be selected from a predefined time range (e.g., Figure 3A Choose from a set of 320 and / or a predefined number of transactions. The maximum time difference between the predefined time range and / or the predefined number of transactions can be used to standardize the time difference value.

[0075] In inference mode, the machine learning system 600 uses first and second processing stages 603 and 604 to detect feature sets from input data 601. Then, the output data from the first and second processing stages 603 and 604 is used to compute a scalar value 602. Figure 6AIn this context, the machine learning system 600 includes a first multilayer perceptron 610. The first multilayer perceptron includes a fully connected neural network architecture with multiple neural network layers (e.g., 1 to 10 layers) to preprocess the proposed transaction data prior to at least a first processing stage 603. Although a multilayer perceptron is described herein, in some implementations, preprocessing may be omitted (e.g., if the received transaction data 608 is already in a suitable format) and / or only a single-layer linear mapping may be provided. In some examples, this preprocessing may be considered as a form of an "embedding" or "initial mapping" layer for the input transaction data. In some cases, one or more neural network layers of the first multilayer perceptron 610 may provide learned scaling and / or normalization of the received transaction data 608. Generally, the fully connected neural network architecture of the first multilayer perceptron 610 represents a first learned feature preprocessing stage that transforms the input data associated with the proposed transaction into feature vectors for further processing. In some cases, the fully connected neural network architecture may learn certain relationships between the elements of the transaction feature data 608, e.g., certain correlations, and output an efficient representation that takes these correlations into account. In one scenario, the input transaction data 608 may include integers and / or floating-point numbers, and the output of the first multilayer perceptron 610 may include a vector of values ​​between 0 and 1. The number of elements or units of the first multilayer perceptron 610 (i.e., the size of the output vector) can be set as a configurable hyperparameter. In some cases, the number of elements or units may be between 32 and 2048.

[0076] It should be noted that the input transaction data 608 may include data from the proposed transaction (e.g., data received via one or more client devices, POS devices, merchant server devices, and payment processor server devices) and data associated with the proposed transaction but not included in data packets associated with the proposed transaction. For example, see reference Figures 1A to 1C as well as Figure 2A and 2B As explained, the input transaction data 608 may also include auxiliary data (e.g., 148 and 242), wherein the auxiliary data is retrieved by the machine learning system 600 (e.g., Figure 1C (as shown) and / or retrieved by payment processor server 140 (such as...) Figure 1A (Or as shown in 1B). The exact content contained in the input transaction data 608 can vary between implementations. This example relates to a general technical architecture for processing transaction data, rather than the exact form of that data. Generally, since the machine learning system 600 includes a set of neural network layers, parameters can be learned based on any required input data configuration or based on available input data. This example relates to the engineering design of such a technical architecture to allow transaction processing at the speed and scale described herein.

[0077] exist Figure 6A In the example, the first processing stage 603 receives transaction time data 606 and the output of the first multilayer perceptron 610. The first processing stage 603 includes a recurrent neural network architecture 620, which... Figure 6A The diagram shows processing layer A. The recurrent neural network architecture 620 receives transaction time data 606 and the output of the first multilayer perceptron 610, and generates output data through neural network mapping (e.g., one or more parameterized functions). For example, the recurrent neural network architecture 620 uses internally maintained state to map the transaction time data 606 and the output of the first multilayer perceptron 610 to a predefined fixed-size vector output. The output of the recurrent neural network architecture 620 is then received by the second processing stage 604 and used to generate a scalar output 602. An example configuration of the recurrent neural network architecture 620 is shown below. Figure 6B As shown. Generally, the first processing stage 603 produces representations particularly suitable for time data with irregular intervals, which are especially well-suited to the time-series data properties of transaction data. For example, comparative neural methods used for time-series processing typically require time samples with regular intervals. This does not align with the asynchronous nature of payment requests. In some cases, the recurrent neural network architecture 620 may include learnable functions. In other cases, the recurrent neural network architecture 620 may be preferred for certain implementations where the recurrent neural network architecture 620 can implement functions using fixed or provided parameters, such as a time decay function. In this case, the parameters can be configured based on the characteristics of a broader model and domain knowledge. In either case, the parameters of the recurrent neural network architecture 620 can enable the intelligent aggregation of transaction features at unequal time intervals.

[0078] Figure 6AThe second processing stage 604 includes one or more attention neural network architectures 660. When using multiple attention neural network architectures, b architectures can be provided in parallel, for example, a multi-head attention arrangement, where each b architecture receives the same input but has different neural network parameters, enabling them to extract different feature sets from the same data. More specifically, different attention heads may focus on different time periods; for example, one attention head may focus on transactions from the past hour, while another attention head may focus on transactions from the past month. In this way, the first attention neural network architecture receives at least transaction time data 606 and output data from the recurrent neural network architecture 620, and uses this data to generate output data. If only one "head" is used for the second processing stage 604, this output data is passed to a second multilayer perceptron 690 to generate a scalar value 602. If multiple "heads" are used for the second processing stage 604, the output data of each attention neural network architecture can be combined to form the output data for the second processing stage 604. In one case, the output data of each attention neural network architecture can be at least concatenated. In another case, the concatenated output data can be input into at least one fully connected neural network layer for dimensionality reduction. For example, this function can be provided by a second multilayer perceptron 690. An example configuration for each attention neural network architecture 660 is as follows: Figure 6C As shown. Depending on the implementation, the first and / or second multilayer perceptrons 610 and 690 may include, for example, 1 to 10 dense or fully connected layers (with corresponding activation functions, such as the ReLU function). Similarly, the number of elements or units (sometimes called channels) used for the first and / or second multilayer perceptrons 610 and 690 can range from 32 to 2048 and can vary between layers, for example, it can be reduced from input to output.

[0079] exist Figure 6AIn this process, at least the output of the second processing stage 604, i.e., the output data from one or more attention neural network architectures 660, is mapped to a scalar output 602 by the second multilayer perceptron 690. The output data from the last architecture in the one or more attention neural network architectures 660 may include a vector of predefined length (e.g., 1 to 1024 elements). The predefined length may be the same as the output of the first multilayer perceptron 610, or it may be larger if multi-head attention is used (e.g., when multiple output vectors are concatenated). In one case, the output length (i.e., size) of each attention head may be set as a separate hyperparameter. In some test configurations, the output size of each attention head ranges from about 5 to about 500 elements or units. The second multilayer perceptron 690 includes a fully connected neural network architecture, which in turn includes multiple neural network layers to receive at least the output data from the second processing stage 604 and map the output data to the scalar value 602. The perceptron 690 may include a series of dimensionality reduction layers. The last activation function in the multiple neural network layers may include a sigmoid activation function to map the output to a range of 0 to 1. The second multilayer perceptron 690 can be trained to extract the correlation between features output by each of the multiple attention heads, and one or more nonlinear functions can be applied to finally output a scalar value 602.

[0080] like Figure 6A As shown by the dashed lines, in some variations, one or more skip connections 692, 694 can be configured to bypass one or more of the first processing stage 603 and the second processing stage 604, respectively. The first skip connection 692 bypasses the first processing stage 603, while the second skip connection bypasses the second processing stage 604. Skip connections like these can improve training (e.g., by making gradients bypass every layer, thus avoiding the "vanishing gradient" problem). They can also improve accuracy by making later layers operate on the correlation between earlier inputs and intermediate layer outputs (e.g., there may be cases where a simple mapping between the features of the output of the first multilayer perceptron 610 and one of the attention layers 660 can be used by the second multilayer perceptron 690 to make a "better" decision than the output of the attention layer 660 alone).

[0081] In some cases, the output of the first multilayer perceptron 610, which can be considered as a normalized transaction data feature vector, can be concatenated or otherwise combined with the output data of the recurrent neural network architecture 620 to provide input to one or more attention neural network architectures 660. In one case, the first skip connection 692 may include a residual connection. Similarly, in some cases, the output of the first multilayer perceptron 610, which can be considered as a normalized transaction data feature vector, can be concatenated or otherwise combined with the output data of one or more attention neural network architectures 660 to provide input to the second multilayer perceptron 690. In one case, the second skip connection 694 may include a residual connection. Typically, in Figure 6A In the diagram, dashed lines illustrate a first skip connection 692 around the first processing level 603 (i.e., layer A) and a second skip connection 694 around the second processing level 604 (i.e., layer B); if skip connections 692 and 694 are provided simultaneously, then skip connections effectively exist around the first and second processing levels 603 and 604 (i.e., the combination of layers A and B). Residual connections can improve the ease with which neural network layers learn mapping functions. Whether to use skip connections, and whether they are combined by concatenation, addition, or another operator (e.g., subtraction or multiplication), can be set as hyperparameters in the configuration data of the machine learning system 600, and can be set based on experiments with a specific implementation.

[0082] It has been discovered Figure 6AThe arrangement is particularly beneficial for rapid inference of input transaction data, which is necessary for sub-second transaction processing. For example, the number of layers in the overall network architecture is small according to modern deep learning standards, which helps to achieve low latency when performing inference on real-time transaction data. The first multilayer perceptron 610 is advantageously arranged to preprocess the transaction data, for example, using learnable parameters, so that they can be provided in a form that achieves high accuracy for the remaining arrangements and is conducive to the stable training of the arrangements. Using the relative time difference between the current transaction and the preceding transaction, time information can be quantified in a format that the arrangements can use to determine the scalar output, and time can be represented in an efficient manner that does not depend on absolute timestamp values ​​(which can be large or complex). Furthermore, dependence on absolute timestamp values ​​can lead to a lack of generality across entities and / or time periods, which in turn can lead to poor performance of machine learning models. The first processing stage 620 is configured to process the transaction feature vector output by the first multilayer perceptron 610 in a stateful manner, wherein the use of previous states is determined by parameters operated on with time difference data about the proposed transaction. For example, the recurrent neural network architecture 620 is configured to use (e.g., how to "remember" or "forget") previous transaction data based on the time interval between the proposed transaction and previous transactions, where the previous transaction data is an aggregation function of transaction feature vectors for multiple previous transactions. Therefore, the first processing stage 603 has a potentially infinite range relative to previous transactions and is arranged to extract features based on this range. Specifically, the recurrent neural network architecture 620 only uses the previous state vector of the entity (by...) Figure 6B The recurrent neural network architecture 620 "knows" the previous transactions of the same entity (as indicated by the input from the previous iteration 622). The state vector has a fixed size, the exact size of which includes the model's hyperparameters (e.g., this might be in the range of 4 to 128 elements). Therefore, the recurrent neural network architecture 620 must summarize the past behavior of any entity up to any point in time in the form of this fixed-size vector. This contrasts with the second processing level 604, which accesses the entity's historical events, e.g., all events or a subset of events based on event time. In some cases, the second processing level 604 may have a fixed input range (e.g., it considers a fixed number of input feature vectors), but attention (self-attention) can be applied within that fixed input range. In some cases, the input range of the second processing level 604 can be defined based on event time; for example, an attention head might only look at events of the same entity that occurred within the past month, where the time range of each attention head is defined as a hyperparameter. In some cases, the input range of the second processing level 604 may have no time constraint and thus can include all historical events of the same entity. Any chosen configuration may depend on the average number of event data items available to each entity and / or any processing resource constraints (e.g., time-limited attention heads may be faster and easier to implement).

[0083] Therefore, the first and second processing stages 603 and 604 apply differentiated but complementary processing to achieve information extraction to determine the scalar value 602. The second processing stage 604 applies neural attention in an efficient manner capable of large-scale, fast reasoning. In fact, the neural network architectures of the first and second processing stages 603 and 604 are both configured for fast computation to handle the scale of parallelized transaction processing (e.g., ...). Figure 6A The millisecond-level inference and parallelizability shown enable the management of 1000-2000 transactions per second on a server computer device. For example, the first and second processing stages 603 and 604 omit certain components used in the comparison neural network architecture, but still allow for accurate inference.

[0084] An example configuration of the recurrent neural network architecture 620 in the first processing stage 603 is as follows: Figure 6B As shown. The recurrent neural network architecture 620 includes three input interfaces for receiving input data: a time difference interface 606 for receiving data that can be used to output the time difference between the proposed transaction and the preceding transaction; a transaction data input interface 618 for receiving data for the proposed transaction; and a state input interface 622 for receiving state data from previous iterations. These interfaces may include passive interfaces, such as a method API for receiving data via a referenced memory location, and / or may include active interfaces, such as applying preprocessing if preprocessing (e.g., calculating the time difference based on absolute time data) has not yet been applied. Functionally, the operations are identical. The time difference interface 606 can receive (or calculate based on the received time data) a time interval Δt. i,i-1 This includes data representing the time interval between the currently proposed transaction i and a previous or preceding transaction i-1, where i represents the iteration index of the recurrent neural network architecture 620. The time interval Δt i,i-1 The exact form may vary depending on the implementation. Although the interval between the i-th and (i-1)-th transactions is shown, in some cases, different intervals can be used, for example, between the i-th and j-th transactions (where i>j), and the state of the j-th transaction can be retrieved appropriately. Time interval Δt i,i-1 606 can be represented as one or more of the following: an integer value between Unix epochs (in seconds); a floating-point value representing a fractional period in hours; a vector of seconds, minutes, hours, days, months, and years, where each element is an integer value or a normalized value between 0 and 1; or a time difference embedding from the embedding layer. Preferably, the time interval Δt i,i-1606 can include a scalar floating (i.e., floating-point) value representing the number of seconds elapsed between a previous transaction and the proposed transaction for the same entity. Alternatively, the time difference interface 606 can receive timestamp data for the proposed and preceding transactions and can output the time interval Δt. i,i-1 The data representation. The transaction data interface 618 can receive the transaction feature vector (i.e., a fixed-length vector) output by the first multilayer perceptron 610, for example, each element includes a normalized value between 0 and 1.

[0085] The recurrent neural network architecture 620 has a stored state that can be retrieved from memory or a communication-coupled storage device. The recurrent neural network architecture 620 receives state data from a previous iteration (e.g., i-1) at a state input interface 622 and outputs state data for the current iteration (e.g., i) at a state output interface 624. The state data received at the state input interface 622 is retrieved after data from previous applications of the recurrent neural network architecture 620 to a set of previous transactions, including the first-to-last transaction. Generally, the recurrent neural network architecture 620 is configured to operate on one event (e.g., a transaction or a transaction-related event) at a time (i.e., per iteration). Figure 6B A schematic processing of a single event definition for a common (i.e., the same) entity is shown (although, for efficiency, computation can be vectorized and / or parallelized across multiple events and / or entities). Therefore, Figure 6B The recurrent neural network architecture 620 includes a processing layer that computes [current_output, new_state] based on [current_input, previous_state], where each of these data items may contain a fixed-length vector representation. The fact that the current output (e.g., 630) depends not only on the input of the current event but also on the input of a set of historical events of the same entity is a property of the layer's recursive nature (e.g., because each current output depends on a previous state, which in turn depends on another previous state, etc.).

[0086] The recurrent neural network architecture 620 includes a forget gate 626 to modify the state data used for previous iterations based on the data output by the time difference input interface 606. For example, the time difference interface 606 can output a representation of the time interval Δt. i,i-1 The data (i.e., data representing the time difference between the proposed transaction and the preceding transaction) is used as input to forget gate 626. Within forget gate 622, the time difference is encoded as φ. d 628 is used to represent time interval Δt i,i-1 The data is used to generate the activation vector of the forget gate. This time difference encoding φ dThe 628 can implement parameterized functions, such as time decay functions. In one case, this time difference is encoded as φ. d 628 can be applied as the form f(s) = [e (-s / w_1) e (-s / w_2) ,...,e (-s / w_d) The vectorized exponential time decay function is calculated as a series of parameterized time decays, where [w_1, w_2...w_d] consists of a set of weights representing the "decay length" in seconds, where d is the magnitude of the recursive state vector. In this case, Δt i,i-1 The time difference s passed to the function to output a vector of length d can be, for example, passed as a scalar floating-point value representing the number of seconds elapsed between a previous event (e.g., a transaction) and the current event (e.g., a proposed transaction) for the same entity. This time difference is encoded as φ. d 628 may include a weighted vector or matrix, which is used to transform the time interval Δt. i,i-1 The time difference is represented by φ. d 628 can be a fixed (e.g., manually configured) parameter and / or can include trainable parameters of the recurrent neural network architecture 620. In the case of an exponential function, the decay length can include (or be trained to include) a mixture of different values ​​representing decay from minutes to weeks to capture both short-term and long-term behavior. The state data received at the state input interface 622 is then weighted using activation vectors to output a modified state vector. Figure 6B In this example, element-wise multiplication 632 is used to apply the activation vector to the state data from previous iterations (e.g., to compute the Hadamard product). The state data comprises a vector with the same length as the transaction feature vector output from the transaction data input interface 618 (i.e., the same length as the output of the first multilayer perceptron 610). The forget gate 626 effectively controls how much state data from previous iterations is remembered and how much is forgotten. Figure 6B The configuration shown is intentionally arranged to allow for tractable training and fast inference on the current transaction processing context. For example, certain other gates, such as input or output gates, provided in the comparison implementation are omitted.

[0087] After modification via forget gate 626, the modified state data is combined with the transaction feature vector output from transaction data input interface 618 via combinational logic 634 to generate output data 630 for the proposed transaction (and this iteration i). Combinatorial logic 634 can perform element-wise addition (i.e., normal vector addition). Output data 630 also forms the output for state output interface 624. The output of state output interface 624 can then be cached in memory or otherwise stored until the next transaction.

[0088] In a preferred implementation, the state data used by the recurrent neural network architecture 620 is entity-dependent, i.e., specific to a user, account holder, or merchant account. In this way, appropriate entities can be identified as part of transaction data preprocessing, and the machine learning system 600 can be configured for that entity. In one case, the machine learning system 600 can apply the same parameters to the neural network architecture for each entity, but it can also store state data separately and retrieve only historical transaction data and / or auxiliary data associated with that entity (e.g., indexed by that entity). Therefore, the machine learning system 600 may include an entity state memory, such as a reference... Figure 2A and 2B As described, this allows for efficient partitioning of data from different entities. In other cases, the parameters used for the forget gate 626 can be shared among multiple entities. This can be advantageous in reducing the number of learnable parameters per entity in the overall machine learning system. For example, the aforementioned exponential time decay weights might be fixed and shared among multiple entities, while the state data might be unique for each entity.

[0089] Figure 6C An example configuration for the second processing level 604 is shown. Figure 6C An example of an attention neural network architecture 660 is shown, which can form an attentional arrangement with one or more "heads". Figure 6C In this architecture, the attention neural network 660 receives input from the first processing stage 603, wherein the input relates to the current event of the same entity and a set of previous events. Each "head" of the attention arrangement can receive the same input data, but can also contain different parameter data, thereby enabling each "head" to learn a different set of parameters. Figure 6C The example configuration can be called a self-attention architecture because the attention neural network architecture 660 determines which parts of the input to process based in part on that input. Figure 6C In the example configuration, the input feature vector is generated at least in part based on the previous iteration output from the first processing stage 603 and weighted based on the current iteration output of the first processing stage 603. The input feature vector is also generated based on data representing time differences, allowing the time interval between the proposed transaction and one or more preceding transactions to influence the applied weights. In effect, this provides differentiated weighting based on the irregular intervals of the time data. Generally, the attention neural network architecture 660 includes neural network layers to apply attention weights to the input feature vector to generate output data for the attention neural network architecture 660. The attention weights are calculated based on the input feature vector and the current output data from the first processing stage 603. This will be explained in more detail below.

[0090] exist Figure 6C In this architecture, the attention neural network includes a time difference interface 662, a historical input interface 664, and a current input interface 666. The time difference interface 662 is similar to the time difference interface 606 of the recurrent neural network architecture 620, but in this case it is configured to output the relative time interval Δt. i,i-1 , ..., Δt i,1 The vector represents multiple time differences between the currently proposed transaction and a set of historical or preceding transactions (each of which). For example, this could be multiple concatenated time interval values, where each time interval value resembles the form output by the time difference interface 606. The time difference interface 662 can receive time data as the relative time interval vector, or it can receive absolute time data and calculate the relative time interval based on that absolute time data. In a preferred example, the time difference interface 662 receives time difference data that includes at least data representing the time differences received by the time difference input interface 606 and one or more time differences between the proposed transaction and one or more other preceding transactions. The history input interface 664 receives historical output data from the first processing stage 603. For example, this data could include the output of a recurrent neural network architecture 620, buffered within a predefined time period or number of iterations.

[0091] In one scenario, both the historical input interface 664 and the time difference interface 662 output data for computations related to a fixed or predefined number of previous iterations (i-1 in the example, and referred to below as T). Alternatively, the historical input interface 664 and the time difference interface 662 can be configured to output data for computations related to a subset of historical events. This subset can be constrained based on a finite number of events (e.g., the 10 most recent events for an entity), a configured time period (e.g., the past month for the same entity), or both (e.g., events within the past month limited to the 10 most recent events). This is the preferred method for managing data size constraints and computational traceability. However, if the number of transactions is small, the number of previous iterations may increase with each iteration.

[0092] Finally, the current input interface 666 receives the current output data from the first processing stage 603. Figure 6C The attention neural network architecture 660 shown is at least derived from the output data of the current iteration of the recurrent neural network architecture 620 (i.e., Figure 6B The output data in 630). In some cases, for example, when using such as Figure 6A In the skip connection shown by the dashed line, the current input interface 666 can receive the output from the first processing stage 603 (e.g., output data from the recurrent neural network architecture 620) and the input for the first processing stage 604 (e.g., such as...). Figure 6B The combination of the received input transaction data (618) shown.

[0093] exist Figure 6C In the example, the time difference data received from the time difference interface 662 uses time difference coding φ e 668 has been modified. (And...) Figure 6B Time difference coding φ d Like 628, this time difference code φ e 668 may include parameterized functions applied to time difference data. Depending on the implementation, this time difference encoding φ e The parameters of 668 can be fixed or trainable (i.e., learned). In some implementations, this time difference encoding φ e 668 can include an exponential time decay function, similar to Figure 6B Regarding time difference coding φ d As described in 628. In other implementations, it can include fully connected neural network layers. In one case, the time difference encoding φ e The 668 can output a tensor containing a set of vectors (i.e., a matrix), with one vector for each event time difference. Each vector can be calculated based on the time elapsed between a specific historical event and the current invention (e.g., the currently proposed transaction). In one case, this time difference is encoded as φ. e Each vector in 668 can be represented in a manner similar to Figure 6B Time difference coding φ d The calculation is performed in a 628 manner, and this operation is repeated for each time difference. In other cases, different positional encodings can be used, such as encoding based on a (weighted) sine function. An example of sine positional encoding is described in the paper "Attention is All You Need" by Vaswani et al. (published on arXiv on December 6, 2017), which is incorporated herein by reference. This time difference encoding φ e 668 can be parameterized based on the weight matrix that can be used to implement time coding. Relative to Figure 6B Time difference coding φ d 628, Time Difference Code φ e The parameters of 668 can be fixed (e.g., provided as predefined configuration values) and / or learned through (end-to-end) training.

[0094] In this example, time difference coding φ eThe output of 668 includes a relative time encoding matrix, with an encoded vector for each time difference. This is received at concatenation box 670. Concatenation box 670 concatenates the relative time encoding with historical output data from the first processing stage 603, such as a set of buffered outputs from the recurrent neural network architecture 620 for iterations i-1 to 1. For example, each vector of the time encoding associated with a specific time difference can be concatenated with the vector of the corresponding encoded event. Thus, concatenation box 670 typically outputs a feature tensor that combines the time difference data with historical output data from the first processing stage 603 (e.g., where the input data is in vector form). The feature tensor is then used as input for determining the key vector and value vector. The feature tensor can be arranged as a longer, flat vector or a multidimensional array (e.g., a buffer-like data structure with one-dimensional indexes of iteration indices).

[0095] Figure 6C The attention neural network architecture 660 in the code can be viewed as an implementation of a single-head self-attention system. This attention system uses key vectors, query vectors, and value vectors. Figure 6CIn the example, value vector computation operation 672 is used to compute the value vector, key vector computation operation 674 is used to compute the key vector, and query vector computation operation 676 is used to compute the query vector. In some cases, vectors can be computed for each iteration to generate attention weights for that iteration. Value vector computation operation 672 and key vector computation operation 674 receive the output of concatenation box 670 as input, i.e., elements of a feature vector or feature tensor, and compute the feature tensor using time difference data and historical output data from the first processing stage 603. The terms key, query, and value are used in the art to refer to different representations used for applying attention. For example, these terms were developed with reference to information retrieval systems, where queries are mapped to a set of keys to return one or more values ​​as matches. In neural attention, at a high level, query vectors are compared with key vectors and used to determine a set of attention weights. In some cases, one or more query vectors can be compared with multiple key vectors (e.g., through matrix operations) to generate a set of attention weights, where each weight is associated with a different iteration index. This set of attention weights is applied to at least one value vector to provide a weighted vector output. In some cases, the weighted vectors from each iteration are combined into a final weighted sum to generate an output vector (e.g., 688). Each of the value, key, and query vector computation operations 672, 674, and 676 involves applying one or more sets of parameter weights (e.g., applied through one or more neural network layers), where the parameter weights are learnable parameters of the attention neural network architecture 660. Thus, the attention neural network architecture 660 “learns” how to transform the input feature tensor output by the concatenation box 670 to generate appropriate key and value vectors, and how to transform the current output data from the first processing stage 603 to generate appropriate query vectors. In effect, the query vector computed by the query vector computation operation 676 represents how the current output of the recurrent neural network architecture 620 should be represented to look for information in the input feature vector (i.e., temporal difference data and previous outputs of the recurrent neural network architecture 620) to weight that input feature vector (i.e., emphasizing some aspects and downplaying others). Since the weights used in each of the value, key, and query vector computation operations 672, 674, and 676 are trainable parameters, the attention neural network architecture 660 learns, through training, how best to manipulate the input data to generate output data, thereby producing an accurate scalar output representing anomalies in the transaction.

[0096] Moving to the mechanism of applying attention, the query vector output by query vector computation operation 676 is applied to the key vector output by key vector computation operation 672 using a first dot product operation 678, which computes the dot product of the key vector and the query vector. This operation can be performed on a set of multiple key vectors and / or multiple query vectors to output the result of the dot product for each iteration in the historical data. Simultaneously, another time difference is encoded as φ. w 680 is used for time difference data output from the time difference interface 662. Time difference encoding φ w 680 can again apply time difference coding φ similar to the description. d 628 and time difference coding φ e One or more functions of 628. In one case, time difference coding φ w 680 outputs a value for each pair (historical event, current event) of the same entity. This value can include a scalar value that is a function of the time elapsed between the historical event and the current event, for example, Δt. i,j The function, where i is the current event and j is a historical event, is encoded by time difference φ. w The function applied in 680 can take several forms. As before, the parameters of this function can be fixed or trainable. In one case, the parameters can be configured to assign higher weights to events from a specific time interval (e.g., events from the same entity over the past week) to encourage the attention layer to focus more on events from that time interval. Therefore, by controlling the temporal difference encoding φ... w With 680 parameters, a specific attention layer can be configured to focus more or less on specific time intervals. If the parameters are defined as configuration parameters, it may allow the operator to control the operation of the attention layer; if these parameters are learned during the training of the neural network architecture, they may converge to values ​​that increase the success rate of anomaly classification.

[0097] Then, time difference coding φ is applied. w The output of 680 is combined with the dot product of the output of the first dot product operation 678 at combinational logic 682. In operation 682, time-based weights (for historical events and the current event) are added to the dot product between the query vector of the current event and the key vector of the historical events; this dot product can be a scalar. Combinational logic 682 effectively adjusts the initial set of attention weights computed using the key and query vectors with weighted time difference data. The time difference interface 662 encodes φ via time difference. wThe weighting of the time difference data in the 680 output is important because it allows for adjustments to the attention weights to encourage the attention neural network architecture 660 to focus more on certain samples in some time intervals than in others. For example, in some tests, it was found that this adjustment typically emphasizes the characteristics of the trading data over a learning timeframe (e.g., approximately one week from the start of the proposed trade).

[0098] Finally, to complete the computation of attention weights from the input feature tensor and the current output data from the first processing stage, a softmax operation 684 is applied, and a dot (i.e., scalar) product of the output of the softmax operation 684 and one or more value vectors output by the value vector computation operation 672 is computed via a second dot product operation 686. The second dot product operation 686 effectively applies the normalized attention weights output by the softmax operation 684 to one or more value vectors to provide the output data 688 of the attention neural network architecture 660. The softmax operation 684 applies a softmax or normalized exponential function to output a set of normalized attention weights, which can be viewed as a set of probabilities (i.e., they sum to 1, and each is in the interval between 0 and 1). The second dot product operation 686 then computes an iterative weighted sum of the input feature tensor weighted by the computed attention weights.

[0099] For example, the output of a recurrent neural network architecture can be represented as a vector A of length L in each iteration j. j The time difference between the current iteration i and the previous iteration j (where i ≠ j) can be expressed as Δt. i,j This time difference could be a scalar floating-point value. In the computation, the recurrent neural network architecture receives A representing the previous state (e.g., j = i - 1). i-j and using time difference coding φ d Encode the time difference. If φ d It is a vector of parameters, for example, a set of time decay weights of length K. Then the forget gate 626 first calculates f. i (φ d ,Δt i,j Then modify the state S. i Calculated as S i =f i (φ d ,Δt i,j )°A j Where ° represents element-wise multiplication, and the output is A. i =S i +X i , where X iIt is the input transaction data feature vector for iteration i, and its length is also K (e.g., output by the first multilayer perceptron 610). Now turning to the attention neural network architecture 660, the time difference interface 662 can receive a multidimensional array (e.g., a vector) containing a time difference of T, i.e., Δt. i,j (For j = i-1 to iT). If each time difference is a floating-point scalar as described above, then the input to the time difference interface 662 can be a vector Δ of size T. i Similarly, the history input interface 664 can receive a multidimensional array H of size K multiplied by T. i , representing the buffered output A of the recurrent neural network architecture j (For j = i-1 to iT). Then, the current input interface 666 receives A for the current iteration. i Similarly, time difference coding φ e 668 can implement the parameterized function f e (φ e Δ i ), where φ e This is a set of parameters for time difference encoding. The function can output a vector, i.e., a multidimensional array (or matrix) E, for each time difference. i Alternatively, the size can be M multiplied by T, where M is the length of the time difference encoding. Then, concatenation operation 670 can stack the corresponding vectors of each source T vector to output a (K+M) multiplied by T matrix. Value and key computation operations 672 and 674 can generate value and key vectors (i.e., concatenated T time intervals and buffered outputs) for each T sample forming the input, respectively. At the first dot product operation 678, the dot product of the query vector and each T key vector can be computed to generate the initial attention weights for T. Time difference encoding φ w 680 can apply a parameter matrix of size T by T, and apply the parameterized function f w (φ w Δ i After that, a weighted time vector D of length T is generated. i (i.e. D) i =f w (φ w Δ i Then, at combinational logic 682, it is added to the vector of initial attention weights in T, i.e., α′. i =α i +D i As mentioned above, there is a time difference encoding φ w The time-based scalar weights at 680 and a scalar key query dot product weight are added together (combined) at component 682. Then the softmax function is applied to the softmax operation at 684, α". i =softmax(α′)i Then, attention weights α can be used. i The vector is weighted by each T-value vector output by the value vector calculation operation 674, and the output B is calculated. i It can be generated as Where, α" i,j It is one of the weight values ​​in the T attention weight value, while V i,j This is the corresponding value vector. Then, the resulting output vector B... i The length is the same as that of the value vector (which can be M+K), or a customizable length set by the matrix multiplication in the value vector calculation operation 672.

[0100] The second processing stage 604 uses one or more attention neural network architectures 660 as described above to rank actions or events associated with the proposed transaction. These actions or events can be data received along with transaction requests and / or auxiliary data for evaluation and weighting, where the relative timing of the various actions constituting these features can be used as part of the weighting. In some examples, the input transaction features 608 may also include other actions associated with the user, such as sequences of measured user input, e.g., mouse clicks and / or changes in account history. This makes the output scalar value sensitive to patterns in the input data. For example, two transaction requests within a few seconds may be associated with a detected pattern that differs from two transaction requests within a few days.

[0101] Therefore, compared to comparative transaction processing methods and systems that do not allow the detection of the above patterns in data sequences, Figures 6A to 6C The architecture of the first example configuration shown is more advantageous. Specifically, as Figures 6A to 6C The first example configuration shown provides improved accuracy, such as reducing false positives and increasing true positives, and avoids errors and suboptimal results traditionally found when attempting to apply machine learning in the context described herein. For example, the machine learning architecture described above is able to identify and infer sequences of asynchronous actions over time, while appropriately understanding the time intervals and temporal densities of individual actions of different types. By using stateful processing of the first processing stage 603 and adaptive attention mechanisms of the second processing stage 604, the first example configuration is able to learn useful features for creating long-term trends in entities (e.g., users or merchants).

[0102] although Figure 6AThe first processing stage 603 and the second processing stage 604 operate collaboratively, but in some examples each can be implemented independently, for example, with or without one or more first and second multilayer perceptrons 610 and 690. For example, the second processing stage 604 can be omitted in some implementations, such that the second multilayer perceptron 690 operates only on the output of the first processing stage 603; or the first processing stage 603 can be omitted, such that the second processing stage 604 receives the feature vector (e.g., X) output by the first multilayer perceptron 610. i Replace A i (This represents one or more current and historical data).

[0103] In some cases, the forget gate 626 can be implemented as a form of time decay processing, where the neural network unit (i.e., the recurrent neural network architecture 620) has a local memory (i.e., state) and this memory is decayed over time before processing new data samples. In some examples, time difference encoding φ d 628 can contain one or more exponential time decay factors.

[0104] The first processing stage 603 can be viewed as computing a weighted sum of the new input relative to the previous input. The forget gate 626 acts as a form of time decay to adjust for the contribution of past transaction pairs to the sum. Time decay can be purely a function of time, since transaction pairs can be considered to occur independently of anything that happens in between. Thus, the first processing stage 603 provides long-term storage for the contribution of earlier transactions that are not the most recent transactions.

[0105] As mentioned above, in some implementations, time difference coding (e.g., φ) d 628, φ e 668 and φ w One or more of 680 can be used as a function f representing the time interval. i (φ i , Δt i,j ) application, where φ i It is a set of parameters. At least in the forget gate 626 implementation, the function can be constrained such that f(Δt) i,i-a )*f(Δt i,i-b )=f(Δt i,i-a +Δt i,i-b For example, the function could include an exponential function. In one case, weights could be applied, and then an exponential activation function could be applied. In this case, the contribution of the iterations depends only on the intervals between iterations and is independent of any intermediate iterations.

[0106] For example, in the above variation, a case with three events or actions is considered, such as a situation related to three transactions. The first event (A) occurs at midnight; the second event (B) occurs at 3 a.m. (i.e., 3 hours after event A); and the third event (C) occurs at 9 a.m. (i.e., 6 hours after event B and 9 hours after event A). The recurrent neural network architecture can be derived from the previous state S based on the interval between the previous event (i-1) and the current event (i). i-1 Determine the current state S i Then add it to the input X of the current event. i In this case, for each update, the recurrent neural network architecture can compute the state S. i =X i +f(t i -t i-1 ).S i-1 For these three events, the various states can be calculated as S. A =X A +0、S B =X B +f(3 hours).S A and S C =X C +f(6 hours).S B , equivalent to S C =X C +f(6 hours).X B +f(6 hours).f(3 hours).X A When f(Δt) i,i-a )*f(Δt i,i-b )=f(Δt i,i-a +Δt i,i-b When, for example, S is applied with exponential time decay, C =X C +f(6 hours).X B +f(9 hours).X A That is, the state is a weighted linear combination of the previous states, where each contribution can be determined independently.

[0107] In one case, the forget gate 626 can apply exponential time decay with one or more constant decay rates or decay factors (in some cases, multiple such factors). These decay rates or factors can be set as hyperparameters of the recurrent neural network architecture 620 or can be learned. In some examples, exponential time decay ensures that the contribution of past actions to the state of the recurrent neural network architecture depends only on the time elapsed since the most recent event. In other words, the elapsed time between events attributable to the entity using this variant is independent of any other actions performed during that period. The use of time decay in the recurrent neural network architecture 620 also allows for long-term memory storage, where the duration of long-term memory storage can be set, for example, by a half-life parameter of exponential time decay. In this case, a function implementing a series of half-lives can be used, which encourages the encoding of action patterns and changes over different time periods.

[0108] In some cases, all time-difference encodings can use fixed (i.e., retrieved or provided) parameter values ​​(e.g., set to configuration data) instead of learned parameters. In this case, all transformations and / or aggregations of historical and current feature vectors related to time differences can be “imposed” on the architecture. This increases the controllability of the system and facilitates (even guides) the training of learnable parameters for attention heads and / or multilayer perceptrons. For example, by setting exponential time decay values ​​for different time periods, useful time-difference encodings can be generated, which underpin the machine learning system described in this paper to learn features that can be mapped to these different time periods. Furthermore, non-learnable parameters improve the speed of processing and training, making them particularly suitable for fast, high-volume transaction processing.

[0109] Second example configuration of a machine learning system

[0110] The above Figures 6A to 6C This illustrates one possible configuration for a machine learning system. A second alternative configuration, such as 700, is also shown. Figure 7 As shown. The first and second configurations are not limited, and some examples described herein can be performed using machine learning models with configurations different from the first and second configurations (e.g., other neural network architectures and / or Bayesian configurations). Similar to the first example configuration 600, the second example configuration 700 can be used to implement... Figures 1A to 1C 2A and 2B Figure 4 as well as Figure 5A and 5B The machine learning systems 160, 210, 402 (where the configuration may include an implementation loaded by the machine learning model platform 414) and one or more of 508.

[0111] Figure 7The second configuration 700 is based on a random forest model. The random forest model is applied to the input feature vector 710 and includes multiple decision trees 720 applied in parallel. Each decision tree 722, 724, and 726 outputs a different classification value C. i 730. Figure 7 Three decision trees 722, 724, and 726 are shown, along with three classification values ​​732, 734, and 736, but there could be a total of N decision trees, where N is a configuration parameter that could be in the hundreds. Classification values ​​730 are passed to an integrated processor 740, which combines the classification values ​​730 from each decision tree 720 to generate a final scalar output 750. The integrated processor 740 can compute a weighted output of the decisions for each decision tree 720 and / or can apply a voting process.

[0112] Training machine learning systems for processing transaction data

[0113] In some examples, machine learning systems, such as those described herein, can be trained using labeled training data. For instance, a training set can be provided that includes data associated with transactions labeled as “normal” or “fraudulent.” In one case, these labels can be provided based on reported fraudulent activity, i.e., using past fraud reports. For example, data associated with approved and processed transactions, without subsequent fraud reports, might be labeled “0” or “normal”; while data associated with rejected transactions and / or transactions later reported as fraudulent and otherwise flagged or blocked might be labeled “1” or “abnormal.” References Figure 8 An example of this training method is described.

[0114] In other examples described herein, improved pipelines for training machine learning systems are proposed. These improved pipelines allow training machine learning systems, such as those described herein, to process transactional data. The training pipeline allows the generation of synthetic data samples for training. This is particularly useful when attempting to train a machine learning system to classify anomalies in transactions (e.g., to indicate fraud or other malicious activity), as examples of “unexpected” action patterns are typically rare, allowing a balanced set of real training labels. The improved pipeline for training machine learning systems involves adapting the feature vector generation process before applying the machine learning system and adapting the training process. The adaptation of the feature vector generation process will be described first in the following sections, followed by the training itself. References Figures 9 to 13 An example of this training method is described.

[0115] Pre-labeled training examples

[0116] Figure 8Example 800 shows a machine learning system trained on a pre-labeled training set for transaction processing. This method can be used when a large amount of labeled training data is available. However, this method may have limitations, which are discussed below. Figures 9 to 13 Describe an improved method for training transaction processing.

[0117] exist Figure 8 In Example 800, training data 810 is obtained. Training data 810 includes feature data 812, which may be feature values ​​arranged as a numerical tensor, and labels 814, in this case, labels 814 are either values ​​"0" representing transactions associated with normal behavior patterns or values ​​"1" representing transactions associated with abnormal or fraudulent behavior patterns. In training mode, feature data 812 is provided to the machine learning system 840 for the output of classification 850, which preferably includes scalar output. The machine learning system 840 is parameterized with a set of parameters 842. These parameters may be initialized to a set of random values. In use, training engine 860 receives the output of the machine learning system 840 for a specific feature dataset and compares it to the labels of that feature dataset. Training engine 860 may evaluate a loss function, such as a logistic loss function, which calculates the error between the prediction and the label. This error is then used to adjust parameters 842.

[0118] In the machine learning system 840, a reference is included. Figures 6A to 6C In the first configuration 600 described, parameter 842 may include neural network weights (in some cases, biases, although these biases can be attributed to the weights) associated with one or more components 628, 668, 668, 672, 674, 676, and 680. In this case, feature data 812 may include both current transactions associated with labels and state data, time difference data, and historical data used to generate inputs for the attention neural network architecture 660 for the recurrent neural network architecture 620.

[0119] In some cases, the training and / or validation of the machine learning system 840 can be performed using externally labeled data. This makes it difficult to obtain (but possible to obtain) data feeds such as transaction "chargeback" information, which indicates which transactions have been refunded (as this could indicate fraud or irregularities). In some cases, correlation operations can be performed to associate transaction chargeback information with the user's disputed transactions (e.g., the refund might be due to a defective product rather than fraud). Feedback to the output allows the machine learning system to operate adaptively and self-correct the model over time.

[0120] Feature vector generation

[0121] Figure 9An example of a machine learning system 900 is shown, which has been adapted to allow training based on unlabeled data. Figure 9 In this example, the machine learning system 900 generates a feature vector based on observable and contextual features. The machine learning system 900 includes an observable feature generator 910 and a contextual feature generator 930. They are each configured to generate different portions of a feature vector 930, which is provided as input to a trained binary classifier 940. The binary classifier 940 is trained to map the input feature vector 930 to a scalar output 950. The scalar output 950 indicates the presence of an anomaly. For example, the binary classifier 940 can be trained on data with two assignable labels. In some cases, these labels might be "0" and "1," where "0" indicates no anomaly and "1" indicates the presence of an anomaly. In this case, the binary classifier 940 can output a value between 0 and 1, representing the probability of the anomaly's presence. The binary classifier 940 is parameterized with a set of parameters 942. For example, these parameters might include weights of the neural network architecture and / or branch weights of a decision tree or random forest model. These parameters can be "learned" using training methods discussed in more detail below. The scalar output 950 of the binary classifier 940 can be used to process current or planned transactions.

[0122] exist Figure 9 In this process, the observable feature generator 910 receives transaction data 912 and 914 and uses them to generate an observable feature vector 916. The context feature generator 930 receives auxiliary data 922 and 924 and uses them to generate a context feature vector 926. Then, the observable feature vector 916 and the context feature vector 926 are combined to generate a total feature vector 930. In one case, the observable feature vector 916 and the context feature vector 926 can be combined by concatenating the two feature vectors 916 and 926 to generate a longer vector. In other cases, combinational logic and / or one or more neural network layers can be used to receive the observable feature vector 916 and the context feature vector 926 as input and map this input to the feature vector 930.

[0123] exist Figure 9 In this process, the observable feature generator 910 receives two types of transaction data. The first part of the transaction data 912 includes data related to a specific transaction being classified. For example, this may include data such as... Figure 5A and 5BThe current or proposed transaction is processed in the process. A first portion of transaction data 912 may include data derived from data packets received with the request to process the proposed transaction. A second portion of transaction data 914 includes data related to transactions within a group defined based on the proposed transaction. For example, these transactions may include transactions within a defined time window. This can be defined based on an (absolute) predefined time range set with reference to the timestamp of the proposed transaction or on a relative time range defined relative to a discrete number of transactions (e.g., the last X transactions of a specific user associated with the proposed transaction). Therefore, observable feature generator 910 can be described as generating an observable feature vector 916 based on recently observed data, which includes at least data derived from the proposed transaction. In contrast, context feature generator 920 generates a context feature vector 926 based on one or more transaction data outside the time window and data retrieved related to a uniquely identifiable entity in the proposed transaction. For example, Figure 9 The context feature generator 920 is shown receiving auxiliary data 922 and historical transaction data 924. The uniquely identifiable entity can be a specific end-user (e.g., a cardholder) or merchant, and the auxiliary data 922 can be data retrieved from records associated with the uniquely identifiable entity (e.g., so-called static data, independent of transaction data representing past transactions). The auxiliary data 922 can include auxiliary data 146 or 242 as previously described. The historical transaction data 924 can include data associated with transactions outside a time window, such as data from timestamps of referenced intended transactions or data from transactions outside the aforementioned predefined time range set relative to a time range. The context feature generator 920 can be configured to compute an aggregate metric across the historical transaction data 924 (or retrieve a pre-computed aggregate metric) and then include the aggregate metric in a context feature vector. The aggregate metric can include simple statistical metrics or features extracted by more advanced neural networks.

[0124] Pipelines for training machine learning systems

[0125] Figure 10 The following is a diagram showing the training based on Figure 9 The pipeline 1000 of the machine learning system describes the feature generation process. Pipeline 1000 operates on the training set 1010 of data sample 1012. In this example, the training set 1000 includes unlabeled data, meaning that data samples 1012 do not have labels (i.e., assigned data values) indicating whether they are related to normal behavior. In other cases, some training set (i.e., at least a portion) may be labeled, but this could be a small fraction of the available data. Data samples may be similar to... Figure 9 The feature vector shown is the feature vector of feature vector 930. Data samples may include features derived from... Figure 9 The arrangement generates feature vectors and / or is not by Figure 9 The arrangement generates feature vectors, but these feature vectors represent the historical input data that forms the binary classifier of the machine learning system being trained.

[0126] Pipeline 1000 begins at data partitioning level 1020. This pipeline operates on training set 1010 and partitions the data of data sample 1012 in training set 1010 to generate partitioned data 1030. Specifically, data partitioning level 1020 divides the data samples into two feature sets: a first feature set 1032 representing observable features and a second feature set 1034 representing contextual features. The first and second feature sets 1032 and 1034 can be observable feature vector 916 and contextual feature vector 926, as referenced. Figure 9 As described. Used in data sample 1012 Figure 9 If the preprocessing generates feature vector 930, the data partitioning level 1020 may include partitioning the concatenated feature vector (e.g., splitting the feature vector into two predefined sets of elements). If data sample 1012 includes a dataset prior to preprocessing, such as input data 912, 914, 922, and / or 924, the data partitioning level 1020 may partition the data based on the data source and / or time data. In one case, the data partitioning level 1020 may operate in a manner similar to the observable feature generator 910 and the context feature generator 920.

[0127] The first feature set 1032 can have properties similar to the observable feature vector 916 and the context feature vector 926, as referenced. Figure 9 As described. For example, observable features can be derived at least in part from a time window of transaction data defined relative to the transactions associated with each data sample. Contextual features can be derived at least in part from one or more transaction data outside the time window, as well as data retrieved that is associated with a uniquely identifiable entity in the transactions related to the data sample. Figure 10 In the given information, there are n pairs of eigenvectors. and Where i represents a specific i-th data sample, O represents observable features, and C represents contextual features.

[0128] Following the data partitioning level 1020, the partitioned data samples 1030 are passed to the synthetic data generation level 1040. During the synthetic data generation level 1040, a set of synthetic data samples 1050 is generated. The synthetic data samples 1050 are generated by combining features from two feature sets that are respectively associated with two different entities in the set of uniquely identifiable entities. For example, this process can be performed by adding a second set of features 1034 to a group indexed by entity identifiers, and then iterating through the first set of features 1032 and randomly selecting paired portions from that group, where the entity identifier of the selected paired portion does not match the entity identifier of the corresponding first set of features. Therefore, the synthetic data samples 1050 include mixed pairs 1052, 1054 from the first and second sets of features 1032, 1034.

[0129] Following the synthetic data generation stage 1040, the original partitioned data 1030 and the synthetic data sample 1050 are passed to the data labeling stage 1060. During the data labeling stage 1060, the original partitioned data 1030 is assigned a label indicating the absence of anomalies. In this example, the label is the value "0". Then, the synthetic data sample 1050 is assigned a label indicating the presence of anomalies. In this example, the label is the value "1". The two sets of labeled data samples are then combined to form an augmented training set 1070 containing labeled data. Figure 10 As shown, the augmented training set 1070 includes pairs of observable features 1072 and contextual features 1074, represented, for example, by individual feature vectors or sets of vector elements, and assigned labels 1076 (which are "0" or "1"). The augmented dataset 1070 can then be used to train a binary classifier implementing the machine learning system of the previous example. For example, the augmented dataset 1070 can be used to train... Figure 9 The binary classifier 940 shown determines the set of parameters 942. When the binary classifier 940 includes a neural network architecture, during training, predictions from the binary classifier 940, in the form of a scalar output 950, can be compared within the assigned labels 1076, for example, in the loss function, and the error based on the difference between the scalar output 950 and one of the numerical values ​​0 or 1 in the labels can be backpropagated through the neural network architecture. In this case, the derivative of the loss function with respect to the weights of the neural network architecture can be determined and used to update these weights, for example, using gradient descent or one of its variations. For example, this can be used... Figures 6A to 6C The neural network architecture shown. In the case where the binary classifier 940 includes a random forest classifier, for example... Figure 7As shown, the augmented training set can be passed to the model as a set of data samples (a concatenation of parts 1072 and 1074) and a set of corresponding labels 1076 (e.g., random_forest.fit(X, y) defined in the machine learning programming library).

[0130] Therefore, the current training pipeline offers a solution to the technical problem of training anomaly classifiers on unlabeled transaction data. Instead of looking at unsupervised classifiers, which are typically inaccurate and require extensive calibration, this example modifies the feature generation process so that the problem can be restructured as a supervised learning problem. This then enables the use of more robust supervised learning models that can identify anomalies in transaction data that more closely align with deviations from expected behavior. This example divides the training data into two sets of features: the first set is based on so-called observable features, i.e., data obtained from or computed using the transactions being classified; the second set is based on so-called contextual features, i.e., data not obtained or computed using the transaction data being classified, such as data derived from historical transaction data and / or lookup data related to a specific entity or entity associated with the transaction being classified. Furthermore, collapsing transaction processing into a single binary classification can make this approach effective while still allowing for executable outputs, such as determining whether to approve or reject the transaction being classified.

[0131] Figure 11A and 11B This shows the relationship with the current training process. Figure 3A and 3B A variation of the example. Figure 11A A set of transaction data 1100 for a uniquely identifiable entity is illustrated schematically. For example, this data may include transactions recorded over time for an end user (e.g., a cardholder or a specific payment account). This data may also include transactions associated with a specific merchant. Transaction data 1100 may include approved transaction records. Figure 11A The data for proposed transaction 1102 is also shown. This may include data on proposed transactions that have been received but not yet approved or rejected. If the transaction is approved, the data used for proposed transaction 1102 can be added to transaction data 1100.

[0132] Figure 11AThe diagram also illustrates how observable and contextual features relative to transaction data 1100 can be defined. If each vertical segment represents data from a different transaction, observable features can be defined as features calculated from a first set of transaction data 1110, and contextual features can be defined as features calculated from a second set of transaction data 1120. The first set of transaction data 1110 includes at least the data used to formulate transaction 1102 and can be considered as raw observations of actions or behaviors defined based on the formulated transaction. Therefore, observable features relate to “observations” of an entity’s current or recent behavior, as represented by its actions, which are recorded through transaction data 1110. The second set of transaction data 1120 does not include the formulated transaction 1102 and can be used to calculate historical metrics equivalent to the data of the formulated transaction 1102. Contextual features can include predictions of current or recent behavior based on actions or behaviors represented in the second set of transaction data 1120.

[0133] Figure 11B An example of feature vector 1140 is shown. Feature vector 1140 may include features from... Figure 10 Data sample 1012 or from Figure 9 The eigenvector is 930. Figure 11BThe feature vector 1140 is shown after the first set of preprocessing steps that convert the input data into numerical values. It should be noted that further preprocessing based on defined ranges or neural network mappings can also be performed to further convert the displayed numerical values ​​into standardized values ​​(e.g., floating-point values ​​between 0 and 1 or -1 and 1). The feature vector 1140 includes numerous vector elements associated with different data inputs. For example, feature vector 1140 includes: the entry "amount," which defines the amount of the proposed transaction (e.g., in local currency); the entry "aggregate_amount," which defines the total amount spent within the first set of transaction data 1110; the entry "merchant_id," which is a unique identifier for the merchant used in the proposed transaction; the entry "merchant_likelihood," which indicates the likelihood that the user entity making the proposed transaction uses the merchant defined by the merchant identifier; the entry "total_amount," which represents an aggregate measure (e.g., sum) of the transaction amounts from the second set of transaction data 1120; the entry "primary_account," which is a boolean value indicating whether the transaction is associated with the current user entity's primary account; the entry "me_amount," which represents another aggregate measure (e.g., median) of the transaction amounts from the second set of transaction data 1120; and the entry "country_id," which represents the current user entity's country of origin. In this case, the entries "amount," "aggregate_amount," and "merchant_id" can represent observable features, while the remaining entries can represent contextual features.

[0134] Methods for training machine learning systems

[0135] Figure 12 An example method 1200 for training a supervised machine learning system to detect anomalies in transaction data is shown. This method 1200 can be used to implement, for example... Figure 10 The pipeline 1000 is shown. At box 1202, method 1200 includes acquiring a training set of data samples. Data samples may include, for example... Figure 10 Data sample 1012. Each data sample is at least partially derived from transaction data and associated with one of a set of uniquely identifiable entities. For example, a data sample may have one or more unique identifiers associated with a user or merchant, or be retrieved based on one or more unique identifiers associated with a user or merchant. In this method, at least a portion of the training set is unlabeled. For example, the training set may include feature vectors, such as... Figure 11B The example shown is 1140, which does not have an assigned exception label.

[0136] At box 1204, method 1200 includes assigning a label indicating the absence of anomalies to unlabeled data samples in the training set. For example, such a label could be... Figure 10 The label 1076 shown is a portion of the assigned "0" label. This label can include numerical or binary values. Since the data sample is derived from transaction data, such as at least... Figure 11A The data shown is 1100, therefore it is related to transactions that have already been processed, and is therefore assumed to represent, for example, the “normal” behavior of data expected during transaction processing.

[0137] At box 1206, the method includes partitioning the data samples in the training set into two feature sets: a first feature set representing observable features and a second feature set representing contextual features. The observable features are derived from functions of at least the transaction data of the current transaction, for example, Figure 11A A function that includes at least 1102 transaction data points. Contextual features are derived from one or more functions that exclude historical transaction data from the current transaction, as well as data retrieved that is associated with a uniquely identifiable entity related to the current transaction. For example, contextual features can be derived from functions similar to... Figure 9 Data 922 and 924 in the example. Example partition feature 1030, such as... Figure 10 As shown.

[0138] At box 1208, the synthetic data sample is generated by combining features from two feature sets that are respectively associated with two different entities in the set of uniquely identifiable entities. For example, each data sample may be directly (e.g., by the entity’s current unique identifier) ​​or indirectly (e.g., by a specific position in the data matrix) associated with a particular uniquely identifiable entity, so the context portion of the data sample may be exchanged with the context portion of another data sample that is not directly or indirectly associated with the particular uniquely identifiable entity.

[0139] At box 1210, method 1200 includes assigning labels indicating the presence of anomalies to the synthetic data samples. For example, the method may include assigning labels such as... Figure 10 The value 1 is shown in label 1076. Since synthetic data samples are generated based on the mismatch between observable features and contextual features, they can be considered positive examples of training anomalies. At box 1212, the original training set obtained in box 1202 is a training set augmented with synthetic data samples. This can include concatenating the synthetic data samples to the bottom of the data matrix, whose rows represent the data samples obtained at box 1202, and concatenating the “1” label to the list of “0” labels. Box 1212 may also include shuffling the data samples after augmentation (but still preserving the correspondence between the data samples and the assigned labels).

[0140] Finally, at box 1214, method 1200 includes training a supervised machine learning system with an augmented training set and assigned labels. This method may include applying a known training process with the augmented training set and assigned labels as training data. The supervised machine learning system may be an implementation of one or more of machine learning systems 160, 210, 402, 508, 600, 700, or 940. The supervised machine learning system may include one or more ensemble systems based on a set of decision trees and recurrent neural networks, for example, as described in the preceding examples. After training, for example, after determining a set of learned parameter values, the trained supervised machine learning system is configured to use the learned parameter values ​​to output a value indicating the presence of anomalies when new data samples are provided.

[0141] In one scenario, observable features are derived from a transaction data function within a predefined time window of the current transaction. For example, this transaction data function may include a defined time range (e.g., within 24 hours) and / or a defined number of transactions (e.g., the last 3 actions). The predefined time window may, for example, be associated with a first set of transaction data 1110, such as... Figure 11A As shown. Then, contextual features are derived from transaction data outside the predefined time window. For example, these features could be derived from a second set of transaction data 1120, such as... Figure 11A As shown.

[0142] One or more observable features may include aggregated measures calculated from transaction data over a predefined time period defined relative to the time of the current transaction. For example, these observable features may include statistical measures (e.g., mean, mode, or median) or functions learned by a neural network mapping. For instance, the predefined time period may involve a set of transaction data such as the first set of transaction data 1110, etc. Figure 11A As shown. In this case, box 1202 may include, for a given data sample, obtaining transaction data of the current transaction, which includes an identifier for a uniquely identifiable entity; and obtaining lookup data for the uniquely identifiable entity. For example, the data sample may include a unique identifier for a user, account holder, or merchant. The obtained lookup data may include, for example, a unique identifier for a user, account holder, or merchant. Figures 1A to 1C 148 or Figure 2A and 2B Auxiliary data such as 242 in the example. Unique identifiers can be used to retrieve auxiliary data (e.g., metadata) indexed by the identifier, i.e., data belonging to an identified user, account holder, or merchant. Transaction data from the current transaction is used to derive a first feature set, and the retrieved lookup data is used to derive a second feature set. Box 1202 may also include retrieving transaction data for uniquely identifiable entities for a time window and calculating one or more aggregate metrics from the transaction data. For example, unique identifiers can also be used to retrieve similar... Figure 11A The first set of transaction data 1110 shown can be used to calculate one or more aggregated metrics to derive a first feature set. In some cases, historical transaction data of uniquely identifiable entities can also be obtained. This historical transaction data may include, for example, Figure 11A The second set of transaction data consists of transaction data outside the time window of 1120. One or more aggregated metrics can be calculated from historical transaction data and used to derive the second feature set.

[0143] For reference Figure 9 As described in other examples, a supervised machine learning system trained by method 1200 may include a binary classifier whose output represents a predefined range of values ​​indicating the probability of an anomaly. The label indicating whether an anomaly is absent or present may then contain two numerical values, such as 0 and 1 or -1 and 1, to enable the computation of a numerical loss for training.

[0144] In some examples, box 1214 can be executed periodically to retrain the machine learning system. In this case, the data augmentation method of method 1200 can be applied to incremental data available since the last execution of box 1214. In this case, the generated labeled training data, i.e., the new data produced by the iterations of boxes 1202 to 1212 on the new data, can be concatenated with the previously available data to increase the size of the training set.

[0145] Example methods for detecting anomalies

[0146] Figure 13 An example method 1300 for detecting anomalies in transaction data is shown. Example method 1300 can be used with... Figure 5A and Figure 5B The example transaction processing flows 500 and 550 are combined and appropriately adapted to apply the reference. Figures 9 to 12 Explain the training methods for supervised machine learning systems.

[0147] At box 1302, method 1300 includes receiving transaction data for anomaly detection. This may include data similar to... Figure 5A The operation in box 522. Transaction data may involve the proposed transaction, for example, reference. Figure 5A and 5B As described in boxes 512 to 518. Transaction data can be received, for example, relative to an internal function or an external RESTful interface, as a data packet accompanying an API request. At box 1304, a first set of features is generated based on the received transaction data. This can be used as... Figure 5A and 5B This can be performed by a portion of one or more of boxes 522 or 524 in the reference. The box may include applying an observable feature generator 910 to generate an observable feature vector 916, as shown in the reference. Figure 9 The first set of features can be configured as a vector of numerical values, which are calculated based at least on the data packet accompanying the API request (i.e., the data of the proposed transaction).

[0148] At box 1306, method 1300 includes identifying a uniquely identifiable entity associated with the received transaction data. For example, the method may include parsing an API request and extracting identifiers for one or more users contemplating a transaction and for the merchant contemplating the transaction. At box 1308, method 1300 includes acquiring auxiliary data associated with the uniquely identifiable entity. This method may include acquiring, for example... Figures 1A to 1C The auxiliary data includes one or more of the following: 148, 242, and 922 from 2A to 2B and 9. The auxiliary data may also include, for example... Figures 1A to 1C 2A to 2B and Figure 9 Historical transaction data from one or more of transaction data 146, 240, and 924, or calculated based on such historical transaction data. If an identifier is extracted, that identifier can be used to locate and / or filter historical transaction data based on a uniquely identifiable entity, such as retrieving transaction data associated with a specific user account.

[0149] At box 1310, a second set of features is generated based on the acquired auxiliary data. This box may include an applied context feature generator 920 to generate a context feature vector 926, as shown in reference [reference]. Figure 9 The second set of features can be represented numerically in a vector representation. In one case, historical transaction data can be processed to determine one or more values ​​in the vector representation. For example, one or more statistical measures can be computed from the historical transaction data, and / or the historical transaction data can be processed by one or more neural network architectures to extract features based on a set of trained parameters. These parameters can be trained using parameters from a machine learning system described below. In some cases, box 1310 may include acquiring historical transaction data for uniquely identifiable entities, wherein the historical transaction data includes transaction data outside a time window defined relative to the received transaction data. In this case, box 1310 may include compute one or more aggregate measures from the historical transaction data and generate a second feature set from at least one or more aggregate measures.

[0150] At box 1312, input data is generated for the supervised machine learning system based on the generated first and second feature sets. The supervised machine learning system may include implementations of one or more of the machine learning systems 160, 210, 402, 600, 700, and 940 described above. The supervised machine learning system is trained on a training set containing historical data samples and synthetic data samples. The synthetic data samples are generated by combining features from the first and second feature sets of the historical data samples, where each feature set is associated with two distinct entities in the set of uniquely identifiable entities, such as two different users or merchants. For example, the training set may include... Figure 10 The augmented training set 1070 shown includes synthetic data samples 1050 and historical data samples 1010 from the original training set. The synthetic data samples are assigned a label indicating the presence of training anomalies (e.g., such as...). Figure 10 The label 1076 in the data is "1"), and unlabeled data samples in the historical data samples are assigned a label to indicate the absence of training anomalies (e.g., as shown in the text). Figure 10 (The "0" shown in label 1076). Training may include applying available model fitting functions from a machine learning computer program codebase. In cases where the supervised machine learning system contains a neural network architecture, training may include applying backpropagation of gradient descent using a loss function based on the difference between the predictions and assigned labels from the output of the supervised machine learning system. Training may be performed during the configuration phase prior to applying method 1300. For some machine learning systems, training may be applied online as more data is generated (e.g., as more transactions are approved or rejected, such as...). Figure 5A and 5B (As shown). Depending on the chosen supervised machine learning system, offline and online training methods can be used. Generating input data may include generating data similar to... Figure 9 The input feature vector shown is the input feature vector of input feature vector 930. In some cases, preprocessing can be applied to generate the input data, including: converting text and categorical data into numerical equivalents if preprocessing has not yet been performed, standardizing the data, dimensionality reduction, etc.

[0151] At box 1314, the supervised machine learning system is applied to the input data. This box may include the forward pass of the supervised machine learning system, sometimes referred to as the "inference" step. The supervised machine learning system includes a binary classifier configured to output a value indicating the presence of anomalies. For example, the output value may include, for instance, values ​​such as... Figure 4 and Figure 9 The output is a scalar value such as 416 or 950. This value can be normalized within the range of 0 to 1 (e.g., using a sigmoid nonlinearity). The process can be described as follows (see reference). Figure 5A and 5BAs described in box 524, this is performed.

[0152] At box 1316, the received transaction data is selectively labeled based on the values ​​output by the supervised machine learning system. This box can be labeled by the supervised machine learning system (e.g., as referenced). Figure 4 (as described) and / or by a separate computing device based on that output (e.g., in...) Figure 5A and 5B The payment processor system (506) at boxes 528 or 552 in the table executes this. The tag may include sending a response to the original API request using a scalar value. It may also include applying one or more custom post-processing calculations, such as applying a threshold to output a binary label as "abnormal" or "non-abnormal".

[0153] In some cases, box 1316 may include approving or rejecting the transaction based on the output of a supervised machine learning system. This box may include generating control data to control whether to accept or reject at least one transaction in the transaction data based on the value of the supervised machine learning system's output. For example, in a simple case, a threshold may be applied to the output of the supervised machine learning system, and values ​​above the threshold (indicating "abnormal") may be rejected, while values ​​below the threshold (indicating "normal" action) may be approved, which is an appropriate decision for values ​​equal to the threshold. In some cases, for example, Figures 1A to 1C and Figures 5A to 5B As shown, transaction data can be received from the point-of-sale device associated with the transaction to be approved. In these cases, box 1316 may include approving the transaction in response to a value of the output of the supervisory machine learning system falling below a predefined threshold.

[0154] In some examples, box 1316 includes selectively labeling uniquely identifiable entities associated with received transaction data based on values ​​output by a supervised machine learning system. For example, this could include labeling a user or merchant as fraudulent.

[0155] and Figures 9 to 13The second aspect relates to examples that address the technical problem of training machine learning systems to indicate whether a transaction is associated with expected (i.e., normal) behavior or unexpected (i.e., anomalous) behavior, where behavior refers only to a set of actions taken against one or more electronic systems over time. This example addresses a problem in transaction processing where a data repository exists, but this typically represents unlabeled historical data, and unexpected behavior is rare by its very nature (e.g., less than 10% of cases are “unexpected”). Therefore, machine learning engineers cannot use many traditional tools because these are designed based on the assumption of a Gaussian (“normal”) distribution, while anomalies are typically characterized by a power-law distribution. Machine learning engineers further face the problem of a lack of labeled data to train machine learning systems. These problems are not widely recognized in the machine learning field, and machine learning engineers often point to the absence of solutions or attempts to apply unsupervised learning methods that do not produce high-precision results usable in production systems.

[0156] In contrast, this example from the second aspect addresses the problem of identifying a large class of anomalous transactions in the presence of poor labeling. A machine learning system is proposed to identify transactions where the observed (i.e., measured) interactions with the computational system differ from expected interactions based on past interactions or auxiliary information. In this example, a training method is proposed that works by reintroducing the problem as a supervised learning problem by augmenting the available data. This method allows the use of powerful supervised learning algorithms capable of identifying anomalies that more closely match the deviations from expected interactions compared to the machine learning system. In practice, the proposed example enables supervised machine learning to be trained to more accurately flag transactions as fraudulent. The implementation of this example has been successfully tested in a production environment where transactions need to be processed at a high volume (1000-2000 transactions per second) with sub-second latency. The machine learning system described in this paper is specifically designed to output simple scalar values—this not only allows for binary approval based on the output but also allows for training the machine learning configuration as described in this paper and configuring it for large-scale production implementations; the scalar values ​​are small but highly informative because they effectively fold large amounts of input information (e.g., input vectors with high dimensions) into information singularities. In testing, this scalar value was used for transaction approval—avoiding false positives that could halt production systems while also identifying true positives consistent with manually flagged anomalies. This offers greater benefit in preventing large-scale illegal abuse of electronic payment systems.

[0157] There are many comparative methods for detecting anomalies in datasets. These methods include unsupervised outlier detection, synthetic minority oversampling technique (SMOTE), snorkeling systems, semi-supervised learning systems, and active learning systems.

[0158] In unsupervised outlier detection, generated features are considered informative to quantify deviations from expected behavior. These features are then fed into an anomaly detection system configured to identify features that are outliers relative to the overall data distribution of these features. Methods for performing unsupervised outlier detection include using tree-based isolated forests, generative adversarial networks, or variational autoencoders. The paper "Variational Autoencoder-based Anomaly Detection using Reconstruction Probability" by An and Cho (SNU Data Mining Center – 2015-2 IE Special Lecture), incorporated herein by reference, describes an anomaly detection method using the reconstruction probability of a variational autoencoder. However, these methods are complex and difficult to reconcile with the constraints of transaction processing.

[0159] SMOTE, described in the paper of the same name by Chawla et al., published in the Journal of Artificial Intelligence Research 16(2002) 321-357 (incorporated hereby by reference), describes a method for oversampling a minority (abnormal) class and undersampling a majority (normal) class from a data sample. However, this method relies excessively on data samples labeled as "abnormal," leading to accuracy issues. The Snorkel system, described by Ratner et al. in their paper "Snorkel: Rapid Training Data Creation with Weak Supervision" (published on arXiv on November 28, 2017, incorporating hereby by reference), allows users to train machine learning models without manually labeling data. It is primarily designed for text-based medical data and attempts to incorporate weakly supervised sources by applying user-assigned labels to a decomposed contextual hierarchy. However, the Snorkel system is not well-suited to the data formats used in transaction processing systems, and the probabilistic approach of generative labeling introduces potential accuracy problems for production transaction processing systems. It also requires a source for the "expert" labels.

[0160] Semi-supervised learning methods deal with situations where there may be some labeled training data, but also a batch of unlabeled data. Most semi-supervised methods (implicit or explicit) infer possible labels for unlabeled data points, allowing them to be used for training. However, semi-supervised methods (e.g., unsupervised methods) tend to be more complex and more sensitive to hyperparameter configuration. They also rely on a small set of high-precision labels. Active learning methods use oracles (e.g., human experts) to label a subset of the data, and then use that labeled subset to apply the labels to a broader unlabeled dataset. However, oracles have limited capacity, so it is important to make forward-looking choices about which data points to send to the oracles. Therefore, active learning primarily focuses on selecting unlabeled data points to send to the oracles in order to learn as quickly as possible.

[0161] Therefore, while these methods attempt to address the label sparsity problem in general settings, they are often unsuitable for the context of transaction processing and further fail to help solve the problem of detecting unexpected action patterns in the transaction data.

[0162] Some examples of training machine learning systems described in this paper operate on the availability of large, unlabeled datasets covering previously processed transactions, where each part of the data (e.g., regarding...) is labeled... Figures 1A to 1C The described transaction data is associated (e.g., indexed) with a uniquely identifiable entity (e.g., a cardholder or merchant). The example described herein partitions the input data into tuples of two feature sets {C, O}, where “C” represents a set of “contextual” machine learning features and “O” represents a set of “observable” machine learning features. This tuple may relate to a specific action taken by the entity in relation to an electronic payment processing system (e.g., drafting a transaction). In other cases, it may also relate to actions over a cumulative period (e.g., activities over a week). In these examples, observable features are computed based on raw observations of such actions or stateful sets of time periods containing those actions (e.g., all actions initiated within a 24-hour period). An action here might relate to submitting a payment request to the payment processing system. Observable features can be viewed as observations of an entity’s current or recent behavior, expressed through its actions. Contextual features can be computed based on a broader auxiliary dataset that can be used to predict observable features. Contextual features can depend on the available data sources and / or the configuration of the machine learning system that receives the input data for prediction.

[0163] Some of the examples described in this article can be used in a variety of ways to facilitate payment processing and detect unexpected activity patterns. In one case, contextual features can be computed based on the likelihood of interaction with other entities (e.g., different merchants, entities located in different geographic regions, different receiving entities, etc.). In this case, the scalar value output of the machine learning system can be used to help determine whether a malicious third party has gained access to payment data (e.g., "hackered" account details attempting to make fraudulent payments). In other cases, the scalar output can be used to identify whether electronic systems are being abused, such as opening accounts using false information. In these cases, the machine learning system can be used to identify the transaction history of entities that is inconsistent with or incompatible with pre-disclosed information. For example, in these cases, observable features, including the frequency, value, and purpose of transactions within a specified time period, can be used, as well as contextual features that aggregate metadata associated with the entity, such as age, occupation, country of residence, etc. In another case, the scalar output of the machine learning system can be used to determine whether a merchant is engaging in suspicious activity, such as detecting electronic behavior patterns that are inconsistent with expectations based on past or disclosed data. For example, in this case, contextual features can include attributes of the entity, including location information. This example can be applied to each of these cases.

[0164] Some examples described in this paper propose a data augmentation method for training that synthesizes binary labels for unlabeled historical training data. In this method, labels are assigned to existing data samples {C, O} in the historical training data to indicate the absence of anomalies. Then, synthetic tuples (i.e., data samples) {C, O'} or {C', O} are generated by combining context and observable features that exist individually in the training data but not together; for example, the C in {C, O} associated with entity_1 is paired with the O' in another {C', O'} associated with entity_2. Once trained on this augmented dataset, the machine learning system can be applied to new data samples {C*, O*} to determine the presence of anomalies. The output of the machine learning system can be used as is or provided as the basis for another processing pipeline (e.g., as input features for training another machine learning system used to predict other output data).

[0165] Some examples described in this paper address the anomaly detection problem in transaction data as a supervised learning problem, thereby providing relevant labels using a combination of data partitioning of the input feature vector and a data augmentation process. The data augmentation process ensures that "anomaly" labels indicate anomalous patterns in action, because the observable features of positively labeled data samples may come from observed actions of different entities, and are therefore generated from contextual features of different entities. This results in a more accurate scalar output than that produced in comparative methods that utilize distribution-based unsupervised anomaly detection to detect expected anomalies relevant to the entity's context. Through the current training process, the trained supervised machine learning system "learns" to weigh more specific types of data anomalies that are associated with deviations from the entity's expected actions. In contrast, traditional unsupervised anomaly detection methods treat all data anomalies equally, regardless of whether they are related to deviations from the entity's expected actions; for example, they do not consider the temporal relevance and patterns within these actions.

[0166] Certain examples related to the second aspect described in this paper allow the training of machine learning systems for transaction processing without providing data “anomalies” with positive labels. This distinguishes our approach from comparative upsampling, weakly supervised, and semi-supervised methods. Furthermore, some of the examples described address not only the problem of potentially lacking labels for positive (i.e., anomalous) data samples, but also the problem of positive data samples being initially rare—a problem that cannot be solved by weakly supervised and semi-supervised methods. Comparative approaches to upsampling, weak labeling, active learning, and semi-supervised learning fail to address the limitations of the diversity of action patterns encoded in labeled data when true labels exist but are very rare. The described examples allow for the generation of a large number of relevant positive (i.e., anomalous) labels, thereby facilitating supervised machine learning systems to learn complex decision boundaries and automatically extract features (e.g., through embeddings), both of which lead to performance improvements. Moreover, some of the examples described do not require the presence of an “oracle” to label data points on request, as with active learning methods, but our approach also does not require input from human experts or proxy signals, unlike weakly supervised methods.

[0167] In some examples of this paper, a transaction processing system is described, including: a transaction processing module (e.g., Figure 6A The first processing stage 603 is configured to: receive first information associated with a first proposed transaction; retrieve second information related to at least one prior transaction associated with the first proposed transaction; and use the second information to calculate a delay algorithm to generate third information; and a weighting module (e.g., Figure 6AThe second processing stage (604) is communicatively coupled to the transaction processing module, wherein the weighting module is configured to: receive third information from the neural-based processing module; apply weighting factors to the third information to generate fourth information; and compute at least one processing algorithm using the first and fourth information to generate an output, wherein the output is used by the additional transaction processing module to determine whether the first proposed transaction is fraudulent. In some examples, a method for detecting fraudulent transactions is also described, comprising: the transaction processing module receiving first information associated with the first proposed transaction; the transaction processing module retrieving second information associated with at least one prior transaction associated with the first proposed transaction from a repository; the transaction processing module computing a delay algorithm using the second information to generate third information; the weighting module receiving the third information from the transaction processing module; the weighting module applying weighting factors to the third information to generate fourth information; and the weighting module computing at least one processing algorithm using the first and fourth information to generate an output, wherein the output is used by the additional transaction processing module to determine whether the first proposed transaction is fraudulent.

[0168] Some of the examples described herein can be implemented by instructions stored in a computer-readable storage medium. The computer-readable medium may include one or more of a spinning disk, a spinning optical disk, a flash random access memory (RAM) chip, and other mechanically movable or solid-state storage media. In use, the instructions are executed by one or more processors to cause the processors to perform the operations described above. The embodiments, variations, and examples described above should be understood as illustrative. Further embodiments, variations, and examples are contemplated. Although certain components of each example have been described separately, it should be understood that the functionality described with reference to one example may be suitably implemented in another example, and certain components may be omitted depending on the implementation. It should be understood that any feature described in relation to any example may be used alone, in combination with other described features, and may also be used in combination with one or more features of any other example, or in any combination of any other example. For example, features described with respect to system components may also be adapted to be performed as part of the described method. Furthermore, equivalents and modifications not described above may be adopted without departing from the scope of the invention as defined in the appended claims.

Claims

1. A machine learning system for processing transaction-related data, the machine learning system comprising: A first processing stage includes a recurrent neural network architecture that includes a forget gate to modify state data from previous iterations based on data representing the time difference between the proposed transaction and the preceding transaction. The recurrent neural network architecture uses the modified state data and the data of the proposed transaction to generate output data. as well as A second processing stage includes an attention neural network architecture communicatively coupled to the first processing stage. The attention neural network architecture includes neural network layers for applying attention weights to an input feature tensor to generate output data. The input feature tensor is generated from at least data representing time differences and historical output data from the first processing stage. The attention weights are calculated based on the input feature tensor and current output data from the first processing stage. The machine learning system is configured to map the output data from the second processing level to a scalar value, the scalar value representing the probability that the proposed transaction will exhibit anomalies in a series of actions; The scalar value is used to determine whether to approve or reject the proposed transaction.

2. The machine learning system according to claim 1, wherein, The recurrent neural network architecture includes: A time difference interface is used to receive at least data for outputting a time difference between the proposed transaction and the preceding transaction; A transaction data input interface is used to receive at least the data required for the proposed transaction; A state input interface for receiving the state data from the previous iteration, the state data being generated by previously applying the recurrent neural network architecture to a set of previous transactions including the first-order transaction; and Combinatorial logic is used to combine the modified state data and the proposed transaction data to generate the output data, which includes the state data for this iteration.

3. The machine learning system according to claim 1 or 2, wherein, The attention neural network architecture includes: A time difference interface for receiving time difference data, the time difference data including at least data representing the time difference and one or more time differences between the proposed transaction and one or more other prior transactions; A history input interface is used to receive historical output data from the first processing level; and The current input interface is used to receive the current output data from the first processing level.

4. The machine learning system according to claim 3, wherein, The attention neural network architecture includes: Combinatorial logic is used to combine data derived from the time difference interface and the historical input interface to generate the input feature tensor; At least one neural network layer is used to compute at least one attention query vector from data received at the current input interface; At least one neural network layer is used to compute at least one attention key vector from the input feature tensor; and At least one neural network layer is used to compute at least one attention value vector from the input feature tensor.

5. The machine learning system according to claim 4, wherein, The attention neural network architecture includes: At least one time-coding layer is used to apply a parameterized function to the time difference data. The output of the at least one temporal encoding layer is used together with the at least one attention key vector and the at least one attention query vector to generate the attention weights.

6. The machine learning system according to claim 1, wherein, The forget gate includes time difference encoding, which implements a parameterized exponential function. The parameterized exponential function is applied to the data representing the time difference to output an activation vector, which modifies the state data through element-wise multiplication.

7. The machine learning system according to claim 6, wherein, The parameters of the parameterized exponential function are fixed to represent different time decay periods.

8. The machine learning system according to claim 1, wherein, The attention neural network architecture includes: A time-coding layer is used to encode time difference data representing the time difference between the proposed transaction and each transaction in a set of preceding transactions. The output of the temporal coding layer is combined with the historical output data from the first processing level used to calculate the input data for the attention weights.

9. The machine learning system according to claim 1, wherein, The second processing level includes multiple attention neural network architectures that form a multi-head self-attention mechanism, wherein the attention neural network architectures are applied in parallel and the outputs are combined to generate additional output data for the second processing level.

10. The machine learning system according to claim 1, comprising: A fully connected neural network architecture includes one or more neural network layers to receive at least output data from the second processing level and map the output data to the scalar value.

11. The machine learning system according to claim 10, wherein, The fully connected neural network architecture also receives output data from the first processing stage to generate the scalar value.

12. The machine learning system according to claim 1, comprising: A fully connected neural network architecture comprising one or more neural network layers for preprocessing data of the proposed transaction prior to at least the first processing stage.

13. The machine learning system according to claim 1, wherein, The scalar output is a standardized value between 0 and 1, representing the probability that the proposed transaction will become an anomaly in a series of actions.

14. The machine learning system of claim 1, comprising one or more skip connections that bypass one or more of the first processing level and the second processing level.

15. A method for processing data associated with a proposed transaction, the method comprising: Receive incoming events from the client transaction processing system, the incoming events being associated with a request for an approval decision on the proposed transaction; Parse the incoming events to extract data for the proposed transaction, including determining the time difference between the proposed transaction and the preceding transaction; Based on the extracted data and the time difference, a machine learning system according to any one of claims 1 to 10 is applied to output the scalar value, which represents the probability that the proposed transaction exhibits an anomaly in a series of actions. The application includes accessing a repository to retrieve at least state data for previous iterations, historical output data from the first processing level, and the time difference between the proposed transaction and one or more other historical transactions. A binary output is determined based on the scalar value output by the machine learning system, the binary output indicating whether the proposed transaction is approved or rejected; and The binary output is returned to the client transaction processing system.

16. The method according to claim 15, wherein, The delay between receiving the incoming event and returning the binary output is less than one second.