Distributed machine learning
A peer-to-peer network of independent data processing nodes in distributed machine learning shares and combines model characteristics to address scalability and reliability issues, ensuring secure and efficient model updates.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- QUEENSLAND UNIVERSITY OF TECHNOLOGY
- Filing Date
- 2025-10-01
- Publication Date
- 2026-07-02
AI Technical Summary
Distributed machine learning architectures face challenges such as computational overhead, communication delays, and reliance on a central server, leading to scalability issues and potential single points of failure.
A peer-to-peer network of independent data processing nodes shares and combines model characteristics to determine an updated machine learning model without a central server, using encryption and validation to ensure security and integrity.
This approach reduces reliance on a central server, enhances scalability, and improves system reliability by enabling secure and efficient model updates across geographically distributed nodes.
Smart Images

Figure AU2025051105_02072026_PF_FP_ABST
Abstract
Description
DISTRIBUTED MACHINE LEARNINGPRIORITY DOCUMENTS
[0001] The present application claims priority from:• Australian Provisional Patent Application No. 2024904313 titled “DISTRIBUTED MACHINE LEARNING” and fded on 24 December 2024;• Australian Provisional Patent Application No. 2025900106 titled “DISTRIBUTED MACHINE LEARNING” and fded on 14 January 2025; and• Australian Provisional Patent Application No. 2025900651 titled “DISTRIBUTED MACHINE LEARNING” and fded on 4 March 2025.
[0002] The contents of the above applications are all incorporated by reference in their entirety.TECHNICAL FIELD
[0003] The present disclosure relates to machine learning. In a particular form, the present disclosure relates to a machine learning architecture distributed over multiple independent data processing nodes.BACKGROUND
[0004] A significant issue with the training of machine learning models is the computational overhead involved as the size and complexity of these models increases. This has led to development of distributed machine learning architectures where the processing load is distributed among individual data processing nodes who then each update a single machine learning model located on a central server processing node. As a result, while the federated learning (FL) architectures are able to facilitate data privacy at each node there is then the frequent transmission of model parameters from each of the distributed individual data processing nodes to the central server processing node where the machine learning model is located.
[0005] When the number of independent nodes is increased in an attempt to scale these FL architectures, communication delays results. Furthermore, complex synchronization protocols are required to coordinate the transfer of model parameters. Another system implementation issue is that this type of FL architecture is critically dependent on the availability of the central server processing node as a result presenting a potential single point of failure which affects system reliability.SUMMARY
[0006] In a first aspect, the present disclosure provides a computer-implemented method, comprising:in a peer-to-peer computer network comprising a plurality of independent data processing nodes, wherein each independent data processing node comprises respective local node training data and a respective machine learning model;determining at each independent data processing node a respective set of model characteristics from training the respective machine learning model on the respective local node training data;communicating, for each independent data processing node, their respective set of model characteristics to each other independent data processing node of the plurality of independent data processing nodes; anddetermining a respective updated machine learning model at a selected independent data processing node of the plurality of independent data processing nodes based on its own locally determined set of model characteristics and received sets of model characteristics from each other independent data processing node.
[0007] In another form, determining the respective updated machine learning model at the selected independent data processing node of the plurality of independent data processing nodes based on its own locally determined set of model characteristics and each received set of model characteristics from each other independent data processing node comprises combining the own locally determined set of model characteristics with a selection of model characteristics from the received sets of model characteristics to form a combination of model characteristics.
[0008] In another form, determining the respective updated machine learning model at the selected independent data processing node comprises:generating a global model at the selected independent data processing node based on the combination of model characteristics; andtraining one or more classifiers based on the global model to generate a final classification result.
[0009] In another form, training the one or more classifiers based on the global model comprises:training a plurality of classifiers to generate associated prediction outputs; andaggregating the prediction outputs to generate the final classification result.
[0010] In another form, the method further comprises determining respective updated machine learning models at each remaining data processing node based on their respective locally determined sets of model characteristics and each received set of model characteristics from each other independent data processing node.
[0011] In another form, the plurality of independent data processing nodes comprises three or more independent data processing nodes.
[0012] In another form, determining the respective set of model characteristics for a machine learning model comprising a plurality of sub-models comprises determining a unified feature representation for the plurality of sub-models.
[0013] In another form, determining the unified feature representation for the plurality of sub-models comprises:extracting features from each of the plurality of sub-models of respective machine learning model;concatenating the extracted features to form a fused feature set; andprocessing the fused feature set to determine the unified feature representation having reduced dimensionality.
[0014] In another form, processing the fused feature set to determine the unified feature representation comprises:performing an independent component analysis on the fused feature set to reduce dimensionality; and optionally:validating the unified feature representation; ordisplaying high dimensional features to visualise model performance.
[0015] In another form, a first independent data processing node seeking to communicate their respective set of model characteristics to a second independent data processing node operable to receive the respective set of model characteristics is defined as a transmitting node and the second independent data processing node is defined to be a receiving node.
[0016] In another form, the method further comprises validating by the receiving node the transmitting node prior to communication.
[0017] In another form, validating by the receiving node the transmitting node comprises the receiving node validating a unique digital identifier of the transmitting node in accordance with an asymmetric encryption scheme based on a transmitting node private -public key pair.
[0018] In another form, validating the unique digital identifier of the transmitting node in accordance with an asymmetric encryption scheme comprises:encrypting by the transmitting node the unique digital identifier based on a private key of the transmitting node private -public key pair to form an encrypted unique digital identifier;sending by the transmitting node the encrypted unique digital identifier to the receiving node; decrypting by the receiving node the encrypted unique digital identifier based on a public key of the transmitting node private -public key pair; andconfirming by the receiving node the unique digital identifier of the transmitting node.
[0019] In another form, the method further comprises securing communication of the respective set of model characteristics between the transmitting node and the receiving node in accordance with an encryption scheme prior to communication.
[0020] In another form, the encryption scheme is a symmetric encryption scheme based on a unique session key generated for the communication of the respective set of model characteristics from the transmitting node to the receiving node.
[0021] In another form, the unique session key is shared between the transmitting node and the receiving node in accordance with an asymmetric encryption scheme between the transmitting node and the receiving node.
[0022] In another form, the encryption scheme is an asymmetric encryption scheme based on privatepublic key pairs generated for the transmitting node and receiving node.
[0023] In another form, the asymmetric encryption scheme is a homomorphic encryption scheme.
[0024] In another form, the homomorphic encryption scheme is an additive homomorphic encryption scheme.
[0025] In another form, the method further comprises assessing by the receiving node the respective set of model characteristics following communication from the transmitting node to determine an associated integrity measure for the respective set of model characteristics.
[0026] In another form, the associated integrity measure indicates interference with the respective set of model characteristics in a form of a cyber-attack, concept drift attack or unlabelled cyber-attack.
[0027] In another form, assessing the respective set of model characteristics to determine an associated integrity measure comprises applying a machine learning classifier to the respective set of model characteristics.
[0028] In another form, communicating, for each independent data processing node, their respective sets of model characteristics to each other independent data processing node comprises each of independent data processing nodes updating their respective sets of model characteristics to each other independent data processing node at substantially simultaneously.
[0029] In another form, each independent data processing node operates as a blockchain node together forming a blockchain network having a corresponding shared blockchain ledger for recording an update based on communications between the plurality of independent data processing nodes.
[0030] In another form, the update is recorded on the shared blockchain ledger following a proof of share operation between the plurality of independent data processing nodes.
[0031] In another form, a first local node training data for a first independent data processing node of the plurality of independent data processing nodes comprises a first data modality and a second local node training data for a second independent data processing node of the plurality of independent data processing nodes comprises a second data modality, and wherein the first and second data modalities are different.
[0032] In another form, the first and second data modalities are different types of image data modalities.
[0033] In another form, the respective machine learning model for a given node is equivalent to the respective machine learning model for a different node.
[0034] In another form, the respective machine learning model for a given node is different to the respective machine learning model for a different node.
[0035] In another form, the method further comprises adding a new independent data processing node to the plurality of independent data processing nodes, wherein the new independent data processing node comprises respective local node training data and a respective machine learning model corresponding to the new independent data processing node.
[0036] In another form, the respective machine learning model at each independent data processing node comprises a component of a shared machine learning model architecture also resident at each of the independent data processing nodes, and wherein the respective updated machine learning model determined at the selected independent node comprises an updated shared machine learning model architecture based on its own locally determined set of model characteristics for its component of the shared machine learning model architecture and received sets of model characteristics for each other component of the shared machine learning model architecture.
[0037] In another form, the shared machine learning model architecture comprises a multi-head transformer model and the component of the shared machine learning model architecture comprises a head of the multi-head transformer.
[0038] In another form, the set of model characteristics for a respective head of the multi -head transformer comprise trained head weights from the respective trained head.
[0039] In another form, the multi -head transformer comprises an ensemble layer for combining prediction outputs from the trained heads to produce a final prediction vector to produce a classification result.
[0040] In another form, combining prediction outputs from the trained heads to produce a final prediction vector comprises averaging the prediction outputs from the trained heads.
[0041] In another form, one or more of the heads of the multi -head transformer are pre-trained.
[0042] In another form, the method further comprises:adding a new independent data processing node to the plurality of independent data processing nodes, the independent data processing node comprising a new head and associated local node training data; andupdating the multi -head transformer to include the new head.
[0043] In another form, the multi-head transformer is a multi-head vision transformer (ViT) configured to analyse image data.
[0044] In another form, the image data is medical image data.
[0045] In another form, the local node training data is accessible only by its associated independent data processing node.
[0046] In another form, each independent data processing node of the plurality of independent data processing nodes is operated by an associated entity having local policy and procedures governing operation of the each independent data processing node.
[0047] In another form, each independent data processing node of the plurality of independent data processing nodes is operated in geographically distinct locations with respect to each other.
[0048] In another form, the method further comprising applying the updated machine learning model to a classification task.
[0049] In a second aspect, the present disclosure provides a computing system comprising:a peer-to-peer computer network comprising a plurality of independent data processing nodes, wherein each independent data processing node comprises one or more data processors, one or more networkinterfaces, and one or more storage devices and is configured to implement the method in accordance with the first aspect of the disclosure.
[0050] In a third aspect, the present disclosure provides an independent data processing node operating in the peer-to-peer computer network of the second aspect of the disclosure.
[0051] In a fourth aspect, the present disclosure provides a non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by a computing system, cause the computing system to perform operations comprising the method in accordance with the first aspect of the disclosure.
[0052] In a fifth aspect, the present disclosure provides an independent data processing node operating in a peer-to-peer computer network comprising a plurality of other independent data processing nodes, wherein all independent data processing nodes comprise respective local node training data and a respective machine learning model, the independent data processing node configured to:determining a local set of model characteristics from training a local machine learning model on its local node training data;receiving respective sets of model characteristics communicated from each of the plurality of other independent data processing nodes following training the respective machine learning model on the respective local node training data at each of the plurality of other independent data processing nodes; and determining a respective updated machine learning model based on its own local set of model characteristics and the received respective sets of model characteristics from each of the plurality of other independent data processing nodes.BRIEF DESCRIPTION OF DRAWINGS
[0053] Embodiments of the present disclosure will be discussed with reference to the accompanying drawings wherein:
[0054] FIG. 1 is a flowchart diagram of a method for determining a machine learning model involving multiple independent data processing nodes according to some illustrative embodiments;
[0055] FIG 2 is a system overview diagram of a system for determining a machine learning model involving multiple independent data processing nodes according to some illustrative embodiments;
[0056] FIG. 3 A is an overview diagram of three independent data processing nodes according to some illustrative embodiments;
[0057] FIG. 3B is a schematic block diagram of the machine learning modelarchitecture referred to in FIG. 3A showing the incorporated SE and BiLSTM layers according to some embodiments;
[0058] FIG. 4 is a flowchart of a method for determining the respective updated machine learning model at the selected independent data processing node according to some embodiments;
[0059] FIG. 5 is a flowchart of a method for determining a unified feature representation for a machine learning model according to some embodiments;
[0060] FIG. 6 is a flowchart of a method for processing a fused feature set according to some embodiments;
[0061] FIG. 7 is an overview diagram showing two independent data processing nodes comprising a transmitting node and a receiving node according to some embodiments;
[0062] FIG. 8 is a flowchart of a method for validating the unique digital identifier of a transmitting node in accordance with an asymmetric encryption scheme according to some embodiments;
[0063] FIG. 9 is a schematic block diagram for a machine learning based data classification architecture according to some embodiments;
[0064] FIG. 10 is a system interaction diagram of an encrypted blockchain based distributed machine learning architecture according to some illustrative embodiments;
[0065] FIG. 11 is a UML diagram of the processing components of the encrypted blockchain based distributed machine learning architecture illustrated in FIG. 10 according to some illustrative embodiments;
[0066] FIG. 12 is data flow diagram 1200 showing the processing and communications steps for the encrypted blockchain based distributed machine learning architecture illustrated in FIG. 10 according to some illustrative embodiments;
[0067] FIG. 13 is a system overview diagram of a system for determining a shared machine learning model architecture where respective independent data processing nodes comprise a component of the shared machine learning model architecture according to some embodiments;
[0068] FIG. 14 is a system overview diagram of the system for determining a machine learning model architecture illustrated in FIG. 13 showing the determining of an updated machine learning modelarchitecture following communication of sets of model characteristics from independent data processing nodes to a selected independent data processing node according to some embodiments;
[0069] FIG. 15 is a system overview diagram of the system for determining a machine learning model architecture illustrated in FIG. 13 showing the adding of a new independent data process node according to some embodiments;
[0070] FIG. 16 is a table comparing performance metrics of a distributed machine learning architecture in accordance with the present disclosure to other distributed machine learning methods;
[0071] FIG. 17 is a table demonstrating the performance of a distributed machine learning architecture in accordance with the present disclosure to other distributed machine learning methods with regard to multi-modal data;
[0072] FIG. 18 is a system overview diagram of a system for determining a machine learning model involving multiple independent data processing nodes in a peer-to-peer computer network according to some illustrative embodiments;
[0073] FIG. 19 is a system overview diagram depicting the initial communication of pretrained weights as determined on an independent data processing node in a peer-to-peer network following a pretraining step according to some embodiments;
[0074] FIG. 20 is a system overview diagram showing the communication of the various sets of model characteristics in the form of weights between the independent data processing nodes of a peer-to-peer network following training at each node according to some embodiments;
[0075] FIG. 21 is a flowchart of a method for determining a respective updated machine learning model at a selected independent data processing node based on combining or aggregating its own locally determined set of model characteristics and received sets of model characteristics from each other independent data processing node according to some embodiments;
[0076] FIG. 22 is a flowchart of a method for extracting model features in the form of concatenated node model features based on local extracted model features determined at each of the nodes according to some embodiments;
[0077] FIG. 23 is a system overview diagram showing the communicating of the various non-invertible embeddings between the independent data processing nodes in peer-to-peer network following determining the respective non-invertible embedding at each node according to some embodiments; and
[0078] FIG. 24 is an architecture overview diagram of an example independent data processing node that may optionally be utilised in accordance with the present disclosure.
[0079] In the following description, like reference characters designate like or corresponding parts throughout the figures.DESCRIPTION OF EMBODIMENTS
[0080] Methods and systems in accordance with the present disclosure provide a distributed machine learning architecture involving a plurality of independent data processing nodes in a peer-to-peer computer network where at each node a machine learning model implemented on the particular node is trained to determine a set of model characteristics forthat node. The model characteristics for each node are then exchanged between all of the nodes so that each node now has access to a combined set of model characteristics from all of the nodes. Any of the nodes may then be selected to form an updated machine learning model at that node based on the sets of model characteristics from the other nodes. In one example, this may be achieved by combining the own locally determined set of model characteristics with a selection of model characteristics from the various received sets of model characteristics to form a combination or subset of model characteristics noting that in various examples this combination or subset may include all of the available sets of model characteristics. In one example, this selection may be based on a machine learning or data classification task at the selected node.
[0081] In this manner, there is no “central” node where the machine learning model resides, and which may critically form a point of failure for the distributed machine learning architecture. Instead, in accordance with the present disclosure, any or all nodes, may implement an updated machine learning model at the respective node based on a selection or all of the learnt model characteristics from the other nodes. The distributed machine learning architecture of the present disclosure may be advantageously applied to shared machine learning model architectures comprising multiple components or heads that each may be trained at a respective independent data processing node.
[0082] Referring now to FIG. 1, there is shown a flowchart diagram of a method 100 for determining a machine learning model involving multiple independent data processing nodes according to some illustrative embodiments. FIG. 2 is a system overview diagram of a system 200 for determining a machine learning model involving multiple independent data processing nodes NltN2, ... , NN(eg, see FIG. 18) in a peer-to-peer computer network 210 according to some illustrative embodiments. In this example node, NNis shown in dashed outline indicating that there may be any nodes up to N independent data processing nodes where N may be range from 2 (ie, two independent data processing nodes) to any integer A as determined by system requirements. In various examples, system 200 may be configured toimplement method 100 set out in FIG. 1 and the other computer-implemented methods set out in the present disclosure.
[0083] The term “independent data processing node” as employed throughout this specification is taken to mean that the independent data processing node is operationally isolated and functions independently from the other independent data processing nodes that are members of the peer-to-peer network. In one example, one or more of the independent data processing nodes may be operationally isolated as a result of each independent data processing node being operated by an associated entity having its own local policy and procedures (eg, security and / or data management policies) governing the operation of the independent data processing node. In another example, one or more of the independent data processing nodes may be operationally isolated as a result of the independent data processing node being operated in geographically distinct locations with respect to each other.
[0084] In another example, one or more of the independent data processing nodes may be operationally isolated through the employment of network, software or operating system isolation, such as through the use of firewalls and / or software based security measures. Accordingly, in various examples, one or more of the independent data processing nodes may be based in the cloud where the cloud comprises cloud based processing and data storage resources which themselves may be geographically distributed or centrally located such as provided by commercial cloud service providers (eg Amazon Web Services, Microsoft Azure, Google Cloud platform, etc.) or a local server farm.
[0085] At block 110, method 100 comprises a peer-to-peer computer network (eg, peer-to-peer computer network 210 shown in FIG. 2) comprising a plurality of independent data processing nodesN2, NN, and where each independent data processing node comprises respective local node training data DlrD2, ...,DNand a respective machine learning modelM2, MN, then method 100 comprises determining at each independent data processing node a respective set of model characteristics from training the respective machine learning on the respective local node data.
[0086] As will be discussed further, a set of model characteristics may include, but not be limited to any one, or combination of, the following: embeddings characterising abstract representations of data such as the local node training data, weights as determined from training the respective machine learning model, feature representations including a unified feature representation, embeddings or head parameters
[0087] Taking the example of weights, in pseudocode their determination may be indicated as follows for a given node A;Equation 1
[0088] < are the set of updated set of weights for time step t + 1 calculated from the weights calculated at time step, ] is the learning rate andis the gradient of a loss function
[0089] In various examples, the local node training data may be unique to each independent data processing node. In other examples, the local node training data may be the same for each independent data processing node of the peer-to-peer computer network 210. In yet other examples, a selection of independent data processing nodes may have the same local node training data and other independent data processing nodes have different local node training data. In one example, the local node training data is only accessible by its associated independent data processing node, ie, for a given independent data processing node, the local respective local node training data may only be accessed (eg, used for training the machine learning model at the independent data processing node) by that associated independent data processing node and not by any other independent data processing node forming part of the peer-to-peer computer network comprising the plurality of the independent data processing nodes.
[0090] In other examples, the local node training data for different nodes may have different modalities, eg, the local node training data for one node may be video data, text data, audio data, image data, or even sub-modalities (eg, image data in the case of medical image data could comprise X-ray image data, computed tomography (CT) image data, magnetic resonance imaging (MRI) image data, positron emission tomography (PET) image data, etc) or multimodal data combining any of the previously mentioned modalities.
[0091] In various examples, the machine learning model may be unique to each independent data processing node. In other examples, the machine learning model may be the same or equivalent for each independent data processing node of the peer-to-peer computer network 210. In other examples, the machine learning model for a first selection of independent data processing nodes is the same but different to the machine learning model for a second selection of independent data processing nodes. In yet other examples, the machine learning model for each independent data processing node may be based on the same core machine learning model architecture (eg, a multi-head transformer model) and the machine learning model for a given independent data processing node is based on training a dedicated “component” or “head” on its local node training data to provide a set of model characteristics comprising the trained head weights for the head.
[0092] As would be appreciated, machine learning models vary in structure and functionality, Example model characteristics that may be exchanged include, but are not limited to: weights, embeddings, heads, and / or encoders.
[0093] In one example, the machine learning model may comprise a convolutional neural network (CNN), and example model characteristics that may be exchanged include, but are not limited to: weights and / or biases from convolutional and fully connected layers, feature maps extracted at different layers, embeddings that encode compressed representations of input data, and / or fdters / kemels that detect spatial patterns in images.
[0094] In one example, the machine learning model may comprise a recurrent neural network (RNN) (eg, long short-term memory (LSTMs) or gated recurrent unit (GRU) models) and example model characteristics that may be exchanged include, but are not limited to include: hidden state representations that store sequential information, cell state weights in LSTMs that handle long-term dependencies, and / or attention mechanism weights which improve sequence alignment in tasks like machine translation and speech recognition.
[0095] In one example, the machine learning model may comprise a graph neural network (GNN) and example model characteristics that may be exchanged include, but are not limited to: node embeddings which represent graph entities as dense vectors, edge features that define relationships between nodes, graph convolution weights from a graph convolutional network (GCN), and / or message passing weights employed in graph attention networks (GATs) to facilitate node communication.
[0096] In one example, the machine learning model may comprise a generative adversarial network (GAN) and example model characteristics that may be exchanged include, but are not limited to: generator weights responsible for synthesising data, discriminator weights that evaluate generated samples, latent space representations which encode compressed data features and / or feature extractor weights typically based on pre-trained CNN or transformer backbones.
[0097] In one example, the machine learning model may comprise a transformer-based large language models (LLM) (eg, BERT, GPT or T5) and example model characteristics that may be exchanged include, but are not limited to: token / word embeddings for providing contextualized word meanings, encoder weights for capturing long-range dependencies in input sequences, decoder weights for generative tasks, and / or self-attention matrices for determining the relationships between words in a sentence.
[0098] In one example, the machine learning model may comprise a vision transformer (ViT) and example model characteristics that may be exchanged include, but are not limited to: patch embeddings, for converting images into tokenised representations, self-attention weights for assigning importance to different input patches, multi-head attention weights that capture long-range dependencies, encoder weights responsible for feature extraction, and / or layer-wise representations from different transformer stages for fine-tuning models across tasks.
[0099] In one example, the machine learning model may comprise a diffusion model (eg, Stable Diffusion or a denoising diffusion probabilistic model (DDPM)) and example model characteristics that may be exchanged include, but are not limited to: noise prediction model weights for guiding data generation, latent space representations for capturing the statistical essence of input data, U-Net encoder weights responsible for the denoising process, attention weights and / or text embeddings (eg, for contrastive language -image pre-training (CLIP) based models) which assist diffusion models to interpret text prompts for conditional generation.
[0100] In one example, the machine learning model may comprise a reinforcement learning model (RLM) (eg, deep Q-network (DQN), proximal policy optimisation (PPO) or soft actor-critic (SAC)) and example model characteristics that may be exchanged include, but are not limited to: stateaction value functions (Q-tables) that help agents make optimal decisions, policy network weights for defining action selection strategies, and / or reward function weights which dictate how an agent learns from its environment.
[0101] In one example, the machine learning model may be pre-trained and training the machine learning model may comprise freezing a component of the pre-trained machine learning model.
[0102] In one example, the machine learning model may comprise a plurality of sub-models trained on the local node training data and determining the set of model characteristics comprises extracting the relevant features in the form of a unified feature representation from the collection of features generated by each sub-model.
[0103] Referring now to FIG. 3 A, there is shown is an overview diagram 300 of three independent data processing nodes NltN2and / V3in accordance with some illustrative embodiments. In this example, node N±comprises machine learning modelfor classifying the presence or occurrence of violence in a video clip which itself comprises three sub-models each based on a convolutional neural Network (CNN) which are enhanced by the addition of squeeze -and-excitation (SE) blocks and bidirectional long short-term memory (LSTM) (BiLSTM) layers. SE blocks recalibrate channel-wise feature responses by considering global context, enhancing the model’s ability to capture important spatial features while the BiLSTM layers capture long-range dependencies in the spatial features extracted by CNNs which can be important for complex tasks such as detecting violence, where temporal patterns are likely to be significant.
[0104] Referring now to FIG. 3B, there is shown a schematic block diagram 350 of the machine learning modelarchitecture referred to in FIG. 3A showing the incorporated SE and BiLSTM layers according to some embodiments.
[0105] In this example, machine learning models M2and M3for data processing nodes N2and N3respectively are the same as M . Furthermore, the local node training data D1, D2and D3for each data processing node NltN2and N3is different with D for node N±comprising the Real-World Fighting Dataset (RWF) is a large-scale dataset of 2000 trimmed video clips depicting fight or non-fight behaviours. These video clips are captured by surveillance cameras from real-world scenes and collected from the YouTube website across 20 different scenes, such as street fights, beach fights, and restaurant altercations. The dataset is divided into two parts: the training set (80%) and the test set (20%). Half of the videos depict fight behaviours, while the others feature normal activities.
[0106] D2for N2in this example comprises the “Hockey” dataset including 1000 video clips, each of size 360x288, and labelled into 500 instances of violent and 500 instances of nonviolent events. The videos were collected from actual hockey games played by the National Hockey League (NHL) to capture real-life violent events. Usually, each video clip lasts between one and two seconds, exhibiting a frame rate of 25 frames per second.
[0107] D3for N3in this example comprises the Real-Life Violence Situations (RLVS) dataset including 2,000 video clips, equally divided between fights and normal actions. The fight videos depict physical altercations in various environments, such as streets, prisons, and schools. Videos within the RLVS dataset feature high resolutions, ranging from 480p to 720p, with durations ranging from 3 to 7 seconds.
[0108] In one example, the machine learning models MltM2and M3(which are the same in this example) have been pre-trained on an initial training data set (eg, ImageNet dataset) and a selection of layers of the machine learning modelsM2and M3are frozen to retain the “knowledge” from the pretraining and allowing fine tuning to the specific task which in this example relates to the classification of violence in a video clip.
[0109] In one example, determining the respective set of model characteristics for the respective machine learning model that itself comprises a plurality of sub-models at an independent data processing node comprises determining a unified feature representation for the for the plurality of sub-models.
[0110] Referring now to FIG. 5, there is shown a flowchart diagram of a method 500 for determining a unified feature representation for a machine learning model according to some embodiments. In pseudocode this may be indicated as follows for a given z-th node NLand associated local training data Dtthen the unified feature representation Ftis defined to be:Equation 2
[0111] At block 510, method 500 comprises extracting features from each sub-model of the machine learning model.
[0112] In various examples, extracting features from the machine learning model comprises retrieving high-dimensional feature embeddings from the intermediate layers of the model, capturing essential patterns, representations, and relationships within the local training data. Depending on the machine learning model, the extracted features may comprise convolutional feature maps in CNNs, selfattention outputs in ViTs, or latent space embeddings in diffusion models.
[0113] At block 520, the extracted features from each sub-model are concatenated to form a fused feature set.
[0114] In one example, forming the fused feature set comprises integrating information from multiple perspectives, combining spatial, temporal, or modality-specific characteristics into a unified representation. In various examples, this may comprise applying feature stacking, embedding projection, or multi-scale aggregation to ensure optimal feature fusion while preserving the discriminative properties of each individual feature set. As would be appreciated, the resulting fused feature representation may function as a comprehensive, low -dimensional encoding of the independent data processing node’s local model performance and data distribution to assist the exchange of model characteristics between independent data processing nodes in accordance with the present disclosure.
[0115] At block 530, the fused feature set is processed to determine the unified feature representation having reduced dimensionality.
[0116] Referring now to FIG. 6, there is shown a flowchart diagram of a method 600 for processing a fused feature set according to some embodiments. In one example, block 530 of FIG. 5 may be implemented in accordance with method 600.
[0117] At block 610, method 600 comprises performing an independent component analysis (ICA) on the fused feature set to reduce dimensionality.
[0118] Optionally, method 600 may further comprise validating the unified feature representation (at block 620) and / or displaying high dimensional features to visualise model performance (at block 630) such as by t-distributed stochastic neighbour embedding (t-SNE).
[0119] In one example, validating the unified feature representation comprises one or more machine learning classifiers. In one example, three or more machine learning classifiers are trained andmajority voting is applied across these classifiers to enhance robustness and reliability ensuring the feature extraction process is effective before features are shared with other nodes.
[0120] In one example, validating the unified feature representation before any communication comprises one or more of the following processes to improve data integrity, security, and fairness. In one example, statistical consistency checks may be applied to verify the stability of feature distributions, preventing concept drift and anomalies. In one example, dimensionality reduction techniques like independent component analysis (ICA) or principal component analysis (PCA) may be applied to remove redundant information, ensuring only the most relevant features are shared.
[0121] In another example, adversarial detection models may be applied to identify tampered updates. In one example, model agreement checks may be applied using cosine similarity or Kullback-Leibler (KL) divergence to confirm alignment with previous updates.
[0122] Referring back to FIG. 1, at block 120, method 100 comprises communicating, for each independent data processing node, their respective set of model characteristics to each other independent data processing node of the plurality of independent data processing nodes .
[0123] Referring now to FIG. 7, there is shown an overview diagram 700 showing two independent data processing nodes comprising a transmitting node NTand a receiving node NRaccording to some embodiments.
[0124] In accordance with the present disclosure, a first independent data processing node selected from any of the independent data processing nodes of the peer-to-peer network of the present disclosure may seek to communicate their respective set of model characteristics to another second independent data processing node from the plurality of independent data processing nodes operable to receive the respective set of model characteristics. As a result, the first independent data processing node functions as a transmitting node NTand the second independent data processing node functions as a receiving node NRin this particular exchange of data. As would be appreciated, in a different exchange of data, the first independent data processing node could function as the receiving node while the second independent data processing node could function as the transmitting node for that particular exchange of data.
[0125] In one example, method 100 further comprises validating by the receiving node NRthe transmitting node NTprior to communication. In this manner, before any communication of the set of model characteristics from the transmitting node NTto the receiving node NRcan occur, the receiving node NRfirst validates the identity of the receiving node NR.
[0126] In one example, this validating by the receiving node NRof the transmitting node comprises the receiving node validating a unique digital identifier of the transmitting node NTin accordance with an asymmetric encryption scheme based on a transmitting node NTprivate -public key pair.
[0127] Referring now to FIG. 8, there is shown a flowchart of method 800 for validating the unique digital identifier of a transmitting node NTin accordance with an asymmetric encryption scheme according to some embodiments.
[0128] At block 810, method 800 comprises encrypting by the transmitting node NTthe unique digital identifier based on a private key of the transmitting node NTprivate -public key pair to form an encrypted unique digital identifier.
[0129] At block 820, method 800 comprises sending by the transmitting node NTthe encrypted unique digital identifier to the receiving node 1VR;
[0130] At block 830, method 800 comprises decrypting by the receiving node NRthe encrypted unique digital identifier based on a public key of the transmitting node NTprivate-public key pair;
[0131] At block 840, method 800 comprises confirming by the receiving node NRthe unique digital identifier of the transmitting node NT.
[0132] In one example, a transmitting node is validated by any potential receiving node before any communication of data between the nodes. In other examples, a transmitting node may only be required to be validated by a receiving node the first time the transmitting node seeks to transmit data to the receiving node. In this manner, a new independent data processing node seeking to join the peer-to-peer network may be initially validated or verified prior to joining the network. In another example, an independent data processing node may be validated for a period of time and then require re validation after this period of time has elapsed.
[0133] In one example, method 100 further comprises securing the communication of the respective set of model characteristics between the transmitting node and the receiving node in accordance with an encryption scheme prior to communication.
[0134] In one example, the encryption scheme is a symmetric encryption scheme based on a unique session key generated for the communication of the respective set of model characteristics from the transmitting node to the receiving node. In other examples, the unique session key may apply tocommunications between the transmitting node and the receiving node for a predetermined period of time.
[0135] In one example, the unique session key is shared between the transmitting node and the receiving node in accordance with an asymmetric encryption scheme (eg, AES) between the transmitting node and the receiving node.
[0136] In another example, the encryption scheme governing the communication of data between the transmitting node and the receiving node is an asymmetric encryption scheme based on private-public key pairs generated for the transmitting node and receiving node.
[0137] In various examples, the asymmetric encryption scheme may be selected from the following including, but not limited to: elliptic curve cryptography (ECC), particularly elliptic curve Diffie-Hellman (ECDH) for providing robust encryption with lower computational overhead, Diffie-Hellman (DH) key exchange to secure symmetric key distribution for encrypted communications, lattice based cryptography (eg, NTRUEncrypt) to provide post-quantum security, proxy re-Encryption (PRE) for allowing encrypted data to be re-encrypted without decryption.
[0138] In one example, the asymmetric encryption scheme is a homomorphic encryption scheme.
[0139] Homomorphic Encryption (HE) enables computations on encrypted data without requiring decryption, preserving data privacy during Al training. Fully Homomorphic Encryption (FHE) schemes, such as BFV (Brakerski / Fan-Vercauteren) and CKKS (Cheon-Kim-Kim-Song), allow secure updates of model characteristics without compromising confidentiality. Partially Homomorphic Encryption (PHE) schemes, like RSA and ElGamal, support specific operations (e.g., multiplication or addition) and are useful for verifying encrypted model updates. Somewhat Homomorphic Encryption (SWHE), including YASHE and BGSW, provides a balance between security and computational efficiency.
[0140] In one example, the homomorphic encryption scheme is an additive homomorphic encryption scheme. In one example, the additive homomorphic encryption scheme is Paillier encryption.
[0141] In one example, a combination of FHE (eg, BFV / CKKS) and ECC (ECDH) is adopted to provide security, scalability, and privacy.
[0142] In various examples, the peer-to-peer network may include a private -public key generation module for generating respective private -public keys (eg, Paillier private -public keys) that is accessible by the independent data processing nodes as required.
[0143] In one example, method 100 further comprises, for each independent data processing node, encrypting their respective set of model characteristics to form a respective encrypted model update before communication of the respective encrypted model update to each other independent data processing node of the plurality of independent data processing nodes.
[0144] In one example, method 100 further comprises the receiving node assessing the respective set of model characteristics following their communication from the transmitting node to determine an associated integrity measure for the respective set of model characteristics. In one example, the associated integrity measure indicates one or more of data drift, cyber-attack or unlabelled data.
[0145] In a further example, assessing the respective set of model characteristics to determine an associated integrity measure comprises applying a data validation classifier to the respective set of model characteristics.
[0146] Referring now to FIG. 9, there is shown a schematic block diagram 900 for a machine learning based data validation classifier architecture according to some embodiments. In this example, data validation classifier 900 comprises a hidden adaptive chaotic dynamics neural network (HACD-NN) and is based on a combination of machine learning approaches including, but not limited to, neural networks, chaotic systems, and incremental learning and is configured to adapt to data traffic (ie, input data) communicated between the independent data processing nodes and learn from it continuously. In various examples, data validation classifier 900 will detect data drift and cyber attacks and deal with unlabelled data or any of these vulnerabilities simultaneously.
[0147] In this example, data validation classifier 900 comprises a chaotic neural network component 910 and a progressive or incremental learning component 980.
[0148] The chaotic neural network component 910 in this example comprises four layers. The first is the input layer 911, which receives the initial features in the form of the set of model characteristics that are being transferred to the receiving node. The number of neurons of input layer 911 is determined based on the input size (input size) and the number of hidden cells (hidden size). It converts the data into an initial representation that the chaotic neural network component 910 can process.
[0149] The next layer is chaotic layer 912 which applies a chaotic model to simulate complex dynamics and to detect non-linear and hidden patterns in the data, enhancing the model's sensitivity tosudden changes and drifts. In one example, chaotic layer 912 uses a chaotic sine function to improve the detection of non-linear patterns, data drift, and anomalies. By introducing dynamic transformations, this layer increases the model's adaptability to evolving patterns and emerging cyber threats. Additionally, it enhances robustness against adversarial attacks by incorporating unpredictable transformations, preventing malicious exploitation, and ensuring better generalisation across diverse data sources, ultimately strengthening the security and adaptability of the system.
[0150] Chaotic neural network component 910 then comprises hidden layer 913, which assists to reduce the features and improve their representation before passing them to the final output layer 914. Hidden layer 913 also assists in learning complex patterns and eliminating overfitting through fewer neurons. The final output layer 914 classifies the predictions as normal and abnormal and provides the decision for the model.
[0151] In various examples, activation functions such as a rectified linear unit (ReLU) may be adopted which assist in adding non-linear properties to the network and reduce scaling problems. This design aims to extract Feature Embeddings and deal with unlabelled data, chaotic patterns and drift in the data. Moreover, it provides a dynamic response to sudden changes. It can be retrained when any new changes are detected.
[0152] In this example, the progressive or incremental learning component 980 comprises an incremental learning layer or model 981 in the form of a stochastic gradient descent classifier (SGDClassifier) where instead of raw data, the feature embeddings from output layer 914 of the chaotic neural network component 910 are employed. In this manner, incremental learning layer 981 may be retrained when any changes in data are detected that are classified as abnormal.
[0153] Data validation classifier 900 then comprises a final layer 950 where the classification results of the two models (ie, in this example “Normal” or “Abnormal”) are combined by, in this example, calculating the average classifications received from the chaotic neural network component 910 and the incremental learning component 980 and determining the final decision based on the voting technique for the average classifications.
[0154] In this manner, data validation classifier 900 is configured to detect data drift and cyber attacks and further has the ability to adapt from the data and retrain itself when detecting any changes in the data to ensure a fast and accurate response to changes and drifts. In one example, data validation classifier 900 may be trained on two datasets characterising abnormal network traffic (ie, CIC UNSW-NB15 and CIC loT 2023) to ensure the model can detect data drifts and cyber attacks.
[0155] In one example, communicating, for each independent data processing node, their respective set of model characteristics to each other independent data processing node comprises each of the independent data processing nodes updating their respective set of model characteristics (eg, respective encrypted model updates) to each other independent data processing node at substantially the same time or simultaneously. In various examples, this substantially parallel updating process is subject to the various validation and encryption processes described in the present disclosure.
[0156] In one example, each independent data processing node operates as a blockchain node together forming a blockchain network having a corresponding shared blockchain based ledger for recording updates. In one example, an update is recorded on the corresponding shared blockchain based ledger following a proof of share operation between the plurality of independent data processing nodes. In this manner, the blockchain records each independent data processing node’s interaction and model update to ensure that no single data processing node can manipulate the process or tamper with the results.
[0157] In one example, the peer-to-peer network employs a Shamir secret sharing scheme where each data processing node shares its cryptographic shares with its update. These shares are distributed across multiple nodes, ensuring no single participant can access the full model update or sensitive data before validating the node by reconstructing the secret using its share and other nodes’ shares using a predefined threshold. Only when enough shares are combined (meeting a predefined threshold) can the secret be reconstructed, further safeguarding data privacy.
[0158] In various examples, the peer-to-peer network may include a Shamir share generation module for generating cryptographic shares based on the defined threshold that is accessible by the independent data processing nodes of the peer-to-peer network.
[0159] In one example, an independent data processing node will request the generation of cryptographic shares from the Shamir share generation module based on the total number of shares and a threshold that defines how many shares are needed to reconstruct the secret. These cryptographic shares are generated and then returned to the independent data processing node which retains one of the cryptographic shares and distributes the remaining cryptographic shares to the other independent data processing nodes.
[0160] In the proof of share operation, each individual data processing node will submit its own cryptographic share and cryptographic shares it has received from the other nodes for verification upon correct reconstruction of the original secret.
[0161] Following the verification of the proof of share operation, new blocks may be added to the blockchain based on the verified shares. Accordingly, each independent data processing node will send a request to the blockchain to create a block along with a proof that its shares are valid and a new block will be created on the blockchain containing the independent data processing node’s set of model characteristics, which may be encrypted and / or validated in accordance with the present disclosure, and added to the shared blockchain ledger.
[0162] An independent data processing node may optionally request validation of the shared blockchain ledger at any time to ensure consistency and correctness.
[0163] Referring now to FIG. 10, there is shown a system interaction diagram 1000 of an encrypted blockchain based distributed machine learning architecture 1000 according to some illustrative embodiments. Independent data processing node 1010 in this example initially requests public / private key pair generation from Paillier Key generation module 1020 which generates the Paillier public / private key pairs and returns them to independent data processing node 1010. As discussed previously, these keys may be used where applicable for homomorphic encryption within the peer-to-peer network to securely encrypt the sets of model characteristics being transmitted between the nodes while potentially allowing computation on these data without decrypting it (eg, encrypted set of model characteristics 1050 which will form a “block” on the blockchain)
[0164] Following, obtaining the private / public key pairs the independent data processing node 1010 now interacts with the secret sharing module 1030 to split its sensitive data (eg, a secret or a share of a blockchain-related value).
[0165] Independent data processing node 1010 requests the generation of shares based on the total number of shares and a threshold that defines how many shares are needed to reconstruct the secret from share generation module 1031 which returns the shares to the node 1010 where the independent data processing node 1010 retains one share, while the rest are distributed to other independent data processing nodes forming the peer-to-peer network.
[0166] Once all the shares are distributed, independent data processing node 1010 submits its own share and the shares it has received from the other nodes for verification to secret sharing module 1030 which verifies that the submitted shares are valid and correctly reconstructs the original secret 1032. Independent data processing node 1010 proceeds to the blockchain creation step only if the shares are valid.
[0167] With valid shares, independent data processing node 1010 now interacts with the Proof of Share Blockchain module 1040 to add new blocks based on the verified shares.
[0168] Independent data processing node 1010 sends a request to the blockchain for block creation 1045, along with proof that its shares are valid (this is done using the reconstructed secret from the previous step). The blockchain module 1040 accepts the request and creates a new block 1045. This new block contains the set of model characteristics 1050 from independent data processing node 1010 that is to be communicated to the other independent data processing nodes comprising the peer-to-peer network as validated by its share and the reconstructed secret.
[0169] Once the block is created it is added to the Blockchain Ledger 1046 which records every block in a decentralised manner. The newly created block is appended to the blockchain ledger at module 1048 ensuring immutability and trust. This ledger is distributed across the network for all nodes to verify and maintain consensus.
[0170] Independent data processing node 1010 (or any other node at module 1065) may request validation of the blockchain ledger 1046 after adding new blocks 1048 or at any given point to ensure consistency and correctness of the blockchain. In one example, independent data processing node 1010 sends a request to validate the ledger 1060 which checks if all blocks have valid signatures and are consistent with the previous blocks. Once validated, the ledger is locked, and further modifications require network consensus. In one example, Excel files are used as an intermediary data storage format before data is structured and integrated into the blockchain and these files are separately processed at module 1070. As would be appreciated, any suitable intermediary buffer or data storage arrangement which optimally is tamper evident may be adopted.
[0171] Referring now to FIG. 11, there is shown a unified modelling language (UML) diagram 1100 of the processing components of the encrypted blockchain based distributed machine learning architecture illustrated in FIG. 10 according to some illustrative embodiments.
[0172] Referring now to FIG. 12, there is shown a data flow diagram 1200 showing the processing and communication steps for the encrypted blockchain based distributed machine learning architecture illustrated in FIG. 10 according to some illustrative embodiments.
[0173] Data flow diagram 1200 shows the steps carried out by independent data processing node 1010 of requesting key generation (in this example Paillier key generation) and subsequent receipt of a public / private key (data exchange 1210), request to generate shares and subsequent receipt of a share and distribution of shares to the other independent data processing nodes (data exchange 1220), submission of share and shares for verification and on verification (data exchange 1230) a request to create a block on the blockchain and confirmation that a block may be created (data exchange 1240), a request to add block from Excel which is used as an intermediary data storage format in this example and confirmation that the block has been added to the blockchain (data exchange 1250) and finally a request to validate the ledgerand confirmation that the ledger has been updated (data exchange 1260). If any invalid blocks are found (for instance, if an independent data processing node tries to tamper with the blockchain), then this will be indicated and appropriate action may be taken to maintain the integrity of the blockchain. In other examples, method and systems in accordance with the present disclosure may, where secure and authenticated communications are a concern, adopt any suitable authenticated peer-to-peer operating arrangement.
[0174] Referring back to FIG. 1, at block 130, method 100 comprises determining a respective updated machine learning model at a selected independent data processing node of the plurality of independent data processing nodes based on its own locally determined set of model characteristics and each received set of model characteristics from each other independent data processing node.
[0175] In one example, determining the respective updated machine learning model at the selected independent data processing node of the plurality of independent data processing nodes based on its own locally determined set of model characteristics and each received set of model characteristics from each other independent data processing node comprises combining the own locally determined set of model characteristics with a selection of model characteristics from the received sets of model characteristics to form a combination or subset of model characteristics.
[0176] Referring now to FIG. 4, there is shown a flowchart of a method 400 for determining the respective updated machine learning model at the selected independent data processing node according to some embodiments. At block 410, method 400 comprises generating a global model at the selected independent data processing node where the machine learning model is to be updated. In various examples, the global model may comprise at least one of a fused feature representation or a fused parameterised backbone. At block 420, one or more classifiers are then trained based on the global model to generate a final classification result for the updated machine learning model. In various examples, training the one or more classifiers comprises training a plurality of classifiers to generate associated prediction output and aggregating the prediction outputs to generate the final classification result. In various examples, this aggregating process may comprise at least one of voting, averaging, weighted averaging, or applying a learned ensemble.
[0177] In one example, for a node A;, combining the nodes own locally determined set of model characteristics and each received set of model characteristics from each other independent data processing node comprises combining the local model characteristics comprising the unified feature representation F[ for node A, with the received model characteristics in the form unified feature representations originating from all the other independent data processing nodes to generate a global model in the form of a fused feature representation Ffused,t (seeblock 420 of FIG. 4) which may be defined as follows:fused,i = Ft U Uy^i Fj Equation 3
[0178] where j comprises a selection from the other data processing node forming the peer-to-peer network.
[0179] In one example, the fused feature representation is processed to determine a fused selected feature representation Fseiecte(j j having reduced dimensionality which is based on the aggregated feature representations from all the nodes. In one example, an ICA is applied to FfLlsed i. This may be represented in pseudocode as follows:Equation 4
[0180] In one example, the fused feature representation is validated. In one example, validating the fused feature representation comprises training multiple machine learning classifiers on the fused feature set to assess its discriminative power and generalisability. In one example, at least three independent classifiers, such as a support vector machine (SVM), Random Forest (RF), k-Nearest Neighbour (k-NN), and / or Gradient Boosting (eg, XGBoost, LightGBM) are selected to evaluate the quality of the fused feature representation. Each classifier independently processes the unified feature representation and makes predictions. To enhance reliability, majority voting is then applied across these classifiers and the final decision is based on the most commonly predicted outcome.
[0181] In one example, determining a respective updated machine learning model at a selected independent data processing node (ie, / V, ) comprises training a classifier using the fused selected feature representation Fselectedj determined for the node (see block 420 of FIG. 4).
[0182] In one example, training a classifier comprises training a plurality of classifiers to generate associated prediction outputs. This may be represented in pseudocode as follows:1,2, ...,K Equation 5
[0183] where Ckis a particular classifier and yk iis the predictions for the F-th type classifier operating at independent data processing node.
[0184] In one example, the prediction outputs from all classifiers may then be aggregated to form a global or final prediction result ygiobai by majority voting as follows:Equation 6
[0185] In other examples, other aggregation, ensemble layers or voting approaches may be adopted as required to generate a final prediction result.
[0186] Referring now to FIG 13, there is shown a system overview diagram of a system 1300 for determining a shared machine learning model architecture where respective independent data processing nodes comprise a component of the shared machine learning model architecture also resident at each of the independent data processing nodes according to some embodiments.
[0187] As depicted in FIG. 13, each independent data processing node comprises respective local node training data DlrD2, DNand respective machine learning modelsM2, MN. In this example, the machine learning model for the z-th independent data processing node comprises a component h, of a shared machine learning model architecture MLA also resident at each independent data processing node.
[0188] In this example, determining a respective updated machine learning model at a selected independent data processing node, NK. of the plurality of independent data processing nodes based on combining its own locally determined set of model characteristics and each received set of model characteristics from each other independent data processing node (eg, see block 130 shown in FIG. 1) comprises determining an updated shared machine learning model architecture MLA based on its own locally determined set of model characteristics for its component of the shared machine learning model architecture MLA and the received sets of model characteristics for each of the other components of shared machine learning model architecture.
[0189] This process can be seen in FIG. 14, which is a system overview diagram of the system 1400 for determining the shared machine learning model architecture illustrated in FIG. 13 showing the determining of an updated shared machine learning model architecture MLA' following communication of sets of model characteristics from independent data processing nodes (ie, the sets of model characteristics relating to all the different components of the MLA trained at their respective independent data processing nodes) to a selected independent data processing node NKaccording to some embodiments. In this manner, the updated shared machine learning model architecture MLA' has the benefit of the training of the separate components of the shared machine learning model architecture at their respective independent data processing nodes without the respective local node training data of the independent data processing nodes requiring disclosure.
[0190] As can be seen in FIGS. 13 and 14, each independent data processing node also include the entire shared machine learning model architecture comprising the multiple components so that any independent data processing node may in principle be selected and the shared machine learning architecture then determined based on the set of model characteristics for itself, and its own component,and the sets of model characteristics for the other components of the shared machine learning model architecture to determine their own respective updated shared machine learning model architecture. In other examples, each independent data processing node may update each other independent data processing node at substantially the same time adopting security, validation and blockchain protocols in accordance with the present disclosure.
[0191] In various examples, the shared machine learning model architecture may be selected from the following architectures including, but not limited to:• shared CNN architectures where feature extractors and embeddings are shared across independent data processing nodes;• shared autoencoder architectures where encoded latent representations are shared across independent data processing nodes;• shared GAN architectures where generator and discriminator components are shared across independent data processing nodes;• shared Diffusion Model architectures where noise predictor networks are shared across independent data processing nodes;• shared GNN architectures where node embeddings are shared across independent data processing nodes;• shared Capsule Network architectures where capsule activations are shared across independent data processing nodes;• shared Mixture of Experts (MoE) architectures where specialised expert networks are shared across independent data processing nodes; and / or• shared RNN and / or LSTM architectures where hidden states are shared across independent data processing nodes.
[0192] In one example, the shared machine learning model architecture is in the form of a multihead transformer comprising multiple headsh2, ... , hN(eg, see FIGS. 13 and 14) and where a component of the shared machine learning model architecture comprises a respective “head” (ie, hj) of the multi-head transformer.
[0193] The core of a multi -head transformer is the transformer encoder comprising multiple layers of multi -head self-attention (MHSA) and feed-forward networks (FFN). Transformers are a deep learning architecture designed to handle sequential data efficiently using a self-attention mechanism. Unlike traditional models such as RNNs, which process data sequentially, Transformers can process entire input sequences in parallel, significantly improving speed and scalability. This architecture is widely used in natural language processing, computer vision, and multi-modal learning.
[0194] A feature of a multi -head transformer is the multi -head self-attention mechanism which allows multiple attention heads to focus on different aspects of the input data simultaneously. This enhances the model’s ability to capture relationships across different data features, leading to more comprehensive learning. In accordance with the present disclosure, different attention heads may be processed or trained on separate independent data processing nodes. This distribution improves scalability, enabling larger models to be trained efficiently without overloading a single processing node. It also enhances privacy by ensuring that training data remains on local nodes while only learned representations (eg, embeddings or weights) are exchanged.
[0195] In accordance with this example, determining at each independent data processing node a respective set of model characteristics from training the respective machine learning model on the respective local node data comprises each independent data processing node determining a respective set of model characteristics for the respective “head” based on its local node training data in the form of trained head weights for the respective trained head at the independent data processing node.
[0196] In one example, the multi -head transformer comprises an ensemble layer to combine prediction outputs from the trained heads to produce a final classification result.
[0197] In one example, the ensemble layer averages the prediction vectors from the trained heads of the multi-head transformer model to determine a final prediction vector. In various examples, each head hLis trained on respective local node training data corresponding to the independent data processing node which may have a specific modality and will generate a prediction vector pLfor a batch of size B and C classes, where pt G 7?SxCwhere pt contains the probabilities for each class across the batch.
[0198] For a selected independent data processing node, Nk. with, in this example, A heads corresponding to the N independent data processing nodes, the ensemble process combines predictions from all heads by averaging their outputs. The final prediction vector pkfor selected node Nkmay then be computed as in follows:Pk = ^ i=iPi Equation ?
[0199] where pLis the prediction output from the z-th head corresponding to the z-th independent data processing node. Where the respective local node training data corresponds to different modalities, this averaging operation ensures that each modality contributes equally to the final decision, promoting fairness and reducing the impact of modality-specific noise or biases.
[0200] As would be appreciated, the ensemble process functions to aggregate the predictions for each class across all heads, resulting in a smoothed prediction that is less sensitive to individual modality errors. This approach not only improves accuracy but also ensures that the final decision reflects a balanced understanding of any multi-modal data.
[0201] Accordingly, in various examples, each head is trained independently on its respective local node training data, which may vary in modality, such as medical imaging, sensor data, or financial transactions. Each head specialises in learning unique feature representations based on the specific dataset at its assigned node, enhancing the diversity and robustness of the overall model. Once trained, each head generates a prediction vector containing probability distributions across all possible classes for a given batch of data. The ensemble layer then aggregates these predictions across multiple heads to compute the final prediction. This is typically done through averaging, weighted voting, or other fusion techniques, ensuring that the model benefits from the strengths of each head while mitigating individual biases. This approach enhances model generalization, reduces bias, and enables multi -modality integration without sharing raw data.
[0202] As would be appreciated, each of the heads of the multi-head transformer architecture (ie, each component of the machine learning model architecture) may be trained on local training data corresponding to an independent data processing node having a different modality.
[0203] As referred to previously, the machine learning model for one or more independent data processing nodes may be pre-trained. In the example, wherein the respective machine learning model at each independent data processing node is a “head” of a multi-head transformer, then the respective head may be pre-trained based on pre-training data provided to the independent data processing node.
[0204] More generally, one or more pre -training datasets {PTD1, PTD2,PTD3, .... , PTDD} may be assigned to pre-train the head of a specific node NLwhere the subscript D refers to the total number of datasets available to pretrain the various heads. These pre-training datasets may be publicly available or determined by agreement between the entities operating the independent data processing nodes to be shared within the peer-to-peer network and may be contrasted to the local node training data throughout this disclosure.
[0205] Consider an independent data processing node Nt, the pretraining process for its corresponding head hLmay be formulated as:Equation 8
[0206] is the learned weight of the head hL. (x, y) is the pre-training dataset sample PTDi which in this pre-training example comprises one or more of the pre-training datasets as referred to above, which contains x as the input image and y as the class labels, L is the cross-entropy loss function, and hL, x; W) refers to the output of the hLwhere the input x gives W weights.
[0207] As would be appreciated, the independent data processing node Ntcomprises the entire multi -head transformer so in principle all, or a selection, of the heads, ie, not only head ht, may be pretrained at a respective independent data processing node based on the pre-training data or a selection of data from the pre -training data.
[0208] Accordingly, independent data processing node NLmay following a pre-training exercise generate a set of pre-trained head weights, eg,..., W^l..., IV^} corresponding to the heads {h1;h2, ht, ... / IQ] where Q indicates the total number of heads that are pre-trained at independent data processing node Ntand which is < IV being the total number of independent data processing nodes. Independent data processing node Ntmay then transfer these pre-trained head weights to one or more of the other independent data processing nodes comprising the peer-to-peer network where they may be used to form one or more updated pre-trained heads.
[0209] Considering again independent data processing node NLbut from the perspective of now receiving a set of pre-trained head weights from one or more of the independent data processing nodes, the pre-trained head weights for the various heads of the multi -head transformer architecture are fused by averaging the learned weights for a given head received from the independent data processing nodes. This may be expressed as follows:Equation 9
[0210] is the updated weight for the -th head at independent data processing node Ntin iteration t+1, P is the number of independent data processing nodes from which pre -trained head weights for -th head are received, and Iis the weight for the -th head received from the z-th independent data processing node.
[0211] In one example, following the above-described pre-training process all layers except the final few of the respective head are frozen. In one example, only the classification layers of the respective head are made trainable to fine-tune the output based on local training data. This freezing of layers ensures that the features learned during pretraining are not lost.
[0212] Define !V(preas the pre-trained weight for the head ht. in one example the feature extraction layers of the head htare frozen, and only the classification weights 9 ® are updated through the training of head hton local node training data Dtas follows:Equation 10I I]
[0213] Wtremains fixed during the training process, while 6clsare the trained head weights resulting from training of head hton local node training data Dtwhich then comprise the set of model characteristics for head htwhich are then communicated to the other independent data processing nodes as discussed previously.
[0214] Referring now to FIG. 15, there is shown a system overview diagram of the system 1500 for determining a shared machine learning model architecture illustrated in FIG. 13 showing the adding of a new independent data processing node according to some embodiments. In one example, a new independent data processing node Nnewhaving a corresponding head hnewand associated local node training data Dnewis added to the peer-to-peer network and the shared machine learning architecture MLA in the form of the multi-head transformer is updated to include the new head.
[0215] In one example, the new head hnewof the new node Nnewis initialised with averaged head weights Wavgobtained from averaging the head weights one or more of the other independent data processing nodes comprising the peer-to-peer network. This initialization assists the new head of the new independent data processing node to commence with a well-generalized set of parameters and further assists in enabling rapid adaptation of hnewto its local node training data Dnew.
[0216] In one example, following this initialisation, the new independent data processing node Nnewtrains its head hnewusing its local node training data Dnew. In accordance with the present disclosure, the new head hnewis trained by minimising the cross-entropy loss between the model predictions and the actual labels for the local node training data as follows:™nw = arg Equation 11
[0217] represents the optimised weights for the head hnewat data processing node Nnew. (x,y) are input-output pairs from the local dataset Dnew, (hnew, x; W) is the prediction output, and L is the cross-entropy loss function as has been discussed previously.
[0218] Once the local training on Nnewis complete, the independent data processing node shares its updated head weightsfor the head hnewwith the other independent data processing nodes in the peer-to-peer network in accordance with the present disclosure.
[0219] In one example, a weighted averaging mechanism is applied to update the global head weights of the multi-head transformer architecture without disrupting the existing learning process by updating the global head weights as follows:Equation 12
[0220] whererepresents the updated weight of the z-th head at iterationthe weights corresponding to head hnewfrom the j-th node, and N+l accounts for the addition of the new node Nnew.
[0221] After the integration of the trained head weights from the new independent data processing node, the ensemble layer is updated to include predictions from hnewas previously described and again each head contributes independently to the final decision, and their predictions are averaged to generate the final output.
[0222] In another example, a selected independent data processing node Nremmay be removed from or depart the peer-to-peer computer network. In various examples, the removal of Nremmay occur as a result of networking issues, changes in policy or computational limitations arising at a node. In one example, the contribution of weights Wremassociated with Nremare excluded from the global model through a reaggregation process, where the global model is redetermined based solely on the remaining independent data processing nodes of the peer-to-peer network. This ensures that the overall representation is updated without the need for retraining or architectural changes.
[0223] In the example of an ensemble process, the contributions from the remaining nodes following removal Nremare rebalanced by dynamically adjusting the weighting of their individual predictions as a result maintaining consensus across the peer-to-peer network in an operation analogous to Equation 12 but instead now removing Wremassociated with node Nremnoting that in a multi -head transformer architecture that Wremmay include weights corresponding to multiple heads. In this manner, the peer-to-peer network of independent data processing nodes will adapt to the addition and / or removal of nodes without requiring retraining.
[0224] As would be appreciated, the above referenced approach allows new independent data processing nodes to join the peer-to-peer network dynamically, without the need to restart the trainingprocess or compromise the learning already carried out for the existing independent data processing nodes. The above averaging mechanism for weight integration may be seen to avoid abrupt changes in the multi -head transformer architecture. Additionally, the ensemble approach ensures that all independent data processing nodes, including any new nodes that are added dynamically to the peer-to-peer network, contribute equally to the final decision-making process.
[0225] In one example, the set of model characteristics determined by training a respective machine learning model at a given independent data processing node may be weighted in accordance with one or more of properties of the independent data processing node to balance the contribution of the determined set of model characteristics from the node when they are combined with the sets of model characteristics from other independent data processing nodes.
[0226] In various examples, the properties of the independent data processing node may comprise characteristics of the local node training data, characteristics of the machine learning model for the given independent data processing node, and / or characteristics of the associated training task.
[0227] In one example, where the respective machine learning model at each independent data processing node comprises a component of a shared machine learning model architecture also resident at each of the independent data processing nodes such as a head of a multi-head transformer and the model characteristics comprise the head weights for the head associated with the independent data processing node then the weights may be weighted in accordance with one or more of properties of the independent data processing node.
[0228] In one example, the weights for each independent data processing node Ntof the peer-to-peer network comprising multiple independent data processing nodes NltN2, NNmay be weighted by a corresponding weighting factor aL.
[0229] In this manner, the updated weight 14+ ’ for the k-th head at node Nj (as an example) after iteration t+1 may be computed as follows:Equation 13
[0230] where W^k,t+1^ is the updated weight of head k at node Nj after iteration t + 1, N is the total number of nodes, amis the weighting factor for the corresponding node Nm, and 14+ / c ’ t") is the weight of head k received from node Nmin iteration t.
[0231] In one example, the weighting factor may correspond to an equal contribution from all nodes corresponding to, ie am= 1.
[0232] In another example, the property of the independent data processing node may correspond to the size of the dataset at the node. In this case, the weighting factor may increasingly or proportionally weight nodes with larger datasets, ie am= | dm| , where | dm| is a measure of the size of the dataset at node Nm.
[0233] In another example, the property of the independent data processing node may correspond to a measure of the validation accuracy of the machine learning model operating at the independent data processing node. In this case, the weighting factor may correspond to proportionally weighting nodes with higher validation accuracy, where amis defined as follows:Equation 14
[0234] and Accuracymis a validation accuracy measure for a machine learning model operating at independent data processing node Nm.
[0235] In one example, the validation accuracy measure for a given node is determined based on a local validation dataset for the given node. In this manner, nodes that demonstrate higher accuracy may have greater influence during any combination or aggregation process, while those with lower accuracy contribute less. In various, examples this assists the updated model to benefit most from the strongest-performing nodes in terms of accuracy while still allowing weaker nodes to participate in the training process.
[0236] In one example applicable to a multi-head transformer an initial assessment of the learned weights for the various headsh2, ... , hkis carried out to determine if there is balanced node performance. In one example, this is determined based on a variance or fairness assessment carried out across the shared machine learning model architecture. In one example, the variance assessment comprises each node evaluating its respective head against its local validation data to assess node performance.
[0237] In one example, a network fairness measure Lfajris determined by computing the difference between the node performances and a mean network performance as follows:Equation 15
[0238] where Pj is the performance of the model (eg, accuracy as indicated above) at node Nj. and P is the mean performance in all nodes.
[0239] If f fajris low, the contributions from the individual nodes are considered balanced. On the other hand, if the network variance measure is above a predetermined threshold, then this indicates a potentially unbalanced network.
[0240] In one example, the ams for each node are determined in accordance with a minimisation process seeking to minimises the network fairness measure Lfa;rovertime in an iterative optimisation process. By minimising this Lfajr, nodes with weaker performance are given proportionally greater influence during weight aggregation (see Equation 13), while the contribution of stronger nodes are reduced. In various examples, this dynamic rebalancing reduces the risk of a single node or modality dominating and assists all nodes to converge toward similar performance levels as a result promoting fairness across the distributed system.
[0241] After weight sharing, each node then integrates the predictions from all heads through ensemble learning as has been previously described.
[0242] In one example, the multi-head transformer is a multi-head vision transformer (ViT) configured to analyse and process image data such as medical image data having different modalities (eg, CT image data, MRI image data, PET image data, etc).
[0243] In accordance with the present disclosure, multi-head ViT comprises multiple heads that each may be trained independently on the local node training data of independent data processing node corresponding to the head being trained. This enables each head to capture features based on the characteristics of the specific local node training data which may be of a different modality to the local node training data at another independent data processing node.
[0244] The MHSA mechanism enables the multi-head ViT to capture global relationships within the input data (eg, image data) by allowing each patch to attend to all others in the sequence. This can be particularly important for modalities such as CT images and / or MRI images, where contextual information across the entire image is critical for accurate representation. Each head processes the attention outputs independently, ensuring that modality-specific features are captured effectively.
[0245] In one example, and in accordance with the present disclosure, the multi-head ViT is pretrained on the pre-training data comprising image data from the MedMNIST data collection which a wide range of image data types, including blood cell images, dermatological images, retinal images, and more, all of which are standardized to make model training more convenient. In one example, three specificdatasets from MedMNIST (ie, BloodMNIST, DermaMNIST, and OCTMNIST) may be selected and provided to each independent data processing node to train their respective head of the multi -head ViT.
[0246] Following pre-training, each independent data processing node trains it own head on its local node training data to determine trained weights for that particular head which are then communicated to each independent data processing node which each have been training their own head based on their own local node training data where the exchanged weights are integrated to improve generalisation. As would be appreciated, this training will specialise to the characteristics of the local node training data such as its modality or the specific type of medical data comprising the local node training data. In this application, this step may be regarded as fine-tuning each head based on local real-world datasets comprising the local node training data for each node following pre-training based on more general data.
[0247] Following this exchange of trained head weights, any selected data processing node may determine an updated multi-head ViT based on its own locally trained head weights and the trained head weights for all of the heads that comprise the multi -head ViT. In this manner, each independent data processing node will have its own version of an updated multi-head ViT and an ensemble layer is applied to combine the prediction outputs from all ViT heads to produce a final classification result.
[0248] By integrating information from all the heads distributed through the peer-to-peer network, each independent data processing node effectively becomes more knowledgeable about the entire dataset distribution over the peer-to-peer network without having to access the local node training data of each independent data processing node as a result maintaining the privacy of this data.
[0249] In accordance with the present disclosure, a new independent data processing node with a corresponding new head can be integrated into the multi-head ViT providing additional flexibility without disrupting the ongoing training or requiring the entire multi-head ViT to restart.
[0250] As can be appreciated, methods and systems in accordance with the present disclosure based on a multi -head ViT address the challenges of handling multi -modality in medical imaging, such as CT, PET, MRI, and X-ray data, by leveraging a decentralised architecture where different independent data processing nodes can train their respective head on their local node training data which may be of a different modality or reflect a different source population for the data.
[0251] Referring now to FIG. 16, there is shown a table 1600 comparing performance metrics of a distributed machine learning architecture in accordance with the present disclosure (ie, “Present Disclosure”) to other distributed machine learning methods. In this example, the peer-to-peer networkcomprises three independent data processing nodes and the machine learning architecture comprises a multi-head ViT transformer.
[0252] In this example, the local node training data for the first independent data processing node comprised DI (BloodMNIST), the local node training data for the second independent data processing node comprised D2 (DermaMNIST), and the local node training data for the third independent data processing node comprised D3 (OCTMNIST).
[0253] In terms of evaluation, the first independent data processing node was tested on a combination of datasets from D2 and D3. The second independent data processing node was tested on a combination of datasets from D 1 and D3 and the third independent data processing node was tested on D 1 and D2. This ensured that each independent data processing node was evaluated on data it hadn't seen during training.
[0254] As can be seen, the machine learning architecture in accordance with the present disclosure demonstrates superior performance across key metrics compared to existing federated learning methods. In terms of accuracy, precision, recall, and Fl-score, the present machine learning architecture achieves the highest values (94%, 93%, 92%, and 93%, respectively), with notably low standard deviations, indicating consistent performance across experiments. In comparison, traditional methods like Federated Learning (86% accuracy) and FedAvg (87% accuracy) fall short, while more recent approaches such as Vision Transformer with Device-to-Device Learning and Swarm Learning show improvements but still lag behind the present machine learning architecture. The high accuracy and Fl -score of the present machine learning architecture highlight its ability to generalize effectively across heterogeneous datasets and make balanced, accurate predictions.
[0255] Bias reduction is a critical factor for fairness in decentralized learning, and the present machine learning architecture excels in this regard, achieving a bias reduction score of 50 ± 1.5, which is significantly higher than all baseline methods. For instance, FedProx and Swarm Learning achieve bias reductions of 32% and 35%, respectively, indicating that these methods still struggle with mitigating biases arising from non-IID data distributions. The present machine learning architecture’s ability to reduce bias is attributed to its decentralized weight-sharing mechanism and ensemble learning approach, which integrate knowledge from diverse datasets without over-relying on any single node or modality. This ensures equitable contributions and improves fairness across nodes.
[0256] Training time and computational efficiency are also notable strengths of the present machine learning architecture. With a training time of 100 ± 5 seconds, the present architecture outperforms methods like Split Learning (220 ± 20 seconds) and Swarm Learning (200 ± 15 seconds), showcasing its efficiency. This advantage stems from the present architecture’s parallel training of ViTheads and lightweight decentralized communication protocol, which minimizes overhead and latency during weight-sharing. Furthermore, the present architecture achieves the highest scores in flexibility (10 ± 0.5) and robustness (10 ± 0.5), emphasizing its ability to handle dynamic network changes, such as the addition of new nodes, without disrupting ongoing training or compromising performance.
[0257] The present machine learning architecture’s strong privacy level (10 ± 0.5) reflects its secure communication protocol, which ensures data privacy throughout the learning process. By enabling decentralized weight sharing without transmitting raw data, the present machine learning architecture adheres to strict privacy regulations, making it suitable for sensitive domains like healthcare. Its high fairness score (10 ± 0.5) reinforces its ability to ensure balanced contributions and benefits across heterogeneous datasets and modalities. Collectively, these results establish the present machine learning architecture as a robust, scalable, and privacy -preserving framework, setting a new benchmark for decentralized federated learning systems.
[0258] Referring now to FIG. 17, there is shown a table 1700 demonstrating the performance of a distributed machine learning architecture in accordance with the present disclosure to other distributed machine learning methods with regard to multi-modal data.
[0259] In this example, the peer-to-peer network comprises three independent data processing nodes and the machine learning architecture comprises a multi-head ViT transformer.
[0260] In this example, the distributed multi-head VIT transformer is evaluated using three distinct medical imaging modalities related to lung cancer detection: CT, PET, and X-ray. In this example, each modality is assigned to a specific independent data processing node as a result enabling the multi-head ViT transformer to specialize in extracting features from diverse datasets. The LIDC-IDRI dataset is used for CT image data providing volumetric imaging data essential for detecting lung nodules and assessing lung structures. PET imaging data from the PET Radiomics Challenge dataset focusing on metabolic activity that indicates malignancy are utilised for PET image data. For X-ray imaging, X-ray image data from the NIH Chest X-ray dataset or an equivalent lung cancer-specific X-ray dataset are used
[0261] Table 1700 shows the results for each independent data processing node tested on a particular image modality following local training on image data of a different modality and then following the communication of trained head weights between the three independent data processing nodes following training of each independent data processing nodes on their respective local node training data comprising image data of a particular modality (see “Global Integration” row).
[0262] In this example, the first independent data processing node (ie, Node 1) is trained on local training node data comprising CT image data, the second independent data processing node (ie, Node 2) -trained on local node training data comprising PET image data, and the third independent data processing node (ie, Node 3) is trained on local node training data comprising X-ray image data
[0263] The results for Node 1, tested on PET image data, highlight the present machine learning architecture’s capability to handle multi-modality. Following initial local training, Node l's performance was limited, achieving an accuracy of 78%, precision of 75%, recall of 72%, and an Fl -score of 73%. The high bias indicates that the local model struggled to generalize its knowledge to the PET modality, as it lacked exposure to metabolic activity features unique to PET imaging. However, following sharing of the trained head weights with global integration in accordance with the present disclosure, performance metrics improved significantly, with accuracy reaching 89.6%, precision 87%, recall 86%, and Fl -score 86.5%. This improvement underscores the ability to integrate specialised knowledge from other nodes, where PET-related features from Node 2 contributed to the enhanced understanding. The generalization gain of 11.6% demonstrates strength in bridging modality-specific gaps, allowing Node 1 to incorporate complementary information and reduce its reliance on structural features obtained from CT and X-ray image data alone.
[0264] Similarly, the results for Node 2, trained on PET image data modalities and tested on X-ray data, further illustrates enhanced multi-modality handling. The local training phase resulted in an accuracy of 81%, precision of 79%, recall of 76%, and an F 1 -score of 77%, showing slightly better performance compared to Node 1. However, the local model also exhibited high bias, as it lacked access to structural patterns unique to X-ray imaging. After global integration in accordance with the present disclosure, accuracy improved to 90.8%, precision reached 88%, recall increased to 87%, and the Flscore improved to 87.5%. The bias reduction (9.8%) and improved generalization highlight the present machine learning architecture’s ability to integrate structural knowledge learned by Node 1 and metabolic insights from Node 3. By synthesizing features from complementary modalities, Node 2's capacity to understand and classify X-ray data effectively are enhanced demonstrating how diverse datasets may be unified for robust decision-making.
[0265] The results for Node 3, tested on CT data and trained on X-ray image data, provide the most compelling evidence of the efficacy of integrating multi -modality data in accordance with the present disclosure. Local training yielded an accuracy of 80%, precision of 77%, recall of 74%, and an Fl -score of 75%, with high bias indicating difficulty in generalising to volumetric features specific to CT imaging. Global integration in accordance with the present disclosure, however, brought a remarkable improvement, with accuracy rising to 93.2%, precision to 89%, recall to 88%, and the Fl-score to 88.5%. The generalisation gain of 13.2% emphasises how Node 3 is able to leverage structural insights from Node 1 (X-ray) and metabolic features from Node 2 (PET) to enrich its representation of CT imaging. This result underscores the capability to unify spatial and structural data from X-ray with metabolic patterns from PET to enhance the model's understanding of CT-specific volumetric features.
[0266] As would be appreciated, across all nodes Table 1700 highlights the ability to handle multi-modality by effectively integrating diverse datasets in a decentralized framework. Each node’s performance during local training was constrained by the limitations of its respective modalities, resulting in high bias and moderate accuracy. Global integration significantly improved performance across all metrics by enabling nodes to share knowledge derived from complementary modalities. This process leverages the decentralized weight-sharing mechanism of the present disclosure, which ensures that modality-specific expertise from each node contributes to a unified global model. By reducing bias and enhancing generalization, machine learning architectures in accordance with the present disclosure can bridge the inherent differences between diverse datasets demonstrating its potential for scalable distributed learning in multi-modal scenarios.
[0267] Referring now to FIG. 18, there is shown a system overview diagram of a system 1800 for determining a machine learning model involving multiple independent data processing nodes N ,N2,N3...,NNin a peer-to-peer computer network 1810 according to some illustrative embodiments.
[0268] In this example, each node NLcomprises associated local node training data DLand a respective machine learning model MLwhich in one example may comprise an encoder such as the encoding component of a transformer architecture. Considering N±as an example, this independent data processing node comprises a machine learning modeloperating on local node training data D to generate a set of model characteristics in the form of weightsfollowing training. As will be described later, N±also comprises weightsoriginating from training respective machine learning models Mj at the other nodes / Vj where j ranges from 1,2, ... , N .
[0269] In this example, Axfurther functions to generate pretrained weights P±by initially training on general data source S such as a large publicly available dataset. In another example, pretrained weights P±may be initially generated by training M1on the local node training data D .
[0270] Referring now to FIG. 19, there is shown a system overview diagram depicting the initial communication of pretrained weights P as determined on independent data processing node N in a peer-to-peer network 1810 following a pretraining step according to some embodiments. In this example, pretrained weights P±are communicated or transferred to each of the other nodes for adoption in their respective machine learning models’ Mj training on their respective local node training data Dj where j ranges from 1,2, ... , N. In other examples, pretrained weights P may only be communicated to a selection of the remaining nodes that have been determined to benefit from the pretraining exercise at At. In other examples, a selection of nodes may carry out their own pretraining step based on their own local general data source to generate their own pretrained weights which may be reused as required. In other examples,a selected node may generate pretrained weights for a first collection of nodes and another selected node may generate pretrained weights for a second collection of nodes.
[0271] Referring back to FIG. 18, it can be seen that the pretrained weights P±may be adopted by the remaining nodes (eg, N2, N3, NN) as an initial input before training their respective machine learning model to generate their associated sets of model characteristics which in this example are in the form of weights. As would be appreciated, the nodes that receive the pretrained weights do not need to duplicate this pretraining effort as a result reducing the amount of computing resources required.
[0272] Consider the example where independent data processing node N trains a respective machine learning modelin the form of local ViT encoder fg on its local node training data D = (%i,yi) by minimizing the standard cross-entropy loss function:i(1)= ®(x,y)~D1[Ve W. y)] Equation 16
[0273] to generate pretrained weights P . In this example, the classification head is excluded from the modeland the pretrained weights are in the form of backbone extractor weights 14^ which are communicated to the other independent data processing nodes which then attach new classification heads compatible with the label space of their own respective local node training data D = (xk, yk) to form their respective machine learning models.
[0274] Each independent data processing node / V, then subsequently fine-tunes its ViT model fg by freezing a subset of the backbone layers and updating the remaining trainable parameters using the same cross-entropy loss function as in Equation 16. As is apparent, no raw data or labels are shared between independent data processing nodes during this pretraining and training process.
[0275] Referring now to FIG. 20, there is shown a system overview diagram showing the communication of the various sets of model characteristics in the form of weights between the independent data processing nodes of peer-to-peer network 1810 following training at each node according to some embodiments. As would be appreciated, these weights represent the learned knowledge for the particular machine learning training task carried out at a particular node.
[0276] As shown in FIG. 18, each node following this communication process may potentially have received the weights from each of the other independent data nodes of peer-to-peer network. In this manner, each node will include a repository or “bank” of the various sets of model characteristics (eg, weights) generated by each of the other independent data processing nodes based on training theirrespective model on their own local training data and communicated from other nodes as well as their own set of model characteristics from local training.
[0277] Referring now to FIG. 21, there is shown a flowchart of a method 2100 for determining a respective updated machine learning model at a selected independent data processing node based on combining or aggregating its own locally determined set of model characteristics and received sets of model characteristics from other independent data processing node according to some embodiments.
[0278] At block 2110, method 2100 comprises selecting a subset of model characteristics from the repository comprising each of the received set of model characteristics that originate from each other independent data processing node as well as the locally determined set of model characteristics where the subset of model characteristics comprises those sets of model characteristics that are to be combined to generate the updated machine learning model.
[0279] In one example, selecting the sets of model characteristics to form the subset of model characteristics that are to be combined is determined in accordance with a machine learning task that is to be carried out at the independent data processing node. As an example, consider a peer-to-peer network comprising a plurality of independent data processing nodes where each node is directed to a respective disease detection task based on the respective machine learning model at a given node being trained to detect a particular disease based on a particular type of image (eg, X-ray, MRI, ultrasound, etc). In this example, each of the nodes are then directed to a machine learning task comprising a combination of detecting a particular disease based on a particular image modality.
[0280] Accordingly, for a selected node being tasked to detect a particular disease based on a particular type of image, determining the subset of model characteristics or equivalently the selection of weights from the repository may be based on their provenance (ie, which node they came from). This is on the basis, that other nodes may be directed to associated disease classification tasks and image modalities that generate weights which may be of utility or relevance to the disease detection of the selected node. Notably, and in accordance with the present disclosure, the selected node does not require access to the data from the other nodes from which the weights are received.
[0281] As an example, if a selected node has weights in its local repository that have been communicated from ten different nodes that form part of the peer-to-peer network, then the node may determine the selection of weights in one example by choosing weights from the two or three of the other nodes based on a particular machine learning task (eg, detecting a specific first condition) to generate an updated model at the selected node. In another example, the selection of weights may originate from another subset of nodes based on a different machine learning task (eg, detecting a different condition ordetecting the same condition but based on a different image modality) to generate a different updated machine learning model.
[0282] In other examples, where all the other nodes may have generated weights suitable to the machine learning task at hand, the subset may comprise all of the available sets of model characteristics (ie, the subset is the entire set of model characteristics available at a given node that have been communicated from other nodes).
[0283] At block 2120, method 2100 comprises generating a global model based on combining the subset of model characteristics.
[0284] In one example, where the respective machine learning models for each node are in the form of local encoders (eg, ViT encoders), then the averaged backbone weights I fusedare determined based on the selected weights to construct a unified global encoder fused. Inthis example, only structurally consistent machine learning components including, but not limited to, machine learning components such as patch embedding layers, positional encodings, multi-head self-attention layers, and feed-forward encoder blocks are included in the combining steps. These layers or machine learning components share identical dimensionality across nodes, making them directly compatible for weight averaging or weighted fusion.
[0285] The classification heads are excluded from the aggregation process unless two or more nodes share the same label space and feature structure. This ensures that task -specific layers are not incorrectly combined, preventing label misalignment or degradation in performance. Instead, each independent data processing node retains its own classifier head locally, while benefiting from the globally fused encoder as a common feature extractor.
[0286] Different approaches may be adopted in the forming of the global model (eg, a unified global encoder fused) to assist in maintaining stability during the fusion process. In one example, forming the global model comprises applying layer-wise averaging to weights of structurally aligned blocks while normalisation layers are fused using running statistics to prevent distributional drift.
[0287] In another example, forming the global model comprises performing weighted averaging where nodes with higher validation accuracy or larger, more diverse datasets are assigned greater influence in forming the unified global encoderfused. Inthis example, residual connections and attention maps are preserved during fusion to ensure that the resulting unified global encoder retains the representational richness of the individual models. This approach allows the unified global encoder fusedto function as a universal backbone, from which individual nodes can attach their own classification heads, or selectively integrate compatible heads when label spaces overlap. As a result, the global modelin the form of the unified global encoder is capable of providing both shared generalisation power and node-specific flexibility, while reducing the need for retraining from scratch.
[0288] In examples where full model aggregation is possible, each independent data processing node may obtain a complete global model following the one-time communication of weights between nodes which may be immediately used for local prediction. In cases where nodes only share an overlapping feature space, aggregation is typically limited to the backbone encoder as discussed above.
[0289] Referring back to FIG. 21, at block 2130, method 2100 comprises extracting model features based on the global model.
[0290] In one example, again where the respective machine learning models for each node are in the form of local encoders, extracting model features comprises extracting a non -invertible embedding in the form of a CLS token representation z as follows:z = f fused. ( )[:,0] e 1R768Equation 17
[0291] Here, the CLS token representation refers to a special learnable non -invertible embedding prepended to the input sequence designed to aggregate global semantic information across all image patches via multihead self-attention. In this example, the resulting CLS token representation serves as a compact and expressive representation of the input, which can be used for personalized downstream classifiers or concatenated across nodes for joint feature learning.
[0292] In another example, where the independent data processing nodes have non-overlapping feature spaces, such as data from different modalities or backgrounds, an updated model at a given node may be generated that seeks to generalise across the non -overlapping feature spaces with a unified model. In this example, each node extracts model features from their own local training data for generating a new downstream classifier.
[0293] Referring now to FIG. 22, there is shown a flowchart of a method 2200 for extracting model features in the form of concatenated node model features based on local extracted model features determined at each of the nodes according to some embodiments. In this example, extracting model features based on the global model comprises at block 2210 each node extracting local model features based on its local training data and then at block 2220 communicating the local extracted model features to the other nodes. At block 2230, method 2200 comprises at a selected node concatenating the received local model features received from the other nodes to form the concatenated node model features.
[0294] Specifically, in the example referred to above, each node applies the unified global encoderfused(ie, the global model) to its local data xkto determine its respective non-invertible embedding (ie, the local extracted model features) in the form of a CLS token representation as follows:< Vxfce Di Equation 18
[0295] denotes the CLS token representation of sample xkfrom node NL(see block 2210 of FIG. 22).
[0296] Referring now to FIG. 23, there is shown a system overview diagram showing the communicating of the various non-invertible embeddings between the independent data processing nodes in peer-to-peer network 1810 following determining the respective non-invertible embedding at each node according to some embodiments (see block 2220 of FIG. 22). In this example, the non-invertible embeddings are in the form of CLS token representations determined in accordance with Equation 18. In other examples, only a subset of nodes may choose to communicate their non-invertible embeddings to each other.
[0297] As would be appreciated, non-invertible embeddings cannot be reconstructed to determine the original data from which they have been generated. This implies that the exchange or communication of these embeddings still preserves data privacy between nodes. In addition, non-invertible embeddings typically comprise a fixed-length representation that can be communicated efficiently between nodes (eg, CLS token representations).
[0298] Following communication of the non-invertible embeddings, they are then concatenated to form a global feature map of concatenated node model features for subsequent training of a shared multi -label classifier which is trained to map the fused embeddings to the union of label spaces contributed by all the independent data processing nodes.
[0299] In one example, where each node generates and communicates their model features in the form of CLS token representations, the concatenated node model features is in the form of a CLS token representation matrix Z G j^vx768 jn t|qjs examp|e eac row of z corresponds to the 768 -dimensional CLS token representation of a single node and its associated local node training data and N represents the total number of nodes that have contributed across the peer-to-peer network 1810. As would be appreciated, the dimensionality of the non-invertible embedding or CLS token may vary depending on the model.
[0300] In one example, the CLS token representation or more generally the non-invertible embedding is further processed to reduce dimensionality (and redundancy). In one example, reducing dimensionality comprises applying a principal component analysis (PCA). First, the covariance matrix of Z is computed, which captures the correlations among the 768 embedding dimensions across all samples. The eigen-decomposition of this covariance matrix then identifies the principal axes (eigenvectors) along which the embeddings vary the most.
[0301] From this decomposition, d eigenvectors are selected at the next step and stored in matrix W. These eigenvectors span a lower-dimensional subspace (where d « 768), while preserving the majority of the variance and discriminative information present in the original embeddings. Finally, the global CLS token representation matrix Z is then projected into this subspace by multiplying with W, resulting in a reduced-dimension representation ZPCAas follows:ZPCA= ZW, where Z e ]RWx768, W e ]R768x500Equation 19
[0302] noting in this case that d is set at 500, but may be varied as required. The concatenated node model features now in the form of this new processed CLS token representation matrix ZPCA, acts as a compact, noise-reduced, and information-rich representation of all samples across the distributed nodes and becomes the input to the global multi-label classifier (see discussion of block 2140 below).
[0303] By training the classifier on ZPCA, rather than directly on the high-dimensional CLS tokens, the system achieves one or more advantage including improved computational efficiency (due to smaller input dimension), noise reduction and better generalisation as the PCA filters out less informative dimensions and further balanced representation across the contributing nodes. This is because the PCA is performed globally on the concatenated node model features and not locally at individual nodes as a result transforming the original CLS token representation matrix Z into a compact global feature space ZPCAthat is more suitable for efficient and fair classifier training across the entire peer-to-peer network 1810.
[0304] As would be appreciated, methods and systems in accordance with the present disclosure may involve a two stage knowledge fusion across the distributed machine learning system comprising the initial exchange of model characteristics between nodes (eg, encoder backbone weights) which are selectively combined to form a global model (eg, unified global encoder / fused) and then the exchange and aggregataion of model features determined at each node (eg, locally determined CLS token representations). This two stage sharing approach allows the system to benefit from rich cross-node features while still respecting privacy constraints, since only weights or embeddings are exchanged and not raw data or full activations.
[0305] Referring back to FIG. 21, at block 2140, method 2100 comprises training a classifier based on the extracted features to generate a final classification result.
[0306] Following on from the previous examples, the extracted features in form of the CLS token representation, or CLS token representation matrix, may be used to train in one example a combined classifier comprising two or more downstream machine learning classifiers whose results are combined to provide the final prediction or classification result. In another example, a single machine learning classifier may be adopted.
[0307] In one example, the two or more downstream machine learning classifiers comprise a multilayer perceptron (MLP) classifier, a random forest (RF) classifier and a support vector machine (SVM) classifier each optimised in accordance with the following objective functions:Equation 20
[0308] where ycis the predicted probability for class c and ycis the one-hot encoded ground truth,Equation 21
[0309] where p(c|t) is the proportion of class c at node t, andTSVM = E i = lNmax(0,l — y (x;))2Equation 22
[0310] where ; G {—1, +1} is the ground-truth label and (x() is the decision function.
[0311] The probability estimates from the two or more classifiers are then combined to determine a final prediction. In one example, the probability estimates from the three classifiers are combined based on a soft voting ensemble layer. Let MLP’s probability map Pm(y = c|x), RF’s probability map Pr(y = c|x), and SVM’s probability map Ps(y = c|x) denote their respective predicted probabilities for class c given input data x. The soft voting classifier ensemble determines the average probability as:Equation 23
[0312] and the final predicted class is then determined by: Equation 24
[0313] In another example, a hierarchical model aggregation approach may be adopted to enable the integration of different node groups without discarding their previously learned knowledge.
[0314] In this example, independent data processing nodes are grouped and their local models are combined into intermediate global models in accordance with whether their feature spaces overlap or not. These intermediate global models are then treated as new base models and are further aggregated or combined across groups.
[0315] In one example, given a set of global models f^\i = 1, , iV} trained on datasets with potentially different numbers of classes, this approach applies a weighted averaging of their backbone weights. This approach prevents the weights of a base model from being excessively averaged, which could otherwise introduce bias and degrade performance on specific independent data processing nodes. In various examples, the weighting of the weights can be done in accordance with two approaches.
[0316] The first approach is square-root class-based weighting where each global model’s weight is proportional to the square root of the number of classes it was trained on. Let C(denote the number of classes for model i. WfUsedis then defined as:Wfused= ^1=1aiwlEquation 25Equation 26
[0317] The second approach is uniform model count-based weighting where each global model is assigned a weight proportional to the number of base models it was aggregated from. Let nt denote the number of base models used to create the i-th global model. In this approach, WfUsedis defined:Wfused= ^=1PtW1Equation 27Equation 28
[0318] After the weight aggregation, the newly formed global model replaces the lower-level global models and is used to directly extract CLS token representations from each independent data processing node (see block 2130 of FIG. 21).
[0319] These extracted tokens are then reduced from 768 -dimensional embeddings to 500-dimensional vectors, as described in Equation 19. Finally, the reduced embeddings are used to train a group of machine learning classifiers (eg, MLP, Random Forest, and SVM), and these classifiers are combined using a soft voting classifier to produce the final prediction (see block 2140 of FIG. 21 and Equations 20 to 24).
[0320] Accordingly, the following cases may be advantageously managed in accordance with the present disclosure. In the first case, where two or more nodes share overlapping or identical feature spaces then the entire model weights may be exchanged and combined to form a unified model that will preserve shared features spaces. In this example, the classifier heads of each node remain specific to the node in accordance with local variations in label sets specific to the node.
[0321] In the second case, where various nodes have non -overlapping feature spaces (eg, one node for processing CT scans and another node processing histopathology images) then each node may be initialized with a shared pretrained backbone (eg, a pretrained ViT encoder). During training, only the sets of model characteristics in the form of extractor weights are exchanged and combined, while the classifier heads remain local to avoid misalignment of label spaces. In one example, this may be applied where there are multimodal datasets (eg, MRI, ultrasound, and X-ray images) to allow individual nodes to leverage complementary representations from entirely different domains without degrading performance.
[0322] In the third case, a hybrid approach may be applied involving a multi-level hierarchical design, where groups of nodes may first be combined locally into intermediate global aggregated models (eg, MaggrMaggr2). These intermediate models may then be further combined into higher-level global models (eg, Maggr= Maggr2ffi Maggr2), effectively creating a scalable, layered architecture.
[0323] As an example, in one application Group 1 aggregated nodes (Maggri) were trained respectively on lung and bone diseases while Group 2 aggregated nodes (Maggr2) were trained respectively on liver and eye datasets. These group-level models were then subsequently combined into a higher-level model (Maggr) that integrated knowledge across multiple disease domains. This hierarchical strategy allows plug-and-play scalability, enabling new groups of nodes to be integrated without retraining the entire network, while maintaining high accuracy and efficiency.
[0324] Various examples in accordance with the present disclosure function to mitigate data imbalance by allowing majority -class or high-volume groups to act as knowledge donors for minorityclass or low-volume groups. For instance, experiments with 30 datasets and 80 disease classes, groups with abundant samples (such as pneumonia from chest X-rays) were first trained to generate stable extractor weights. These weights were then reused as pretrained initializations for smaller datasets (suchas rare bone diseases), ensuring that underrepresented classes could inherit robust feature representations without overfitting.
[0325] This approach replaces traditional oversampling or synthetic augmentation with a model-driven solution, leveraging the strength of large datasets to lift the performance of rare categories. Across experiments, this strategy improved sensitivity and recall for minority classes while maintaining overall accuracy.
[0326] Various examples in accordance with the present disclosure also function to support improved integration of heterogeneous data modalities, such as X-ray, CT, MRI, ultrasound, fundus images, and even text-based annotations. Each modality is initially trained in its own component or head, producing specialized weights. During aggregation or combination, these weights are combined into a unified multi-modal representation. For example, nodes trained on CT scans contribute structural feature weights, while nodes trained on pathology slides contribute cellular-level representations.
[0327] In accordance with the present disclosure, these modality-specific extractors may be fused enabling the final model to learn both global structural cues and fine-grained local features to achieve superior cross-domain generalization compared to other machine learning methodologies such as centralized and swarm learning. For example, if one node is specialized in detecting a particular disease and another node focuses on a different disease — or even the same one — each node will eventually accumulate a collection of weights from multiple peers. These collected weights may then be aggregated or combined to construct a robust global model that integrates knowledge from across the network.Additionally, because each node receives weights from many other nodes, it has the capability to selectively combine only the most relevant subsets of weights to create specialized or task specific models for a particular machine learning task.
[0328] Method and systems in accordance with the present disclosure have been evaluated on highly heterogeneous and imbalanced medical datasets covering a wide range of imaging modalities and classification tasks, including pathology, CT, MRI, X-ray, ultrasound, endoscopy, and other modalities. In experiments across 30 datasets and 80 independent labels on distributed nodes, an overall accuracy of 92.7% has been achieved surpassing centralized learning (84.9%) and swarm learning (72.99%).Federated learning approaches failed under these conditions due to the high demands on computational resources. Excellent scalability has also been demonstrated with only a 1% drop in accuracy on existing nodes after expansion, compared to a 20% drop found with centralized learning demonstrating the resilience of the present approach to catastrophic forgetting. Additionally, computational costs were reduced by up to 50% compared to both centralized and swarm learning.
[0329] As would be appreciated, method and systems in accordance with the present disclosure can reuse previously aggregated or combined weights as pretrained models for future tasks, dramatically reducing retraining requirements and computational cost. In practice, once weights from multiple nodes have been combined into a global or intermediate model, these combined weights may then be stored in a repository or “weight bank”. When a new task, dataset, or class is introduced, the independent data processing nodes will not require training of the model from scratch. Instead, a given machine learning model may initialize training based on the previously aggregated weights which have already encoded robust and generalizable feature representations.
[0330] While the various examples and embodiments of the present disclosure have been directed to multiple independent data processing nodes operable to exchange data in a peer-to-peer network, in another example the models and associated training data may be resident on a single node and the training may be conducted sequentially with each model trained then generating successive sets of model characteristics that are stored on the node (eg, in a weight bank) and which in turn may be adopted in later training exercises. The sets of model characteristics generated may then be selectively combined to generate a global model that may be directed to a given machine learning or data classification task. In one example, a large training data set may be divided into training data subsets and machine learning models be successively run on these smaller training data subsets and where sets of model characteristics from earlier training of a given model may be adopted in the training of a subsequent model on another training data subset meaning that the subsequent model does not have to be trained from scratch.
[0331] Referring now to FIG. 24, there is shown an architecture overview diagram of an example independent data processing node 2400 that may optionally be utilised in accordance with the present disclosure. Independent data processing node 2400 typically includes at least one processor 2410 which communicates with a number of peripheral devices via bus subsystem 2450. These peripheral devices may include a storage subsystem 2440, including, for example, a memory subsystem 2441 and a file storage subsystem 2445, user interface input / output devices 2430, and a network interface subsystem 2420. The input and output devices 2430 allow user interaction with independent data processing node 2400 if required.
[0332] Network interface subsystem 2420 provides an interface to outside networks and is coupled to corresponding interface devices in other independent data processing nodes or computing devices. The communications between independent data processing nodes may be implemented with a variety of fabrics, devices and protocols. For example, the fabric for network communications between the nodes may include Ethernet, wireless (802.11), or equivalent. Data communication protocols may include Transmission Control Protocol (TCP), User Datagram Protocol (UDP), Internet Protocol (IP),Hypertext Transfer Protocol (HTTP), Wireless Access Protocol (WAP), Handheld Device Transport Protocol (HDTP), Session Initiation Protocol (SIP), Real Time Protocol (RTP) or equivalent.
[0333] User interface input devices 2430 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and / or other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and ways to input information into independent data processing node 2400 or onto a communication network.
[0334] User interface output devices 2430 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term "output device" is intended to include all possible types of devices and ways to output information from independent data processing node 2400 to another independent data processing node or to a user or to another computing device.
[0335] Storage subsystem 2440 stores programming and data constructs that provide the functionality of some or all implementations in accordance with the present disclosure. For example, the storage subsystem 2440 may include the logic to perform selected aspects of method 100 of FIG. 1, and / or method 500 of FIG. 5, and / or method 600 of FIG. 6, and / or method 800 of FIG. 8, and / or method 900 of FIG. 9, and / or method 1000 of FIG. 10 and / or method 1200 of FIG. 12 and / or method 2100 of FIG. 21 and / or method 2200 of FIG. 22.
[0336] These software modules are generally executed by processor 2410 alone or in combination with other processors. Memory 2441 used in the storage subsystem 2440 can include a number of memories including a main random access memory (RAM) 2443 for storage of instructions and data during program execution and a read only memory (ROM) 2442 in which fixed instructions are stored. A file storage subsystem 2445 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 2445 in the storage subsystem 2440, or in other machines accessible by the processor(s) 2410.
[0337] Bus subsystem 2450 provides a mechanism for letting the various components and subsystems of independent data processing node 2400 communicate with each other as intended.Although bus subsystem 2450 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
[0338] Independent data processing node 2400 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of independent data processing node 2400 depicted in FIG. 24 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of independent data processing node 2400 are possible having more or fewer components than the computing device depicted in FIG. 24.
[0339] In other examples, an independent data processing node 2400 may be implemented as cloud computing instance to support the execution of computer-implemented methods in accordance with the present disclosure. The cloud computing instances may be embodied, for example, as an instance of cloud computing resources (eg, a virtual computing machine comprising processing, storage and interface resources) that may be provided by the cloud computing environment.
[0340] The independent data processing node 2400 may run any suitable operating system including, but not limited to, any embedded operating system, any real-time operating system, any open source operating system, any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the independent data processing node 2400 and performing the operations described in this disclosure. In an embodiment, the operating system may be run on one or more cloud machine instances.
[0341] As will be appreciated in light of this disclosure, the various computer-implemented methods of the present disclosure, are implemented in software, such as a set of instructions (eg, HTML, XML, C, C++, object oriented C, BASIC, Python, etc.) encoded on any computer readable medium or computer program product (eg, hard drive, server, disc, or other suitable non -transitory memory or set of memories), that when executed by one or more processors, cause the various methodologies provided in this disclosure to be carried out.
[0342] Distributed machine learning methods, architectures and systems in accordance with embodiments of the present disclosure, and by contrast to current systems, decentralise the learning process as a result eliminating the need for a central node server and accordingly removing the single point of failure. In accordance with the present disclosure, each independent data processing node independently processes its training data and communicates directly with the other nodes, making the system more resilient and secure against both technical failures and adversarial attacks.
[0343] Eliminating the central node server also functions to significantly reduce network traffic. In traditional centralised and federated learning approaches, frequent model updates and data transmissions to and from a central aggregator (ie, the central node server) create substantial communication overhead. By contrast, in the distributed architecture described, each independent dataprocessing node processes its own training data locally and only shares essential learned representations (such as weights, embeddings, or model heads) directly with other independent data processing nodes in the peer-to-peer network. This peer-to-peer communication minimises the volume of transmitted data, as updates occur selectively and asynchronously rather than through a central coordinator that requires continuous synchronisation. Additionally, by decentralising model updates, network congestion is alleviated, leading to improved scalability and efficiency, particularly in large-scale deployments with numerous participating independent data processing nodes.
[0344] Additionally, the distributed machine learning methods, architectures and systems framework offers a more robust solution by allowing each independent data processing node to enforce its own security policies and regulations while using feature fusion techniques to integrate data from diverse distributions better. This flexibility improves model accuracy and generalisation and ensures compliance with jurisdictional requirements, making machine learning architectures in accordance with the present disclosure suitable for cross-border data-sharing applications.
[0345] In various examples, the components of a shared machine learning model architecture (eg, a multi-head transformer) may be distributed across different independent data processing nodes, trained on local node training data which may comprise data of different modalities and then the trained head weights from each independent data processing node may be shared with each other independent data processing node to realise an updated shared machine learning model architecture which may be applied to data of different modalities.
[0346] Distributing components of a shared machine learning model architecture, such as a multi-head transformer, across independent data processing nodes offers several key technical advantages. This approach enhances scalability by enabling parallel processing across multiple nodes, reducing computational bottlenecks. It improves privacy by allowing data to remain localised while only exchanging model updates, minimising the risk of data exposure. Additionally, it increases model robustness and generalisation by leveraging diverse, decentralised datasets. By enabling specialised training for different modalities across nodes, this architecture also facilitates efficient multi-modal learning, making it well-suited for applications in secure, distributed Al systems. As would be appreciated, implementations in accordance with the present disclosure provide flexibility in supporting diverse model architectures that ensures seamless deployment across vision, NLP, reinforcement learning, generative models, and structured data processing without requiring central coordination. Across these diverse architectures, peer-to-peer sharing of model characteristics, such as weights, embeddings, feature representations, and attention heads, ensures that models can collaborate while preserving privacy and security. This adaptability allows for implementations across various applications, including computer vision, NLP, cybersecurity, loT, and autonomous systems.
[0347] The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement or any form of suggestion that such prior art forms part of the common general knowledge.
[0348] It will be understood that the terms “comprise” and “include” and any of their derivatives (e.g. comprises, comprising, includes, including) as used in this specification, and the claims that follow, is to be taken to be inclusive of features to which the term refers, and is not meant to exclude the presence of any additional features unless otherwise stated or implied.
[0349] In some cases, a single embodiment may, for succinctness and / or to assist in understanding the scope of the disclosure, combine multiple features. It is to be understood that in such a case, these multiple features may be provided separately (in separate embodiments), or in any other suitable combination. Alternatively, where separate features are described in separate embodiments, these separate features may be combined into a single embodiment unless otherwise stated or implied. This also applies to the claims which can be recombined in any combination. That is a claim may be amended to include a feature defined in any other claim. Further a phrase referring to “at least one of’ a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover: a, b, c, a-b, a-c, b-c, and a-b-c.
[0350] It will be appreciated by those skilled in the art that the disclosure is not restricted in its use to the particular application or applications described. Neither is the present disclosure restricted in its preferred embodiment with regard to the particular elements and / or features described or depicted herein. It will be appreciated that the disclosure is not limited to the embodiment or embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope as set forth and defined by the following claims.
Claims
CLAIMS1. A computer-implemented method, comprising:in a peer-to-peer computer network comprising a plurality of independent data processing nodes, wherein each independent data processing node comprises respective local node training data and a respective machine learning model;determining at each independent data processing node a respective set of model characteristics from training the respective machine learning model on the respective local node training data;communicating, for each independent data processing node, their respective set of model characteristics to each other independent data processing node of the plurality of independent data processing nodes; anddetermining a respective updated machine learning model at a selected independent data processing node of the plurality of independent data processing nodes based on its own locally determined set of model characteristics and received sets of model characteristics from each other independent data processing node.
2. The method of claim 1, wherein determining the respective updated machine learning model at the selected independent data processing node of the plurality of independent data processing nodes based on its own locally determined set of model characteristics and each received set of model characteristics from each other independent data processing node comprises combining its own locally determined set of model characteristics with a selection of model characteristics from the received sets of model characteristics to form a combination of model characteristics.
3. The method of claim 2, wherein determining the respective updated machine learning model at the selected independent data processing node comprises:generating a global model at the selected independent data processing node based on the combination of model characteristics; andtraining one or more classifiers based on the global model to generate a final classification result.
4. The method of claim 3, wherein training the one or more classifiers based on the global model comprises:training a plurality of classifiers to generate associated prediction outputs; andaggregating the prediction outputs to generate the final classification result.
5. The computer-implemented method of any one of claims 1 to 4, further comprising determining respective updated machine learning models at each remaining data processing node based on theirrespective locally determined sets of model characteristics and each received set of model characteristics from each other independent data processing node.
6. The computer-implemented method of any one of claims 1 to 5, wherein the plurality of independent data processing nodes comprises three or more independent data processing nodes.
7. The computer-implemented method of any one of claims 1 to 6, wherein determining the respective set of model characteristics for a machine learning model comprising a plurality of sub-models comprises determining a unified feature representation for the plurality of sub-models.
8. The computer-implemented method of claim 7, wherein determining the unified feature representation for the plurality of sub-models comprises:extracting features from each of the plurality of sub-models of respective machine learning model;concatenating the extracted features to form a fused feature set; andprocessing the fused feature set to determine the unified feature representation having reduced dimensionality.
9. The computer-implemented method of claim 8, wherein processing the fused feature set to determine the unified feature representation comprises:performing an independent component analysis on the fused feature set to reduce dimensionality; and optionally:validating the unified feature representation; ordisplaying high dimensional features to visualise model performance.
10. The computer-implemented method of any one of claims 1 to 9, wherein a first independent data processing node seeking to communicate their respective set of model characteristics to a second independent data processing node operable to receive the respective set of model characteristics is defined as a transmitting node and the second independent data processing node is defined to be a receiving node.
11. The computer-implemented method of claim 10 further comprising validating by the receiving node the transmitting node prior to communication.
12. The computer-implemented method of claim 11, wherein validating by the receiving node the transmitting node comprises the receiving node validating a unique digital identifier of the transmitting node in accordance with an asymmetric encryption scheme based on a transmitting node private -public key pair.
13. The computer-implemented method of claim 12, wherein validating the unique digital identifier of the transmitting node in accordance with an asymmetric encryption scheme comprises:encrypting by the transmitting node the unique digital identifier based on a private key of the transmitting node private -public key pair to form an encrypted unique digital identifier;sending by the transmitting node the encrypted unique digital identifier to the receiving node; decrypting by the receiving node the encrypted unique digital identifier based on a public key of the transmitting node private -public key pair; andconfirming by the receiving node the unique digital identifier of the transmitting node.
14. The computer-implemented method of any one of claims 10 to 13, further comprising:securing communication of the respective set of model characteristics between the transmitting node and the receiving node in accordance with an encryption scheme prior to communication.
15. The computer-implemented method of claim 14, wherein the encryption scheme is a symmetric encryption scheme based on a unique session key generated for the communication of the respective set of model characteristics from the transmitting node to the receiving node.
16. The computer-implemented method of claim 15, wherein the unique session key is shared between the transmitting node and the receiving node in accordance with an asymmetric encryption scheme between the transmitting node and the receiving node.
17. The computer-implemented method of claim 14, wherein the encryption scheme is an asymmetric encryption scheme based on private -public key pairs generated for the transmitting node and receiving node.
18. The computer-implemented method of claim 17, wherein the asymmetric encryption scheme is a homomorphic encryption scheme.
19. The computer-implemented method of claim 18, wherein the homomorphic encryption scheme is an additive homomorphic encryption scheme.
20. The computer-implemented method of any one of claims 10 to 19, further comprising:assessing by the receiving node the respective set of model characteristics following communication from the transmitting node to determine an associated integrity measure for the respective set of model characteristics.
21. The computer-implemented method of claim 20, wherein the associated integrity measure indicates interference with the respective set of model characteristics in a form of a cyber-attack, concept drift attack or unlabelled cyber-attack.
22. The computer-implemented method of claim 20 or 21, wherein assessing the respective set of model characteristics to determine an associated integrity measure comprises applying a machine learning classifier to the respective set of model characteristics.
23. The method of any one of claims 1 to 22, wherein communicating, for each independent data processing node, their respective sets of model characteristics to each other independent data processing node comprises each of independent data processing nodes updating their respective sets of model characteristics to each other independent data processing node at substantially simultaneously.
24. The method of claim 23, wherein each independent data processing node operates as a blockchain node together forming a blockchain network having a corresponding shared blockchain ledger for recording an update based on communications between the plurality of independent data processing nodes.
25. The method of claim 24, wherein the update is recorded on the shared blockchain ledger following a proof of share operation between the plurality of independent data processing nodes.
26. The method of any one of claims 1 to 25, wherein a first local node training data for a first independent data processing node of the plurality of independent data processing nodes comprises a first data modality and a second local node training data for a second independent data processing node of the plurality of independent data processing nodes comprises a second data modality, and wherein the first and second data modalities are different.
27. The method of claim 26, wherein the first and second data modalities are different types of image data modalities.
28. The method of any one of claims 1 to 27, wherein the respective machine learning model for a given node is equivalent to the respective machine learning model for a different node.
29. The method of any one of claims 1 to 27, wherein the respective machine learning model for a given node is different to the respective machine learning model for a different node.
30. The method of any one of claims 1 to 29, further comprising:adding a new independent data processing node to the plurality of independent data processing nodes, wherein the new independent data processing node comprises respective local node training data and a respective machine learning model corresponding to the new independent data processing node.
31. The computer-implemented method of any one of claims 1 to 30, wherein the respective machine learning model at each independent data processing node comprises a component of a shared machine learning model architecture also resident at each of the independent data processing nodes, and wherein the respective updated machine learning model determined at the selected independent node comprises an updated shared machine learning model architecture based on its own locally determined set of model characteristics for its component of the shared machine learning model architecture and received sets of model characteristics for each other component of the shared machine learning model architecture.
32. The computer-implemented method of claim 31, wherein the shared machine learning model architecture comprises a multi-head transformer model and the component of the shared machine learning model architecture comprises a head of the multi-head transformer.
33. The computer-implemented method of claim 32, wherein the set of model characteristics for a respective head of the multi-head transformer comprise trained head weights from the respective trained head.
34. The computer-implemented method of claim 33, wherein the multi-head transformer comprises an ensemble layer for combining prediction outputs from the trained heads to produce a final prediction vector to produce a classification result.
35. The computer-implemented method of claim 34, wherein combining prediction outputs from the trained heads to produce a final prediction vector comprises averaging the prediction outputs from the trained heads.
36. The computer-implemented method of any one of claims 32 to 35, wherein one or more of the heads of the multi-head transformer are pre-trained.
37. The computer-implemented method of any one of claims 32 to 36, further comprising:adding a new independent data processing node to the plurality of independent data processing nodes, the independent data processing node comprising a new head and associated local node training data; andupdating the multi-head transformer to include the new head.
38. The computer-implemented method of any one of claims 32 to 37, wherein the multi -head transformer is a multi -head vision transformer (ViT) configured to analyse image data.
39. The computer-implemented method of claim 38, wherein the image data is medical image data.
40. The computer-implemented method of any one of claims 1 to 39, wherein the local node training data is accessible only by its associated independent data processing node.
41. The computer-implemented method of any one of claims 1 to 40, wherein each independent data processing node of the plurality of independent data processing nodes is operated by an associated entity having local policy and procedures governing operation of the each independent data processing node.
42. The computer-implemented method of any of claims 1 to 41, wherein each independent data processing node of the plurality of independent data processing nodes is operated in geographically distinct locations with respect to each other.
43. The computer-implemented method of any one of claims 1 to 39, further comprising applying the updated machine learning model to a classification task.
44. A computing system comprising:a peer-to-peer computer network comprising a plurality of independent data processing nodes, wherein each independent data processing node comprises one or more data processors, one or more network interfaces, and one or more storage devices and is configured to implement the method of any one of claims 1 to 43.
45. An independent data processing node operating in the peer-to-peer computer network of claim 44.
46. A non-transitory computer-readable medium, having stored thereon program instructions that, upon execution by a computing system, cause the computing system to perform operations comprising the method of any one of claims 1 to 43:
47. An independent data processing node operating in a peer-to-peer computer network comprising a plurality of other independent data processing nodes, wherein all independent data processing nodes comprise respective local node training data and a respective machine learning model, the independent data processing node configured to:determining a local set of model characteristics from training a local machine learning model on its local node training data;receiving respective sets of model characteristics communicated from each of the plurality of other independent data processing nodes following training the respective machine learning model on the respective local node training data at each of the plurality of other independent data processing nodes; and determining a respective updated machine learning model based on its own local set of model characteristics and the received respective sets of model characteristics from each of the plurality of other independent data processing nodes.