Cloud-edge collaborative training and deployment system and method of deep learning model
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGXI UNIVERSITY OF FINANCE AND ECONOMICS
- Filing Date
- 2026-04-12
- Publication Date
- 2026-06-19
Smart Images

Figure CN122247823A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and distributed computing technology, and more specifically, to a cloud-edge collaborative training and deployment system and method for deep learning models. Background Technology
[0002] With the rapid development of the Internet of Things (IoT) and edge computing, the demand for real-time intelligent processing on terminal devices is increasing. Cloud-edge collaborative training, as an effective paradigm, aims to leverage the powerful computing capabilities of cloud servers and the local data of edge nodes to jointly optimize deep learning models, enabling high-performance, low-latency intelligent services on resource-constrained edge devices. This model, by maintaining a global model in the cloud and performing localized training and inference at the edge, effectively utilizes distributed data while ensuring data privacy and response efficiency. It is of great significance for promoting the application of artificial intelligence in key areas such as intelligent manufacturing and smart cities.
[0003] Existing cloud-edge collaborative training frameworks typically employ basic architectures such as federated learning, which generally involve distributing structurally consistent global model copies from the cloud to all edge nodes. However, this approach has several inherent drawbacks: First, edge nodes exhibit high heterogeneity in computing power, storage resources, and local data distribution. A uniform model structure struggles to balance performance and efficiency, burdening weaker nodes and failing to fully leverage the potential of stronger nodes. Second, local training is based solely on isolated data, lacking effective alignment with the global model's knowledge structure. This can easily lead to training biases on nodes with unique data distributions, impairing the generalization ability of the global model. Finally, when aggregating model updates from various nodes, the cloud often uses a simple weighted average strategy, failing to fully consider the differences in data distribution carried by updates from different nodes. This can result in compatibility issues in the internal feature fusion layer of the aggregated global model when dealing with complex and diverse edge scenarios, limiting the overall performance improvement of the model.
[0004] Therefore, this paper proposes a cloud-edge collaborative training and deployment system and method for deep learning models to address the above problems. The core issues to be solved are: how to dynamically allocate appropriate model structures to heterogeneous edge nodes, how to ensure that nodes are consistent with global knowledge during local training, and how to more intelligently aggregate heterogeneous model updates in the cloud to enhance the compatibility and robustness of the global model. Summary of the Invention
[0005] To overcome the aforementioned deficiencies of the prior art, embodiments of the present invention provide a cloud-edge collaborative training and deployment system and method for deep learning models, in order to solve the problems mentioned in the background art.
[0006] To achieve the above objectives, the present invention provides the following technical solution: a cloud-edge collaborative training and deployment method for a deep learning model, wherein the deep learning model is a neural network model, comprising the following steps: S1: The cloud server dynamically generates a personalized sub-network for each edge computing node. The personalized sub-network is a reduced network formed by selecting a continuous layer sequence from the global neural network model and sampling and masking the neurons in some layers of the sequence. The length of the selected layer sequence is determined based on the historical training stability of the node, and the sampling and masking ratio of the neurons is determined based on the uniqueness of the local data distribution of the node. S2: The cloud server will distribute the generated personalized sub-network to the corresponding edge computing nodes; S3: Edge computing nodes use local data to train the personalized sub-network. The loss function of the training includes a feature constraint loss, which is used to reduce the difference between the feature representation output by the last layer of the personalized sub-network and the feature representation of the adjacent subsequent layers in the global model synchronously obtained from the cloud server. S4: The edge computing node uploads the sub-network parameter increments obtained after training, as well as the mean and variance of the feature representation output by the last layer in the batch dimension, to the cloud server. S5: The cloud server performs hierarchical aggregation updates, specifically as follows: First, the feature distribution similarity between nodes is calculated based on the mean and variance of the features uploaded by each node, and the nodes are divided into multiple aggregation groups according to the similarity; then, for each aggregation group, the parameter increments of each node in the group corresponding to the same network layer are weighted and averaged to obtain the group-level parameter update; finally, for the fusion layer in the global model that is responsible for receiving the output of different aggregation groups, its output distribution is optimized to be compatible with the feature distribution of each aggregation group, thereby updating the parameters of the fusion layer; S6: Based on the updated global model and the latest state information of each node, the cloud server re-executes step S1 to start the next round of collaborative training.
[0007] Preferably, the historical training stability in step S1 is calculated by: calculating the average value of the cosine of the angle between the parameter increment direction uploaded by the node and the parameter update direction after global aggregation over the past M consecutive training cycles, and the average value is the historical training stability.
[0008] Preferably, the calculation method for the uniqueness of local data distribution in step S1 is as follows: the cloud server maintains a global data feature prototype based on the statistical data of all nodes; for each node, the geodesic distance between its periodically uploaded local data features and the global data feature prototype is calculated, and the distance is normalized and used as its uniqueness of local data distribution.
[0009] Preferably, the specific rule for determining the layer sequence length in step S1 is as follows: a base layer number L_base is preset, the historical training stability of the node is mapped to a layer number adjustment amount ΔL, and the final layer sequence length L = L_base + ΔL, wherein the higher the historical training stability, the larger and more positive ΔL is.
[0010] Preferably, the specific rule for determining the sampling masking ratio of the neuron in step S1 is as follows: a basic masking ratio is preset for each layer in the neural network, and the masking ratio of the layer is adjusted upward according to the uniqueness of the local data distribution of the node. The higher the uniqueness, the greater the actual adjustment of the layer.
[0011] Preferably, the weight coefficient of the feature constraint loss in step S3 is adaptively set: during training, the difference between the output feature of the last layer of the personalized sub-network and the adjacent subsequent layer features obtained synchronously are calculated in real time, and the weight of the feature constraint loss in the total loss function is dynamically adjusted according to the magnitude of the difference value. The larger the difference value, the higher the weight.
[0012] Preferably, the specific method for dividing nodes into multiple aggregation groups based on similarity in step S5 is as follows: using the vector composed of the mean and variance of the features uploaded by each node as its feature distribution coordinates, calculating the Bach distance between all pairs of coordinates; using a hierarchical clustering algorithm, with the Bach distance as the metric, gradually merging nodes until the distance between groups is greater than a preset merging threshold, thus forming the final aggregation group division.
[0013] Preferably, the specific method for optimizing the fusion layer in step S5 is as follows: the group-level update parameters from different aggregation groups are forward propagated to the fusion layer to obtain multiple sets of output feature distributions; with the goal of minimizing the Jensen-Shannon divergence between these output feature distributions, the parameters of the fusion layer are iteratively updated using the gradient descent algorithm.
[0014] A cloud-edge collaborative training and deployment system for a deep learning model, wherein the deep learning model is a neural network model, including a cloud server and multiple edge computing nodes connected through a communication network; The cloud server includes: The subnetwork dynamic construction module is used to perform a personalized subnetwork generation process for each edge computing node; The node feature clustering module is used to perform node aggregation group partitioning; The hierarchical aggregation update module is used to perform hierarchical aggregation and fusion layer optimization operations; Global model management and distribution module; The edge computing nodes include: The local training module is used to perform training that includes adaptive weighted feature constraint loss; Update the information processing and uploading module to prepare and upload parameter increments and feature statistics; Model inference service module.
[0015] Preferably, the cloud server further includes a node status monitoring module, which continuously tracks the changing trends of the historical training stability and local data distribution uniqueness of each node; when it is detected that the status value of any node in the current period deviates from its moving average by more than a preset threshold, the sub-network dynamic construction module will be triggered outside the normal period to regenerate a personalized sub-network for that node.
[0016] The technical effects and advantages of this invention are as follows: Compared to existing technologies that distribute the same model to all nodes, this invention designs a personalized sub-network dynamic generation mechanism to customize a uniquely structured reduced network for each edge computing node. This mechanism adaptively determines the depth and width of its sub-network based on the stability of a node's historical training behavior and the uniqueness of its local data distribution. For stable nodes with representative data, a deeper and wider network is allocated to tap into their potential; for unstable nodes or nodes with unique data, a simplified network is allocated with increased neuron masking to improve efficiency and focus. This allows the system to flexibly adapt to node heterogeneity, achieving precise allocation of computing power and model capacity matching under resource-constrained conditions, thereby improving the overall resource utilization efficiency and adaptability of the collaborative training system.
[0017] To address the issue of local training easily deviating from global knowledge, this invention introduces an adaptively weighted feature constraint loss into the local loss function of each node. This loss does not directly compare parameters, but rather constrains the consistency between the output features of the node's personalized sub-network and the feature representations of adjacent layers in the global model distributed from the cloud. Its weights are dynamically adjusted based on real-time feature differences; the greater the difference, the stronger the constraint. This design provides a clear yet flexible global knowledge anchor for the local optimization of nodes, effectively suppressing training divergence caused by data heterogeneity without excessively limiting local fitting capabilities. It guides local updates to maintain coordination with the evolution direction of the global model, thereby ensuring the generalization performance of the aggregated global model.
[0018] To address the shortcomings of neglecting the distributional differences between updates during the cloud aggregation phase, this invention proposes a hierarchical aggregation strategy based on feature distribution clustering. This strategy first divides nodes into multiple aggregation groups with similar distributions based on the feature statistics uploaded by the nodes, and then performs parameter fusion within each group. Subsequently, it optimizes the fusion layer in the global model, which is responsible for connecting different groups, with the goal of minimizing the output distribution differences when processing features from each group. This method changes the traditional uniform aggregation approach for all updates, achieving refined grouping and collaborative optimization of heterogeneous updates. This enables the global model to learn robust representations of different data distribution patterns, enhances the model's compatibility and fusion capabilities for features from different edge scenarios, and ultimately improves the overall performance and stability of the model in complex and open environments. Attached Figure Description
[0019] Figure 1 This is a flowchart illustrating the overall workflow of the method of the present invention.
[0020] Figure 2 This is a flowchart illustrating the core processing and data analysis of the present invention.
[0021] Figure 3 This is a flowchart of the node clustering grouping decision-making process of the present invention. Detailed Implementation
[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0023] Example 1 As attached Figures 1 to 3 The cloud-edge collaborative training and deployment system and method for a deep learning model shown are mainly executed in a distributed system consisting of a cloud server and multiple edge computing nodes connected by a communication network.
[0024] The method involves the following steps: First, based on the analysis of the long-term behavior and data characteristics of each edge node, the cloud server dynamically generates and distributes a structurally matched personalized neural network subnetwork to each node. Then, each edge node trains this subnetwork locally using its private data, with the training process subject to a feature constraint aimed at aligning global knowledge. After training, each node uploads the updated model parameters and key statistical information reflecting its data feature distribution back to the cloud server. Upon receiving information from all nodes, the cloud server does not perform a simple global average. Instead, it intelligently groups the nodes based on the similarity of their feature distributions, performing initial aggregation within each group. Then, it fine-tunes the specific layers in the model that connect different groups, thereby updating the global model. This process is repeated continuously, enabling the global model to continuously evolve and adapt to the dynamic changes in the edge environment.
[0025] Further specific implementation methods are as follows: The collaborative training and deployment process proceeds according to the following steps: Steps S1 and S2: Dynamic generation and distribution of personalized subnetworks.
[0026] A cloud server maintains a global deep neural network model. In traditional collaborative training, all nodes receive the same complete copy of the model, ignoring the heterogeneity of nodes in terms of computing power, data distribution, and training reliability. This leads to weak nodes being overburdened, strong nodes failing to realize their potential, and the uniform model struggling to adapt to all local data characteristics. To address this issue, this invention, at the beginning of each collaborative training cycle, has the cloud server maintain a global deep neural network model for the first node. Each edge node generates a unique, personalized subnetwork. This subnetwork is a continuous sequence of layers truncated from the global model according to specific rules, with neurons in certain layers of this sequence being randomly masked, thus forming a shrunk network that can be pruned in both depth (number of layers) and width (number of neurons per layer). This pruning is guided by two core state metrics of the node.
[0027] Furthermore, the two dynamic evaluation metrics upon which this personalized subnetwork is generated are: the historical training stability of the node. and the uniqueness of its local data distribution . To measure the reliability of a node as a "trainer", To measure the uniqueness of its data as a "source of knowledge".
[0028] Furthermore, historical training stability This is used to quantify the consistency between a node's past training behavior and the global optimization direction. Its calculation principle is based on an observation in federated learning: ideally, the local stochastic gradient descent direction of each node should be expected to align with the update direction after global aggregation. Persistent deviations may indicate data heterogeneity, malicious behavior, or local overfitting. The cloud server records past continuous... Within a training cycle (e.g., set) ), from node Received parameter update vector The parameter vector is aggregated and updated with the global model during the same period. Calculate the cosine similarity between two vectors in each period: This value is between Between, the closer The more consistent the directions, the better. Then take this... The arithmetic mean of the cosine similarity over each period is obtained. . It is a value in the interval [0, 1] (negative values can be truncated in practice). The higher the value, the more consistent the historical training direction of the node is with the global optimization direction, and the more reliable the gradient signal it generates.
[0029] Furthermore, the uniqueness of local data distribution Used to assess the deviation of node data from the overall dataset, identifying "long tail" or "marginal" portions of the data distribution. A lightweight global data feature prototype is maintained on the cloud server. It can be a reference point obtained by calculating the cluster center or simply averaging the mean of local data features periodically reported by all nodes.
[0030] For nodes Calculate its characteristic mean To the prototype Euclidean distance This reflects the absolute deviation. However, relying solely on absolute distance can lead to misjudgment because the overall feature space itself can be highly dispersed. Therefore, local comparison is introduced: finding the closest match in the feature space. of A set of features of other nodes Calculate the distance from the node to its Average distance of nearest neighbors This reflects local density. Uniqueness It can be calculated as the ratio of absolute distance to local average distance: in This refers to the Sigmoid function, used to smoothly map the result to... interval, It is a local minimum to prevent division by zero. The ratio is greater than... This indicates that the node's features are farther from the global center than its neighbors, indicating high uniqueness; the ratio is less than This indicates that the node's features are located in a relatively dense region. The larger the value, the more unique the data distribution of that node, which may represent a rare but potentially important pattern.
[0031] In obtaining and Then, the structural parameters of the subnetwork are determined according to the following rules: reliable nodes are given stronger expressive power, and unique data is subject to stronger regularization.
[0032] Subnetwork depth (number of layers) ) Mainly composed of The decision is made because deeper networks have a stronger fitting ability but are also more prone to overfitting or producing unstable updates.
[0033] Set a base depth (For example ), through a linear or piecewise linear mapping function to Convert to depth adjustment amount A simple rule could be: if ,but ;like ,but ;like ,but .therefore, ,in The function limits the number of layers to a minimum. and maximum This ensures that nodes with stable training receive deeper layers to learn complex patterns, while unstable nodes receive shallower networks to guarantee training robustness.
[0034] Furthermore, the width of each layer in the subnetwork (the proportion of neurons retained) Then it is mainly composed of The principle behind this decision is to apply a “width penalty” to unique data as an implicit form of structured regularization.
[0035] Specifically, this involves setting a base retention ratio for each layer. (For example Width adjustment follows ,in It is a scaling factor (e.g.) ), used to control the intensity of the penalty. This means that for nodes with highly unique data ( Each layer, for example, retains only The neurons. When generating subnetworks, according to... The network is proportionally and randomly dropouted, with a corresponding number of neurons and their connections. This design forces the network to process unique data in narrower channels, thus requiring it to learn more core, discriminative features, suppressing overfitting to noise or idiosyncratic details, and enabling its learned knowledge to generalize and benefit the global model.
[0036] Step S3: Local training combined with feature constraints.
[0037] edge nodes Upon receiving a personalized subnetwork, it is trained using its local data. In the classic FedAvg model, nodes independently optimize their local models, which is prone to "client drift" due to data heterogeneity. This means that each node's model converges toward its own local optimum, impairing the performance of the global model.
[0038] The training objective function of this scheme includes a main loss function designed to perform a specific task (such as image classification). (Such as cross-entropy loss), and also introduced a feature constraint loss. The specific construction of this loss term is as follows: During the forward propagation process of local training, the output features of the last layer of the personalized sub-network (i.e., before the output layer) are recorded. At the same time, before training begins, the node synchronously obtains the output features of the adjacent layer immediately following the end of the sub-network in the current global model from the cloud server (under standard forward propagation), and this feature serves as a stable "teacher" signal. Aiming to minimize and The difference between them is measured using mean squared error: Therefore, the total training loss of the node is ,in It is a tradeoff coefficient. Through this loss, while nodes strive to fit local data, the feature representations of their intermediate layers are guided to move closer to the representation space expected by the global model, which is equivalent to performing a kind of local feature space knowledge distillation.
[0039] Furthermore, the coefficient Designed to be dynamically adaptive rather than a fixed value, it addresses the need for alignment across different nodes or the same node at different training stages. Its adjustment logic is based on real-time feedback: after each training batch, the current batch's... Values, and maintain them in a sliding window (e.g., most recent). Exponential moving average within (number of batches) Set a threshold that reflects acceptable alignment error. The update rule for dynamic weights is as follows: ,in It is the basic weight. It is the scaling factor.
[0040] This means that when the average difference between local features and global features Below the threshold At that time, it is considered that the alignment is good, and only the basic level constraint is applied. );once Beyond a threshold, the constraint strength increases linearly, applying a stronger "pull" to correct potential biases from local training. This mechanism makes it suitable for data uniqueness (…). High) or unstable training ( For nodes with low global distribution, stronger global guidance can be automatically applied, while for nodes that are close to the global distribution, less intervention is required to preserve their flexibility in local learning.
[0041] Step S4: Upload update information.
[0042] After local training is complete, edge nodes need to upload updated information to the cloud server. The uploaded content is designed to balance information volume, communication overhead, and privacy protection. It mainly includes two parts: 1. Parameter increment This is the difference between the subnetwork parameters after training and the initial parameters at reception. Uploading only the increments instead of all parameters can significantly reduce the amount of data transmitted.
[0043] 2. Feature distribution statistics: i.e., the features used for alignment in step S3. The mean and variance vectors are calculated across multiple training batches locally (e.g., all batches in one epoch). Mean It describes the central tendency and variance of the node data. This characterizes its degree of dispersion. Together, these two define a simple diagonal covariance Gaussian distribution. This is used to approximate the distribution of node data in the feature subspace. This statistical information is highly compressed (only 2D floating-point numbers), does not contain any original data, protects privacy, and at the same time provides a key distribution basis for intelligent aggregation in the cloud.
[0044] Step S5: Hierarchical aggregation update based on feature clustering.
[0045] The cloud server collects data from all nodes. Then, a global model update is initiated. Traditional federated averaging weights the updates from all nodes, implicitly assuming that the data are independent and identically distributed, which can reduce model performance in heterogeneous scenarios.
[0046] First, node clustering is performed. The goal is to find sets of nodes with similar feature distributions. For any two nodes... and Using the statistical information it uploads, we assume that its characteristic distribution can be approximated as follows: and Using Bach's distance as a measure of distribution similarity considers both the mean and variance (covariance) of the distribution, providing a more comprehensive comparison than comparing only the mean. Its calculation formula is: in Here we assume the covariance is a diagonal matrix. This indicates the product (the determinant of a diagonal matrix, i.e., the product of the diagonal elements). The smaller the value, the more similar the two distributions are. Based on the calculated... Bach distance matrix ( (where the number of nodes is 0), and a bottom-up hierarchical clustering algorithm (AGNES) is used.
[0047] At the start of the algorithm, each node belongs to its own class. At each step, the two classes with the smallest Bach distance among all current classes are merged, and the distances between the new class and other classes are updated (e.g., using average linking). This process continues until the average distance between all classes exceeds a preset merging threshold. (For example Stop when ) . Finally, all nodes are divided into Aggregate group The distribution of node characteristics within a group is similar, while the relative differences between groups are relatively large.
[0048] Secondly, perform parameter aggregation within the group. For each aggregation group... Because nodes within a group have similar data distributions, their model updates are theoretically more consistent. The cloud server updates the parameters uploaded by all nodes within each group that correspond to the same shared layer in the global model, for each group separately. Perform a weighted average to obtain the aggregate update for this group. Weight It can be set to the same as the local data volume of the node. Proportional ( ), or further combined with its stability ( Then, This is applied to the corresponding layer of the global model. This step is equivalent to first forming a local consensus model within several data-homogeneous subgroups.
[0049] Finally, inter-group fusion layer optimization is performed. The global model typically contains specific layers (such as fully connected layers or a Transformer block) whose input depends on the processing results of multiple preceding branches or paths. These layers can be considered "fusion layers," with parameters... .
[0050] After aggregation within a group, different groups are processed in the global model. The path of the corresponding data (i.e., the path of different data) The updated portion generates diverse intermediate features that flow to the fusion layer. To enhance the robustness and fusion capability of the global model to diverse inputs, specific optimization is required. This allows the fusion layer to map features from different distributions to a more harmonious common subspace. The optimization objective is to minimize the difference between the output distributions produced by the fusion layer for different groups of inputs. The Jensen-Shannon divergence (JSD) is used as a measure of the difference between multiple distributions because it is symmetric and smooth. The alignment loss is defined as: in, Indicates when using the first Group aggregation parameters When driving the global model forward propagation to the fusion layer, the distribution of the output features of that layer can be approximated by calculating the mean and covariance of this batch of outputs. Is this it? The arithmetic mean distribution of the distribution. It's the KL divergence. Using the gradient descent algorithm, only the parameters of the fusion layer are updated. To minimize After several iterations, the fusion layer By learning to adjust its weights, the output tends to a common statistical property that is more favorable to subsequent tasks, regardless of which data distribution the input features come from, thereby improving the generalization ability of the global model on unknown heterogeneous data.
[0051] Step S6: Iterate through the loop.
[0052] After completing the update in step S5, the cloud server obtains a new generation of global model that integrates various knowledge sets and has an optimized internal fusion layer. Subsequently, the system returns to step S1, based on this latest global model and the latest state of each node as shown in the previous round (calculated from the information uploaded in the previous round). and Then, the next round of personalized subnetwork generation, distribution, and collaborative training begins, forming a self-evolving closed loop. This dynamic adjustment mechanism enables the system to continuously adapt to changes in node data distribution and training state transitions.
[0053] Example 2 Application Scenario Example: Consider a collaborative training scenario for a customer behavior analysis model in a cross-regional smart retail store. The cloud server is located at headquarters, and the edge nodes are local servers in each store, responsible for analyzing video streams from in-store cameras to identify customer behaviors (such as "browsing," "trying out," and "purchase intention") and product attention. The global model is a 3D convolutional neural network for video behavior recognition.
[0054] The t-th round of collaborative training: 1. Personalized sub-network generation and distribution (S1, S2): The headquarters cloud server calculates indicators based on the performance of each store in the previous round.
[0055] Store A (Large Flagship Store in the City Center): Historical Training Stability (High, stable and classic customer flow pattern), data uniqueness (Low, data close to mainstream distribution). The headquarters generated a depth for it. (Deeper) Width Retention Ratio of Each Layer Subnetworks.
[0056] Store B (Newly opened art-themed community store): (Stability is average) (High profile, unique customer base and behavioral patterns). Headquarters generates a deep [relationship / strategy]. (Shallower), width retention ratio The two sub-networks, with their distinct structures, were distributed separately.
[0057] 2. Local Training (S3): Stores A and B use local surveillance video data from this week to train their subnetworks.
[0058] During the training of store A, because its data is similar to the global features, the feature differences are... Smaller, dynamic weight Maintain at the base value about.
[0059] In the initial training phase of store B, due to the unique characteristics of its customers' clothing and shopping paths, the extracted features... Synchronized with headquarters The differences are significant. Rapidly exceeded the threshold Dynamic weights Rise to It imposes stronger alignment constraints to prevent the model from learning overly specific, non-generalizable patterns.
[0060] 3. Information Upload (S4): After training is completed, Store A uploads parameter increments for all shared layers of its 8-layer sub-network. And the statistics of its last layer of characteristic distribution: , .
[0061] Store B uploads the parameter increments of the corresponding shared layer of its 5th layer subnetwork. And the statistics of its characteristic distribution: , .
[0062] Other stores (C, D, E, F, etc.) should follow a similar procedure.
[0063] 4. Hierarchical Aggregation Update (S5): Clustering (S5.1): The headquarters calculates the Barthel distance between the feature distributions of all stores. It is found that the feature distributions of stores A, C, and D are very similar to each other. The characteristic distributions of stores B, E, and F are similar to each other, but far from those of group A. Using hierarchical clustering, to Using a threshold value, two aggregation groups are automatically created: "Main Commercial Area Group". and "Specialty Community Store Groups" .
[0064] Intra-group aggregation (S5.2): In Within the group, the parameter increments uploaded by stores A, C, and D, corresponding to the 5 shared layers out of the first 8 layers of the global model, are weighted and averaged (assuming weights are based on data volume) to obtain the group aggregate update. And update layers 1-5 of the global model. Within the group, the parameter increments of the corresponding layers uploaded by B, E, and F are averaged to obtain... And also update layers 1-5 of the global model (Note: here) The update will cover The update is part of the hierarchical aggregation. The actual implementation may use a more complex parameterization or branching structure, but for simplicity, this means that two different sets of updates to the shared base layer need to be merged, and the key to merging lies in the subsequent merging layer.
[0065] Inter-group fusion optimization (S5.3): Assume that layer 6 in the global model is a key fully connected fusion layer. Headquarters implemented optimized processes: 1. Use respectively and The updated model parameters are forward-propagated to layer 6 using a representative calibration dataset (or sample features extracted from each group) to obtain two sets of output features, and their distribution is then estimated. and .
[0066] 2. Calculate alignment loss .
[0067] 3. Calculate using backpropagation. Regarding the parameters of the fusion layer The gradient is calculated and updated using an optimizer (such as SGD). ,For example ,in It is the learning rate.
[0068] 4. Repeat the iteration several times, until... It has been adjusted to a state that is better able to handle the different characteristics of commercial areas and community stores.
[0069] 5. Loop (S6): The first loop The round ends, and the updated global model is obtained. Proceed to the next... In the next round, headquarters recalculates based on the updated directions uploaded by stores A and B in the latest training session. , Because optimization of the fusion layer may make the global model more favorable to the characteristics of community stores, the training direction of store B may be more consistent with the global direction. It may improve to 0.7. At the same time, it will be recalculated based on the new features. Based on these new states, headquarters may generate a depth-based [database / strategy] for store B. A slightly deeper subnetwork is used to continuously optimize collaboration efficiency.
[0070] This application scenario fully demonstrates how the method of this invention accurately matches node characteristics through dynamic subnetwork allocation, stabilizes local training and prevents drift through adaptive feature constraint loss, and intelligently identifies, groups, and fuses knowledge from heterogeneous data sources through hierarchical aggregation and fusion layer optimization based on distribution clustering. The entire process, while strictly protecting the privacy of the original data of each node, systematically solves the core challenges faced by traditional cloud-edge collaborative training in heterogeneous environments, ultimately training a unified intelligent model with strong generalization capabilities and scenario adaptability.
[0071] Finally, the following points should be noted: First, in the description of this application, it should be noted that, unless otherwise specified and limited, the terms "installation", "connection", and "linkage" should be interpreted broadly, and can be mechanical or electrical connections, or internal connections between two components, or direct connections. "Up", "down", "left", "right", etc. are only used to indicate relative positional relationships. When the absolute position of the described object changes, the relative positional relationship may change. Secondly: The accompanying drawings of the embodiments disclosed in this invention only involve the structures involved in the embodiments disclosed in this invention. Other structures can refer to the general design. In the absence of conflict, the same embodiment and different embodiments of this invention can be combined with each other. In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A cloud-edge collaborative training and deployment method for a deep learning model, wherein the deep learning model is a neural network model, characterized in that, Includes the following steps: S1: The cloud server dynamically generates a personalized sub-network for each edge computing node. The personalized sub-network is a reduced network formed by selecting a continuous layer sequence from the global neural network model and sampling and masking the neurons in some layers of the sequence. The length of the selected layer sequence is determined based on the historical training stability of the node, and the sampling and masking ratio of the neurons is determined based on the uniqueness of the local data distribution of the node. S2: The cloud server will distribute the generated personalized sub-network to the corresponding edge computing nodes; S3: Edge computing nodes use local data to train the personalized sub-network. The loss function of the training includes a feature constraint loss, which is used to reduce the difference between the feature representation output by the last layer of the personalized sub-network and the feature representation of the adjacent subsequent layers in the global model synchronously obtained from the cloud server. S4: The edge computing node uploads the sub-network parameter increments obtained after training, as well as the mean and variance of the feature representation output by the last layer in the batch dimension, to the cloud server. S5: The cloud server performs hierarchical aggregation updates, specifically as follows: First, the feature distribution similarity between nodes is calculated based on the mean and variance of the features uploaded by each node, and the nodes are divided into multiple aggregation groups according to the similarity; then, for each aggregation group, the parameter increments of each node in the group corresponding to the same network layer are weighted and averaged to obtain the group-level parameter update; finally, for the fusion layer in the global model that is responsible for receiving the output of different aggregation groups, its output distribution is optimized to be compatible with the feature distribution of each aggregation group, thereby updating the parameters of the fusion layer; S6: Based on the updated global model and the latest state information of each node, the cloud server re-executes step S1 to start the next round of collaborative training.
2. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The historical training stability in step S1 is calculated as follows: the average value of the cosine of the angle between the parameter increment direction uploaded by the node and the parameter update direction after global aggregation in the past M consecutive training cycles is calculated, and the average value is the historical training stability.
3. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The calculation method for the uniqueness of local data distribution in step S1 is as follows: the cloud server maintains a global data feature prototype based on the statistical data of all nodes; for each node, the geodesic distance between its periodically uploaded local data features and the global data feature prototype is calculated, and the distance is normalized and used as its uniqueness of local data distribution.
4. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The specific rule for determining the length of the layer sequence in step S1 is as follows: a base layer number L_base is preset, the historical training stability of the node is mapped to a layer number adjustment amount ΔL, and the final layer sequence length L = L_base + ΔL, where the higher the historical training stability, the larger and more positive ΔL is.
5. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The specific rule for determining the sampling masking ratio of the neuron in step S1 is as follows: a basic masking ratio is preset for each layer in the neural network, and the masking ratio of the layer is adjusted upward according to the uniqueness of the local data distribution of the node. The higher the uniqueness, the greater the actual adjustment of the layer.
6. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The weight coefficient of the feature constraint loss in step S3 is adaptively set: during training, the difference between the output feature of the last layer of the personalized sub-network and the adjacent subsequent layer features obtained synchronously is calculated in real time, and the weight of the feature constraint loss in the total loss function is dynamically adjusted according to the magnitude of the difference value. The larger the difference value, the higher the weight.
7. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The specific method for dividing nodes into multiple aggregation groups based on similarity in step S5 is as follows: using the vector composed of the mean and variance of the features uploaded by each node as its feature distribution coordinates, calculating the Bach distance between all pairs of coordinates; using a hierarchical clustering algorithm, with the Bach distance as the metric, gradually merging nodes until the distance between groups is greater than a preset merging threshold, thus forming the final aggregation group division.
8. The cloud-edge collaborative training and deployment method for deep learning models according to claim 1, characterized in that, The specific method for optimizing the fusion layer in step S5 is as follows: the group-level update parameters from different aggregation groups are forward propagated to the fusion layer to obtain multiple sets of output feature distributions; with the goal of minimizing the Jensen-Shannon divergence between these output feature distributions, the parameters of the fusion layer are iteratively updated using the gradient descent algorithm.
9. A cloud-edge collaborative training and deployment system for a deep learning model, wherein the deep learning model is a neural network model, characterized in that, This includes cloud servers and multiple edge computing nodes connected via a communication network; The cloud server includes: The subnetwork dynamic construction module is used to perform a personalized subnetwork generation process for each edge computing node; The node feature clustering module is used to perform node aggregation group partitioning; The hierarchical aggregation update module is used to perform hierarchical aggregation and fusion layer optimization operations; Global model management and distribution module; The edge computing nodes include: The local training module is used to perform training that includes adaptive weighted feature constraint loss; Update the information processing and uploading module to prepare and upload parameter increments and feature statistics; Model inference service module.
10. The cloud-edge collaborative training and deployment system for deep learning models according to claim 9, characterized in that, The cloud server also includes a node status monitoring module, which continuously tracks the changing trends of the historical training stability and local data distribution uniqueness of each node. When it is detected that the status value of any node in the current period deviates from its moving average by more than a preset threshold, the sub-network dynamic construction module will be triggered outside the normal period to regenerate a personalized sub-network for that node.