Gradient compression method and system based on activation function patterns
By using activation functions to determine the node activity state in the neural network model, and updating weights and accumulating gradients only when nodes are active, the problems of communication overhead and training time delay are solved, achieving a more efficient training process and reduced power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2025-09-23
- Publication Date
- 2026-06-12
AI Technical Summary
In the process of training neural network models on a large scale, existing technologies suffer from significant communication overhead and memory bottlenecks, resulting in training time delays. Furthermore, the initial experiments to find an appropriate threshold take a long time, which affects training efficiency.
By using activation functions to determine the activity state of nodes during the training process of neural network models, weights are updated and gradients are accumulated only when nodes are active, reducing communication overhead.
It effectively reduces the communication overhead between computing devices and servers, improves training efficiency, and reduces power consumption, while maintaining or improving the training accuracy of the model.
Smart Images

Figure CN122197985A_ABST
Abstract
Description
[0001] Cross-references to related applications
[0002] This application claims priority to Korean Patent Application No. 10-2024-0183567, filed on December 11, 2024, with the Korean Intellectual Property Office, and all rights derived therefrom, the entire contents of which are incorporated herein by reference. Technical Field
[0003] This disclosure relates to gradient compression methods and / or systems based on activation function patterns. Background Technology
[0004] Some example embodiments generally relate to gradient compression methods and / or systems based on activation function patterns, and more specifically, to methods for reducing communication overhead by accumulating gradients of the weights corresponding to inactive nodes during the training of a neural network model.
[0005] In large-scale training environments for neural network models, a common approach involves multiple distributed hardware devices computing the gradients of the weights, with a parameter server aggregating all gradients generated by the hardware devices to train the neural network model. However, significant communication overhead and memory bottlenecks occur during the process of sending the aggregated results to the parameter server, which can delay the training time of the neural network model.
[0006] To partially address this problem, a method has been used that reduces communication overhead by accumulating gradients whenever weights are updated, pre-defining a threshold for the accumulated gradients, and sending the accumulated gradients to a parameter server once the threshold is exceeded. However, this type of method requires or utilizes a process that depends on the neural network model and training data to find an appropriate threshold, necessitating initial experimentation. These initial experiments are quite time-consuming, making efficient training of neural network models challenging. Therefore, a gradient compression method is desired that enables efficient training of neural network models while reducing communication overhead between the parameter server and hardware devices, regardless of the type of training data and neural network model. Summary of the Invention
[0007] At least one technical objective to be achieved according to some example embodiments is to provide a method for reducing communication overhead between multiple computing devices and a server by accumulating gradients of weights corresponding to inactive nodes in a neural network model during training across multiple computing devices and sending the accumulated gradients to a server when the corresponding nodes become active.
[0008] Furthermore, at least one technical objective to be achieved according to some example embodiments is to provide a method for accumulating gradients by taking into account not only the current state but also the previous states of nodes included in the neural network model.
[0009] The technical objectives of the example embodiments are not limited to those mentioned above, and based on the following description, those skilled in the art will clearly understand other objectives not explicitly stated.
[0010] According to some example embodiments, a gradient compression method performed by a computing device is provided. The gradient compression method may include: performing a computation on the input data of a first node of a first model using an activation function; calculating a first gradient of a first weight corresponding to the first node based on the computation result; determining whether the first node is active based on the computation result; updating the first weight based on the first gradient in response to the first node being active; and accumulating the first gradient as an accumulated gradient of the first node in response to the first node being inactive.
[0011] Alternatively or additionally, according to some example embodiments, a gradient compression system is provided. The system may include: a processor; and a memory storing instructions, wherein, when executed by the processor, the instructions enable the processor to perform computations on input data of a first node of a first model using an activation function; compute a first gradient of a first weight corresponding to the first node based on the computation result; determine whether the first node is active based on the computation result; update the first weights based on the first gradient if the first node is active; and accumulate the first gradient as an accumulated gradient of the first node in response to the first node being inactive.
[0012] Alternatively or additionally, according to some example embodiments, a non-transitory computer-readable medium is provided storing a computer program configured to, when executed by a processor, cause the system to: perform computation on input data of a first node of a first model using an activation function; compute a first gradient of a first weight corresponding to the first node based on the computation result; determine whether the first node is active based on the computation result; update the first weight based on the first gradient in response to the first node being active; and accumulate the first gradient as an accumulated gradient of the first node in response to the first node being inactive.
[0013] Alternatively or additionally, according to some example embodiments, a server and a plurality of computing devices configured to communicate with the server are provided. Each of the plurality of computing devices is configured to compute a first gradient of a first weight corresponding to a first node based on the computation result, determine whether the first node is active based on the computation result, and accumulate the first gradient as an accumulated gradient of the first node in response to the first node being inactive.
[0014] In some example embodiments, each of the plurality of computing devices is configured to transmit the results of the accumulated gradients to a server.
[0015] In some example embodiments, at least one of the plurality of computing devices has an operating speed different from at least one other of the plurality of computing devices.
[0016] It should be noted that the effects of the present invention are not limited to those described above, and other effects of some example embodiments will be apparent from the following description. Attached Figure Description
[0017] The above and other aspects and features of the present invention will become more apparent from a detailed description of some exemplary embodiments of the invention with reference to the accompanying drawings, in which:
[0018] Figure 1 This is a block diagram illustrating the overall system configuration according to some example embodiments;
[0019] Figure 2 A gradient compression method according to some example embodiments is conceptually illustrated;
[0020] Figure 3 An example of performing weight updates or gradient accumulation based on the current state of each node is shown;
[0021] Figure 4 An example is shown that weight updates or gradient accumulation are performed based on both the previous state and the current state of each node;
[0022] Figure 5 This is a flowchart illustrating a gradient compression method according to some example embodiments;
[0023] Figure 6 It is shown Figure 5 A flowchart illustrating an example of the step of updating the first weight;
[0024] Figure 7 It is shown Figure 5 A flowchart of another example of the step of updating the first weight;
[0025] Figure 8This is a flowchart illustrating another embodiment of a gradient compression method according to the present invention;
[0026] Figure 9 The loss and accuracy of a neural network model based on the number of training iterations using a gradient compression method, according to some example embodiments, are shown.
[0027] Figure 10 The communication overhead based on the number of training iterations using a gradient compression method is shown according to some example embodiments; and
[0028] Figure 11 This is a block diagram illustrating the hardware configuration of a computing device including a neural network model according to some example embodiments. Detailed Implementation
[0029] In the following sections, some exemplary embodiments will be described in detail with reference to the accompanying drawings. References will follow later with the accompanying drawings. Figure 1 The advantages and features of the inventive concept, as well as the methods for achieving these advantages and features, will become apparent from the detailed description of the embodiments. However, some exemplary embodiments are not limited to the exemplary embodiments disclosed below, but can be implemented in various different forms. Therefore, the exemplary embodiments are set forth only to complete the inventive concept and to fully inform those skilled in the art of the scope of the inventive concept, which is limited only by the scope of the claims.
[0030] The same reference numerals in different figures denote the same or similar elements and therefore perform similar functions. Furthermore, for the sake of simplicity, descriptions and details of well-known steps and elements have been omitted. Moreover, numerous specific details are set forth in the following detailed description of the inventive concept in order to provide a thorough understanding of the inventive concept. However, it should be understood that the inventive concept can be practiced without these specific details. In other instances, well-known methods, processes, components, and circuits have not been described in detail so as not to unnecessarily obscure the gist of the inventive concept. Examples of various embodiments are further shown and described below. It should be understood that the description herein is not intended to limit the claims to the specific embodiments described. Rather, it is intended to cover substitutions, modifications, and equivalents that may be included within the spirit and scope of the inventive concept as defined by the appended claims.
[0031] Unless otherwise defined, all terms used herein, including technical and scientific terms, shall have the same meaning as commonly understood by one of ordinary skill in the art to which the inventive concept pertains. It will be further understood that terms (such as those defined in common dictionaries) shall be interpreted as having the meaning consistent with their meaning in the context of the relevant field and shall not be interpreted in an idealized or overly formal sense unless expressly defined herein. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the inventive concept. As used herein, unless the context clearly indicates otherwise, the singular constructions “a” and “an” are intended to include the plural constructions as well.
[0032] Furthermore, when describing components of an inventive concept, terms such as first, second, A, B, a, and b may be used. These terms are used only to distinguish one component from another, and the nature, sequence, order, or number of components is not limited by the terms. It should be understood that when a component is described as being “connected,” “coupled,” or “combined” to another component, that component may be directly connected, coupled, or combined to the other component, and another component may be “inserted” therein, and thus that component may be connected, coupled, or combined to another component via yet another component.
[0033] It will be further understood that the terms “comprise,” “comprising,” “include,” and “including” as used herein specify the presence of the stated features, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, elements, components, and / or portions thereof.
[0034] Figure 1 This is a block diagram illustrating the configuration of the overall system 10 according to some example embodiments. (See reference...) Figure 1 The overall system 10 may include a server 11 and one or more computing devices 13-1, 13-2, ..., 13-N. The server 11 may further include storage 12, and the computing devices 13-1, 13-2, ..., 13-N may each include neural network models 14-1, 14-2, ..., 14-N. The computing devices 13-1, 13-2, ..., 13-N will be collectively referred to as computing device 13 below, and the neural network models 14-1, 14-2, ..., 14-N will be collectively referred to as neural network model 14 below.
[0035] Each of the computing devices 13-1, 13-2, ..., 13-N may be designed to be identical, for example, having the same electrical and / or physical characteristics; alternatively, at least one of the computing devices 13-1, 13-2, ..., 13-N may differ from the other computing devices, for example, it may have different physical and / or electrical characteristics. For example, physical characteristics may be, or may include, or may be based on at least one of the size and / or number of input / output ports and / or devices, and electrical characteristics may be, or may include, or may be based on at least one of storage capacity, memory capacity, processing speed, or power consumption; the example embodiments are not limited thereto.
[0036] Each of the computing devices 13-1, 13-2, ..., 13-N can communicate with each other and / or with server 11 via a bus (such as, but not limited to, a wireless bus and / or a wired bus) to exchange information (such as, but not limited to, data and / or commands) stored in various formats (such as, but not limited to, analog and / or digital formats), and can communicate in various ways (such as, but not limited to, broadcast, one-way, two-way, or multi-way) to send and / or receive information; information can be sent and / or received in various ways, such as, but not limited to, serial and / or parallel methods. The example embodiments are not limited thereto.
[0037] The neural network model 14 can operate based on statistical learning algorithms inspired by biological neurons in the fields of machine learning and cognitive science. The neural network model 14 refers to a model in which artificial neurons (or nodes) forming a network through synaptic connections can learn to adjust the strength of synaptic connections to solve problems. Each neural network model 14 may include multiple neural network layers. For example, each neural network model 14 may include an input layer, one or more hidden layers (such as, but not limited to, a large number of hidden layers), and an output layer.
[0038] Multiple neural network layers can each include at least one node and at least one weight, and neural network computation can be performed by calculating the results of previous layer computations and the corresponding weights. The results of previous layer computations refer to the input data provided to the nodes in the current layer. The calculation between the results of previous layer computations and the corresponding weights can be performed based on one or more activation functions. For example, the activation function can be, or may include, or be based on one or more of the sigmoid function, tanh function, rectified linear unit (ReLU), or softmax function, but the example embodiment is not limited thereto.
[0039] The weights in multiple neural network layers can be derived, improved, optimized, or at least partially optimized based on the results of training the neural network model 14. For example, the weights can be updated during training to reduce or minimize the loss or cost values obtained from the neural network model 14 (e.g., reduce or minimize the gradient of the weights). The neural network model 14 can infer the desired outcome data from arbitrary input data.
[0040] For example, neural network model 14 may utilize at least one artificial intelligence (AI) architecture and algorithm, such as convolutional neural networks (CNNs) (e.g., GoogleNet, AlexNet, or VGG networks) or one or more of visual analytics, visual understanding, video synthesis, and ResNet, for visual processing and image classification, but the example embodiments are not limited thereto. The above examples are not limited to the AI architectures and algorithms that may be used according to some example embodiments.
[0041] Computing devices 13-1, 13-2, ..., 13-N can receive input data and train neural network models 14-1, 14-2, ..., 14-N respectively. During training, whenever the weights in the neural network models 14-1, 14-2, ..., 14-N change, computing devices 13-1, 13-2, ..., 13-N can transmit the weight change values ΔW1, ΔW2, ..., ΔW... N The weights are sent to storage 12 on server 11. The average of the weight change values stored in storage 12 (e.g., another measure of mean, median, mode, or central tendency, such as a measure based on at least one of mean, median, or mode) can be set as new weights for neural network models 14-1, 14-2, ..., 14-N, and computing devices 13-1, 13-2, ..., 13-N can continue to train neural network models 14-1, 14-2, ..., 14-N.
[0042] In this scenario, the time spent by computing device 13 training the neural network model 14 corresponds to computation time, and the time spent sending the weight change values to server 11 corresponds to communication time. As the number of computing devices 13 increases, the number of calculated weight change values also increases, which may lead to longer communication time. Therefore, communication overhead may increase significantly.
[0043] Therefore, according to some example embodiments, in order to reduce communication overhead, computing device 13 may determine whether to send weight change values to server 11 based on whether the nodes in neural network model 14 are active, rather than sending weight change values to server 11 every time the weight changes.
[0044] Specifically, the computing device 13 can perform calculations on the input data of the first node in the neural network model 14 using an activation function. For example, the computing device 13 can input the input data of the first node and its corresponding first weight into the activation function to perform the calculation. The first weight corresponding to the first node refers to the weight between the first node and the nodes connected to the first node in each previous layer. The computing device 13 can calculate the gradient of the first weight based on the calculation result.
[0045] Alternatively, the computing device 13 can determine whether the first node is active based on the calculation result. If the calculation result is equal to or greater than a threshold, the first node may be active. Conversely, if the calculation result is less than the threshold, the first node may be inactive.
[0046] According to some example embodiments, if the first node is active, the computing device 13 can update the first weight based on previously computed gradients. Specifically, the update of the first weight can utilize not only the gradient computed in the current state but also gradients accumulated from previous states. Conversely, if the first node is inactive, the computing device 13 can accumulate the computed gradients into an accumulated gradient instead of immediately updating the first weight. For example, the accumulated gradient can be stored in a buffer (not shown) of the computing device 13.
[0047] For example, according to some example embodiments, the first weight is updated based on the gradient only when the first node is active, and the change in the first weight is sent to server 11 only when the first node is active. Conversely, if the first node is inactive, the gradient is accumulated, and the first weight is not updated, meaning no weight change is sent to server 11. For example, according to some example embodiments, communication overhead occurs only when the first node is active, and no communication overhead occurs when the first node is inactive. This can reduce communication overhead during the training of neural network model 14 and can reduce the power consumption of computing device 13 and server 11.
[0048] This type of gradient accumulation process can be referred to as gradient compression. Some example embodiments correspond to gradient compression methods, and computing device 13 may correspond to a system that performs gradient compression methods. Reference will be made below. Figures 2 to 4 A conceptual review of gradient compression methods.
[0049] Figure 2 A gradient compression method according to some example embodiments is conceptually illustrated. Figure 2An example overall system 10 is depicted, comprising four computing devices 13-1 to 13-4 and neural network models 14-1 to 14-4. Input data IDAT1 to IDAT4, corresponding to the neural network models 14-1 to 14-4 of the computing devices 13-1 to 13-4 respectively, can be input. Activation functions can be used to perform calculations between the input data IDAT1 to IDAT4 and the weights of the nodes in the neural network models 14-1 to 14-4. Based on the calculation results, it can be determined whether a node in the neural network models 14-1 to 14-4 is active.
[0050] For example, in Figure 2 In the diagram, active nodes are shaded in gray, while inactive nodes are unshaded (e.g., displayed in white). The gradients of the weights corresponding to shaded nodes can be used immediately for weight updates, while the gradients of the weights corresponding to unshaded nodes can continue to accumulate. Figure 2 The weight change values ΔW1 to ΔW4 in the diagram can include only the gradients of the weights corresponding to the shaded nodes (e.g., active nodes). Since the weight change values ΔW1 to ΔW4 only include the gradients of the weights corresponding to the active nodes, rather than including the gradients of the weights corresponding to all nodes, communication overhead can be reduced.
[0051] Figure 3 An example of performing weight updates or gradient accumulation based on the current state of each node is shown. (Reference) Figure 3 Gradient accumulation (①) can be performed on inactive node 21, and gradient-based weight updates (②) can be performed on active node 22. For example, Figure 3 The example embodiment shown corresponds to a feature that considers only the current state of each node. However, since nodes that were previously inactive are more likely to remain inactive, it may be necessary or desirable to also consider the previous state of each node when determining whether to perform a weight update.
[0052] Figure 4 An example is shown that weight updates or gradient accumulation are performed based on both the previous and current states of each node. (Reference) Figure 4 If node 21, which is in an inactive state, remains inactive in subsequent stages, gradient accumulation (①) can be performed, and if node 22, which is in an active state, remains active in subsequent stages, gradient-based weight update (④) can be performed. Figure 2 The example embodiment is shown. However, if node 22, which is currently active, was previously inactive, gradient accumulation (②) can be performed even if it is currently active. Similarly, if node 21, which is currently inactive, was previously active, gradient accumulation (③) can be performed. For example, with Figure 3The example embodiments shown are different. Figure 4 The example embodiment shown considers both the previous state and the current state of each node, thus... Figure 3 Compared to previous implementations, this results in more selective weight updates and further reduces communication overhead.
[0053] refer to Figure 1 In some embodiments, computing device 13 may update the first weight only when the first node is active and the gradient of the first weight exceeds a threshold. In some example embodiments, computing device 13 may update the first weight based on the accumulated gradient even when the first node is inactive, if or in response to the accumulated gradient exceeding a threshold. In this case, the threshold for the accumulated gradient may vary depending on the type of input data or neural network model 14. This can prevent or reduce the possibility and / or impact of the training of neural network model 14 becoming too slow due to gradient compression. Simultaneously, after updating the first weight based on the computed gradient and the accumulated gradient, computing device 13 may initialize the accumulated gradient.
[0054] In some example implementations where the activation function used is ReLU, if or in response to a negative input to the first node, it can be determined that the first node is inactive, regardless of the computational outcome. Thus, accumulation can be performed only for small gradients generated by momentum, and the communication overhead remains zero even when training the neural network model 14 is repeated.
[0055] Server 11 and computing device 13 can be configured using one or more physical servers (such as virtual machines) included in a cloud-based server cluster. See later. Figure 11 The detailed configuration and operation of computing device 13 according to some example embodiments are described.
[0056] In some example embodiments, server 11 may deploy the neural network model 14 trained according to the foregoing embodiments to a user terminal (not shown). Here, the user terminal may include any one or more devices used by the user to perform tasks using the deployed neural network model 14, such as smartphones, tablet PCs, and laptop computers.
[0057] Figure 1 The components shown can communicate with each other via a network. For example, the network can be implemented as any type of wired and / or wireless network, such as one or more of a local area network (LAN), a wide area network (WAN), a mobile radio communication network, or a wireless broadband internet (WiBro) network.
[0058] Figure 5 This is a flowchart illustrating a gradient compression method according to some example embodiments. It will be described later. Figures 5 to 8It shows the result of Figure 1 The computing device 13 or Figure 11 The steps / operations performed by the computing device 500 in the following description. Therefore, unless a specific step / operation is explicitly mentioned in the following description, it should be understood that the specific step / operation may be performed by... Figure 1 The computing device 13 and / or Figure 11 The computing device 500 in the middle is executed.
[0059] refer to Figure 5 In operation S100, an activation function can be used to perform computation on the input data of the first node of the first model. In operation S200, based on the computation result, the first gradient of the first weight corresponding to the first node can be calculated. In operation S300, a determination can be made regarding whether the first node is active based on the computation result. In operation S400, if the first node is active ("yes"), the first weight can be updated based on the first gradient. If the accumulated gradient of the first node already exists, the first weight can be updated based on both the accumulated gradient and the first gradient, and the accumulated gradient can be initialized after the update. Conversely, if the first node is not active ("no"), in operation S500, the first gradient can be accumulated as the accumulated gradient of the first node. (See later for further details.) Figure 6 and Figure 7 Describe an example of operating S400.
[0060] Figure 6 It is shown Figure 5 A flowchart of an embodiment of operation S400, which is the step of updating the first weight. (See also...) Figure 6 In operation S410, it can be determined whether the first active node was previously active. If the first node was previously active ("yes"), then in operation S420, the first weight can be updated based on the first gradient. Figure 6 The embodiments correspond to Figure 4 Examples of implementations.
[0061] Figure 7 It is shown Figure 5 A flowchart of another embodiment of operation S400. See also... Figure 7 In operation S430, a determination can be made regarding whether the first gradient exceeds a threshold. If the first gradient exceeds the threshold ("yes"), the first weight can be updated based on the first gradient in operation S440.
[0062] Figure 8 This is a flowchart illustrating another embodiment of a gradient compression method according to a concept presented in this invention. (Reference) Figure 8After operation S500, a determination can be made in operation S600 regarding whether the accumulated gradient exceeds a threshold. If the accumulated gradient exceeds the threshold ("yes"), the first weight can be updated in operation S700 based on the accumulated gradient, regardless of the state of the first node (i.e., even if the first node is inactive).
[0063] Figure 9 The diagram illustrates the loss and accuracy of a neural network model based on the number of training iterations using a gradient compression method, according to some example embodiments. In Figure 30, reference numeral 31 indicates the loss based on the number of training iterations when training the neural network model using a conventional method without gradient accumulation, and reference numeral 33 indicates the loss based on the number of training iterations when training the neural network model using a gradient accumulation method that considers only the current state of each node. Figure 3 In the embodiments, and reference numeral 35 indicates the loss based on the number of training iterations when training the neural network model using a gradient accumulation method that considers both the previous and current states of each node, as in Figure 4 In the embodiments described. Reference numerals 32, 34, and 36 respectively indicate when using conventional methods, according to Figure 3 The method of the embodiment and according to Figure 4 The methods shown in some example embodiments train neural network models based on the accuracy of the number of training iterations. (See references.) Figure 9 It was confirmed that, compared with traditional training methods, the neural network model training method according to some example embodiments leads to more reduced or minimized loss and higher accuracy.
[0064] Figure 10 The diagram illustrates the communication overhead based on the number of training iterations of a neural network model using a gradient compression method, according to some example embodiments. In Figure 40, reference numeral 41 indicates the communication overhead based on the number of training iterations when training the neural network model using a conventional method, and reference numeral 42 indicates the communication overhead based on the number of training iterations when training the neural network model using a gradient accumulation method that considers only the current state of each node, as shown in Figure 40. Figure 3 In some example embodiments shown, and with reference numeral 43 indicating the communication overhead depending on the number of training iterations when training a neural network model using a gradient accumulation method that considers both the previous and current states of each node, as in... Figure 4 Some example embodiments are shown. (See references) Figure 10 It can be confirmed that, compared with traditional training methods, the neural network model training method according to some example embodiments significantly reduces communication overhead. Specifically, it can be confirmed that, compared with the method according to... Figure 3 Compared to training a neural network model using some example embodiments shown, when based on, for example Figure 4The example embodiments shown demonstrate a significant reduction in communication overhead when training neural network models.
[0065] In summary, for reference Figure 9 and Figure 10 By training neural network models according to some example embodiments, the loss can be further reduced or minimized, the accuracy can be further improved, and / or the communication overhead can be reduced.
[0066] Figure 11 This is a block diagram illustrating the hardware configuration of a computing device including a neural network model according to some example embodiments.
[0067] refer to Figure 11 The computing device 500 may include at least one processor 510, a bus 530, a communication interface 540, a memory 520 for loading a computer program 560 executed by the processor 510, and a storage 550 for storing the computer program 560. However, Figure 11 Only components relevant to some example embodiments are depicted. Therefore, it should be understood that, in addition to... Figure 11 In addition to the components shown, other general-purpose components may also be included. For example, computing device 500 may include, besides... Figure 11 Various components other than those shown. Alternatively or additionally, computing device 500 may be configured to omit... Figure 11 Some of the components shown are shown below. Each component of the computing device 500 will be described below.
[0068] Processor 510 can control at least some or all of the overall operation of the components of computing device 500. Processor 510 may include at least one of a central processing unit (CPU), microprocessor unit (MPU), microcontroller unit (MCU), graphics processing unit (GPU), or any other type of processor. Furthermore, processor 510 can perform calculations for at least one application or program to perform operations / methods according to some example embodiments. Computing device 500 may include one or more processors.
[0069] Memory 520 can store various data, commands, and / or information. Memory 520 can load computer program 560 from memory 550 to perform operations / methods according to some example embodiments. Memory 520 can be implemented as or may include non-volatile memory and / or volatile memory, such as random access memory (RAM), but the example embodiments are not limited thereto.
[0070] Bus 530 can provide communication functionality between components of computing device 500. Bus 530 can be implemented as various types of buses, such as address bus, data bus, and control bus.
[0071] Communication interface 540 can support both wired and wireless Internet communication for computing device 500. Alternatively or additionally, communication interface 540 can support various communication methods other than Internet communication. For this purpose, communication interface 540 may include a communication module.
[0072] Storage 550 can non-temporarily store one or more computer programs 560. Storage 550 can be implemented as a non-volatile memory, such as read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, or one or more of any type of computer-readable recording medium.
[0073] Computer program 560 may include one or more instructions that, when loaded into memory 520, enable processor 510 to perform various operations / methods according to some example embodiments. In other words, by executing the loaded instructions, processor 510 can perform various operations / methods according to some example embodiments.
[0074] For example, computer program 560 may include instructions for performing the following operations: performing computation on the input data of a first node of a first model using an activation function; calculating a first gradient of a first weight corresponding to the first node based on the computation result; determining whether the first node is active based on the computation result; updating the first weight based on the first gradient if the first node is active; and accumulating the first gradient as the accumulated gradient of the first node if the first node is inactive.
[0075] According to some example embodiments, since gradients are accumulated based on the pattern of the activation function and the active or inactive state of nodes in each neural network model, without the need or expectation to determine a threshold for the accumulated gradients every time each neural network model or training dataset changes, communication overhead can be reduced without degrading the efficiency of training each neural network model. Alternatively or additionally, with the reduction in communication overhead, the power consumption of parameter servers and hardware devices can also be significantly reduced.
[0076] Various exemplary embodiments and their effects have been described above with reference to the accompanying drawings. The effects of the inventive concept are not limited to those described above, and other effects not mentioned will be readily apparent to those skilled in the art from the above description.
[0077] All components constituting the exemplary embodiments are described as operating in combination with or in combination with each other. However, the inventive concept is not necessarily limited to any particular embodiment. In other words, within the scope of the inventive concept, all components can operate in at least two selective combinations with each other.
[0078] Although the operations are shown in the accompanying drawings as being performed in a specific order, it should not be construed as the operations should be performed in the specific order shown or in a sequential order, or that all the operations shown should be performed to obtain the desired result.
[0079] Computing devices can, for example, have trainable structures, such as artificial neural networks, decision trees, support vector machines, Bayesian networks, genetic algorithms, etc., utilizing training data. Non-limiting examples of trainable structures can include convolutional neural networks (CNNs), generative adversarial networks (GANs), artificial neural networks (ANNs), region-based convolutional neural networks (R-CNNs), region proposal networks (RPNs), recurrent neural networks (RNNs), stacked deep neural networks (S-DNNs), state-space dynamic neural networks (S-SDNNs), deconvolutional networks, deep belief networks (DBNs), restricted Boltzmann machines (RBMs), fully convolutional networks, long short-term memory (LSTM) networks, classification networks, etc.
[0080] Any elements and / or functional blocks disclosed above may include or be implemented in processing circuitry, such as hardware including logic circuitry; hardware / software combinations, such as a processor executing software; or combinations thereof. For example, processing circuitry may more specifically include, but is not limited to, a central processing unit (CPU), an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field-programmable gate array (FPGA), a system-on-a-chip (SoC), a programmable logic unit, a microprocessor, an application-specific integrated circuit (ASIC), etc. Processing circuitry may include at least one electronic component, such as a transistor, a resistor, a capacitor, etc. Processing circuitry may include electronic components, such as logic gates including at least one of AND gates, OR gates, NAND gates, NOT gates, etc.
[0081] Although some exemplary embodiments have been described with reference to the accompanying drawings, these exemplary embodiments are not limited to those described above, but can be implemented in various different forms. Those skilled in the art will understand that the exemplary embodiments can be practiced in other specific forms without changing the technical spirit or essential characteristics of the described exemplary embodiments. Therefore, it should be understood that the exemplary embodiments described above are illustrative in all respects, not restrictive. Furthermore, the exemplary embodiments are not necessarily mutually exclusive. For example, some exemplary embodiments may include one or more features described with reference to one or more accompanying drawings, and may also include one or more other features described with reference to one or more other accompanying drawings.
Claims
1. A gradient compression method performed by a computing device, the gradient compression method comprising: The activation function is used to perform computation on the input data of the first node of the first model; Based on the calculation results, calculate the first gradient of the first weight corresponding to the first node; Based on the result of the calculation, it is determined whether the first node is in an active state; In response to the first node being in the active state, the first weight is updated based on the first gradient; as well as In response to the first node not being in the active state, the first gradient is accumulated as the accumulated gradient of the first node.
2. The gradient compression method according to claim 1, wherein, Updating the first weight includes updating the first weight based on the first gradient and the accumulated gradient.
3. The gradient compression method according to claim 1, wherein, The update of the first weight includes updating the first weight based on the first gradient in response to both the previous state and the current state of the first node being the active state.
4. The gradient compression method according to claim 1, wherein, Updating the first weight includes updating the first weight based on the first gradient in response to the first gradient exceeding a threshold.
5. The gradient compression method according to claim 1, wherein, The activation function is based on the rectified linear unit ReLU, and The first model is based on an image classification model.
6. The gradient compression method according to claim 5, wherein, The input data is negative, and Determining whether the first node is in the active state includes determining that the first node is not in the active state, regardless of the result of the calculation.
7. The gradient compression method according to claim 1, wherein, Updating the first weight also includes initializing the accumulated gradient after updating the first weight.
8. The gradient compression method according to claim 1, further comprising: In response to the accumulated gradient exceeding a threshold, the first weight is updated based on the accumulated gradient, regardless of whether the first node is in the active state.
9. A gradient compression system, comprising: processor; and A non-transitory computer-readable medium storing instructions that, when executed by the processor, cause the system to: The activation function is used to perform computation on the input data of the first node of the first model; Based on the calculation results, calculate the first gradient of the first weight corresponding to the first node; Based on the result of the calculation, it is determined whether the first node is in an active state; In response to the first node being in the active state, the first weight is updated based on the first gradient; as well as In response to the first node not being in the active state, the first gradient is accumulated as the accumulated gradient of the first node.
10. The gradient compression system according to claim 9, wherein, Updating the first weight includes updating the first weight based on the first gradient and the accumulated gradient.
11. The gradient compression system according to claim 9, wherein, The update of the first weight includes updating the first weight based on the first gradient in response to both the previous state and the current state of the first node being the active state.
12. The gradient compression system according to claim 9, wherein, Updating the first weight includes updating the first weight based on the first gradient in response to the first gradient exceeding a threshold.
13. The gradient compression system according to claim 9, wherein, Updating the first weight also includes initializing the accumulated gradient after updating the first weight.
14. The gradient compression system according to claim 9, wherein, In response to the accumulated gradient exceeding a threshold, when executed by the processor, the instruction also enables the system to update the first weight based on the accumulated gradient, regardless of whether the first node is in the active state.
15. A non-transitory computer-readable medium for storing a computer program, wherein The computer program is configured to cause the processor, when executed by the processor: An activation function is used to perform calculations on the input data of the first node of the first model; based on the result of the calculation, a first gradient of the first weight corresponding to the first node is calculated; Based on the result of the calculation, it is determined whether the first node is in an active state; If the first node is in the active state, then update the first weight based on the first gradient; And in response to the first node not being in the active state, the first gradient is accumulated as the accumulated gradient of the first node.
16. The non-transitory computer-readable medium according to claim 15, wherein, Updating the first weight includes updating the first weight based on the first gradient and the accumulated gradient.
17. The non-transitory computer-readable medium according to claim 15, wherein, The update of the first weight includes updating the first weight based on the first gradient in response to both the previous state and the current state of the first node being the active state.
18. The non-transitory computer-readable medium according to claim 15, wherein, Updating the first weight includes updating the first weight based on the first gradient in response to the first gradient exceeding a threshold.
19. The non-transitory computer-readable medium according to claim 15, wherein, Updating the first weight also includes initializing the accumulated gradient after updating the first weight.
20. The non-transitory computer-readable medium according to claim 15, wherein, In response to the accumulated gradient exceeding a threshold, the computer program is configured to, when executed by the processor, also cause the processor to update the first weight based on the accumulated gradient, regardless of whether the first node is in the active state.