Artificial intelligence core, artificial intelligence core system, and loading / storing method for the artificial intelligence core system
The AI core system addresses bandwidth limitations by employing a load/store method with main and standby operations, optimizing data and program transmission for efficient deep learning tasks, ensuring no delays in current tasks and maximizing hardware utilization.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- REBELLIONS INC
- Filing Date
- 2021-12-30
- Publication Date
- 2026-06-18
- Estimated Expiration
- Not applicable · inactive patent
Smart Images

Figure 0007875536000001 
Figure 0007875536000002 
Figure 0007875536000003
Abstract
Description
【Technical Field】 , , 【0001】 The present invention relates to an artificial intelligence core, an artificial intelligence core system, and a load / store method of an artificial intelligence core system. Specifically, the present invention relates to an artificial intelligence core, an artificial intelligence core system, and a load / store method of an artificial intelligence core system for maximizing the utilization of an artificial intelligence core. 【Background Art】 【0002】 In recent years, artificial intelligence (AI) technology has been attracting attention as the core technology of the fourth industrial revolution and the most promising technology globally. The biggest problem with such AI technology is computing performance. For AI technology that realizes human learning ability, reasoning ability, perception ability, natural language processing ability, etc., it is most important to quickly process a large amount of data. 【0003】 In the deep learning and inference of initial artificial intelligence, a conventional computer's central processing unit (CPU) or graphics processing unit (GPU) has been used. However, since there are limitations in deep learning and inference operations with a high workload, an artificial intelligence core specialized in deep learning operations structurally has been in the spotlight. 【0004】 An artificial intelligence core contains a large number of multipliers inside. For the operation work of such an arithmetic unit, it is difficult to ensure sufficient bandwidth for calling the necessary data and programs. 【0005】 Therefore, a method of calling programs and data required for the next operation in advance in a time series may be a very good method to improve the performance of an artificial intelligence core. 【Prior Art Documents】 【Patent Documents】 【0006】 [Patent Document 1] Korean Registered Patent Publication No. 10-2258566 [Overview of the project] [Problems that the invention aims to solve] 【0007】 The objective of this invention is to provide an artificial intelligence core that can be efficiently maximized. 【0008】 Another objective of the present invention is to provide an artificial intelligence core system that can efficiently maximize bandwidth with respect to external interfaces. 【0009】 Another objective of the present invention is to provide a load / store method for an artificial intelligence core system that can efficiently maximize bandwidth with respect to external interfaces. 【0010】 The object of the present invention is not limited to the object stated above, and other object and advantages of the present invention not mentioned can be understood from the following description and more clearly from the embodiments of the present invention. Furthermore, it is readily apparent that the object and advantages of the present invention can be achieved by the means and combinations set forth in the claims. [Means for solving the problem] 【0011】 To solve the aforementioned problems, an artificial intelligence core according to some embodiments of the present invention includes a process unit that receives input activations and weights and generates output activations by two-dimensional matrix calculations, and a load / store unit that transmits programs and input data received via an external interface to an on-chip buffer and transmits output data from the on-chip buffer to the external interface, wherein the load / store operation includes a main load / store operation for the currently running operation currently performed by the process unit and a standby load / store operation for a standby operation to be performed by the process unit after the currently running operation. 【0012】 The system may also include: an activation buffer that provides the input activation to the process unit, receives the output activation from the process unit, and temporarily stores the input activation and the output activation; an on-chip buffer that temporarily stores and transmits to the process unit a program and input data for the process unit to perform calculations, and temporarily stores the output data received from the process unit, the input data including the input activation and the weighted value; and an activation load / store unit that transmits the input activation from the on-chip buffer to the activation buffer and transmits the output activation from the activation buffer to the on-chip buffer. 【0013】 Furthermore, the standby load / store operation may be performed using bandwidth of the external interface that is not used by the main load / store operation. 【0014】 Furthermore, the load / store unit may include a main load / store unit that performs the main load / store operation and transmits the first load data and the first store data to the on-chip buffer, and a hidden load / store unit that performs the standby load / store operation and transmits the second load data and the second store data to the on-chip buffer. 【0015】 Furthermore, the hidden load / store unit may include: a hidden load unit that fetches a standby load instruction received from a task controller and executes a standby load instruction issue; a hidden store unit that fetches a standby store instruction received from the task controller and executes a standby store instruction issue; a hidden load buffer that sequentially receives memory access requests corresponding to the load instruction from the hidden load unit; a hidden store buffer that sequentially receives memory access requests corresponding to the store instruction from the hidden store unit; a hidden load engine that receives a memory access request from the hidden load buffer and transmits the second load data to the on-chip buffer; and a hidden store engine that receives a memory access request from the hidden store buffer and transmits the second store data to the on-chip buffer. 【0016】 The load / store unit may further include a translation index buffer that stores a translation table of recently used virtual memory addresses and physical memory addresses. 【0017】 Furthermore, the main load / store unit may include a load unit that fetches load instructions and executes load instruction issuance, a store unit that fetches store instructions and executes store instruction issuance, a load buffer that sequentially receives memory access requests from the load unit, a store buffer that sequentially receives memory access requests from the store unit, a load engine that receives memory access requests from the load buffer and transmits first load data to the on-chip buffer, and a store engine that receives memory access requests from the store buffer and transmits first store data to the on-chip buffer. 【0018】 Furthermore, the first load data may have a higher priority than the second load data, and the first store data may have a higher priority than the second store data. 【0019】 Furthermore, the priority order can be tagged to the first and second load data and the first and second store data. 【0020】 Furthermore, the priority can be tagged by the load engine or the store engine. 【0021】 The load / store unit may further include an arbiter that receives the first and second load data and the first and second store data and transmits them to the on-chip buffer in a round-robin manner. 【0022】 Furthermore, the on-chip buffer includes multiple banks, and the value obtained by dividing the number of inputs of the first load data, second load data, first store data, and second store data per unit clock cycle by the number of banks of the on-chip buffer is smaller than the reference input / output ratio of the arbiter, and the reference input / output ratio may be the largest input-to-output ratio value within the range in which the arbiter does not cause any waiting time for each of the first load data, second load data, first store data, and second store data. 【0023】 Also, the hidden load / store unit and the main load / store unit may share at least a part of the hardware with each other. 【0024】 Also, the hidden load / store unit and the main load / store unit may be realized by different hardware from each other. 【0025】 Also, the process unit may include a PE array that executes a two-dimensional matrix operation of sequentially multiplying the input activation and the weighting value to generate the output activation, and a vector unit that executes a one-dimensional operation. 【0026】 Also, the external interface may include any one of a data bus, an external chip interface, or a local bus. 【0027】 According to some embodiments of the present invention for solving the above-mentioned other problems, an artificial intelligence core system includes a memory for storing a program for performing operations and input data, a bus for transmitting the input data and a control signal from the memory, and an artificial intelligence core for receiving the program, the input data and the control signal and executing a two-dimensional matrix operation to generate output data. The artificial intelligence core includes a load / store unit for loading the program and the input data from the memory and storing the output data in the memory, a process unit for executing operations using the program and the input data, and an on-chip buffer for temporarily storing the program, the input data and the output data between the process unit and the load / store unit. The bus includes a control bus for transmitting the control signal and a data bus for transmitting the input data and the output data. The load / store unit executes a main load / store operation for the currently executing task currently executed by the process unit and a standby load / store operation for a standby execution task to be executed by the process unit after the currently executing task. The standby load / store operation is executed using a bandwidth of the data bus that is not used by the main load / store operation. 【0028】 In addition, the memory may include an on-chip memory formed within the same chip as the artificial intelligence core and an off-chip memory formed separately from the artificial intelligence core. 【0029】 Further, the artificial intelligence core is a first artificial intelligence core, and further includes a second artificial intelligence core different from the first artificial intelligence core. The bus further includes a local bus for transmitting the input data and the output data between the first and second artificial intelligence cores. The load / store unit may execute the standby load / store operation using a bandwidth of the local bus that is not used by the main load / store operation. 【0030】 Furthermore, the load / store unit includes a main load / store unit that performs the main load / store operation and a hidden load / store unit that performs the standby load / store operation, and the standby load / store operation may have a lower priority than the main load / store operation. 【0031】 Furthermore, the aforementioned priority can be identified in a tagged form. 【0032】 The artificial intelligence core may further include an activation buffer that provides input activations to the process unit and receives output activations from the process unit, and an activation load / store unit that retrieves the input activations from the on-chip buffer and transmits them to the activation buffer, and transmits the output activations from the activation buffer to the on-chip buffer. 【0033】 A loading / storage method for an artificial intelligence core system according to some embodiments of the present invention for solving the aforementioned and other problems includes a main load / store unit loading a first program for a first task, executing the first task using the first program, and, upon confirming that the main load / store unit is not operating during the first task, a hidden load / store unit loading a second program for a second task that is waiting to be executed after the first task, and, upon completion of loading the first task and the second program, executing the second task using the second program. 【0034】 Furthermore, loading the second program may include fetching a standby load instruction for the second program, issuing the fetched standby load instruction, transmitting a memory access request corresponding to the issued standby load instruction to a hidden load buffer, the hidden load buffer sequentially transmitting the memory access request to the load engine, the load engine receiving second load data from off-chip memory via a data bus in response to the memory access request, and transmitting the second load data to an on-chip buffer. 【0035】 Furthermore, loading the first program may include fetching a load instruction for the first program, issuing the fetched load instruction, transmitting a memory access request corresponding to the issued load instruction to a load buffer, the load buffer sequentially transmitting the memory access request to a load engine, the load engine receiving first load data from off-chip memory via a data bus in response to the memory access request, and transmitting the first load data to an on-chip buffer. 【0036】 Furthermore, the first load data may have a higher priority than the second load data. 【0037】 A load / store method for an artificial intelligence core system according to some embodiments of the present invention for solving the aforementioned and other problems includes a main load / store unit performing a load operation of first data for a first operation, performing a first operation using the first data, and if it is confirmed that the main load / store unit is not operating during the first operation, a hidden load / store unit performing a load operation of second data for a second operation that is waiting to be executed after the first operation, and when the loading operations of the first operation and the second data are completed, performing a second operation using the second data. 【0038】 Furthermore, the first operation is a matrix operation operation of the first layer of the neural network, the second operation is a matrix operation operation of the second layer of the neural network, and the second data may be the kernel data of the second layer. 【0039】 Furthermore, the first data may include an input activation, and performing the first operation may include storing the input activation in an activation buffer, a process unit receiving the input activation from the activation buffer and generating an output activation, and the activation buffer storing the output activation. [Effects of the Invention] 【0040】 The artificial intelligence core, artificial intelligence core system, and load / store method of the artificial intelligence core system of the present invention can optimally utilize the bandwidth of the connection interface between the artificial intelligence core and the outside world to preload data and programs for the next task. 【0041】 Furthermore, the loading / storing of programs and data for the next task can prevent delays in the currently running task by ensuring that the loading / storing of programs and data for the next task does not cause delays in the currently running task. 【0042】 Furthermore, the main load / store unit and the hidden load / store unit can share hardware, maximizing the efficiency of hardware utilization. 【0043】 Along with the above, the specific effects of the present invention will be described below, along with a detailed explanation of the specific matters for implementing the present invention. [Brief explanation of the drawing] 【0044】 [Figure 1] Figure 1 is a block diagram illustrating an artificial intelligence core system according to several embodiments of the present invention. [Figure 2] Figure 2 is a block diagram that provides a detailed explanation of the structure of the artificial intelligence core shown in Figure 1. [Figure 3] Figure 3 is a block diagram that provides a detailed explanation of the structure of the process unit shown in Figure 2. [Figure 4] Figure 4 is a conceptual diagram illustrating the structure of the neural network in the deep learning work performed by the process unit. [Figure 5] Figure 5 is a block diagram illustrating the operation of the load / store unit in Figure 2. [Figure 6] Figure 6 is a block diagram illustrating in detail the structure of the load / store unit in Figure 5. [Figure 7] Figure 7 is a timing diagram illustrating the program loading operation of an artificial intelligence core system according to several embodiments of the present invention in chronological order. [Figure 8] Figure 8 is a timing diagram illustrating the data prefetch operation of an artificial intelligence core system according to several embodiments of the present invention in a time-series manner. [Figure 9] Figure 9 is a block diagram illustrating in detail the main load / store unit of an artificial intelligence core according to several embodiments of the present invention. [Figure 10] Figure 10 is a block diagram illustrating in detail a hidden load / store unit of an artificial intelligence core according to several embodiments of the present invention. [Figure 11] Figure 11 is a block diagram illustrating an artificial intelligence core system according to several embodiments of the present invention. [Figure 12] Figure 12 is a block diagram illustrating an artificial intelligence core system according to several embodiments of the present invention. [Figure 13] Figure 13 is a block diagram illustrating in detail the structure and operation of the first artificial intelligence core in Figure 12. [Figure 14]Figure 14 is a flowchart illustrating a load / store method for an artificial intelligence core system according to several embodiments of the present invention. [Figure 15] Figure 15 is a flowchart illustrating in detail the step of loading the first program in Figure 14. [Figure 16] Figure 16 is a flowchart illustrating in detail the step of loading the second program in Figure 14. [Figure 17] Figure 17 is a flowchart illustrating a load / store method for an artificial intelligence core system according to several embodiments of the present invention. [Figure 18] Figure 18 is a flowchart that provides a detailed explanation of the steps involved in performing the first task shown in Figure 17. [Modes for carrying out the invention] 【0045】 The terms and words used in this specification and in the claims should not be construed to be limited to their general or dictionary meanings. In accordance with the principle that inventors may define the concepts of terms and words to best describe their invention, they should be interpreted as meanings and concepts consistent with the technical idea of the present invention. Furthermore, the embodiments and configurations shown in the drawings described herein represent only one embodiment of the present invention and do not represent the entire technical idea of the present invention; therefore, it should be understood that, at the time of filing, there may be various equivalents, variations, and applicable examples that can substitute for them. 【0046】 The terms first, second, A, B, etc., used herein and in the claims may be used to describe various components, but such components should not be limited by such terms. The terms are used solely for the purpose of distinguishing one component from another. For example, without departing the scope of the present invention, the first component may be named the second component, and similarly the second component may be named the first component. The terms "and / or" include a combination of multiple related descriptions or any of the multiple related descriptions. 【0047】 The terms used herein and in the claims are used solely to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, terms such as “includes” or “having” should be understood not to pre-exist any exclusion of features, figures, stages, actions, components, parts, or combinations thereof described in the specification. 【0048】 Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as those generally understood by a person of ordinary skill in the art to which this invention pertains. 【0049】 Terms that are defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this application. 【0050】 Furthermore, the configurations, processes, steps, or methods included in each embodiment of the present invention can be shared to the extent that they do not technically contradict each other. 【0051】 The following describes several embodiments of the artificial intelligence core system according to the present invention with reference to Figures 1 to 8. 【0052】 Figure 1 is a block diagram illustrating an artificial intelligence core system according to several embodiments of the present invention. 【0053】 Referring to Figure 1, an artificial intelligence core system according to some embodiments of the present invention includes an artificial intelligence core 100, memory 200, and an external interface 300. 【0054】 The artificial intelligence core 100 may be a process module specifically designed for deep learning computations. The artificial intelligence core 100 may be implemented on a separate single or multiple chip, or as part of a System on Chip (SoC) integrated into a system. Specializing in convolution operations, i.e., matrix multiplication, the artificial intelligence core 100 can perform deep learning training and inference tasks far more efficiently than conventional CPUs or GPUs. The artificial intelligence core 100 can be implemented as a module in hardware. 【0055】 The memory 200 can transmit programs, input data, and control signals to the artificial intelligence core 100 via the external interface 300. The memory 200 can also receive and store output data from the artificial intelligence core 100. 【0056】 The memory 200 may include on-chip memory 210 and off-chip memory 220. The on-chip memory 210 may be, for example, SRAM (Static Random Access Memory) formed on a chip such as the artificial intelligence core 100. The on-chip memory 210 may be shared memory shared by multiple cores. However, this embodiment is not limited thereto. 【0057】 The off-chip memory 220 may be an external memory formed separately from the artificial intelligence core 100. The off-chip memory 220 may include, for example, at least one of DRAM (Dynamic Random-Access Memory), NAND flash memory, NOR flash memory, and 3D crosspoint memory. However, this embodiment is not limited to these. 【0058】 The memory 200 can provide programs and input data to the artificial intelligence core 100 via the external interface 300, and can receive and store output data from the artificial intelligence core 100 via the external interface 300. 【0059】 The external interface 300 can perform data exchange between the artificial intelligence core 100 and the memory 200. The external interface 300 can transfer not only data, but also programs and control signals. 【0060】 The external interface 300 can be implemented in various forms. Specifically, if the artificial intelligence core 100 is implemented in SoC form, the external interface 300 may be the main data bus. Alternatively, if the artificial intelligence core 100 is implemented in single-chip form, the external interface 300 may be an external chip interface. 【0061】 Figure 2 is a block diagram that provides a detailed explanation of the structure of the artificial intelligence core shown in Figure 1. 【0062】 Referring to Figure 2, the artificial intelligence core 100 may include a process unit 110, an activation buffer 120, an activation load / store unit 130, an on-chip buffer 140, and a load / store unit 150. 【0063】 The process unit 110 may be a module that performs calculations. The process unit 110 may perform not only one-dimensional calculations but also two-dimensional matrix calculations, i.e., convolution calculations. The process unit 110 may receive an input activation Act_In, multiply it by a weighted value, and then add the results to generate an output activation Act_Out. 【0064】 Figure 3 is a block diagram that provides a detailed explanation of the structure of the process unit shown in Figure 2. 【0065】 Referring to Figures 2 and 3, the process unit 110 may include a PE array 111 and a vector unit 112. 【0066】 The PE array 111 can receive the input activation Act_In and perform multiplication with the weighted values. In this process, the input activation Act_In and the weighted values form a matrix and can be calculated by convolution. As a result, the PE array 111 can generate the output activation Act_Out. 【0067】 The PE array 111 may include at least one processing element 111a. The processing elements 111a may be aligned with each other and each perform multiplication of one input activation Act_In and one weight. 【0068】 The PE array 111 can generate partial sums by summing the values for each multiplication. Such partial sums can be used as the output activation Act_Out. Since the PE array 111 performs two-dimensional matrix multiplication, it can also be referred to as a two-dimensional matrix compute unit. 【0069】 The vector unit 112 can primarily perform one-dimensional operations. The vector unit 112 can perform deep learning operations together with the PE array 111. This allows the process unit 110 to specialize in the necessary operations. In other words, the artificial intelligence core 100 has computation modules that perform large amounts of two-dimensional matrix multiplication and one-dimensional operations, enabling it to efficiently perform deep learning tasks. 【0070】 Figure 4 is a conceptual diagram illustrating the structure of the neural network in the deep learning work performed by the process unit. 【0071】 Referring to Figure 4, the neural network implemented by the PE array 111 may include an input layer Input1~k containing input nodes into which input data is received, an output layer Output1~i containing output nodes that output output data, and M hidden layers placed between the input and output layers. 【0072】 Here, the edges connecting the nodes in each layer may be assigned weights. These weights, or the presence or absence of edges, can be added, removed, or updated during the learning process. Therefore, the weights of the nodes and edges between the k input nodes and the i output nodes can be updated during the learning process. 【0073】 Before a neural network performs learning, all nodes and edges can be set to their initial values. However, as information is input cumulatively, the weights of the nodes and edges are changed, and in this process, matching can occur between the parameters input as learning factors and the values assigned to the output nodes. 【0074】 Furthermore, the weights of the nodes and edges between the input and output nodes that make up the neural network can be updated during the neural network's learning process. 【0075】 Referring again to Figure 2, the activation buffer 120 can provide the process unit 110 with an input activation Act_In and receive an output activation Act_Out from the process unit 110. The activation buffer 120 can temporarily store the input activation Act_In and the output activation Act_Out. 【0076】 The input activation (Act_In) and output activation (Act_Out) can refer to the input and output values of a layer in a neural network. In this case, if the neural network has multiple layers, the output value of the previous layer becomes the input value of the next layer, so the output activation (Act_Out) of the previous layer can be used as the input activation (Act_In) of the next layer. 【0077】 The activation buffer 120 can rapidly provide activation to computationally intensive process units 110, particularly the PE array 111, and rapidly receive activation, thereby increasing the computation speed of the artificial intelligence core 100. 【0078】 The activation load / store unit 130 can transmit input activation Act_In from the on-chip buffer 140 to the activation buffer 120, and output activation Act_Out from the activation buffer 120 to the on-chip buffer. In other words, the activation load / store unit 130 can perform both activation loading and storage operations. 【0079】 The on-chip buffer 140 is a memory located inside the artificial intelligence core 100, and can receive and temporarily store all the input data necessary for the artificial intelligence core 100's work from an external source. The on-chip buffer 140 can also temporarily store output data calculated by the artificial intelligence core 100 for transmission to an external source. 【0080】 The on-chip buffer 140 can receive input activation Act_In from the activation load / store unit 130 to the activation buffer 120 and output activation Act_Out. In addition to the activation load / store unit 130, the on-chip buffer 140 can also send and receive data directly with process units. In other words, the on-chip buffer 140 can exchange data with both the PE array 111 and the vector unit 112. 【0081】 The load / store unit 150 can receive at least one of input data, programs, and control signals from an external source via the external interface 300. The load / store unit 150 can transmit at least one of the received input data, programs, and control signals to the on-chip buffer 140. 【0082】 Similarly, the load / store unit 150 can transmit output data to the outside via the external interface 300. The load / store unit 150 can transmit output data generated by the process unit 110. 【0083】 Figure 5 is a block diagram illustrating the operation of the load / store unit in Figure 2. 【0084】 Referring to Figure 5, the task controller 10 may be implemented by the artificial intelligence core 100. The task controller 10 may be a module that controls the work of the artificial intelligence core 100. The task controller 10 may be a module logically implemented by the artificial intelligence core 100. However, this embodiment is not limited to these. 【0085】 The external interface 300 may include a control bus 310 and a data bus 320 if the artificial intelligence core 100 is an SoC. In this case, the control bus 310 is a bus for transmitting control signals, and the data bus 320 may be a bus for transmitting input data and output data. 【0086】 The control bus 310 can transmit control signals to the task controller 10 for loading or storing for the current operation. For example, the task controller 10 can transmit at least one of a load command and a standby load command to the load / store unit 150. Alternatively, the task controller 10 can transmit at least one of a store command and a standby store command to the load / store unit 150. The load / store unit 150 can perform a load / store operation according to at least one of the load command, store command, standby load command, and standby store command. 【0087】 In this context, load and store instructions refer to instructions for programs and data related to the task currently being performed by the process unit 110, while standby load and standby store instructions may refer to instructions for programs and data related to the next task that the process unit 110 will perform. 【0088】 Load instructions, standby load instructions, store instructions, and standby store instructions may each include the following details: 【0089】 Dscrptr{src, dst, burst size, #burst} Here, src may mean the source, i.e., the address of the data to be loaded or stored; dst may mean the destination, i.e., the address to which the data is transmitted; burst size may mean the burst size, i.e., the split size; and #burst may mean the burst number, i.e., the number of splits. However, this embodiment is not limited to these. 【0090】 The load / store unit 150 may include a main load / store unit 151 and a hidden load / store unit 152. The main load / store unit 151 may perform a main load / store operation while a load / store operation is in progress. 【0091】 For example, the main load / store unit 151 may fetch load instructions and issue load instructions. Here, issuing an instruction may mean determining whether the conditions make it impossible to execute the instruction, and if so, proceeding with the execution. 【0092】 The main load / store unit 151 may, in accordance with the issued load command, access the off-chip memory 220 via the data bus 320 to receive the first load data Dpr and transmit it to the on-chip buffer 140. In this case, the first load data Dpr may be data with a high priority. 【0093】 The hidden load / store unit 152 may perform standby load / store operations during a load / store operation. For example, the hidden load / store unit 152 may fetch a standby load instruction and issue a standby load instruction. 【0094】 The hidden load / store unit 152 may, in accordance with the issued load instruction, access the off-chip memory 220 via the data bus 320 to receive the second load data Dnpr and transmit it to the on-chip buffer 140. In this case, the second load data Dnpr may be data with a lower priority. That is, the first load data Dpr may have a relatively higher priority than the second load data Dnpr. That is, the on-chip buffer 140 may store the first load data Dpr before the second load data Dnpr. 【0095】 In this case, priority can be identified by tagging the data. This ensures that the main load / store operation for the currently running operation is not delayed by the standby load / store operation. In other words, the standby load / store operation can be performed without any interference to the execution of the main load / store operation. Furthermore, the standby load / store operation can be performed using the remaining bandwidth after deducting the bandwidth of the external interface 300 used by the main load / store operation. That is, the loading operation of the program and data must be performed first in chronological order before the computational operation can be performed, and the execution time of the computational operation can be much longer than that of the loading operation. 【0096】 As a result, the artificial intelligence core system according to this embodiment can maximize bandwidth utilization by allocating bandwidth that is not being used during computational work to standby tasks. 【0097】 Figure 6 is a block diagram illustrating in detail the structure of the load / store unit shown in Figure 5. 【0098】 Referring to Figure 6, the load / store unit 150 may include a load unit 151a, a store unit 151b, load buffers 151a_b, store buffers 151b_b, a hidden load unit 152a, a hidden load buffer 152a_b, a hidden store unit 152b, a hidden store buffer 152b_b, a load engine 153, a store engine 154, a conversion index buffer 155, and an arbiter 156. 【0099】 The load unit 151a can fetch load instructions from the task controller 10 and issue load instructions. When the load unit 151a provides the issued load instructions to the load buffer 151a_b, the load buffer 151a_b can sequentially transmit memory access requests to the load engine 153 in the order in which the inputs were received. 【0100】 Furthermore, the store unit 151b can fetch store instructions from the task controller 10 and issue store instructions. When the store unit 151b provides the issued store instructions to the store buffer 151b_b, the store buffer 151b_b can sequentially transmit memory access requests to the store engine 154 in the order they were received. 【0101】 The hidden load unit 152a can fetch a standby load instruction from the task controller 10 and issue a standby load instruction. When the hidden load unit 152a provides the issued standby load instruction to the hidden load buffer 152a_b, the hidden load buffer 152a_b can sequentially transmit memory access requests to the load engine 153 in the order they were received. 【0102】 Furthermore, the hidden store unit 152b can fetch a standby store instruction from the task controller 10 and issue a standby store instruction. When the hidden store unit 152b provides the issued standby store instruction to the hidden store buffer 152b_b, the hidden store buffer 152b_b can sequentially transmit memory access requests to the store engine 154 in the order in which they were received. 【0103】 The load engine 153 can receive a memory access request and call the first load data Dpr and the second load data Dnpr via the data bus 320. In this case, the load engine 153 can quickly look up the data using the translation table of recently used virtual addresses and physical addresses in the translation index buffer 155. If the virtual address of the load engine 153 is not in the translation index buffer 155, address translation information can be looked up in memory 200. 【0104】 The first load data Dpr may be data corresponding to a memory access request received from load buffers 151a / b, and the second load data Dnpr may be data corresponding to a memory access request received from hidden load buffers 152a / b. 【0105】 In this case, the load buffer 151a_b and the hidden load buffer 152a_b do not transmit memory access requests to the load engine 153 simultaneously. In other words, the hidden load unit 152a and the hidden load buffer 152a_b can identify when the load unit 151a and the load buffer 151a_b have not transmitted a memory access request to the load engine 153, and then transmit the memory access request to the load engine 153. That is, the hidden load buffer 152a_b can only operate if the instruction issuance operation is stalled in the load buffer 151a_b. 【0106】 The arbiter 156 can receive the first load data Dpr and the second load data Dnpr from the load engine 153. The arbiter 156 can transmit the first load data Dpr and the second load data Dnpr, which are input in a round-robin manner, to bank B of the on-chip buffer 140, respectively. In other words, since the arbiter 156 sequentially distributes the data to bank B of the on-chip buffer 140, a delay in the first load data Dpr may generally occur when the second load data Dnpr is added. 【0107】 However, in some embodiments of the present invention, the artificial intelligence core can assign a high priority to the first load data Dpr, thereby preventing processing delays to the first load data Dpr even when the second load data Dnpr is added. 【0108】 Such priorities can be tagged by the load engine 153. However, this embodiment is not limited to this. That is, it is also possible that priority information can be predetermined and transmitted in the load unit 151a and the hidden load unit 152a. 【0109】 The store engine 154 can receive a memory access request and call the first store data local bus 500 and the second store data local bus 500 via the data bus 320. In this case, the store engine 154 can quickly retrieve data from the translation index buffer 155 using the translation table of recently used virtual addresses and physical addresses. If the virtual address of the store engine 154 is not in the translation index buffer 155, address translation information can be retrieved from memory 200. 【0110】 The first store data local bus 500 may contain data corresponding to a memory access request received from the store buffer 151b_b, and the second store data local bus 500 may contain data corresponding to a memory access request received from the hidden store buffer 152b_b. 【0111】 In this case, the store buffer 151b_b and the hidden store buffer 152b_b do not send memory access requests to the store engine 154 simultaneously. That is, the hidden store unit 152b and the hidden store buffer 152b_b can identify when the store unit 151b and the store buffer 151b_b have not transmitted a memory access request to the store engine 154, and then transmit a memory access request to the store engine 154. In other words, the hidden store buffer 152b_b can only operate if the instruction issuance operation is stalled in the store buffer 151b_b. 【0112】 The arbiter 156 can receive the first store data local bus 500 and the second store data local bus 500 from the store engine 154. The arbiter 156 can transmit the first store data local bus 500 and the second store data local bus 500, which are input in a round-robin manner, from bank B of the on-chip buffer 140 to the data bus 320, respectively. In other words, since the arbiter 156 sequentially retrieves data from bank B of the on-chip buffer 140, when the second store data local bus 500 is added, a processing delay of the first store data local bus 500 may generally occur. 【0113】 However, in some embodiments of the present invention, the artificial intelligence core can assign a high priority to the first store data local bus 500, thereby preventing processing delays on the first store data local bus 500 even when a second store data local bus 500 is added. 【0114】 Such priorities can be tagged by the store engine 154. However, this embodiment is not limited to this. That is, it is also possible that priority information can be predetermined and communicated in the store unit 151b and the hidden store unit 152b. 【0115】 In this case, the load unit 151a, load buffer 151a_b, store unit 151b, store buffer 151b_b, load engine 153, store engine 154, conversion index buffer 155, and arbiter 156 may be included in the main load / store unit 151. 【0116】 On the other hand, the hidden load unit 152a, hidden load buffer 152a_b, hidden store unit 152b, hidden store buffer 152b_b, load engine 153, store engine 154, conversion index buffer 155, and arbiter 156 may be included in the hidden load / store unit 152. 【0117】 In other words, the main load / store unit 151 and the hidden load / store unit 152 may share the load engine 153, the store engine 154, the conversion index buffer 155, and the arbiter 156 with each other. At least one of the load engine 153, the store engine 154, and the conversion index buffer 155 may be implemented in hardware. 【0118】 Since the usage times of the main load / store unit 151 and the hidden load / store unit 152 will inevitably differ in practice, the load engine 153 and the store engine 154 can share some of the same hardware. This can maximize the resource utilization efficiency of this embodiment. 【0119】 Figure 7 is a timing diagram illustrating the program loading operation of an artificial intelligence core system according to several embodiments of the present invention in chronological order. 【0120】 Referring to Figure 7, first, the task controller 10 may execute a first program load PrLD1. The first program is a program necessary for the first task execution EXEC1 and may be a program for deep learning work. Since the first program load PrLD1 must precede the first task execution EXEC1, the first task execution EXEC1 may be dependent on the first program load PrLD1. 【0121】 In a typical artificial intelligence core, the second program load PrLD2 may be executed after the first task execution EXEC1 is completed. In contrast, the artificial intelligence core 100 according to this embodiment may execute the second program load PrLD2 in parallel with the first task execution EXEC1 of the deep learning task. As a result, the second task execution EXEC2 can start immediately when the first task execution EXEC1 is completed. This allows the artificial intelligence core 100 according to this embodiment to dramatically increase the speed of the deep learning task. 【0122】 Figure 8 is a timing diagram illustrating the data prefetch operation of an artificial intelligence core system according to several embodiments of the present invention in a time-series manner. 【0123】 Referring to Figure 8, first, the task controller 10 may execute a first program load PrLD1. Next, a first fetch Fetch1 may be executed. The first fetch Fetch1 may be the stage of retrieving data for deep learning training and inference. 【0124】 The first task execution, EXEC1, can be dependent because it requires loading the program and data. Similarly, the second task execution, EXEC2, can also be dependent because it requires loading data, like the second prefetch, PreFetch2. The second prefetch, PreFetch2, could, for example, retrieve kernel data for the next layer of a CNN (Convolutional Neural Network) or LSTM (Long Short-Term Memory). 【0125】 In this embodiment, the artificial intelligence core system executes a second prefetch (PreFetch2) during the first task execution (EXEC1) to acquire data corresponding to the second task execution (EXEC2) in advance, so that the second task execution (EXEC2) can be started immediately after the first task execution (EXEC1) is completed. This can further increase the processing speed of the artificial intelligence core in this embodiment. 【0126】 Hereinafter, with reference to Figures 9 and 10, an artificial intelligence core and an artificial intelligence core system according to several embodiments of the present invention will be described. If there is any overlap with the above, it will be simplified or omitted. 【0127】 Figure 9 is a block diagram illustrating in detail the main load / store unit of an artificial intelligence core according to some embodiments of the present invention, and Figure 10 is a block diagram illustrating in detail the hidden load / store unit of an artificial intelligence core according to some embodiments of the present invention. 【0128】 Referring to Figures 9 and 10, the load / store units of the artificial intelligence core system according to some embodiments of the present invention may be separated in hardware. That is, the main load / store unit 151 may include a load unit 151a, a store unit 151b, load buffers 151a_b, store buffers 151b_b, a first load engine 153_1, a first store engine 154_1, and a first conversion index buffer 155_1. 【0129】 Furthermore, the hidden load / store unit 152 may include a hidden load unit 152a, a hidden store unit 152b, a hidden load buffer 152a_b, a hidden store buffer 152b_b, a second load engine (153_2), a second store engine 154_2, and a second conversion index buffer 155_2. 【0130】 In this embodiment, the design difficulty of the artificial intelligence core 100 is reduced because the main load / store unit 151 and the hidden load / store unit 152 are physically separated from each other, and the durability of each can be maintained for a longer period because the load engine 153 and the store engine 154 are not shared with each other. However, in the case of the arbiter 156, more accurate values can be obtained by setting them identically. 【0131】 Hereinafter, with reference to Figure 11, several embodiments of the present invention, including artificial intelligence cores and artificial intelligence core systems, will be described. If there is any overlap with the previously mentioned content, it will be simplified or omitted. 【0132】 Figure 11 is a block diagram illustrating an artificial intelligence core system according to several embodiments of the present invention. 【0133】 Referring to Figure 11, in some embodiments of the present invention, the artificial intelligence core system includes a load / store unit 150 which includes an extended arbiter 156_1, and the load engine 153 and store engine 154 do not need to use data with different priorities. Instead, the number of banks B held by the on-chip buffer 140 can be further increased, and an extended bank Be may be included in the on-chip buffer 140. 【0134】 In other words, if the number of Bank B entries increases in proportion to the increase in the number of inputs, existing data will no longer need to wait, thus preventing delays in the processing speed of the AI core 100. 【0135】 The extended arbiter 156_1 may have a reference input-output ratio. In this case, the reference input-output ratio may mean the largest input-to-output ratio within a range where no input latency occurs. The value obtained by dividing the number of inputs of the first load data, second load data, first store data, and second store data that enter into the extended arbiter 156_1 by the number of banks B and extended banks (Be) of the on-chip buffer 140 may be smaller than the reference input-output ratio. 【0136】 Therefore, it is possible that simply increasing the number of Bank B units in the on-chip buffer 140, without prioritizing the load data, may not cause any damage to the main load / store operations. 【0137】 Hereinafter, with reference to Figures 12 and 13, several embodiments of the present invention, including artificial intelligence cores and artificial intelligence core systems, will be described. Any overlap with the previously mentioned content will be simplified or omitted. 【0138】 Figure 12 is a block diagram illustrating an artificial intelligence core system according to several embodiments of the present invention, and Figure 13 is a block diagram illustrating in detail the structure and operation of the first artificial intelligence core in Figure 12. 【0139】 Referring to Figure 12, an artificial intelligence core system according to some embodiments of the present invention may include a first artificial intelligence core 100, a second artificial intelligence core 400, and a local bus 500. 【0140】 The first artificial intelligence core 100 may be identical to the artificial intelligence core 100 in Figure 1. The second artificial intelligence core 400 may be a separate core isolated from the first artificial intelligence core 100. The first artificial intelligence core 100 and the second artificial intelligence core 400 may exchange data with each other using the local bus 500. 【0141】 The local bus 500 can be a pathway for transmitting data between cores. The local bus 500 can improve the speed of a multicore system through inter-core communication. 【0142】 Referring to Figure 13, the load / store unit 150 of the first artificial intelligence core 100 can communicate with the second artificial intelligence core 400 via the local bus 500. In particular, the main load / store unit 151 and the hidden load / store unit 152 can each perform data load / store operations via the local bus. 【0143】 This embodiment allows for maximizing bandwidth utilization even in data exchange between cores. 【0144】 The following describes a loading / storing method for an artificial intelligence core system according to several embodiments of the present invention, with reference to Figures 6, 7, and 14-16. Any overlap with the previously mentioned content will be simplified or omitted. 【0145】 Figure 14 is a flowchart illustrating a loading / storing method of an artificial intelligence core system according to several embodiments of the present invention, and Figure 15 is a flowchart illustrating in detail the step of loading the first program in Figure 14. Figure 16 is a flowchart illustrating in detail the step of loading the second program in Figure 14. 【0146】 Referring to Figure 14, the main load / store unit loads the first program (S100). 【0147】 Referring to Figure 15 in more detail, a load instruction for the first program can be fetched (S110), and the fetched load instruction can be issued (S120). 【0148】 Next, a memory access request corresponding to the issued load instruction is transmitted to the load buffer (S130), and the load buffer sequentially transmits the memory access request to the load engine (S140). 【0149】 Next, the first load data is received from the off-chip memory via the data bus (S150), and the first load data is transmitted to the on-chip buffer (S160). 【0150】 Referring again to Figure 14, the first operation is performed using the first program (S200). 【0151】 Specifically, referring to Figure 7, the first program is a program necessary for the first task execution EXEC1, and may be a program for deep learning work. Since the first program load PrLD1 needs to precede the first task execution EXEC1, i.e., the first task, the first task execution EXEC1 may be dependent on the first program load PrLD1. 【0152】 Referring again to Figure 14, we confirm that the main load / store unit is not operating (S300), and the hidden load / store unit loads the second program for the second operation (S400). 【0153】 Referring to Figure 16 in more detail, a standby load instruction for the second program may be fetched (S410), and the fetched standby load instruction may be issued (S420). 【0154】 Next, the memory access request corresponding to the issued standby load instruction is transmitted to the hidden load buffer (S430), and the hidden load buffer sequentially transmits the memory access request to the load engine (S440). 【0155】 Next, the second load data is received from the off-chip memory via the data bus (S450), and the second load data is transmitted to the on-chip buffer (S460). 【0156】 Specifically, referring to Figure 7, in this embodiment, the artificial intelligence core 100 can execute the second program load PrLD2 in parallel with the first task execution EXEC1 of the deep learning work. As a result, when the first task execution EXEC1 finishes, the second task execution EXEC2 can start immediately. This allows the artificial intelligence core 100 in this embodiment to dramatically increase the speed of the deep learning work. 【0157】 Furthermore, referring to Figure 6, the hidden load unit 152a and the hidden load buffer 152a_b can identify when the load unit 151a and the load buffer 151a_b have not transmitted a memory access request to the load engine 153, and then transmit the memory access request to the load engine 153. 【0158】 Stages S300 and S400 may be executed in parallel with stage S200. 【0159】 Referring again to Figure 14, the second operation is performed using the second program (S500). 【0160】 Specifically, referring to Figure 7, the second program is a program necessary for the execution of the second task EXEC2, and may be a program for deep learning work. Since the second program load PrLD2 needs to precede the execution of the second task EXEC2, i.e., the second task, the execution of the second task EXEC2 may be dependent on the second program load PrLD2. 【0161】 In this embodiment, the loading / storing method of the artificial intelligence core allows for the parallel execution of the first task and the loading of the second program for the second task, thereby increasing work efficiency and enabling maximum utilization of the bandwidth of the external interface 300, which could not be utilized conventionally. 【0162】 The following describes a loading / storing method for an artificial intelligence core system according to several embodiments of the present invention, with reference to Figures 17 and 18. Any overlap with the previously mentioned content will be simplified or omitted. 【0163】 Figure 17 is a flowchart illustrating a load / store method for an artificial intelligence core system according to several embodiments of the present invention, and Figure 18 is a flowchart illustrating in detail the steps for performing the first operation in Figure 17. 【0164】 Referring to Figure 17, the main load / store unit loads the first data (S1100). 【0165】 Specifically, referring to Figure 8, the first fetch, Fetch1, may be performed. The first fetch, Fetch1, may be the stage where data is retrieved for deep learning training and inference. 【0166】 Referring again to Figure 17, the first operation is performed using the first data (S1200). 【0167】 More specifically, referring to Figure 18, the input activation is saved to the activation buffer (S1210). 【0168】 Specifically, referring to Figure 2, the activation load / store unit 130 can transmit the input activation Act_In from the on-chip buffer 140 to the activation buffer 120. The activation buffer 120 can temporarily store the input activation Act_In. 【0169】 Referring again to Figure 18, the process unit receives the input activation from the activation buffer and generates the output activation (S1220). The activation buffer then stores the output activation (S1230). 【0170】 Referring again to Figure 17, we confirm that the main load / store unit is not operating (S1300), and the hidden load / store unit loads the second data for the second operation (S1400). 【0171】 Stages S1300 and S1400 may be executed in parallel with stage S1200. 【0172】 Referring again to Figure 17, the second operation is performed using the second data (S1500). 【0173】 Specifically, referring to Figure 8, the second task execution EXEC2 may also be dependent because it requires data loading, just like the second prefetch PreFetch2. In this embodiment, the artificial intelligence core system executes the second prefetch PreFetch2 during the first task execution EXEC1 to retrieve data corresponding to the second task execution EXEC2 in advance, so that the second task execution EXEC2 can start immediately after the first task execution EXEC1 is completed. 【0174】 The above description is merely illustrative of the technical concept of this embodiment, and any person with ordinary skill in the art to which this embodiment belongs can make various modifications and variations without departing from the essential characteristics of this embodiment. Therefore, this embodiment is for illustrative purposes only, not to limit the technical concept of this embodiment, and the scope of the technical concept of this embodiment is not limited by such embodiment. The scope of protection of this embodiment should be interpreted by the following claims, and all technical concepts within an equivalent scope should be interpreted as being included in the scope of rights of this embodiment.
Claims
[Claim 1] A process unit that receives input activations and weights, and generates output activations through a two-dimensional matrix operation, An artificial intelligence core comprising a load / store unit that performs load / store operations, which include transferring program and input data received via an external interface to an on-chip buffer and transferring output data from the on-chip buffer to the external interface, wherein the load / store operations include a main load / store operation for currently running operations performed by the process unit and a standby load / store operation for standby operations performed by the process unit after the currently running operations. [Claim 2] An activation buffer provides the input activation to the process unit, receives the output activation from the process unit, and temporarily stores the input activation and the output activation. The process unit temporarily stores the program and input data for the process unit to perform calculations and transmits them to the process unit, temporarily stores the output data received from the process unit, and the input data is stored in an on-chip buffer including the input activation and the weighted value. The system includes an activation load / store unit that transmits the input activation from the on-chip buffer to the activation buffer and transmits the output activation from the activation buffer to the on-chip buffer. The artificial intelligence core according to claim 1. [Claim 3] The standby load / store operation is performed using bandwidth from the external interface that is not used by the main load / store operation. The artificial intelligence core according to claim 1. [Claim 4] The aforementioned load / store unit is A main load / store unit that performs the main load / store operation and transmits the first load data and first store data to the on-chip buffer, Includes a hidden load / store unit that performs the standby load / store operation and transmits the second load data and the second store data to the on-chip buffer, The artificial intelligence core according to claim 1. [Claim 5] The aforementioned hidden load / store unit is A hidden load unit that fetches a standby load command received from the task controller and executes the issuance of a standby load command, A hidden store unit that fetches a standby store command received from the task controller and executes a standby store command, A hidden load buffer that sequentially receives memory access requests corresponding to the load instruction from the hidden load unit, A hidden store buffer that sequentially receives memory access requests corresponding to the store instruction from the hidden store unit, A hidden load engine that receives a memory access request from the hidden load buffer and transmits the second load data to the on-chip buffer, A hidden store engine that receives a memory access request from the hidden store buffer and transmits the second store data to the on-chip buffer, The artificial intelligence core according to claim 4. [Claim 6] The load / store unit further includes a translation index buffer that stores a translation table of recently used virtual memory addresses and physical memory addresses. The artificial intelligence core according to claim 5. [Claim 7] The aforementioned main load / store unit is A load unit that fetches load instructions and executes load instruction issuance, A store unit that fetches store instructions and executes store instruction issuance, A load buffer that sequentially receives memory access requests from the aforementioned load unit, A store buffer that sequentially receives memory access requests from the aforementioned store unit, A load engine that receives a memory access request from the load buffer and transmits the first load data to the on-chip buffer, Includes a store engine that receives a memory access request from the store buffer and transmits the first store data to the on-chip buffer, The artificial intelligence core according to claim 4. [Claim 8] The first load data has a higher priority than the second load data. The first store data has a higher priority than the second store data. The artificial intelligence core according to claim 4. [Claim 9] The aforementioned priority order is determined by the first load data and the second load data, and the first store data and the second store data being tagged. The artificial intelligence core according to claim 8. [Claim 10] The aforementioned priority is tagged by the load engine or store engine. The artificial intelligence core according to claim 9. [Claim 11] The load / store unit further includes an arbiter that receives the first load data and the second load data, and the first store data and the second store data, and transmits them to the on-chip buffer in a round-robin manner. The artificial intelligence core according to claim 4. [Claim 12] The aforementioned on-chip buffer includes multiple banks, The value obtained by dividing the number of inputs of the first load data, the second load data, the first store data, and the second store data per unit clock cycle by the number of banks of the on-chip buffer is smaller than the reference input / output ratio of the arbiter. The reference input-output ratio is the largest input-output ratio value within the range where no waiting time occurs for the first load data, the second load data, the first store data, and the second store data, respectively, by the arbiter. The artificial intelligence core according to claim 11. [Claim 13] The hidden load / store unit and the main load / store unit share at least some hardware with each other. The artificial intelligence core according to claim 4. [Claim 14] The hidden load / store unit and the main load / store unit are implemented using different hardware. The artificial intelligence core according to claim 4. [Claim 15] The aforementioned process unit is A processing element array (PE array) performs a two-dimensional matrix operation that sequentially multiplies the input activation and the weighted value to generate the output activation, Includes a vector unit that performs one-dimensional operations, The artificial intelligence core according to claim 1. [Claim 16] The aforementioned external interface includes one of the following: a data bus, an external chip interface, or a local bus. The artificial intelligence core according to claim 1. [Claim 17] A program for performing calculations and memory for storing input data, A bus that transmits the input data and control signals from the memory, The program, the input data and the control signals, and the artificial intelligence core that receives these signals, performs a two-dimensional matrix operation, and generates output data, are included. The aforementioned artificial intelligence core is A load / store unit that loads the program and input data from the memory and stores the output data in the memory, A process unit that performs calculations using the program and the input data, The process unit and the load / store unit include an on-chip buffer for temporarily storing the program, the input data, and the output data, The bus includes a control bus for transmitting the control signals and a data bus for transmitting the input data and the output data. The load / store unit performs a main load / store operation for the currently running operation that the process unit is currently performing, and a standby load / store operation for a standby operation that the process unit will perform after the currently running operation, wherein the standby load / store operation is performed using bandwidth of the data bus that is not used by the main load / store operation. Artificial intelligence core system. [Claim 18] The memory includes an on-chip memory formed within the same chip as the artificial intelligence core, Includes an off-chip memory formed separately from the artificial intelligence core, The artificial intelligence core system according to claim 17. [Claim 19] The aforementioned artificial intelligence core is the first artificial intelligence core, It further includes a second artificial intelligence core that is different from the first artificial intelligence core, The bus further includes a local bus for transmitting the input data and the output data between the first artificial intelligence core and the second artificial intelligence core. The load / store unit performs the standby load / store operation using bandwidth of the local bus that is not used by the main load / store operation. The artificial intelligence core system according to claim 17. [Claim 20] The aforementioned load / store unit is The main load / store unit that performs the aforementioned main load / store operations, Includes a hidden load / store unit that performs the aforementioned standby load / store operation, The aforementioned standby load / store operation has a lower priority than the aforementioned main load / store operation. The artificial intelligence core system according to claim 17. [Claim 21] The aforementioned priority is identified in a tagged form. The artificial intelligence core system according to claim 20. [Claim 22] The aforementioned artificial intelligence core is An activation buffer provides input activation to the process unit and receives output activation from the process unit, The system further includes an activation load / store unit that retrieves the input activation from the on-chip buffer and transmits it to the activation buffer, and transmits the output activation from the activation buffer to the on-chip buffer. The artificial intelligence core system according to claim 17. [Claim 23] The main load / store unit loads the first program for the first task, The first operation is performed using the first program. If the main load / store unit does not operate during the first operation, the hidden load / store unit loads the second program for the second operation, which is waiting to be executed after the first operation. When the first operation and the loading of the second program are completed, the second operation is performed using the second program, Loading the second program and performing the second operation is done using the bandwidth of the data bus that is not used by the first operation. The hidden load / store unit loading the second program for the second operation is A wait load instruction is fetched for the second program, The fetched standby load instruction is issued, The memory access request corresponding to the issued standby load instruction is transmitted to the hidden load buffer. The hidden load buffer sequentially transmits the memory access request to the load engine. The load engine, in response to the memory access request, receives second load data from the off-chip memory via the data bus. This includes transmitting the second load data to an on-chip buffer. Loading / storing methods for the artificial intelligence core system. [Claim 24] Loading the first program means Fetch the load instruction for the first program, The fetched load instruction is issued, The memory access request corresponding to the issued load instruction is transmitted to the load buffer. The load buffer sequentially transmits the memory access requests to the load engine. The load engine, in response to the memory access request, receives first load data from the off-chip memory via the data bus. This includes transmitting the first load data to an on-chip buffer. A method for loading / storing an artificial intelligence core system according to claim 23. [Claim 25] The first load data has a higher priority than the second load data. A method for loading and storing an artificial intelligence core system according to claim 24. [Claim 26] The main load / store unit performs the first data load operation for the first operation, Using the first data, perform the first operation. If it is confirmed that the main load / store unit is not operating during the first operation, the hidden load / store unit will execute the second data load operation for the second operation which is waiting to be executed after the first operation. Once the first operation and the second data loading operation are completed, the second operation is performed using the second data, The loading of the second data and the execution of the second operation are performed using the bandwidth of the data bus that is not used by the first operation. The aforementioned hidden load / store unit performs the second data loading operation for the second operation, A wait load command is fetched for the second data mentioned above. The fetched standby load instruction is issued, The memory access request corresponding to the issued standby load instruction is transmitted to the hidden load buffer. The hidden load buffer sequentially transmits the memory access request to the load engine. The load engine, in response to the memory access request, receives second load data from the off-chip memory via the data bus. This includes transmitting the second load data to an on-chip buffer. Loading / storing methods for the artificial intelligence core system. [Claim 27] The first operation described above is a matrix operation operation in the first layer of the neural network, The second operation described above is a matrix operation in the second layer of the neural network, The second data is the kernel data of the second layer. A method for loading and storing an artificial intelligence core system according to claim 26. [Claim 28] The aforementioned first data includes input activation, Performing the first operation is, The aforementioned input activation is saved in the activation buffer. The process unit receives the input activation from the activation buffer and generates an output activation. The activation buffer includes saving the output activation. A method for loading and storing an artificial intelligence core system according to claim 26.