Silicon brain
The silicon chip design addresses the von Neumann bottleneck by integrating non-volatile memory cells in NAND strings for digital neural network operations, reducing power consumption and computation time in neural networks.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- WATANABE HIROSHI
- Filing Date
- 2025-11-30
- Publication Date
- 2026-06-18
AI Technical Summary
Current semiconductor memory technologies face challenges in increasing integration density and power consumption due to the von Neumann bottleneck and the inefficiency of volatile memory, which limits computation speed and scalability in neural networks.
A silicon chip design that integrates a thread arithmetic unit with non-volatile memory cells arranged in NAND strings, performing digital operations to simulate neural network functions without analog signals, reducing the need for external memory access and analog-to-digital conversions.
This approach significantly reduces power consumption and computation time by minimizing data transfers between the arithmetic unit and main memory, enabling efficient processing of neural networks with reduced power demands.
Smart Images

Figure JP2025041714_18062026_PF_FP_ABST
Abstract
Description
Silicon Brain 【0001】 This invention relates to a technology for integrating a neural network onto a silicon chip (IC chip). 【0002】 Conventional semiconductor-based computing methods involve the coordinated operation of a memory device and an arithmetic processing unit (CPU or other processor). A memory device (semiconductor memory) consists of a collection of memory elements called memory cells (memory elements, bit cells, or simply cells or elements) (array, cell array, memory cell array, or memory element array). Each element consists of at least a source, drain, gate, or control gate. The drain can be connected to a bit line, the source can be connected to a source line, and the gate can be connected to a word line. These connections are generally made through contacts (terminals). For example, word line contacts (terminals), bit line contacts (terminals), or source line contacts (terminals). When such a collection of elements is distributed on a two-dimensional plane, access to each memory element is made by word lines (WL) and bit lines (BL) located in the X and Y directions of the two-dimensional plane, where the angle between them is greater than zero. For example, the address of a memory element located at the intersection of the A-th word line and the B-th bit line is (A,B). This is called the memory element address. However, A is specifically referred to as the address on the X-axis (X address), and B is specifically referred to as the address on the Y-axis (Y address). 【0003】 For a long time, semiconductor memory technology development has been dominated by integrating more memory elements onto the surface of a silicon wafer using semiconductor manufacturing processes in accordance with Moore's Law (see Non-Patent Literature 1). However, in recent years (since 2015), it has become difficult to increase the integration density of memory elements in a two-dimensional plane, and methods of arranging memory elements in three-dimensional space have become mainstream even at the mass production level. In this case, the address can be represented as (A, B, C), where C is the address on the Z axis where the angle with the XY plane is greater than zero (Z address). 【0004】However, whether two-dimensional or three-dimensional, the current information recording method of semiconductor memory devices is based on memory elements. When each memory element (cell) has two values, 0 and 1, it is said that there is a memory capacity (amount of information that can be stored) of 1 bit per cell. If there are two such memory elements, the memory capacity is 2 bits. At this time, there are four combinations of 0 and 1: (00), (01), (10), and (11). At this time, the number of cases can be calculated by 2 to the power of 2. If the cell array is composed of N memory elements (where 1 cell is 1 bit), the memory capacity of the cell array is N bits. At this time, the number of cases can be calculated by 2 to the power of N. If the cell array is composed of N memory elements (where 1 cell is 2 bits), the memory capacity of the cell array is N bits. At this time, the number of cases can be calculated by 2 to the power of 2N. 【0005】 Therefore, the amount of information (number of bits) of conventional semiconductor devices is expressed as the logarithm with 2 as the base of the number of cases. Even if a technology called so-called multi-value is used, the logarithm only becomes 4 or 8, and such a logarithm can always be converted to a logarithm with 2 as the base. Therefore, even if the multi-value technology is used, there is no change in describing information in bits. 【0006】 In contrast, the human brain is not composed of memory elements. If there is something corresponding to a memory element, it is possible to mention the cell body that constitutes a part of a nerve cell, but the cell body does not store information of 0 or 1. 【0007】 As briefly shown in FIG. 1, generally, a nerve cell (neuron) consists of three parts: one cell body (soma body), a plurality of (for example, dozens of) dendrites, and one axon. The cell body can receive external inputs from these multiple dendrites. The axon generally extends longer than the dendrites, and its tip further branches into dozens to hundreds. The tips of these branched axons are called axon terminals (or axon endings, Axon terminal). 【0008】As briefly shown in Figure 2, the axonal terminal approaches one of the dendrites of another cell body and forms a junction. This junction is called synapsis. 【0009】 We have two cell bodies, A and B. Cell body A receives multiple inputs x(n) from the outside through multiple dendrites (n), where n is an integer from 1 to N. N can range from several hundred to tens of thousands. Cell body A assigns a weight W(n) to each input x(n). The signal obtained by summing these weights is called SUM. SUM is transmitted through the axon to one of the axon terminals. The number of axon terminals is approximately 10 to several hundred per nerve cell. When SUM exceeds a certain threshold (threshold of exitation), the nerve cell generates an action potential, which drives synapsis and transmits neurotransmitters from cell body A to cell body B. 【0010】 This threshold changes as the signal is repeatedly transmitted. In other words, repeated learning from experience can strengthen, weaken, or cause synaptic connections to shift. Strengthening of synaptic connections can be explained by a decrease in the threshold. Synaptic breaks can be explained by an increase in the threshold. Synaptic shifts can be explained by a change in the distribution of synaptic thresholds. 【0011】 Figure 3 shows a model of this. When neurotransmitters are transmitted, the output y is set to 1 (y=1), and otherwise y=0. This model is called a perceptron. It is widely used in deep learning and machine learning. 【0012】 There are mainly two ways to implement a perceptron on a computer. 【0013】 The traditional method represents the input x(n), synaptic weights w(n), sum, threshold, and output y all as bit information; in other words, it's a computer program. 【0014】This method places a heavy load on computers, which is a significant problem. There is a greater need than ever for improved processing speed and reduced power consumption. Deep learning and machine learning require the instantaneous processing of massive amounts of data, and if computations that place a heavy load on the system flood the world, the power consumption of data centers will increase exponentially, making it practically impossible to operate them. Furthermore, there are growing concerns that this could accelerate global warming. (See Non-Patent Literature 2) 【0015】 The main cause of the computation speed limit is excessive data communication between the arithmetic unit and main memory. While the arithmetic unit can still be made faster, the communication speed of the data bus between the arithmetic unit and main memory has plateaued. This is called the von Neumann bottleneck (or memory bus problem). 【0016】 The main reason for the increase in power consumption is that the currently dominant main memory is a volatile memory called dynamic random access memory (DRAM). As a result, the power consumption due to refreshing the recorded data has become significant and cannot be ignored. 【0017】 A recent trend is to directly replicate the perceptron within a semiconductor chip to avoid the von Neumann bottleneck and simultaneously reduce power consumption. However, the neural network of the human brain is generally designed to generate synapses between two unspecified nerve cells. In other words, while current semiconductor technology makes it possible to place perceptrons at precisely defined addresses on a two-dimensional plane or three-dimensional space, it is not easy to replicate synapses between arbitrary nerve cells or to freely rearrange them according to learning. 【0018】 Generally, a neural network refers to a network of nerve circuits. Originally, it refers to a network with countless synapses, as shown in Figure 2, and is something that actually exists in living organisms. The perceptron in Figure 3 is a model that includes both the neuron (Figure 1) and synapses. 【0019】If we extract the portion enclosed by the dotted line from the perceptron in Figure 3, we get Figure 4. Here, let N be the number of dendrites, x(p) be the input, and w(p) be the synaptic weights. Here, p is an integer from 1 to N and refers to each individual dendrite. In Figure 3, N=4 is used as an example. The sum of the products of the input x(p) and synaptic weight w(p) from 1 to N is SUM in the perceptron in Figure 3, and t(1,1) in Figure 4. 【0020】 In Figure 3, the part not enclosed by the dotted line, i.e., the part related to the action potential, is missing in Figure 4. Nevertheless, in Figure 5, t(2,1) and t(3,1) are added to the second column, and t(1,2) is added to the third column. The arrow from the first column to the first row of the second column means that the input x(p) is multiplied by the weight w(p,1,1) and added together in the first row of the second column. This means that the scalar product (sum of products) of the row vector with x(p) as an element and the column vector with w(p,1,1) as an element is calculated. The result of that calculation is t(1,1). 【0021】 The reason there are three weight arguments is that we added 1 to represent the transition from the first column to the second column, and another 1 to represent the summation in the first row of the second column. In other words, the weight from x(p) to t(1,1) is w(p,1,1). 【0022】 In Figure 5, the calculation of the input from the second column to the first row of the third column is performed. That is, the arrow from the second column to the first row of the third column means that the input t(q,1) is multiplied by the weight w(q,1,2) and added together in the first row of the second column. This means that the scalar product (sum of products) of the row vector with t(q,1) as an element and the column vector with w(q,1,2) as an element is calculated. The result of that calculation is t(1,2). 【0023】In Figure 6, t(2,2) and t(3,2) are added to the third column, and y is added to the fourth column. The arrow from the third column to the first row of the fourth column (i.e., y) means that the input t(r,2) is multiplied by the weight w(r,1,3) and added together. This means that the scalar product (sum of products) is calculated between the row vector whose element is t(r,2) and the column vector whose element is w(r,1,3). The result of this calculation is y. 【0024】 In Figure 7, we further consider the arrow from column 1 x(p) to column 2, row 2. Although not shown in the figure, the newly added arrow is assigned a weight w(p,2,1). 【0025】 In Figure 8, we also take into account the arrow from column 1 x(p) to column 2, row 3. Although not shown in the figure, the newly added arrow is assigned a weight w(p,3,1). 【0026】 In Figure 9, we also take into account the arrow from the second column t(q,1) to the third column, second row. Although not shown in the figure, the newly added arrow is assigned a weight w(q,2,2). 【0027】 In Figure 10, we also take into account the arrow from the second column t(q,1) to the third column, third row. Although not shown in the figure, the newly added arrow is assigned a weight w(q,3,2). Thus, as we move to the right in the figure, the column number (k) increases, and at the same time, the layer depth of learning increases. In other words, the column number (k) is related to the depth of the learning layer. 【0028】In neural networks, commonly used in the field of artificial intelligence, the first column in Figure 10 is called the input layer. The last column in Figure 10 (for example, the fourth column) is called the output layer. Multiple layers between the input and output layers (for example, the second and third columns) are called hidden layers. When there are multiple hidden layers (number of layers, or number of hidden layers), it is called deep learning. It is believed that the more hidden layers there are, the higher the performance of deep learning. An epoch is the period from the input layer to calculating the output y. The weights are readjusted and the output y is recalculated from the same input until the output y reaches the desired value. Various feedback methods are employed at this time to bring the output y closer to the desired value. The learning process is completed when the desired output y is finally obtained after repeating epochs. Therefore, the learning time is the product of the time to process one epoch and the number of epochs spent until learning is complete. The time to process one epoch is determined by the chip's ability to parallel process countless thread operations (multiply-accumulate operations). 【0029】 The left diagram in Figure 11 is similar to Figure 4. The arrow from the kth column to the (k+1)th column, row j, represents the input t(p,k) multiplied by the weight w(p,j,k+1) to obtain the (k+1)th column, row j. This means calculating the scalar product (thread operation, or sum-of-products operation) of the row vector (input vector) whose elements are t(p,k) and the column vector (weight vector) whose elements are w(p,j,k+1). The result of this calculation is the thread output or sum-of-products, which in this case is the thread t(j,k+1). This calculation result can be provided to an external AI chip that primarily performs the sum-of-products operations necessary for neural network computation, and can be stored in main memory or storage. In other words, the column number k represents the layer number of the hidden layer. However, when k=0, it is the input layer. Also, when k=M, it is the output layer. The number j is the element number of the output of the (k+1)th layer (output element number). The number q represents the input element number (input element number) in the k-th layer. 【0030】If we denote the thread arithmetic unit that computes t(j, k+1) as T(j, k), it can be considered as the element in the j-th row and k-th column of the thread matrix. That is, T(j, k) is a unit (or circuit) that computes the thread in the j-th row of the (k+1)-th layer. In the right diagram of Figure 11, the thread arithmetic unit T(j, k) is positioned on the semiconductor chip at a location determined by j, given k. In other words, there is a relationship between j and its position (or placement) on the semiconductor chip. In the example on the right of Figure 11, j searches for a position from the top left to the right on the chip, and when it reaches the edge, it moves down one row and searches again from left to right. In general, the number of rows j in the output of the (k+1)-th column is an integer from 1 to L(k+1). However, the number of elements (rows) L in each column (each layer) is different for each k, so it is an integer function L(k) with k as an argument. In other words, L(k+1) threads t(j, k+1) can be input to the (k+1)th layer thread arithmetic unit T(j', k+1) and output to row j' and column (k+2), thereby computing thread t(j', k+2). Here, j' is an integer from 1 to L(k+2), and k is an integer from 0 to M. In this case, the number of elements in the weight w in Figure 11 is the product of the number of rows in the input layer and the number of rows in the output, resulting in L(k+1)L(k+2). That is, it is the product of L(k+1) and L(k+2). In the example in Figure 10, L(0)L(1)L(2)=36 and M=3. However, L(M)=1. 【0031】Let's return to Figure 10. The hidden layer is made up of a superposition of multiple thread operations. The total number of hidden layers is given by M-1. However, in Figure 10, as an example, M=3, i.e., 2 columns. Therefore, when calculating T(j, k+1), L(k) is the number of input elements to the thread operation unit. Thus, the number of weight elements w transferred from the kth column to the k+1th column within the hidden layer is L(k)L(k+1), i.e., the product of L(k) and L(k+1). In this way, the integer k is repeated from 0 to M, so the total number of weight elements w required to calculate the output y is L(0)L(1)L(2)...L(M-1). This is the product of L(0) to L(M-1). However, since the number of output y is 1, L(M)=1. If all weight elements were to be stored in main memory, a very large bit capacity would be required for complex deep learning. 【0032】 To calculate the output y, first, L(1) threads, t(1, 1), t(2, 1)… t(L(1), 1), are calculated (thread operations) from L(0) inputs. Next, these are used as input to calculate (thread operations) L(2) more threads, t(1, 2), t(2, 2)… t(L(2), 2), and so on. This is repeated from k=0 to M-1 to calculate the output y. The calculation of the output y is essentially the same as the calculation of threads. Thus, the number of thread operations required to calculate the output y in deep learning is L(1) + L(2) + … + L(M-1) + L(M), where L(M) = 1. In the example in Figure 10 (L(1) = 3, L(2) = 3, L(3) = 1, M=3), a total of 7 threads are required (3 + 3 + 1). 【0033】 Next, let's discuss Graphics Processor Units (GPUs), which are commonly used in image analysis. 【0034】 Generally, a GPU consists of multiple cores. For example, it might be composed of 1000 cores, each with 512 bits. Thus, it has far more cores (usually around 10) than a typical CPU. 【0035】A GPU is a processor chip originally designed for image processing, and because it has a large number of cores, it is better suited than a CPU (Central Processing Unit) to parallel process multiple rotation matrices. The more complex the 3D image being processed, the more rotation matrices need to be processed in parallel. Calculating a 3D rotation matrix involves calculating the product of a 3x3 rotation matrix and a 3D column vector (the column vector before rotation). The result is a 3D column vector (the column vector after rotation). 【0036】 The first component of the rotated back vector is the scalar product (thread operation, or sum-of-products operation) of the first row of the rotation matrix and the rotated front vector. In other words, it can be calculated using the thread T(a, b) mentioned above. For example, if N=3, the following calculation result is obtained: T(a, b) = t(1, b) × w(1, a, b+1) + t(2, b) × w(2, a, b+1) + t(3, b) × w(3, a, b+1). This calculation result is provided outside the chip and can be saved to main memory or storage as needed. 【0037】 The second component of the rotated rear vector is the scalar product (thread operation, or sum-of-products operation) of the second row of the rotation matrix and the rotated front vector. In other words, it can be calculated using the thread T(c, d) described above. As an example, if N=3, the following calculation result is obtained: T(c, d) = t(1, d) × w(1, c, d+1) + t(2, d) × w(2, c, d+1) + t(3, d) × w(3, c, d+1). This calculation result is provided outside the chip and can be stored in main memory or storage. 【0038】The third component of the rotated back vector is the scalar product (thread operation, or sum-of-products operation) of the third row of the rotation matrix and the rotated front vector. In other words, it can be calculated using the thread T(e, f) mentioned above. For example, if N=3, the following calculation result is obtained: T(e, f) = t(1, f) × w(1, e, f+1) + t(2, f) × w(2, e, f+1) + t(3, f) × w(3, e, f+1). This calculation result is provided outside the chip and can be stored in main memory or storage. 【0039】 Thus, calculating a single 3D rotation matrix requires three threads (multiply-accumulate). In short, both deep learning and the analysis of complex 3D shapes (innumerable 3D rotation matrices) can be performed by parallel processing with countless threads (multiply-accumulate). In other words, the computational processing required for image analysis and deep learning is almost identical. This computational similarity between image analysis and deep learning is why GPUs are widely used in deep learning. For example, one core can be assigned to the calculations related to one thread. Alternatively, two or three threads can be combined and assigned to a single core. In any case, since GPUs have a significantly larger number of cores than CPUs, they are more suitable processors for deep learning than CPUs. 【0040】 However, GPUs consume a very large amount of power, ranging from 500W to about 1kW. In AI servers that utilize countless GPUs, the increase in power consumption is a problem. One solution is to dramatically improve computing speed. For example, even if the power consumption per chip triples, if the processing speed of the chip increases 100 times, the power consumption can be reduced to 3%. In other words, it is expected that power consumption can be reduced by as much as 97%. However, a 100-fold increase in processing speed means that the AI can be used 100 times more at the same cost. In other words, humans have an inherent desire, and ultimately they end up using the AI 100 times more, causing power consumption to triple. Therefore, this method will ultimately not be able to suppress the power consumption of AI servers. 【0041】 Another method is to reduce the amount of information handled. GPUs handle not only pixel coordinates but also subtle nuances of color and brightness, so they handle data with fixed lengths of 8, 16, and 32 bits, as well as floating-point numbers, all together. However, if only threaded operations specific to deep learning are performed, it is sufficient to handle only data with a fixed length of 8 bits. In other words, the amount of information handled to perform the same threaded operations can be significantly reduced. The Neural Network Processing Unit (NPU) was developed with this concept in mind. The power consumption of an NPU is about one-tenth that of a GPU. In short, reducing the amount of information handled in threaded operations is an effective way to reduce power consumption. 【0042】 Another method for reducing the power consumption of AI chips is to mimic the workings of the human brain to some extent. ICs developed with this concept are specifically called neuromorphic ICs. (Non-patent document 3) One concrete method for realizing this is conusing-in-memory. These often involve analog operation of memory cells. However, even if analog operation is performed inside the chip, when outputting some or all of the calculation results to the outside, they must always be converted to digital signals. Generally, AI often places high demands on the throughput of such analog-to-digital conversion, and there is a dilemma in that power consumption ultimately increases in order to meet these demands. 【0043】 This invention was made in view of the above circumstances and provides a method for reducing the power consumption of an information processing system using a neural network. 【0044】To solve the above problems, the present invention employs the following means. The solution proposed by the present invention includes an SB block comprising a thread arithmetic unit, the thread arithmetic unit returns a thread output to an external input, the external input consists of input elements from the 0th to the (N-1)th, the input elements from the 0th to the (N-1)th are each represented in m-element binary, the thread arithmetic unit consists of a NAND string from the nth column to the (n+m-1)th column, a bit line from the nth column to the (n+m-1)th column, and a word line from the sth row to the (s+N-1)th row for any integer n, the NAND string from the nth column to the (n+m-1)th column consists of memory cells from the sth row to the (s+N-1)th row, the data of the jth weight element is stored as 0 or 1 in the memory cell of the jth row in the NAND string from the nth column to the (n+m-1)th column, where j is an integer between 0 and N-1. From the word lines of the (s+N-1)th row from the sth row, the word line of the (s+j)th row is selected and a read voltage is applied. From the input elements of the (N-1)th row from the 0th row, the jth input element is selected. From the bit lines of the (n+m-1)th column from the nth column, the bit line of the (n+m-r)th column is selected. (r-1) data zeros are attached to the right end of the jth input element, and (m-r) data zeros are attached to the left end of the jth input element, making this the rth input code for the jth input element. The rth input code is input to the bit line of the (n+m-r)th column, which is represented in 2m binary. The (n+m-r)th column bit line inputs the rth input code digit by digit to one end of the (n+m-r)th column NAND string. The output from the memory cell in the jth row of the (n+m-r) column NAND string is taken as the output code for the rth column, the output codes from the 0th to the (m-1st)th bit lines are output, the output codes from the 0th to the (m-1st)th are added together, and this sum is taken as the output element for the jth input element and the jth weight element.The s-th to (s + N - 1)-th word lines are sequentially selected, the read voltage is sequentially applied, the 0-th to (N - 1)-th output elements are sequentially output, and the 0-th to (N - 1)-th output elements are added together to obtain the thread output, which is characterized in that. 【0045】 Furthermore, the memory cells of the j-th row and the (j + 1)-th row in the NAND string of the (n + m - r)-th column are the j-th and (j + 1)-th non-volatile memory cells. The j-th and (j + 1)-th non-volatile memories each have first to third terminals. The third terminal of the j-th non-volatile memory cell is connected to the word line of the j-th row. The second terminal of the j-th non-volatile memory cell is connected to the first terminal of the (j + 1)-th memory cell. The j-th non-volatile memory cell can store data 0 or data 1. When the stored data is data 1, if data 0 is input from the bit line of the (n + m - r)-th column, data 0 is output. If data 1 is input from the bit line of the (n + m - r)-th column, data 1 is output. When the stored data is 0, if data 0 is input from the bit line of the (n + m - r)-th column, data 0 is output. If data 1 is input from the bit line of the (n + m - r)-th column, data 0 is output, which is characterized in that. 【0046】 According to the present invention, by simulating a part of the function of a neural network in a silicon chip (IC chip) without using an analog signal, it is possible to save the power consumed by artificial intelligence. Hereinafter, the best mode for implementing the invention will be specifically described. 【0047】 First, let's see how power is consumed in the thread operation. (Principle of product-sum operation) 【0048】Next, the mechanism of the sum-of-products operation (or thread operation) will be described using FIG. 12. First, as an example, three pieces of data (A, B, and C) are prepared. Data A is used as the first input to an adder. Data B is used as the first input to a multiplicator. Data C is used as the second input to the multiplicator. The output of the multiplicator, that is, the product of data B and data C, becomes the second input to the adder. The output of the adder is substituted back into data A. This operation is repeated for the number of input elements (N). 【0049】 FIG. 13 is a drawing showing how data is exchanged between an AI chip (such as a GPU or NPU) and a main memory (such as DRAM). Of course, only digital data is transferred here. Therefore, if analog signals are handled inside the chip, analog-to-digital conversion must be performed once inside the chip. Since the signal that has been processed digitally once outside is input back into the AI chip, if analog signals are handled inside the chip, analog-to-digital conversion must be performed again. Thus, as the throughput of data processing improves, the power consumption increases. Naturally, when the power consumption is concentrated, the operating speed is also suppressed. Handling analog signals inside the AI chip will degrade the performance of the AI chip in terms of both power consumption and speed. 【0050】 Each time a processor such as an AI chip accesses the main memory, power is consumed through the memory BUS. In FIG. 12, the substitution into data A is an overwrite in the main memory. The data A is read from the main memory and used as the first input to the adder (retrieve). Data B is read from the main memory and retrieved. Data C is read from the main memory and retrieved. 【0051】Let PO be the power consumption for one overwrite operation to main memory. Let PR be the power consumption each time data is retrieved from main memory. In this case, the power consumption for calculating the scalar product of a row vector and a column vector of number N elements (thread operation, or sum-of-products operation) is N × (PO + PR + PR + PR). (First embodiment) 【0052】 What would happen if we could make the input C to the multiplier on-chip, that is, if we could do it within the AI chip without using main memory? As shown in Figure 14, the number of times data is retrieved from main memory per multiplication is reduced from three to two. Therefore, the power consumption for the scalar product (thread operation, or sum-of-accumulate operation) of row vectors and column vectors with N elements becomes N × (PO + PR + PR). 【0053】 The data C corresponds to the weights w(i, j, k) in the neural network calculation. As there are three integer arguments (i, j, k), it is a three-dimensional matrix. In deep learning, k becomes large, and as the number of inputs and outputs of each thread operation increases, i and j also become large. In other words, a large number of data C elements must be handled. For this reason, the system corresponding to Figure 12 (conventional example) uses HBM (High-Band Width Memory), which consists of multiple DRAMs connected via TSV stacked vertically, as the main memory. 【0054】 Therefore, in order to handle data C on-chip, the AI chip and HBM must be integrated into a single chip. This is a difficult task. 【0055】In this application, we attempt a different approach. As shown in Figure 15, as the first input, data B consisting of N external elements is acquired from outside the chip (or AI chip). As the second input, N weight elements are acquired from within the chip (or AI chip) (on-chip). These threads are computed, and the result is output as data A. Data A can be provided to the outside of the chip (or AI chip). The number of elements in the output is 1, since it is only a scalar product. In this case, access to main memory is PR when acquiring data B and PS when outputting data A. Assuming that PS and PO are equal, the power consumption for the scalar product (thread operation, or sum-of-accumulate operation) of a row vector and a column vector with N elements is N × PR + PO. In this way, it is possible to reduce power consumption without relying on the embedded main memory. This represents an example of the effect of the thread computing device (Present MAD) according to this application. 【0056】 In other words, by using this invention, it becomes possible to perform the calculation of the scalar product (sum-of-products operation) of a row vector and a column vector with N elements as a thread operation, saving power consumption by 2N × PR + (N-1) × PO each time it is executed. 【0057】 Therefore, if we repeat the same thread operation K times to execute one epoch, we can save K × (2N × PR + (N-1) × PO) power consumption per epoch. 【0058】 Therefore, if Q is the number of epochs required to complete one learning cycle, then Q × K × (2N × PR + (N-1) × PO) power consumption can be saved to complete one learning cycle. (Second Embodiment) 【0059】 Figure 16 is an example of an equivalent circuit diagram illustrating the first embodiment. By using this circuit, it is possible to enjoy the advantages of Figure 15. However, analog processing is not performed inside the AI chip, including Figure 16. In other words, Figure 16 is an example of an equivalent circuit of a digital thread arithmetic unit. That is, the input, output, and weights are all digital data. 【0060】This circuit makes it possible to calculate t(j, k+1) from t(1, k), t(2, k), ... t(N-1, k), and t(N, k) in Figure 11, taking into account the weights w(1, j, k+1), w(2, j, k+1), ... w(N, j, k+1) (thread operation or sum-of-accumulate operation). In other words, this equivalent circuit is for calculating the thread T(j, k) in Figure 11. When calculating this thread, analog processing is not used, and as explained in Figure 15, it is possible to significantly reduce power consumption. However, in this application, analog processing is not performed inside the AI chip. 【0061】 Figure 16 shows an example of a circuit that performs multiply-accumulate operations using only digital processing, which is a feature of this invention. The NAND strings connected to each bit line are connected in parallel using a common source line (CSL). 【0062】 For simplicity, we are only extracting the portion consisting of the four bit lines, BL(n) to BL(n+3), where n is any non-negative integer. The number of word lines is N, which represents the number of memory cells in series on the NAND string. 【0063】 Figure 17 shows the equivalent circuit of a memory cell constituting a NAND string. The memory cell has a control gate, a charge storage layer, and three terminals: terminal 1, terminal 2, and terminal 3. The charge storage layer consists of a floating gate or a charge trapping layer. In either case, the state is 0 when a certain amount of charge has been stored, and state 1 otherwise. 【0064】Terminal 2 of a memory cell in a NAND string is connected to terminal 1 of another adjacent memory cell in the same NAND string. This is omitted from the equivalent circuit diagram in Figure 16. However, as one example, this connection between terminal 1 and terminal 2 can be achieved by sharing a portion of the semiconductor between adjacent memory cells. Alternatively, as another example, this connection between terminal 1 and terminal 2 can be achieved by sharing a diffusion layer on the semiconductor surface between adjacent memory cells. Alternatively, as yet another example, this connection between terminal 1 and terminal 2 can be achieved by sharing a portion of the conductor between adjacent memory cells. Terminal 3 can be connected to a word line to which an external control gate voltage can be applied. Terminal 3 is also omitted from the diagram in Figure 16. In Figure 16, the NAND string is extended vertically, and the word line is extended horizontally. 【0065】 The control gate voltage primarily uses three types: write voltage Vpgm, read voltage Vread, and pass-through voltage Vpass. 【0066】 When a potential above a certain level is applied to terminal 3, electrons are injected into the charge storage layer from the channel region between terminals 1 and 2. This is called writing. Typically, the potential between terminals 1 and 2 is set to 0V, and a voltage of approximately 12V to 25V is applied as Vpgm. Vpgm may be a pulse of a constant voltage or a pulse that increases in steps, and is generally a voltage pulse adjusted to control the number of electrons injected into the charge storage layer. 【0067】 A memory cell in which a certain number of electrons have been written is called a state 0 cell, and the data stored in this memory cell is data 0. Conversely, a memory cell in which a certain number of electrons have been written or extracted is called a state 1 cell, and the data stored in this memory cell is data 1. In state 1, when the read voltage Vread is applied to the control gate from the word line connected to terminal 3, a current of a certain level or more flows between terminals 1 and 2. In state 0, even if the read voltage Vread is applied to the control gate from the word line connected to terminal 3, no current of a certain level or more flows between terminals 1 and 2. 【0068】 By setting the control gate voltage to 0V and applying a high voltage Vers to terminals 1 and 2, and to the channel region between terminals 1 and 2, electrons can be extracted from the charge storage layer into the channel region. This is called erasure. This erasure can change a cell that was in state 0 to state 1. A cell that was in state 1 remains in state 1. Generally, Vers is approximately 12V to 25V. 【0069】 In Figure 16, select transistors are placed at both ends of the NAND string. When the potential applied to the gate (selection gate) of the select transistor is Von, the select transistor is switched on, and when it is Voff, the select transistor is switched off. Generally, Von is a higher potential than Voff. That is, Von is higher than the threshold voltage Vt of the select transistor, and Voff is lower than the threshold voltage Vt of the select transistor. In Figure 16, as an example, Von is applied to the select transistors at the top and bottom ends (both ends of each NAND string). 【0070】 In Figure 16, as an example, Vread is applied to the word line of the first row (row 0). Here, as an example, the row numbering starts from 0. Vpass is applied to the word lines from the second row (row 1) to row (N-1). Vpass is a voltage high enough to switch on the cell, regardless of whether the cell state is 1 or 0. However, it is lower than the write voltage Vpgm, and is considered to be low enough that writing will not occur accidentally by applying Vpass. In other words, Vpgm is higher than Vpass, and Vpass is higher than Vread. Both Vpass and Vread are voltages low enough that writing is not considered to occur. 【0071】The state (data) of the cells in row 0 is 1, 1, 0, 1 from left to right. This is the 0th component (0th weight element) w(0) of the weight vector represented in quaternary binary. That is, w(0) = (1,1,0,1). Next, the state (data) of the cells in row 1 is 1, 0, 0, 1 from left to right. This is the 1st component (1st weight element) w(1) of the weight vector represented in quaternary binary. That is, w(1) = (1,0,0,1). Similarly, the state (data) of the cells in row j is 0, 0, 1, 1 from left to right. This is the jth component (jth weight element) w(j) of the weight vector represented in quaternary binary. That is, w(j) = (0,0,1,1). However, in general, j can be chosen from integers between 0 and N-1. 【0072】 In the example shown in Figure 16, there are word lines up to the (N-1)th row. Therefore, the number of word lines is N. However, this N is equal to or greater than the number of inputs to the sum-of-products operation. 【0073】 Below the (N-1)th row are a series of selection transistors. These selection transistors are, for example, connected in parallel to the common source line (CSL). In other words, in this example, four NAND strings corresponding to the elements of the quaternary binary system are connected in parallel to the CSL in each row. The selection transistors shown below the diagram switch this connection on and off. 【0074】 Figure 18 shows an example of the input and output method in this application. However, the first row and below of Figure 16, where Vpass is applied, have been omitted. As an example, the 0th component (element) x0 of the input vector is (1010) when represented in quaternary binary. In other words, the jth component (jth input element) of the input vector is a sequence of data 0 or data 1. Similarly, the jth component (jth output element) of the output vector is also a sequence of data 0 or data 1. 【0075】In Figure 18, first, (1010) is input to the bit line BL(n+3). By adding (0000) to the beginning of x0, it is expanded to an octal binary number (00001010), and this is then used as the 0th component of the input vector. That is, x(0)=(00001010). 【0076】 Next, let's consider inputting the first digit of x0 into BL(n+3). However, inputting 0 into the bit line when the upper selection gate is ON means applying a low voltage (for example, 0V) to the drain of the cell to which Vread is applied (in this case, the cell in the 0th row). Conversely, inputting 1 into the bit line when the upper selection gate is ON means applying a high voltage (for example, 5V) to the drain of the cell to which Vread is applied (in this case, the cell in the 0th row). In this example, the first digit of x0 = (00001010) is 0. The cell in the 0th row connected to BL(n+3) is in state 1, but since the drain voltage is low, no current above a certain level flows through BL(n+3). In other words, the output (or output data) is data 0. 【0077】 Next, the second digit (1) of x0 = (00001010) is input to BL(n+3). However, inputting 1 to the bit line when the upper selection gate is ON means applying a high voltage (for example, 5V) to the drain of the cell to which Vread is applied (in this case, the cell in the 0th row). In this example, the cell in the 0th row connected to BL(n+3) is in state 1, and since the drain voltage is high, a current greater than a certain amount flows through BL(n+3). In other words, the output (or output data) is data 1. 【0078】 Next, the third digit (0) of x0 = (00001010) is input to BL(n+3). However, inputting 0 to the bit line when the upper selection gate is ON means applying a low voltage (for example, 0V) to the drain of the cell to which Vread is applied (in this case, the cell in the 0th row). In this example, the cell in the 0th row connected to BL(n+3) is in state 1, but because the drain voltage is low, no current above a certain level flows through BL(n+3). In other words, the output (or output data) is data 0. 【0079】 Next, the fourth digit (1) of x0 = (00001010) is input to BL(n+3). However, inputting 1 to the bit line when the upper selection gate is ON means applying a high voltage (5V for example) to the drain of the cell to which Vread is applied (in this case, the cell in row 0). In this example, the cell in row 0 connected to BL(n+3) is in state 1, and since the drain voltage is high, a current greater than a certain amount flows through BL(n+3). In other words, the output (or output data) is data 1. 【0080】 Next, the 5th to 8th digits (0000) of x0 = (00001010) are input to BL(n+3) in order (from right to left). However, inputting data 0 to the bit line when the upper selection gate is ON means applying a low voltage (for example, 0V) to the drain of the cell to which Vread is applied (in this case, the cell in row 0). Whether the state of the cell to which Vread is applied connected to BL(n+3) is 1 or 0, no current above a certain level will flow through BL(n+3) because the drain voltage is low. In other words, the output (or output data) will be (0000). This is an output code consisting of four data 0s in a row. 【0081】 Thus, once all digits of x0 = (00001010) have been input, the output to BL(n+3) will be (00001010) in octal binary. 【0082】 Next, input the quaternary binary number (1010) into bit line BL(n+2). First, add (000) to the beginning and (0) to the end to expand it into the octary binary number (00010100), which is then used as the input vector. That is, x0 = (00010100). 【0083】 Similar to the case of bit line BL(n+3), the output from BL(n+2) can be obtained. If the octal binary number x(0) = (00010100) is entered into the cell with Vread applied to BL(n+2) (the cell in row 0 in this example), the output will be the output code (00000000). 【0084】Next, input the quaternary binary number (1010) into bit line BL(n+1). First, add (00) to the beginning and then (00) to the end to expand it into the octary binary number (00101000), which is then used as the input code. That is, x0 = (00101000). 【0085】 Similar to the case of bit line BL(n+3), the output from BL(n+1) can be obtained. If the input code for the cell with Vread applied to BL(n+1) (the cell in row 0 in this example) is octal binary x0 = (00101000), the output code is output code (00101000). 【0086】 Next, input (1010) to bit line BL(n). First, add (0) to the beginning and (000) to the end to expand it to the octal binary number (01010000), which is then used as the input code. x0 = (01010000). 【0087】 As with the bit line BL(n+3), we can obtain the output from BL(n). If we input the octal binary number x0 = (01010000) into the cell with Vread applied to BL(n) (in this example, the cell in row 0), the output will be the output code (01010000). 【0088】 Next, we sum up all the outputs from BL(n), BL(n+1), BL(n+2), and BL(n+3). Figure 19 shows how to sum up all the outputs by long multiplication. The result is an octal binary number (10000010). This is the product of the input code x0, which is the 0th component of the input vector, and the 0th component w0 of the weight vector stored in the 0th row. 【0089】 In the above, the product of the 0th component of the input, expressed in quaternary binary, and the 0th component of the weight, also expressed in quaternary binary, is represented in octary binary. Below, we will generalize this and describe it. 【0090】First, we express the j-th component of the input vector (the j-th input element), xj, and the j-th component of the weight vector (the j-th weight element), wj, in m-element binary. For example, let's assume xj = (1...10) and wj = (10...01). We will denote this xj as xjx, using xj as the base xj. 【0091】 Next, in Figure 20, as an example, select the column with the largest argument number in the bit line, i.e., the BL(n+m-1) column. Furthermore, attach the m-element binary number (0...0) to the m-element binary xj from the beginning, and convert it into a 2m-element binary input code. That is, set xj = (0...0xjx) = (0...01...10). Here, the m-element binary number (0...0) is a sequence of m zeros. 【0092】 Next, select the bit line with one less argument number, i.e., BL(n+m-2). Furthermore, attach the (m-1) binary number (0...0) from the beginning and the 1 binary number (0) from the end to xj, converting it into a 2m binary input code. That is, set xj = (0...0xjx0) = (0...01...100). Here, the 1 binary number (0) is data 0, and the (m-1) binary number (0...0) is a sequence of (m-1) data 0s. 【0093】 Next, we select a bit line with one less argument number, i.e., BL(n+m-3). Furthermore, we attach (m-2) binary numbers (0...0) from the beginning and binary numbers (00) from the end to xj, converting it into a 2m binary input code. That is, we set xj = (0...0xjx00) = (0...01...1000). Here, a one-byte binary number (00) is two data 0s in a row, and an (m-2) binary number (0...0) is (m-2) data 0s in a row. 【0094】Next, select the bit line with one less argument number, i.e., BL(n+m-4). Furthermore, attach the (m-3) binary number (0...0) from the beginning and the ternary binary number (000) from the end to xj, and convert it into a 2m binary input code. That is, set xj = (0...0xjx000) = (0...01...10000). Here, the 1 binary number (000) is three data 0s in a row, and the (m-3) binary number (0...0) is (m-3) data 0s in a row. 【0095】 In this way, by sequentially printing the elements of xj, represented in 2m binary, onto the bit line BL(n+mr) of each selected column, outputs are obtained sequentially according to the state (0 or 1) of the cell to which Vpass is printed. However, r is an integer between 1 and m. That is, the above procedure can be continued from r = 1 to r = m. 【0096】 Figure 21 shows a method for calculating the product of xjx and wj. 【0097】 That is, when r = m, BL(n) is selected and the extended xj = (0xjx0...0). When r = m-1, BL(n+1) is selected and the extended xj = (00xjx0...0). When r = m-2, BL(n+2) is selected and the extended xj = (000xjx0...0). When r = 2, BL(n+m-2) is selected and the extended xj = (0...0xjx0). When r = 1, BL(n+m-1) is selected and the extended xj = (0...0xjx). 【0098】 By the way, since the number of bit lines used is m, m outputs can be obtained from the circuit in Figure 21. As shown in Figure 22, by adding up all of these output codes from the 1st to the mth using long division, we can calculate the jth scalar product (the jth output element) of the jth component of the input vector (the jth input element) and the jth component of the weight vector (the jth weight element). 【0099】 Next, let's consider adding up j from 0 to (N-1). This is a step towards completing the sum-of-products operation. 【0100】The case where j=0 in the scalar product operation in Figure 21 corresponds to Figure 16. In other words, in Figure 16, Vread is applied only to the 0th row, and Vpass is applied to all other rows. In Figure 20, Vread is applied only to the jth row, and Vpass is applied to all other rows. 【0101】 In other words, in Figure 21, you can calculate the scalar product according to Figures 21 and 22 while varying j from 0 to N-1. 【0102】 Figure 23 is a flowchart that briefly illustrates one example of this method. In the figure, (q)0 represents q zeros, where q is a non-negative integer. In Figure 20, it is assumed that Von is pre-applied to the selection gate. Also, in the above explanation, the word lines were placed from row 0 to row (N-1), but the rows in which the word lines are placed can be from row s to row (s+N-1), as long as the total number of rows is N. However, for the sake of simplicity, the following explanation will also use the case where s = 0. 【0103】 First, assign s to the integer j, and assign 0.0 to the variables output and scalarp. However, for the sake of simplicity, we will assume s=0. 【0104】 Next, it checks if j is less than N, and if not, it returns a (NO) output and terminates. This output is the result of the multiply-accumulate operation (thread output). If it is less than N, it appends scalarp to the (YES) output, selects the j-th word line WL(j) in the circuit diagram shown in Figure 20, applies Vread to that WL(j), and applies Vpass to the other word lines WL. 【0105】 Next, assign 0.0 to the variable scalarp and 1 to the integer r. 【0106】Select the (n+mr)-th bit line BL(n+mr) and input the m-element binary number (mr)0 / xjx / (r-1)0. Here, (mr)0 / xjx / (r-1)0 is obtained by attaching the (mr)-element binary number (0...0) from the beginning and the (r-1)-element binary number (0...0) from the end to the j-th component (xjx) of the m-element binary input. This is the r-th input code for the j-th input element. 【0107】 For this input code, the current flowing through BL(n+mr) is used to obtain temp (output code) according to the method described in Figure 18 or Figure 21. This temp is appended to scalarp to check if r is less than m. If it is (YES), r is incremented by 1 and BL(n+mr) is selected. If it is not less than m (NO), j is incremented by 1 and it is checked whether j is less than N. At this point, this scalarp becomes the scalar product (the j-th output element) of xj (the j-th input element) and wj (the j-th weight element). 【0108】 Repeat this process until END is reached. (Third embodiment) 【0109】 Now, this operation is carried out by a circuit diagram as shown in Figure 20. This circuit diagram, as an example, consists of 2m selection gates, mN memory cells (where mN is the product of m and N), and a common source line CSL. This is referred to as one block in this application. 【0110】 Such blocks can be arranged within multiple chips. Figure 24 shows two blocks, Block(n) and Block(p), arranged vertically with a CSL in between. Here, n and p are arbitrary integers that are distinct from each other. 【0111】The combined number of lines (word lines) of the two blocks is (N-1), where N is the number of inputs. In other words, these two blocks can function as a single block. A concrete example of how to operate them is shown in Figure 23. Such blocks are used to compute the threads shown in Figure 11, and in this application, we will refer to them specifically as small blocks for silicon brains (SB Block or SB Block). 【0112】 Figure 25 shows another example. Both Block(n) and Block(p) are SB Blocks, each consisting of N word lines. In other words, the two blocks in Figure 20 are arranged independently as blocks, sharing the CSL. The specific ways in which Block(n) and Block(p) are moved are shown, for example, in Figure 23. 【0113】 Figure 26 shows how four of the multiple blocks are arranged with a single shared source line CSL in between. In other words, the arrangement of blocks can be shown in both row and column directions, so for example, in Figures 24 and 25, n can be replaced with (j,k) and p can be replaced with (j+1,k). Thus, the four blocks become Block(j,k), Block(j,k+1), Block(j+1,k), and Block(j+1,k+1), respectively. 【0114】 Figure 27 shows an example of wiring the CSL in a U-shape. In this example, 12 blocks can be arranged to share the CSL. (Fourth Embodiment) 【0115】 Figure 28 is a circuit diagram containing Block(j,k) and Block(j+r,k). Page is the number of columns (number of bit lines). 【0116】Block (j, k) is a silicon brain block (SB Block) characteristic of this invention, consisting of m rows (m bit lines). Incidentally, the memory cells constituting the silicon brain block may be the same as the memory cells constituting the flash memory. In this case, Block (j+r, k) can be considered as a small block (NAND Block) consisting of memory cells having a non-volatile storage function, such as NAND flash. 【0117】 These can be combined to form a larger block by bundling bit lines equivalent to one page. In other words, the Block(j,k) mentioned above is a small block intended for thread operations. A NAND Block is also a small block, similar to an SB Block, but its function is to be used as non-volatile storage. The circuit diagram for a NAND Block is the same as in Figures 16 and 20, and it is not possible to distinguish between an SB Block and a NAND Block from a circuit perspective. They are the same circuit, but can be used for different functions. In other words, an SB Block can be used as a NAND Block. Conversely, a NAND Block can also be used as an SB Block for thread operations. 【0118】 In other words, Figure 27 is also a mixed configuration of SB Blocks and NAND Blocks, as shown in Figure 29. It is up to the user whether to use each small block for thread operations or as non-volatile storage. 【0119】 In conventional NAND flash memory, one block is arranged on the chip so that it occupies one well. In relation to the present invention, such a block that occupies one well is divided into multiple smaller blocks. Of these multiple smaller blocks, one is an SB block and another is a NAND block. Figure 30 shows an example of such an arrangement. In other words, one block in the present invention, or one block relating to the present invention, is a smaller block obtained by dividing one block of conventional NAND flash memory. (Fifth Embodiment) 【0120】Figure 31 is a diagram showing an example of the flow over one epoch. 【0121】 The integers used for increment are j and k. However, k has a special meaning in determining the structure of the neural network. Specifically, when k = 0, it represents the input layer, and when k = M, it represents the output layer. The hidden layers are from k = 1 to k = M-1. That is, the number of hidden layers is M-1. When k = M, the output y is calculated. That is, the output y is the charge added in the Mth layer. Also, j is the element number of the (k+1)th column output element obtained by thread operations, and is an integer from 1 to L(k+1). 【0122】 Here, i is the element number of the input element x(i) input at the k-th layer when performing a thread operation, and is an integer from 1 to L(k). However, if the number of elements in the input layer (k=0) is N, then L(0) = N. Input elements can be input from outside the thread operation unit. 【0123】 The Start command will start both increments from 1. 【0124】 Next, the first thread operation is performed to obtain the output t(1, 2) (Get t(1, 2)). An activation function is used to convert this into the activated output element Act(1, 2) and save it to main memory. 【0125】 Next, the following thread operation is performed to obtain the output t(2, 2) (Get t(2, 2)). Using the activation function, this is converted into the activated output element Act(2, 2) and stored in main memory. 【0126】This is repeated L(2) times. The activation output element Act(j, 2) is provided to the outside of the semiconductor chip. Depending on the case, the activation output element Act(j, 2) is stored in main memory, where j is an integer from 1 to L(2). Thus, the L(2) activation output elements Act(1, 2), Act(2, 2), ... Act(L(2), 2) are combined and called the activation output of the second layer. (The first layer is the input layer.) By incrementing k, the activation output of the (k+1)th layer is obtained. However, the activation output of the (k+1)th layer is the combination of L(k+1) activation output elements Act(1, k+1), Act(2, k+1), ... Act(L(k+1), k+1). 【0127】 Next, we check whether k is less than M. If it is, we substitute (overwrite) the activation output element into the input element and update (increment) k. That is, we increase it by 1, and repeat this until k = M. Then, when k = M, we calculate the Y value. However, when k = M, we may or may not use the activation function. 【0128】 In this way, the activation output element Act(j, k+1) is calculated for all possible j and k values and stored in main memory as needed. The Y value is also calculated (Get Y) and stored in main memory as needed. This completes one training cycle. In other words, the calculation of the Y value (Get Y) marks the end of one epoch. 【0129】 Figure 32 is a diagram illustrating an example of a method for repeating epochs. Here, the number of epochs (learning cycles) is represented by g (epoch number). 【0130】 First, at the start, let g = 1. 【0131】 Y(g) is calculated using the method shown in Figure 31. However, Y(g) is the same as the Y value described above. The Y value calculated the first time is Y(1), the Y value calculated the second time is Y(2), and so on, up to the Y value calculated the gth time is Y(g). 【0132】Next, the difference (or residual) between the latest Y(g) and the previous Y(g-1) is compared with a predetermined value. If the absolute value of this residual is considered sufficiently small compared to the predetermined value, Y(g) is considered to have converged. If it has not converged and g is not smaller than the upper limit gupp, the calculation is considered a failure and terminated. Alternatively, if g is smaller than the upper limit gupp, g is incremented (increased by 1), Y is calculated again, and this is taken as Y(g+1). 【0133】 On the other hand, if Y(g) has converged, Y=Y(g) is set, Y is output, and the learning process ends (End). The final output Y can be saved to the main memory outside the AI chip, if necessary. At this point, the latest g is the number of epochs. The number of epochs is useful for representing the learning efficiency of the algorithm. In other words, a large number of epochs indicates low learning efficiency, while a small number indicates high learning efficiency. 【0134】 Learning speed is expressed as the product of the operating speed of the AI chip, which is represented by the clock frequency and bandwidth, and the number of epochs. 【0135】 As mentioned above, the total number of elements of the weight w required to calculate the output Y is L(0)L(1)...L(M-1). (Sixth Embodiment) 【0136】 The above explanation uses the example of a case where there is only one output Y. However, the key feature of this invention is the chip technology for threaded computation. Its use is not limited to the above-described use, but can be applied to any computing architecture that uses threaded computation. 【0137】 When there are multiple outputs Y, for example, there is a phenomenon called convolution. It is obvious that the chip technology for calculating threads of this invention can be utilized even in such cases. 【0138】Figure 33 shows an example of convolution. In this example, the input layer has 4 columns and the output layer has 3 columns (y1, y2, y3). The number of hidden layers is set to M-1. Of course, the number of input layers does not have to be 4; in principle, it can be 3 or even 2 columns. The number of output layers does not have to be 3; it can be 2 or even more. Figure 34 is a redrawn version of Figure 10 for comparison. The number of hidden layers is expanded from 2 to M-1 while keeping the number of output layers at 1. 【0139】 One way to use convolution is to show the AI a picture of a bird and ask it to choose the correct answer from among bird (y1), cat (y2), and turtle (y3). When the AI chooses y1, it is told that it is correct; otherwise, it is told that it is incorrect. The AI learns by repeating this process. (Seventh Embodiment) 【0140】 Figure 30 shows a layout of multiple SB blocks and multiple NAND blocks on the X-Y plane. In other words, as an example, word lines are drawn in the X-axis direction and bit lines are drawn in the Y-axis direction. If this is rotated 90 degrees around the X-axis, the word lines remain in the X-axis direction, but the bit lines become in the Z-axis direction. In other words, the multiple SB blocks and multiple NAND blocks are laid out in a cell array on the X-Z plane. 【0141】Figure 35 shows multiple cell arrays arranged in the Y-axis direction on the X-Z plane. In other words, if the Z-axis direction, where the bit lines run, is perpendicular to the semiconductor substrate, the channels are arranged vertically. This is a structure unique to 3D NAND flash memory. Looking at the Y-axis direction, the (j-1)th layer, the j-th layer, and the (j+1)th layer are arranged from the front towards the back of the page. Figure 36 shows the layout of the (j-1)th layer on the module plane within the Z-X plane. It is completely filled with SB Blocks. Figure 37 shows the layout of the j-th layer on the module plane within the Z-X plane. SB Blocks and NAND Blocks are mixed. Figure 36 shows the layout of the (j+1)th layer on the module plane within the Z-X plane. It is completely filled with NAND Blocks. Thus, the layout of small blocks in each layer is flexible. 【0142】 Thus, the SB Block, which is one of the features of this invention, can also be incorporated into 3D NAND flash memory. 【0143】 In Figures 18 and 21, the input data for the input elements was entered from the right side of the diagram, but this can also be reversed. This will be explained in detail using Figure 39. First, assign 0 to the integer j, and assign 0.0 to the variables output and scalarp. However, for the sake of simplicity, assume s=0. 【0144】 Next, it checks if j is less than N, and if not, it returns a (NO) output and terminates. This output is the result of the multiply-accumulate operation (thread output). If it is less than N, it appends scalarp to the (YES) output, selects the j-th word line WL(j) in the circuit diagram shown in Figure 20, applies Vread to that WL(j), and applies Vpass to the other word lines WLs. 【0145】 Next, assign 0.0 to the variable scalarp and assign m to the integer r. 【0146】Select the (n+mr)th bit line BL(n+mr) and input the m-element binary number (r-1)0 / xjx / (mr)0. Here, (r-1)0 / xjx / (mr)0 is obtained by attaching the (r-1)-element binary number (0...0) from the beginning and the (mr)-element binary number (0...0) from the end to the j-th component (xjx) of the m-element binary input. This is the r-th input code for the j-th input element. 【0147】 For this input code, the current flowing through BL(n+mr) is used to obtain temp (output code) according to the method described in Figure 18 or Figure 21. This attachment is appended to scalarp and checked to see if r is greater than 1. If it is greater (YES), r is decremented by 1 and BL(n+mr) is selected. If it is not greater (NO), j is incremented by 1 and checked to see if that j is less than N. At this point, this scalarp becomes the scalar product (the j-th output element) of xj (the j-th input element) and wj (the j-th weight element). 【0148】 Repeat this process until you reach END. 【0149】 Finally, as mentioned above, the weight elements in this application can be expressed in the form of a three-dimensional matrix, w(i, j, k). When the number of hidden layers increases, that is, when dealing with more complex problems in deep learning, the number of elements in w(i, j, k) becomes very large. In other words, accessing main memory each time the weights are updated requires a very large amount of power consumption. Therefore, integrating the weight cells directly onto the AI chip, as in this application, contributes to a significant reduction in the power consumption required to utilize deep learning (artificial intelligence). 【0150】Goodon E. Moore, “Cramming more components onto integrated circuits”, Electronics, volume 38, Number 8, April 19, 1965.Masanet, E.; Shhehabi, A.; Lei, N.; Smith, S.; Koomey, J. Recalibrating global data center energy-use estimates. Science 2020, vol. 3667, 984―986.A Survey of Neuromorphic Computing and Neural Networks in Hardware, CD Schuman, TE Potok, RM Tatton, D. Birdwell, ME Dean, GS Rose, and JS Plank; https: / / doi.org / 10.48550 / arXiv.1705.06963. 【0151】 The features of this application have been explained above. 【0152】 The technical scope of the present invention is not limited to the embodiments described above, and various modifications can be made without departing from the spirit of the invention. 【0153】 This makes it possible to provide a method for significantly reducing the power consumption of deep learning using only digital processing on a semiconductor chip. 【0154】A diagram illustrating an example of a nerve cell. A diagram illustrating an example of synapsis. A diagram illustrating an example of the concept of a perceptron. A diagram illustrating an example of thread operation. A diagram illustrating an example of a combination of thread operations. A diagram illustrating an example of a combination of thread operations. A diagram illustrating an example of a combination of thread operations. A diagram illustrating an example of a combination of thread operations. A diagram illustrating an example of a combination of thread operations. A diagram illustrating an example of a thread. A diagram illustrating a conventional method of thread operation. A diagram illustrating an example of a von Neumann bottleneck. A diagram illustrating an example of a thread operation method when data C is on-chip. A diagram illustrating an example of thread operation relating to the present application. A diagram illustrating an example of a circuit configuration relating to the present application. A diagram illustrating an example of a method for handling input data and generating output data relating to the present application. A diagram illustrating an example of a method for generating output data relating to the present application. A diagram illustrating an example of a circuit configuration relating to the present application. A diagram illustrating an example of a method for handling input data relating to the present application. A diagram illustrating an example of a method for generating output data relating to the present application. A diagram illustrating an example of a thread operation (sum-accumulate operation) relating to the present application. A diagram illustrating an example of block layout relating to this application. A diagram illustrating an example of block arrangement relating to this application. A diagram illustrating an example of block arrangement relating to this application. A diagram illustrating an example of block arrangement relating to this application. A diagram illustrating an example of circuit configuration relating to this application. A diagram illustrating an example of block arrangement relating to this application. A diagram illustrating an example of block arrangement relating to this application. A diagram illustrating an example of one learning cycle (one epoch) relating to this application. A diagram illustrating an example of neural network computation relating to this application. A diagram illustrating an example of a convolution model relating to this application. A diagram illustrating an example of thread operation combination relating to this application. A diagram illustrating an example of block arrangement in three-dimensional space relating to this application. A diagram illustrating an example of block arrangement in three-dimensional space relating to this application. A diagram illustrating an example of block arrangement in three-dimensional space relating to this application.A diagram illustrating an example of thread operations (sum-accumulate operations) related to this invention.
Claims
1. 20, 23 Includes an SB block comprising a thread arithmetic unit, the thread arithmetic unit returns a thread output to an external input, the external input consists of input elements from the 0th to the (N-1)th, the input elements from the 0th to the (N-1)th are each represented in m-element binary, the thread arithmetic unit consists of a NAND string from the nth column to the (n+m-1)th column, a bit line from the nth column to the (n+m-1)th column, and a word line from the sth row to the (s+N-1)th row for any integer n, the NAND string from the nth column to the (n+m-1)th column consists of memory cells from the sth row to the (s+N-1)th row, the data of the jth weight element is stored as 0 or 1 in the memory cell of the jth row in the NAND string from the nth column to the (n+m-1)th column, where j is an integer between 0 and N-1. A semiconductor device characterized by: selecting the word line of the (s+j)th row from the word lines of the (s+N-1)th row from the sth row and applying a read voltage; selecting the jth input element from the input elements of the 0th to the (N-1)th row; selecting the bit line of the (n+m-r)th column from the bit lines of the (n+m-1)th column from the nth column; attaching (r-1) data zeros to the right end of the jth input element and (m-r) data zeros to the left end of the jth input element to form the rth input code for the jth input element; and inputting the rth input code to the bit line of the (n+m-r)th column.
2. 18, 21 The semiconductor device according to claim 1, characterized in that the input code of r is represented in 2m binary, the bit lines of the (n+m-r) column input the input code of r digit by digit to one end of the (n+m-r) column NAND string, the output from the memory cell of the jth row in the (n+m-r) column NAND string is taken as the output code of r, the output codes from the 0th to the (m-1)th are output to the bit lines of the nth to the (n+m-1)th columns, the output codes from the 0th to the (m-1)th are added together to form the output element of the jth for the input element of j and the weight element of j.
3. The semiconductor device according to claim 2, characterized in that it sequentially selects the (s+N-1) word lines from s, sequentially applies the read voltage, sequentially outputs the 0th to the (N-1) output elements, and sums up the 0th to the (N-1) output elements to obtain the thread output.
4. 18, 21 The memory cells in the jth and (j+1)th rows of the (n+m-r) column NAND string are the jth and (j+1)th nonvolatile memory cells, and the jth and (j+1)th nonvolatile memory cells each have a first to a third terminal, the third terminal of the jth nonvolatile memory cell is connected to the word line of the jth row, the second terminal of the jth nonvolatile memory cell is connected to the first terminal of the (j+1)th memory cell, the jth nonvolatile memory cell is capable of storing data 0 or data 1, and when the stored data is data 1, it outputs data 0 if data 0 is input from the bit line of the (n+m-r) column, and outputs data 1 if data 1 is input from the bit line of the (n+m-r) column. The semiconductor device according to claim 1, characterized in that, when the stored data is 0, it outputs data 0 if data 0 is input from the bit line of the (n+m-r) column, and outputs data 0 if data 1 is input from the bit line of the (n+m-r) column. 5.30 A semiconductor device according to claim 1, characterized in that it includes a plurality of small blocks on a well formed on the surface of a semiconductor substrate, and one of the plurality of small blocks is the SB block. 6.35 The semiconductor device according to claim 1, characterized in that the word lines from the sth row to the (s+N-1)th row are stacked on a semiconductor surface, the module plane including the SB block is perpendicular to the semiconductor surface, the module plane is arranged to be perpendicular to the first axis direction within the semiconductor surface, the module plane is arranged to be parallel to the second axis direction which is perpendicular to the first axis direction, the word lines from the sth row to the (s+N-1)th row are arranged to be parallel to the second axis direction, and the bit lines from the nth column to the (n+m-1)th column are arranged to be parallel to the first axis direction.
7. The semiconductor device according to claim 1, characterized in that the SB block is arranged perpendicular to a third axis direction perpendicular to the semiconductor surface, the word lines from the sth row to the (s+N-1)th row are arranged parallel to the fourth axis direction within the semiconductor surface, and the bit lines from the nth column to the (n+m-1)th column are arranged perpendicular to the fourth axis direction within the semiconductor surface.