Stacked storage device, system with the same and associated method
The stacked storage device with silicon via connections between computational units addresses memory bottlenecks by enabling parallel data processing, thereby reducing processing time and power consumption.
Patent Information
- Authority / Receiving Office
- DE · DE
- Patent Type
- Patents
- Current Assignee / Owner
- SAMSUNG ELECTRONICS CO LTD
- Filing Date
- 2018-04-12
- Publication Date
- 2026-06-25
AI Technical Summary
Memory bandwidth and latency bottlenecks in processing systems are exacerbated by inter-device bandwidth and inter-device latency when accessing stacked semiconductor devices, leading to inefficiencies in data processing and increased power consumption.
A stacked storage device architecture with silicon vias connecting computational units across multiple semiconductor dies, allowing simultaneous data processing and reduced data exchange between the stacked device and external components.
This architecture enables parallel data processing, reducing processing time and power consumption by minimizing inter-device data transfer, particularly in memory-intensive operations.
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
Technical field The present inventive concept relates to integrated semiconductor circuits and more precisely to a stacked storage device, a system comprising a stacked storage device, and a method for operating a stacked storage device. Discussion of the state of the art Memory bandwidth and latency are performance bottlenecks in many processing systems. Storage capacity can be increased by using a stacked memory device, in which multiple semiconductor devices are stacked within a single memory chip package. The stacked semiconductor devices (or dies) can be electrically connected to each other via silicon through-hole vias or through-substrate vias (TSVs). Such a stacking technology can increase storage capacity while also mitigating bandwidth and latency drawbacks. Each time an external device accesses the stacked memory device, data is communicated between the stacked semiconductor dies. However, in this case, inter-device bandwidth and inter-device latency drawbacks can occur twice with each access.Therefore, inter-device bandwidth and inter-device latency can have a significant impact on processing efficiency and power consumption when the external device requests multiple accesses to the stacked storage device. JP S60262253 A discloses the following: An instruction sent from a CPU to a memory data processing circuit is interpreted by a control part. Thus, the part simultaneously controls a memory circuit and a processing part of each memory level via a control / address signal line. In this case, data can be transferred simultaneously from a given memory level to optionally multiple memory levels, as long as a data selector in the part selects the data on the side of a data bus. These memory levels have a completely independent data processing system, controlled by a single part, if the selector chooses the data provided by the same memory level. Thus, the memory levels can perform data processing concurrently and in parallel. JP 2015176435 A discloses the following: An LSI chip lamination system comprises: a plurality of processors, each of which has one or more processors mounted on it capable of executing a process according to image data; a memory chip containing a memory capable of storing image data to be inputted and output by the processors; and superparallel pass-through buses containing a plurality of signal lines connecting the laminated plurality of processor chips to the memory chip. The plurality of processor chips are configured to simultaneously read the image data stored in the memory chip via the superparallel pass-through buses, and each process to be executed by the processors on the plurality of processors is configured to process the image data. US 20170263306 A1 discloses the following: Devices and methods for logic / memory devices are provided. A device comprises, for example, a plurality of memory components that are adjacent to and coupled with each other. A logic component is coupled to the plurality of memory components. At least one memory component comprises a memory device with an array of memory cells and a sampling circuit coupled to the array. The sampling circuit comprises a sampling amplifier and a computing component. A timing circuit is coupled to the array and the sampling circuit and is configured to control the timing of operations for the sampling circuit. The logic component comprises control logic coupled to the timing circuit. The control logic is configured to execute instructions to cause the sampling circuit to perform the operations. US 2015 / 0348603A1 discloses the following: Semiconductor memory device comprising a ZQ calibration unit configured to generate a pull-up VOH code according to a first target VOH proportional to a power supply voltage, and an output driver configured to generate a data signal with a VOH proportional to the power supply voltage based on the pull-up VOH code, where VOH means "high-level output voltage". US 2017 / 0148496A1 discloses the following: a semiconductor storage device comprising: a pull-up VOH control block configured to generate a first target VOH, wherein the voltage level of the first target VOH is proportional to a supply voltage; a ZQ calibration unit configured during a ZQ calibration operation to generate a pull-up VOH code in accordance with the first target VOH; and an output driver configured to generate a data signal with a first VOH level based on the pull-up VOH code, wherein the first VOH level is an output high-level voltage proportional to the power supply voltage, and wherein the first target VOH has a voltage level of supply voltage / 25 or supply voltage / 3. SUMMARY According to an exemplary embodiment of the inventive concept, a stacked storage device comprises: a logic semiconductor; a plurality of storage semiconductors stacked with the logic semiconductor, each storage semiconductor having an integrated memory circuit and one or more of the storage semiconductors being a computational semiconductor having a computational unit; and silicon vias electrically connecting the logic semiconductor and the plurality of storage semiconductors.wherein each of the computation units is configured to perform calculations based on transmitted data and internal data, and to generate computation result data, wherein the transmitted data is provided jointly to the computation semiconductor dies through the silicon vias, and the internal data is read from the integrated memory circuits of the computation semiconductor dies. According to an exemplary embodiment of the inventive concept, a storage system comprises: a base substrate; at least one logic semiconductor die stacked on the base substrate; a plurality of storage semiconductor dies stacked on the base substrate or on the logic semiconductor die; and a plurality of computation units formed in one or more of the computation semiconductor dies from the plurality of storage semiconductor dies, each of the computation units being configured to perform calculations based on transmitted data and internal data and to generate calculation result data, the transmitted data being provided common to the computation semiconductor dies, and the internal data being read from integrated memory circuits of the computation semiconductor dies. According to an exemplary embodiment of the inventive concept, a method for operating a stacked storage device is provided, wherein the stacked storage device has a computation unit in each of a plurality of computation semiconductor dies which are stacked in a vertical direction, wherein the method comprises: providing transmit data common to each of the computation units via silicon vias which electrically connect the computation semiconductor dies; providing internal data, which are read from the integrated memory circuits of the computation semiconductor dies, for each of the computation units; and performing a plurality of calculations based on the transmit data and the internal data simultaneously using the computation units. BRIEF DESCRIPTION OF THE DRAWINGS The above and other features of the present inventive concept will be more clearly understood by a detailed description of exemplary embodiments thereof with reference to the accompanying drawings. Fig. 1 is a flowchart illustrating a method for operating a stacked memory device according to an exemplary embodiment of the inventive concept. Fig. 2 is a perspective exploded view of a system comprising a stacked memory device according to an exemplary embodiment of the inventive concept. Fig. 3 is a diagram illustrating an example of a high-bandwidth memory (HBM) organization. Fig. 4 is a diagram illustrating a memory bank contained in the stacked memory device of Fig. 2 according to an exemplary embodiment of the inventive concept.Figure 5 is a diagram illustrating an integrated memory circuit contained in a memory semiconductor of the stacked memory device of Figure 2 according to an exemplary embodiment of the inventive concept. Figure 6 is a diagram illustrating a processing unit according to an exemplary embodiment of the inventive concept. Figure 7 is a diagram showing a data transmission path during a normal access operation in a stacked memory device according to an exemplary embodiment of the inventive concept. Figures 8A and 8B are diagrams showing implementations of the data transmission path of Figure 7 according to exemplary embodiments of the inventive concept. Figures 9, 10, 11A, 11B, 12, 13, 14A, 14B, and 14B illustrate the inventive concept.Figures 14C are diagrams illustrating a transmission path of transmitted data in a stacked storage device according to exemplary embodiments of the inventive concept. Figures 15, 16, 17, 18, 19, 20, 21, and 22 are diagrams illustrating a transmission path of output data from a processing unit in a stacked storage device according to exemplary embodiments of the inventive concept. Figures 23 and 24 are diagrams illustrating a transmission path of transmitted data in a stacked storage device according to exemplary embodiments of the inventive concept. Figure 25 is a diagram illustrating a processing unit contained in a stacked storage device according to an exemplary embodiment of the inventive concept.Figure 26 is a diagram illustrating the output of calculation result data according to an exemplary embodiment of the inventive concept. Figure 27 is a diagram illustrating a matrix calculation using a computation circuit according to an exemplary embodiment of the inventive concept. Figure 28 is a timing diagram illustrating the operation of a stacked storage device according to an exemplary embodiment of the inventive concept. Figures 29 and 30 are diagrams illustrating packaging structures of a stacked storage device according to exemplary embodiments of the inventive concept. Figure 31 is a block diagram illustrating a mobile system according to an exemplary embodiment of the inventive concept. DETAILED DESCRIPTION OF THE EXECUTION FORMS Exemplary embodiments of the present inventive concept are described in more detail below with reference to the accompanying drawings. In the drawings, the same reference numerals refer to the same elements. Fig. 1 is a flowchart illustrating a method for operating a stacked storage device according to an exemplary embodiment of the inventive concept. Referring to Fig. 1, a plurality of computational units are formed in one or more computational semiconductor dies from a plurality of storage semiconductor dies stacked vertically (S100). Transmit data is provided to the plurality of computational units by using silicon vias that electrically connect the plurality of storage semiconductor dies (S200). Internal data, which is read from integrated memory circuits of the computational semiconductor dies, is provided to the plurality of computational units (S300). A plurality of calculations are performed simultaneously based on the transmitted data and the internal data by using the plurality of computational units (S400). As such, the method for operating a stacked storage device according to the present embodiment can reduce the amount of data exchanged between the stacked storage device, the logic semiconductor, and the external device. For example, memory-intensive or data-intensive data processing can be performed in parallel by the multiple compute units contained in the storage semiconductor. Consequently, the processing time and power consumption can be reduced. Fig. 2 is a perspective exploded view of a system comprising a stacked storage device according to an exemplary embodiment of the inventive concept. Referring to Fig. 2, a system 10 comprises a stacked storage device 1000 and a host device 2000. The stacked storage device 1000 can comprise a base semiconductor die or a logic semiconductor die 1010 and a plurality of storage semiconductors dies 1070 and 1080, which are stacked with the logic semiconductor die 1100. Fig. 2 illustrates a non-limiting example of one logic semiconductor die and two storage semiconductors dies. For example, two or more logic semiconductors dies and one, three, or more storage semiconductors dies can be included in the stacked structure of Fig. 2. Additionally, Fig. 2 illustrates a non-limiting example in which the storage semiconductors dies 1070 and 1080 are stacked vertically with the logic semiconductor die 1010. As shown below with reference to Fig.As will be described in section 29, the storage semiconductor dies 1070 and 1080 can be stacked vertically, and the logic semiconductor die 1010 cannot be stacked with the storage semiconductor dies 1070 and 1080, but can be electrically connected to the storage semiconductor dies 1070 and 1080 by an interposer or a wiring layer and / or a base substrate. The logic semiconductor die 1010 can include a memory interface MIF 1020 and logic for accessing integrated memory circuits 1071 and 1081, which are formed in the memory semiconductor dies 1070 and 1080. The logic can include a control circuit CTRL 1030, a global buffer GBF 1040, and a data transformation logic DTL 1050. The memory interface 1020 can communicate with an external device, such as the host device 2000, via an intermediate connection device 12. The control circuit 1030 can control the overall operation of the stacked memory device 1000. The data transformation logic 1050 can perform logic operations on data exchanged with the memory semiconductors 1070 and 1080, or on data exchanged via the memory interface 1020. For example, the data transformation logic 1050 can perform max pooling, rectified linear unit (ReLU) operations, channel-wise addition, etc. The storage semiconductors dies 1070 and 1080 can each include the integrated memory circuits 1071 and 1081, respectively. At least one of the storage semiconductors dies 1070 and 1080 can be a computation semiconductor die 1080, which includes a computation circuit 100. As will be described below, the computation circuit 100 can include one or more computation blocks, and each of the computation blocks can include one or more computation units. Each of the computation units can perform calculations based on transmitted data and internal data to provide computation result data. For example, the transmitted data can be provided jointly for computation semiconductors dies by using silicon vias (TSVs), and the internal data can be read from the integrated memory circuit of the corresponding computation semiconductor die. The host device 2000 can have a host interface HIF 2110 and processor cores CR1 2120 and CR2 2130. The host interface 2110 can communicate with an external device, such as the stacked storage device 1000, via the intermediate connection device 12. The components of the host device 2000 can be arranged, for example, on a base semiconductor die, a logic semiconductor die, or a substrate 2100. Fig. 3 is a diagram illustrating an exemplary high-bandwidth memory (HBM) organization. Referring to Fig. 3, an HBM 1001 can have a stack of several dynamic random access memory (DRAM) semiconductors dies 1100, 1200, 1300, and 1400. The HBM stack structure can be optimized by a plurality of independent interfaces, which are called channels. Each DRAM stack can support up to eight channels in accordance with the HBM standards. Fig. 3 shows an example stack containing four DRAM semiconductors dies 1100, 1200, 1300, and 1400, and each DRAM semiconductor die supports two channels, CHANNEL0 and CHANNEL1. For example, as illustrated in Fig. 3, the fourth storage semiconductor the 1400 can have two integrated storage circuits 1401 and 1402, which correspond to the two channels CHANNEL0 and CHANNEL1. The fourth storage semiconductor, die 1400, can, for example, correspond to a computational semiconductor, die, which has computational units. Each of the integrated memory circuits 1401 and 1402 can have a plurality of memory banks MB, and each memory bank MB can have a computational block CB. As described with reference to Fig. 4, each computational block CB can have a plurality of computational units CU. As such, the computational units can be distributed across the memory banks MB of the computational semiconductor, die. Each channel, for example CHANNEL0 and CHANNEL1, provides access to an independent set of DRAM banks. Requests from one channel cannot access data pinned to a different channel. Channels are clocked independently and do not need to be synchronized. Each of the HBM 1001 memory dies 1100, 1200, 1300, and 1400 can access any other memory die 1100, 1200, 1300, and 1400 to transmit the send data and / or the calculation result data. The HBM 1001 may also include an interface die 1010 or a logic semiconductor die, which is located at one end of the stack structure to provide signal routing and other functions. Some functions of the DRAM semiconductors die 1100, 1200, 1300 and 1400 may be implemented in the interface die 1010. Fig. 4 is a diagram illustrating a memory bank contained in the stacked memory device of Fig. 2 according to an exemplary embodiment of the inventive concept. Referring to Fig. 4, a memory bank 200 can have a plurality of data blocks DBK1~DBKn and one computation block 300. Fig. 4 illustrates a configuration of a first data block DBK1 as an example. The other data blocks DBK2~DBKn in Fig. 4 can have the same configuration as the first data block DBK1. Each data block can have a plurality of submemory cell arrays SARR, and each submemory cell array SARR can have a plurality of memory cells. In a read operation, bit line read amplifiers BLSA can sample and amplify data stored in the memory cells to provide the read data sequentially outside the memory bank (for example, to an external device) via local input / output lines LIO and global input / output lines GIO.In a write operation, data provided from outside the memory bank (e.g., an external device) can be sequentially stored in the memory cells via the global input / output lines GIO and the local input / output lines LIO. The computation block 300 can have a plurality of computation units CU1~CUn. Fig. 4 illustrates a non-limiting example in which one computation unit is assigned to each data block; however, in accordance with an exemplary embodiment of the inventive concept, one computation unit can be assigned to each of two or more data blocks. As described above, the computation units CU1~CUn can perform the calculations simultaneously based on transmitted data DA and internal data DW1~DWn. The transmitted data DA is provided jointly for the computation units CU1~CUn, and the internal data DW1~DWn is read from the data blocks DBK1~DBKn of the corresponding memory bank. Although an exemplary arrangement of the computational units with respect to a memory bank is described in Fig. 4, the stacked storage device can have a plurality of computational semiconductors, each computational semiconductor having a plurality of memory banks, and the computational units can be arranged as in Fig. 4 to correspond to all of the memory banks. All of the computational units of all memory banks can receive the common transmitted data and the internal data from their respective data blocks. As such, the amount of data exchanged between the stacked storage device, the logic semiconductor, and the external device can be reduced. For example, memory-intensive or data-intensive data processing can be performed in parallel by the plurality of computational units distributed within the storage semiconductor.This means that data processing time and power consumption can be reduced. Fig. 5 is a diagram illustrating an integrated memory circuit according to an exemplary embodiment of the inventive concept. A DRAM is described as an example of the integrated memory circuits formed in the memory semiconductor dies, with reference to Fig. 5. The stacked memory device can be any of a variety of memory cell architectures, including, but not limited to, volatile memory architectures such as DRAM, thyristor RAM (TRAM), and static RAM (SRAM), or non-volatile memory architectures such as read memory (ROM), flash memory, ferroelectric RAM (FRAM), magnetoresistive RAM (MRAM), and the like. Referring to Fig.5 features an integrated memory circuit 400, a control logic 410, an address register 420, a bank control logic 430, a row address multiplexer 440, a column address latch 450, a row decoder 460, a column decoder 470, a memory cell arrangement 480, a calculation circuit 100, an input / output (I / O) gating circuit 490, a data input / output (I / O) buffer 495 and a refresh counter 445. The memory cell arrangement 480 can have a plurality of bank arrangements 480a-480h. The row decoder 460 can have a plurality of bank row decoders 460a-460h, each coupled to the bank arrangements 480a-480h, and the column decoder 470 can have a plurality of bank column decoders 470a-470h, each coupled to the bank arrangements 480a-480h. The computation circuit 100 can have a plurality of computation blocks CB 100a-100h, each coupled to the bank arrangements 480a-480h. As described above, each of the computation blocks 100a~100h can have a plurality of computation units which receive the common transmission data and the respective internal data from the bank orders 480a~480h. Address register 420 can receive an address ADDR, which contains a bank address BANK_ADDR, a row address ROW_ADDR, and a column address COL_ADDR, from a memory controller. Address register 420 can allocate the received bank address BANK_ADDR to the bank control logic 430, the received row address ROW_ADDR to the row address multiplexer 440, and the received column address COL_ADDR to the column address latch 450. The bank control logic 430 can generate bank control signals in response to the bank address BANK_ADDR. One of the bank line decoders 460a~460h corresponding to the bank address BANK_ADDR can be activated in response to the bank control signals, and one of the bank column decoders 470a~470h corresponding to the bank address BANK_ADDR can be activated in response to the bank control signals. The line address multiplexer 440 can receive the line address ROW_ADDR from address register 420 and can receive a refresh line address REF_ADDR from refresh counter 445. The line address multiplexer 440 can selectively output either the line address ROW_ADDR or the refresh line address REF_ADDR as a line address RA. The line address RA output by the line address multiplexer 440 can be applied to bank line decoders 460a–460h. The activated bank line decoder 460a~460h can decode the line address RA, which is output by the line address multiplexer 440, and can activate a word line corresponding to the line address RA. For example, the activated bank line decoder can apply a word line driver voltage to the word line corresponding to the line address RA. The column address latch 450 can receive the column address COL_ADDR from the address register 420 and can temporarily store the received column address COL_ADDR. In an exemplary embodiment of the inventive concept, in a burst mode, the column address latch 450 can generate column addresses that increment the received column address COL_ADDR. The column address latch 450 can apply the temporarily stored or generated column address to the bank column decoders 470a–470h. The activated bank column decoder 470a~470h can decode the column address COL_ADDR, which is output by the column address latch 450, and can control the input / output gating circuit 490 to output the data corresponding to the column address COL_ADDR. The I / O gating circuit 490 can include circuits for gating or clocking input / output data. The I / O gating circuit 490 can also include read data latches for storing data output by the bank arrangements 480a~480h, and write drivers for writing data to the bank arrangements 480a~480h. Data to be read from a bank arrangement of bank arrangements 480a–480h can be sampled by one of the bank read amplifiers coupled to the bank arrangement from which the data is to be read and can be stored in read data latches. The data stored in the read data latches can be made available to the memory controller via the data I / O buffer 495. Data DQ to be written to a bank arrangement of bank arrangements 480a–480h can be made available to the data I / O buffer 495 by the memory controller. The write driver can write the data DQ to a bank arrangement of bank arrangements 480a–480h. The control logic 410 can control operations of the integrated memory circuit 400. For example, the control logic 410 can generate control signals for the integrated memory circuit 400 to perform a write or read operation. The control logic 410 can include an instruction decoder 411, which decodes a CMD instruction received from the memory controller, and a mode register set 412, which sets an operating mode of the integrated memory circuit 400. For example, the instruction decoder 411 can generate the control signals corresponding to the CMD instruction by decoding a write enable signal, a row address strobe signal, a column address strobe signal, a chip select signal, etc. Fig. 6 is a diagram illustrating a calculation unit according to an exemplary embodiment of the inventive concept. Referring to Fig. 6, each compute unit CU can have first input terminals connected to first nodes N1, which receive internal data DW[N-1:0], and second input terminals connected to second nodes N2, which receive the transmit data DA[N-1:0]. The first nodes N1 are connected to output terminals of an input-to-output read amplifier IOSA, which amplifies signals on global input-to-output lines GIO and GIOB to output amplified signals. The second nodes N2 are connected to input terminals of an input-to-output driver IODRV, which drives the global input-to-output lines GIO and GIOB. During a normal read operation, the compute unit (CU) is disabled, and the input / output read amplifier (IOSA) amplifies the read data provided by the global input / output lines (GIO and GIOB) to prepare the amplified signals for external use. During a normal write operation, the compute unit (CU) is disabled, and the input / output driver (IODRV) drives the global input / output lines (GIO and GIOB) based on the write data provided from external sources. During a compute operation, the compute unit (CU) is enabled to receive the transmit data (DA[N-1:0]) and the internal data (DW[N-1:0]). In this case, the input / output read amplifier (IOSA) is enabled to output the internal data (DW[N-1:0]), and the input / output driver (IODRV) is disabled to prevent the transmit data (DA[N-1:0]) from being used for internal memory cells. In an exemplary embodiment of the inventive concept, as illustrated in Fig. 6, the output terminals of the computing unit CU, which provide the computation result data DR, can be connected to the first nodes N1, in other words, to the output terminals of the input-output read amplifier IOSA. Accordingly, the computation result data DR can be made available externally via the normal read path. The input-output read amplifier IOSA is deactivated while the computing unit CU provides the computation result data DR. In another exemplary embodiment of the inventive concept, the output terminals of the computing unit CU can be disconnected from the first nodes N1, and the computation result data DR can be provided via an additional data path, distinct from the normal read path.In an exemplary embodiment of the inventive concept, the output nodes of the computation unit CU can be connected to the second nodes N2 in order to store the computation result data DR in the memory cells via the normal write path. Fig. 6 illustrates a global differential line pair GIO and GIOB for convenience; however, each compute unit CU can be connected to N global line pairs to receive N bits of transmit data DA[N-1:0] and N bits of internal data DW[N-1:0]. For example, N can be 8, 16, or 21 depending on the operating modes of the stacked storage device. The following describes data transmission paths of a stacked storage device according to exemplary embodiments of the inventive concept with reference to Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 to 24. Although a logic semiconductor die 1010 and a first to fourth memory semiconductor die 1100, 1200, 1300 and 1400 are illustrated in Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 to 24, the number of logic semiconductors dies and memory semiconductors dies may differ. Fig. 7 is a diagram illustrating a data transmission path during a normal access operation in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 8A and Fig. 8B are diagrams illustrating implementations of the data transmission path of Fig. 7 according to exemplary embodiments of the inventive concept. Referring to a stacked storage device of Fig. 7, data can be exchanged between the logic semiconductor die 1010 and the first to fourth storage semiconductors die 1100, 1200, 1300 and 1400 by means of a first to fourth data bus DBUS1~DBUS4, which correspond to the first to fourth storage semiconductors die 1100, 1200, 1300 and 1400.In other words, during normal read and write operations, data can be exchanged between logic semiconductor die 1010 and the first memory semiconductor die 1100 via the first data bus DBUS1, data can be exchanged between logic semiconductor die 1010 and the second memory semiconductor die 1200 via the second data bus DBUS2, data can be exchanged between logic semiconductor die 1010 and the third memory semiconductor die 1300 via the third data bus DBUS3, and data can be exchanged between logic semiconductor die 1010 and the fourth memory semiconductor die 1400 via the fourth data bus DBUS4. During normal read and write operations, data cannot be exchanged between memory semiconductors die 1100, 1200, 1300, and 1400. Each of the data buses DBUS1~DBUS4 can have a plurality of data paths and each data path can extend in the vertical direction by connecting the silicon vias which are formed in the memory semiconductor dies 1100, 1200, 1300 and 1400. Referring to Figures 8A and 8B, the logic semiconductor die 1010 and the memory semiconductor dies 1100, 1200, 1300, and 1400 can each have transmission circuits TX and receiving circuits RX to perform bidirectional communication via the data buses DBUS1~DBUS4. The transmission circuits TX and the receiving circuits RX, corresponding to the first to fourth data buses DBUS1~DBUS4, can be implemented in all of the memory semiconductor dies 1100, 1200, 1300, and 1400. This allows for a unified manufacturing process. Furthermore, the transmission circuits TX and the receiving circuits RX can be selectively activated for the required data communication. Fig. 8A illustrates a data transmission path corresponding to a normal write operation, and Fig. 8B illustrates a data transmission path corresponding to a normal read operation. Referring to Fig. 8A, during the normal write operation the transmission circuits TX of the logic semiconductor die 1010 and the receiving circuits RX of the storage semiconductors dies 1100, 1200, 1300 and 1400 can be activated to transfer write data from the logic semiconductor die 1010 to the storage semiconductors dies 1100, 1200, 1300 and 1400, each via the data buses DBUS1~DBUS4. A first transmission circuit TX1 of logic semiconductor 1010 and a first receiving circuit RX11 of the first storage semiconductor 1100 can be activated to transmit the first write data WR1 via the first data bus DBUS1. A second transmission circuit TX2 of logic semiconductor 1010 and a second receiving circuit RX22 of the second storage semiconductor 1200 can be activated to transmit the second write data WR2 via the second data bus DBUS2. A third transmission circuit TX3 of logic semiconductor 1010 and a third receiving circuit RX33 of the third storage semiconductor 1300 can be activated to transmit the third write data WR3 via the third data bus DBUS3. A fourth transmission circuit TX4 of the logic semiconductor 1010 and a fourth receiving circuit RX44 of the fourth storage semiconductor 1400 can be activated to transmit fourth write data WR4 via the fourth data bus DBUS4. In Fig.8A indicates that the transmit and receive circuits printed in bold are activated. Referring to Fig. 8B, during normal read operation the transmission circuit TX of the memory semiconductors dies 1100, 1200, 1300 and 1400 and the receiving circuit RX of the logic semiconductor die 1010 can be activated to transmit read data from the memory semiconductors dies 1100, 1200, 1300 and 1400 to the logic semiconductor die 1010 via the data buses DBUS1~DBUS4. A first transmission circuit TX11 of the first storage semiconductor (die 1100) and a first receiving circuit RX1 of the logic semiconductor (die 1010) can be activated to transmit the first read data RD1 via the first data bus DBUS1. A second transmission circuit TX22 of the second storage semiconductor (die 1200) and a second receiving circuit RX2 of the logic semiconductor (die 1010) can be activated to transmit the second read data RD2 via the second data bus DBUS2. A third transmission circuit TX33 of the third storage semiconductor (die 1300) and a third receiving circuit RX3 of the logic semiconductor (die 1010) can be activated to transmit the third read data RD3 via the third data bus DBUS3. A fourth transmission circuit TX44 of the fourth storage semiconductor die 1400 and a fourth receiving circuit RX4 of the logic semiconductor die 1010 can be activated to transmit fourth read data RD4 via the fourth data bus DBUS4. In Fig.8B indicates that the transmit and receive circuits printed in bold are activated. As such, during normal read and write operations, data can be transmitted via the data buses DBUS1~DBUS4, which correspond to the memory semiconductors 1100, 1200, 1300 and 1400 respectively. Figures 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 to 24 illustrate data transmission paths for a computational operation according to exemplary embodiments of the inventive concept. In Figs. 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 to 24, certain configurations and operations are the same as those shown and described with reference to Figs. 7, 8A and 8B, and therefore repeated descriptions may be omitted. Fig. 9 is a diagram illustrating a data transmission path of transmitted data in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 10 is a diagram illustrating an implementation of the data transmission path of Fig. 9 according to an exemplary embodiment of the inventive concept. Some of the stacked storage semiconductors die 1100, 1200, 1300, and 1400 can be computation semiconductors die, which includes a computation circuit CAL, and other stacked storage semiconductors die can be input-output semiconductors die, which do not include the computation circuit CAL. Fig. 9 illustrates a non-limiting example in which a first, a second, and a third storage semiconductor die 1100, 1200, and 1300 are the computation semiconductors die, and the fourth storage semiconductor die 1400 is the input-output semiconductor die. Referring to Fig. 9, transmit data DA can be transferred directly from the input-output semiconductor die 1400 to the computational semiconductors dies 1100, 1200, and 1300 without passing through the logic semiconductor die 1010. The input-output semiconductor die 1400 can simultaneously operate the data buses DBUS1 to DBUS4, which correspond to the respective memory semiconductors dies 1100, 1200, 1300, and 1400, with the transmit data DA. The computational semiconductors dies 1100, 1200, and 1300 can receive the transmit data DA via the data buses DBUS1 to DBUS3, which correspond to the respective computational semiconductors dies 1100, 1200, and 1300. Referring to Fig. 10, a first transmission circuit TX41 of the input-output semiconductor 1400 and a first receiving circuit RX11 of the first computational semiconductor 1100 can be activated to transmit the transmit data DA via the first data bus DBUS1. A second transmission circuit TX42 of the input-output semiconductor 1400 and a second receiving circuit RX22 of the second computational semiconductor 1200 can be activated to transmit the transmit data DA via the second data bus DBUS2. A third transmission circuit TX43 of the input-output semiconductor 1400 and a third receiving circuit RX33 of the third computational semiconductor 1300 can be activated to transmit the transmit data DA via the third data bus DBUS3. In the embodiment shown in Fig. 10, the transmit data DA is transferred by selectively activating the transmit circuit and the receive circuit. In the embodiments shown in Fig. 11A and Fig. 11B, the transmit data DA can be transferred by selectively connecting the data buses. Figures 11A and 11B are diagrams illustrating implementations of the data transmission path of Figure 9 according to an exemplary embodiment of the inventive concept. Referring to Fig. 11A, switching circuits SW1, SW2, and SW3 can be connected between the adjacent data buses DBUS1~DBUS4. Switching circuits SW1, SW2, and SW3 are each switched on in response to switching control signals SCON1, SCON2, and SCON3, respectively. All data buses DBUS1~DBUS4 can be electrically connected when all switching circuits SW1, SW2, and SW3 are switched on. In this case, the transmit data DA to the computational semiconductors dies 1100, 1200, and 1300 can be transmitted via the first, second, and third data buses DBUS1, DBUS2, and DBUS3, respectively, even if the input-output semiconductor die 1400 only operates the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400. Referring to Fig. 11B, switching circuits SW1, SW2, and SW3 can be connected between the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400, and each of the data buses DBUS1, DBUS2, and DBUS3, which correspond to the computational semiconductors dies 1100, 1200, and 1300, respectively. The switching circuits SW1, SW2, and SW3 are each activated in response to switching control signals SCON1, SCON2, and SCON3. All of the data buses DBUS1 to DBUS4 can be electrically connected when all of the switching circuits SW1, SW2, and SW3 are activated. In this case, the transmit data DA to the computation semiconductors dies 1100, 1200 and 1300 can be transmitted via the first, second and third data bus DBUS1, DBUS2 and DBUS3, even if the input-output semiconductor die 1400 only operates the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400. Fig. 12 is a diagram illustrating a data transmission path of transmitted data in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 13 is a diagram illustrating an implementation of the data transmission path of Fig. 12 according to an exemplary embodiment of the inventive concept. Referring to Fig. 12, the transmit data DA can be transferred directly from the input-output semiconductor die 1400 to the computational semiconductors dies 1100, 1200, and 1300 without passing through the logic semiconductor die 1010. The input-output semiconductor die 1400 can operate the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400, with the transmit data DA, and the computational semiconductors dies 1100, 1200, and 1300 can receive the transmit data DA via the fourth data bus DBUS4. Referring to Fig. 13, a fourth receiving circuit RX14 of the first computational semiconductor the 1100, a fourth receiving circuit RX24 of the second computational semiconductor the 1200 and a fourth receiving circuit RX34 of the third computational semiconductor the 1300 can be activated simultaneously when a fourth transmission circuit TX44 of the input-output semiconductor the 1400 is activated such that the transmit data DA can be transmitted simultaneously to all of the computational semiconductors the 1100, 1200 and 1300 via the fourth data bus DBUS4. In the embodiment shown in Fig. 13, the transmit data DA is transferred by selectively activating the transmit circuit and the receive circuit. In the embodiments shown in Fig. 14A, Fig. 14B, and Fig. 14C, the transmit data DA can be transferred by selectively connecting the data buses. Figures 14A, 14B and 14C are diagrams illustrating implementations of the data transmission path of Figure 12 according to an exemplary embodiment of the inventive concept. Referring to Figures 14A, 14B, and 14C, switching circuits SW1, SW2, and SW3 can be connected between the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400, and each of the data buses DBUS1, DBUS2, and DBUS3, which correspond to the computational semiconductors dies 1100, 1200, and 1300, respectively. The switching circuits SW1, SW2, and SW3 are each activated in response to switching control signals SCON1, SCON2, and SCON3. All of the data buses DBUS1 to DBUS4 can be electrically connected when all of the switching circuits SW1, SW2, and SW3 are activated.In this case, the transmit data DA to the computational semiconductors dies 1100, 1200 and 1300 can be transmitted via the first, second and third data buses DBUS1, DBUS2 and DBUS3 by activating the receiving circuits RX11, RX22 and RX33, which correspond to the first, second and third data buses DBUS1, DBUS2 and DBUS3 in the respective computational semiconductors dies 1100, 1200 and 1300, even if the input-output semiconductor die 1400 only operates the fourth data bus DBUS4. In exemplary embodiments of the inventive concept, as described below with reference to Figs. 15, 16, 17 to 18, calculation result data DR1, DR2 and DR3, which are output by the calculation circuits CAL to the calculation semiconductor dies 1100, 1200 and 1300, are transferred from the calculation semiconductor dies 1100, 1200 and 1300 to the logic semiconductor die 1010, and then transferred from the logic semiconductor die 1010 to the input-output semiconductor die 1400. In exemplary embodiments of the inventive concept, as described below with reference to Figs. 19, 20, 21 to 22, the calculation result data DR1, DR2 and DR3 can be transferred directly from the calculation semiconductors dies 1100, 1200 and 1300 to the input-output semiconductor dies 1400 without passing through the logic semiconductor dies 1010. Fig. 15 is a diagram illustrating a first transmission path of output data from computing circuits in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 16 is a diagram illustrating an implementation of the first transmission path of Fig. 15 according to an exemplary embodiment of the inventive concept. Referring to Fig. 15, the calculation result data DR1, DR2, and DR3 can be simultaneously transmitted from the computation semiconductors dies 1100, 1200, and 1300 to the logic semiconductor die 1010 via the data buses DBUS1, DBUS2, and DBUS3, which correspond to the computation semiconductors dies 1100, 1200, and 1300, respectively. As described with reference to Fig. 2, the logic semiconductor die 1010 can have a global buffer 1040, and the calculation result data DR1, DR2, and DR3 from the computation semiconductors dies 1100, 1200, and 1300 can be stored in the global buffer 1040. Referring to Fig. 16, a first transmission circuit TX11 of the first computational semiconductor 1100 and a first receiving circuit RX1 of the logic semiconductor 1010 can be activated to transmit the computation result data DR1 via the first data bus DBUS1. A second transmission circuit TX22 of the second computational semiconductor 1200 and a second receiving circuit RX2 of the logic semiconductor 1010 can be activated to transmit the computation result data DR2 via the second data bus DBUS2. A third transmission circuit TX33 of the third computational semiconductor 1300 and a third receiving circuit RX3 of the logic semiconductor 1010 can be activated to transmit the computation result data DR3 via the third data bus DBUS3. The transfer of the calculation result data DR1, DR2 and DR3 can be carried out simultaneously with respect to all the calculation semiconductors 1100, 1200 and 1300. Fig. 17 is a diagram illustrating a second transmission path of output data from computing circuits in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 18 is a diagram illustrating an implementation of the second transmission path of Fig. 17 according to an exemplary embodiment of the inventive concept. Referring to Fig. 17, calculation result data DR can be transmitted sequentially from the logic semiconductor die 1010 to the input / output semiconductor die 1400 via the fourth data bus DBUS, which corresponds to the input / output semiconductor die 1400, using a time-division or time-division multiplexing scheme. The calculation result data DR transmitted from the logic semiconductor die 1010 to the input / output semiconductor die 1400 can be the same as the calculation result data DR1, DR2, and DR3 output by the calculation semiconductors dies 1100, 1200, and 1300, respectively, or data processed by the data transformation logic 1050 in Fig. 2. Referring to Fig. 18, a fourth transmission circuit TX4 of the logic semiconductor die 1010 and a fourth receiving circuit RX44 of the input-output semiconductor die 1400 can be activated to transmit the calculation result data DR via the fourth data bus DBUS4. In this case, the calculation result data DR can be stored in the integrated memory circuit of the input-output semiconductor die 1400 by means of a normal write operation. If the amount of calculation result data DR is too large, the calculation result data can be transmitted and stored using a time-division or time-division multiplexing scheme. Fig. 19 is a diagram illustrating a transmission path of output data from computing circuits in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 20 is a diagram illustrating an implementation of the transmission path of Fig. 19 according to an exemplary embodiment of the inventive concept. Referring to Fig. 19, the calculation result data DR1, DR2, and DR3 can be transferred directly from the calculation semiconductors dies 1100, 1200, and 1300 to the input / output semiconductor die 1400 without passing through the logic semiconductor die 1010. Each of the calculation semiconductors dies 1100, 1200, and 1300 can operate each of the data buses DBUS1, DBUS2, and DBUS3, which correspond to the respective calculation semiconductors dies 1100, 1200, and 1300, with the calculation result data DR1, DR2, and DR3. The input-output semiconductor die 1400 can sequentially receive the calculation result data DR1, DR2 and DR3 via the data buses DBUS1, DBUS2 and DBUS3, which correspond to the calculation semiconductor dies 1100, 1200 and 1300. Referring to Fig. 20, a first transmission circuit TX11 of the first computation semiconductor die 1100 and a first receiving circuit RX41 of the input-output semiconductor die 1400 can be activated to transmit the computation result data DR1 via the first data bus DBUS1. A second transmission circuit TX22 of the second computation semiconductor die 1200 and a second receiving circuit RX42 of the input-output semiconductor die 1400 can be activated to transmit the computation result data DR2 via the second data bus DBUS2. A third transmission circuit TX33 of the third computation semiconductor die 1300 and a third receiving circuit RX43 of the input-output semiconductor die 1400 can be activated to transmit the computation result data DR3 via the third data bus DBUS3. The transfer of the calculation result data DR1, DR2 and DR3 can be carried out sequentially with respect to the calculation semiconductor dies 1100, 1200 and 1300. Fig. 21 is a diagram illustrating a transmission path of output data from computing circuits in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 22 is a diagram illustrating an implementation of the transmission path of Fig. 21 according to an exemplary embodiment of the inventive concept. Referring to Fig. 21, the calculation result data DR1, DR2, and DR3 can be transmitted directly from the calculation semiconductors dies 1100, 1200, and 1300 to the input-output semiconductor die 1400 without passing through the logic semiconductor die 1010. The calculation semiconductors dies 1100, 1200, and 1300 can sequentially operate the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400, with the calculation result data DR1, DR2, and DR3, and the input-output semiconductor die 1400 can sequentially receive the calculation result data DR1, DR2, and DR3 via the fourth data bus DBUS4, which corresponds to the input-output semiconductor die 1400. Referring to Fig. 22, a fourth transmission circuit TX14 of the first computation semiconductor 1100, a fourth transmission circuit TX24 of the second computation semiconductor 1200, and a fourth transmission circuit TX34 of the third computation semiconductor 1300 can be activated to sequentially operate the fourth data bus DBUS4 with the computation result data DR1, DR2, and DR3. The fourth receiving circuit RX44 of the input-output semiconductor 1400 can maintain the activated state to sequentially receive the computation result data DR1, DR2, and DR3. Fig. 23 is a diagram illustrating a data transmission path of transmitted data in a stacked storage device according to an exemplary embodiment of the inventive concept, and Fig. 24 is a diagram illustrating an implementation of the data transmission path of Fig. 23 according to an exemplary embodiment of the inventive concept. As illustrated in Fig. 23, any of the stacked storage semiconductors dies 1100, 1200, 1300, and 1400 can be the computation semiconductor die, which contains the computation circuit CAL. In this case, the transmit data DA can be transferred from one of the computation semiconductors dies 1100, 1200, 1300, and 1400 to another of the computation semiconductors dies. The computation semiconductor die that is to provide the transmit data DA can be determined based on a command provided by the logic semiconductor die 1010. Fig. 23 illustrates a non-limiting example in which the third computation semiconductor die 1300 provides the transmit data DA. Referring to Fig. 23, the transmitted data DA can be transferred directly from one computational semiconductor, in other words the third computational semiconductor, the 1300, to the other computational semiconductors, in other words the first, second, and fourth computational semiconductors, the 1100, 1200, and 1400, without passing through the logic semiconductor, the 1010. The third computational semiconductor, the 1300, can simultaneously operate the first, second, and fourth data buses, DBUS1, DBUS2, and DBUS4, which correspond to the first, second, and fourth memory semiconductors, the 1100, 1200, and 1400, respectively, with the transmitted data DA. The first, second, and fourth computational semiconductors, the 1100, 1200, and 1400, can receive the transmitted data DA via the data buses DBUS1, DBUS2, and DBUS4. Referring to Fig. 24, a first transmission circuit TX31 of the third computational semiconductor, the 1300, and a first receiving circuit RX11 of the first computational semiconductor, the 1100, can be activated to transmit the transmit data DA via the first data bus DBUS1. A second transmission circuit TX32 of the third computational semiconductor, the 1300, and a second receiving circuit RX22 of the second computational semiconductor, the 1200, can be activated to transmit the transmit data DA via the second data bus DBUS2. A fourth transmission circuit TX34 of the third computational semiconductor, the 1300, and a fourth receiving circuit RX44 of the fourth computational semiconductor, the 1400, can be activated to transmit the transmit data DA via the fourth data bus DBUS4. The transmission of the DA data can be carried out simultaneously with respect to all of the other computational semiconductors 1100, 1200 and 1400.It must be understood that the third transmission circuit TX33 of the third computational semiconductor die 1300 and the third receiving circuit RX33 of the third computational semiconductor die 1300 can be activated to transmit the transmit data DA to the input terminals of the computational units of the third computational semiconductor die 1300. In this case, the third computational semiconductor die 1300 can perform the calculations like the other computational semiconductors die 1100, 1200 and 1400, in addition to performing the function of providing the transmit data DA for the other computational semiconductors die 1100, 1200 and 1400. Fig. 25 is a diagram illustrating a computing unit contained in a stacked storage device according to an exemplary embodiment of the inventive concept. Referring to Fig. 25, each computing unit 500 can include a multiplication circuit 520 and a collection circuit 540. The multiplication circuit 520 can include buffers 521 and 522 and a multiplier 523, which is configured to multiply the transmitted data DA[N-1:0] and the internal data DW[N-1:0]. The collection circuit 540 can include an adder 541 and a buffer 542 to accumulate outputs from the multiplication circuit 520 to provide the computation result data DR. The collection circuit 540 can be initialized in response to a reset signal RST and output the computation result data DR in response to an output activation signal OUTEN. Using the computing units, matrix computation can be performed efficiently, as illustrated in Fig. 25, and as will be described with reference to Fig. 27. Fig. 26 is a diagram illustrating the output of calculation result data according to an exemplary embodiment of the inventive concept. Figure 26 illustrates the output of the calculation result data corresponding to a channel CHANNEL-0. A single channel CHANNEL-0 can contain multiple memory banks BANK0~BANK15, and each of the memory banks BANK0~BANK15 can contain multiple computation units CU0~CU15. The memory banks BANK0~BANK15 can be subdivided by two pseudochannels PSE-0 and PSE-1. Each computational semiconductor can further include multiple bank adders 610a to 610p. Each bank adder 610a to 610p can sum the outputs of the computational units CU0 to CU15 in each of the memory banks BANK0 to BANK15 to generate each of the bank result signals BR0 to BR15. The bank result signals BR0 to BR15 can be output simultaneously via the data bus DBUS, which corresponds to each computational semiconductor. For example, if the data bus corresponding to one computational semiconductor has a data width of 128 bits, and one channel CHANNEL-0 designates 16 memory banks BANK0 to BANK15, the output of each bank adder can be output via 8-bit or one-byte data paths of the DBUS data bus.In other words, the bank result signal BR0 of the first bank adder 610a can be output via the data paths corresponding to the first byte BY0 of the DBUS data bus, the bank result signal BR1 of the second bank adder 610b can be output via the data paths corresponding to the second byte BY1 of the DBUS data bus, and in this way the bank result signal BR15 of the sixteenth bank adder 610p can be output via data paths corresponding to the sixteenth byte BY15 of the DBUS data bus. Fig. 27 is a diagram illustrating a matrix calculation using a calculation circuit according to an exemplary embodiment of the inventive concept. Fig. 27 illustrates a matrix-vector multiplication performed using computation units CU0-0 to CU95-15 in a stacked storage device according to an exemplary embodiment of the inventive concept. In Fig. 27, computation units Cui-0 to Cui-15 correspond to the i-th row (i=1~95) of the i-th memory bank BANKi. For example, the matrix-vector multiplication can be a 32-bit mode, and each memory bank can have 16 computation units. It is assumed that each of the four storage semiconductors has two channels, and each channel has 16 memory banks.In this case, if one storage semiconductor is used as the input-output semiconductor described above, and the other three storage semiconductors are used as the computational semiconductors described above, the number of memory banks contained in the computational semiconductors can be 96, in other words, six channels * 16 memory banks. A first set of transmitted data DA0~DA15 during a first time period T1 and a second set of transmitted data DA16~DA31 are provided sequentially for all compute units in all memory banks. As such, activations can be sent sequentially. Additionally, a first set of internal data DW0~DW95 during the first time period T1 and a second set of internal data DW96~DW191 are provided sequentially for the compute units as weights. The internal data corresponds to data read from the respective memory banks. As such, the compute units can perform scalar product operations based on the activations and weights, which are provided sequentially. The compute units in the same memory bank provide subtotals of the same initial activation. Consequently, after the scalar product operations are completed, the subtotals can again be added by the bank adders in Fig.26 are summed up to provide the final result as bank result signals BR0~BR95. Matrix-vector multiplication, as illustrated in Fig. 27, can correspond to a 1x1 convolution or a fully connected layer. In the case of a multilayer perceptron (MLP) and a recurrent neural network (RNN), the transmit data or transmit activations correspond to a subset of a one-dimensional input activation. In the case of a convolutional neural network (CNN), the input activation corresponds to a 1x1 subcolumn of an input activation tensor. Fig. 28 is a timing diagram illustrating the operation of a stacked storage device according to an exemplary embodiment of the inventive concept. As described with reference to Fig. 9, in the stacked storage device according to an exemplary embodiment of the inventive concept, the first, second, and third storage semiconductors 1100, 1200, and 1300 can correspond to the first, second, and third computational semiconductors in which the computational units CAL are formed, and the fourth storage semiconductor 1400 can correspond to the input-output semiconductor 1400, which does not have the computational units CAL. In this case, the transmission data can be provided by the input-output semiconductor 1400.As specified in the HBM standards, the first computational semiconductor, the 1100, can have a first channel CH0 and a second channel CH1; the second computational semiconductor, the 1200, can have a third channel CH2 and a fourth channel CH3; the third computational semiconductor, the 1300, can have a fifth channel CH4 and a sixth channel CH5; and the input-output semiconductor, the 1400, can have a seventh channel CH6 and an eighth channel CH7. Each channel can operate as a pseudo-channel 0 or 1. Instructions such as MRST, ABR0, MAC, SUM, MWRT, etc., as illustrated in Fig. 28, can be specified to perform calculations in parallel using the computational units in the stacked storage device according to exemplary embodiments of the inventive concept. In Fig. 28, time points T0~TN+1 indicate relative time points or time selections of the instructions. MRST can be a command to reset buffers in the computational units. For example, the reset signal RST in Fig. 25 can be activated based on MRST to reset buffer 542. Additionally, MRST can be used to set a channel selector in the control circuit 1030 in Fig. 2 for transmitting the data. ABR0 can initiate the transmission of send data. ABR0 can be similar to the read command, but the read data can be transmitted to the computational units in the computational semiconductor rather than to an external device. ABR0 can be output per pseudochannel. MAC can initiate the computation operation in the computational semiconductor dies. MAC can be similar to the read command, but the internal data can be transferred to the computational units, while transfers to the external device or other semiconductor dies via the silicon vias are prevented. MAC can be sent to all of the computational semiconductor dies and output per pseudochannel. SUM can transfer the calculation result data from the computation units to the logic semiconductor. For example, the output activation signal OUTEN in Fig. 25 can be activated based on SUM, and the calculation result data DR can be summed by the bank adders 610a~610p in Fig. 26 to provide the bank result data BR for the logic semiconductor 1010. MWRT can set a channel selector in the control circuit 1030 in Fig. 2 such that the calculation result data can be transferred from the logic semiconductor 1010 to the input-output semiconductor 1010. In Fig. 28, the seventh channel CH6 and the eighth channel CH7 can correspond to the input-output semiconductor die 1400, which stores the transmitted data and the computation results, and the first through sixth channels CH0 to CH5 can correspond to the computation semiconductors die 1100, 1200, and 1300, which store the internal data and perform the computational operation. As illustrated in Fig. 28, ABR0, MAC, and MWRT can be output alternately for the first pseudochannel PSE-0 and the second pseudochannel PSE-1, and thus the operations of the computational units can be performed alternately by a unit of a pseudochannel. For example, at time T2, the transmission of the transmitted data for the second pseudochannel PSE-1 and the transmission of the internal data and the computations for the first channel PSE-0 can be performed simultaneously. Figures 29 and 30 are diagrams illustrating packaging structures of a stacked storage device according to exemplary embodiments of the inventive concept. Referring to Fig. 29, a memory chip 801 can have an interposer ITP and a stacked memory device stacked on the interposer ITP. The stacked memory device can have a logic semiconductor die LSD and a plurality of memory semiconductors die MSD1~MSD4. Referring to Fig. 30, a memory chip 802 can comprise a base substrate BSUB and a stacked memory device stacked on the base substrate BSUB. The stacked memory device can comprise a logic semiconductor die LSD and a plurality of memory semiconductors die MSD1~MSD4. Fig. 29 illustrates a structure in which the storage semiconductors MSD1~MSD4, with the exception of the logic semiconductor LSD, are stacked vertically, and the logic semiconductor LSD is electrically connected to the storage semiconductors MSD1~MSD4 via the interposer ITP or the base substrate. Fig. 30 illustrates a structure in which the logic semiconductor LSD is stacked vertically with the storage semiconductors MSD1~MSD4. As described above, at least one of the storage semiconductors MSD1~MSD4 can be the computation semiconductor, which contains the computation circuit CAL. The computation circuits CAL can have multiple computation units that perform the calculations based on the common transmitted data and their respective internal data described above. The base substrate BSUB can be the same as, or include, the interposer ITP. The base substrate BSUB can be a printed circuit board (PCB). External interconnects, such as conductive protrusions BMP, can be formed on a lower surface of the base substrate BSUB, and internal interconnects, such as conductive protrusions, can be formed on a top surface of the base substrate BSUB. In the embodiment of Fig. 30, the logic semiconductor die LSD and the memory semiconductors die MSD1~MSD4 can be electrically connected via silicon vias. In Fig. 29, the memory semiconductors die MSD1~MSD4 can be electrically connected via silicon vias. The stacked semiconductors die LSD and MSD1~MSD4 can be packed or encapsulated using a resin RSN. Fig. 31 is a block diagram illustrating a mobile system according to an exemplary embodiment of the inventive concept. Referring to Fig. 31, a mobile system 3000 comprises an application processor 3100, a connectivity unit 3200, a volatile storage device VM 3300, a non-volatile storage device NVM 3400, a user interface 3500 and a power supply 3600, which are connected via a bus. The 3100 application processor can run applications such as a web browser, a game application, a video player, etc. The 3200 connectivity unit can perform wired or wireless communication with an external device. The 3300 volatile memory device can store data processed by the 3100 application processor or can function as main memory. For example, the 3300 volatile memory device can be a DRAM such as dual-rate dynamic random-access memory (DDR SDRAM), low-performance DDR (LPDDR) SDRAM, graphics DDR (GDDR) SDRAM, RAM bus DRAM (RDRAM), etc. The 3400 non-volatile memory device can store a boot image for booting the 3000 mobile system and other data. The 3500 user interface can include at least one input device, such as a keypad, a touchscreen, etc.and have at least one output device, such as a loudspeaker, a display device, etc. The power supply 3600 can supply a power supply voltage to the mobile system 3000. In an exemplary embodiment of the inventive concept, the mobile system 3000 can further comprise a camera image processor (CIS) and / or a storage device, such as a memory card, a solid-state drive (SSD), a hard disk drive (HDD), a compact disc read storage device (CD-ROM), etc. The volatile storage device 3300 and / or the non-volatile storage device 3400 can be implemented in a stacked structure as described with reference to Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 to 30. The stacked structure can have a plurality of storage semiconductor dies connected by silicon vias, and the computational units described above can be formed in at least one of the storage semiconductor dies. As described above, the stacked storage device, the storage system comprising the stacked storage device, and the method for operating a stacked storage device according to exemplary embodiments of the inventive concept can reduce the amount of data exchanged between the stacked storage device, the logic semiconductor, and the external device. For example, the stacked storage device, the storage system comprising the stacked storage device, and the method for operating a stacked storage device according to exemplary embodiments of the inventive concept perform memory-intensive or data-intensive data processing in parallel through the plurality of processing units contained in the storage semiconductor. Consequently, the data processing time and power consumption are reduced.Furthermore, the data processing time and power consumption of MLP, RNM, CNN, etc., can be reduced by increasing the memory bandwidth of kernel weights for matrix-vector multiplication by the majority of computational units arranged in the memory bank, and by increasing the memory bandwidth of activations for matrix-vector multiplication by sending. Exemplary embodiments of the present inventive concept can be applied to any devices and systems that have a storage device requiring a refresh operation. For example, the present inventive concept can be applied to systems such as a mobile phone, a smartphone, a personal digital assistant (PDA), a portable multimedia player (PMP), a digital camera, a camcorder, a personal computer (PC), a server computer, a workstation, a laptop computer, a digital TV, a set-top box, a portable game console, a navigation system, etc.
Claims
Stacked storage device (1000) comprising: a logic semiconductor die (1010); a plurality of storage semiconductors die (1070, 1080; 1100, 1200, 1300, 1400) stacked with the logic semiconductor die (1010), each of the storage semiconductors die (1070, 1080; 1100, 1200, 1300, 1400) comprising an integrated memory circuit (1071, 1081; 1401, 1402) and one or more of the storage semiconductors die (1070, 1080; 1100, 1200, 1300, 1400) comprising a computational semiconductor die (1080; 1400) comprising a computational unit (CU, CU1~CUn); and silicon vias (TSV) which electrically connect the logic semiconductor die (1010) and the majority of memory semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400);wherein each of the computation units (CU, CU1~CUn) is configured to perform calculations based on transmitted data (DA) and internal data (DW1~DWn) and to generate computation result data (DR), wherein the transmitted data (DA) is provided jointly for the computation semiconductor dies (1080; 1400) through the silicon vias (TSV) and the internal data (DW1~DWn) is read from the integrated memory circuits (1071, 1081; 1401, 1402) of the computation semiconductor dies (1080; 1400). Stacked storage device (1000) according to claim 1, wherein each of the storage semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400) has a plurality of storage banks (MB; BANK0~BANK15), and the computation units (CU, CU1~CUn) are arranged in the storage banks (MB; BANK0~BANK15) which are contained in the computation semiconductor dies (1080; 1400). Stacked storage device (1000) according to claim 2, wherein the computation units (CU, CU1~CUn) contained in the memory banks (MB; BANK0~BANK15) of the computation semiconductor dies (1080; 1400) jointly receive the transmitted data (DA) and simultaneously perform the calculations based on the transmitted data (DA). Stacked storage device (1000) according to claim 2, wherein each of the storage banks (MB; BANK0~BANK15) has a plurality of data blocks (DBK1) and each of the computation units (CU, CU1~CUn) is assigned with respect to a predetermined number of data blocks (DBK1). Stacked storage device (1000) according to claim 1, wherein each of the computation units (CU, CU1~CUn) has first input terminals for receiving the internal data (DW1~DWn) and second input terminals for receiving the transmitted data (DA), wherein the first input terminals are connected to output terminals of an input-output read amplifier (IOSA) which amplifies signals on global input-output lines (GIO, GIOB), and the second input terminals are connected to input terminals of an input-output driver (IODRV) which operates the global input-output lines (GIO, GIOB). Stacked storage device (1000) according to claim 1, wherein at least one of the storage semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400) is an input-output semiconductor die (1400) which does not have the computation units (CU, CU1~CUn). Stacked storage device (1000) according to claim 6, wherein the transmit data (DA) are transferred directly from the input-output semiconductor die (1400) to the computation semiconductor dies (1080; 1400) without passing through the logic semiconductor die (1010). Stacked storage device (1000) according to claim 6, wherein the input-output semiconductor die (1400) simultaneously operates data buses (DBUS; DBUS1~DBUS4) with the transmit data (DA) and each of the computation semiconductor dies (1080; 1400) receives the transmit data (DA) via a corresponding one of the data buses (DBUS; DBUS1~DBUS4). Stacked storage device (1000) according to claim 6, wherein the input-output semiconductor die (1400) operates a data bus (DBUS; DBUS1-DBUS4) corresponding to the input-output semiconductor die (1400) with the transmit data (DA), and wherein each of the computation semiconductor dies (1080; 1400) receives the transmit data (DA) via the data bus (DBUS; DBUS1~DBUS4) corresponding to the input-output semiconductor die (1400). Stacked storage device (1000) according to claim 6, wherein the computation result data (DR) are transferred from the computation semiconductor dies (1080; 1400) to the logic semiconductor die (1010), and then from the logic semiconductor die (1010) to the input-output semiconductor die (1400). Stacked storage device (1000) according to claim 10, wherein the computation result data (DR) are simultaneously transferred from the computation semiconductor dies (1080; 1400) to the logic semiconductor die (1010) via data buses (DBUS; DBUS1~DBUS4) which each correspond to the computation semiconductor dies (1080; 1400), and the computation result data (DR) are sequentially transferred from the logic semiconductor die (1010) to the input-output semiconductor die (1400) by a time-division scheme via the data bus (DBUS; DBUS1~DBUS4) which corresponds to the input-output semiconductor die (1400). Stacked storage device (1000) according to claim 6, wherein the computation result data (DR) are transferred directly from the computation semiconductor die (1080; 1400) to the input-output semiconductor die (1400) without passing through the logic semiconductor die (1010). Stacked storage device (1000) according to claim 12, wherein each of the computation semiconductor dies (1080; 1400) operates a corresponding data bus (DBUS; DBUS1~DBUS4) with the computation result data (DR) and the input-output semiconductor die (1400) receives the computation result data (DR) sequentially via the data buses (DBUS; DBUS1-DBUS4) corresponding to the computation semiconductor dies (1080; 1400). Stacked storage device (1000) according to claim 12, wherein the computation semiconductors dies (1080; 1400) sequentially operate a data bus corresponding to the input-output semiconductor die (1400) with the computation result data (DR), and wherein the input-output semiconductor die (1400) sequentially receives the computation result data (DR) via the data bus (DBUS; DBUS1~DBUS4) corresponding to the input-output semiconductor die (1400). Stacked storage device (1000) according to claim 1, wherein all of the storage semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400) are the computation semiconductor dies (1080; 1400) which have the computation units (CU, CU1~CUn). Stacked storage device (1000) according to claim 2, wherein each of the computation semiconductors (1080; 1400) further comprises a plurality of bank adders, and each of the bank adders sums up outputs of the computation units (CU, CU1~CUn) in each of the memory banks (MB; BANK0~BANK15) to generate bank result signals (BR0~BR95). Stacked storage device (1000) according to claim 1, wherein each of the computation units (CU, CU1~CUn) comprises: a multiplication circuit (520) configured to multiply the transmitted data (DA) and the internal data (DW1~DWn); and a collection circuit (540) configured to accumulate outputs of the multiplication circuit (520) to provide the computation result data (DR). Stacked storage device (1000) according to claim 1, wherein the logic semiconductor die (1010) further comprises a data transformation logic configured to process data provided by the storage semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400) or data provided by an external device. A memory system comprising: a base substrate (BSUB); at least one logic semiconductor die (1010) stacked on the base substrate (BSUB); a plurality of memory semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400) stacked on the base substrate (BSUB) or on the logic semiconductor die (1010); and a plurality of computation units (CU, CU1~CUn) which are formed in one or more computation semiconductor dies (1080; 1400) from the plurality of storage semiconductor dies (1070, 1080; 1100, 1200, 1300, 1400), wherein each of the computation units (CU, CU1~CUn) is configured to perform calculations based on transmit data (DA) and internal data (DW1~DWn) and to generate computation result data (DR), wherein the transmit data (DA) is provided common to the computation semiconductor dies (1080; 1400), and the internal data (DW1~DWn) is read from integrated memory circuits of the computation semiconductor dies (1080; 1400). Method for operating a stacked storage device (1000), wherein the stacked storage device (1000) comprises a computation unit (CU, CU1~CUn) in each of a plurality of computation semiconductor dies (1080; 1400) stacked in a vertical direction, the method comprising: providing transmit data (DA) common to each of the computation units (CU, CU1~CUn) via silicon vias (TSV) electrically connecting the computation semiconductor dies (1080; 1400); providing internal data (DW1-DWn) read from integrated memory circuits of the computation semiconductor dies (1080; 1400) for each of the computation units (CU, CU1~CUn); and performing multiple calculations based on the transmitted data (DA) and the internal data (DW1~DWn) simultaneously using the calculation units (CU, CU1~CUn).