Compute-in-memory system, control method and apparatus, and electronic device

By dynamically controlling the parallelism and computation count of the memory circuit's computation core, the problem of insufficient reliability in the in-memory computing architecture is solved, achieving high-precision and high-efficiency computation under different input conditions.

WO2026138895A1PCT designated stage Publication Date: 2026-07-02BEIJING ZHICUN (WITIN) TECH CORP LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
BEIJING ZHICUN (WITIN) TECH CORP LTD
Filing Date
2025-12-24
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

The reliability of in-memory computing architecture still needs to be improved, especially when faced with different input data sparsity and device deviations, the calculation error is large, affecting the calculation accuracy and speed.

Method used

By controlling the computational cores in the storage circuit to perform a single computation with parallelism, the parallelism and number of computations can be dynamically adjusted based on the input data or circuit status, thereby optimizing computational accuracy and speed.

Benefits of technology

It improves the reliability and computational accuracy of the in-memory computing architecture, reduces computational errors, and enhances the overall performance of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025145218_02072026_PF_FP_ABST
    Figure CN2025145218_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed are a compute-in-memory system, a control method and apparatus, and an electronic device, which relate to the technical field of electronics. The compute-in-memory system comprises: a memory circuit, which comprises a first computing core, wherein the first computing core comprises a plurality of first memory cells, which are used for storing first weight data, the first computing core is used for receiving a first input signal, and converting the first input signal into a first output signal on the basis of the first weight data, and the first input signal is generated on the basis of first input data; and a control circuit, which is used for controlling the parallelism of the first computing core for one instance of computation on the basis of the first input data or a first state of the memory circuit. The compute-in-memory system can improve the reliability of a compute-in-memory architecture.
Need to check novelty before this filing date? Find Prior Art

Description

In-memory computing systems, control methods and devices, and electronic equipment

[0001] This application claims priority to Chinese Patent Application No. 202411918651.9, filed on December 24, 2024, entitled "In-memory computing system, control method and apparatus, and electronic device", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of electronic technology, and more specifically, to a storage system, control method and apparatus, and electronic equipment. Background Technology

[0003] In-memory computing (IMC) is an innovative technology proposed to address the problem of separation between computation and storage in traditional von Neumann computer architectures. IMC reduces transmission latency and memory access power consumption by physically integrating data storage and computation, or bringing them closer together. IMC can significantly improve data processing efficiency, but it still faces challenges, such as the need to improve its reliability. Summary of the Invention

[0004] This application provides a control method and apparatus, a memory computing system, and an electronic device to improve the reliability of memory computing architecture.

[0005] In a first aspect, a storage computing system is provided, comprising: a storage circuit including a first computing core, the first computing core including a plurality of first storage units, the plurality of first storage units being used to store first weight data, the first computing core being used to receive a first input signal and convert the first input signal into a first output signal based on the first weight data, wherein the first input signal is generated based on the first input data; and a control circuit being used to control the parallelism of a single computation by the first computing core according to the first input data or a first state of the storage circuit.

[0006] Optionally, the above-mentioned storage circuit may include multiple computing cores, and the first computing core is one of the multiple computing cores.

[0007] Optionally, the first state may include the state of the entire storage circuit or the state of the first computing core, wherein the state of the first computing core may be the state of the storage area where the first computing core is located.

[0008] Optionally, the aforementioned first precision can be determined based on the usage of the storage circuit or the storage area where the first computing core is located.

[0009] Optionally, the usage of the storage circuit may include the running time (or duration) of the storage circuit; or, the usage of the storage area where the first computing core is located may include the running time (or duration) of the storage area where the first computing core is located.

[0010] Optionally, the usage of the aforementioned storage circuit or the storage area where the first computing core is located may also include the number of times the storage area where the first computing core is located is used.

[0011] Optionally, the above usage count may include the usage of data stored in the storage area where the storage circuit or the first computing core is located, and may include one or more of the following: data read count, or calculation count.

[0012] In some implementations of the first aspect, the control circuit is used to: control the parallelism of a single computation by the first computing core based on the accumulated value of the non-zero values ​​in the first input data.

[0013] In some implementations of the first aspect, the control circuit is used to: control the first input signal input to the first computing core to control the number of first storage units participating in the first computation, the parallelism of the first computation is related to the number of first storage units participating in the first computation in the storage unit group, the first input signal is generated based on a first data segment in the first input data, the accumulated value of data with non-zero values ​​in the first data segment is less than or equal to a first threshold, and the computation of the first computing core includes the first computation.

[0014] In some implementations of the first aspect, the control circuit is further configured to: control the number of calculations performed by the first computing core based on the first input data or the first state of the storage circuit.

[0015] In some implementations of the first aspect, the in-memory computing system further includes: an input circuit for receiving first input data and first control data, and inputting a first input signal to the storage circuit based on the first input data and the first control data; and a control circuit for providing the first control data to the input circuit.

[0016] In some implementations of the first aspect, the first control data includes mask data.

[0017] In some implementations of the first aspect, the in-memory computing system further includes: an output circuit for receiving first indication information and shifting the output of a first calculation by the first computing core according to the first indication information, wherein the first indication information is used to determine the number of bits to be shifted; and a control circuit for providing the first indication information to the output circuit.

[0018] Optionally, the output circuit may include a first register group, a first shift accumulator group, a second register group, a second shift accumulator group, and a local buffer. The first shift accumulator group includes first shift accumulators 1 to 1 shift accumulator K, and the second shift accumulator group includes second shift accumulators 1 to 2 shift accumulator J, where K and J are positive integers greater than 1. The values ​​of K and J may be the same or different, depending on the encoding method of the weight vector and the data vector. Optionally, the number of bits in the second register is greater than the number of bits in the first register.

[0019] In some implementations of the first aspect, the storage circuit further includes a second computing core, which includes a plurality of second storage units for storing second weight data. The second computing core is used to receive a second input signal and convert the second input signal into a second output signal based on the second weight data, wherein the second input signal is generated based on the second input data. The control circuit is also used to control the parallelism of the second computing core in one computation according to the second input data or a second state of the storage circuit.

[0020] In some implementations of the first aspect, the storage circuit includes a first storage region, the storage cells in the first storage region have the same first access address, the first storage region includes a first computing core and a second computing core, and the control circuit is further configured to: control the number of calculations of the second computing core based on the number of calculations of the first computing core.

[0021] In some implementations of the first aspect, the control circuit is also used to: determine a first threshold based on a first state of the storage circuit.

[0022] Optionally, since the first threshold can affect the number of data segments that limit the first input data from being split, and the larger the number of data segments, the slower the first computing core completes the first calculation based on the first input data, it can be concluded that when there is a need to improve the first calculation speed, the value of the first threshold can be increased adaptively.

[0023] Optionally, when the control circuit determines the first threshold, it can simultaneously consider the computational accuracy and computational speed of the storage circuit or the specified computing core, weigh these two parameters, and adjust the value of the quantity threshold to achieve the optimal overall performance of the storage computing system in the specified scenario.

[0024] Optionally, the correspondence between the first threshold, the first precision, and the first computing speed can be represented in tabular form or in the form of a mathematical model formula. Based on this, for different computing scenarios, after determining the requirements for the first precision and the first computing speed in that scenario, the corresponding first threshold that can achieve the optimal overall performance of the in-memory computing system can be determined, that is, the trade-off between the first precision and the first computing speed of the storage circuit or the first computing core in the in-memory computing system can be achieved.

[0025] In a second aspect, a control method is provided for controlling a storage circuit, the storage circuit including a first computing core, the first computing core including a plurality of first storage units, the plurality of first storage units for storing first weight data, the first computing core for receiving a first input signal, and converting the first input signal into a first output signal based on the first weight data, the control method including: determining the first input data or a first state of the storage circuit, wherein the first input signal is generated based on the first input data; and controlling the parallelism of a single computation by the first computing core according to the first input data or the first state of the storage circuit.

[0026] In some implementations of the second aspect, the control method further includes: controlling the parallelism of the first computing core in a single computation based on the accumulated value of the non-zero values ​​in the first input data.

[0027] In some implementations of the second aspect, the control method further includes: controlling a first input signal input to the first computing core to control the number of first storage units participating in the first computation, the parallelism of the first computation being related to the number of first storage units participating in the first computation in the storage unit group, the first input signal being generated based on a first data segment in the first input data, the accumulated value of data with non-zero values ​​in the first data segment being less than or equal to a first threshold, and the computation of the first computing core including the first computation.

[0028] In some implementations of the second aspect, the control method further includes: controlling the number of calculations of the first computing core based on the first input data or the first state of the storage circuit.

[0029] In some implementations of the second aspect, the storage circuit is connected to the input circuit, the input circuit is used to receive the first input data, and the control method further includes: providing the input circuit with the first control data, the input circuit being used to input the first input signal to the storage circuit based on the first input data and the first control data.

[0030] In some implementations of the second aspect, the first control data includes mask data.

[0031] In some implementations of the second aspect, the storage circuit is connected to the output circuit, and the control method further includes: providing the output circuit with first indication information, the first indication information being used to determine the number of bits to be shifted, and the output circuit being used to shift the output of a single calculation by the first calculation core according to the first indication information.

[0032] In some implementations of the second aspect, the storage circuit further includes a second computing core, which includes a plurality of second storage units for storing second weight data. The second computing core is used to receive a second input signal and convert the second input signal into a second output signal based on the second weight data. The second input signal is generated based on the second input data. The control method further includes controlling the parallelism of the second computing core in one computation according to the second input data or the second state of the storage circuit.

[0033] In some implementations of the second aspect, the storage circuit includes a first storage region, the storage cells in the first storage region have the same first access address, the first storage region includes a first computing core and a second computing core, and the control method further includes: controlling the number of calculations of the second computing core based on the number of calculations of the first computing core.

[0034] In some implementations of the second aspect, the control method further includes: determining a first threshold based on a first state of the storage circuit.

[0035] For a description of the beneficial effects of the second aspect, please refer to the description of the beneficial effects of the first aspect, which will not be repeated here.

[0036] Thirdly, a control device is provided for controlling a storage circuit, the storage circuit including a first computing core, the first computing core including a plurality of first storage units, the plurality of first storage units for storing first weight data, the first computing core for receiving a first input signal, and converting the first input signal into a first output signal based on the first weight data. The control device includes: a determining unit and a controlling unit, wherein the determining unit is used to: determine the first input data or a first state of the storage circuit, wherein the first input signal is generated based on the first input data; the controlling unit is used to: control the parallelism of a single calculation by the first computing core according to the first input data or the first state of the storage circuit.

[0037] In some implementations of the third aspect, the control unit is also used to: control the parallelism of the first computing core in a single computation based on the accumulated value of the non-zero values ​​in the first input data.

[0038] In some implementations of the third aspect, the control unit is further configured to: control the first input signal input to the first computing core to control the number of first storage units participating in the first computation, the parallelism of the first computation being related to the number of first storage units participating in the first computation in the storage unit group, the first input signal being generated based on a first data segment in the first input data, the accumulated value of data with non-zero values ​​in the first data segment being less than or equal to a first threshold, and the computation of the first computing core including the first computation.

[0039] In some implementations of the third aspect, the control unit is also used to: control the number of calculations of the first computing core according to the first input data or the first state of the storage circuit.

[0040] In some implementations of the third aspect, the storage circuit is connected to the input circuit, the input circuit is used to receive first input data, and the control device further includes: a transmission unit, which is used to provide first control data to the input circuit, and the input circuit is used to input a first input signal to the storage circuit based on the first input data and the first control data.

[0041] In some implementations of the third aspect, the first control data includes mask data.

[0042] In some implementations of the third aspect, the storage circuit is connected to the output circuit, and the control device further includes: an indicator unit for providing first indication information to the output circuit for determining the number of bits to be shifted, and the output circuit for shifting the output of the first calculation core in one calculation according to the first indication information.

[0043] In some implementations of the third aspect, the storage circuit further includes a second computing core, which includes a plurality of second storage units for storing second weight data. The second computing core is used to receive a second input signal and convert the second input signal into a second output signal based on the second weight data. The second input signal is generated based on the second input data. The control unit is also used to control the parallelism of the second computing core in one computation according to the second input data or the second state of the storage circuit.

[0044] In some implementations of the third aspect, the storage circuit includes a first storage region, the storage cells in the first storage region have the same first access address, the first storage region includes a first computing core and a second computing core, and the control unit is further configured to: control the number of calculations of the second computing core based on the number of calculations of the first computing core.

[0045] In some implementations of the third aspect, the determining unit is also used to: determine a first threshold based on a first state of the storage circuit.

[0046] For a description of the beneficial effects of the third aspect, please refer to the description of the beneficial effects of the first aspect, which will not be repeated here.

[0047] Fourthly, a control device is provided, comprising at least one processing circuit and an interface circuit, the interface circuit being used for signal connection with a storage circuit, and the at least one processing circuit being used to execute any of the control methods of the second aspect.

[0048] Fifthly, a control device is provided, configured to perform any of the control methods of the second aspect.

[0049] In a sixth aspect, an electronic device is provided, comprising any of the storage and computing systems of the first aspect.

[0050] In a seventh aspect, a computer program product is provided, the computer program product including instructions that, when executed by a processor, cause any of the control methods of the second aspect above to be executed.

[0051] Eighthly, a computer-readable medium is provided that stores instructions which, when executed by a processor, cause any of the control methods of the second aspect above to be performed.

[0052] Based on the above technical solutions, the first precision of the region where the storage circuit or the first computing core is located can be evaluated by one or more of these methods combined. Combining multiple methods can help improve the accuracy of the evaluation results. Attached Figure Description

[0053] Figure 1 shows a schematic diagram of an in-memory computing system according to an exemplary embodiment of this application;

[0054] Figure 2 shows a schematic diagram of an in-memory computing system according to an exemplary embodiment of this application;

[0055] Figure 3 is a schematic diagram of the architecture of an in-memory computing system according to an exemplary embodiment of this application;

[0056] Figure 4 is a diagram illustrating the effect of dynamic input according to an exemplary embodiment of this application;

[0057] Figure 5 is a schematic diagram of an implementation of input control according to an exemplary embodiment of this application;

[0058] Figure 6 is a schematic diagram of an input data segmentation method according to an exemplary embodiment of this application;

[0059] Figure 7 is a schematic diagram of segmented input data according to an exemplary embodiment of this application;

[0060] Figure 8 shows a schematic diagram of a shifting portion of an output circuit according to an exemplary embodiment of this application;

[0061] Figure 9 illustrates a schematic diagram of parallel computing using multiple computing cores according to an exemplary embodiment of this application;

[0062] Figure 10 shows a schematic diagram of segmented input data corresponding to multiple computing cores according to an exemplary embodiment of this application;

[0063] Figure 11 shows a schematic diagram of segmented input data corresponding to multiple computing cores according to an exemplary embodiment of this application;

[0064] Figure 12 shows a schematic diagram of segmented input data corresponding to multiple computing cores according to an exemplary embodiment of this application;

[0065] Figure 13 shows a flowchart of a control method proposed in an embodiment of this application;

[0066] Figure 14 shows a flowchart of a data processing method proposed in an embodiment of this application;

[0067] Figure 15 shows an experimental result diagram of the comparison experiment for the calculation accuracy proposed in the embodiment of this application;

[0068] Figure 16 shows an experimental result diagram of the comparison experiment for the comparison of computing speed proposed in the embodiment of this application;

[0069] Figure 17 shows an experimental result diagram of a comprehensive comparative experiment proposed in the embodiments of this application;

[0070] Figure 18 is a schematic diagram of a control device according to an exemplary embodiment of the present application;

[0071] Figure 19 shows a schematic diagram of a control device according to an exemplary embodiment of the present application;

[0072] Figure 20 shows a schematic diagram of an electronic device according to an exemplary embodiment of this application. Detailed Implementation

[0073] The technical solutions in this application will now be described with reference to the accompanying drawings.

[0074] To keep the drawings concise, the figures in this application only schematically show the parts related to the corresponding embodiments, and they do not represent the actual structure of the product. In addition, to make the drawings concise and easy to understand, some figures only schematically show some structures or components, and there may actually be more or fewer identical or similar structures or components.

[0075] The business scenarios described in the embodiments of this application are for illustrative purposes only and do not constitute a limitation on the technical solutions provided in the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.

[0076] In this application, unless otherwise expressly specified and limited, ordinal numbers, such as "first," "second," etc., are used only to distinguish the objects being described and should not be construed as indicating or implying the relative importance or order between the objects being described. Furthermore, ordinal numbers do not represent the quantity of the objects being described. "Multiple" includes two or more, and other quantifiers are similar. "Or," "and / or," etc., are used to describe the relationship between objects, indicating a non-exclusive inclusion. For example, "A and / or B," "A or B" can include: "A alone," "B alone," or "A and B." Similarly, "A, B, and / or C," "A, B, or C" can include: "A alone," "B alone," "C alone," "A and B," "A and C," "B and C," or "A, B, and C." Additionally, the " / " in this application is used to indicate an "or" relationship between preceding and following objects. The meaning of "one or more of A and B" or "at least one of A and B" in this application is the same as the meaning of "A and / or B" or "A or B" above. "One or more of A, B and C" or "at least one of A, B and C" has the same meaning as "A, B and / or C" or "A, B or C" above.

[0077] In this application, unless otherwise expressly specified and limited, "connection" includes direct or indirect connection between objects: the connected objects may be directly connected through a medium (e.g., wires, wiring, etc.), or indirectly connected through other elements, or may be an internal connection.

[0078] In a memory-based computing architecture, the memory-based computing system can perform in-memory computation (or operations) using memory as the carrier. This memory can include: non-volatile memory (NVM) or volatile memory (VM). Volatile memory can include, but is not limited to: static random access memory (SRAM); non-volatile memory can include, but is not limited to: flash memory, resistive random access memory (RRAM), magnetic random access memory (MRAM), or phase change memory (PCM), etc.

[0079] For ease of understanding, Figure 1 shows a schematic diagram of an in-memory computing system according to an exemplary embodiment of this application.

[0080] As shown in Figure 1, the in-memory computing system 100 includes a storage circuit (also called an in-memory computing circuit) 110 and a control circuit 120. The storage circuit 110 stores weight data (also called weights); the control circuit 120 controls the operating state of the storage circuit 110. The operating states of the storage circuit 110 include, for example, a programming state and a calculation state. In the programming state, weight data is written into the storage circuit 110. In the calculation state, the storage circuit 110 receives an input signal Sin and converts the input signal Sin into an output signal Sout based on the weight data. The storage circuit 110 can store multiple weight data, which can be equivalent to at least one vector (or matrix). The storage circuit 110 can store weight data in units of storage cells, which can also be called storage units or storage structures. For example, the storage circuit 110 includes a storage cell array, which includes multiple storage cells arranged in an array.

[0081] The memory cell utilizes the conduction capability of a semiconductor device, such as electrical conductance or transconductance, to store weight data. For example, the memory cell may include a resistive memory device or a transistor memory device. For instance, weight data can be stored by controlling the electrical conductance of a resistive memory device, or by controlling the transconductance of a transistor memory device.

[0082] The storage circuit 110 can perform calculations in groups. For example, a storage cell array includes at least one storage cell group, and each storage cell group includes multiple storage cells that can store multiple weight data. These multiple weight data can be equivalent to a first data vector (or a first data matrix). In programming mode, the weight data is written into the storage cells, which is equivalent to writing the first data vector (or the first data matrix) into the storage cell group in the storage cell array. In calculation mode, the storage circuit 110 receives an input signal, and the conduction capability of the storage cells can change the input signal to obtain an output signal. Accumulating the output signals in the storage cell group can achieve an equivalent multiplication operation. The storage cell array includes one-dimensional arrays, two-dimensional arrays, or three-dimensional arrays, etc., and the storage cell group includes multiple storage cells located in the same row or column, or multiple storage cells located in multiple rows or columns, etc. These multiple storage cells can output their output signals collinearly.

[0083] In some possible implementations, the in-memory computing system 100 may further include an input circuit 130 and an output circuit 140. The input circuit 130 converts input data D1 into at least one input signal Sin and provides it to the storage circuit 110; the storage circuit 110 converts the received input signal Sin into an output signal Sout based on weight data; the output circuit 140 converts the output signal Sout into output data D2 and outputs it. The at least one input signal can be equivalent to a second data vector (or a second data matrix), and the output data D2 can be equivalent to the product of a first data vector (or a first data matrix) and a second data vector (or a second data matrix).

[0084] In some possible implementations, the output circuit 140 may include at least one conversion circuit that can sense the output signal Sout and convert the output signal Sout into output data D2 for subsequent circuits.

[0085] As an example, Figure 2 shows a schematic diagram of an in-memory computing system according to an exemplary embodiment of this application.

[0086] As shown in Figure 2, the in-memory computing system 200 includes a storage cell array 210, which includes multiple storage cells S. ij Where i ∈ [1, m], j ∈ [1, n], m is the number of rows in the storage cell array, and n is the number of columns in the storage cell array. Storage cell S ij Store weight data w ij When the memory cell array 210 is in the programming state, memory cell S ij The conduction capability can be controlled based on weight data to achieve a target state, thereby achieving the storage of weight data. When the storage cell array 210 is in the calculation state, it can be controlled through storage cell S. ij The input terminal IN is directed to the storage unit S ij Provide an input signal, such as an input voltage V i Storage unit S ij The output terminal OUT outputs its output signal, such as the output current. Multiple memory cells (e.g., S...) 1j -S mj The output terminals of the memory can be collinear. According to Kirchhoff's laws, the output signals of multiple memory cells are accumulated to obtain the output signal I. j Satisfy the following formula:

[0087] In some possible implementations, the input data includes digital input signals, such as the input signal V of the storage cell array 210. iThe input signal may include an analog signal. The input circuit 230 may include, for example, a digital-to-analog converter (DAC) to convert the digital signal into an analog signal and provide it to the memory cell array 210. In some possible implementations, the input signal to the memory cell array 210 may include a digital signal, represented by waveform characteristics such as pulse width, amplitude, or area. The input circuit 230 adjusts the waveform of the signal based on the input data to obtain the input signal, which is then provided to the memory cell array.

[0088] In some possible implementations, the output circuit 240 may include at least one conversion circuit for converting the output signal of the memory cell array 210 and outputting it to a subsequent circuit. For example, the output circuit 240 may include a first conversion circuit 241 for performing a first conversion on the output signal of the memory cell array 210. For example, if the input signal includes a voltage signal and the output signal includes a current signal, the first conversion circuit 241 can convert the current signal into a voltage signal. Alternatively, the output circuit 240 may include a second conversion circuit 242. The second conversion can be implemented, for example, through a sampling circuit, and the signal converted by the first conversion circuit 241 can be further provided to the second conversion circuit 242 for a second conversion. For example, the first conversion circuit 241 may include a transimpedance amplifier (TIA) to convert the current signal into a voltage signal; the second conversion circuit 242 may include an analog-to-digital converter (ADC) to convert the analog signal into a digital signal and provide it to the subsequent circuit. Additionally, in the example of FIG. 2, the control circuit 220 can be used to control the memory cells S in the memory cell array 210. ij The running state, such as the programming state and computation state mentioned above.

[0089] Figure 2 is only an example illustrating a connection method of memory cells in a memory cell array 210. Other connection methods can be used besides those shown in Figure 2. For example, the input terminals of the memory cells can be connected collinearly by columns, and the output terminals can be connected collinearly by rows. Furthermore, the input terminal of the memory cell may include the gate of a transistor memory device, or it may include the source or drain of a transistor memory device; this application does not limit the specific type of memory cell. This application also does not limit the type of memory cell; for example, the memory cell may include a floating gate transistor (FGT), a memristor, a magnetic tunnel junction (MTJ), or a phase-change structure. Furthermore, the memory cell may include multiple transistors; for example, the memory cell may include a first transistor and a second transistor, where the gate of one transistor is connected to the source or drain of the other transistor, and the charge stored at the gate can be used to characterize weight data. Optionally, the gate may also be connected to a capacitor to increase the stability and duration of the stored charge.

[0090] With the development of electronic technology, the traditional von Neumann computer architecture, which separates computation and storage, can no longer meet the ever-increasing performance demands. This is because in the von Neumann architecture, computation and storage are separate. Therefore, during computational tasks, data needs to be constantly moved from memory to the processor and then the computation results need to be stored back into memory. This data movement process consumes a significant amount of time and resources, especially when processing large-scale datasets, where data movement overhead can become a key factor affecting system performance. Against this backdrop, in-memory computing architecture was proposed. In-memory computing architecture reduces transmission latency and memory access power consumption by physically integrating data storage and computation, or bringing them closer together. In-memory computing architecture can greatly improve data processing efficiency. However, it still faces many challenges; for example, the reliability of in-memory computing architecture still needs improvement.

[0091] In view of this, embodiments of this application propose a memory computing system, control method and apparatus, and electronic device, which improves the reliability of the memory computing architecture by controlling the parallelism of the computing core in the memory circuit for one computation.

[0092] Figure 3 is a schematic diagram of the architecture of an in-memory computing system according to an exemplary embodiment of this application. Referring to Figure 3, the in-memory computing system 300 includes a storage circuit 310 and a control circuit 320. The storage circuit 310 includes computing cores (or arithmetic cores) 311-31p, where p represents the number of computing cores and is a positive integer. The computing core 31a includes multiple storage units, where a∈[1,p]. These multiple storage units are used to store first weight data. The computing core 31a is used to receive a first input signal and convert the first input signal into a first output signal based on the first weight data, wherein the first input signal is generated based on the first input data. The control circuit 320 is used to control the parallelism of the computing core 31a in one computation according to the first input data or a first state of the storage circuit.

[0093] The parallelism of a single computation includes, for example, the amount of input to the computational core in a single computation. Taking the memory cell array shown in Figure 2 as an example, this parallelism can include the number of rows activated in a single computation; in other types of memory cell arrays, this parallelism can include the number of columns activated in a single computation. This parallelism is related to the number of memory cells activated within a memory cell group, and it can be characterized by the number of sub-signals in the input signals coupled to the computational core in a single computation.

[0094] For clarity, computing core 31a can be referred to as the first computing core, and the storage units within computing core 31a can be referred to as the first storage units. In some possible embodiments, the computing core may include the above-mentioned storage unit groups or may include multiple storage unit groups with their input terminals collinearly connected.

[0095] Device deviations may exist between storage units in a storage circuit, potentially causing computational errors during the computation process of the in-memory computing system. These errors may be related to the magnitude of the multiplication-accumulation result; the larger the result, the greater the computational error. In other words, for the same weighted data, a larger average input results in a larger average accumulation, potentially leading to more severe computational errors. However, in practical applications, the distribution of input data may vary significantly, for example, the sparsity of the input data may differ considerably. Taking artificial intelligence (AI) applications as an example, this sparsity difference may be caused by differences in sparsity between network layers, or by differences in sparsity between different input data (e.g., input bits) within a single layer. Sparsity, for example, refers to the proportion of a certain type of element in a dataset to the total number of elements. Data sparsity can include the proportion of non-zero elements in a dataset. Zero elements are those with values ​​near 0; the difference between their value and 0 does not affect the circuit's determination that the value is 0. Non-zero elements include elements other than zero elements; different non-zero elements can have different or the same values. Based on the above technical solution, the parallelism of the computing core's calculation in one operation is controlled by the control circuit, thereby realizing dynamic input of input data to control the sparsity of the input data, which helps to reduce the computational error of the in-memory computing system and improve the reliability of the in-memory computing architecture.

[0096] Please refer to Figure 4, which is a diagram illustrating the effect of dynamic input according to an exemplary embodiment of this application. For example, as shown in the example at the top of Figure 4, the input quantity of the single-input storage circuit is fixed; for example, for different computing cores, the single-input quantity is 128. This is highly disadvantageous for achieving high-performance or high-reliability in-memory computing systems. For example, in cases with high sparsity and a large average input value, the calculation error is relatively large, leading to a decrease in calculation accuracy; in cases with low sparsity and a small average input value, the calculation error is relatively small. The fixed input quantity limits the upper limit of the calculation speed, resulting in a waste of the computing performance of the in-memory computing system. It is evident that using a fixed input quantity not only fails to maintain the stability of the average input value, but on the contrary, the difference in the average input value may be very large.

[0097] The above technical solutions take into account the aforementioned issues. The parallelism (or input quantity) of a single computation by a computing core can be dynamically adjusted according to the state of the input data or storage circuit to support different combinations of input quantities. This allows for flexible adjustment to address different device deviation characteristics or input data distributions, achieving coordinated optimization and flexible adjustment of computational accuracy and speed. For example, as shown in the example below Figure 4, the parallelism (or input quantity) of a single computation differs between different computing cores. For instance, computing core 1 can perform 4 computations, with parallelism (or input quantity) of 64, 128, 32, and 32 per computation; computing core 2 can perform 5 computations, with parallelism (or input quantity) of 64, 64, 32, 32, and 64 per computation; and computing core 3 can perform 2 computations, with parallelism (or input quantity) of 128 per computation.

[0098] The above are merely examples; the number of calculations may be the same or different between different computing cores; and the degree of parallelism (or input quantity) of a single calculation may be the same or different. This application is not limited to these limitations, and the number of calculations or the degree of parallelism of a single calculation may vary depending on the input data or the state of the storage circuit, etc.

[0099] This application does not limit the type of input signal to the computing core. For example, the first input signal described above may include a digital signal or an analog signal. That is, the above scheme can be applied to storage circuits in either the analog or digital domains. For storage circuits in the analog domain, the input signal may include an analog signal, and different values ​​of data can be characterized by different values ​​of the analog signal's parameters (e.g., amplitude).

[0100] In some possible embodiments, input data (e.g., first input data) is carried by an input signal (e.g., a first input signal), or in other words, the input signal is generated based on the input data. The input data can be understood as an input vector, and the input signal includes multiple sub-signals, each sub-signal corresponding to an element of the input vector. Multiple sub-signals can be input into the storage circuit in parallel, for example, inputting sub-signals to multiple rows or columns of storage cells at once. The sub-signal may correspond to a 1-bit element (e.g., element 0 or 1), or the sub-signal may correspond to a multi-bit element; this application is not limited thereto.

[0101] In some possible embodiments, the first state described above may include the state of the entire memory circuit or the state of the first computing core. The state may include the first precision of the memory region where the memory circuit or the first computing core is located.

[0102] In some possible embodiments, the control circuit described above can control the parallelism of the first computing core in a single computation based on the accumulated value of the non-zero values ​​in the first input data. In this way, the parallelism can be dynamically adjusted by detecting the input data, thereby matching the parallelism of a single computation with the sparsity of the current input data. This achieves coordinated optimization and flexible adjustment of computational accuracy and speed under different sparsity conditions.

[0103] The accumulated value of non-zero values ​​in the input data is positively correlated with the result of multiplication and accumulation during the calculation process; that is, the larger the accumulated value, the larger the result of multiplication and accumulation, and the greater the potential calculation error. Furthermore, the detection of the input data has low implementation complexity, which can reduce the algorithm's complexity, and it also offers strong real-time performance.

[0104] Non-zero values ​​in the input data can include data with a value of 1 or a value greater than 1. For example, for 1-bit data, non-zero values ​​can include the data corresponding to element 1; the accumulated value of non-zero values ​​in the input data can also be understood as the number of data corresponding to element 1; for digital domain storage circuits, the accumulated value of non-zero values ​​in the input data can also be understood as the number of sub-signals in the input signal whose corresponding element is 1. Furthermore, for multi-bit data, non-zero values ​​can include the data corresponding to the weighted sum of element 1 and the weight of its corresponding bit; the impact of this data on the multiplication-accumulation result after conversion into the input signal is positively correlated with its value, and accumulation can be performed based on its value.

[0105] Taking 1-bit data as an example, the more 1s there are in the input data, the lower the sparsity of the input data, and the greater the calculation error of multiplication and accumulation based on that input data; conversely, the fewer 1s there are in the input data, the higher the sparsity of the input data, and the smaller the calculation error of multiplication and accumulation based on that input data. Therefore, a threshold for the accumulated non-zero values ​​can be set to control the parallelism.

[0106] In some possible embodiments, the control circuit can control the parallelism of a single computation by the computing core 31a by controlling a first input signal input to the computing core 31a to control the number of first memory units participating in the current computation (which may be referred to as the first computation for clarity), wherein the parallelism of the computation is related to the number of memory units in the memory unit group participating in the computation. The first input signal is generated based on a first data segment in the first input data, wherein the accumulated value of data with non-zero values ​​in the first data segment is less than or equal to a first threshold.

[0107] The control circuit's control of the first input signal, from the perspective of the input data, is equivalent to splitting the first input data into at least one data segment, and the first data segment is one of multiple data segments.

[0108] Based on the above technical solution, by controlling the input signal to the computing core, the length of the effective data input to the computing core can be affected. This allows the length of the effective data provided to the computing core at one time to be flexibly adjusted, which is equivalent to splitting the first input data into multiple data segments and inputting them to the computing core for computation in stages. Since the cumulative value of the non-zero values ​​in each data segment is less than or equal to a first threshold, the computational accuracy of the first computing core for each data segment can be maintained at a high level. This increases the computational accuracy of the in-memory computing system for the first input data and enhances the reliability of the in-memory computing architecture.

[0109] In some possible embodiments, the control circuit described above is further configured to control the number of calculations performed by the computing core 31a based on the first input data or the first state of the storage circuit. In this way, the sparsity of the data participating in a single calculation can be reduced by performing calculations in stages by the computing core, thereby improving computational accuracy. The number of calculations performed by the computing core can be understood as being related to the number of data segments into which the first input data is divided. When the first input data is divided into N data segments, the number of calculations performed by the first computing core is N, where N is a positive integer.

[0110] In some possible embodiments, the in-memory computing system 300 can control the parallelism of a single computation by the computing core 31a in the following manner. Referring to Figure 3, the in-memory computing system 300 further includes:

[0111] Input circuit 330. Input circuit 330 can receive first input data and first control data, and input a first input signal to storage circuit 310 based on the first input data and first control data. Control circuit 320 can provide the aforementioned first control data to input circuit 330.

[0112] This application does not limit the type of the first control data, as long as the control data can truncate the first input data to the required length. For example, the first control data may include indication information for indicating the data segments (e.g., the first data segment) in the first input data that participate in the current calculation (e.g., the first calculation).

[0113] Please refer to Figure 5, which is a schematic diagram of an input control implementation according to an exemplary embodiment of this application. In some possible embodiments, the first control data includes mask data, which can extract at least a portion of the data in the first input data, i.e., the data segments (e.g., the first data segment) participating in the current calculation (e.g., the first calculation). In some possible embodiments, the control circuit can determine the first control data in the following manner: determining the first data segment based on a first threshold, wherein the cumulative value of the non-zero data in the first data segment is less than or equal to the first threshold; determining a first mask based on the first data segment and the first input data; and determining the first control data based on the first mask.

[0114] After the input circuit receives the first input data and the first control data, it can perform operations on the first input data and the first mask data to extract at least a portion of the first input data, and output the first input signal directly or indirectly through conversion, thereby realizing the control of the parallelism of the computing core.

[0115] Furthermore, the control circuit described above can also perform the following operations: when the cumulative value of non-zero data in the remaining data segment is greater than the first threshold, a second data segment is determined based on the remaining data segment and the first threshold, wherein the cumulative value of non-zero data in the second data segment is less than or equal to the first threshold; a second mask is determined based on the second data segment and the first input data; and second control data is determined based on the second mask, wherein the second control data includes the second mask data and is used to extract the second data segment from the first input data.

[0116] After the input circuit receives the second control data, it can perform operations on the first input data and the second mask data to extract at least a portion of the remaining data segments and input a second input signal to the storage circuit. The second input signal is used to carry the aforementioned second data segments.

[0117] The above process can be repeated. When the cumulative value of non-zero data in the current remaining data segment is less than or equal to the first threshold, the third mask can be determined based on the current remaining data segment and the first input data. The third control data is determined based on the third mask, which includes the third mask data and is used to extract the current remaining data segment from the first input data.

[0118] After receiving the third control data, the input circuit can perform calculations on the first input data and the third mask data to extract the data from the current remaining data segment. A third input signal is then input to the storage circuit, carrying the current remaining data segment. Based on this technical solution, by recording data segments using mask data, the segmentation of input data can be implemented in the hardware of the input circuit, thereby improving the efficiency of input data segmentation control and further enhancing the efficiency of the in-memory computing system.

[0119] In some other embodiments, the above input circuit may include a multiplier circuit or a logic gate circuit, etc. Such circuits can perform corresponding logical operations to retain valid data (such as the data of the first data segment) in the input data, and turn other input data into 0 after the operation, so as to achieve the effect of invalid input.

[0120] To facilitate understanding, the following example illustrates the segmentation process of the first input data from the perspective of data splitting.

[0121] Figure 6 is a schematic diagram of an input data segmentation method according to an exemplary embodiment of the present application. Figure 7 is a schematic diagram of segmented input data according to an exemplary embodiment of the present application.

[0122] During the segmentation of the input data: Based on the data input length of the storage circuit, the first input data to be input is obtained; in this embodiment, the data input length can be understood as the maximum data input length of the storage circuit, and here, a data input length of 256 is taken as an example. This first input data is used as the current data segment, and the accumulated value of the non-zero values ​​in the current data segment is determined. Taking 1 bit of data as an example, this accumulated value can be determined by counting the number of element 1s. If the accumulated value does not exceed a set threshold (e.g., a first threshold), a first input signal can be generated based on the current data segment and sent to the storage circuit to complete one calculation. If the accumulated value exceeds the set threshold, the current data segment is halved, and the above process is repeated until a set of data segments (or split vectors) is obtained, such that the accumulated value of the non-zero values ​​in each data segment is less than the preset threshold. Thus, in this set of data segments, each data segment meets the threshold limit, while minimizing the number of segments, thereby achieving a high computing power utilization rate.

[0123] In the examples above, the splitting granularity is a power of 2, such as 8, 16, 32, 64, 128, 256, etc. This method is convenient for hardware implementation and can improve the efficiency of data splitting; or the software implementation algorithm is simple and helps to reduce complexity. However, this application is not limited to this. In some other embodiments, other splitting granularities can also be used, as long as the data segments obtained by splitting meet the threshold setting range.

[0124] Furthermore, the above example uses a top-down splitting approach, but a bottom-up splitting approach can also be used, assuming a basic splitting granularity of 8 units. Using 8 units as the unit, first obtain a first segment of length 8 from the first input data, as the current segment. If the cumulative value of non-zero values ​​in the current segment does not exceed a threshold, add one unit of data to obtain a second segment of length 16. Repeat this process until the cumulative value of non-zero values ​​in the current segment exceeds the threshold. Then, use the previous segment as a data segment and repeat the process until a set of data segments (or splitting vectors) is obtained, such that the cumulative value of non-zero values ​​in each data segment is less than a preset threshold. In other embodiments, adding one unit of data can be changed to doubling the length of the current data segment.

[0125] In summary, this application does not restrict the implementation method of data splitting, as long as the data segments obtained from the splitting can meet the range of the threshold setting.

[0126] Furthermore, in some embodiments of this application, during the splitting process, the remaining segments can be merged with or remain independent of the further split segments. The merged data segments can be split unequally based on the size of the segments from the previous split. This merging and splitting method can further reduce the number of segments, thereby achieving higher computing power utilization.

[0127] Taking the splitting result shown in Figure 7 as an example, assuming the threshold is 16 and the data input length is 256, the cumulative value of the non-zero values ​​in the first input data is 55. Therefore, the first input data can be split into data segment 1 and data segment 2, each with a length of 128. Assume the cumulative value of the non-zero values ​​in data segment 1 is 25, and the cumulative value of the non-zero values ​​in data segment 2 is 30. To minimize the number of segments in the input data, the subsequent splitting of the first input data can be performed as follows:

[0128] Data segment 1 is further split into data segment 3 and data segment 4, both with a length of 64. The cumulative value of the non-zero values ​​in data segment 3 is 15, which does not exceed the threshold, so data segment 3 is retained. For data segment 4, it can be merged with the aforementioned data segment 2 into a single data segment, namely data segment 5. The cumulative value of the non-zero values ​​in data segment 5 is greater than the threshold, so 128 bits of data in data segment 5 can be split to obtain data segment 6, and the remaining 64 bits of data become data segment 7. The cumulative value of the non-zero values ​​in data segment 6 is 15, which does not exceed the threshold, so data segment 6 is retained. The cumulative value of the non-zero data in data segment 7 is 25, so data segment 7 can be split into data segment 8 and data segment 9 with a length of 32. The cumulative value of the non-zero data in data segment 8 is 15, and the cumulative value of the non-zero data in data segment 9 is 10. Both of them meet the condition of not exceeding the threshold. Thus, the splitting operation of the first input data based on the threshold is completed.

[0129] In some possible embodiments, the aforementioned threshold (e.g., a first threshold) may be a pre-set threshold. In some possible embodiments, the threshold may be dynamically adjusted based on a first state of the storage circuit.

[0130] When the sparsity of the first input data is very low, for example, below a threshold-related sparsity, the first input data can be directly converted into an input signal and provided to the storage circuit without splitting it into multiple data segments. Compared to schemes based on a fixed input data length, the above scheme has a significant advantage in computation speed. For example, with a fixed input data length of 128 bits, the above scheme can provide the input data all at once in scenarios with very low input data sparsity, which is advantageous compared to the computation time consumed by inputting 128 bits of data twice for two separate calculations. Therefore, for business scenarios with low sparsity, the technical solution provided in this application helps increase the computation speed of the in-memory computing system. For other business scenarios with different sparsities, it can balance computational accuracy and computational speed, enabling the in-memory computing system to utilize its computing power as much as possible while meeting the requirements for computational accuracy.

[0131] In some possible embodiments, the calculation result of a single calculation by the computing core can be shifted to accumulate the calculation result. For example, referring to FIG3, in some embodiments, the in-memory computing system 300 may further include an output circuit 340. This output circuit 340 can receive first indication information and, based on the first indication information, shift the output of a single calculation by the computing core 31a. The control circuit 320 can provide the first indication information to the output circuit 340. The first indication information is used to determine the number of bits to be shifted.

[0132] Suppose a memory computing system wants to perform operations on a raw data vector. The components of this raw data vector can be encoded into multi-bit data. One bit of this multi-bit data can be converted into input data to realize the calculation of the weight data vector corresponding to the computing kernel. For the input data, a fixed input data length scheme is adopted, and the shift operation at the output end can be based on fixed segments. For example, taking 8-bit data as an example, the shift order can increase from bit 0 to bit 7 (for example, the shift order is denoted as: 01234567).

[0133] For scenarios where the length of the input data can change dynamically, the shift order can also change dynamically. For example, the input data corresponding to the 0th bit can be split into 5 data segments, the input data corresponding to the 1st bit can be split into 3 data segments, the input data corresponding to the 2nd bit can be split into 2 data segments, and so on; the shift order for this multi-bit data can be 0000011122... In some possible embodiments, the control circuit 320 can generate first indication information, and the shift order directly or indirectly indicated by this first indication information can realize arbitrary shift-add operations in the in-memory computing system based on dynamic parallelism control.

[0134] In some possible embodiments, the shift-addition can be performed recursively in the time domain. This process includes, for example, shift-accumulation of the weight vector and shift-accumulation of the data vector. The shift-accumulation of the weight vector can be determined based on parameters such as the encoding method of the weight data or the number of bits in the weight data. The shift-accumulation of the data vector can be determined based on parameters such as the encoding method of the data vector or the number of bits in the input data.

[0135] For example, Figure 8 shows a schematic diagram of a shifting portion of an output circuit according to an exemplary embodiment of this application.

[0136] Referring to Figure 8, the output circuit 340 may include a first register group 81, a first shift accumulator group 82, a second register group 83, a second shift accumulator group 84, and a local buffer 85. The first shift accumulator group 82 includes first shift accumulators 1 to 1 shift accumulator K, and the second shift accumulator group 84 includes second shift accumulators 1 to 2 shift accumulator J, where K and J are positive integers greater than 1. The values ​​of K and J may be the same or different, depending on the encoding method of the weight vector and the data vector.

[0137] The first register group 81 can be connected to the sampling circuit 80 (e.g., an ADC) to receive the sampled output of the calculation result from the sampling circuit 80. The first register group 81 is also connected to the first shift accumulator group 82 to provide the sampled output to the first shift accumulator group 82 for shift accumulation. The first shift accumulator k can shift the sampled output k times by, for example, x bits, and accumulate it with the accumulation result (or partial sum) stored in the second register group 83, then output it to the second register group 83, where k∈[1,K]. This application does not limit the shift direction or the number of bits x, which can be related to the encoding method of the weight vector.

[0138] The second register group 83 is connected to the second shift-accumulator group 84, and can provide the current accumulation result to the second shift-accumulator group 84 for shift-accumulation. The second shift-accumulator j can shift the j-th accumulation result provided by the second register group 83 by, for example, y bits, and accumulate it with the corresponding accumulation result (or partial sum) stored in the local buffer 85, where j∈[1,J]. This application does not limit the direction of the shift and the number of bits y, which can be related to the encoding method of the data vector.

[0139] In some possible embodiments, a first indication may be provided to the second shift accumulator group 84 so that the second shift accumulator group 84 can be used to determine the current shift bit x as the length (e.g., number of bits) of the input data changes dynamically.

[0140] In some possible embodiments, the first indication information may indicate the position of the data component corresponding to the current input data (e.g., the first input data) in the encoded multi-bit data, such as "0000011122..." in the example above. In this way, the output circuit (e.g., the second shift accumulator group) determines the number of shifts as the length (e.g., the number of bits) of the input data changes dynamically.

[0141] In some possible embodiments, the first indication information may indicate whether the data segment corresponding to the current calculation belongs to the same input data as the data segment corresponding to the previous calculation, or whether the shift method has changed. For example, when they belong to the same input data or the shift method remains unchanged, the first indication information has a first value; when they do not belong to the same input data or the shift method changes, the first indication information has a second value. In this way, the output circuit (e.g., the second shift accumulator group) can determine the number of shift bits based on the first indication information and the number of times the sampling result is obtained.

[0142] In some possible embodiments, the first indication information may indicate the degree of parallelism of a single computation, and the output circuit may determine the number of shift bits based on this degree of parallelism and the number of times the sampling results are obtained.

[0143] This application does not limit the form of the first indication information, as long as the output circuit can directly or indirectly determine the position of the segment of the input data corresponding to the current calculation result in the multi-bit data, so as to determine the number of shifts. Based on the above technical solution, the step-by-step shift operation can flexibly configure the number of shifts in the in-memory computing system during the calculation process to be compatible with input data of different splitting forms. The output circuit can accurately complete the shift and accumulation operation of the calculation result without significantly changing the structure or logic of the output part of the in-memory computing system.

[0144] In some possible embodiments, referring to FIG3, the in-memory computing system 300 may include multiple computing cores, referred to as 311 to 31p in FIG3. The multiple computing cores 311 to 31p can be independent of each other and can perform time-sharing or parallel computing, either entirely or partially. However, in some scenarios, the computing cores may not be able to operate completely independently due to factors such as memory circuit access limitations. In this case, in some possible embodiments, the next set of input data can be prefetched only after all relevant computing cores have completed their calculations. However, this implementation scheme may result in some computing cores being idle, leading to a waste of computing power.

[0145] Taking NAND flash memory as an example, computational cores can be configured in units of planes, with one core corresponding to one plane. When multiple cores run in parallel, multiple planes are accessed in parallel, sharing the same vertical address. This address access restriction means that the parallel-running cores are not completely independent of each other. Even if one core has completed its computation, the vertical access address cannot be released until the computations of other cores are completed, at which point the next set of input data can be processed.

[0146] For example, Figure 9 illustrates a schematic diagram of parallel computing using multiple computing cores according to an exemplary embodiment of this application. Referring to Figure 9, parallel computing using computing cores 91, 92, and 93 is taken as an example. For example, for computing core 91, the input data is divided into four data segments, and computing core 91 performs four calculations to complete the operation on the input data, consuming four computation cycles. For computing core 92, the input data is divided into five data segments, and computing core 92 performs five calculations to complete the operation on the input data, consuming five computation cycles. For computing core 93, the input data is divided into two data segments, and computing core 93 performs two calculations to complete the operation on the input data, consuming two computation cycles.

[0147] Although computing core 93 completed the calculation in only 2 computation cycles, due to the aforementioned access address limitation, computing core 93 was idle for 3 computation cycles before starting the next round of input data computation. Similarly, computing core 91 was idle for 1 computation cycle. This idleness of computing cores results in a waste of computing power in the in-memory computing system.

[0148] In view of this, this application also proposes an optimization scheme for parallel operation of multiple computing cores. For multiple computing cores in parallel processing, alignment is performed between at least two computing cores to ensure that the computation time of the cores is the same, and the computation accuracy is improved by utilizing the idle time of the cores. For example, computing core 93 or computing core 91 is further split, so that the computation time of computing core 93 or computing core 91 is aligned with that of computing core 92, and the computation accuracy of computing core 93 or computing core 91 is improved by utilizing control time. Since some data segments in the input data are split into multiple inputs through further input control, the accumulated value of non-zero values ​​in the further split data segments will be further reduced, thereby further improving the computational accuracy of the in-memory computing system. This maximizes computational accuracy while leveraging the computing power of the in-memory computing system.

[0149] In some possible embodiments, referring to FIG3, for the sake of distinction, the computational core 31b in the storage circuit 310 described above can be referred to as the second computational core, where b∈[1,p] and b is not equal to a. The storage units in the computational core 31b can be referred to as second storage units. The second computational core includes multiple second storage units, which are used to store second weight data. The second computational core is used to receive a second input signal and convert the second input signal into a second output signal based on the second weight data, wherein the second input signal is generated based on the second input data. The control circuit 320 is also used to control the parallelism of the second computational core in one computation according to the second input data or the second state of the storage circuit 310.

[0150] In some possible embodiments, similar to the first state, the second state may include the state of the storage circuit or the second computing core, which may include the second precision of the storage region where the storage circuit or the second computing core is located. The method for controlling the parallelism of the second computing core's computation in a single operation can refer to the method for controlling the parallelism of the first computing core's computation in a single operation in the foregoing embodiments, and will not be repeated here.

[0151] In some possible embodiments, the storage circuit 310 includes a first storage region, where storage cells within the first storage region have the same first access address. The first storage region includes the first computing core and the second computing core. In some possible embodiments, the control circuit 320 is further configured to control the number of calculations performed by the second computing core based on the number of calculations performed by the first computing core. The first access address may differ in different storage media, and this application is not limiting it; for example, it may include a vertical address.

[0152] The aforementioned control over the parallelism of the second computing core's computation in a single operation, and the control over the number of computations performed by the second computing core, can be equivalent to controlling the number of data segments in the second input data. Each data segment corresponds to the parallelism of the second computing core's computation in a single operation. This involves further splitting the data segments in the second input data that already meet the first threshold. The method for further splitting the data segments can be achieved by the control circuit determining the mask data. The control input circuit generates corresponding input signals based on the aforementioned data segments and mask data to input the secondary split data segments into the storage circuit. For ease of understanding, the following example illustrates the secondary splitting of input data for multiple computing cores from the perspective of data splitting. The result of this data splitting can be achieved by controlling the input format of the input data (e.g., the mask-based segmented input proposed in the aforementioned embodiment) through the aforementioned control circuit.

[0153] Figure 10 shows a schematic diagram of segmented input data corresponding to multiple computing cores according to an exemplary embodiment of the present application. Figure 11 shows a schematic diagram of segmented input data corresponding to multiple computing cores according to another exemplary embodiment of the present application.

[0154] Taking the splitting result shown in Figure 10 as an example, assuming the threshold is 16, and computational cores 101 and 102 run in parallel, based on the splitting results of the first input data corresponding to computational core 101 and the second input data corresponding to computational core 102, it can be seen that the first input data includes 4 data segments, and the second input data also includes 4 data segments. Therefore, it is not necessary to split any one of the 4 data segments in the first input data, nor is it necessary to split any one of the 4 data segments in the second input data. At this time, the time consumed by the two computational cores to complete the calculation of all the input data in this round is 4 computation cycles.

[0155] Taking the splitting result shown in Figure 11 as an example, assuming the threshold is 16, and computational cores 111 and 112 run in parallel, based on the splitting results of the first input data corresponding to computational core 111 and the second input data corresponding to computational core 112, it can be seen that the first input data includes 4 data segments, and the second input data includes 5 data segments. Therefore, the computation time consumed by computational core 111 to complete the computation is 4 computation cycles, and the computation time consumed by computational core 112 to complete the computation is 5 computation cycles. However, due to the aforementioned access address limitation, computational core 111 will be idle for 1 computation cycle. At this time, the target data segments in the first input data can be split, so that the first input data is also split into 5 segments. At this time, the time consumed by the two computational cores to complete the computation of all the input data in this round is 5 computation cycles. Moreover, the cumulative value of non-zero values ​​in the target data segments of the first input data after splitting may be further reduced, thereby increasing the overall computational accuracy of the in-memory computing system. However, the time consumed by the two parallel computational cores to complete the entire computation task of this input data remains unchanged.

[0156] The target data segment can be any one of the multiple data segments included in the first input data. In some possible embodiments, the target data segment can include the data segment with the largest sum of non-zero values ​​among the multiple data segments included in the first input data. Alternatively, the target data segment can include the data segment with the largest length among the multiple data segments included in the first input data. This can further improve computational accuracy. Furthermore, based on the selection method of the split data segments, the difference in the sum of non-zero values ​​between data segments can be reduced, thereby helping to improve the stability of the computational accuracy of the storage circuit when calculating different data segments.

[0157] In some possible embodiments, assuming that the second input data includes 6 data segments, then after splitting the target data segments of the first input data, the control circuit 320 can further split the data segments in the first input data in the same or different ways.

[0158] In some possible embodiments, there can be more than two computing cores running in parallel. When there are more than two computing cores, the method of splitting or merging the input data for the computing cores can be implemented by referring to the scheme ideas proposed in the above examples.

[0159] For ease of understanding, the following example uses three computing cores running in parallel to illustrate the method for processing input data in this application embodiment.

[0160] Figure 12 shows a schematic diagram of segmented input data corresponding to multiple computing cores according to an exemplary embodiment of this application.

[0161] Taking the splitting result shown in Figure 12 as an example, assuming a threshold of 16, computational cores 121, 122, and 123 run in parallel. The first input data to computational core 121 is split into 4 segments, the second input data to computational core 122 is split into 4 segments, and the third input data to computational core 123 is split into 5 segments. At this point, the segment with the largest cumulative non-zero value in the first input data and the segment with the largest cumulative non-zero value in the second input data can be split. This makes the time consumed by the three computational cores to complete the calculation of all the input data in this round 5 computation cycles. However, the cumulative value of non-zero values ​​in the target data segments of the first input data may be further reduced, and the cumulative value of non-zero values ​​in the target data segments of the second input data may be further reduced, thereby increasing the overall computational accuracy of computing cores 121 and 122. However, the time consumed by the three parallel computing cores to complete all computational tasks of this input data remains unchanged, thus maximizing the computing power of the in-memory computing system.

[0162] Based on the above technical solution, the occurrence of some computing cores being idle during the execution of computing tasks by the storage circuit can be reduced, and the computing accuracy of the computing cores can be increased. Moreover, the time consumed by multiple parallel in-memory computing cores to complete all computing tasks of the input data remains unchanged, thereby maximizing the computing power of the in-memory computing system.

[0163] In storage circuits, device deviation is not constant and may gradually increase over time. A fixed input data length or a fixed conduction mode may be suitable for the device deviation at a certain moment, but it cannot be flexibly adjusted according to the actual device state. This application also provides a control scheme that can control the parallelism of a single computation by the computing core based on the state of the storage circuit (e.g., the first state mentioned above), or it can control a threshold (e.g., the first threshold) based on the state of the storage circuit. For example, if the device deviation is still small, a larger threshold can be set to allow a larger input amount (e.g., more rows or columns conducting), achieving higher parallelism and improving computation speed. As the device deviation increases, the threshold can be reduced, decreasing the input amount (e.g., allowing fewer rows or columns to conduct), maintaining or improving computation accuracy through lower parallelism. This threshold setting allows for a flexible balance between computation accuracy and computation speed.

[0164] In some possible embodiments, the control circuit 320 may also be used to determine a first threshold based on a first state of the storage circuit.

[0165] In some possible embodiments, the first state of the storage circuit can also be used directly to control the parallelism of the computation core in one computation.

[0166] In some possible embodiments, the first state of the storage circuit may include, for example, the operating state of the storage circuit or the operating state of the computing core 31a including the storage circuit. This operating state may include, for example, running time or the number of computations. For instance, the running time and the number of computations may be divided into multiple regions, each corresponding to a different threshold. The region to which the storage circuit or computing core 31a belongs is determined based on its current running time or the number of computations; the threshold corresponding to that region is then used as the first threshold.

[0167] In some possible embodiments, the first state of the storage circuit may include, for example, information related to the accuracy of the storage circuit or its computational core 31a, which can be referred to as an accuracy index. The accuracy index is divided into multiple regions, each corresponding to a different threshold. Based on the current accuracy index of the storage circuit or computational core 31a, the region to which it belongs is determined; the threshold corresponding to that region is used as the first threshold. The detection of the accuracy index can be performed on the overall accuracy of the storage circuit or on the accuracy of the computational core.

[0168] In some possible embodiments, the accuracy metric can be acquired periodically, and a first threshold can be determined based on the acquired accuracy metric. In some possible embodiments, the accuracy metric can be acquired event-triggered, for example, after the storage circuit or computing core 31a has continuously performed computational tasks. Alternatively, it can be triggered based on the runtime of the storage circuit or computing core, for example, when the runtime of the storage circuit or computing core 31a reaches or exceeds a time threshold.

[0169] The above precision indicators refer to those that reflect the operational or storage precision of the storage circuit or computing core. This application does not limit the type of these indicators. For example, they may include usage time, number of uses, sampled values ​​of the computation results, or bit error rate. The regional division may differ depending on the type of indicator.

[0170] In some possible embodiments, the above-mentioned accuracy indicators can be obtained or determined by at least one of the following methods:

[0171] Method 1: Determine the accuracy index based on the usage of the storage circuit or the computing core.

[0172] In some possible embodiments, the usage of the storage circuitry may include the runtime (or duration) of the storage circuitry; or, the usage of the computing core may include the runtime (or duration) of the computing core. The runtime mentioned above refers, for example, to the accumulated runtime of the storage circuitry or computing core before a refresh. In some possible embodiments, the storage circuitry or computing core may count the runtime after a refresh, and restart the system corresponding to the count of that runtime after the storage circuitry or computing core rewrites data.

[0173] In some possible embodiments, the usage of the storage circuitry may include the number of times the storage circuitry is used; or, the usage of the computing core may include the number of times the computing core is used. For example, the number of uses may include the usage of data stored in the storage circuitry or the computing core, such as one or more of the following: the number of data reads or the number of computations.

[0174] The above usage count refers to the cumulative usage count of the storage circuit or computing core before the refresh. In some possible embodiments, the storage circuit or computing core can count the usage count after the refresh, and restart the storage circuit or computing core corresponding to the usage count after the data is rewritten.

[0175] The above usage information, such as usage time or number of uses, can be used as an accuracy indicator to characterize the accuracy. Alternatively, a correlation can be established between usage information such as usage time or number of uses and the accuracy value to obtain the corresponding accuracy value as an accuracy indicator.

[0176] For example, a relationship table can be established between the usage of a memory circuit or computing core (e.g., runtime or number of uses) and its precision value. The corresponding precision value, serving as a precision indicator, is determined by looking up this relationship table based on the current usage of the memory circuit or computing core. For instance, the relationship table can be determined through one or more methods, such as experience, theoretical derivation, experimentation, or simulation.

[0177] For example, the relationship between the usage of storage circuits or computing cores and accuracy values ​​can be modeled. By statistically analyzing the usage data of storage circuits or computing cores and inputting the usage data into the model, the corresponding accuracy values ​​can be obtained as accuracy indicators.

[0178] Method 2: Determine the accuracy index based on the first sampled value, wherein the first sampled value can be used to indicate the equivalent data stored in the storage unit.

[0179] In some possible embodiments, a reading method (e.g., sampling reading) can be used to obtain the representation value of the equivalent data stored in the current tested storage unit, and the accuracy index can be determined based on the difference between the representation value and the target value of the target data.

[0180] Optionally, this difference can be used as a precision metric. Alternatively, a precision metric can be determined based on this difference. For example, a relationship table between the difference and the precision value can be established. The corresponding precision value is determined based on the current difference, serving as the precision metric. For instance, the relationship table can be determined through one or more methods such as experience, theoretical derivation, experimentation, or simulation. For example, by modeling the relationship between the difference and the precision value, statistically analyzing the difference data, and inputting the difference data into the model, the corresponding precision value can be obtained as the precision metric.

[0181] The above differences could be the differences obtained by reading a single memory cell of the storage circuit or computing core; or, the average of the differences obtained by reading all or part of the memory cells of the storage circuit or computing core.

[0182] The above sampling and reading can be achieved using an ADC, for example.

[0183] Method 3: Determine the accuracy index based on the second sampled value; the second sampled value may include sampled values ​​of the computing core or the storage unit group of the computing core to determine the accuracy index, wherein the second sampled value can be used to indicate the calculation result of the computing core or the storage unit group.

[0184] In some possible embodiments, when using a sampling circuit to obtain the computation results of a computing core or memory cell group, the sampling precision of the sampling circuit can be set to be higher than that used in actual operation. For example, the ADC can be configured with additional redundant precision bits during sampling. The value of this redundant precision bit can reflect the magnitude of the computational precision; for example, a larger value indicates a greater deviation in the computational result and lower precision. The precision index can be determined based on this redundant precision bit.

[0185] Optionally, the value corresponding to the redundant precision bit can be used as a precision index. Alternatively, the precision index can be determined based on the value corresponding to the redundant precision bit. For example, a relationship table between the value and the precision value can be established. The corresponding precision value is determined based on the currently sampled value, serving as the precision index. For instance, the relationship table can be determined through one or more methods such as experience, theoretical derivation, experimentation, or simulation. For example, the relationship between the value and the precision value can be modeled, and the corresponding precision value can be obtained by statistically analyzing the values ​​corresponding to the redundant precision bits and inputting these values ​​into the model, serving as the precision index.

[0186] Method 4: Determine the precision index based on the bit error rate of the original codeword, which is used to generate data to be written into the storage circuit or computing core.

[0187] In some possible embodiments, data stored in the storage circuitry or computing core can be read, or computation results can be sensed. The read or sensed data can be verified, for example, using error-correcting codes. The lower the precision of the storage circuitry or computing core, the higher the bit error rate; therefore, the bit error rate can be used to determine the precision metric.

[0188] This application does not limit the type of error correction code. For example, error correction codes may include error correction codes (ECC), low-density parity-check codes (LDPC), and other error correction codes.

[0189] Optionally, the bit error rate (BER) can be used as a precision metric. Alternatively, the precision metric can be determined based on the BER. For example, a relationship table between the BER and precision values ​​can be established. The corresponding precision value is determined based on the currently obtained BER, serving as the precision metric. For instance, the relationship table can be determined through one or more methods such as experience, theoretical derivation, experimentation, or simulation. For example, the relationship between the BER and precision values ​​can be modeled, and the corresponding precision value can be obtained by statistically analyzing the BER and inputting it into the model, serving as the precision metric.

[0190] In some possible embodiments, the tolerance for computational precision of the storage circuit or computing core is high, and the operation speed of the storage circuit or computing core can be increased. Therefore, the value of the first threshold can be adaptively increased. Increasing the value of the first threshold may cause the control circuit to change the input form of the input data, for example, increasing the parallelism of the computing core in computation based on the input data. This reduces the number of data segments into which the input data is divided, which in turn reduces the time for the computing core to complete the computation based on the input data, thereby improving the computation speed of the storage circuit or computing core.

[0191] In some possible embodiments, when the control circuit determines the first threshold, it can simultaneously consider the computational accuracy and computational speed of the storage circuit or the computing core, and weigh these two parameters. By adjusting the value of the threshold, a balance between the computational speed and computational accuracy of the in-memory computing system can be achieved in different scenarios.

[0192] For example, a relationship can be established between a threshold, an accuracy metric, and a computational speed. A first threshold is determined based on the target accuracy metric and the target computational speed. For instance, this relationship can include a relational table and can be determined through one or more methods such as experience, theoretical derivation, experimentation, or simulation. For example, the first threshold can be obtained by modeling the relationship between the threshold, the accuracy metric, and the computational speed, by determining the target accuracy and the target computational speed, and inputting the target accuracy metric and the target computational speed into the model.

[0193] Accordingly, this application also proposes a control method for implementing the control function of the control circuit on the storage circuit.

[0194] Figure 13 shows a flowchart of a control method proposed in an embodiment of this application. This control method 1300 can be used to control a storage circuit in a memory computing system. The storage circuit includes a first computing core, which includes multiple first storage units for storing first weight data. The first computing core receives a first input signal and converts the first input signal into a first output signal based on the first weight data. The control method 1300 may include the following steps:

[0195] S1310: Determine the first state of the first input data or storage circuit, wherein the first input signal is generated based on the first input data.

[0196] S1320: Control the parallelism of the first computing core in one computation based on the first input data or the first state of the storage circuit.

[0197] For a detailed description of the two steps described above, please refer to the corresponding embodiments of the control circuit mentioned above, which will not be repeated here.

[0198] In some possible embodiments, a data processing device can directly split the first input data and input the split data segments directly to the input circuit, so that the input circuit can directly generate corresponding input signals according to the data segments to control the parallelism of the first computing core in one computation. This data processing device may be located in the control circuit 120 / 220, or independent of the control circuit 120 / 220.

[0199] Figure 14 shows a flowchart of a data processing method proposed in an embodiment of this application.

[0200] The data processing method 1400 can be executed by the aforementioned data processing apparatus. Referring to FIG14, the method 1400 may include the following steps:

[0201] S1410: Receive the first input data.

[0202] S1420: Divide the first input data into N data segments, where N is a positive integer, and the cumulative value of the non-zero data in each data segment is less than or equal to the first threshold.

[0203] When N=1, the first input data does not need to be split into multiple segments, that is, splitting the first input data into N data segments includes the case where the first input data is not split.

[0204] S1430: Transmits data segments to the input circuit.

[0205] Figure 15 illustrates the effects of different control schemes under an exemplary embodiment of this application. Referring to Figure 15, the horizontal axis represents the layer numbering of the large model, and the vertical axis represents the signal-to-noise ratio (SNR). Assuming the first threshold is 16, based on Figure 15, it can be seen that:

[0206] Evaluate the SNR and number of data segments for the following three schemes in sequence:

[0207] Option 1: A storage-based computing scheme with a fixed input data length, assuming a fixed input data length of 64. Option 2: A scheme that dynamically adjusts the input data length, but without optimization for parallel operation of the computing cores. Option 3: A scheme that dynamically adjusts the input data length while optimizing for parallel operation of the computing cores.

[0208] Scheme 1 exhibits significant fluctuations in computational accuracy when processing data across different layers, with an average SNR of 51.44 dB, barely meeting the required computational accuracy. Scheme 2 shows smaller fluctuations in computational accuracy of the storage circuit when processing data across different layers, achieving an average SNR of 52.36 dB. Scheme 3 also shows smaller fluctuations in computational accuracy of the storage circuit when processing data across different layers. In addition, Scheme 3 further divides the data segments in the computational cores with shorter computation cycles, thereby further increasing the computational accuracy of the storage circuit, achieving an average SNR of 56.75 dB.

[0209] In summary, the control scheme for dynamically adjusting the length of input data proposed in the embodiments of this application can effectively increase the computational accuracy of the in-memory computing system and improve the reliability of the in-memory computing architecture.

[0210] Figure 16 shows a comparison of computational speeds proposed in the exemplary embodiments of this application. Referring to Figure 16, the horizontal axis represents the layer numbering of the large model, and the vertical axis represents the number of data segments input to the storage circuit. From Figure 16, it can be seen that:

[0211] Taking a data input length of 256 as an example, assuming that Scheme 1 uses a fixed 64-bit data input, regardless of the data sparsity, it requires 4 calculations. Scheme 2, when inputting data, can dynamically adjust the number of calculations based on data sparsity; for example, it can input a maximum of 256 bits of data at a time. Therefore, compared to Scheme 1, Scheme 2 generally has fewer data segments in most layers, meaning that Scheme 2's calculation speed is faster than Scheme 1's, especially in layers with high input data sparsity, where Scheme 2 can significantly accelerate computation. The meanings of Scheme 2 and Scheme 1 are described in Figure 15 above.

[0212] Figure 17 shows a comprehensive comparison of the effects proposed in an exemplary embodiment of this application. Noise is added to the computation process of the large model, and the SNR and number of data segments of the following various schemes are evaluated sequentially:

[0213] Option a: Based on a fixed input data length, the fixed number of conductions is set to 64 rows;

[0214] Option b: Based on a fixed input data length, the fixed number of conductions is set to 128 rows;

[0215] Solution c: A scheme that dynamically controls the length of input data and optimizes it for parallel operation of multiple computing cores, setting the threshold to 16;

[0216] Scheme d: A scheme that dynamically controls the length of input data and optimizes the parallel operation of multiple computing cores, setting the number threshold to 18;

[0217] Scheme e: A scheme that dynamically controls the length of input data and optimizes the parallel operation of multiple computing cores, setting the number threshold to 20.

[0218] Referring to Figure 17, the horizontal axis in Figure 17 is used to represent the relative SNR, and the vertical axis is used to represent the relative computation time. The relative SNR in the experimental results of scheme a is set to the baseline value of 0, and the relative computation time in the experimental results of scheme a is set to the baseline value of 1.

[0219] Although the computation time of scheme b is significantly reduced compared to scheme a, the computation accuracy of scheme b is very low, which is usually difficult to meet the computation accuracy requirements of storage circuits.

[0220] For schemes c, d, and e, as the first threshold decreases, the relative SNR of the corresponding storage circuit increases accordingly, while the computation time of the storage circuit also increases accordingly; or, the computation accuracy of the storage circuit increases, but the computation speed of the storage circuit decreases. Even if the computation time of the storage circuit increases, it is still relatively small compared to scheme a. Although the computation time is longer than that of scheme b, the computation accuracy of the storage devices in schemes c, d, and e is significantly increased, which can meet the computation accuracy requirements of the storage circuit and help to maximize the computing power of the in-memory computing system and reduce the waste of computing power in the in-memory computing system.

[0221] As shown in Figure 17 above, by adjusting the size of the first threshold, a smooth Pareto front can be obtained regarding the quantity threshold, computational accuracy, and computational speed. In the application of the method proposed in this embodiment, points can be selected on this front by adjusting the size of the first threshold, thereby achieving a smooth and precise adjustment of the trade-off between computational accuracy and computational speed.

[0222] In the above method embodiments, the order of the process numbers does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0223] This application also provides an apparatus for implementing any of the above methods. For example, a control device is provided, which includes a unit (or means) for implementing any of the above control methods 1300.

[0224] Figure 18 is a schematic diagram of a control device according to an exemplary embodiment of the present application.

[0225] The control device 1800 can be used to control a storage circuit, which includes a first computing core. The first computing core includes multiple first storage units for storing first weight data. The first computing core is used to receive a first input signal and convert the first input signal into a first output signal based on the first weight data. The control device 1800 includes a determining unit 1810 and a controlling unit 1820. The determining unit 1810 is used to determine the first input data or a first state of the storage circuit, wherein the first input signal is generated based on the first input data. The controlling unit 1820 is used to control the parallelism of the first computing core's calculation in one operation according to the first input data or the first state of the storage circuit.

[0226] It should be understood that the above division of units is only a logical functional division. In actual implementation, all or part of them can be integrated into a single physical entity, or they can be physically separated. Furthermore, the above units can be implemented in the form of a processor calling software; for example, a control device may include a processor connected to a memory containing instructions. The processor calls the instructions stored in the memory to implement any of the above control methods. The memory can be internal to the control device or external to it. Alternatively, the above units can be implemented in the form of hardware circuits. The functions of some or all units can be achieved through the design of the hardware circuits, which can be understood as one or more processing circuits. For example, in some embodiments, the hardware circuit may include an application-specific integrated circuit (ASIC), which implements the functions of some or all of the above units through the design of the logical relationships between the devices within the circuit. Furthermore, in some embodiments, the hardware circuit can be implemented using a programmable logic device (PLD) circuit, which may include a large number of logic devices. The logical relationships between the logic devices are configured through a configuration file, thereby achieving the functions of some or all of the above units. The above control devices can be implemented by a processor calling a program; or by a hardware circuit; or partially by a processor calling a program and partially by a hardware circuit.

[0227] In some possible embodiments, the processor or processing circuit is a circuit with signal processing capabilities. For example, the processor may be a circuit with instruction read and execute capabilities. In other possible embodiments, the processor can implement its functions through the logical relationships of hardware circuits, which are fixed or reconfigurable. For example, the processor may be a hardware circuit implemented as an ASIC or PLD, such as a field-programmable gate array (FPGA). In a reconfigurable hardware circuit, the process of the processor loading a configuration document and configuring the hardware circuit can be understood as the process of the processor loading instructions to implement the functions of some or all of the above units. This application does not limit the type of processor, including, for example, a central processing unit (CPU), a microcontroller unit (MCU), a graphics processing unit (GPU), or a digital signal processor (DSP). Alternatively, it may be a hardware circuit designed for artificial intelligence, which can be understood as an ASIC, such as a neural network processing unit (NPU), a tensor processing unit (TPU), or a deep learning processing unit (DPU).

[0228] In some possible embodiments, the units in the above control device may be integrated in whole or in part, or may be implemented independently. In some embodiments, these units are integrated together and implemented in the form of a system on chip (SOC).

[0229] This application also provides a control device, which may be located within or include the above-described control circuit. The control device may be located within the control circuit 120 / 220 shown in FIG. 1 or FIG. 2, or may be independent of the control circuit 120 / 220. This control device can be used to execute any of the above-described control methods.

[0230] This application also provides a control device, as shown in FIG19. FIG19 shows a schematic diagram of a control device according to an exemplary embodiment of this application. As shown in FIG19, the control device 1900 includes: at least one processing circuit 1910 and an interface circuit 1920, the interface circuit 1920 being used for signal connection with a storage circuit, and the at least one processing circuit 1910 being used for executing any of the control methods provided in the above embodiments.

[0231] This application also provides a computer program product, which includes instructions that, when executed by a processor, cause any of the control methods described in the above embodiments to be executed.

[0232] This application also provides a computer-readable medium storing instructions that, when executed by a processor, cause any of the control methods described in the above embodiments to be executed.

[0233] In the above method embodiments, the order of the process numbers does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0234] This application also provides an electronic device, as shown in FIG. 20. FIG. 20 illustrates a schematic diagram of an electronic device according to an exemplary embodiment of this application. As shown in FIG. 20, the electronic device 2000 may include any of the above-described in-memory computing systems 300 for processing data of the electronic device. The electronic device may also include an input / output device 2020 for receiving user input or outputting processing results. This application does not limit the input type and output type; for example, input may include voice input, text input, image input, or video input, etc. The output may include text output, voice output, image output, or video output, etc. The electronic device may also include a processor 2030, which may process data provided to the in-memory computing system 300 or process output data of the in-memory computing system 300. The output of the input / output device 2020 may be based on the output of the processor 2030 or the output of the in-memory computing system 300.

[0235] This application does not limit the type of electronic device. For example, according to some embodiments, the electronic device may include wearable devices. Wearable devices include, but are not limited to: head-mounted devices (e.g., helmets or hats), devices worn on the ears (e.g., headphones), devices worn on the wrist (e.g., watches), and devices worn on other parts of the body (e.g., electronic necklaces, medical monitoring devices, or glasses). According to some embodiments, the electronic device may include portable terminals. For example, the electronic device may include, but is not limited to, mobile phones, general-purpose computing devices (e.g., laptops or tablets), personal digital assistants, etc. According to some embodiments, the electronic device may include other types of edge devices, such as personal computers, in-vehicle computers or in-vehicle computing platforms, or smart home electronic products. According to some embodiments, the electronic device may also include devices such as servers.

[0236] In the above embodiments, the descriptions of different embodiments each have their own emphasis. Parts not described in detail or recorded in a certain embodiment can be referred to in the relevant descriptions of other embodiments. Furthermore, the different embodiments described above can be freely combined as needed. Moreover, as technology evolves, the elements described in this application can be replaced by equivalent elements appearing after this application.

Claims

1. An in-memory computing system, characterized in that, include: The storage circuit includes a first computing core, the first computing core including a plurality of first storage units, the plurality of first storage units being used to store first weight data, the first computing core being used to receive a first input signal, and convert the first input signal into a first output signal based on the first weight data, wherein the first input signal is generated based on the first input data; A control circuit is used to control the parallelism of the first computing core in a single computation based on the first input data or the first state of the storage circuit.

2. The in-memory computing system according to claim 1, characterized in that, The control circuit is used for: The parallelism of the first computing core in a single computation is controlled based on the accumulated value of the non-zero values ​​in the first input data.

3. The in-memory computing system according to claim 1 or 2, characterized in that, The control circuit is used for: The first input signal, which is controlled to the first computing core, controls the number of first storage units participating in the first computation. The parallelism of the first computation is related to the number of first storage units participating in the first computation in the storage unit group. The first input signal is generated based on the first data segment in the first input data. The cumulative value of the data with non-zero values ​​in the first data segment is less than or equal to a first threshold. The computation of the first computing core includes the first computation.

4. The storage system according to any one of claims 1 to 3, characterized in that, The control circuit is also used for: The number of calculations performed by the first computing core is controlled based on the first input data or the first state of the storage circuit.

5. The storage system according to any one of claims 1 to 4, characterized in that, The in-memory computing system also includes: An input circuit is configured to receive the first input data and the first control data, and input the first input signal to the storage circuit based on the first input data and the first control data; The control circuit is used to provide the first control data to the input circuit.

6. The in-memory computing system according to claim 5, characterized in that, The first control data includes mask data.

7. The storage system according to any one of claims 1 to 6, characterized in that, The in-memory computing system also includes: An output circuit is used to receive first indication information and shift the output of the first calculation core in one calculation according to the first indication information, wherein the first indication information is used to determine the number of bits to be shifted. The control circuit is used to provide the first indication information to the output circuit.

8. The storage system according to any one of claims 1 to 7, characterized in that, The storage circuit further includes a second computing core, which includes a plurality of second storage units for storing second weight data. The second computing core is used to receive a second input signal and convert the second input signal into a second output signal based on the second weight data, wherein the second input signal is generated based on the second input data. The control circuit is also used to control the parallelism of the second computing core in a single computation based on the second input data or the second state of the storage circuit.

9. The in-memory computing system according to claim 8, characterized in that, The storage circuit includes a first storage region, wherein storage cells within the first storage region have the same first access address, and the first storage region includes a first computing core and a second computing core. The control circuit is further configured to: The number of calculations of the second computing core is controlled based on the number of calculations performed by the first computing core.

10. The storage system according to any one of claims 3 to 9, characterized in that, The control circuit is also used for: The first threshold is determined based on the first state of the storage circuit.

11. A control method, characterized in that, For controlling a storage circuit, the storage circuit includes a first computing core, the first computing core including a plurality of first storage units, the plurality of first storage units being used to store first weight data, the first computing core being used to receive a first input signal, and convert the first input signal into a first output signal based on the first weight data, the control method including: Determine the first input data or the first state of the storage circuit, wherein the first input signal is generated based on the first input data; The parallelism of the first computing core in a single computation is controlled based on the first input data or the first state of the storage circuit.

12. The method according to claim 11, characterized in that, The method further includes: The parallelism of the first computing core in a single computation is controlled based on the accumulated value of the non-zero values ​​in the first input data.

13. The method according to claim 11 or 12, characterized in that, The method further includes: The first input signal, which is controlled to the first computing core, controls the number of first storage units participating in the first computation. The parallelism of the first computation is related to the number of first storage units participating in the first computation in the storage unit group. The first input signal is generated based on the first data segment in the first input data. The cumulative value of the data with non-zero values ​​in the first data segment is less than or equal to a first threshold. The computation of the first computing core includes the first computation.

14. The method according to any one of claims 11 to 13, characterized in that, The method further includes: The number of calculations performed by the first computing core is controlled based on the first input data or the first state of the storage circuit.

15. The method according to any one of claims 11 to 14, characterized in that, The storage circuit is connected to the input circuit, the input circuit is used to receive the first input data, and the method further includes: The input circuit provides first control data, and the input circuit is used to input the first input signal to the storage circuit based on the first input data and the first control data.

16. The method according to claim 15, characterized in that, The first control data includes mask data.

17. The method according to any one of claims 11 to 16, characterized in that, The storage circuit is connected to the output circuit, and the method further includes: The output circuit is provided with first indication information, which is used to determine the number of bits to be shifted, and the output circuit is used to perform the shift on the output of the first calculation core in one calculation according to the first indication information.

18. The method according to any one of claims 11 to 17, characterized in that, The storage circuit further includes a second computing core, which includes a plurality of second storage units for storing second weight data. The second computing core is used to receive a second input signal and convert the second input signal into a second output signal based on the second weight data. The second input signal is generated based on the second input data. The method further includes: The parallelism of the second computing core in a single computation is controlled based on the second input data or the second state of the storage circuit.

19. The method according to claim 18, characterized in that, The storage circuit includes a first storage region, where storage cells within the first storage region have the same first access address. The first storage region includes a first computing core and a second computing core. The method further includes: The number of calculations of the second computing core is controlled based on the number of calculations performed by the first computing core.

20. The method according to any one of claims 13 to 19, characterized in that, The method further includes: The first threshold is determined based on the first state of the storage circuit.

21. A control device, characterized in that, The control device includes an interface circuit and a processing circuit. The interface circuit is used to connect to a storage circuit via signals, and the processing circuit is used to execute the control method as described in any one of claims 11 to 20.

22. An electronic device, characterized in that, Includes the storage system as described in any one of claims 1 to 10.