Systems, cores, and chips for neuromorphic computing

By adopting a chip-based hierarchical tree topology architecture, the problems of scalability and data transmission bandwidth of neuromorphic chips are solved, achieving efficient data communication and system expansion, and meeting the needs of next-generation AI models.

CN117436491BActive Publication Date: 2026-06-30ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2022-07-11
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing neuromorphic chips and systems face challenges in terms of scalability and data transfer bandwidth, especially as the manufacturing cost of monolithic processors increases with die size, and existing architectures struggle to scale to the processing unit size required for next-generation AI models and applications.

Method used

It adopts a layered tree topology architecture based on core particles. By combining internal plug-ins and core particles, it organizes computing nodes using multi-level switches to achieve flexible data transmission and scalability. Micro-bump packaging is used to improve chip density and data transmission bandwidth.

Benefits of technology

It achieves performance similar to that of a single-chip processor and higher production yield, while optimizing short-range data communication and system scalability to meet the large-scale needs of spiking neural networks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117436491B_ABST
    Figure CN117436491B_ABST
Patent Text Reader

Abstract

This disclosure provides a system, core, and chip for neuromorphic computing. The system includes: multiple inner modules, each inner module including multiple routers and a set of cores, each core including multiple switches and a set of neuronal processing entities; wherein each of the multiple switches within each core is connected to one or more neuronal processing entities in the set of neuronal processing entities, and the multiple switches within each core are organized in a tree topology; each of the multiple routers within each inner module is connected to one or more cores in the set of cores, and the multiple routers within each inner module are organized in a tree topology. This disclosure improves data transmission bandwidth, flexibility, and scalability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to neuromorphic devices and systems with ultra-high data transmission bandwidth, flexibility, and scalability for artificial intelligence (AI) applications. Background Technology

[0002] Neuromorphic chips or systems have the potential to become the next generation of artificial intelligence (AI) architectures due to their power-efficient computing and cognitive computing capabilities. However, existing neuromorphic chip and system architectures face scaling challenges. From a performance perspective, typical neuromorphic chips or systems are based on bus, crossbar, on-chip network, or mesh architectures, which have limited data transfer bandwidth and poor scalability. From a hardware perspective, existing neuromorphic chips or systems rely on monolithic processors (such as CPUs and GPUs) to provide computing resources. While monolithic processors offer higher density and thus potentially better performance (because processing units are closer together), the cost of manufacturing monolithic processors increases with die size (e.g., larger dies lead to lower production yields). Furthermore, next-generation AI models and applications require a large number of processing units, making monolithic chip construction increasingly impractical. Summary of the Invention

[0003] To address the aforementioned challenges, this disclosure describes a novel chiplet-based hierarchical tree topology architecture.

[0004] Various embodiments of this specification may include hardware circuits, systems, and devices having a chip-based hierarchical tree topology architecture. This architecture can be applied at both the macroscopic (e.g., data center) and microscopic (e.g., chip) levels.

[0005] According to one aspect, a core-based neuromorphic system is described. The system may include: a plurality of interposers, each interposer including a plurality of routers and a set of cores, each core including a plurality of switches and a set of neuron processing entities (NPEs). In some embodiments, each of the plurality of switches within each core is connected to one or more neuron processing entities within the set of neuron processing entities within each core, and the plurality of switches within each core are organized in a tree topology. In some embodiments, each of the plurality of routers within each interposer is connected to one or more cores within the set of cores within each interposer, and the plurality of routers within each interposer are organized in a tree topology.

[0006] In some embodiments, each neuron processing entity in a set of neuron processing entities includes a register file as local memory.

[0007] In some embodiments, a set of cores within each inner plug are connected to one or more micro-bumps.

[0008] In some embodiments, the system further includes multiple rack-level switches, each rack-level switch being connected to a set of internal modules among multiple internal modules, wherein the multiple rack-level switches are organized in a tree topology.

[0009] In some embodiments, each of the plurality of cores includes a configurable core-level clock to coordinate a set of neuronal processing entities within each core.

[0010] In some embodiments, the chip-level clock of one chip is independent of the chip-level clock of another chip.

[0011] In some embodiments, each of the plurality of in-plugs includes a configurable in-plug-level clock to coordinate a set of granules within each in-plug, wherein the in-plug-level clock is independent of the granule-level clock.

[0012] In some embodiments, the organization within each core in a tree topology comprises: a root-level switch at the root level, where the root level is the highest level in the tree topology; a plurality of leaf-level switches at the leaf level, where the leaf level is the lowest level in the tree topology; and a plurality of intermediate-level switches between the root level and the leaf level, wherein each of the plurality of intermediate-level switches is connected to two or more lower-level switches and one higher-level switch.

[0013] In some embodiments, one of a plurality of intermediate-level switches includes: a first input interface configured to receive one or more first requests from a higher-level switch; a first priority queue configured to store the received one or more first requests; a second input interface configured to receive one or more second requests from one of two or more lower-level switches; a second priority queue configured to store the received one or more second requests; and a third input interface configured to receive one or more global commands that control the forwarding order of the one or more first requests stored in the first priority queue and the one or more second requests stored in the second priority queue.

[0014] In some embodiments, one or more first requests received from a higher-level switch include data received from one or more neuron processing entities from a group of neuron processing entities connected to the higher-level switch.

[0015] In some embodiments, a set of neuronal processing entities within each core are connected to a local bus for local data communication.

[0016] According to another aspect, a neuromorphic chip based on a particle-based tree topology is described. The particle-based tree topology neuromorphic chip includes: multiple particles, each particle comprising multiple switches and a set of neuron processing entities (NPEs), wherein each of the multiple switches in each particle is connected to one or more neuron processing entities in the set of neuron processing entities in each particle, the multiple switches in each particle are organized in a tree topology, and the multiple particles are encapsulated in an interposer using one or more microbumps.

[0017] According to another aspect, a chip device based on a tree topology is described. The tree topology-based chip may include: multiple neuron processing entities (NPEs); multiple switches, each switch connected to one or more of the multiple neuron processing entities; wherein the multiple switches are organized in a tree topology, and the multiple switches include: a root-level switch at the root level, where the root level is the highest level in the tree topology; multiple leaf-level switches at the leaf level, where the leaf level is the lowest level in the tree topology; and multiple intermediate-level switches between the root level and the leaf level, wherein each of the multiple intermediate-level switches is connected to two or more lower-level switches and one higher-level switch. Attached Figure Description

[0018] These and other features of the systems, methods, and hardware devices of this disclosure, as well as the operational methods and functions of the related elements and component combinations of the structure, and the economics of manufacture, will become more apparent upon consideration of the following description and appended claims with reference to the accompanying drawings, which form part of this specification, wherein similar reference numerals denote corresponding portions in the drawings. However, it should be understood that the drawings are for illustration and description only and are not intended to be a definition of limitations of this disclosure.

[0019] Figure 1A A schematic diagram of an exemplary chip-based hierarchical tree topology architecture is shown according to some embodiments.

[0020] Figure 1B This illustrates a logical view of an inner plug-in within an exemplary, core-based, hierarchical tree topology according to some embodiments.

[0021] Figure 2 A cross-sectional view is shown within an internal plug-in in an exemplary chip-based hierarchical tree topology architecture according to some embodiments.

[0022] Figure 3 A system diagram of a chip within an exemplary chip-based hierarchical tree topology architecture according to some embodiments is shown.

[0023] Figure 4 A system diagram is shown of an exemplary switch having a neuron-level processing architecture within a chip-based hierarchical tree topology, according to some embodiments.

[0024] Figure 5 A block diagram of an inner plug-in in an exemplary chip-based hierarchical tree topology architecture according to some embodiments is shown. Specific Implementation

[0026] This disclosure is intended to enable those skilled in the art to make and use embodiments, and is provided in the context of specific applications and their requirements. Various modifications to the embodiments of this disclosure will be apparent to those skilled in the art, and the general principles defined herein can be applied to other embodiments and applications without departing from the spirit and scope of this disclosure. Therefore, this disclosure is not limited to the embodiments shown, but is accorded the widest scope consistent with the principles and features of this disclosure.

[0027] Neuromorphic chips use electronic analog circuits to simulate the neurobiological architecture that exists in the human brain's nervous system, and they have the potential to become the next generation of artificial intelligence (AI) architecture.

[0028] In some embodiments, these neuromorphic chips are designed for spiking neural networks (SNNs) to replicate the structure of the human brain. In addition to neuronal and synaptic states, spiking neural networks incorporate the concept of time into their operation. The idea is that neurons in a spiking neural network do not transmit information in each propagation cycle as in typical multilayer perceptual networks (such as convolutional neural networks (CNNs) or deep neural networks (DNNs)). Instead, neurons in a spiking neural network transmit information when the membrane potential (an inherent property of neurons related to their membrane charge) reaches a specific value called a threshold. When the membrane potential reaches the threshold, the neuron fires and generates a signal that is transmitted to other neurons, which respond accordingly by increasing or decreasing their potential. In other words, spiking neural networks can have a different data flow pattern than traditional neural networks. For example, the training process of a traditional CNN or DNN can perform forward and backward propagation of data, both of which involve moving large amounts of data throughout the network layers—that is, all data moves between layers. In contrast, data transfer within spiking neural networks tends to be localized, with neurons located closer together having a higher chance of data exchange, while neurons located farther apart have a lower chance. Therefore, the performance of spiking neural networks is more limited by the bandwidth between neurons located closer together.

[0029] Furthermore, spiking neural networks typically have an enormous scale (e.g., 80 billion neurons) compared to existing CNNs or DNNs. Existing CNN and DNN chip and system designs rely on monolithic processors connected via bus, cross-bar, network-on-chip (NoC), and mesh architectures. However, monolithic processors may not be able to scale to the level of SNNs due to declining production gains (e.g., manufacturing such large-scale monolithic chips results in extremely low yields), high latency (e.g., using a large number of chips with NoC architectures leads to considerable inter-chip communication latency, and NoC architectures also suffer from high latency in long-distance synaptic packet transmission), and increasing complexity (e.g., the number of connections and routing complexity increase exponentially with the number of nodes in a mesh architecture).

[0030] Considering the challenges posed by SNNs and the shortcomings of existing neuromorphic designs, a more ideal chip and system architecture can focus on optimizing short-range data communication and scalability. In some embodiments, short-range data communication can be optimized by implementing a hierarchical architecture with different levels of network bandwidth, with neurons located closer together connected to those with higher network bandwidth. Chip and system scalability can be improved by manufacturing smaller and packaged chips, thereby achieving performance similar to monolithic processors and higher production yields. Taking these design factors into account, the following description introduces a chip-based hierarchical tree topology architecture.

[0031] Figure 1A A schematic diagram of an exemplary chip-based hierarchical tree topology architecture is shown according to some embodiments. Figure 1A The accompanying drawings cover both the macro-level architecture (i.e., overall architecture 100) and the micro-level architecture (i.e., internal modules 110). In some embodiments, overall architecture 100 refers to a data center environment including racks 105 containing compute nodes. These compute nodes can be organized in a tree topology using different levels of switches, such as a primary core switch, multiple secondary aggregation switches, and multiple rack-top (ToR) switches. In this tree topology, the core switch acts as the root and connects to multiple aggregation switches, each aggregation switch connecting to multiple rack-top switches. A compute node within a rack 105 can exchange data with other compute nodes within the same rack 105 or can exchange data with compute nodes in other racks 105 via the "tree" of switches (e.g., intra-rack data exchange can be conducted via its corresponding rack-top switch, and cross-rack data exchange between two adjacent racks can be conducted via the source rack-top switch, aggregation switch, and then the target rack-top switch to find the target compute node).

[0032] In some embodiments, each computing node includes a plurality of in-nodes 110, each in-node 110 includes a plurality of cores 120, and each core 120 includes a plurality of neuron processing entities (NPEs) 121. In some embodiments, each core 120 includes a plurality of switches that organize the neuron processing entities therein in a tree topology. For illustration, Figure 1A The core 120 includes a single-level tree structure with switches (circles labeled 0) connecting to four neuronal processing entities (circles labeled 00, 01, 02, and 03). In some embodiments, each inner plug 110 includes multiple routers that organize the cores therein in a tree topology. For illustration, Figure 1AThe inner plug-in 110 includes a single-level tree structure with routers (circles marked with R) connected to four cores 120.

[0033] With this design, different levels of switches are used at the macro level (e.g., rack level in a data center) to organize rack 105 in a tree topology, and at the micro level (e.g., chip level), each computing node within each rack 105 also has an internal tree topology organized using plug-in 110, chip 120 and neuron processing entity 121.

[0034] In some embodiments, die 120 is packaged within an in-plug 110 to replace a monolithic processor. Die 120 is an integrated circuit block configured to work with other similar dies to form larger, more complex chips. Smaller dies (e.g., 5000 neurons on a single die) offer several advantages over monolithic chips that require large dies (e.g., 1 million neurons on a single die), such as higher yield at the cost of performance. In some embodiments, to compensate for performance degradation by using smaller dies, die 120 is packaged within an in-plug 110 to achieve similar performance to a monolithic processor while maintaining a high yield. The in-plug 110 can use a multi-level switch to package the tree structure of die 120 into a single silicon chip.

[0035] Figure 1B A logical view of an internal plug-in within an exemplary chip-based hierarchical tree topology is shown, according to some embodiments. Figure 1B The logical view shown corresponds to Figure 1A The built-in plugin 110. (For example...) Figure 1B As shown, the plug-in 150 includes a router (circle marked R) as the root of the tree, which has multiple core-level switches (circles marked 0, 1, 2, 3) as child nodes of the root. Each core-level switch is further connected to multiple neuronal processing entities (circles marked 00-33) as child nodes.

[0036] Figure 2 A cross-sectional view of an inner plug-in in an exemplary chip-based hierarchical tree topology architecture according to some embodiments is shown. Figure 2 The plug-in 220 in the document is for illustrative purposes and, depending on the implementation, may include fewer, more, or alternative components.

[0037] In some embodiments, the plug-in 220 uses multiple routers to organize multiple cores 210 in a tree topology. Each router has a first set of interfaces and a second set of interfaces. The first set of interfaces is used to connect to one or more cores among the multiple cores 210, and the second set of interfaces is used to connect to one or more other routers. A core is connected to only one router, but a router can connect to multiple cores. In some embodiments, each core among the multiple cores 210 includes a core-level tree topology with a root-level switch (which will be...). Figure 3 (More details are shown in the diagram). Each core connects its root switch to the router via a first set of interfaces, thereby connecting to the corresponding router. In some embodiments, the first set of interfaces includes microbumps 230. Compared to solder bumps in a conventional PC board (PCB) package, the microbumps 230 can have a smaller pitch (e.g., 40-55 μm), thus providing a denser arrangement of cores 210 and higher data transmission bandwidth within the inner plug-in 220.

[0038] In some embodiments, the plug-in 220 has an plug-in level clock 222 for coordinating multiple cores 210 within the plug-in via multiple routers. For example, the plug-in level clock 222 may send a clock signal that oscillates between high and low states, and the plug-in level clock 222 acts as a metronome to control the actions of the routers, which indirectly coordinates the actions of the cores 210 connected to the routers.

[0039] like Figure 2 As shown, each core 210 includes a core-level clock 212 for coordinating the actions of neuronal processing entities within the core 210. In some embodiments, all cores 210 within the inter-plugin 220 share the same core-level clock 212, and all neuronal processing entities on the core 210 operate on the same clock. In some embodiments, different cores 210 within the inter-plugin 220 are individually configurable. Users can configure different clocks on the cores 210 so that the cores 210 can perform different functions. This is particularly useful and meaningful for processing SNN models and applications. For example, this design enables SNNs that fire different neurons on different time signals (e.g., clock signals).

[0040] In some embodiments, within each core 210, one or more neuronal processing entities may also be organized in a tree-like topology. Each neuronal processing entity may be associated with one or more register files serving as local memory. Figure 3 As shown, the neuron processing entity 210 can be organized into a tree structure using a neuron processing entity-level switch.

[0041] Figure 3A system diagram of a core 300 in an exemplary core-based hierarchical tree topology is shown according to some embodiments. Figure 1A-Figure 2 The core shown only has first-order neurons processing entities, so the tree-like structure within the core is not shown. Figure 3 The core 300 in the diagram illustrates multiple neuron processing entities organized in a multi-level tree topology using a hierarchical neuron processing entity-level switch. Figure 3 The accompanying drawings are for illustrative purposes only, and depending on the implementation, the core 300 may have fewer, more, and optional components and connections.

[0042] In some embodiments, the core 300 includes a root-level switch 310 (the highest level in the tree topology of the core 300), a plurality of leaf-level switches 330 (the lowest level in the tree topology of the core 300), and a plurality of intermediate-level switches 320 between the root and leaf levels. In some embodiments, each of the plurality of intermediate-level switches 350 is connected to two or more lower-level switches (e.g., 330) and one higher-level switch (e.g., 310). Each neuron processing entity in the core 300 is connected to only one switch.

[0043] In some embodiments, each neuron processing entity-level switch (including root-level switch 310, intermediate-level switch 320, or leaf-level switch 330) in the core 300 is connected to another neuron processing entity-level switch via a first set of interfaces, and to one or more neuron processing entities and their corresponding register files (RF) via a second set of interfaces. In some embodiments, the connection interfaces within the core 300 (the first set of interfaces, the second set of interfaces, or both) employ general-purpose input / output (GPIO) interfaces. GPIO is an uncommitted digital signal pin on an integrated circuit or electronic circuit board used as an input or output, or both. In some embodiments, neuron processing entities under the same switch can be connected via a local bus serving as a dedicated data communication channel, and neuron processing entities under different switches can exchange data via routing across two or more switches.

[0044] Figure 4 A system diagram is shown of a neuron processing physical-level switch 400 in an exemplary chip-based hierarchical tree topology architecture according to some embodiments. The neuron processing physical-level switch 400 may be... Figure 3 Any one of the root-level switch 310, intermediate-level switch 320, or leaf-level switch 330. Figure 4The accompanying drawings are for illustrative purposes only, and depending on the implementation, the neuron processing physical-level switch 400 may have fewer, more, and alternative components and connections.

[0045] In some embodiments, the neuron processing entity-level switch 400 can be used for data exchange between neuron processing entities within the same core, different cores, or even between internal plug-ins. Neuron processing entities under the same neuron processing entity-level switch 400 can be locally connected to a bus providing dedicated bandwidth.

[0046] like Figure 4 As shown, the neuron processing entity-level switch 400 may include two or more input interfaces (e.g., input interfaces 410A and 410B), which correspond to a higher-level switch node and a lower-level switch node, respectively. Here, "higher-level" and "lower-level" refer to their levels in the tree topology within the chip. For example, input interface 410A may include a data selector (MUX) for receiving requests from external memory or higher-level switch nodes. The request may include data, instructions, or both data and instructions for forwarding to another switch.

[0047] In some embodiments, input interfaces 410A and 410B can store received requests into different priority queues 430A and 430B, respectively. These priority queues 430A and 430B can temporarily store received requests to be forwarded. In some embodiments, the neuron processing entity-level switch 400 may also include a node controller for receiving global commands 440 from a higher-level scheduler. These global commands 440 can specify the forwarding order of requests stored in priority queues 430A and 430B. For example, priority queues 430A and 430B can both send their respective first requests to a data selector, controlled by the node controller, to select one of two first requests based on the global command 440. The selected request is the next request to be forwarded.

[0048] In some embodiments, the neuron processing entity-level switch 400 may include a local clock 450 (e.g., a chip-level clock) for controlling forwarding actions based on time signals. For example, clock 450 may specify that a request is forwarded every 1 microsecond. Clock 450 may be a separate configurable clock for configuring the request forwarding rate of the neuron processing entity-level switch 400. This configurable clock allows the neuron processing entity-level switch 400 (and its underlying neuron processing entities and register files) to implement a set of neurons in a neuromorphic chip or system that can be triggered based on clock signals.

[0049] In some embodiments, a selected request is fed into a corresponding output buffer within the neural processing entity-level switch 400 before being forwarded to the target switch node. The neural processing entity-level switch 400 includes multiple output buffers corresponding to input interfaces (e.g., 410A and 410B). These output buffers act as staging places, awaiting the node controller to determine the target switch node. Once the node controller determines the target switch node (e.g., a higher-level switch node (represented as node n+1) or a lower-level switch node (represented as node n-1)), the corresponding output buffer is selected, and a first request therein is forwarded to the corresponding output interface (e.g., 420A or 420B).

[0050] Figure 5 A block diagram of an inner plug-in 500 in an exemplary chip-based hierarchical tree topology architecture according to some embodiments is shown. The inner plug-in 500 can be considered as an electrical carrier having internal circuitry organized in a tree topology. Figure 5 As shown, the plug-in 500 includes one or more routers 505 and 510 that organize multiple cores 520 in a tree topology, and each core 520 includes one or more switches 530 that organize multiple neuron processing entities and register files 540 in a tree topology. Figure 5 The diagrams in this document are for illustrative purposes only, and depending on the implementation, the plug-in 500 may have fewer, more, and alternative components and connections.

[0051] In some embodiments, multiple in-devices 500 can serve as components for implementing next-generation processing units (such as GPUs or NPUs), and also as components for next-generation data centers that process SNN models and applications. In some embodiments, each switch 530 within an in-device includes 100-200 neuron processing entities to balance die size and performance during manufacturing. Each neuron processing entity can implement approximately 200 million neurons. This means that each switch 530 can correspond to 2-4 billion neurons. The number of dies 520 within an in-device 500 can be determined based on the size of each switch 530. Based on current state-of-the-art manufacturing technologies, the size of an in-device may be limited by the wafer reticle size, which defines the chip surface area that can be exposed using a single mask.

[0052] The tree-like topology of the plug-in 500 effectively provides flexible and hierarchical bandwidth configuration for neuron processing entities (where neurons reside). For example, neuron processing entities 540 under the same switch 530 are connected to a local bus (i.e., a dedicated channel serving only these neuron processing entities 540) to provide optimal bandwidth; neuron processing entities 540 across different switches 530 but still within the same core 520 can exchange data through several dedicated switches 530 (dedicated only to the corresponding neuron processing entity 540); while neuron processing entities 540 across different cores 520 can exchange data through several dedicated switches 530 and routers 510 (dedicated to the corresponding cores 520). It should be noted that the more localized the nodes (neuron processing entities 540 or cores 520), the denser and more dedicated the data exchange channels. Thus, the plug-in 500 structure facilitates local data exchange, which is compatible with the data flow patterns in SNNs. This design has technical advantages compared to existing bus-based and mesh-based architectures. For example, bus-based SoCs may cause their nodes to share the same bus, and these nodes will compete with each other for fixed bandwidth. Competition between nodes limits the scalability and bandwidth of a SoC. As another example, a mesh-based SoC may require every two nodes to be connected (e.g., a direct connection between every two nodes) to achieve ideal bandwidth, which may not scale well because the complexity of wiring increases exponentially with the number of neurons.

[0053] Each process, method, and algorithm described in the preceding sections can be embodied in a code module executed by one or more computer systems or computer processors including computer hardware, and can be fully or partially automated by that code module. These processes and algorithms can be implemented, partially or entirely, in dedicated circuitry.

[0054] When the functions disclosed herein are implemented as software functional units and sold or used as independent products, they may be stored in a processor-executable, non-volatile, computer-readable storage medium. Specific technical solutions (all or part) disclosed herein, or aspects contributing to the present technology, may be embodied in the form of a software product. A software product includes multiple instructions that may be stored in a storage medium to cause a computing device (which may be a personal computer, server, network device, etc.) to perform all or some steps of the methods of the embodiments of this disclosure. The storage medium may include a flash drive, portable hard disk drive, ROM, RAM, magnetic disk, optical disk, another medium operable for storing program code, or any combination thereof.

[0055] Specific embodiments also provide a system including a processor and a non-transitory computer-readable storage medium storing processor-executable instructions to cause the system to perform operations corresponding to the steps in any of the methods of the above embodiments. Specific embodiments also provide a non-transitory computer-readable storage medium configured with instructions executable by one or more processors to cause one or more processors to perform operations corresponding to the steps in any of the methods of the above embodiments.

[0056] The embodiments disclosed herein can be implemented through a cloud platform, server, or group of servers (collectively referred to as the "service system") that interacts with a client. The client can be a terminal device or a client registered by a user on the platform, wherein the terminal device can be a mobile terminal, a personal computer (PC), or any device on which the platform application can be installed.

[0057] The various features and processes described above can be used independently of each other or combined in various ways. All possible combinations and sub-combinations are within the scope of this disclosure. Furthermore, in some embodiments, certain method or process blocks may be omitted. The methods and processes described herein are not limited to any particular order, and the associated blocks or states may be executed in other suitable orders. For example, the described blocks or states may be executed in an order other than the order specified in this disclosure, or multiple blocks or states may be combined in a single block or state. Example blocks or states may be executed serially, in parallel, or in some other manner. Blocks or states may be added to or removed from the example embodiments of this disclosure. The exemplary systems and components described herein may be configured differently than those described. For example, elements may be added, removed, or reset in the example embodiments of this disclosure compared to those described in the example embodiments of this disclosure.

[0058] The various operations of the example methods described herein can be performed at least partially by an algorithm. This algorithm may include program code or instructions stored in memory (e.g., the aforementioned non-transitory computer-readable storage medium). Such an algorithm may include a machine learning algorithm. In some embodiments, the machine learning algorithm may not explicitly program the computer to perform the function, but may learn from training data to build a predictive model for performing that function.

[0059] The various operations of the example methods described herein can be performed at least partially by one or more processors, which can be temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, these processors can constitute the engine of a processor implementation that runs to perform one or more of the operations or functions described herein.

[0060] Similarly, the methods described herein can be implemented at least in part by a processor, where a specific processor or one or more processors are examples of hardware. For example, at least some operations of the methods can be performed by one or more processors or an engine implemented by a processor. Furthermore, one or more processors can also be used to support the performance of related operations in a “cloud computing” environment or as “software as a service” (SaaS). For example, at least some operations can be performed by a set of computers (as an example of a machine including processors) that can be accessed via a network (e.g., the Internet) and through one or more appropriate interfaces (e.g., Application Program Interfaces (APIs)).

[0061] The performance of certain operations can be distributed across processors, rather than residing within a single machine, but deployed across multiple machines. In some example embodiments, the processor or processor-implemented engine may reside in a single geographic location (e.g., in a home environment, office environment, or server farm). In other example embodiments, the processor or processor-implemented engine may be distributed across multiple geographic locations.

[0062] In this specification, multiple instances can implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are shown and described as separate operations, one or more separate operations may be performed simultaneously, and the order in which they are performed is not required. Structures and functions presented as separate components in the example configuration can be implemented as composite structures or components. Similarly, structures and functions presented as single components can be implemented as separate components. These, and other variations, modifications, additions, and improvements fall within the scope of this document.

[0063] Although an overview of the subject matter has been described with reference to specific example embodiments, various modifications and changes can be made to these embodiments without departing from the broader scope of embodiments of this disclosure. These embodiments of this disclosure may be referred to individually or collectively by the term "this disclosure" merely for convenience, and are not intended to voluntarily limit the scope of this disclosure to any single disclosure or concept, if in fact more than one disclosure or concept is disclosed.

[0064] The embodiments illustrated herein have been described in sufficient detail to enable those skilled in the art to practice the disclosed teachings. Other embodiments may be used and derived therefrom, allowing for structural and logical substitutions and changes without departing from the scope of this disclosure. Therefore, the Specific Embodiments section should not be construed as limiting, and the scope of the various embodiments is defined only by the appended claims and all their equivalents.

[0065] Any process description, element, or block in the flowcharts described herein and / or in the accompanying drawings should be understood to potentially represent a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or step in the process. As will be understood by those skilled in the art, alternative implementations are included within the scope of the embodiments described herein, wherein elements or functions may be removed depending on the functionality involved, and elements or functions may be performed in an order different from the order shown or discussed (including substantially simultaneous or reverse order).

[0066] As used herein, “or” is inclusive, not exclusive, unless otherwise expressly indicated or indicated by the context. Therefore, here, “A, B, or C” means “A, B, A and B, A and C, B and C, or A, B, and C”, unless otherwise expressly indicated or indicated by the context. Furthermore, “and” is both consequential and individual, unless otherwise expressly indicated or indicated by the context. Therefore, here, “A and B” means “A and B, jointly or separately”, unless otherwise expressly indicated or indicated by the context. Furthermore, multiple instances may be provided for a resource, operation, or structure described herein as a single instance. Moreover, the boundaries between various resources, operations, engines, and data stores are arbitrary and specific operations are described within the context of a particular illustrative configuration. Other allocations of functionality are contemplated and may fall within the scope of various embodiments of this disclosure. Generally, structures and functions presented as separate resources in the example configuration may be implemented as combined structures or resources. Similarly, structures and functions presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within the scope of embodiments of this disclosure as represented by the appended claims. Therefore, the specification and drawings are considered illustrative rather than restrictive.

[0067] The terms “comprising” or “including” are used to indicate the presence of a subsequently stated feature, but do not preclude the addition of other features. Conditional language, such as “may” or “may”, unless explicitly stated otherwise or otherwise understood in the context in which they are used, is generally intended to convey that certain embodiments include certain features, elements, and / or steps that are not included in other embodiments. Therefore, such conditional language generally does not imply that features, elements, and / or steps are required in any way by one or more embodiments, or that one or more embodiments must include logic for determining whether such features, elements, and / or steps are included or will be performed in any particular embodiment, with or without user input or prompting.

Claims

1. A system for neuromorphic computing, comprising: Multiple inner plug-ins, each inner plug-in comprising multiple routers and a set of cores, the cores comprising multiple switches and a set of neuron processing entities; Within each of the cores, each of the plurality of switches includes a first set of interfaces and a second set of interfaces. The first set of interfaces is used to connect to one or more neuron processing entities in the group of neuron processing entities, and the second set of interfaces is used to connect to one or more other switches in the plurality of switches. Through the corresponding second set of interfaces, the plurality of switches are organized in a tree topology. Through the first set of interfaces of the first switch in the plurality of switches, each neuron processing entity in the group of neuron processing entities is connected only to the first switch. Within each of the aforementioned plug-ins, each of the plurality of routers includes a third set of interfaces and a fourth set of interfaces. The third set of interfaces is used to connect to one or more cores in the same core group via one or more switches in the same core group. The fourth set of interfaces is used to connect to one or more other routers in the plurality of routers. Through the corresponding fourth set of interfaces, the plurality of routers are organized in a tree topology. Each core in the same core group is connected to the first router only through the third set of interfaces of the first router in the plurality of routers.

2. The system according to claim 1, wherein, Each of the neuron processing entities includes a register file as local memory.

3. The system according to claim 1, wherein, The third set of interfaces for connecting the one or more cores to the router includes one or more microbumps.

4. The system according to claim 1, wherein, Also includes: Multiple rack-mount switches, each of which is connected to a set of internal modules among a plurality of internal modules, wherein the plurality of rack-mount switches are organized in a tree topology.

5. The system according to claim 1, wherein, Each of the cores includes a configurable core-level clock to coordinate the set of neuronal processing entities within each core.

6. The system according to claim 5, wherein, The core-level clock of one core is independent of the core-level clock of the other core.

7. The system according to claim 5, wherein, Each of the plurality of in-plugs includes a configurable in-plug-level clock to coordinate the set of granules within each in-plug, wherein the in-plug-level clock is independent of the granule-level clock.

8. The system according to claim 1, wherein, The multiple switches within each of the core particles in the tree topology include: A root-level switch at the root level, wherein the root level is the highest level in the tree topology; Multiple leaf-level switches at the leaf level, wherein the leaf level is the lowest level in the tree topology; The plurality of intermediate level switches between the root level and the leaf level, wherein each of the plurality of intermediate level switches is connected to two or more lower level switches and one higher level switch.

9. The system according to claim 8, wherein, One of the multiple intermediate-level switches includes: The first input interface is configured to receive one or more first requests from the higher-level switch; A first priority queue is configured to store the received one or more first requests; The second input interface is configured to receive one or more second requests from the two or more lower-level switches; A second priority queue is configured to store the received one or more second requests; and The third input interface is configured to receive one or more global commands that control the forwarding order of one or more first requests stored in the first priority queue and one or more second requests stored in the second priority queue.

10. The system according to claim 9, wherein, The one or more first requests received from the higher-level switch include data received from one or more neuron processing entities of the group of neuron processing entities connected to the higher-level switch.

11. The system according to claim 1, wherein, Each of the cores contains a group of neuron processing entities connected to a local bus for local data communication, without needing to pass through the multiple switches within the core.

12. A chip for neuromorphic computing, comprising: Multiple neurons process entities; A plurality of switches, each of which is connected to one or more neuron processing entities among the plurality of neuron processing entities, wherein each neuron processing entity is connected to only one of the plurality of switches, the plurality of switches being organized in a tree topology, the plurality of switches comprising: A root-level switch at the root level, wherein the root level is the highest level in the tree topology; Multiple leaf-level switches at the leaf level, wherein the leaf level is the lowest level in the tree topology; The plurality of intermediate level switches between the root level and the leaf level, wherein each of the plurality of intermediate level switches is connected to two or more lower level switches and one higher level switch.

13. The core according to claim 12, wherein, Also includes: Configurable core-level clocks to coordinate the multiple neuron processing entities.

14. The core according to claim 13, wherein, The core-level clock of one core is independent of the core-level clock of the other core.

15. The core according to claim 12, wherein, One of the multiple intermediate-level switches includes: The first input interface is configured to receive one or more first requests from the higher-level switch; A first priority queue is configured to store the received one or more first requests; The second input interface is configured to receive one or more second requests from the two or more lower-level switches; A second priority queue is configured to store the received one or more second requests; and The third input interface is configured to receive one or more global commands that control the forwarding order of one or more first requests stored in the first priority queue and one or more second requests stored in the second priority queue.

16. The core according to claim 15, wherein, The one or more first requests received from the higher-level switch include data received from one or more neuron processing entities from a group of neuron processing entities connected to the higher-level switch.

17. A chip for neuromorphic computing, comprising: Multiple cores, each core comprising multiple switches and a set of neuron processing entities, wherein, Each of the plurality of switches within each core is connected to one or more neuron processing entities in the set of neuron processing entities; Each of the neuron processing entities is connected to only one of the plurality of switches. Each of the plurality of switches includes a first set of interfaces and a second set of interfaces. The first set of interfaces is used to connect one or more neuron processing entities in the group of neuron processing entities, and the second set of interfaces is used to connect one or more other switches in the plurality of switches. Through the corresponding second set of interfaces, the plurality of switches within each core are organized in a tree topology. The plurality of cores are encapsulated in an inner plug-in using one or more microbumps.

18. The chip according to claim 17, wherein, Each of the plurality of cores includes a configurable core-level clock to coordinate the set of neuronal processing entities within each core.

19. The chip according to claim 18, wherein, The core-level clock of one core is independent of the core-level clock of the other core.

20. The chip according to claim 17, wherein, The multiple switches within each of the core particles in the tree topology include: A root-level switch at the root level, wherein the root level is the highest level in the tree topology; Multiple leaf-level switches at the leaf level, wherein the leaf level is the lowest level in the tree topology; The plurality of intermediate level switches between the root level and the leaf level, wherein each of the plurality of intermediate level switches is connected to two or more lower level switches and one higher level switch.