A continuous-time Ising model hardware solving system based on multi-chip interconnection
By using a multi-chip interconnect system with a distributed architecture, the storage and communication latency problems of continuous-time Ising machines in large-scale combinatorial optimization problems are solved, achieving efficient and accurate combinatorial optimization solutions and breaking through the limitations of single-chip resources.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH OF CHINA
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-26
Smart Images

Figure CN121765178B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computational modeling technology, specifically to a hardware solution system for the continuous-time Ising model based on multi-chip interconnection. Background Technology
[0002] The Ising model is a mathematical model originating from statistical physics. It can be used to simulate complex computational problems that are difficult for traditional computers to handle. Hardware solvers for the Ising model are a class of dedicated computing devices based on non-von Neumann architectures. Their core objective is to leverage the dynamic evolution characteristics of real physical systems (such as circuits and optics) to accelerate the search for the global minimum (i.e., the ground state) of the Ising model's energy function, thereby solving complex computational problems that are difficult for traditional computers to process.
[0003] Many mathematical problems with practical applications can be transformed into energy minimization problems using the Ising model, the most typical of which is the combinatorial optimization problem. Combinatorial optimization is a class of optimization problems that seek the maximum or minimum value under constraints, usually NP-hard, meaning that the optimal solution cannot be found exhaustively in polynomial time. Traditional computers typically use heuristic algorithms such as simulated annealing to find approximate solutions, but as the problem size increases further, software algorithms face severe bottlenecks due to excessively long iteration times and high energy consumption. In contrast, mapping combinatorial optimization problems to the Ising energy model and utilizing the natural physical evolution of hardware circuits to automatically converge to a low-energy state has become a key technical path to overcome existing computing power limitations and solve large-scale complex optimization problems.
[0004] To address the aforementioned computational challenges, the industry has proposed various hardware solution systems for the Ising model with different architectures, collectively referred to as Ising Machines or Annealing Machines. These systems mainly include: Coherent Ising Machines (CIMs) that utilize optical physical effects, with their core based on Optical Parametric Oscillators (OPOs); dedicated annealing systems based on traditional semiconductor digital circuits, such as Fujitsu's Digital Annealer and Hitachi's CMOS Annealer; and Continuous-Time Digital Ising Machines based on combinational logic.
[0005] However, existing Ising machines, especially continuous-time Ising machines, face severe challenges in solving large-scale problems, including the following: First, storage and computing power bottlenecks; the number of weights in fully connected problems increases quadratically with the number of nodes, and the on-chip storage resources of a single chip are insufficient to support large-scale global weight matrices, making it impossible to solve large-scale combinatorial optimization problems. Second, node partitioning in multi-chip systems; simple multi-chip cascading often ignores the coupling topology characteristics between nodes, leading to the partitioning of nodes with a large number of connections onto different chips, resulting in high cross-chip communication requirements. Third, latency issues in multi-chip interconnects; due to the high cross-chip communication requirements in large-scale problems, communication latency becomes a bottleneck for system convergence time. Existing numerical transfer methods all use communication protocols, introducing microsecond-level communication latency, which disrupts the dynamic continuity of continuous Ising machines in the solution process, leading to a decrease in solution accuracy.
[0006] Therefore, there is a need for a distributed continuous-time Ising model hardware solution system that can overcome the limitations of single-chip resources, optimize the amount of data transmission between chips, and maintain the continuity of dynamics in multi-chip interconnection. Summary of the Invention
[0007] To address the aforementioned technical problems, this invention proposes a hardware solution system for continuous-time Ising models based on a distributed architecture. This system achieves efficient solutions to large-scale combinatorial optimization problems with limited physical hardware resources through the interconnection and collaboration of multiple Field-Programmable Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs), combined with matrix partitioning mapping and module reuse technology based on continuous-time hardware solution systems.
[0008] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0009] A hardware solution system for the continuous-time Ising model based on multi-chip interconnection includes:
[0010] The host computer is used to perform cluster analysis and graph partitioning on the global weight matrix of the Ising model, and generate neuron mapping relationships and scheduling tables;
[0011] Multiple chips, each with multiple logic accumulation units and activation function units, are used to achieve continuous time updates of neuron states through combinational logic;
[0012] The data transceiver module, located in each chip, is used to receive and store the sub-weight matrices in the global weight matrix of the neurons related to the current chip from the host computer, and distribute the sub-weight matrix data to the corresponding logic accumulation unit and activation function unit according to the neuron mapping relationship and scheduling table.
[0013] The system control module is used to control the loading of sub-weight matrix data and the iterative calculation of each logical accumulation unit and activation function unit according to the scheduling table;
[0014] The output interconnect structure includes an inter-chip physical interconnect network based on interconnect signals and an on-chip delay interconnect network based on combinational logic, which is used to realize the state transmission of neurons between chips and within the chip.
[0015] An external communication module is used to transmit sub-weight matrix data, neuron states, and control commands between the host computer and each chip.
[0016] In one embodiment, the host computer is used to perform cluster analysis and graph partitioning on the global weight matrix of the Ising model, generating neuron mapping relationships and a scheduling table, specifically including:
[0017] The host computer maintains the global weight matrix of the target Ising model, performs cluster analysis on the global weight matrix to identify the number of connections in each sub-weight matrix, and divides the sub-weight matrices of the global weight matrix into high-density and low-density sub-weight matrices based on the number of connections in each sub-weight matrix and a set threshold. Neurons are then renumbered, and neurons belonging to the same high-density sub-weight matrix are mapped to the same chip. Neurons connecting different high-density sub-weight matrices are defined as boundary nodes and assigned to different chips, thus obtaining a graph partitioning result for multiple chips and establishing the mapping relationship between neurons in the actual problem-solving process and the hardware solution system. Based on the upper limit of available computing resources for each chip, the set of neurons assigned to the same chip is further divided into multiple virtual node groups, generating a corresponding scheduling table. According to the neuron partitioning strategy, the sub-weight matrix data obtained after partitioning the global weight matrix, along with the corresponding neuron numbers and target chip identifiers, are encapsulated into data frames conforming to the communication protocol and distributed to the corresponding chips through the communication link.
[0018] In one embodiment, each activation function unit on the chip corresponds to a neuron.
[0019] In one embodiment, the logic accumulation unit includes a value generation unit, an arithmetic unit, and a noise injection unit; the activation function unit includes a parameter configuration unit, a nonlinear activation unit, and an output control unit.
[0020] The numerical generation unit receives the current output state of the connected neurons from the local chip or other chips, performs logical operations with the weights between the neurons, and obtains the numerical magnitude of the influence of other neurons on the neuron based on the result of the logical operation.
[0021] The computing unit adopts a parallel accumulation architecture, including a data allocation unit, a parallel accumulation array, and an output generation unit; the data allocation unit identifies the input data and allocates the data to the parallel accumulation array; the parallel accumulation array performs an accumulation operation on the data; the output generation unit is used to receive the accumulation operation result and the noise injection value and obtain the final output signal;
[0022] The noise injection unit is located at the output of the arithmetic unit. Based on the noise intensity parameters stored in the corresponding noise intensity register given by the system control module, it generates random or controllable disturbance values. The disturbance values are added to the output signal of the arithmetic unit to obtain the total output including the noise term.
[0023] The parameter configuration unit receives external parameters and compares the total output with the threshold set by the activation function register; the nonlinear activation unit performs variable nonlinear activation processing according to different parameters, thereby dynamically adjusting the output of the activation function unit by adjusting the parameter size; the output control unit receives control instructions from the system control module to latch and output the neuron state of the activation function unit, and outputs the neuron state to all logic accumulation units.
[0024] In one embodiment, the data transceiver module, disposed in each chip, is used to receive and store sub-weight matrices from the global weight matrix of neurons related to the current chip, received from the host computer, and to distribute the sub-weight matrix data to the corresponding logical accumulation unit and activation function unit according to the neuron mapping relationship and scheduling table, specifically including:
[0025] The data transceiver module receives the global weight matrix segmented by the host computer and is used to receive, filter, map, and load the global weight matrix sent by the host computer in blocks. The data transceiver module has a mapping relationship storage unit, which is used to receive and store the global neuron number, the neuron number on the current chip, and the virtual node group identifier to which the neuron belongs.
[0026] During the initialization phase, under the control of the system control module, the data transceiver module receives and parses the data frames segmented and sent by the host computer. Based on the target chip identifier and neuron number carried in the data frame, it determines whether the weight of the neuron belongs to the current chip. For sub-weight matrix data belonging to the current chip, the sub-weight matrix data is redirected to the corresponding neuron according to the neuron mapping relationship and stored in the weight matrix register, categorized and stored according to virtual node groups. Sub-weight matrix data that does not belong to the current chip is not received or stored. Each chip only receives and stores the weight of the corresponding neuron on the current chip and the sub-weight matrix corresponding to that neuron, thereby realizing the distributed carrying of the global connection matrix.
[0027] During the solution execution phase, the data transceiver module, under the control of the system control module, inputs the weights corresponding to the neurons updated in different time slices into the weight matrix register of the logic accumulation unit according to the scheduling table, thereby solving large-scale problems.
[0028] In one embodiment, the system control module is used to control the loading of sub-weight matrix data and the iterative calculation of each logical accumulation unit and activation function unit according to the scheduling table, specifically including:
[0029] The system control module includes an internal loop control module and a distributed coordination module;
[0030] The internal loop control module controls the switching of each logic accumulation unit and activation function unit in the chip between different working modes through a multi-state controller, so that the logic accumulation unit and activation function unit can support parameter reconfiguration and debugging while maintaining continuous-time dynamic evolution.
[0031] The distributed collaboration module synchronously rotates weights and status data among multiple chips based on the scheduling table, breaking down the combinatorial optimization problem into time-sharing computation of subtasks on limited hardware resources.
[0032] In one embodiment, the internal loop control module controls the switching between different operating modes of the logic accumulation units and activation function units within the chip via a multi-state controller, enabling the logic accumulation units and activation function units to support parameter reconfiguration and debugging while maintaining continuous-time dynamic evolution. Specifically, this includes:
[0033] The working modes include initialization mode and iterative calculation mode;
[0034] The initialization mode control data transceiver module receives the neuron mapping relationship and global weight matrix data sent by the host computer, completes the storage of the mapping relationship, and loads the sub-weight matrices of the global weight matrix, the initial state of the neuron, and the activation function unit parameters.
[0035] The iterative calculation mode drives each logic accumulation unit and activation function unit to perform continuous-time state evolution according to the predetermined schedule and activation function unit parameter settings. During this process, the output of the logic accumulation unit and activation function unit is sampled at a set period, and the sampling results are output to the host computer through the external communication module.
[0036] In one embodiment, the distributed collaboration module synchronizes weights and status data across multiple chips according to a scheduling table, decomposing the combinatorial optimization problem into time-sharing computation of subtasks on limited hardware resources, specifically including:
[0037] The M neurons in the combinatorial optimization problem are divided into N groups. The number of neurons in each group is equal to the number of logic accumulation units and activation function units in the hardware solution system. At the beginning of the k-th time slice, the system control module first sets the logic accumulation units and activation function units on the current chip to their initial states, and prohibits changes in the output control units in the logic accumulation units. The system control module reads the sub-weight matrix and activation function unit parameters corresponding to the current virtual node group according to the scheduling table transmitted by the host computer, and writes the weight matrix data and noise intensity parameters into the weight matrix register and noise intensity register of the corresponding logic accumulation unit and activation function unit through the internal bus. At the same time, it reads the final state of each neuron in the previous evolution cycle from the memory where the virtual node group is located, and preloads the final state into the corresponding logic accumulation unit register as the initial state of this evolution cycle. Then, the activation function units are enabled, the output control units change, and the neurons begin to iterate. After a specified evolution time, the weight matrix register, noise intensity register, logic accumulation unit register, and activation function unit parameters are updated according to the virtual node group and the scheduling table, and then the next evolution cycle begins, until the neuron state no longer changes or the solution time is reached.
[0038] In one embodiment, the output interconnect structure includes an inter-chip physical interconnect network based on interconnect signals and an intra-chip delay interconnect network based on combinational logic, used to realize state transmission of neurons between chips and within a chip, specifically including:
[0039] Inter-chip physical interconnection network based on interconnection signals: The inter-chip physical interconnection network establishes hardware connection paths between multiple chips through multiple sets of transmission lines. One end of each set of transmission lines is connected to the output pin of the transmitting chip, and the other end is connected to the input pin of the receiving chip. The neuron state output by the logic accumulation unit and the activation function unit of one chip is activated by the activation function unit in the logic accumulation unit, and then directly transmitted to the other chips through the physical interconnection link via the output buffer. Through the input buffer inside the chip, it is converted into an internal signal of the chip, and then directly connected to the logic accumulation unit of another chip through a programmable input delay unit that configures the cross-chip propagation time, thereby simulating the propagation time constant of different paths on the digital platform.
[0040] On-chip delayed interconnect network based on combinational logic: Different neurons located inside the same chip accumulate each other and are activated by activation function units. The resulting output value is directly transmitted to the other neurons on the same chip via interconnects.
[0041] In one embodiment, the external communication module includes a physical layer interface unit, an internal data transceiver synchronization unit, a protocol parsing and encapsulation unit, and a data buffering and distribution unit.
[0042] In the downlink direction, the host computer sends a data stream through an external communication module. After being converted by the physical layer interface unit, the data stream is input to the data transceiver synchronization unit of the current chip. After clock data recovery and serial-to-parallel conversion, the data stream is output to the protocol parsing and encapsulation unit for instruction parsing. In the uplink direction, under the trigger of the system control module, the chip writes the sampling result of the activation function unit to the data transceiver synchronization unit after protocol conversion by the data buffer and distribution unit, and then sends it to the host computer.
[0043] Compared with the prior art, the beneficial technical effects of the present invention are:
[0044] The continuous-time Ising model hardware solution system based on multi-chip interconnection in this invention breaks through the limitations of resource quantity in solving large-scale combinatorial optimization problems. It can be configured to solve Max-Cut problems or Ising model problems of arbitrary size. Under the multi-chip architecture, it also maintains the asynchronous energy reduction characteristics of continuous-time or quasi-continuous-time problems, which is more consistent with real-world physical systems. Through distributed weighting, larger-scale problems can be solved with fewer resources. Furthermore, the architecture is applicable to any Ising problem. By changing the weight bit width, other combinatorial optimization problems can be solved. The design architecture using multiple FPGAs or ASICs also facilitates system scalability. Attached Figure Description
[0045] Figure 1 This is a schematic diagram of the overall architecture of the present invention.
[0046] Figure 2 This is a schematic diagram of the system workflow of the present invention.
[0047] Figure 3 This is a schematic diagram of the chip structure solved by the continuous-time Ising model of the present invention.
[0048] Figure 4 This is a detailed structural diagram of the logic accumulation unit and activation function unit of the present invention. Detailed Implementation
[0049] A preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.
[0050] This invention discloses a hardware solution system for the continuous-time Ising model based on a distributed architecture, comprising: a host computer, a data transceiver module, a system control module, a logic accumulation unit and an activation function unit, an output interconnection structure, and an external communication module, all implemented by digital logic circuits;
[0051] The host computer is used to maintain the global weight matrix of the Ising model corresponding to the combinatorial optimization problem. It performs cluster analysis on the global weight matrix to identify the number of connections in each sub-weight matrix. Based on the number of connections in each sub-weight matrix and a set threshold, the sub-weight matrices of the global weight matrix are divided into high-density and low-density sub-weight matrices. Neurons are renumbered, and neurons belonging to the same high-density sub-weight matrix are mapped to the same chip. Neurons connecting different high-density sub-weight matrices are defined as boundary nodes and assigned to different chips, thus obtaining a graph partitioning result for multiple chips and establishing the mapping relationship between neurons in the actual problem-solving process and the hardware solution system. Based on the upper limit of available computing resources for each chip, the set of neurons assigned to the same chip is further divided into multiple virtual node groups, generating a corresponding scheduling table. According to the neuron partitioning strategy, the sub-weight matrix data obtained after partitioning the global weight matrix, along with the corresponding neuron number and target chip identifier, are encapsulated into data frames conforming to the communication protocol and distributed to the corresponding chips through the communication link.
[0052] Output interconnect structure; divided into inter-chip physical interconnect network based on interconnect signals and intra-chip delay interconnect based on combinational logic.
[0053] Inter-chip physical interconnect network based on interconnect signals: The inter-chip physical interconnect unit is set on a multi-chip carrier board (PCB) to establish hardware connection paths between multiple FPGA or ASIC chips. Multiple transmission lines are pre-laid on the PCB. One end of each transmission line connects to the differential output pin of the transmitting chip, and the other end connects to the input pin of the receiving chip. The lengths of the transmission lines are matched during the PCB design phase to ensure that the propagation delays of cross-chip connections are approximately similar. After activation by the activation function unit, the neuron state signal is directly converted into an interconnect signal pair through the output buffer. The signal is then converted into an internal chip signal through the chip's internal input buffer, and then directly connected to the logic accumulation unit after passing through a programmable input delay unit that configures the effective propagation time of the neuron state transmitted across chips. This simulates the propagation time constants of different paths on the digital platform.
[0054] On-chip delay interconnect based on combinational logic: For different neurons located within the same chip, after activation by the activation function unit, the output value is directly transmitted to the logic accumulation unit of the remaining neurons on the same chip via combinational logic. Combinational logic achieves nanosecond-level delay modulation through a clock-controlled carry chain, but the signal transmission path is not clock-controlled and is not inserted into registers. This allows the spin states of the sub-weight matrices within the same chip to interact directly through the combinational logic. Due to the existence of modulated path delay, the asynchronous and unordered update characteristics of the continuous-time Ising solution system are preserved.
[0055] The logic accumulation unit includes a value generation unit, an arithmetic unit, and a noise injection unit.
[0056] The numerical generation unit receives the current output state of the connected neurons from the local chip or other chips, calculates the weight between the two neurons, obtains the numerical sign and magnitude corresponding to the weight of each neuron, and generates a signal for calculating the cumulative sum of the neurons.
[0057] The computing unit adopts a parallel accumulation architecture, including a data allocation unit, a parallel accumulation array, and an output generation unit. The data allocation unit identifies the input data and allocates the data to the parallel accumulation array. The parallel accumulation array performs an accumulation operation on the data. The output generation unit is used to receive the accumulation operation result and the noise injection value and obtain the final output signal.
[0058] The noise injection unit is located at the output of the arithmetic unit. Based on the noise intensity parameters stored in the corresponding noise intensity register given by the system control module, it generates random or controllable disturbance values. The disturbance values are added to the output signal of the arithmetic unit to obtain the total output including the noise term.
[0059] The activation function unit includes a parameter configuration unit, a nonlinear activation unit, and an output control unit. The parameter configuration unit receives external parameters and compares the total output with a threshold set by the activation function unit register. The nonlinear activation unit performs variable nonlinear activation processing based on different parameters, thereby dynamically adjusting the output of the activation function unit by adjusting the parameter values. The total output is compared with the threshold set by the activation function unit's peripheral register to obtain the nonlinear activation processing. The specific form of the activation function (such as a sign function, a biased sign function, a piecewise function, etc.) can be configured through the activation function unit's peripheral register, thereby supporting different energy decay strategies and annealing strategies. The output control unit receives control commands from the system control module and is used to latch and output the neuron state of the activation function unit. The output binary neuron state represents the output of the activation function unit at the current moment. This output is input to the remaining logic accumulation units on this chip through combinational logic and to the logic accumulation units on other chips through fiber optic connections.
[0060] The system control module is divided into an internal loop control module and a distributed coordination module.
[0061] Internal Loop Control Module: Internally equipped with a multi-state controller, this module manages the overall workflow of the chip and the state of each logic accumulation unit and activation function unit, including but not limited to initialization mode and iterative calculation mode. In initialization mode, the control data transceiver module receives neuron mapping relationships and global weight matrix data from the host computer, stores the mapping relationships, loads the sub-weight matrices of the global weight matrix, the initial states of neurons, and the parameters of the activation function units. In iterative calculation mode, each logic accumulation unit and activation function unit undergoes continuous-time state evolution according to a predetermined schedule and activation function unit parameter settings. During this process, the outputs of the logic accumulation units and activation function units are sampled at set intervals, and the sampling results are output to the host computer via an external communication module.
[0062] Distributed Collaboration Module: Through the system module, the total M neurons of the problem are divided into N groups. The number of neurons in each group is equal to the number of logic accumulation units and activation function units in the system. At the beginning of the k-th time slice, the system control module first sets the logic accumulation units and activation function units on the chip to their initial states and disables state latch updates. At this time, according to the scheduling table stored by the host computer, the system control module retrieves the sub-weight matrix and activation function unit parameters belonging to the current neuron node group from the on-chip memory, and writes the weight matrix data and noise intensity parameters into the weight matrix register and noise intensity register of the corresponding logic accumulation unit and activation function unit through the internal bus. Simultaneously, it reads the final state of each neuron in the previous evolution cycle from the memory of the virtual node group and preloads the final state into the corresponding logic accumulation unit register as the initial state for this round of evolution.
[0063] The data transceiver module receives the global connection matrix segmented by the host computer and, in conjunction with a multiplexing mechanism, receives, filters, maps, and loads the global connection matrix sent by the host computer in blocks. Internally, the data transceiver module includes a mapping table storage unit to receive and store the correspondence between global neuron numbers, corresponding neuron numbers of activation function units on the chip, and their respective virtual node groups. During initialization, under the control of the system control module, the data transceiver module receives and parses the sub-weight matrix data frames segmented and sent by the host computer: based on the target chip identifier and global neuron number carried in the data frame, it determines whether the weight belongs to the chip; for sub-weight matrix data belonging to the chip, it redirects them to the corresponding logical accumulation unit and activation function unit and their weight peripheral registers or local weight storage areas according to the mapping relationship, and categorizes them by virtual node group or sub-block; sub-weight matrix data not belonging to the chip is not received or stored. Each chip only receives and stores the local connection matrix related to the neuron corresponding to the logical accumulation unit on its own chip, thereby achieving distributed carrying of the global connection matrix. During the solution execution phase, under the control of the system control module, the data transceiver module, according to the scheduling table, inputs the weights corresponding to the neurons updated by the activation function unit in different time slices into the peripheral registers of the logic accumulation unit corresponding to the numerical generation unit, thereby reusing the weight matrix registers. Through this structure, this chip can solve combinatorial optimization problems of arbitrary scale by inputting weights and updating the local connection matrices of multiple neuron node groups in turn, with a fixed physical scale of logic accumulation units, activation function units, and register resources.
[0064] An external communication module, mounted on a self-made PCB board, is used to transmit sub-weight matrix data, activation function unit sampling results, and control commands between the host computer and multiple chips. It transmits these data between the host computer and the chips, distributing and filtering the data using target chip identifiers and data frame header identifiers, and writing the connection matrix related to the neuron corresponding to the logical accumulation unit on the chip into local storage. The external communication module includes a physical layer interface unit and internal chip-level data transmission and reception synchronization units, protocol parsing and encapsulation units, and data buffering and distribution units. In the downlink direction, the host computer sends a data stream through the external communication module. After conversion by the physical layer interface unit, the data is input to the current chip's data transmission and reception synchronization unit. After clock data recovery and serial-to-parallel conversion, the data is output to the parsing and encapsulation unit for command parsing. In the uplink direction, triggered by the system control module, the chip writes the sampling results of the activation function units to the data transmission and reception synchronization unit after protocol conversion by the data buffering and distribution unit, and then sends them to the host computer.
[0065] The solution system divides the M neuron nodes corresponding to a combinatorial optimization problem of size M into N groups. Under the control of the system control module, when a time slice begins, the logic accumulation unit and activation function unit are in the initialization state, driving the sub-weight matrix corresponding to the current group of neuron nodes and writing it into the weight peripheral register group of each logic accumulation unit and activation function unit. After the input is completed, the system outputs activation control signals to the logic accumulation unit and activation function unit that need to be activated in the current time slice. The logic accumulation unit and activation function unit begin iterating until the next time slice arrives. The logic accumulation unit and activation function unit then re-enter the initialization state, rewrite the neuron state and sub-weight matrix corresponding to the next time slice, and begin iterating again. This process is repeated to achieve the rotational update of the overall neuron nodes. Thus, block-based solutions to Max-Cut problems of arbitrary size are achieved with a fixed number of activation function units.
[0066] When the output of the activation function unit changes, the activation function units located on the same chip, due to the combinational logic connection, directly receive the signal change without being limited by the clock. At the same time, the logic accumulation unit makes a corresponding numerical change to the change. The activation function units in the external chip introduce a certain delay change through direct interconnection. Each logic accumulation unit and activation function unit in the system affects each other until all activation function units stabilize to a fixed output, thus obtaining the solution to the corresponding combinatorial optimization problem.
[0067] Furthermore, the system control module processes the segmented global connection matrix to form a combinational logic closed loop and controllable oscillation. This invention includes a state latch and an oscillation control unit in the system control module. The state latch is used to maintain or update the current output state when needed. The system control module determines whether the current activation function unit should be updated by controlling the enable signal of the state latch: when the latch is enabled, the output results of the logic accumulation unit and the activation function unit continuously change, interconnected with other logic accumulation units and activation function units by combinational logic, forming a large-scale continuous-time combinational loop; when the latch is closed, the output of the activation function unit remains unchanged, equivalent to temporarily stopping the updating of neurons.
[0068] The global connection matrix can address Max-Cut problems of arbitrary size and connection rate, or general Ising-type combinatorial optimization problems. Through a reuse mechanism, the data transceiver module, under the control of the scheduling table provided by the system control module, sequentially loads multiple pre-divided sub-weight matrices from the host computer into a limited number of weight peripheral registers. This allows for block-wise computation of the global weight matrix under conditions of fixed physical resources and limited inter-chip interconnects. Furthermore, the host computer performs cluster analysis and renumbers the neuron nodes of the global weight matrix. Based on the number of connections in each sub-weight matrix and a set threshold, the sub-weight matrices of the global weight matrix are divided into high-density and low-density sub-weight matrices, and the neurons are renumbered. Neurons belonging to the same high-density sub-weight matrix are mapped to the same chip. Neurons connecting different high-density sub-weight matrices are defined as boundary nodes and assigned to different chips, thus enabling the solution of Max-Cut or general combinatorial optimization problems of arbitrary size and connection rate at the overall system level.
[0069] Furthermore, the number of transmission lines in the inter-chip physical interconnection unit does not require complete interconnection between all neurons. Instead, the host computer's graph partitioning and edge mapping algorithm maps as few neurons as possible to different chips.
[0070] Furthermore, a delay control module is added to the interconnects between chips. By injecting random values, different delays are configured for the IOBELAY output of the chips, thereby injecting random delays into the neurons, making it easier for the system to escape energy minima.
[0071] Furthermore, the external communication module identifies different data types by controlling the data frame type identifier. The data type identifier is used to distinguish: global connection matrix or bias parameters from the host computer, sampling results of the combined activation function unit transmitted from the chip to the host computer, and configuration commands related to system control.
[0072] Furthermore, the external communication module internally includes a transmit buffer queue and a receive buffer queue to temporarily store pending weighted data frames, result data frames, and control frames, preventing momentary link congestion. At the receiving end, incoming data is buffered, and in conjunction with the system control module, data is provided to the data transmission / reception module or result processing module as needed.
[0073] Furthermore, the construction of the logical accumulation unit and the activation function unit is based on the physical Ising model; the updating process of the logical accumulation unit and the activation function unit is the process of the corresponding neuron system's total energy state value automatically decreasing.
[0074] Example 1:
[0075] In this embodiment, taking the Max-Cut problem with 800 nodes as an example, the hardware solution system of the Ising model of the present invention is deployed on four interconnected chips on a self-made PCB board. The chips are implemented by a Field Programmable Gate Array (FPGA), and its overall architecture is shown in Figure 1.
[0076] A Zynq AX100B FPGA is used as the main control chip, responsible for overall system scheduling, task allocation, data transmission control, and some computational tasks. It supports 50 nodes of logic accumulation units and activation function units. Three Xilinx Ku095 FPGA chips serve as computing chips, each supporting 250 nodes of logic accumulation units and activation function units. Each Ku095 FPGA is configured with 250 logic accumulation units and activation function units, each processing one neuron. Because the system has a sufficient number of logic accumulation units and activation function units, the problem can be solved without activating a scheduling table or virtual node groups. The system achieves high-speed data transmission between FPGA chips via fiber optic communication, with a fiber optic link transmission rate of 12Gbps.
[0077] The PC is responsible for dividing the global weight matrix into four sub-weight matrices, each 250×800 in size (i.e., each subproblem contains 250 nodes and 250×800 connections). The combination of these four sub-weight matrices constitutes the entire 800-node problem. Each 250×800 sub-weight matrix is distributed across four FPGA chips for computation. Each FPGA computes the 250×800 matrix. Each FPGA chip has 250 logic accumulation units and activation function units, each responsible for one node in a 250×800 matrix. Within each time step, the system divides and distributes the matrix across the FPGAs for parallel computation, using a 200MHz clock to control the circuitry functions other than the logic accumulation and activation function units. The time step is divided into 100ns, or 20 clock cycles.
[0078] Within each time step, the Zynq AX100B FPGA master control chip sequentially transmits the 250×800 matrix to three Ku095 FPGAs through scheduling.
[0079] Each FPGA's internal memory connects to an external communication module to receive sub-weight matrix data from the host computer. The FPGA receiver's QSFP28 fiber optic interface is configured for a 12Gbps high-speed serial transceiver protocol. The fiber optic interface module receives the optical signal and converts it into an electrical signal. The received signal is then connected to the RX pin of the FPGA's BANK127-128 GTY transceiver via the fiber optic interface. The GTY transceiver's reference clock is provided by a Si5332BD11025-4 chip. The signal is then transmitted to the FPGA's data transceiver module for decoding and data loading. If verification fails, the FPGA sends a retransmission request to the transmitting FPGA via the fiber optic channel. The transmission delay of the external communication module is proportional to the length of the fiber optic cable. Assuming the fiber optic cable length is L meters, the propagation speed of the fiber optic signal is approximately 2 × 10⁻⁶ meters. 8 If the speed is m / s, then the propagation delay per meter of optical fiber is 5 nanoseconds. In this example, the optical fiber length is 1 meter, and the communication delay for this segment is 5 nanoseconds. During system design, all optical fiber links are matched in length to maintain consistent communication timing.
[0080] Each data frame includes the following fields:
[0081] Target chip ID (2 bytes) indicates the target FPGA chip of the data frame; Byte represents bytes.
[0082] Data type (2 bytes), identifies the data type.
[0083] Weight matrix or initial state;
[0084] Data length (4 bytes) indicates the length of the data portion; in this example, it is 250 × 800 (i.e., 200,000 bytes).
[0085] Data content (a 250×800-bit matrix) includes weight matrix or neuron state data;
[0086] The scheduling table data is transmitted in byte order.
[0087] The checksum (4 bytes) is used to verify data integrity and ensure that no data is corrupted during transmission.
[0088] End of frame marker (1 byte).
[0089] The system uses a 200MHz clock to control the global operation of the FPGA, including circuit functions except for the logic accumulation unit and the activation function unit. Within each 100ns time step, the state update and data loading of the activation function unit are synchronized according to the schedule. Each time step is 20 clock cycles of the 200MHz clock, ensuring that the system accurately processes each subtask.
[0090] Each FPGA receives sub-weight matrix data from the host computer via a fiber optic link. Each data frame includes a CRC (Cyclic Redundancy Check) field. The receiving FPGA performs integrity checks on the received frames; if the data is inconsistent, a retransmission mechanism is triggered. Data transmission proceeds sequentially: first, data is transmitted from the host computer to the FPGA via a QSFP28 fiber optic interface and loaded into the weight matrix register; then, it is updated in parallel through the computation logic of the logic accumulation unit and activation function unit. When the system transitions to an iterative state, the 250 logic accumulation units and activation function units within each FPGA begin parallel computation, performing continuous-time Ising updates. The connections between neurons corresponding to logic accumulation units and other neurons are stored in the weight matrix register and updated according to the current node's state and weight. Each logic accumulation unit includes a value generation unit, an arithmetic unit, and a noise injection unit; the activation function unit includes a parameter configuration unit, a nonlinear activation unit, and an output control unit. These units work together, with the computation unit summing the neuron connection weights step by step, the noise injection unit introducing thermal noise to simulate the computation, and the activation function unit updating the neuron state according to a threshold, thereby simulating the physical evolution process in the continuous-time Ising model.
[0091] On the PC, cluster analysis is performed on the global weight matrix to identify the number of connections in each sub-weight matrix. Based on the number of connections in each sub-weight matrix and a set threshold, the sub-weight matrices of the global weight matrix are divided into high-density and low-density sub-weight matrices. Neurons are then renumbered, and neurons belonging to the same high-density sub-weight matrix are mapped to the same chip. Neurons connecting different high-density sub-weight matrices are defined as boundary nodes and assigned to different chips. During the neuron allocation phase, the 50 neurons allocated to a Zynq chip are designated as low-density regions, with fewer than 50 interconnections with other neurons. This allows the system to be applicable to connection problems involving 800 nodes.
[0092] Fifty of the 74 I / O ports of the Zynq AX100B chip are connected to the 50 I / O ports of three Ku095 FPGAs via differential pair interconnects. The three Ku095 FPGAs are interconnected by 200 I / O ports each using differential pair interconnects. Interconnect signals communicate between the FPGAs via high-speed I / O pins, ensuring high-quality data transmission and low interference. Dedicated pin configurations and routing are used for data transmission between these I / O ports to avoid collisions and signal loss.
[0093] The fiber optic cable operates at a speed of 12Gbps. In addition to the 200 neurons directly connected between every two KU095s, the registers of 600 neurons are transmitted through the fiber optic interface. The transmission delay of the fiber optic cable is 600 / 12Gbps=50ns, and the total delay including the transmission delay is about 150ns, which is sufficient to update the neurons that are not directly connected.
[0094] Within each time slice, the calculation results are directly transmitted through the logic accumulation units and activation function units within the FPGA. The output of each activation function unit is synchronously updated through local combinational logic. The output value of the combinational logic is directly interconnected to the other FPGA chips on-chip, and the state is transmitted to the other three FPGAs via equal-length transmission lines, bypassing registers and directly to the logic accumulation units.
[0095] After the logic accumulation unit and activation function unit within each FPGA chip complete their calculations, the sampling module stores the results in the transmission buffer. Data is transmitted to the host computer via the uplink of the QSFP28 fiber optic interface through the TX interface of the GTY transceiver. The optical signal is converted into an electrical signal and transmitted back at a rate of 12Gbps. The host computer performs global optimization analysis based on the transmitted sampling results, ultimately obtaining the optimal solution to the Max-Cut problem.
[0096] Example 2:
[0097] This embodiment takes the Max-Cut problem with 2000 nodes, a connection rate of 10%, and a connection weight width of 5 bits as an example.
[0098] The system uses a Zynq AX100B FPGA as the main control chip, responsible for overall system scheduling, task allocation, multiplexing scheduling, and data transmission control. The system is configured with one AX100B and three Xilinx KU095 FPGAs as parallel computing chips. The AX100B implements 50 logic accumulation units and activation function units on-chip, mapped to 200 logic nodes for logic accumulation and activation function unit computation tasks through a multiplexing scheduling table. Each KU095 implements 200 logic accumulation units and activation function units on-chip, mapped to 600 logic nodes for logic accumulation and activation function unit computation tasks through a multiplexing scheduling table. Thus, the AX100B and the three KU095s together implement a parallel solution architecture for 2000 logic nodes, used to complete the iterative update and solution of the Max-Cut problem. The problem is a weighted Max-Cut random graph with 2000 nodes and a connectivity ratio of 10%. For any pair of nodes (i,j), with probability Edges are generated; the average number of connections per node is approximately .
[0099] The connection weight of each edge It uses 5-bit fixed-point encoding for storage (preferably signed binary two's complement), and the value range is: The KU095 internally performs sign expansion on the 5-bit weights and accumulates them in the addition tree during calculation.
[0100] Each KU095 FPGA is configured with 200 logic accumulation units and activation function units (each activation function unit corresponds to one neuron). To support 2000 logic nodes without significantly increasing hardware resources, this embodiment adopts a 10-times multiplexing method: each KU095 updates 3 groups of logic nodes (200 nodes per group) in 3 consecutive time slices, thereby enabling a single KU095 to cover 600 logic nodes.
[0101] The system uses a 200 MHz clock to control all global functional modules (communication, buffering, scheduling, CRC check) except for the logic accumulation unit and activation function unit. The basic scheduling time slice is defined as 100 ns (i.e., 20 clock cycles of 200 MHz). Each KU095 completes the "data loading and parallel update" of a group of 200 logical nodes within one 100 ns time slice, and completes a full update of its 600 logical nodes in three consecutive time slices (1000 ns); this 300 ns is defined as one "iteration step".
[0102] The PC maintains the global graph structure and weight set, and divides the 2000 nodes into three groups (one group of 200 nodes and three groups of 600 nodes each), mapping them to four FPGAs. Since the connectivity is 10% and the global weight matrix is sparse, this embodiment uses sparse connections, but still distributes a 2000×2000 fully connected matrix. For each logical node i mapped to an FPGA, the PC generates its adjacency list: neighbor node number j: needs to cover 0-1999, using 11-bit encoding; edge weight... : Uses 5-bit encoding; each adjacency entry is packaged into a fixed-length record of 16 bits (11-bit node sequence number plus 5-bit weight).
[0103] To facilitate fixed hardware bandwidth and the addition tree structure, this embodiment fixes the number of adjacent items for each node at 200 (nodes with fewer than 200 items are padded with all zeros; if a node has more than 200 items, it is truncated on the PC, renumbered, and then partitioned to meet the upper limit constraint). Therefore, the data size of the sub-weight matrix of each node is approximately 200 × 2 bytes = 400 bytes.
[0104] Each KU095 corresponds to 200 nodes, so the weight payload of a single KU095 is approximately 80,000 bytes. The four FPGAs together contain approximately 320,000 bytes of sub-weight matrix data (excluding frame headers, CRC, and scheduling table overhead). High-speed data transmission between the Zynq AX100B and each KU095 FPGA is achieved via QSFP28 fiber optic communication. The fiber optic link uses a high-speed serial transceiver protocol. The KU095 side is connected to the QSFP28 optical module via a GTY transceiver, and the GTY reference clock is provided by a Si5332BD11025-4. The transmission delay of the external communication module is proportional to the fiber length: assuming the fiber length is... Meters, with a propagation speed of approximately 2 × 10⁻⁶ meters. 8 If the speed is m / s, then the delay per meter is approximately 5 ns. In this embodiment, the fiber length is 1 m, corresponding to a one-way propagation delay of approximately 5 ns; the system design matches the lengths of each link to maintain timing consistency.
[0105] Each KU095 internally has a "weight receiving and decoding module" that writes the received sparse adjacency list data into the on-chip BRAM to form a weight storage structure that can be randomly accessed by the logic accumulation unit; at the same time, it writes into the "node mapping table" and "scheduling table" to indicate the set of 200 logical nodes that should be updated in each 100 ns time slice.
[0106] Each data frame includes the following fields:
[0107] Target chip ID (2 bytes): Indicates the target KU095 number (0~2);
[0108] Data type (2 bytes): Identifies the data categories of "weight matrix, initial state, scheduling table, and state return".
[0109] Data length (4 bytes): Indicates the length of the data portion (in bytes), used to carry a weighted payload of 320,000 bytes;
[0110] Data content (variable length): Weight matrix: a 16-bit fixed-length record stream arranged in the order of "node-adjacency" (11-bit node sequence number plus 5-bit weight connection), and includes node offset and index information;
[0111] Initial state: Initial values for 2000 nodes (1 bit / node), totaling 2000 bits (250 bytes), distributed according to chip fragments;
[0112] Scheduling table: Describes the loading order and time slice number of logical nodes under 3-time reuse;
[0113] CRC checksum (4 bytes): used for transmission integrity verification; End of frame flag (1 byte).
[0114] The receiving FPGA triggers a retransmission mechanism for data frames that fail CRC check. If the check fails, the FPGA sends a retransmission request to the sending FPGA through the fiber optic channel to ensure that the weight and status data are reliably loaded.
[0115] After the system enters the iterative state, the logic accumulation units and activation function units inside each FPGA begin parallel computation. The core process is as follows:
[0116] At the beginning of each 100 ns time slice, the scheduling module selects the virtual node group corresponding to the current time slice according to the scheduling table and sends its adjacency data from BRAM to the input of the logic accumulation unit.
[0117] The logic accumulation unit for each logic node The 200 adjacent entries are accumulated in parallel to calculate the local field. :
[0118] ;
[0119] in, For logical nodes neighboring nodes Current state (0 / 1 encoding mapping) The weight is 5 bits; N(i) is the current logical node. The set of neighboring nodes; the summation bit width is configured to 13 bits according to the worst case to avoid overflow; the noise injection unit is a bias term. Controllable perturbations are introduced to simulate thermal noise annealing. The index is used to determine the current state of the logical node; the activation function unit updates the state according to the configured nonlinear function, and the update result is written to the state latch; 200 neurons complete the update within this time slice; the KU095 completes the update of 600 logical nodes in 3 consecutive time slices (one iteration step is 300 ns).
[0120] After each KU095 completes a time slice or an iteration step, it writes the updated logical node state to the transmit buffer and transmits it back to Zynq via the QSFP28 uplink through the GTY's TX interface, and then back to the PC. The PC reads the global state, calculates the current cut value, and calculates the solution time according to the set sampling period for external evaluation; finally, it outputs the corresponding Max-Cut solution and its cut value, completing the solution to the problem.
[0121] Combinatorial optimization problems (COPs) are widespread across numerous fields, but traditional algorithms often experience rapidly increasing solution time as the scale grows. To address this, this invention proposes a multi-chip interconnected FPGA-based combinatorial optimization solution system and method for large-scale combinatorial optimization. The system consists of a host computer, a main control FPGA, and several parallel computing FPGAs. The host computer is responsible for generating and managing the target problem weight matrix and can perform block encoding according to the problem size. The main control FPGA is responsible for task allocation, scheduling, data frame organization, and result aggregation, and distributes subtasks to each computing FPGA through high-speed inter-chip interconnects to achieve system-level scalability. The overall hardware of this system adopts a layered architecture of PC-PS-PL, facilitating expansion and adaptation to different Ising machine cores and problems of different sizes.
[0122] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0123] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0124] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention, and no reference numerals in the claims should be construed as limiting the scope of the claims.
[0125] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
Claims
1. A hardware solution system for the continuous-time Ising model based on multi-chip interconnection, characterized in that, include: The host computer is used to perform cluster analysis and graph partitioning on the global weight matrix of the Ising model, and generate neuron mapping relationships and scheduling tables; Multiple chips, each with multiple logic accumulation units and activation function units, are used to achieve continuous time updates of neuron states through combinational logic; The data transceiver module, located in each chip, is used to receive and store the sub-weight matrices in the global weight matrix of the neurons related to the current chip from the host computer, and distribute the sub-weight matrix data to the corresponding logic accumulation unit and activation function unit according to the neuron mapping relationship and scheduling table. The system control module is used to control the loading of sub-weight matrix data and the iterative calculation of each logical accumulation unit and activation function unit according to the scheduling table; The output interconnect structure includes an inter-chip physical interconnect network based on interconnect signals and an on-chip delay interconnect network based on combinational logic, which is used to realize the state transmission of neurons between chips and within the chip. An external communication module is used to transmit sub-weight matrix data, neuron states, and control commands between the host computer and each chip.
2. The hardware solution system for the continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, The host computer is used to perform cluster analysis and graph partitioning on the global weight matrix of the Ising model, generating neuron mapping relationships and scheduling tables, specifically including: The host computer maintains the global weight matrix of the target Ising model, performs cluster analysis on the global weight matrix to identify the number of connections in each sub-weight matrix, and divides the sub-weight matrices of the global weight matrix into high-density and low-density sub-weight matrices based on the number of connections in each sub-weight matrix and a set threshold. Neurons are then renumbered, and neurons belonging to the same high-density sub-weight matrix are mapped to the same chip. Neurons connecting different high-density sub-weight matrices are defined as boundary nodes and assigned to different chips, thus obtaining a graph partitioning result for multiple chips and establishing the mapping relationship between neurons in the actual problem-solving process and the hardware solution system. Based on the upper limit of available computing resources for each chip, the set of neurons assigned to the same chip is further divided into multiple virtual node groups, generating a corresponding scheduling table. According to the neuron partitioning strategy, the sub-weight matrix data obtained after partitioning the global weight matrix, along with the corresponding neuron numbers and target chip identifiers, are encapsulated into data frames conforming to the communication protocol and distributed to the corresponding chips through the communication link.
3. The hardware solution system for the continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, Each activation function unit on the chip corresponds to one neuron.
4. The hardware solution system for the continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, The logic accumulation unit includes a value generation unit, an arithmetic unit, and a noise injection unit; the activation function unit includes a parameter configuration unit, a nonlinear activation unit, and an output control unit. The numerical generation unit receives the current output state of the connected neurons from the local chip or other chips, performs logical operations with the weights between the neurons, and obtains the numerical magnitude of the influence of other neurons on the neuron based on the result of the logical operation. The computing unit adopts a parallel accumulation architecture, including a data allocation unit, a parallel accumulation array, and an output generation unit; the data allocation unit identifies the input data and allocates the data to the parallel accumulation array; The parallel accumulation array performs an accumulation operation on the data; the output generation unit is used to receive the accumulation operation result and the noise injection value and obtain the final output signal. The noise injection unit is located at the output of the arithmetic unit. Based on the noise intensity parameters stored in the corresponding noise intensity register given by the system control module, it generates random or controllable disturbance values. The disturbance values are added to the output signal of the arithmetic unit to obtain the total output including the noise term. The parameter configuration unit receives external parameters and compares the total output with the threshold set by the activation function register. The nonlinear activation unit performs variable nonlinear activation processing based on different parameters, thereby dynamically adjusting the output of the activation function unit by adjusting the parameter size; The output control unit receives control commands from the system control module to latch and output the neuron state of the activation function unit, and outputs the neuron state to all logic accumulation units.
5. The hardware solution system for the continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, The data transceiver module, located in each chip, is used to receive and store sub-weight matrices from the global weight matrix of neurons related to the current chip, received from the host computer, and to distribute the sub-weight matrix data to the corresponding logic accumulation unit and activation function unit according to the neuron mapping relationship and scheduling table. Specifically, it includes: The data transceiver module receives the global weight matrix segmented by the host computer and is used to receive, filter, map, and load the global weight matrix sent by the host computer in blocks. The data transceiver module has a mapping relationship storage unit, which is used to receive and store the global neuron number, the neuron number on the current chip, and the virtual node group identifier to which the neuron belongs. During the initialization phase, under the control of the system control module, the data transceiver module receives and parses the data frames segmented and sent by the host computer. Based on the target chip identifier and neuron number carried in the data frame, it determines whether the weight of the neuron belongs to the current chip. For sub-weight matrix data belonging to the current chip, the sub-weight matrix data is redirected to the corresponding neuron according to the neuron mapping relationship and stored in the weight matrix register, categorized and stored according to virtual node groups. Sub-weight matrix data that does not belong to the current chip is not received or stored. Each chip only receives and stores the weight of the corresponding neuron on the current chip and the sub-weight matrix corresponding to that neuron, thereby realizing the distributed carrying of the global connection matrix. During the solution execution phase, the data transceiver module, under the control of the system control module, inputs the weights corresponding to the neurons updated in different time slices into the weight matrix register of the logic accumulation unit according to the scheduling table, thereby solving large-scale problems.
6. The hardware solution system for the continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, The system control module is used to control the loading of sub-weight matrix data and the iterative calculation of each logical accumulation unit and activation function unit according to the scheduling table, specifically including: The system control module includes an internal loop control module and a distributed coordination module; The internal loop control module controls the switching of each logic accumulation unit and activation function unit in the chip between different working modes through a multi-state controller, so that the logic accumulation unit and activation function unit can support parameter reconfiguration and debugging while maintaining continuous-time dynamic evolution. The distributed collaboration module synchronously rotates weights and status data among multiple chips based on the scheduling table, breaking down the combinatorial optimization problem into time-sharing computation of subtasks on limited hardware resources.
7. A hardware solution system for a continuous-time Ising model based on multi-chip interconnection according to claim 6, characterized in that, The internal loop control module controls the switching of each logic accumulation unit and activation function unit within the chip between different operating modes through a multi-state controller. This enables the logic accumulation units and activation function units to maintain continuous-time dynamic evolution while supporting parameter reconfiguration and debugging, specifically including: The working modes include initialization mode and iterative calculation mode; The initialization mode control data transceiver module receives the neuron mapping relationship and global weight matrix data sent by the host computer, completes the storage of the mapping relationship, and loads the sub-weight matrices of the global weight matrix, the initial state of the neuron, and the activation function unit parameters. The iterative calculation mode drives each logic accumulation unit and activation function unit to perform continuous-time state evolution according to the predetermined schedule and activation function unit parameter settings. During this process, the output of the logic accumulation unit and activation function unit is sampled at a set period, and the sampling results are output to the host computer through the external communication module.
8. The hardware solution system for the continuous-time Ising model based on multi-chip interconnection according to claim 7, characterized in that, The distributed collaboration module synchronizes weights and status data across multiple chips based on the scheduling table, decomposing the combinatorial optimization problem into time-sharing computations of subtasks on limited hardware resources, specifically including: The M neurons in the combinatorial optimization problem are divided into N groups. The number of neurons in each group is equal to the number of logic accumulation units and activation function units in the hardware solution system. At the beginning of the k-th time slice, the system control module first sets the logic accumulation units and activation function units on the current chip to their initial states, and prohibits changes in the output control units in the logic accumulation units. The system control module reads the sub-weight matrix and activation function unit parameters corresponding to the current virtual node group according to the scheduling table transmitted by the host computer, and writes the weight matrix data and noise intensity parameters into the weight matrix register and noise intensity register of the corresponding logic accumulation unit and activation function unit through the internal bus. At the same time, it reads the final state of each neuron in the previous evolution cycle from the memory where the virtual node group is located, and preloads the final state into the corresponding logic accumulation unit register as the initial state of this evolution cycle. Then, the activation function units are enabled, the output control units change, and the neurons begin to iterate. After a specified evolution time, the weight matrix register, noise intensity register, logic accumulation unit register, and activation function unit parameters are updated according to the virtual node group and the scheduling table, and then the next evolution cycle begins, until the neuron state no longer changes or the solution time is reached.
9. A hardware solution system for a continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, The output interconnect structure includes an inter-chip physical interconnect network based on interconnect signals and an intra-chip delay interconnect network based on combinational logic, used to realize the state transmission of neurons between chips and within a chip, specifically including: Inter-chip physical interconnection network based on interconnection signals: The inter-chip physical interconnection network establishes hardware connection paths between multiple chips through multiple sets of transmission lines. One end of each set of transmission lines is connected to the output pin of the transmitting chip, and the other end is connected to the input pin of the receiving chip. The neuron state output by the logic accumulation unit and the activation function unit of one chip is activated by the activation function unit in the logic accumulation unit, and then directly transmitted to the other chips through the physical interconnection link via the output buffer. Through the input buffer inside the chip, it is converted into an internal signal of the chip, and then directly connected to the logic accumulation unit of another chip through a programmable input delay unit that configures the cross-chip propagation time, thereby simulating the propagation time constant of different paths on the digital platform. On-chip delayed interconnect network based on combinational logic: Different neurons located inside the same chip accumulate each other and are activated by activation function units. The resulting output value is directly transmitted to the other neurons on the same chip via interconnects.
10. A hardware solution system for a continuous-time Ising model based on multi-chip interconnection according to claim 1, characterized in that, The external communication module includes a physical layer interface unit, a data transceiver synchronization unit inside the chip, a protocol parsing and encapsulation unit, and a data buffering and distribution unit. In the downlink direction, the host computer sends a data stream through an external communication module. After being converted by the physical layer interface unit, the data stream is input to the data transceiver synchronization unit of the current chip. After clock data recovery and serial-to-parallel conversion, the data stream is output to the protocol parsing and encapsulation unit for instruction parsing. In the uplink direction, under the trigger of the system control module, the chip writes the sampling result of the activation function unit to the data transceiver synchronization unit after protocol conversion by the data buffer and distribution unit, and then sends it to the host computer.