Computing chip, data processing method, device, apparatus, and storage medium

By integrating computing modules and memory modules into a computing chip and introducing cluster-level shared memory and a hierarchical communication architecture, the problems of data transmission latency and network congestion caused by the separation of computing modules and memory modules are solved, realizing a computing chip design with high-efficiency data transmission and low power consumption.

CN122019466BActive Publication Date: 2026-06-26SUZHOU YIZHU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SUZHOU YIZHU INTELLIGENT TECH CO LTD
Filing Date
2026-04-07
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In high-performance computing chips, the physical separation of computing modules and memory modules leads to problems such as increased data transmission latency, on-chip network congestion, and excessive hardware costs and power consumption.

Method used

By physically integrating the computing module and the memory module, and introducing a cluster-level shared memory and hierarchical communication architecture, the data access path is optimized through cluster shared memory, reducing cross-storage access distance and latency, and simplifying on-chip network routing design and flow control.

Benefits of technology

It significantly improves the data transmission efficiency of the computing module, reduces the access traffic and power consumption of the on-chip network, reduces the chip area and hardware cost, and forms a high-efficiency computing and low-power computing chip architecture.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122019466B_ABST
    Figure CN122019466B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a computing chip, a data processing method and device, equipment and a storage medium, and relate to the technical field of chips. The computing chip comprises at least one computing cluster, and the computing cluster comprises: a plurality of memory and calculation units, each memory and calculation unit comprising a calculation module and a memory module physically integrated with the calculation module; a cluster center; a cluster shared memory, which is in communication connection with the cluster center and in communication connection with each calculation module in the plurality of memory and calculation units through the cluster center; and a network on chip, the cluster center of each computing cluster being connected to the network on chip. Efficient access to local data is achieved by physical integration of memory and calculation, cross-storage access paths are optimized by cluster-level cluster shared memory, the interaction logic of the network on chip is simplified by a hierarchical communication architecture, and the data transmission efficiency of the calculation module is improved. By reducing access requests of the network on chip and simplifying network design, the access traffic and power consumption of the network on chip are reduced, and the chip area and hardware cost are reduced.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of chip technology, and in particular to computing chips, data processing methods, apparatus, devices, and storage media. Background Technology

[0002] In high-performance computing chip architectures, the compute module (CM) and the adjacent memory module (MM) are typically physically separated. In this decoupled architecture, if a compute module needs data that is not in its own memory module but in the memory module of another compute unit, it must retrieve the data through the network-on-chip (NOC) within the chip. If the data required by the compute unit is not stored locally, any memory access across memory modules must traverse the high-speed interconnect network within the chip. This increases data transmission latency and causes a large amount of unnecessary NOC traffic and network congestion. Furthermore, the NOC needs to handle the massive concurrent random access requests from all compute modules, complicating its routing design and flow control mechanisms, and significantly increasing the chip's area, cost, and power consumption. Summary of the Invention

[0003] The main objective of this application is to propose a computing chip, a data processing method, an apparatus, a device, and a storage medium to improve the data transmission efficiency of the computing module and reduce the access traffic and power consumption of the on-chip network.

[0004] To achieve the above objectives, a first aspect of this application provides a computing chip, comprising:

[0005] At least one computing cluster;

[0006] The computing cluster includes:

[0007] Multiple in-memory computing units, each of the in-memory computing units including a computing module and a memory module physically integrated with the computing module;

[0008] Cluster center;

[0009] The cluster shares memory, communicates with the cluster center, and communicates with each computing module in the plurality of in-memory computing units through the cluster center.

[0010] The on-chip network is connected to the cluster center of each computing cluster.

[0011] In some embodiments, the computing module and the memory module are vertically integrated using a three-dimensional stacking technique, and the memory module is configured as the private memory of the computing module.

[0012] In some embodiments, the cluster shared memory of the computing cluster is on-chip memory, has a physical address space with a private cache hierarchy independent of the computing modules, and is configured to be directly accessible by all the computing modules within the computing cluster.

[0013] In some embodiments, the cluster shared memory of the computing cluster is connected to the on-chip network via the cluster center.

[0014] In some embodiments, the on-chip network includes a switch corresponding to each of the computing clusters, and the cluster center is connected to the on-chip network via the switch.

[0015] In some embodiments, the computing chip includes at least two wafers, each wafer including at least one of the computing clusters.

[0016] In some embodiments, each chip has a chip center, which is communicatively connected to all the cluster centers within the corresponding chip. The chip center is configured to perform address routing for cross-cluster access requests and uniformly access the on-chip network.

[0017] In some embodiments, the cluster center is configured with an address discrimination circuit, which, in response to an access request initiated by the computing module, directs the access request to the cluster shared memory or forwards it to the on-chip network based on the target address.

[0018] In some embodiments, the address discrimination circuit is specifically configured to: if the target address of the access request initiated by the computing module is located in the cluster shared memory of the computing cluster, the access request is directed to the cluster shared memory; if the target address is located outside the computing cluster, the access request is forwarded to the on-chip network for routing transmission.

[0019] In some embodiments, the computing chip has a three-level data access path, including:

[0020] The computing module accesses the private access path of its corresponding memory module;

[0021] The computing module accesses the cluster-shared path of the cluster shared memory through the cluster center;

[0022] The computing module accesses the on-chip network via the cluster center to access global access paths to resources outside the computing cluster.

[0023] In some embodiments, the memory module is 3D-DRAM, and the cluster shared memory is SRAM.

[0024] In some embodiments, the path for the computing module to access its corresponding memory module does not pass through the cluster center and the on-chip network.

[0025] In some embodiments, the cluster shared memory is configured to store data shared by multiple computing modules within the computing cluster, and the computing modules can access the cluster shared memory without going through the on-chip network.

[0026] To achieve the above objectives, a second aspect of this application provides a data processing method applied to a computing chip as described in any of the first aspects, comprising:

[0027] Within each time step, the allocated local data blocks are processed in parallel in each of the computing modules, and the resulting calculations are stored in the corresponding memory modules.

[0028] Within the same computing cluster, the computation results corresponding to the computing modules are transferred to the cluster shared memory to obtain the corresponding cluster aggregated data;

[0029] Global data is obtained by synchronizing the aggregated cluster data in the shared memory of the clusters through the on-chip network via each cluster center.

[0030] The global data is stored in the shared memory of each of the clusters, so that each computing module in the computing cluster can directly access the global data.

[0031] In some embodiments, the cluster shared memory includes a shared buffer for storing the cluster aggregate data and the global data.

[0032] In some embodiments, the method is applied to image or video generation tasks based on a diffusion model, wherein the local data blocks are latent representation patches of the image or video in a latent space.

[0033] In some embodiments, the calculation result is a denoised latent representation patch; the cluster aggregated data is a latent frame composed of multiple denoised latent representation patches.

[0034] In some embodiments, depending on the generation stage, the potential frame includes at least one of a global potential frame, a window potential frame, and a current potential frame, wherein the global potential frame contains the complete potential representation of the current target generated object; the window potential frame consists of potential frames from a preset number of time steps prior to the current time step; and the current potential frame is the potential representation to be processed at the current time step.

[0035] In some embodiments, the step of synchronizing the aggregated cluster data in the shared memory of the clusters through each of the cluster centers on the on-chip network to obtain global data includes:

[0036] The cluster aggregated data in the cluster shared memory is sent to the corresponding cluster center;

[0037] The on-chip network performs a full collection operation among the cluster centers, so that each cluster center obtains the cluster aggregated data from the other computing clusters.

[0038] In some embodiments, the full collection operation employs a ring-based communication algorithm, using each of the cluster centers participating in the synchronization as communication nodes, and performing a preset number of cyclic exchanges, so that each of the cluster centers obtains the complete global data.

[0039] In some embodiments, when the computing chip is the computing chip described in the first aspect, the step of synchronizing the cluster aggregated data through each of the cluster centers on the on-chip network to obtain global data includes:

[0040] The cluster aggregation data obtained is sent to the corresponding chip center of the chip using the cluster center, and the chip center performs address routing for cross-cluster access requests.

[0041] By uniformly accessing the on-chip network through each of the chip centers, the cluster aggregated data collected by each chip center is sent to other chips through the on-chip network for data synchronization, so that the shared memory of each cluster contains all the cluster aggregated data to form the global data.

[0042] To achieve the above objectives, a third aspect of this application provides a data processing apparatus applied to a computing chip as described in any of the first aspects, comprising:

[0043] A local computing module is used to process the allocated local data blocks in parallel within each computing module at each time step, and store the obtained computing results in the corresponding memory module.

[0044] The cluster aggregation module is used to transfer the calculation results corresponding to the calculation module to the cluster shared memory to obtain the corresponding cluster aggregated data within the same computing cluster.

[0045] The on-chip synchronization module is used to synchronize the aggregated cluster data in the shared memory of the clusters in the on-chip network through each of the cluster centers to obtain global data.

[0046] A global data access module is used to store the global data in the shared memory of each of the clusters, so that each computing module in the computing cluster can directly access the global data.

[0047] To achieve the above objectives, a fourth aspect of the present application provides an electronic device, the electronic device including a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method described in the second aspect above.

[0048] To achieve the above objectives, a fifth aspect of the present application provides a storage medium storing a computer program that, when executed by a processor, implements the method described in the second aspect.

[0049] The computing chip, data processing method, apparatus, device, and storage medium proposed in this application embodiment include at least one computing cluster. Each computing cluster comprises: multiple in-memory computing units, each including a computing module and a memory module physically integrated with the computing module; a cluster center; cluster shared memory, communicatively connected to the cluster center and to each computing module in the multiple in-memory computing units via the cluster center; and an on-chip network, with the cluster center of each computing cluster connected to the on-chip network. This application embodiment first reduces the energy consumption of local data access through the physical integration of computing modules and memory modules. Then, by utilizing the cluster-level cluster shared memory design, it constructs a two-tier storage system combining local storage and cluster shared storage to optimize the data access path. When an in-memory computing unit needs to interact with other in-memory computing units within the cluster, it does not need to obtain data across the cluster via the global on-chip network; it can directly retrieve data from the cluster shared memory within the cluster. This transforms the original global on-chip network transmission into direct intra-cluster transmission, significantly shortening the transmission distance and latency of cross-storage access and significantly improving the data transmission efficiency of the computing modules. Meanwhile, local access and shared memory access within the cluster do not consume bandwidth resources of the on-chip network. Data is only transmitted through the on-chip network when it exceeds the range of the shared memory, effectively reducing the number of concurrent access requests to the on-chip network and alleviating network congestion at its source. Furthermore, the on-chip network does not need to handle massive random access, significantly simplifying its routing design and flow control mechanisms, thereby reducing chip area and hardware costs. In addition, each cluster center is individually connected to the on-chip network, allowing the on-chip network to handle only core data interactions between clusters, reducing invalid data transmission and lowering network access traffic. Therefore, this embodiment achieves efficient access to local data through in-memory computing physical integration, optimizes cross-storage access paths through cluster-level shared memory, simplifies the interaction logic of the on-chip network through a layered communication architecture, and significantly improves the data transmission efficiency of the computing modules. Simultaneously, by reducing on-chip network access requests and simplifying network design, it achieves a dual reduction in on-chip network access traffic and power consumption, while also reducing chip area and hardware costs, ultimately forming a computing chip architecture that integrates high-efficiency computing, low-power design, and cost optimization. Attached Figure Description

[0050] Figure 1 This is a schematic diagram of the structure of parallel computing in related technologies.

[0051] Figure 2 This is a schematic diagram of the computing chip provided in an embodiment of this application.

[0052] Figure 3 This is another structural schematic diagram of the computing chip provided in the embodiments of this application.

[0053] Figure 4 This is another structural schematic diagram of the computing chip provided in the embodiments of this application.

[0054] Figure 5 This is a flowchart of the data processing method provided in the embodiments of this application.

[0055] Figure 6 This is a schematic diagram of local data blocks allocated in parallel processing in each computing module, provided in an embodiment of this application.

[0056] Figure 7 This is a data flow diagram of the diffusion model in related technologies.

[0057] Figure 8 This is a temporal logic diagram of the diffusion model in related technologies.

[0058] Figure 9 This is a schematic diagram illustrating the execution logic of the patch parallel strategy in related technologies.

[0059] Figure 10 This is a flowchart provided in this application embodiment, which shows how global data is obtained by synchronizing aggregated cluster data in the shared memory of each cluster center on the on-chip network.

[0060] Figure 11 This is a schematic diagram summarizing the global data provided in the embodiments of this application.

[0061] Figure 12 This is another schematic diagram provided in the embodiments of this application, in which cluster aggregated data is synchronized on the on-chip network through each cluster center to obtain global data.

[0062] Figure 13 This is a schematic diagram of cross-chip global data construction provided in an embodiment of this application.

[0063] Figure 14 This is a schematic diagram of continuous time step data processing provided in an embodiment of this application.

[0064] Figure 15 This is a structural block diagram of a data processing device applied to a computing chip, provided in another embodiment of this application.

[0065] Figure 16This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0066] To clearly present the purpose, technical solution, and advantages of this application, the following will provide a more detailed description of this application in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only for explaining this application and are not intended to limit this application.

[0067] It should be noted that although functional modules are divided in the device schematic diagram and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart.

[0068] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0069] First, let's analyze some of the terms used in this application:

[0070] Diffusion Model (DM) is a generative AI model that generates new data by progressively adding noise to the data and then denoising it in reverse. It can learn the distribution of real data from random noise and is widely used in fields such as image and speech generation.

[0071] Latent Space Frame (LSF / LF): refers to the compressed representation of data in the latent space. It is a low-dimensional feature frame after encoding high-dimensional original data. It can significantly reduce computation and storage overhead and is the core carrier for feature transfer and processing in diffusion models.

[0072] Image patch: refers to a local sub-block after the complete image is divided. It is usually a pixel region of fixed size. It can decompose a large image into multiple small units for parallel processing, adapt to the distributed computing architecture of computing chips, and improve the computing efficiency of image-related tasks.

[0073] Time Step (TS): In a diffusion model, the time step represents the iterative stage of noise addition or removal. Each time step corresponds to a different noise intensity. The model completes the generation task by learning the noise distribution at different time steps and is the core control parameter for the iterative calculation of the diffusion model.

[0074] Shared Latent Frame Buffer (SLFB): This is a shared buffer area that stores potential frames and is accessible to multiple computing modules. It supports efficient sharing and reuse of potential frames by multiple modules within the cluster, reducing data transfer overhead.

[0075] 3D-DRAM: or 3D vertically stacked DRAM, is a high-density memory achieved by vertically stacking multiple layers of memory cells, featuring large capacity and high bandwidth.

[0076] Memory Module (MM): This is a local storage unit physically integrated with the computing module in this embodiment, and it uses 3D-DRAM. It provides low-latency local data caching for the computing module, supporting the in-memory computing architecture design.

[0077] The Compute Module (CM) is the core computing unit of the chip, responsible for performing tasks such as tensor calculations and noise prediction for the diffusion model. It is physically integrated with the memory module and acquires data through a three-level data access path, serving as the core carrier of computing power output in this application.

[0078] Cluster: In this embodiment, a computing cluster consists of multiple in-memory units, shared cluster memory, and a cluster center. It achieves resource scheduling through the cluster center and supports collaborative computing among multiple computing modules within the cluster.

[0079] Network on Chip (NOC): This is an on-chip communication network that connects multiple computing clusters or chips, and forwards data through routing nodes. It supports global data interaction across clusters and chips.

[0080] Cluster Shared Memory (CSM): This is a shared storage unit within the computing cluster in this embodiment, using SRAM. It provides low-latency data sharing capabilities for all computing modules within the cluster, reducing data movement across nodes.

[0081] In high-performance computing chip architectures, the compute module (CM) and the adjacent memory module (MM) are typically physically separated. In this decoupled architecture, if a compute module needs data that is not in its own memory module but in the memory module of another compute unit, it must obtain the data through the network-on-chip (NOC) within the chip.

[0082] Reference Figure 1 , Figure 1This is a schematic diagram of a computing chip architecture in related technologies. The diagram illustrates a 4×4 alternating grid structure, where each node is equipped with a computing module and a memory module. The memory module corresponding to each computing module serves as its local storage. All computing modules are connected to the on-chip network (NOC) via a global communication link, and the NOC handles the data and instruction transmission between modules.

[0083] Therefore, when the data required by the computing module is not stored locally, any access operation across memory modules must be completed through the on-chip network (NOC). This not only increases data transmission latency but also generates a large amount of invalid on-chip network traffic, leading to network congestion. For example, in some computing tasks, related data should be exchanged between physically adjacent computing modules. However, in this grid layout, even if two adjacent computing modules communicate, data transmission still needs to be relayed through the on-chip NOC, thereby increasing network data traffic and transmission latency, and reducing overall transmission efficiency.

[0084] Furthermore, since all non-local memory access requests must be processed through the on-chip network (NOC), the NOC must be capable of handling massive random access requests. This makes its routing design and flow control mechanisms more complex, significantly increasing chip area, hardware cost, and overall power consumption. For example... Figure 1 In the middle, the computing module CM in the lower left corner needs to access the data in the memory module MM corresponding to the computing module to its right. Although the two are physically adjacent, the access request still needs to be completed by accessing the on-chip network NOC, which creates a defect of generating additional NOC traffic for local friendly operation.

[0085] Furthermore, existing distributed parallel computing architectures typically employ a patch parallelism strategy, splitting high-resolution input data into multiple smaller patches and distributing them to different computing modules (CMs) for parallel processing. Between processing different time steps, each computing module needs to perform an "all-gather" operation through topologies such as network-on-chip (NOC), Xbar, or 2D mesh to exchange intermediate computation results and obtain global context information. However, in large-scale parallel computing scenarios, when hundreds or thousands of computing modules simultaneously and frequently access global shared memory and perform data synchronization, severe data contention arises, leading to rapid exhaustion of NOC bandwidth, a significant increase in communication latency, and impacting the data processing efficiency of the diffusion model.

[0086] Based on this, embodiments of this application provide a computing chip, a data processing method, an apparatus, a device, and a storage medium. The computing chip first reduces the energy consumption of local data access through the physical integration of the computing module and the memory module. Then, by utilizing a cluster-level shared memory design, a two-tier storage system combining local storage and cluster shared storage is constructed to optimize the data access path. When a computing unit needs to interact with other computing units within the cluster, it does not need to obtain data across the cluster via a global on-chip network. Instead, it can directly retrieve data from the cluster's shared memory, transforming the original global on-chip network transmission into direct intra-cluster transmission. This significantly shortens the transmission distance and latency of cross-storage access, and significantly improves the data transmission efficiency of the computing module. Simultaneously, neither local access within the cluster nor cluster shared memory access consumes bandwidth resources of the on-chip network. Transmission only occurs via the on-chip network when data exceeds the range of the cluster shared memory, effectively reducing the number of concurrent access requests to the on-chip network and alleviating network congestion at its source. Furthermore, the on-chip network does not need to handle massive random access, and its routing design and flow control mechanisms can be greatly simplified, thereby reducing chip area and hardware costs. Furthermore, each cluster center is independently connected to the on-chip network, allowing the on-chip network to handle only core data interactions between clusters, reducing invalid data transmission and lowering network access traffic. Therefore, this embodiment achieves efficient access to local data through in-memory computing physical integration, optimizes cross-storage access paths through cluster-level shared memory, and simplifies the interaction logic of the on-chip network through a layered communication architecture, significantly improving the data transmission efficiency of the computing modules. Simultaneously, by reducing on-chip network access requests and simplifying network design, it achieves a dual reduction in on-chip network access traffic and power consumption, while also reducing chip area and hardware costs, ultimately forming a computing chip architecture that integrates high-efficiency computing, low-power design, and cost optimization.

[0087] The data processing method provided in this application can be applied to a terminal, a server, or a computer program running on either the terminal or the server. For example, the computer program can be a native program or software module in an operating system; it can be a native application (APP), i.e., a program that needs to be installed in the operating system to run, such as a client supporting data processing for computing chips, i.e., a program that only needs to be downloaded to a browser environment to run; it can also be a small program that can be embedded in any APP. In short, the above-mentioned computer program can be any form of application, module, or plugin. The terminal communicates with the server via a network. The data processing method can be executed by the terminal or the server, or by the terminal and the server working together.

[0088] In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, or smartwatch, etc. The server can be a standalone server, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms; it can also be a service node in a blockchain system, where the service nodes form a peer-to-peer (P2P) network. The P2P protocol is an application layer protocol running on top of the Transmission Control Protocol (TCP). The terminal and server can connect via Bluetooth, Universal Serial Bus (USB), or a network, etc., and this embodiment does not impose any limitations.

[0089] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0090] The computing chip provided in the embodiments of this application will be described below, specifically through the following embodiments.

[0091] In one embodiment, reference is made to Figure 2 , Figure 2 This is a schematic diagram of the computing chip provided in an embodiment of this application. Figure 2 As can be seen, the architecture of the computing chip in this embodiment includes at least one computing cluster (shown as four in the figure, but this does not imply a limitation on the number), and an on-chip network (NoC) for inter-cluster communication. Each computing cluster is an independent unit with local computing and storage capabilities, consisting of multiple in-memory computing units (as shown by the dashed box in the figure, each computing cluster contains four in-memory computing units).

[0092] In this design, each in-memory computing unit integrates a computing module and a memory module, which are physically tightly coupled. This integrated design allows the computing module to directly access the local memory module. The path for the computing module to access its corresponding memory module does not pass through the cluster center or the on-chip network, which can significantly reduce the latency and energy consumption of local data access and achieve efficient local data interaction for in-memory computing. Here, the memory module can be 3D-DRAM.

[0093] In addition, each computing cluster is equipped with a Cluster Hub (CH) and Cluster Shared Memory (CSM) connected to it. This CSM communicates directly with the computing modules of all in-memory units within the cluster, constructing a two-tiered storage system combining local memory and CSM. Specifically, the CSM is configured to store data shared by multiple computing modules within the cluster, and computing modules can access the CSM without going through an on-chip network. When an in-memory unit needs to retrieve data from other memory modules within the cluster, it can directly retrieve it from the CSM within the cluster without cross-cluster access. The Cluster Hub, acting as the cluster's communication hub, coordinates data interaction between local memory, CSM, and computing modules, and serves as the sole interface for communication between the cluster and the outside world. The CSM here can be SRAM.

[0094] Furthermore, in the cluster architecture, the cluster shared memory achieves efficient connectivity with the on-chip network through the cluster center, ensuring that the central nodes of all computing clusters can uniformly access the on-chip network, thus constructing a layered communication architecture. This architecture leverages the fast communication characteristics of shared memory, allowing different computing nodes to directly access shared data, greatly improving the efficiency of data processing and transmission. The first layer is intra-cluster communication, including data interaction between in-memory computing units within the cluster and between in-memory computing units and the cluster shared memory, all completed within the cluster without consuming on-chip network bandwidth, effectively reducing concurrent access requests to the on-chip network. The second layer is inter-cluster communication; only when data exceeds the current cluster shared memory range will the cluster center send a request to the on-chip network, transmitting core data between different computing clusters via the on-chip network. This ensures that the on-chip network only handles necessary data interactions between clusters, avoiding massive random access, simplifying network routing design and flow control logic, thereby reducing chip area, hardware cost, and overall power consumption.

[0095] In one embodiment, the computing module and memory module are vertically integrated using three-dimensional stacking technology, achieving a high-bandwidth transmission channel and significantly improving data processing speed. Furthermore, the memory module of the in-memory computing unit is configured as private memory for the computing module. In this in-memory computing architecture, internal access within a single in-memory computing unit is extremely fast. However, if a global on-chip network (CBI) communication method is used, the communication cost between different memory modules is relatively high. Therefore, in this embodiment, multiple in-memory computing units can first be divided into the same computing cluster based on physical distance, and then cluster shared memory can be introduced within the computing cluster to establish a high-speed shared area. This fully utilizes the advantage of near-memory access in in-memory computing, intercepting a large number of high-bandwidth demands within the computing cluster and avoiding the expensive overhead of cross-module communication.

[0096] Furthermore, the in-memory computing units are vertically integrated using three-dimensional stacking technology, with computing modules stacked on top of storage modules, resulting in extremely high computing density per unit area of ​​the chip. In this high-density architecture, if all non-local access relies on the global on-chip network, the on-chip network would have to handle massive concurrent requests, easily causing network congestion and data transmission delays. In this case, cluster shared memory adds a buffer layer between local memory and remote access, directly reducing the design complexity, area, and power consumption of the high-density in-memory computing chip by reducing the data flow entering the global network. Additionally, if the computing chip is used to process tasks such as AI training or big data analysis, the related data in these tasks often exhibits locality, meaning that physically close units exchange data frequently. Therefore, if the global on-chip network transmission method in related technologies is used, even if two in-memory computing units are physically adjacent, they must exchange data through the on-chip network (NOC) system. However, this embodiment's hierarchical design based on cluster shared memory matches the data flow characteristics of in-memory computing tasks, ensuring that local tasks are completed locally / within the cluster, while only global tasks follow the slower cross-cluster path.

[0097] In one embodiment, the cluster shared memory of the computing cluster is an on-chip memory with a physical address space that is independent of the private cache hierarchy of the computing modules and is configured to be directly accessible by all computing modules within the computing cluster. This embodiment treats the cluster shared memory as an independent hardware entity on the chip and assigns it an independent physical address space primarily for the following two reasons: Firstly, this design allows all computing modules within the computing cluster to directly read and write to it via hardware links, without relying on complex caching protocols or relay mechanisms, significantly improving the efficiency and determinism of data access. Secondly, if the cluster shared memory is treated as part of the processor cache, it is essentially just a logical partition of the L1 / L2 cache within the computing modules, still limited by the organization of cache lines. If the cluster shared memory consists of caches shared by multiple memory modules within the computing cluster, it directly weakens the ability of a single computing core to process private data and mixes communication flow with cache consistency flow, potentially exacerbating traffic congestion and making it unpredictable. Furthermore, the underlying hardware of such partitioned caches may still share bandwidth or ports with other cache logic, and access latency will be affected by the load on the cache controller, making stability unreliable.

[0098] This embodiment treats the cluster shared memory as an independent shared space, not as a private cache of any individual memory module. Therefore, it allows for simpler software-defined consistency management, avoiding complex hardware cache consistency protocols and simplifying routing design. Furthermore, since the cluster shared memory is not a cache, it provides deterministic access latency and eliminates write-back and replacement operations caused by cache misses, further improving the reliability and predictability of data access.

[0099] In one embodiment, reference is made to Figure 2 The on-chip network also includes switches corresponding to each computing cluster, with each cluster center connecting to the on-chip network through a dedicated switch. These switches, corresponding to the computing clusters, serve as core scheduling nodes connecting the cluster centers and the on-chip network, enabling fine-grained cluster-level traffic control, further improving transmission efficiency and reducing network traffic and power consumption.

[0100] On the one hand, cross-cluster data requests from each computing cluster are aggregated centrally by the cluster center and then connected to the on-chip network via a dedicated switch. The switch can filter, schedule, and limit outbound traffic from its own cluster, preventing scattered and invalid cross-cluster requests from directly flooding the on-chip network and reducing invalid data transmission. Simultaneously, it achieves isolated control of outbound traffic from each cluster, preventing massive requests from a single cluster from consuming the entire network bandwidth and further alleviating network congestion. On the other hand, as a dedicated access node between the cluster and the on-chip network, the switch can forward traffic from the on-chip network to its own cluster in a targeted manner, ensuring that external data is accurately delivered to the cluster center, reducing routing time in the on-chip network, and improving the transmission efficiency of cross-cluster data. Furthermore, by replacing the on-chip network's direct control of traffic to a single cluster with the switch's lightweight scheduling logic, the on-chip network is free from the need to design complex routing adaptation rules for each computing cluster, further simplifying the overall design of the on-chip network and reducing its hardware area and power consumption.

[0101] Furthermore, the cluster center is connected to the on-chip network via a switch, enabling flexible decoupling between the computing cluster and the on-chip network. Traffic scheduling and fault management for a single cluster can be completed at the dedicated switch level, without affecting the overall operation of the on-chip network or interfering with network interactions between other computing clusters, thus improving the scalability and stability of the chip architecture. The on-demand on / off switching and traffic regulation features of the switch can dynamically adjust cross-cluster traffic based on the actual workload of the computing cluster: reducing the switch's transmission power consumption when the cluster is under low load, and ensuring efficient traffic transmission when under high load, achieving dynamic optimization of network power consumption.

[0102] Therefore, this application embodiment achieves refined and isolated management of cross-cluster traffic through a dedicated switch for the on-chip network, significantly improving the data transmission efficiency of the computing module both locally and across clusters. At the same time, by filtering, aggregating, and isolating traffic through the switch, the number of access requests and invalid transmissions of the on-chip network is reduced. Combined with the simplification of network design, this achieves a dual reduction in on-chip network access traffic and power consumption, and also reduces chip area and hardware cost, ultimately forming a high-efficiency, low-power, and highly scalable computing chip architecture design.

[0103] In one embodiment, the cluster center is configured with an address discrimination circuit. This circuit can efficiently process access requests initiated by computing modules. In response to these requests, it directs the access request to the cluster's shared memory or forwards it to the on-chip network based on the target address. If the target address of the access request initiated by the computing module is located within the shared memory of the corresponding computing cluster, the access request is directed to the shared memory. If the target address is located outside the computing cluster, the access request is forwarded to the on-chip network for routing and transmission.

[0104] Specifically, the address discrimination circuit can accurately determine the target address of each access request initiated by the computing module: if the target address is located within the shared memory of the computing cluster, the access request is directly directed to the shared memory, allowing the data request to complete closed-loop processing within the cluster without accessing the on-chip network; only when the target address is located outside the computing cluster is the access request forwarded to the on-chip network for routing. This address discrimination and request routing avoids the problem of indiscriminate cross-node access leading to limited transmission efficiency, ensuring that all data requests within the same computing cluster are completed locally, generating no invalid traffic on the on-chip network, significantly reducing the number of concurrent access requests to the on-chip network, alleviating network congestion, and allowing computing modules to avoid waiting for complex routing on the on-chip network, significantly improving the efficiency of data transmission within the cluster.

[0105] After the cluster center completes request scheduling through the address discrimination circuit, it connects to the on-chip network via a dedicated switch to achieve centralized and isolated management of cross-cluster traffic. For access requests determined by the address discrimination circuit to be cross-cluster requests, the cluster center aggregates them and connects them to the on-chip network via a switch. The precise traffic routing through the address discrimination circuit allows the on-chip network to handle only core cross-cluster data requests, without having to deal with massive intra-cluster requests, reducing chip hardware area and cost, as well as the power consumption caused by complex routing and massive requests.

[0106] Combination Figure 2 As can be seen, the architecture of the computing chip in this application embodiment has a three-level data access path. The access path includes a private access path for the computing module to access the corresponding memory module, a cluster-shared path for the computing module to access the cluster shared memory through the cluster center, and a global access path for the computing module to access resources outside its computing cluster by accessing the on-chip network through the cluster center.

[0107] In the private access path, the compute module directly accesses its bound memory module without going through the cluster center or on-chip network. This is the lowest latency access path, suitable for the high-frequency local data needs of compute modules, ensuring the efficient execution of basic computing tasks. In the intra-cluster shared path, the compute module accesses the shared memory of its compute cluster through the cluster center. This path supports data sharing among multiple compute modules within the same cluster, with lower latency than the global access path, suitable for medium-range data interaction during multi-module collaborative computing within a cluster. In the global access path, the compute module accesses the on-chip network via the cluster center to access resources outside its own compute cluster. This is the most comprehensive access path, supporting cross-cluster global data interaction, suitable for large-scale distributed computing scenarios, and enabling efficient communication between multiple clusters through the on-chip network's switching nodes.

[0108] In one embodiment, reference is made to Figure 3 , Figure 3 This is another schematic diagram of the computing chip provided in this application embodiment. The following description uses two computing clusters as an example. Each computing cluster consists of multiple in-memory computing units, a cluster center, and cluster shared memory. The cluster shared memory is communicatively connected to the computing modules of all in-memory computing units within the cluster. The cluster center acts as a scheduling node, responsible for forwarding and controlling data access within the cluster. As shown by the dotted lines in the diagram, the computing clusters are distinguished. Each computing cluster includes four in-memory computing units, one block of cluster shared memory, and one cluster center. The cluster center of each computing cluster is then connected to an on-chip network, forming a global communication link across the clusters.

[0109] In one embodiment, to expand the scale of the computing chip, the computing chip may further include at least two wafers, each wafer containing at least one computing cluster. Each wafer also has a wafer center, which is communicatively connected to all cluster centers within its corresponding wafer. The wafer center is configured to perform address routing for cross-cluster access requests and provide unified access to the on-chip network.

[0110] In one embodiment, reference is made to Figure 4 , Figure 4 This is another structural schematic diagram of the computing chip provided in this application embodiment. The description uses two chips as an example. The computing chip is deployed in a distributed manner using two chips to achieve scaling of computing power. Specifically, each chip contains two computing clusters, each cluster contains eight in-memory computing units, and each chip is configured with a chip center, which communicates with all cluster centers within the corresponding chip. In this case, the chip centers of all chips are connected to the on-chip network, forming a global communication link between multiple chips, supporting cross-chip data interaction in large-scale distributed computing scenarios. The chip center is used for address resolution and routing of cross-cluster access requests and for unifying all cluster requests from its chip to the on-chip network, realizing global communication between chips.

[0111] In one embodiment, reference is made to Figure 4In this multi-chip extended architecture, access paths are also layered and extended. Private access paths are those where computing modules directly load data or write intermediate results from their bound memory modules, without going through any intermediate nodes. Intra-cluster shared paths refer to the paths where in-memory computing units within the same cluster access shared data in the cluster's shared memory through cluster central scheduling; these have moderate latency and are suitable for collaborative computing among multiple units within the cluster. Global access paths include intra-chip cross-cluster paths and global cross-chip paths. Intra-chip cross-cluster paths refer to the path where, when in-memory computing units in different clusters within the same chip interact with data, requests are uploaded from the local cluster central to the chip central, and then routed and forwarded by the chip central to the target cluster, enabling global resource access within the chip. Global cross-chip paths refer to the path where, in large-scale distributed computing tasks, in-memory computing units in different chips need to interact with global data; requests are accessed through the on-chip network from the local chip central, forwarded to the target chip central, and ultimately access the target cluster's resources; this is the most comprehensive access path.

[0112] In one embodiment, the cluster shared memory is configured to sequentially store cluster aggregated data formed by the aggregation of intermediate results from each computing module and global data obtained after synchronization via the on-chip network in each iteration time step during multiple iterations of computation, so that each computing module can directly access it in subsequent iteration time steps.

[0113] The computing chip in this embodiment is suitable for parallel processing tasks requiring multiple iterations, such as image or video generation tasks using diffusion models. In such tasks, within each iteration time step, each computing module within the computing cluster processes its allocated local data in parallel, temporarily storing intermediate results in a private memory module. Then, the intermediate results from each computing module are aggregated into the cluster's shared memory via the cluster center, forming the cluster aggregated data for that time step. Finally, the global data is synchronized between clusters via the on-chip network, writing it back to the shared memory of each cluster for direct access by the computing modules in the next time step. In this way, the cluster shared memory repeatedly stores and updates global data across multiple time steps, ensuring that cross-module data exchange in each iteration is completed within the cluster, significantly reducing the access pressure on the on-chip network.

[0114] The computing chip proposed in this application includes at least one computing cluster. Each computing cluster includes: multiple in-memory units, each including a computing module and a memory module physically integrated with the computing module; a cluster center; cluster shared memory, communicatively connected to the cluster center and to each computing module in the multiple in-memory units via the cluster center; and an on-chip network, with the cluster center of each computing cluster connected to the on-chip network. This application first reduces the energy consumption of local data access through the physical integration of computing modules and memory modules. Then, by utilizing the cluster-level cluster shared memory design, a two-tier storage system combining local storage and cluster shared storage is constructed to optimize the data access path. When an in-memory unit needs to interact with other in-memory units within the cluster, it does not need to obtain data across the cluster via the global on-chip network. Instead, it can directly retrieve data from the cluster shared memory within the cluster, transforming the original global on-chip network transmission into direct intra-cluster transmission. This significantly shortens the transmission distance and latency of cross-storage access, and significantly improves the data transmission efficiency of the computing modules. Meanwhile, local access and shared memory access within the cluster do not consume bandwidth resources of the on-chip network. Data is only transmitted through the on-chip network when it exceeds the range of the shared memory, effectively reducing the number of concurrent access requests to the on-chip network and alleviating network congestion at its source. Furthermore, the on-chip network does not need to handle massive random access, significantly simplifying its routing design and flow control mechanisms, thereby reducing chip area and hardware costs. In addition, each cluster center is individually connected to the on-chip network, allowing the on-chip network to handle only core data interactions between clusters, reducing invalid data transmission and lowering network access traffic. Therefore, this embodiment achieves efficient access to local data through physical integration of in-memory computing, optimizes cross-storage access paths through cluster-level shared memory, simplifies the interaction logic of the on-chip network through a layered communication architecture, and significantly improves the data transmission efficiency of the computing modules. Simultaneously, by reducing on-chip network access requests and simplifying network design, it achieves a dual reduction in on-chip network access traffic and power consumption, while also reducing chip area and hardware costs, ultimately forming a high-efficiency, low-power, and low-cost computing chip architecture design.

[0115] The data processing method in the embodiments of this application is described below based on the architecture of the aforementioned computing chip.

[0116] Figure 5 This is an optional flowchart of the data processing method provided in the embodiments of this application. Figure 5 The method may include, but is not limited to, steps 110 to 130. It is also understood that this embodiment... Figure 5 The order of steps 110 to 130 is not specifically limited. The order of steps can be adjusted or some steps can be reduced or added according to actual needs.

[0117] Step 110: Within each time step, the allocated local data blocks are processed in parallel in each computing module, and the obtained computing results are stored in the corresponding memory module.

[0118] In one embodiment, when performing large-scale parallel computing tasks such as image or video generation, it is necessary to decompose the complex global computing task into multiple independently processable units to fully utilize the chip's parallel computing power. Therefore, high-resolution input data can be split into multiple smaller local data blocks, with each computing module processing one local data block, and a one-to-one proprietary mapping relationship established between the data block and the computing module (CM). This partitioning strategy allows each computing module to be responsible only for processing the specific local data block assigned to it, thereby achieving a high degree of spatial parallelism in the task.

[0119] Because the diffusion model has an iterative nature, the following description uses a time step as an example to illustrate the data processing at each time step. During this process, the cluster's shared memory repeatedly stores global data across multiple time steps. In one embodiment, refer to... Figure 6 , Figure 6 This is a schematic diagram illustrating the parallel processing of allocated local data blocks in each computing module according to an embodiment of this application. The left side of the diagram represents the data segmentation section. Input data, such as a super-resolution image, is uniformly divided into four independent local data blocks: local data block P0, local data block P1, local data block P2, and local data block P3. The right side corresponds to four independent computing modules: computing module CM0, computing module CM1, computing module CM2, and computing module CM3. Arrows indicate the one-to-one mapping relationship between local data blocks and computing modules; that is, each specific local data block is fixedly assigned to a corresponding computing module for processing.

[0120] In this embodiment, each computing module is equipped with its own private memory module. For example, computing module CM0 corresponds to memory module MM0, computing module CM1 corresponds to memory module MM1, computing module CM2 corresponds to memory module MM2, and computing module CM3 corresponds to memory module MM3. These memory modules are physically and tightly integrated with the computing unit using 3D stacking technology. Since each computing module only needs to process its own dedicated local data block, and the computation results can be directly stored in its physically integrated private memory, the computation process does not need to contend for the bandwidth of the cluster's shared memory or on-chip network. This extremely short data transmission path eliminates bandwidth contention and also achieves low latency and low power consumption.

[0121] In one embodiment, the data processing method provided in this application can be efficiently applied to image or video generation tasks based on diffusion models. In such tasks, within each iteration time step, each computing module in the computing cluster processes the allocated local data in parallel, temporarily storing intermediate results in a private memory module; then, the intermediate results of each computing module are aggregated into the cluster's shared memory through the cluster center, forming the cluster aggregated data for that time step; then, the global data is written back to the cluster shared memory of each cluster through on-chip network synchronization between clusters, allowing the computing modules of the next time step to access it directly. In this way, the cluster shared memory repeatedly stores and updates global data across multiple time steps, ensuring that cross-module data exchange in each iteration is completed within the cluster, significantly reducing the access pressure on the on-chip network. Since direct video diffusion in pixel space would generate a huge computational burden, this embodiment chooses to perform computation in a lower-dimensional, more information-dense latent space. By using encoders such as Variational Autoencoders (VAEs), high-resolution original video frames are compressed into latent representations containing core features, thereby significantly improving the efficiency of denoising computation.

[0122] In one embodiment, reference is made to Figure 7 , Figure 7 This is a data flow diagram of the diffusion model in related technologies. The process generates new latent frames by concatenating multi-source latent frames and inputting them into the backbone model, while simultaneously caching and reusing historical frames. First, the system reads the global latent frame (Global Latent Frame) providing global scene information and a sequence of window latent frames (Window Latent Frames) read from the KV cache. These window latent frames are the results of previous generation and are cached to support context-dependent generation tasks. Subsequently, all global and window latent frames are fed into the concatenation module, combining them into a feature tensor containing global background and local temporal context. The concatenated tensor is input into the data model, such as the backbone model, which performs core diffusion model calculations, including noise prediction and feature reconstruction, ultimately generating a new latent frame (New Latent Frame) containing new temporal information. This new latent frame serves as output for subsequent processing and is also written to the KV cache, becoming the window latent frame for the next round of calculation, thus forming a cyclical context caching mechanism to ensure the temporal continuity of the generated content.

[0123] Next, refer to Figure 8 , Figure 8This is a temporal logic diagram of the diffusion model in related technologies. In image generation tasks, the diffusion model starts with random noise, corresponding to time step t=0 in the diagram, and gradually removes the noise through N iterations. At each time step, such as t=1 to t=N-1, it predicts and eliminates noise of corresponding intensity based on the output of the previous step, finally obtaining a clear generated image at time step t=N-1. The number of time steps N in this iterative generation method can be adjusted according to the requirements of generation quality and computational efficiency to balance generation effect and computational cost.

[0124] In related technologies, a patch parallelism strategy is employed to improve data processing efficiency. (See reference...) Figure 9 , Figure 9 This is a schematic diagram illustrating the execution logic of the patching parallel strategy in related technologies. First, the model input, such as the latent space frame of a diffusion model, is divided into fixed-size image blocks, as shown in the diagram as data block 0, data block 1, data block 2, and data block 3. For example, through segmentation, a 2048×2048 image can be divided into 16×16 128×128 image blocks. The purpose of segmentation is to decompose a single large computational task into multiple independently processable subtasks, adapting to the distributed computing power cluster architecture of computing chips and avoiding single-node computing power bottlenecks. Then, each data block is sent to an independent GPU for processing. Each GPU is only responsible for processing the data block assigned to it, and all GPUs start working simultaneously, executing the same model computation task. Since each GPU can only see a small portion of the input, it lacks a global view of the entire image, which affects model understanding and performance. Therefore, between two time steps, through the All Gather collective communication operation, all GPUs exchange their intermediate computation results. The All Gather operation aggregates all scattered local information, providing a global context for each GPU. This ensures that the processing of each data block takes into account the information of the entire input, thereby guaranteeing the accuracy of the computation results.

[0125] The parallel processing procedure of embodiments of this application is described below.

[0126] In one embodiment, the local data block is a latent representation patch of an image or video in the latent space, and the intermediate computation result is a denoised latent representation patch. The system first receives compressed latent space data. To address the massive data processing challenges posed by high-resolution video generation, the complete latent representation frame is spatially split into multiple smaller latent representation patches. Each latent representation patch is then assigned as an independent local data block to a specific computing module within the computing cluster using a one-to-one proprietary mapping. This partitioning mechanism not only enables a single computing module to handle ultra-large images exceeding its single-machine capacity but also provides clear task boundaries for each computing module's processing during the computation phase, thereby achieving large-scale parallel processing of the latent space data.

[0127] In one embodiment, the cluster aggregated data consists of potential frames composed of multiple denoised potential representation patches. Depending on the generation stage, the potential frame includes at least one of a global potential frame, a window potential frame, and a new potential frame. The global potential frame contains the complete potential representation of the current target object; the window potential frame consists of potential frames from a preset number of time steps prior to the current time step; and the current potential frame is the potential representation to be processed at the current time step. After each computing module starts in parallel, it can perform complex inverse denoising tasks on the assigned potential representation patches. At each denoising step, the computing module uses an attention mechanism to query historical information stored in the key-value cache, using the currently noisy new potential frame patch as a query. This historical information includes global potential frames providing macroscopic scene layout and window potential frames providing object motion trends and continuous changes in lighting and shadow. By concatenating the current noisy patch with rich contextual information extracted from the cache, the backbone model can accurately predict and remove the necessary noise, thereby outputting clear, temporally coherent denoised potential representation patches. Subsequently, the computation module uses the denoised latent representation patch as the computation result and writes it directly into its physically integrated private memory module via an extremely short physical path. This proximity-based storage mechanism avoids frequent data movement across the global on-chip network, fundamentally eliminating performance bottlenecks caused by bandwidth contention and ensuring extremely low latency for the generation task at high throughput.

[0128] Understandably, in related technologies, the physical distance between computing units and storage units is relatively large, and data synchronization requires frequent data transfer within a global on-chip network, relying on high-load network access. In this embodiment, however, the denoised latent representation patch is used as the computation result and directly written to its physically integrated private memory via a very short physical path. This mechanism simplifies data synchronization from complex network paths to simple local memory writes.

[0129] Step 120: Within the same computing cluster, transfer the computation results corresponding to the computing modules to the cluster shared memory to obtain the corresponding cluster aggregated data.

[0130] In one embodiment, reference is made to Figure 2 It is known that a cluster shared memory is configured in the same computing cluster, and a shared buffer is set in the cluster shared memory. The system can obtain a potential frame by splicing the denoised potential representation patches generated by each computing module in the shared buffer, and use the potential frame as the cluster aggregated data.

[0131] In the process of migrating computation results from private memory to cluster shared memory, the system utilizes a pre-defined shared buffer to perform logical data reorganization and storage management. This process is not simply a physical location transfer, but rather a transformation of data from local patches to semantically complete frames. Within the shared buffer, the system performs address alignment and concatenation on the denoised latent representation patches generated by each computation module, thereby obtaining a complete latent frame, which is then used as cluster aggregation data. Specifically, the cluster center spatially arranges the computation results from different computation modules within the shared buffer according to a pre-defined mapping logic. Since the computation results have already achieved denoised feature alignment in the physical storage of the memory modules, this aggregation process ensures that the latent frame maintains a complete feature representation in the latent space, facilitating subsequent global synchronization and cross-patch attention computation.

[0132] Furthermore, to adapt to the computational needs of the diffusion model at different time steps and task stages, the shared buffer can also implement classified storage management of latent frames. Depending on the generation stage, latent frames include at least one of the following: global latent frames, window latent frames, and new latent frames. Global latent frames provide macroscopic scene context and feature representations of the main structure for video or image generation, serving as benchmark reference information to ensure the layout stability of generated content over long sequences or at large scales. Window latent frames consist of a preset number of frames preceding the current target generation frame, serving as dynamic context, recording the microscopic trends of object motion and continuous changes in lighting and shadow to ensure the coherence of the generated video stream. New latent frames refer to the target generation frame that needs denoising processing in the current time step of the diffusion model. This frame is initially filled with noise and is iteratively optimized by continuously referencing information from global and window latent frames.

[0133] Step 130: Synchronize the cluster aggregated data in the cluster shared memory on the on-chip network through each cluster center to obtain global data.

[0134] In one embodiment, during the data processing flow, after each computing cluster completes the preliminary calculation of local data blocks and obtains the corresponding cluster aggregated data, it enters the cross-cluster global synchronization stage. The purpose of this stage is to break down information silos between clusters and construct a complete global context view through an efficient collective communication protocol. (Refer to...) Figure 10 , Figure 10 This application provides a flowchart illustrating how global data is obtained by synchronizing aggregated cluster data in shared memory across various cluster centers on an on-chip network. The flowchart specifically includes the following steps:

[0135] Step 1010: Send the cluster aggregated data in the cluster shared memory to the corresponding cluster center.

[0136] In one embodiment, the synchronization task is initiated by the cluster center as the core hub. Cluster aggregated data stored in a shared buffer within the cluster's shared memory is extracted in parallel and pushed to the respective cluster center. In this case, the cluster center not only acts as a data buffer but also as a gateway for accessing the on-chip network.

[0137] Step 1020: Perform a full collection operation between cluster centers via on-chip network so that each cluster center can obtain cluster aggregated data from other computing clusters.

[0138] In one embodiment, a full collection operation is performed between cluster centers via an on-chip network. Since computing chips may be distributed across multiple wafers, the routing functions at each wafer center are responsible for address redirection and path optimization of cross-cluster access requests, ensuring that data packets can be accurately addressed in complex topologies. The essence of the full collection operation is to achieve many-to-many data broadcasting, meaning that the cluster aggregated data held by all computing clusters is ultimately pieced together into a complete global dataset within each cluster.

[0139] The full collection operation employs a ring-based communication algorithm, using each participating cluster center as a communication node and performing a preset number of cyclical exchanges to ensure that each cluster center obtains complete global data. The ring-based communication algorithm ensures that each participating cluster center receives aggregated cluster data from other computing clusters, thus obtaining global data.

[0140] In one embodiment, to avoid on-chip network bandwidth bottlenecks when handling massive data synchronization, this embodiment employs a highly optimized ring-based communication algorithm. Under this algorithm, the participating cluster centers are organized into a closed-loop logical topology. Specifically, the N participating cluster centers are defined as communication nodes, each initially holding only the cluster aggregate data of its own computing cluster. Then, a preset number of rounds of cyclic exchange are performed. In each round, each cluster center performs receiving and sending actions in parallel through the on-chip network, receiving data from an upstream node in the ring and forwarding its own data or data received in the previous round to downstream nodes. The cluster aggregate data is transmitted in a pipeline, thus, by executing the preset N-1 rounds of cyclic exchange, the cluster aggregate data can cover all communication nodes station by station. This ring-based communication method ensures that the instantaneous bandwidth requirements of the on-chip network are evenly distributed across the physical links in each round of exchange, avoiding congestion caused by data conflicts.

[0141] Next, as the loop completes, each cluster center accumulates cluster aggregated data from all other computing clusters. Under the scheduling of the cluster center, this dispersed cluster aggregated data is rewritten as global data back to the shared buffer of the shared memory of each cluster, completing the transformation from local aggregation to global consistency. The resulting global data provides a complete reference view for subsequent diffusion model calculations. Since each computing module can directly access the potential representation of the entire video frame through its respective cluster's shared memory, the attention mechanism can extract global features across the boundaries of physical patches, thus fundamentally ensuring the ultimate coherence and accuracy of the video generation task in terms of spatial layout and temporal sequence.

[0142] In one embodiment, reference is made to Figure 11 , Figure 11 This is a schematic diagram summarizing the global data provided in an embodiment of this application. Wherein, Figure 11 This example illustrates the process of four cluster shared memory instances constructing global data within a preset number of rounds via a switching hub. It is assumed that all four cluster shared memory instances are connected to the same switching hub, and each cluster shared memory instance initially holds only one cluster aggregate of global data, labeled p0, p1, p2, and p3. In this case, each cluster shared memory instance and its corresponding cluster center are considered a communication node in a ring topology. Through a preset number of rounds of cyclical switching, each node eventually obtains the complete {p0, p1, p2, p3} dataset as global data, requiring N-1=3 rounds of cyclical switching.

[0143] Specifically, after the first round of exchange, each node, under the address routing of the exchange hub, sends its own cluster aggregate data to the next node in the ring. For example, the node holding p0 receives p3, the node holding p1 receives p0, and so on. At this point, each cluster shared memory holds two different cluster aggregate data. Next, after the second round of exchange, each node continues to forward at least one newly received cluster aggregate data from the previous round. After this round of exchange, each cluster shared memory has accumulated three cluster aggregate data. For example, the first node now holds {p0, p3, p2}. Then comes the final round of pre-defined cyclic exchange. As the last missing cluster aggregate data is forwarded to its destination through the exchange hub, each cluster shared memory has collected all cluster aggregate data from all other computing clusters, i.e., {p0, p1, p2, p3}. It can be seen that in the process of constructing global data, through three rounds of ordered cyclic exchange, the cluster aggregate data scattered in each cluster shared memory is finally transformed into consistent global data.

[0144] The above embodiments maximize bandwidth utilization. In each round of communication, each node simultaneously sends and receives data with a constant data volume, ensuring uniform utilization of the physical bandwidth of the switching hub and avoiding sudden congestion. Furthermore, executing a pre-defined ring routing protocol reduces the control overhead of the on-chip network, achieving extremely low-latency synchronization under high throughput. This ensures that before the next round of diffusion model denoising calculation begins, each computing cluster can directly read complete global context information from its local shared buffer, thereby guaranteeing the continuity of the generation task.

[0145] In one embodiment, in implementation scenarios targeting ultra-large-scale computing power requirements, the hardware architecture of the computing chip can adopt a multi-chip distributed design. For example... Figure 4 The illustrated computing chip architecture is distributed across at least two chips, each chip containing at least one computing cluster, and each chip has a chip center. The chip center communicates with all cluster centers within its chip, acting as a communication hub between the chip's internal and external on-chip networks. In this architecture, cross-chip data synchronization also involves a physical routing layer. (See reference...) Figure 12 , Figure 12 This is another schematic diagram provided in this application embodiment of obtaining global data by synchronizing aggregated cluster data on an on-chip network through each cluster center, specifically including the following steps:

[0146] Step 1210: Use the cluster center to send the obtained cluster aggregated data to the chip center of the corresponding chip, and the chip center will perform address routing for cross-cluster access requests.

[0147] In one embodiment, cluster aggregated data generated by each computing cluster is first aggregated within the chip. The cluster aggregated data stored in the cluster's shared memory initiates a transmission request to the chip center through its respective cluster center, and the chip center is configured to perform address routing for cross-cluster access requests.

[0148] Specifically, after receiving data packets from different cluster centers, the chip center parses them according to the globally unified address space to determine whether the data target belongs to another cluster within this chip or to a remote chip. For intra-chip synchronization, the chip center forwards data directly via a high-speed bus, achieving low-latency switching. For cross-chip synchronization, the chip center logically encapsulates the datasets from multiple clusters, preparing them for access to higher-level physical links. This design avoids direct competition for on-chip network bandwidth among each cluster center and significantly reduces the risk of network congestion through the chip center's pre-routing function.

[0149] Step 1220: By connecting all chip centers to the on-chip network, the cluster aggregated data collected by each chip center is sent to other chips through the on-chip network for data synchronization, so that the shared memory of each cluster contains all the cluster aggregated data to form global data.

[0150] In one embodiment, during the cross-chip global exchange phase, each chip center acts as the sole access node for its chip, uniformly accessing the on-chip network. Since the chip center has already collected and organized the cluster aggregation data of all clusters within its chip, it represents the entire chip in peer-to-peer data exchange with other chips. Through the on-chip network, each chip center performs high-performance collective communication operations. For example, in a two- or multi-chip architecture, the chip center sends the locally collected cluster aggregation data to other remote chips while simultaneously receiving corresponding datasets from other chips. This cross-chip synchronization ensures that all physical entities participating in the computation can share the same data. Finally, through the on-chip network's backhaul and distribution, the cluster aggregation data from all chips and all computing clusters across the entire chip is rewritten back into each cluster's shared memory. This hierarchical synchronization mechanism ensures that each cluster's shared memory contains all cluster aggregation data, thus forming logically complete global data.

[0151] This chip-centric hierarchical routing architecture, as described in this embodiment, enhances chip scalability. When processing ultra-high-resolution video generation tasks, the data volume often exceeds the storage and bus limits of a single chip. In such cases, address routing and unified access via the chip-centric architecture reduce physical link complexity. The on-chip network only needs to connect a smaller number of chip-centric architectures, rather than a large number of cluster architectures, thus reducing wiring difficulty. Simultaneously, the chip-centric architecture reduces the number of small packets on the on-chip network by merging address requests, thereby increasing effective bandwidth.

[0152] In one embodiment, reference is made to Figure 13 , Figure 13 This is a schematic diagram of cross-chip global data construction provided in an embodiment of this application. The diagram illustrates a computing device with at least two chips (e.g., chip 0 and chip 1), each chip deploying multiple computing clusters. Initially, the shared memory of each computing cluster only holds a portion of the cluster aggregate data in its local area, such as a part of p0, p1, p2, and p3. First, a full collection is performed within the chip, and synchronization is performed within a single chip. Taking chip 0 as an example, the computing clusters within it exchange the data patches they hold through address routing at the chip center. After this level of synchronization, all computing clusters within the same chip obtain the complete dataset covered by that chip. For example, all clusters in chip 0 now hold p0 and p1, while all clusters in chip 1 hold p2 and p3. This local aggregation can reduce the number of data packets in subsequent cross-chip communication. Next, a full collection between chips is performed. After the data within the chip is ready, each chip center connects to the on-chip network and initiates a device-level full collection operation. Chip 0's chip center sends the locally aggregated dataset ({p0,p1}) to chip 1, while simultaneously receiving the dataset ({p2,p3}) from chip 1. Through cross-chip data exchange, each computing cluster of all chips within the device eventually collects the cluster aggregated data from all clusters across the entire chip, thus constructing complete global data containing {p0,p1,p2,p3}.

[0153] Step 140: Store global data in shared memory in each cluster so that each computing module in the computing cluster can directly access the global data.

[0154] In one embodiment, after completing the ring-shaped full collection synchronization across chips or clusters via the on-chip network, the final data distribution and storage stage begins. Global data is received from the on-chip network by each cluster center and written into the cluster shared memory directly communicated with it. Specifically, this data is stored in a pre-defined shared buffer within the cluster shared memory. Since the cluster shared memory has a physical or logical communication connection with the computing modules of each in-memory unit in the computing cluster, this transforms the cluster aggregated data, originally scattered throughout the entire chip, into a complete data copy that each computing cluster can independently hold.

[0155] In related technologies, computing units need to traverse multiple bus layers to access the memory of other nodes, resulting in significant latency. However, in this embodiment, since the global data is stored in the shared buffer of the cluster's shared memory, the computing module only needs to initiate a local memory read request to obtain the features of the entire video frame when performing related attention calculations.

[0156] In one embodiment, reference is made to Figure 14 , Figure 14 This is a schematic diagram of continuous time step data processing provided in an embodiment of this application. Wherein, Figure 14 This diagram illustrates the process of performing a full collection between two time steps, ensuring the continuity of distributed computing over time. Specifically, in tasks such as video generation, computation is divided into multiple consecutive time steps. The diagram shows the iterative process from time step t-1 to time step t. At the end of time step t-1, the local latent representation patches generated by each computing module are aggregated through a full collection operation to obtain cluster aggregate data, which is then stored in a shared buffer in the cluster's shared memory. After the full collection operation, the cluster aggregate data from different computing clusters are synchronized into global data stored in the shared buffer, which can be directly accessed by computing modules within the same cluster. Next, when entering the next time step t, the computing module uses this global data as a context reference and extracts necessary spatiotemporal features through an attention mechanism. This synchronization mechanism, performed between time steps, ensures that each newly generated frame accurately references the complete global information of the previous frame, effectively eliminating common issues such as screen flickering or discontinuity in video generation tasks.

[0157] According to the architecture of related technologies, it is difficult to handle the explosive growth in data scale when processing high-resolution video generation tasks. Taking a 4096×4096×3 frame of raw image as an example, in order to reduce the computational burden, it is usually compressed into the latent space by an encoder for processing. For example, even with an 8×8 downsampling ratio, 4-channel feature representation, and 2-byte data precision, the data volume of the latent space of a single frame is still as high as 2MB. This data characteristic of several megabytes means that when the system adopts a patching parallel strategy to divide the task into multiple patches and distribute them to different computing modules, the modules must frequently perform full collection operations to obtain complete global context information. The solutions of related technologies rely on on-chip networks, cross switches, or two-dimensional grids to access remote global memory when dealing with such large-scale data exchange. In this mode, the high-concurrency requests initiated by hundreds or thousands of computing modules at the same time will cause the on-chip network bandwidth and memory bandwidth to quickly reach the bottleneck, resulting in severe communication congestion and data contention, thereby limiting the execution efficiency of the computing chip.

[0158] This application embodiment achieves local storage by physically integrating the computing module with a privately stacked high-bandwidth memory module. This allows the calculation results of each patch to be directly written to local memory, thereby reducing the consumption of public network bandwidth from the initial stage. Although the mathematical computation of the diffusion model itself remains constant, this application embodiment introduces a ring-based communication algorithm on the basis of the chip architecture, optimizing the complexity of the full collection operation from the non-linear growth of the traditional architecture to linear complexity. In the preset N-1 rounds of cyclic exchange, the instantaneous communication load borne by each node remains constant, ensuring that the system can maintain a very high throughput even when processing potential frame data at the 2MB level. In addition, the hierarchical synchronization mechanism further restricts a large number of communication tasks to be completed within the lower-latency chip, significantly reducing long-distance data transport across chips. The data processing process under this architecture not only reduces the equivalent total power consumption of the system, but also reduces the idle time of the computing module waiting for data, ensuring that the video generation task maintains a high degree of consistency and coherence in both spatial layout and temporal sequence.

[0159] Therefore, in this embodiment, through the physical integration of the computing module and the memory module, the relevant data generated by the computing module when processing local data blocks can be directly stored in its own bound private memory. This eliminates the need for each computing module to compete for shared memory and on-chip network bandwidth during local computing phases, fundamentally solving the communication congestion problem caused by a large number of computing units accessing the same memory resource at high frequency. Secondly, by using the cluster shared memory of the same computing cluster as a high-speed, centralized data exchange area, the frequent copying of data between multiple private memories by computing modules is avoided, significantly reducing the synchronization latency during multi-unit collaboration. Furthermore, the global synchronization operation is decomposed into multiple layers within and between the cluster. Through this layered All Gather operation, large-scale, highly complex communication tasks are transformed into multiple small-scale local communications, significantly reducing communication complexity. Therefore, this embodiment can improve the overall efficiency of data collection operations in the computing chip.

[0160] The computing chip, data processing method, apparatus, device, and storage medium provided in this application embodiment include at least one computing cluster. Each computing cluster includes: multiple in-memory computing units, each including a computing module and a memory module physically integrated with the computing module; a cluster center; cluster shared memory, communicatively connected to the cluster center and to each computing module in the multiple in-memory computing units via the cluster center; and an on-chip network, with the cluster center of each computing cluster connected to the on-chip network. This application embodiment first reduces the energy consumption of local data access through the physical integration of computing modules and memory modules. Then, by utilizing the cluster-level cluster shared memory design, it constructs a two-tier storage system combining local storage and cluster shared storage to optimize the data access path. When an in-memory computing unit needs to interact with other in-memory computing units within the cluster, it does not need to obtain data across the cluster via the global on-chip network. Instead, it can directly retrieve data from the cluster shared memory within the cluster, transforming the original global on-chip network transmission into direct intra-cluster transmission. This significantly shortens the transmission distance and latency of cross-storage access and significantly improves the data transmission efficiency of the computing modules. Meanwhile, local access and shared memory access within the cluster do not consume bandwidth resources of the on-chip network. Data is only transmitted through the on-chip network when it exceeds the range of the shared memory, effectively reducing the number of concurrent access requests to the on-chip network and alleviating network congestion at its source. Furthermore, the on-chip network does not need to handle massive random access, significantly simplifying its routing design and flow control mechanisms, thereby reducing chip area and hardware costs. In addition, each cluster center is individually connected to the on-chip network, allowing the on-chip network to handle only core data interactions between clusters, reducing invalid data transmission and lowering network access traffic. Therefore, this embodiment achieves efficient access to local data through in-memory computing physical integration, optimizes cross-storage access paths through cluster-level shared memory, simplifies the interaction logic of the on-chip network through a layered communication architecture, and significantly improves the data transmission efficiency of the computing modules. Simultaneously, by reducing on-chip network access requests and simplifying network design, it achieves a dual reduction in on-chip network access traffic and power consumption, while also reducing chip area and hardware costs, ultimately forming a computing chip architecture that integrates high-efficiency computing, low-power design, and cost optimization.

[0161] This application also provides a data processing apparatus applied to a computing chip, capable of implementing the above-described data processing method, see reference. Figure 15 The device includes:

[0162] The local computing module 1510 is used to process the allocated local data blocks in parallel within each computing module at each time step, and store the obtained computing results into the corresponding memory module.

[0163] The cluster aggregation module 1520 is used to transfer the calculation results of the corresponding computing modules to the cluster shared memory to obtain the corresponding cluster aggregated data within the same computing cluster.

[0164] The on-chip synchronization module 1530 is used to synchronize the cluster aggregated data in the shared memory of the cluster with the on-chip network through each cluster center to obtain global data.

[0165] The global data access module 1540 is used to store global data in the shared memory of each cluster, so that each computing module in the computing cluster can directly access the global data.

[0166] The specific implementation of the data processing device in this embodiment is basically the same as the specific implementation of the data processing method described above, and will not be repeated here.

[0167] This application also provides an electronic device, including:

[0168] At least one memory;

[0169] At least one processor;

[0170] At least one program;

[0171] The program is stored in a memory, and the processor executes the at least one program to implement the data processing method described above in this application. The electronic device can be any smart terminal, including mobile phones, tablets, personal digital assistants (PDAs), in-vehicle computers, etc.

[0172] Please see Figure 16 , Figure 16 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:

[0173] The processor 1601 can be implemented using a general-purpose central processing unit (CPU), microprocessor, application specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.

[0174] The memory 1602 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 1602 can store the operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 1602 and is called and executed by the processor 1601 using the data processing method of the embodiments of this application.

[0175] The input / output interface 1603 is used to implement information input and output;

[0176] The communication interface 1604 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0177] Bus 1605 transmits information between various components of the device (e.g., processor 1601, memory 1602, input / output interface 1603, and communication interface 1604);

[0178] The processor 1601, memory 1602, input / output interface 1603 and communication interface 1604 are connected to each other within the device via bus 1605.

[0179] This application embodiment also provides a storage medium that stores a computer program, which, when executed by a processor, implements the above-described data processing method.

[0180] Memory, as a non-transitory storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0181] The computing chip, data processing method, apparatus, device, and storage medium proposed in this application embodiment include at least one computing cluster. Each computing cluster comprises: multiple in-memory computing units, each including a computing module and a memory module physically integrated with the computing module; a cluster center; cluster shared memory, communicatively connected to the cluster center and to each computing module in the multiple in-memory computing units via the cluster center; and an on-chip network, with the cluster center of each computing cluster connected to the on-chip network. This application embodiment first reduces the energy consumption of local data access through the physical integration of computing modules and memory modules. Then, by utilizing the cluster-level cluster shared memory design, it constructs a two-tier storage system combining local storage and cluster shared storage to optimize the data access path. When an in-memory computing unit needs to interact with other in-memory computing units within the cluster, it does not need to obtain data across the cluster via the global on-chip network; it can directly retrieve data from the cluster shared memory within the cluster. This transforms the original global on-chip network transmission into direct intra-cluster transmission, significantly shortening the transmission distance and latency of cross-storage access and significantly improving the data transmission efficiency of the computing modules. Meanwhile, local access and shared memory access within the cluster do not consume bandwidth resources of the on-chip network. Data is only transmitted through the on-chip network when it exceeds the range of the shared memory, effectively reducing the number of concurrent access requests to the on-chip network and alleviating network congestion at its source. Furthermore, the on-chip network does not need to handle massive random access, significantly simplifying its routing design and flow control mechanisms, thereby reducing chip area and hardware costs. In addition, each cluster center is individually connected to the on-chip network, allowing the on-chip network to handle only core data interactions between clusters, reducing invalid data transmission and lowering network access traffic. Therefore, this embodiment achieves efficient access to local data through in-memory computing physical integration, optimizes cross-storage access paths through cluster-level shared memory, simplifies the interaction logic of the on-chip network through a layered communication architecture, and significantly improves the data transmission efficiency of the computing modules. Simultaneously, by reducing on-chip network access requests and simplifying network design, it achieves a dual reduction in on-chip network access traffic and power consumption, while also reducing chip area and hardware costs, ultimately forming a computing chip architecture that integrates high-efficiency computing, low-power design, and cost optimization.

[0182] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0183] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.

[0184] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0185] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.

[0186] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0187] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0188] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0189] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0190] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0191] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0192] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.

Claims

1. A computing chip, characterized in that, include: At least one computing cluster; The computing cluster includes: Multiple in-memory computing units, each of the in-memory computing units including a computing module and a memory module physically integrated with the computing module; Cluster center; The cluster shares memory, communicates with the cluster center, and communicates with each computing module in the plurality of in-memory computing units through the cluster center. On-chip network, with the cluster center of each computing cluster connected to the on-chip network; The shared memory of the computing cluster is an on-chip memory, which has a physical address space with a private cache hierarchy independent of the computing modules, and is configured to be directly accessible by all computing modules within the computing cluster. The computing chip has a three-level data access path, including: a private access path for the computing module to access its corresponding memory module; an intra-cluster shared path for the computing module to access the cluster shared memory through the cluster center; and a global access path for the computing module to access resources outside the computing cluster by accessing the on-chip network through the cluster center. The cluster shared memory is configured to store data shared by multiple computing modules within the computing cluster, and the computing modules can access the cluster shared memory without going through the on-chip network.

2. The computing chip according to claim 1, characterized in that, The computing module and the memory module are vertically integrated using three-dimensional stacking technology, and the memory module is configured as the private memory of the computing module.

3. The computing chip according to claim 1, characterized in that, The cluster shared memory of the computing cluster is connected to the on-chip network via the cluster center.

4. The computing chip according to claim 1, characterized in that, The on-chip network includes a switch corresponding to each computing cluster, and the cluster center is connected to the on-chip network through the switch.

5. The computing chip according to claim 1, characterized in that, The computing chip comprises at least two wafers, and each wafer comprises at least one computing cluster.

6. The computing chip according to claim 5, characterized in that, Each of the aforementioned chips has a chip center, which is communicatively connected to all the cluster centers within the corresponding chip. The chip center is configured to perform address routing for cross-cluster access requests and uniformly access the on-chip network.

7. The computing chip according to claim 1, characterized in that, The cluster center is equipped with an address discrimination circuit, which responds to the access request initiated by the computing module and directs the access request to the cluster shared memory or forwards it to the on-chip network according to the target address.

8. The computing chip according to claim 7, characterized in that, The address discrimination circuit is specifically configured as follows: if the target address of the access request initiated by the computing module is located in the cluster shared memory of the computing cluster, the access request is directed to the cluster shared memory; if the target address is located outside the computing cluster, the access request is forwarded to the on-chip network for routing transmission.

9. The computing chip according to any one of claims 1 to 8, characterized in that, The memory module is 3D-DRAM, and the cluster shared memory is SRAM.

10. The computing chip according to claim 1, characterized in that, The path for the computing module to access its corresponding memory module does not pass through the cluster center or the on-chip network.

11. A data processing method, applied to a computing chip as described in any one of claims 1 to 10, characterized in that, include: Within each time step, the allocated local data blocks are processed in parallel in each of the computing modules, and the resulting calculations are stored in the corresponding memory modules. Within the same computing cluster, the computation results corresponding to the computing modules are transferred to the cluster shared memory to obtain the corresponding cluster aggregated data; Global data is obtained by synchronizing the aggregated cluster data in the shared memory of the clusters through the on-chip network via each cluster center. The global data is stored in the shared memory of each of the clusters, so that each computing module in the computing cluster can directly access the global data.

12. The data processing method according to claim 11, characterized in that, The cluster shared memory includes a shared buffer for storing the cluster aggregated data and the global data.

13. The data processing method according to claim 11 or 12, characterized in that, The method is applied to image or video generation tasks based on a diffusion model, where the local data block is a potential representation patch of the image or video in the latent space.

14. The data processing method according to claim 13, characterized in that, The calculation result is a denoised potential representation patch; the cluster aggregated data is a potential frame composed of multiple denoised potential representation patches.

15. The data processing method according to claim 14, characterized in that, Depending on the generation stage, the potential frame includes at least one of a global potential frame, a window potential frame, and a current potential frame. The global potential frame contains the complete potential representation of the current target object to be generated. The window potential frame consists of potential frames from a preset number of time steps prior to the current time step. The current potential frame is the potential representation to be processed at the current time step.

16. The data processing method according to claim 11, characterized in that, The step of synchronizing the aggregated cluster data in the shared memory of each cluster center on the on-chip network to obtain global data includes: The cluster aggregated data in the cluster shared memory is sent to the corresponding cluster center; The on-chip network performs a full collection operation among the cluster centers, so that each cluster center obtains the cluster aggregated data from the other computing clusters.

17. The data processing method according to claim 16, characterized in that, The full collection operation employs a ring-based communication algorithm, using each of the cluster centers participating in the synchronization as communication nodes, and performing a preset number of cyclic exchanges to enable each cluster center to obtain the complete global data.

18. The data processing method according to claim 11, characterized in that, When the computing chip is the computing chip as described in claim 6, the step of synchronizing the cluster aggregated data through each of the cluster centers on the on-chip network to obtain global data includes: The cluster aggregation data obtained is sent to the corresponding chip center of the chip using the cluster center, and the chip center performs address routing for cross-cluster access requests. By uniformly accessing the on-chip network through each of the chip centers, the cluster aggregated data collected by each chip center is sent to other chips through the on-chip network for data synchronization, so that the shared memory of each cluster contains all the cluster aggregated data to form the global data.

19. A data processing apparatus, applied to a computing chip as described in any one of claims 1 to 10, characterized in that, include: A local computing module is used to process the allocated local data blocks in parallel within each computing module at each time step, and store the obtained computing results in the corresponding memory module. The cluster aggregation module is used to transfer the calculation results corresponding to the calculation module to the cluster shared memory to obtain the corresponding cluster aggregated data within the same computing cluster. The on-chip synchronization module is used to synchronize the aggregated cluster data in the shared memory of the clusters in the on-chip network through each of the cluster centers to obtain global data. A global data access module is used to store the global data in the shared memory of each of the clusters, so that each computing module in the computing cluster can directly access the global data.

20. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the data processing method according to any one of claims 11 to 18.

21. A storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the data processing method according to any one of claims 11 to 18.