Connectivity-aware near-memory computing apparatus and system for large-scale graph computation

By dividing graph data into three types of triangles in the near-memory computing module and adopting an index computing and dual-CAM heterogeneous processing architecture, the problem of low triangle counting efficiency in the existing technology is solved, and efficient triangle counting and improved resource utilization are achieved.

CN122240965APending Publication Date: 2026-06-19BEIHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIHANG UNIV
Filing Date
2026-03-13
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing near-memory computing schemes ignore the density distribution differences of graph structures when performing triangle counting, resulting in low DRAM access efficiency and performance bottlenecks in both space and time. This is especially true when dealing with long adjacency lists, causing cache thrashing and resource contention.

Method used

A connectivity-aware near-memory computing module is adopted. The graph data is pre-divided into three types of triangles. An index computing module is used to replace the set intersection operation to count the first type of triangles. A dual-CAM heterogeneous processing architecture is adopted to physically separate the counting of the second and third types of triangles, thereby improving resource utilization and processing efficiency.

Benefits of technology

It effectively solves the performance bottleneck in traditional solutions, improves the triangle counting efficiency of large-scale graph computation, avoids cache jitter and memory overflow, and improves data throughput and processing speed.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240965A_ABST
    Figure CN122240965A_ABST
Patent Text Reader

Abstract

This application relates to the field of near-memory computing and discloses a connectivity-aware near-memory computing device and system for large-scale graph computing. The near-memory computing module includes a decoder, an index calculation module, a first CAM, and a second CAM. The decoder reads graph data of the target graph from an external first memory; the index calculation module calculates the indices of hub edges using arithmetic and logical operations; the decoder queries the first graph based on the edge indices to count the number of first-type triangles containing at least two hub vertices; the first CAM traverses all non-hub vertices in the graph data and counts the number of second-type triangles consisting of three non-hub vertices; the second CAM traverses each non-hub vertices in the graph data and counts the number of third-type triangles consisting of two non-hub vertices and one hub vertices. This solution can efficiently handle triangle counting in large-scale graph structures, resolving cache thrashing and performance bottlenecks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of near-memory computing, specifically to a connectivity-aware near-memory computing device and system for large-scale graph computing. Background Technology

[0002] Triangle counting, a crucial technique in graph computation, is widely used in various fields such as social network analysis, spam detection, and complex relationship mining. By quantifying the number of ternary closure structures in a graph, it can measure the tightness of user relationships, discover communities, and identify abnormal patterns. With the rapid increase in graph data size, triangle counting has been introduced into near-memory computing schemes to alleviate the performance bottlenecks of traditional processor-centric architectures (CPU / GPU). Near-memory computing integrates computing units within or near memory (such as DRAM), utilizing the high bandwidth within memory to reduce data movement between the processor and main memory, thus overcoming the "memory wall" bottleneck.

[0003] Current near-memory computation schemes often ignore the density distribution differences in graph structures when performing triangle counting. However, due to the inherent sparsity and random access of graph data, these schemes involve numerous discontinuous memory accesses during intersection operations, leading to low DRAM (Dynamic Random Access Memory) access efficiency and creating space and time performance bottlenecks. Furthermore, reading long adjacency lists can cause severe cache thrashing, and in some cases, the addressing memory capacity may be insufficient to handle long adjacency lists. Therefore, finding a near-memory computation scheme capable of efficiently counting triangles in large-scale graphs is a pressing issue. Summary of the Invention

[0004] The purpose of this application is to provide a connectivity-aware proximity computing device and system for large-scale graph computing, so as to achieve efficient triangle counting for large-scale graphs.

[0005] To achieve the above objectives, the technical solution of this application is as follows: In a first aspect, embodiments of this application provide a connectivity-aware near-memory computing module for large-scale graph computing, the module comprising: The decoder is configured to read graph data of a target graph from an external first memory; the graph data of the target graph includes information on all points in the target graph and their corresponding adjacent points; wherein, in the graph data, the indices of all edges consisting of 2 Hub points are stored in the first graph. The index calculation module is configured to traverse all points in the graph data and calculate the index of the edge formed by the two adjacent Hub points corresponding to each point through arithmetic logic; wherein, the Hub points are the first proportion of points in the sorted order obtained after all points are sorted by degree from high to low. The decoder is also configured to query the first bitmap based on the index of the edge and count the number of first-type triangles containing at least two Hub points in the graph data; The first CAM is configured to, after counting the number of first-type triangles, traverse all non-Hub points in the graph data and count the number of second-type triangles composed of 3 non-Hub points in the graph data; The second CAM is configured to, after counting the number of second-type triangles, traverse each non-Hub point in the graph data and count the number of third-type triangles formed by two non-Hub points and one Hub point in the graph data.

[0006] Optionally, the module further includes a controller; the index calculation module is configured to calculate the index of the edge formed by the two adjacent Hub points corresponding to each point, including: taking the two adjacent Hub points corresponding to the target point being traversed as the first vertex and the second vertex respectively, and calculating the index of the edge formed by the first vertex and the second vertex according to the index of the first vertex and the index of the second vertex. The decoder is configured to query the first bitmap based on the edge index and count the number of first-type triangles containing at least two Hub points in the graph data, including: querying the first bitmap based on the edge index; and sending a first signal to the controller when the value at the corresponding position in the first bitmap is 1. The controller is configured to perform a count of the first type of triangles based on the first signal.

[0007] Optionally, the module further includes a controller; the first CAM is specifically configured to query the first non-Hub point adjacency table of the target non-Hub point currently being traversed, and the second non-Hub point adjacency table corresponding to each non-Hub point in the first non-Hub point adjacency table, and determine whether there is an edge between the target non-Hub point and the non-Hub points in the second non-Hub point adjacency table. If an edge exists between the target non-Hub point and a non-Hub point in the adjacency list of the second non-Hub point, a second signal is sent to the controller; The controller is also configured to perform a count of the second type of triangles based on the second signal.

[0008] Optionally, the module further includes a controller; the second CAM is specifically configured to obtain the base edges formed by the target non-Hub point and its corresponding non-Hub adjacent points obtained by querying the first CAM; based on the base edges, query the Hub point adjacency table of the non-Hub adjacent points, and determine whether there is an edge between the target non-Hub point and the Hub point in the Hub point adjacency table; If an edge exists between the target non-Hub point and a Hub point in the Hub point adjacency list, a third signal is sent to the controller. The controller is also configured to perform a count of the third type of triangles based on the third signal.

[0009] Optionally, the graph data further includes: a first list obtained by sorting all points in the target graph according to their degree from high to low; the controller is further configured to perform the following steps: Traverse all points in the first list, control the decoder to read the index of the Hub adjacent point of each point from the first memory in sequence, and transmit it to the index calculation module; After traversing all points in the first list, the non-Hub points in the first list are traversed again. The decoder is controlled to read the non-Hub point adjacency table of each target non-Hub point and the non-Hub point adjacency table of each non-Hub adjacent point of the target non-Hub point from the first memory in sequence, and send them to the first CAM. After counting the number of the second type of triangles, the decoder is controlled to read the Hub point adjacency table of each base edge from the first memory in turn, based on the base edge formed by each target non-Hub point and its corresponding non-Hub adjacent point, which is queried by the first CAM, and send it to the second CAM.

[0010] Optionally, the module further includes an on-chip cache submodule; the decoder is also configured to perform the following steps: Before reading the information of the adjacent points corresponding to the currently traversed point from the first memory, query whether the information of the adjacent points is cached in the on-chip cache submodule; If the information of the adjacent points is stored in the on-chip cache submodule, the graph data is read from the on-chip cache submodule; If the information of the adjacent point is not stored in the on-chip cache submodule, the information of the adjacent point is read from the first memory and written to the on-chip cache submodule.

[0011] Secondly, embodiments of this application provide a connectivity-aware near-memory computing device for large-scale graph computing, comprising: multiple processing units; wherein each processing unit includes: a buffer module and multiple first memories; the buffer module includes: a memory controller and multiple near-memory computing modules as provided in the first aspect of embodiments of this application; The memory controller is configured to perform the following steps: Receive multiple batches of graph data for a target graph and corresponding batch processing tasks sent by an external processor; wherein each batch of graph data includes: the first bitmap and information on the adjacency list of all points in the target graph; Write the multiple batches of graph data and the corresponding batch processing tasks into the first memory; Each near-memory computing module is controlled to process each batch processing task, read from the first memory and count the number of triangles contained in the graph data of the corresponding batch to obtain local statistical results; Based on the local statistical results of each nearby module, the total number of triangles contained in the target graph is calculated.

[0012] Thirdly, embodiments of this application provide a connectivity-aware near-memory computing system for large-scale graph computing, including: a processor and at least one near-memory computing device as provided in the second aspect of embodiments of this application; The processor is configured to segment the graph to be processed based on the number of near-memory computing devices, generate graph data corresponding to multiple batches of target graphs, and construct corresponding batch processing tasks, which are then distributed to each near-memory computing device; wherein, the graph data of each batch includes: the first bitmap and information on the adjacency list of all points in the target graph; Each near-memory computing device is configured to count the number of triangles contained in the graph data of the corresponding batch based on the received batch processing task, and return the statistical results to the processor; The processor is also configured to calculate the total number of triangles contained in the target graph based on statistical results returned by each near-memory computing device.

[0013] Optionally, the processor is configured to segment the graph to be processed based on the number of near-memory computing devices, generate graph data corresponding to multiple batches of target graphs, and construct corresponding batch processing tasks, including: The points in the graph to be processed are sorted from highest to lowest degree to generate a first list; wherein the degree of each point is related to the number of edges connected to the point. According to the first ratio, select multiple points ranked at the top from the first list and determine them as Hub points; and specify the direction of each edge in the graph to be processed: from the point with the lower degree among the two points of the edge to the point with the higher degree; Based on all edges consisting of 2 Hub points, construct the first graph; and construct the Hub point adjacency list and the non-Hub point adjacency list for each point in the graph to be processed. Divide all Hub points into multiple Hub groups, and divide all non-Hub points into multiple non-Hub groups; Based on the first bitmap and at least one of the following, construct a batch of graph data: an adjacency list of a Hub group, or an adjacency list of a non-Hub group; construct corresponding batch processing tasks based on the graph data of each batch.

[0014] Optionally, the processor is configured to divide all Hub points into multiple Hub groups, specifically including: based on the number of all near-memory computing modules contained in the system, allocating corresponding Hub points to each near-memory computing module in a round-robin manner to obtain the Hub group corresponding to each near-memory computing module; All non-Hub points are divided into multiple non-Hub groups. Specifically, this involves dividing all non-Hub points into multiple non-Hub groups based on the capacity threshold of the CAM in each near-memory computing device and the data volume of the Hub adjacency list and the non-Hub adjacency list of each non-Hub point. In each non-Hub group, the data volume of the adjacency list of each first non-Hub point and the total data volume of the adjacency list of each second Hub point are both less than or equal to the capacity threshold.

[0015] This application pre-stores the indices of edges formed by two adjacent Hub points in the target graph in the first-level graph. When processing graph data, the nearest-memory computation module performs arithmetic logic calculations through the index computation module to obtain the corresponding edge indices, and then queries the first-level graph through the decoder to determine if a first-type triangle exists. Compared to traditional solutions, this solution replaces set intersection operations with arithmetic logic index calculations, saving significant memory access operations and conserving memory resources. It improves processing efficiency while avoiding cache thrashing.

[0016] Furthermore, this solution employs a dual-CAM heterogeneous processing architecture. By physically isolating the two CAMs and using pipelined processing, the second type of triangle counting (without hub points) is physically separated from the third type of triangle counting (with hub points). This avoids long and short adjacency lists competing for resources in the same memory, improving data throughput and processing efficiency. This method can solve the performance bottleneck of traditional near-memory computing schemes and efficiently handle triangle counting tasks in large-scale graphs. Attached Figure Description

[0017] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram of a connectivity-aware near-memory computing module for large-scale graph computing proposed in an embodiment of this application; Figure 2 This is a schematic diagram showing the connection of points in the graph structure; Figure 3 This is a schematic diagram of the architecture of the Rank-level processing unit in a DDR4 memory according to an embodiment of this application; Figure 4 This is a schematic diagram of the operation of a connectivity-aware near-memory computing module for large-scale graph computing deployed in a DDR4 memory according to one embodiment of this application; Figure 5 This is a schematic diagram of a connectivity-aware near-memory computing system distributing batch computing tasks in one embodiment of this application for large-scale graph computing; Figure 6 This is a comparison chart of the end-to-end runtime of different near-memory computing schemes; Figure 7 This is a comparison chart of data access volume for different near-memory computing schemes; Figure 8 This is a comparison chart of runtime for different data distribution strategies used in multi-rank architectures. Detailed Implementation

[0019] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0020] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments.

[0021] In the various embodiments of this application, it should be understood that the sequence number of each process described below does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0022] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects as detailed in this application.

[0023] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other.

[0024] Current near-memory computing schemes typically ignore density differences in graph structures when counting triangles. Due to the sparsity and randomness of graph data, obtaining the adjacency list of intersections requires frequent discontinuous memory accesses, leading to inefficient DRAM access. Furthermore, the adjacency lists of hub vertices (highly connected vertices) in the graph are extremely long, and processing these vertices quickly fills and overflows the on-chip cache, causing severe cache thrashing and preventing valid data from residing in the cache. Schemes such as DIMMining / ProMiner have high computational complexity and lack optimization for regions with different densities, resulting in significant data movement.

[0025] Furthermore, during triangle counting, the set intersection operation consumes a significant amount of computation time. Merge-based methods such as DIMMining / ProMiner have low parallelism, while content-addressable memory (CAM)-based methods such as CLAP / CRISP are limited by CAM capacity and struggle to handle the extremely long adjacency lists of Hub vertices.

[0026] In this application, the graph structure is pre-divided according to the type of triangles to obtain three types of graph data. Then, multiple calculation methods are used to count the number of triangles from each type of graph data. For Hub points with long adjacency lists, an index calculation module is used to perform arithmetic logic calculations instead of set intersection operations, saving resource overhead and avoiding cache overflow. At the same time, a dual-CAM heterogeneous processing architecture is adopted to physically separate the counting of the second type of triangles that do not contain Hub points from the counting of the third type of triangles that do contain Hub points, improving the utilization of CAM memory resources.

[0027] The present application will now be described in detail with reference to the accompanying drawings and embodiments.

[0028] Figure 1This is a schematic diagram of a connectivity-aware proximity computing module 100 for large-scale graph computing according to an embodiment of this application. Figure 1 As shown, the near-memory computing module includes: Decoder 101 is configured to read graph data of a target graph from an external first memory; the graph data of the target graph includes information on all points in the target graph and their corresponding adjacent points; wherein, in the graph data, the indices of all edges consisting of 2 Hub points are stored in the first graph. The index calculation module 102 is configured to traverse all points in the graph data and calculate the index of the edge formed by the two adjacent Hub points corresponding to each point through arithmetic logic; wherein, the Hub point is the first proportion of points in the sorted order obtained after all points are sorted by degree from high to low. The decoder 101 is also configured to query the first bitmap based on the edge index and count the number of first-type triangles containing at least two Hub points in the graph data; The first CAM103 is configured to, after counting the number of the first type of triangles, traverse all non-Hub points in the graph data and count the number of the second type of triangles formed by 3 non-Hub points in the graph data; The second CAM104 is configured to, after counting the number of the second type of triangles, traverse each non-Hub point in the graph data and count the number of the third type of triangles formed by two non-Hub points and one Hub point in the graph data.

[0029] In this embodiment, a connectivity partitioning strategy is used to divide the target graph data into three parts, corresponding to three types of triangles. The triangle counts for each of the three parts are then processed by the index calculation module, the first CAM, and the second CAM within the nearest-memory calculation module. Specifically, a first proportion of points with high degrees are selected from the graph to be processed and designated as Hub points, while the remaining points are designated as non-Hub points. The degree of a point is the number of edges directly connected to it. In this scheme, all points are pre-sorted according to their degree from high to low, and the first proportion of points are selected as Hub points.

[0030] Figure 2 This is a schematic diagram showing the connection of points in a graph structure. For example... Figure 2As shown, in one embodiment, points "A, B, C, G" with the highest degree (33%) are selected as Hub points based on the number of edges directly connected to each point (i.e., degree). The remaining points "D, E, F, H, I, J, K, L" are non-Hub points. Since Hub points have many adjacent points, the adjacency list is quite long. For large-scale graph structures, using traditional set intersection operations results in significant memory accesses when processing Hub points, leading to low computational efficiency, data jitter, and the risk of memory overflow. In this embodiment, triangles are classified into three types based on Hub points: Type 1 triangles: HHN triangles containing 2 hub points and HHH triangles containing 3 hub points; The second type of triangle: an NNN triangle containing 3 non-Hub points (i.e., without Hub points); The third type of triangle: an HNN triangle containing one hub point and two non-hub points.

[0031] Because the connections between Hub points are extremely dense and the adjacency list is quite long, in order to improve the efficiency of triangle counting, this embodiment uses arithmetic logic calculations based on the index to replace the traditional set intersection operation based on the adjacency list for the statistics of the first type of triangles. This improves the computational efficiency and avoids data jitter caused by a large number of frequent memory accesses, as well as the risk of memory overflow in the CAM memory.

[0032] Before performing triangle counting, all points and adjacency lists in the large-scale graph to be processed are pre-divided and reorganized according to the three types of triangles mentioned above, generating two types of graph data: one is data storing the indices of edges formed by two Hub points in the first graph, used to calculate and count the number of first-type triangles using arithmetic logic; the other is data of each vertex and its corresponding adjacency list stored in CSR format, used to count the number of second and third-type triangles using set intersection operations in the CAM memory. In this embodiment, a Bitmap is used to store the edges formed by Hub points, where each edge occupies only 1 bit (1 indicates the existence of an edge, and 0 indicates the absence of an edge), which greatly compresses memory usage and improves data access speed compared to storing extremely long adjacency list data.

[0033] Large-scale graphs can be processed in parallel using multiple near-memory computing modules. Before issuing tasks, the complete graph is pre-segmented to obtain graph data for multiple target graphs (region graphs). The graph data of a target graph includes information about all points within that region and their corresponding adjacent points (such as an adjacency list). Then, each target graph is assigned to a corresponding near-memory computing module. The complete first-order graph is copied to each parallel-processing near-memory computing module, enabling each near-memory computing module to perform arithmetic and logical operations when counting first-type triangles.

[0034] In each near-memory computation module, the index computation module calculates the edge index corresponding to each target point (including hub points and non-hub points) and its corresponding adjacent points in the acquired graph data. The decoder then queries the first-order graph based on the edge index to check if a corresponding edge exists, thus counting the number of first-type triangles. Each near-memory computation module counts the number of triangles in the corresponding target graph, and finally, the local statistical results of all near-memory computation modules are integrated to achieve a complete triangle count for the entire graph. During triangle counting, the decoder inside the near-memory computation module first reads the graph data of the corresponding target graph from the external first memory (e.g., DRAM), including: the first-order graph, all points in the target graph (region graph), and the adjacency list consisting of their corresponding adjacent points. The index computation module traverses each point and all points in its corresponding adjacency list, based on the edge index corresponding to the two adjacent hub points of each target point (including hub points and non-hub points). The decoder queries the first-order graph based on this edge index, thereby counting the first-type triangles formed by the target point and its two adjacent hub points.

[0035] For the second and third types of triangles, this embodiment performs set intersection operations using a heterogeneous dual CAM memory deployed in the near-memory computing module. In this embodiment, the adjacency lists corresponding to all points in the target graph are pre-divided according to the following partitioning rules and stored as corresponding word CSR data: (1) Define the direction of all edges formed by adjacent vertices in the target graph as follows: the vertex with the lower degree points the edges with the higher degree points. For example, Figure 2 The direction of edge BI is from point I (lower degree) to point B (higher degree). Specifying the direction of edges effectively avoids repeatedly reading and processing the same edge. (2) Store the information of the edges of all points in the target graph pointing to their non-Hub adjacent points as CSR graph data of non-Hub point adjacency list, which is used to count the number of second type triangles; (3) Store the information of the edges of all points in the target graph pointing to their Hub adjacent points as CSR graph data of Hub point adjacency list, which is used to count the number of first and third type triangles.

[0036] Based on the above rules, the adjacency list data corresponding to non-Hub points is divided into graph data containing Hub point information and graph data not containing Hub point information. Then, two CAM memories are used to process the second type of triangle counts that do not contain Hub points and the third type of triangle counts that do contain Hub points, respectively, to achieve physical separation of the two types of triangle counts, avoid long and short adjacency lists competing for resources in the same memory, and improve data throughput and processing efficiency.

[0037] In this embodiment, the counting operations for the three types of triangles are pipelined serially. Before counting the second and third types of triangles, it is necessary to first count all first-type triangles. After counting the number of first-type triangles, the first CAM (Non-Hub CAM) traverses all non-Hub points and their corresponding graph data in the graph data to count the number of second-type triangles (i.e., NNN triangles). After counting the second-type triangles, the second CAM (Hub CAM) counts the third-type triangles (i.e., HNN triangles) in the graph data, which are formed by using a non-Hub point as the target point, one adjacent Hub point, and one adjacent non-Hub point. After counting the number of first, second, and third-type triangles, the counts of the three types of triangles are integrated to obtain the total number of triangles in the target graph. It is worth noting that in large-scale graph computing scenarios, if multiple parallel near-memory computing modules are used to process a portion of the complete graph data (target graph) separately, it is also necessary to integrate the triangle counts of the corresponding target graphs output by each near-memory computing module to finally obtain the total number of triangles in the complete graph.

[0038] The solution provided in this embodiment can effectively handle large-scale graph structures. Based on the difference in the density of hub and non-hub points in the graph, the graph data is pre-divided. The counting of the first type of triangles for densely distributed hub points is changed from the traditional set intersection operation of adjacency lists to arithmetic and logical operations based on bitmap address indexes. This eliminates the need for set intersection operations on long adjacency lists and access to large amounts of memory data, improving system counting efficiency and avoiding internal data jitter and memory overflow problems. For sparsely distributed non-hub points, heterogeneous dual CAMs are used to count the second and third types of triangles respectively. The parallel processing characteristics of the CAM memory are used to efficiently complete the set intersection processing of the non-hub point adjacency lists of the distribution coefficients, obtaining the statistical results of the number of triangles of the corresponding types. This not only solves the single CAM capacity bottleneck problem in existing solutions, but also realizes pipelined parallelism for counting various types of triangles, improving computational efficiency.

[0039] Optionally, in practical applications, the degree of a point in a graph can be determined by incorporating weights in addition to the number of edges connecting the point. For example, in a social graph, each edge has a corresponding social weight to represent the closeness of the relationship between two points. When calculating the degree of each point, the degree is calculated by combining the number of edges connected to each point and the weight of each edge, thus taking into account the closeness of the social relationship during the selection of Hub points.

[0040] In this embodiment, the triangle counting in the near-memory computing module employs a three-stage pipelined process. The near-memory computing module also includes a controller, which, according to the pipelined processing strategy, controls the index computing module, the first CAM, and the second CAM to sequentially count the number of triangles of each type. The controller receives counting signals from the index computing module, the first CAM, and the second CAM, and controls the registers to count the three types of triangles.

[0041] As one embodiment of this application, the connectivity-aware proximity computing module for large-scale graph computing further includes a controller; the index computing module is configured to calculate the index of the edge formed by the two adjacent Hub points corresponding to each point, including: taking the two adjacent Hub points corresponding to the target point being traversed as the first vertex and the second vertex respectively, and calculating the index of the edge formed by the first vertex and the second vertex according to the index of the first vertex and the index of the second vertex. The decoder is configured to query the first bitmap based on the edge index and count the number of first-type triangles containing at least two Hub points in the graph data, including: querying the first bitmap based on the edge index; and sending a first signal to the controller when the value at the corresponding position in the first bitmap is 1. The controller is configured to perform a count of the first type of triangles based on the first signal.

[0042] In one embodiment, in the first stage of pipelined processing, the index calculation module and the decoder count the first type of triangles. The index calculation module traverses all points in the target graph, taking the currently traversed target point (Hub point or non-Hub point) as the root node, and using the two Hub points in the target point's Hub point adjacency list as the first vertex and the second vertex, respectively. The index calculation module is equipped with arithmetic logic circuitry to directly map the input Hub point index pairs to the physical read address of the first graph, thereby skipping the set intersection operation. Specifically, based on the index of the first vertex and the index of the second vertex, arithmetic logic is used to calculate the index (address) h of the edge formed by the first vertex and the second vertex: (Assuming) ); in, , These are the indices of the first vertex and the second vertex, respectively.

[0043] The decoder uses the edge index h output by the index calculation module to query the corresponding bit in the first-order graph stored in external memory. If the bit is 1, it indicates that an edge consisting of the first and second vertices exists in the target graph; if the bit is 0, it indicates that no edge consisting of the first and second vertices exists in the target graph. When the decoder finds that the corresponding bit in the first-order graph is 1, it sends a first signal to the controller. After receiving the first signal, the controller increments the count of the first type of triangles (i.e., HHH triangles and HHN triangles) by 1 in the control register.

[0044] Since this stage only performs arithmetic and logical calculations on addresses through the index calculation module, without performing costly set intersection operations, it can greatly reduce access to external memory.

[0045] As one embodiment of this application, the connectivity-aware proximity computing module for large-scale graph computing further includes a controller; the first CAM is specifically configured to query the first non-Hub point adjacency list of the target non-Hub point currently being traversed, and the second non-Hub point adjacency list corresponding to each non-Hub point in the first non-Hub point adjacency list, and determine whether there is an edge between the target non-Hub point and the non-Hub points in the second non-Hub point adjacency list. If an edge exists between the target non-Hub point and a non-Hub point in the adjacency list of the second non-Hub point, a second signal is sent to the controller; The controller is also configured to perform a count of the second type of triangles based on the second signal.

[0046] In one embodiment, in the second stage of the pipelined processing, the first CAM counts the second type of triangles. After the index calculation module has traversed all points in the target graph, the second stage of triangle counting begins. The first CAM traverses all non-Hub points in the target graph, reads the first non-Hub point adjacency list corresponding to the target non-Hub point, and reads the corresponding second non-Hub point adjacency list based on each non-Hub point in the first non-Hub point adjacency list. The first CAM uses set intersection operation to check whether there is an edge between the non-Hub points in the two lists, thereby determining whether there exists a second type of triangle (i.e., an NNN triangle) formed by the currently traversed target non-Hub point I and its two corresponding non-Hub adjacent points. For example, if the non-Hub adjacent point K of the currently traversed target non-Hub point I has a non-Hub adjacent point J, and there is an edge between the target non-Hub point I and the non-Hub point J, it indicates that there exists a second type of triangle formed by the three non-Hub points I, J, and K.

[0047] Then, upon detecting an edge, the first CAM sends a second signal to the controller. After receiving the second signal, the controller increments the count of the second type of triangles (i.e., NNN triangles) by 1 in its control register.

[0048] As one embodiment of this application, the connectivity-aware near-memory computing module for large-scale graph computing also includes a controller; The second CAM is specifically configured to obtain the base edges formed by the target non-Hub point and its corresponding non-Hub adjacent points obtained from the query of the first CAM; based on the base edges, query the Hub point adjacency table of the non-Hub adjacent points to determine whether there is an edge between the target non-Hub point and the Hub point in the Hub point adjacency table; If an edge exists between the target non-Hub point and a Hub point in the Hub point adjacency list, a third signal is sent to the controller. The controller is also configured to perform a count of the third type of triangles based on the third signal.

[0049] In one embodiment, the third stage of the pipelined processing involves the second CAM counting third-type triangles. After the first CAM has traversed all non-Hub points in the target graph, the third stage of triangle counting begins. The second CAM, based on the base edges sequentially read by the first CAM from the target non-Hub points and their non-Hub adjacent points, queries the Hub adjacency table of those non-Hub adjacent points. The second CAM uses set intersection operations to check if there is an edge between the target non-Hub point and a Hub point in the Hub adjacency table, thereby determining whether a third-type triangle (i.e., an HNN triangle) containing the currently traversed target non-Hub point exists. Figure 2 As shown, for example, it is known that the target non-Hub point H and its non-Hub adjacent point E form a base edge, and the Hub point G exists in the Hub adjacency list of the non-Hub adjacent point E; if an edge is detected between the target non-Hub point H and the Hub point G (i.e., a successful match in the second CAM), it indicates that there is a third type of triangle composed of H, E, and G.

[0050] Furthermore, upon detecting an edge, the second CAM sends a third signal to the controller. Upon receiving the third signal, the controller increments the count of the third type of triangle (i.e., HNN triangle) in its control register. During the counting phase of the third type of triangle, the second CAM reuses the edge data (i.e., base edges) composed of two non-Hub points read by the first CAM, thus saving the step of rereading relevant data from the external first memory, reducing memory access and improving the processing efficiency of triangle counting. Moreover, this embodiment, through reuse, allows the system to only query the short adjacency list of non-Hub points without querying the long adjacency list of Hub points (e.g., the non-Hub point adjacency list of Hub points), improving processing efficiency while effectively avoiding buffer overflow problems.

[0051] Optionally, within the near-memory computing module, a hash-based hardware accelerator can be used to replace the first and second CAMs in performing the counting operations for the second and third types of triangles. For example, linear probing hardware logic circuits can be used to implement the set intersection operation function of the first and second CAMs, thereby further reducing the hardware area and saving hardware costs.

[0052] In one embodiment of this application, the graph data further includes: a first list obtained by sorting all points in the target graph according to their degree from high to low; the controller is further configured to perform the following steps: Traverse all points in the first list, control the decoder to read the index of the Hub adjacent point of each point from the first memory in sequence, and transmit it to the index calculation module; After traversing all points in the first list, the non-Hub points in the first list are traversed again. The decoder is controlled to read the non-Hub point adjacency table of each target non-Hub point and the non-Hub point adjacency table of each non-Hub adjacent point of the target non-Hub point from the first memory in sequence, and send them to the first CAM. After counting the number of the second type of triangles, the decoder is controlled to read the Hub point adjacency table of each base edge from the first memory in turn, based on the base edge formed by each target non-Hub point and its corresponding non-Hub adjacent point, which is queried by the first CAM, and send it to the second CAM.

[0053] In one embodiment, after segmenting the large-scale graph to be processed, all points in each target graph are pre-sorted according to their degree from high to low to obtain a corresponding first list. When processing the graph data of the target graph, the controller in the nearest memory calculation module controls the decoder to read the information of its points and adjacency lists in a pipelined traversal manner according to the order of the points in the first list.

[0054] In this embodiment, based on a three-stage pipeline processing sequence, the controller first traverses all points in the first list and correspondingly reads the Hub adjacency table of the corresponding point from the external first memory, obtains the index of the Hub adjacency point, and transmits it to the index calculation module for calculating the index of the edge formed by two Hub adjacency points. Then, the controller controls the decoder to read the first graph in the external memory. Based on the edge index calculated by the index calculation module, the decoder checks whether a corresponding edge exists in the first graph. After traversing the first list, the counting stage for the second type of triangles begins. The controller controls the decoder to sequentially read each target non-Hub point and its associated non-Hub adjacency table from the external first memory based on the first list, and sends it to the first CAM for parallel querying. After traversing all non-Hub points in the first list, the counting stage for the third type of triangles begins. The controller controls the decoder to sequentially read the Hub adjacency table of non-Hub adjacency points from the external first memory based on the base edges formed by each target non-Hub point and its non-Hub adjacency points queried from the first CAM, and sends it to the second CAM for parallel querying of the third type of triangles.

[0055] As one embodiment of this application, the connectivity-aware proximity computing module for large-scale graph computation further includes an on-chip cache submodule for caching the first-order graph and adjacency list data read from an external first memory; the decoder is further configured to perform the following steps: Before reading the information of the adjacent points corresponding to the currently traversed point from the first memory, query whether the information of the adjacent points is cached in the on-chip cache submodule; If the information of the adjacent points is stored in the on-chip cache submodule, the graph data is read from the on-chip cache submodule; If the information of the adjacent point is not stored in the on-chip cache submodule, the information of the adjacent point is read from the first memory and written to the on-chip cache submodule.

[0056] In one embodiment, the near-memory computing module further includes an on-chip cache submodule (i.e., a cache) to cache frequently accessed data read by the decoder from the external first memory, such as the first-order graph and adjacency list data (e.g., data of points in the adjacency list). Before the decoder reads the information of the currently traversed point and its related adjacent points from the external first memory, the decoder first checks whether the data of that point is already cached in the on-chip cache submodule. If it is already cached in the on-chip cache submodule, there is no need to request data from the external memory again; instead, the corresponding point data is read directly from the on-chip cache submodule. For example, in the second stage, after the decoder reads the adjacency list of non-Hub points, the on-chip cache submodule retains some information of non-Hub points. During the second CAM processing of graph data in the third stage, some non-Hub point data cached in the on-chip cache submodule can be read directly without having to access the relevant data in the external first memory through the decoder again. This approach can reduce the number of times the near-memory computing module requests data access from the external memory, further improving the processing efficiency of graph data.

[0057] Based on the same inventive concept, one embodiment of this application provides a connectivity-aware proximity computing device for large-scale graph computing, comprising: multiple processing units; wherein each processing unit includes: a buffer module and multiple first memories; the buffer module includes: a memory controller and multiple proximity computing modules as provided in the above embodiment; The memory controller is configured to perform the following steps: Receive multiple batches of graph data for a target graph and corresponding batch processing tasks sent by an external processor; wherein each batch of graph data includes: the first bitmap and information on the adjacency list of all points in the target graph; Write the multiple batches of graph data and the corresponding batch processing tasks into the first memory; Each near-memory computing module is controlled to process each batch processing task, read from the first memory and count the number of triangles contained in the graph data of the corresponding batch to obtain local statistical results; Based on the local statistical results of each near-memory module, the total number of triangles contained in the target image is calculated. In this embodiment, the near-memory computing device can be a dynamic random access memory, such as DDR4 (Dynamic DoubleRate 4th-generation Synchronous DRAM), which contains multiple Rank-level parallel processing units (NMP Units). The near-memory computing module provided in the above embodiment is deployed in each parallel processing unit to achieve parallel processing of batch tasks.

[0058] Figure 3 This is a schematic diagram of the architecture of the Rank-level processing unit in a DDR4 memory according to one embodiment of this application. Figure 3 As shown, the processing unit contains one buffer chip NMP unit and four DRAM chips. The buffer chip contains a first memory controller and a proximity computing unit. The first memory controller receives batch processing tasks and controls the proximity computing units to execute the corresponding triangle counting. Each proximity computing unit contains multiple parallel proximity computing modules (PEs). In the DDR4 memory, the processing efficiency of the proximity computing device is improved through the deployment of two levels of parallel devices. When executing the triangle counting task, multiple parallel processing units (NMP units) within the DDR4 memory, and multiple parallel proximity computing modules within each processing unit, run simultaneously, processing points in specific regions of the graph data, generating local results, which are then integrated to obtain the final statistical result.

[0059] Figure 4 This is a schematic diagram illustrating the operation of a connectivity-aware proximity computing module deployed in a DDR4 memory for large-scale graph computation in one embodiment of this application. After receiving the currently assigned batch of tasks, the proximity computing unit distributes each task to each proximity computing module (PE) within it in a round-robin manner. Subsequently, the controller in each PE controls the decoder to read graph data from the external first memory (i.e., DRAM chip) and load it into the corresponding device to perform triangle pipeline counting operations. Simultaneously, the decoder caches the read graph data in an on-chip cache submodule, which uses an LRU (Least Recently Used) caching mechanism for data eviction and update. During the triangle counting process, the decoder reads data in an orderly manner from the shared on-chip cache submodule or the external first memory according to the corresponding processing stage and transmits it to the corresponding device (including the index calculation module, the first CAM, and the second CAM).

[0060] like Figure 4 As shown, in DDR4 memory, the three-stage process of triangle counting performed internally by each near-memory computing module is as follows: (1) In Phase 1, the index calculation module and decoder perform a count of the first type of triangles (HHH / HHN) based on the first bit map. Each point in the target graph is traversed, and each traversed point is taken as the root node u. The decoder retrieves the Hub point adjacency list HE.N(u) of the root node u from the local DRAM, along with the first bit map, and stores it in the on-chip cache submodule. This table is streamed to the Index Computing Unit (ICU), which calculates the index address of the corresponding edge for each unique pair of Hub neighbors (h1, h2). This edge index address represents a request for a specific location in the first bit map (H2H bitmap). The decoder then searches the on-chip cache submodule to check if the relevant part of the first bit map data is stored. If it exists, it further checks if the corresponding bit value is 1. If the relevant first bit map data is not stored in the on-chip cache submodule, it requests it from the DRAM outside the near-memory calculation module. If the corresponding bit value in the first bit map is found to be 1, a triangle is determined to exist, and the match is successful. The decoder sends the first instruction to the controller, which increments the count of the first type of triangle (i.e., HHH triangle or HHN triangle) according to the type of the root node u.

[0061] Optionally, since the root node includes both Hub nodes and non-Hub nodes, in large-scale graph structure scenarios, the graph data can be further divided into those with Hub nodes as the root node and those with non-Hub nodes, and processed in batches. It should be noted that regardless of whether the root node u is a Hub node or a non-Hub node, the Hub node adjacency table HE.N(u) of the root node must first be obtained. Then, the table is traversed to extract the indices of any two Hub adjacent nodes for pairing, and the physical index of the edge formed by these two Hub nodes is calculated using the index calculation module.

[0062] (2) Phase two is composed of the first CAM ( Figure 4 The NH-CAM module performs a count of the second type of triangles (NNN triangles). After the count of HHH triangles in the current batch is completed, the nearest memory module processes each non-Hub point in the target graph of the corresponding batch in batches. For a given batch, NH-CAM traverses each non-Hub point and sets the currently traversed target non-Hub point u to each of its non-Hub neighbors. The entries (u,v) are concatenated to preload NH-CAM. After NH-CAM is loaded, the PE executes the following microarchitecture pipeline steps: 1) NH-CAM operates as a conventional memory, sequentially reading an entry (u,v) to establish the base edge; 2) The non-Hub adjacent node v is passed to the decoder, which retrieves its non-Hub adjacent list NHE.N(v) from the cache; 3) For each non-Hub point in the adjacency list, obtain the non-Hub point. PE constructs an entry (u, w) and sends it as a search term to the NH-CAM search line for querying; 4) If the query is successfully matched, a counting signal (i.e., the second signal) is generated and sent to the controller.

[0063] (3) Stage three is composed of the second CAM ( Figure 4 The H-CAM in the process performs a count of the third type of triangles (i.e., HNN triangles). After the count of the NNN triangles in the current batch is completed, for the root node u of the non-Hub in the same batch, the PE first sets the adjacent nodes of each root node u to its Hub. The system concatenates and constructs entries (u,v) to preload H-CAM. After H-CAM loading is complete, the PE executes the following microarchitecture pipeline steps: 1) NH-CAM is run again as a regular memory, sequentially reading an entry (u,v) to re-establish the base edge; 2) In this base edge, the non-Hub neighbor v of the target non-Hub point u is transmitted to the decoder, and the decoder retrieves its corresponding Hub point adjacency table HE.N(v) from the cache; 3) For each obtained Hub adjacency point PE constructs (u, h) entries and sends them as search terms to the search line of H-CAM for querying; 4) If the query is successfully matched, a counting signal (i.e., the third signal) is generated and sent to the controller.

[0064] After the triangle counting task for a batch of target images is completed, the data in the first and second CAMs will be cleared. The controller then controls the decoder to continue reading the data for the next batch of target images from external DRAM, and repeats the above steps until all batches of image data allocated to the near-memory computing module have been processed.

[0065] Based on the same inventive concept, one embodiment of this application provides a connectivity-aware proximity computing system for large-scale graph computing, including: a processor and at least one proximity computing device as provided in the above embodiment; The processor is configured to segment the graph to be processed based on the number of near-memory computing devices, generate graph data corresponding to multiple batches of target graphs, and construct corresponding batch processing tasks, which are then distributed to each near-memory computing device; wherein, the graph data of each batch includes: the first bitmap and information on the adjacency list of all points in the target graph; Each near-memory computing device is configured to count the number of triangles contained in the graph data of the corresponding batch based on the received batch processing task, and return the statistical results to the processor; The processor is also configured to calculate the total number of triangles contained in the target graph based on statistical results returned by each near-memory computing device.

[0066] In one embodiment, the near-memory computing system consists of a host processor and multiple near-memory computing devices (e.g., DDR4 processors). When processing triangle counting in a large-scale graph, the processor segments the graph to be processed according to the number of near-memory computing devices in the system, obtaining graph data corresponding to multiple local graphs (i.e., target graphs). Batch processing tasks are then constructed based on the graph data of each target graph and distributed to each near-memory computing device. Considering the dense distribution of hub points and the long adjacency list, to facilitate the indexing module's query of the bitmap, this embodiment copies the first-order graph constructed based on the complete graph to be processed into the graph data of each batch processing task. That is, the graph data of each batch includes the complete first-order graph, as well as information on the adjacency lists of all hub points and non-hub points in the target graph of that batch.

[0067] Figure 5 This is a schematic diagram illustrating the distribution of batch computing tasks by a near-memory computing system in one embodiment of this application. For example... Figure 5 As shown, after the host generates batch processing tasks, it distributes each batch processing task to different DDR4 memories. The second memory controller deployed in the DDR4 memory distributes the received batch processing tasks to the processing units at each Rank level within the DDR4 memory. Figure 5 In the NMP Unit, the tasks corresponding to the batch are stored in multiple first memories (e.g., ...) within each processing unit. Figure 3 The process involves a DRAM chip within each processing unit. A first memory controller, deployed within a buffer module of each processing unit, controls multiple local memory processing (PE) modules on that buffer module to read multiple batches of triangle counting tasks sent to the local machine in parallel from the local first memory, obtaining the statistical results for each batch. Each local memory processing unit (DDR4 memory) returns its multiple local statistical results to the host processor, which integrates all the local statistical results to obtain the total number of triangles present in the complete graph.

[0068] Because bitmap storage data is extremely compact (e.g., only a small portion of the 5.2GB Twitter graph), this solution employs an asymmetric graph partitioning strategy of "full bitmap copying and full subgraph partitioning." The small first-order graph is copied to the processing units of various nearby computing devices, while the massive HNN and NNN-related graph data is divided into multiple batches and distributed to different Rank-level processing units. This effectively solves the memory overflow problem in large-scale graph structure scenarios using traditional solutions employing full-graph copying or uniform partitioning strategies. Furthermore, compared to the problem of large-scale cross-node data movement caused by uniform graph partitioning strategies in traditional solutions, this solution uses bitmap-index-based arithmetic logic calculations to replace the set intersection operations on graph data for the first type of triangles. This saves significant communication overhead and prevents the system's computational performance from drastically decreasing as the graph size increases, maintaining high processing speed even in large-scale graph structure scenarios. Since the connection information between all Hub points is available in the first-order graph, this method of full broadcasting, which utilizes the characteristics of the bitmap being extremely small and frequently accessed, can eliminate cross-node communication when processing the most important load (HHH triangle, accounting for an average of about 75.9%), greatly improving processing efficiency and the scalability of large-scale graph structures.

[0069] In one embodiment of this application, the processor is configured to segment the graph to be processed based on the number of connectivity-aware near-memory computing devices for large-scale graph computation, generate graph data corresponding to multiple batches of target graphs, and construct corresponding batch processing tasks, including: The points in the graph to be processed are sorted from highest to lowest degree to generate a first list; wherein the degree of each point is related to the number of edges connected to the point. According to the first ratio, select multiple points ranked at the top from the first list and determine them as Hub points; and specify the direction of each edge in the graph to be processed: from the point with the lower degree among the two points of the edge to the point with the higher degree; Based on all edges consisting of 2 Hub points, construct the first graph; and construct the Hub point adjacency list and the non-Hub point adjacency list for each point in the graph to be processed. Divide all Hub points into multiple Hub groups, and divide all non-Hub points into multiple non-Hub groups; Based on the first bitmap and at least one of the following, construct a batch of graph data: an adjacency list of a Hub group, or an adjacency list of a non-Hub group; construct corresponding batch processing tasks based on the graph data of each batch.

[0070] In one embodiment, the host processor pre-segments the graph to be processed to obtain multiple target graphs, and constructs corresponding batch processing tasks based on the graph data of each target graph. The specific process is as follows: (1) Based on the number of connected edges, calculate the degree of each point in the graph to be processed, and sort them according to the degree to obtain the first list. Select the point with the highest degree in the first proportion from the first list and determine it as the Hub point. Specify the direction of the edge in the graph as the point from the later point in the first list to the point from the earlier point in the first list; (2) Filter out all edges consisting of two Hub points from the graph and store them in the bitmap to obtain the first bitmap. Also, based on the graph to be processed, construct the Hub point adjacency list and the non-Hub point adjacency list for each point; (3) Divide all Hub points in the graph into multiple Hub groups and divide all non-Hub points into multiple non-Hub groups. Based on the first graph and at least one of the following, construct a batch of graph data: an adjacency list of a Hub group or an adjacency list of a non-Hub group. (4) Based on the generated graph data for each batch, construct the corresponding batch processing task. As one embodiment of this application, the processor is configured to divide all Hub points into multiple Hub groups, specifically including: based on the number of all near-memory computing modules contained in the system, allocating corresponding Hub points to each near-memory computing module in a round-robin manner to obtain the Hub group corresponding to each near-memory computing module; All non-Hub points are divided into multiple non-Hub groups. Specifically, this involves dividing all non-Hub points into multiple non-Hub groups based on the capacity threshold of the CAM in each near-memory computing device and the data volume of the Hub adjacency list and the non-Hub adjacency list of each non-Hub point. In each non-Hub group, the data volume of the adjacency list of each first non-Hub point and the total data volume of the adjacency list of each second Hub point are both less than or equal to the capacity threshold.

[0071] In the above embodiments, different grouping strategies are used for Hub points and non-Hub points. Specifically, for Hub points, a round-robin approach is used to allocate corresponding Hub points to each PE based on the number of all nearby computing modules (PEs) deployed in the system. Then, all Hub points allocated to each PE are defined as a Hub group. For non-Hub points, the capacity of the CAMs deployed in the PEs needs to be considered. Based on the capacity thresholds of the first and second CAMs deployed in each PE in the system (usually the first and second CAMs have the same capacity), and the data volume of the Hub point adjacency table and the non-Hub point adjacency table for each point, all non-Hub points are divided to ensure that the data volume of each non-Hub group is less than the CAM capacity threshold, thereby ensuring that each batch of graph data can be processed normally by the first and second CAMs, avoiding cache overflow.

[0072] To further verify the effectiveness of this scheme, this embodiment conducted multi-dimensional comparative tests on eight real-world graph datasets. These eight graph datasets are: Email-Enron (EE), Astro (AS), YouTube (YT), LiveJournal (LJ), Orkut (OK), Mico (MC), Wiki (WK), and Twitter (TT). For comparison, this embodiment selected relevant mainstream near-memory computing architectures (such as the DIMMINING, CLAP, and CRISP schemes) for comparison. Specific test data and results are as follows: Figure 6 , Figure 7 and Figure 8 As shown.

[0073] Figure 6 This is a comparison chart of the end-to-end runtime of different near-memory computing schemes. For example... Figure 6 As shown, the Speedup results are the normalized average processing time based on the proposed scheme (PRISM). Compared to the DIMMining scheme, this scheme achieves an average speedup of 2.21 times; compared to the CLAP scheme, it achieves an average speedup of 2.05 times; and compared to the CRISP scheme, it achieves an average speedup of 1.56 times. Particularly noteworthy is its ability to handle extremely large-scale graph structures containing 61.6M vertices (such as the Twitter dataset). The CRISP scheme, in contrast, failed to complete computation due to memory overflow (OoM) exceeding memory capacity, while this scheme, with its efficient connectivity-aware partitioning and arithmetic logic operations, stably and efficiently completed the processing of large-scale graph data.

[0074] Figure 7 This is a comparison chart of data access volume for different near-memory computing schemes. For example... Figure 7As shown, in terms of DRAM data access volume (memory traffic), the proposed scheme (PRISM) reduces DRAM access volume by an average of 39.49% compared to the CLAP scheme and by an average of 32.10% compared to the CRISP scheme. This significant performance improvement is mainly attributed to two key optimizations in this application: First, this application employs a compact first-order graph and arithmetic logic operations to handle dense hub nodes, achieving cache-friendly data lookups and significantly reducing memory access traffic for long adjacency lists; Second, this application innovatively splits the adjacency list into independent hub node sets and non-hub node sets (loaded and processed independently by the first CAM and the second CAM, respectively), effectively reducing the data volume of subsequent set intersection operations and preventing irrelevant data from polluting the on-chip cache, thereby greatly improving memory access efficiency. This advantage is particularly evident in ultra-large-scale graph datasets (such as Wiki and Twitter), where DRAM access volume reductions compared to the CLAP scheme reach 64.65% and 68.00%, respectively.

[0075] Figure 8 This is a comparison chart showing the runtime of different data distribution strategies used in a multi-rank architecture. For example... Figure 8 As shown, this embodiment compares the normalized runtime of three strategies: full graph copy, graph split, and the bitmap-based Hub-subgraph copy strategy adopted in this application. For large-scale graphs exceeding the single-level memory capacity (such as Wiki and Twitter), the traditional full graph copy strategy directly leads to Out of Memory (OoM), while the graph split strategy generates extremely high communication overhead (manifested as significantly longer runtime) due to frequent cross-node interactions. The Hub-subgraph copy strategy adopted in this application, by copying only the very small first graph and splitting the remaining subgraphs, successfully avoids OoM while minimizing cross-node communication overhead, fully ensuring the scalability of the system for handling ultra-large-scale graph structures.

[0076] In summary, the test results show that the technical solution of this application effectively solves the memory overflow, cache jitter and cross-node communication bottlenecks in near-memory computation under large-scale graph structures, and significantly improves the computational efficiency of triangle counting and the overall scalability of the system.

[0077] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

[0078] For the sake of simplicity, the method embodiments are described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and components involved are not necessarily essential to this application.

[0079] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this application can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of this application can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0080] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0081] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0082] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0083] Although preferred embodiments of the embodiments of this application have been described, those skilled in the art, once they understand the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, this application is to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of this application.

[0084] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0085] The connectivity-aware near-memory computing device and system for large-scale graph computing provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A connectivity-aware near-memory computing module for large-scale graph computation, characterized in that, include: The decoder is configured to read the target graph data from an external first memory; The graph data of the target graph includes: information on all points in the target graph and their corresponding adjacent points; wherein, in the graph data, the indices of all edges consisting of 2 Hub points are stored in the first graph. The index calculation module is configured to traverse all points in the graph data and calculate the index of the edge formed by the two adjacent Hub points corresponding to each point through arithmetic logic; wherein, the Hub points are the first proportion of points in the sorted order obtained after all points are sorted by degree from high to low. The decoder is also configured to query the first bitmap based on the index of the edge and count the number of first-type triangles containing at least two Hub points in the graph data; The first CAM is configured to, after counting the number of first-type triangles, traverse all non-Hub points in the graph data and count the number of second-type triangles composed of 3 non-Hub points in the graph data; The second CAM is configured to, after counting the number of second-type triangles, traverse each non-Hub point in the graph data and count the number of third-type triangles formed by two non-Hub points and one Hub point in the graph data.

2. The connectivity-aware proximity computing module for large-scale graph computing according to claim 1, characterized in that, It also includes a controller; the index calculation module is configured to calculate the index of the edge formed by the two adjacent Hub points corresponding to each point, including: taking the two adjacent Hub points corresponding to the target point being traversed as the first vertex and the second vertex respectively, and calculating the index of the edge formed by the first vertex and the second vertex according to the index of the first vertex and the index of the second vertex. The decoder is configured to query the first bitmap based on the edge index and count the number of first-type triangles containing at least two Hub points in the graph data, including: querying the first bitmap based on the edge index; and sending a first signal to the controller when the value at the corresponding position in the first bitmap is 1. The controller is configured to perform a count of the first type of triangles based on the first signal.

3. The connectivity-aware proximity computing module for large-scale graph computing according to claim 1, characterized in that, It also includes a controller; the first CAM is specifically configured to query the first non-Hub point adjacency table of the target non-Hub point currently being traversed, and the second non-Hub point adjacency table corresponding to each non-Hub point in the first non-Hub point adjacency table, and determine whether there is an edge between the target non-Hub point and the non-Hub points in the second non-Hub point adjacency table. If an edge exists between the target non-Hub point and a non-Hub point in the adjacency list of the second non-Hub point, a second signal is sent to the controller; The controller is also configured to perform a count of the second type of triangles based on the second signal.

4. The connectivity-aware proximity computing module for large-scale graph computing according to claim 3, characterized in that, It also includes the controller; The second CAM is specifically configured to obtain the base edges formed by the target non-Hub point and its corresponding non-Hub adjacent points obtained from the query of the first CAM; based on the base edges, query the Hub point adjacency table of the non-Hub adjacent points to determine whether there is an edge between the target non-Hub point and the Hub point in the Hub point adjacency table; If an edge exists between the target non-Hub point and a Hub point in the Hub point adjacency list, a third signal is sent to the controller. The controller is also configured to perform a count of the third type of triangles based on the third signal.

5. The connectivity-aware proximity computing module for large-scale graph computation according to any one of claims 2-4, characterized in that, The graph data also includes: a first list sorted by degree from highest to lowest for all points in the target graph; the controller is further configured to perform the following steps: Traverse all points in the first list, control the decoder to read the Hub adjacent point index of each point from the first memory in sequence, and transmit it to the index calculation module; After traversing all points in the first list, the non-Hub points in the first list are traversed again. The decoder is controlled to read the non-Hub point adjacency table of each target non-Hub point and the non-Hub point adjacency table of each non-Hub adjacent point of the target non-Hub point from the first memory in sequence, and send them to the first CAM. After counting the number of the second type of triangles, the decoder is controlled to read the Hub point adjacency table of each base edge from the first memory in turn, based on the base edge formed by each target non-Hub point and its corresponding non-Hub adjacent point, which is queried by the first CAM, and send it to the second CAM.

6. The connectivity-aware proximity computing module for large-scale graph computing according to claim 5, characterized in that, It also includes an on-chip cache submodule; the decoder is further configured to perform the following steps: Before reading the information of the adjacent points corresponding to the currently traversed point from the first memory, query whether the information of the adjacent points is cached in the on-chip cache submodule; If the information of the adjacent points is stored in the on-chip cache submodule, the graph data is read from the on-chip cache submodule; If the information of the adjacent point is not stored in the on-chip cache submodule, the information of the adjacent point is read from the first memory and written to the on-chip cache submodule.

7. A connectivity-aware near-memory computing device for large-scale graph computation, characterized in that, include: Multiple processing units; wherein each processing unit includes: a buffer module and multiple first memories; the buffer module includes: a memory controller and multiple near-memory computing modules as described in any one of claims 1-6; The memory controller is configured to perform the following steps: Receive multiple batches of graph data for a target graph and corresponding batch processing tasks sent by an external processor; wherein each batch of graph data includes: the first bitmap and information on the adjacency list of all points in the target graph; Write the multiple batches of graph data and the corresponding batch processing tasks into the first memory; Each near-memory computing module is controlled to process each batch processing task, read from the first memory and count the number of triangles contained in the graph data of the corresponding batch to obtain local statistical results; Based on the local statistical results of each nearby module, the total number of triangles contained in the target graph is calculated.

8. A connectivity-aware near-memory computing system for large-scale graph computation, characterized in that, include: Processor and at least one near-memory computing device as described in claim 7; The processor is configured to segment the graph to be processed based on the number of near-memory computing devices, generate graph data corresponding to multiple batches of target graphs, and construct corresponding batch processing tasks, which are then distributed to each near-memory computing device; wherein, the graph data of each batch includes: the first bitmap and information on the adjacency list of all points in the target graph; Each near-memory computing device is configured to count the number of triangles contained in the graph data of the corresponding batch based on the received batch processing task, and return the statistical results to the processor; The processor is also configured to calculate the total number of triangles contained in the target graph based on statistical results returned by each near-memory computing device.

9. The connectivity-aware near-memory computing system for large-scale graph computing according to claim 8, characterized in that, The processor is configured to segment the graph to be processed based on the number of near-memory computing devices, generate graph data corresponding to multiple batches of target graphs, and construct corresponding batch processing tasks, including: The points in the graph to be processed are sorted from highest to lowest degree to generate a first list; wherein the degree of each point is related to the number of edges connected to the point. According to the first ratio, select multiple points ranked at the top from the first list and determine them as Hub points; and specify the direction of each edge in the graph to be processed: from the point with the lower degree among the two points of the edge to the point with the higher degree; Based on all edges consisting of 2 Hub points, construct the first graph; and construct the Hub point adjacency list and the non-Hub point adjacency list for each point in the graph to be processed. Divide all Hub points into multiple Hub groups, and divide all non-Hub points into multiple non-Hub groups; Based on the first bitmap and at least one of the following, construct a batch of graph data: an adjacency list of a Hub group, or an adjacency list of a non-Hub group; construct corresponding batch processing tasks based on the graph data of each batch.

10. The connectivity-aware near-memory computing system for large-scale graph computing according to claim 9, characterized in that, The processor is configured to divide all Hub points into multiple Hub groups, specifically including: based on the number of all near-memory computing modules contained in the system, using a round-robin method to allocate corresponding Hub points to each near-memory computing module, thereby obtaining the Hub group corresponding to each near-memory computing module; All non-Hub points are divided into multiple non-Hub groups. Specifically, this involves dividing all non-Hub points into multiple non-Hub groups based on the capacity threshold of the CAM in each near-memory computing device and the data volume of the Hub adjacency list and the non-Hub adjacency list of each non-Hub point. In each non-Hub group, the data volume of the adjacency list of each first non-Hub point and the total data volume of the adjacency list of each second Hub point are both less than or equal to the capacity threshold.