Data-triggered heterogeneous server system and its task scheduling method

By introducing direct hardware interconnection between the data trigger control layer and the computing node layer in a heterogeneous server system, the problem of low data scheduling efficiency in traditional server architecture is solved, achieving low-latency and deterministic task processing, which is suitable for real-time task scheduling in high-concurrency scenarios.

CN122309092APending Publication Date: 2026-06-30INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Traditional server architectures struggle to achieve closed-loop processing where data arrives and is executed immediately within a microsecond-level response window. This results in low data scheduling efficiency across heterogeneous servers. In particular, under high-concurrency scenarios, software-layer interrupt overhead and memory copying further amplify latency fluctuations, making task triggering and response unpredictable.

Method used

A data-triggered heterogeneous server system is adopted, which directly interconnects the computing node layer and the data triggering control layer through hardware, bypassing the operating system scheduling, and directly sending task instructions to the target processing core to achieve dynamic matching of data features and on-demand execution, avoiding resource idleness and overload.

Benefits of technology

It achieves low-latency and efficient task scheduling, eliminates the uncertainty of operating system scheduling and memory copy overhead, improves the determinism of task response and throughput, and supports real-time processing in high-concurrency scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309092A_ABST
    Figure CN122309092A_ABST
Patent Text Reader

Abstract

This application discloses a data-triggered heterogeneous server system and its task scheduling method, relating to the field of artificial intelligence technology. It includes: a computing node layer, comprising at least one computing node, each computing node including a processing core combination and corresponding core shared memory, wherein the processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores; and a data-triggered control layer, comprising at least one control node, the control node being connected to the corresponding computing node including at least one general-purpose processing core and multiple accelerated processing cores. When received reference data meets the triggering conditions, the control node directly sends task instructions to multiple target processing cores in the connected computing node, wherein the processing core combination includes multiple target processing cores. This solves the technical problem of low data scheduling efficiency in heterogeneous servers in related technologies.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence, and in particular to a data-triggered heterogeneous server system and its task scheduling method. Background Technology

[0002] Traditional server architectures, with their fixed general-purpose cores (such as CPUs) and dedicated cores (such as GPUs) working together and static resource allocation mechanisms, struggle to achieve closed-loop processing where data is executed as soon as it arrives within a microsecond-level response window, severely hindering the industrialization of real-time intelligent systems.

[0003] While current mainstream server architectures have adopted high-speed interconnect technologies to improve hardware bandwidth, their task scheduling and data parsing still heavily rely on the operating system kernel, virtualization management platform, or software framework. Specifically, after data enters from the external interface, it must sequentially pass through: driver layer reception, kernel protocol stack parsing, user-mode scheduling framework matching, virtual memory mapping, task queue enqueueing, CPU scheduling context switching, and finally, wake-up of the acceleration unit for execution. The average latency of this entire chain is as high as 1-5ms. In high-concurrency scenarios, software-layer interrupt overhead, lock contention, and memory copying further amplify latency fluctuations, leading to unpredictable task triggering and response. In other words, heterogeneous servers in related technologies suffer from low data scheduling efficiency. Summary of the Invention

[0004] This application provides a data-triggered heterogeneous server system and its task scheduling method to at least solve the problem of low data scheduling efficiency in heterogeneous servers in related technologies.

[0005] This application provides a data-triggered heterogeneous server system, comprising: a computing node layer, including at least one computing node, each computing node including a processing core combination and core shared memory corresponding to the processing core combination, wherein the processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores; and a data-triggered control layer, including at least one control node, the control node being connected to the at least one general-purpose processing core and multiple accelerated processing cores of the corresponding computing node, wherein, when the received reference data meets the triggering conditions, the control node directly sends task instructions to multiple target processing cores in the connected computing node, wherein the processing core combination includes multiple target processing cores.

[0006] This application also provides a task processing method for a heterogeneous server system based on data triggering, comprising: determining the reference data as target data when the reference data written to the core shared memory meets the preset triggering conditions, and parsing the target data to obtain data characteristics, wherein the core shared memory is the shared memory of a combination of processing cores; determining multiple target processing cores in the combination of processing cores according to the data characteristics, wherein the combination of processing cores includes at least one general-purpose processing core and multiple accelerated processing cores, and the combination of processing cores includes multiple target processing cores; generating task instructions according to the multiple target processing cores, and writing the task instructions into the registers corresponding to the multiple target processing cores.

[0007] This application also provides a task processing device for a data-triggered heterogeneous server system, comprising: a first determining module, configured to determine reference data as target data when reference data written to core shared memory meets preset triggering conditions, and to parse the target data to obtain data features, wherein the core shared memory is the shared memory of a processing core combination; a second determining module, configured to determine multiple target processing cores in the processing core combination according to the data features, wherein the processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores, and the processing core combination includes multiple target processing cores; and a generating module, configured to generate task instructions according to the multiple target processing cores and write the task instructions into the registers corresponding to each of the multiple target processing cores.

[0008] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of any of the above-described data-triggered heterogeneous server system task processing methods.

[0009] This application also provides a computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, it implements the steps of any of the above-described data-triggered heterogeneous server system task processing methods.

[0010] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the above-described data-triggered heterogeneous server system task processing methods.

[0011] The system provided in this application includes a computing node layer comprising at least one computing node, each computing node including a processing core combination and corresponding core shared memory. The processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores. A data triggering control layer includes at least one control node, which is connected to the corresponding computing node containing at least one general-purpose processing core and multiple accelerated processing cores. When the received reference data meets the triggering conditions, the control node directly sends task instructions to multiple target processing cores in the connected computing node. The processing core combination includes multiple target processing cores. Through direct hardware interconnection between the control node and the general-purpose cores and multiple accelerated cores within the computing node, task instructions are directly sent to the target processing core combination, bypassing operating system scheduling, when the triggering conditions are met. This achieves precise allocation of computing power to tasks and dynamically matches the combination of general-purpose cores and accelerated cores based on data characteristics, avoiding resource idleness and overload, and realizing low-latency processing with data triggering and on-demand execution. Therefore, it can solve the problem of low data scheduling efficiency in heterogeneous servers in related technologies. Attached Figure Description

[0012] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 This is a schematic diagram of the hardware environment of an optional data-triggered heterogeneous server system according to an embodiment of this application;

[0014] Figure 2 This is a schematic diagram of an optional data-triggered heterogeneous server system according to an embodiment of this application;

[0015] Figure 3 This is a schematic diagram of another optional data-triggered heterogeneous server system according to an embodiment of this application;

[0016] Figure 4 This is a flowchart of an optional data-triggered task processing method for a heterogeneous server system according to an embodiment of this application;

[0017] Figure 5 This is a structural block diagram of an optional data-triggered heterogeneous server system task processing device according to an embodiment of this application. Detailed Implementation

[0018] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.

[0019] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.

[0020] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0021] According to one aspect of the embodiments of this application, a task processing method for a data-triggered heterogeneous server system is provided. As an optional implementation, the above-described task processing method for a data-triggered heterogeneous server system can be applied, but is not limited to, to systems such as... Figure 1 The illustrated hardware environment includes a data-triggered heterogeneous server system. This data-triggered heterogeneous server system may include:

[0022] The computing node layer 102 includes at least one computing node. The computing node includes a processing core combination and core shared memory corresponding to the processing core combination. It may also include a general processing unit, a switching module and a shared memory unit. The processing core combination includes at least one general processing core and multiple acceleration processing cores. The switching module can be used to connect cross-node links.

[0023] The data trigger control layer 104 includes at least one control node. The control node is connected to at least one general-purpose processing core and multiple accelerated processing cores of the corresponding computing node. When the received reference data meets the triggering conditions, the control node directly sends task instructions to multiple target processing cores in the connected computing node. The processing core combination includes multiple target processing cores.

[0024] It should be noted that the compute node layer refers to the physical unit within the server that executes computational tasks. Each compute node is an independent hardware module containing a set of heterogeneous processing cores, such as general-purpose cores and accelerated cores, along with their dedicated shared memory. This memory is a local cache or DRAM area shared by all processing cores within the node. A processing core combination can refer to a heterogeneous computing power unit within a compute node consisting of at least one general-purpose processing core, such as a CPU, and multiple accelerated processing cores (such as GPUs, NPUs, or FPGAs). The data trigger control layer is a control unit independent of the compute nodes. It can be a control node composed of at least one FPGA or dedicated control chip, directly establishing hardware connections with all processing cores within the compute node, thereby enabling real-time monitoring, feature analysis, and instruction issuance capabilities.

[0025] The control node does not schedule tasks through an operating system or software queue. Instead, it directly senses data-triggered events and, based on preset conditions, sends hardware-level task instructions in parallel to multiple processing cores (general or accelerated) within the target computing node. These instructions directly reach the instruction buffers of each core, achieving low-latency execution without middleware intervention. It's important to note that core shared memory serves as a local data carrier, ensuring that the target core can access the required data immediately after the instruction is issued.

[0026] In an optional implementation, the control node is implemented using an FPGA, with built-in data monitoring, feature parsing, and task triggering modules. After external data is written to the core shared memory via RDMA or CXL, the control node monitors write events to a specified address range in real time through the CXL3.0 management link. When data characteristics, such as format or QoS, meet triggering conditions (e.g., latency less than or equal to 100μs and bandwidth greater than or equal to 5GB / s), the control node identifies the target processing core combination (e.g., 8 NPUs and 1 CPU) in parallel using hardware logic, generates a 64-byte fixed-format hardware instruction containing the target core ID and data address, and writes it directly to the 1KB instruction buffer of each target core through a dedicated CXL3.0 channel. The instruction skips the OS kernel and virtualization layer; after completing the current instruction, the core immediately reads the new task from the buffer and executes the data from the shared memory. The control node ensures the visibility of instructions and data addresses through the CXL coherence protocol. The core shared memory can be implemented using DDR5 or HBM2E, with a capacity of 16–64GB, supporting multi-core concurrent read / write.

[0027] The system provided in this application includes a computing node layer comprising at least one computing node, each computing node including a processing core combination and corresponding core shared memory. The processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores. A data triggering control layer includes at least one control node, which is connected to the corresponding computing node containing at least one general-purpose processing core and multiple accelerated processing cores. When the received reference data meets the triggering conditions, the control node directly sends task instructions to multiple target processing cores in the connected computing node. The processing core combination includes multiple target processing cores. Through direct hardware interconnection between the control node and the general-purpose cores and multiple accelerated cores within the computing node, task instructions are directly sent to the target processing core combination, bypassing operating system scheduling, when the triggering conditions are met. This achieves precise allocation of computing power to tasks and dynamically matches the combination of general-purpose cores and accelerated cores based on data characteristics, avoiding resource idleness and overload, and realizing low-latency processing with data triggering and on-demand execution. Therefore, it can solve the problem of low data scheduling efficiency in heterogeneous servers in related technologies.

[0028] In an optional implementation, the control node includes: a data monitoring module, used to collect reference data and perform hardware comparison based on a preset parameter set, and generate a trigger signal when the triggering conditions are met; a feature parsing module, used to extract data features from the reference data upon receiving the trigger signal, and look up target task configuration information in a first mapping table based on the data features; and a task triggering module, used to generate task instructions upon receiving target task configuration information, and send the task instructions to the corresponding computing node.

[0029] It should be noted that the control node is a hardware control unit independent of the computing core, implemented by an FPGA or dedicated coprocessor. It integrates a data monitoring module, a feature parsing module, and a task triggering module, which are connected in series via an internal high-speed bus such as AXI4-Stream, forming a low-latency hardware pipeline. The data monitoring module is responsible for real-time acquisition of reference data from external input or shared memory, and performing hardware-level comparisons based on preset parameters such as traffic thresholds, priorities, and data formats. The feature parsing module performs protocol-level parsing on the received reference data, extracting structured features such as data format, processing requirements, and QoS parameters, and locating the corresponding processing task based on a mapping table between data features and task configuration. The task triggering module generates standardized hardware task instructions based on the mapping results and directly sends them to the target computing node via CXL or a dedicated link.

[0030] The control node autonomously completes the entire process of monitoring, parsing, matching, and triggering without the need for CPU or OS intervention. The data monitoring module determines the triggering conditions in real time through a hardware comparator array; the feature parsing module extracts data semantics using a state machine and lookup table circuit; and the task triggering module directly generates and sends instructions, bypassing the operating system's scheduling queue and interrupt handling, forming an end-to-end hardware closed loop from data arrival to task initiation.

[0031] In an optional implementation, the control node is deployed on an FPGA chip. The data monitoring module is configured with 64 parallel monitoring channels, each bound to a specific memory address range. Hardware counters and threshold comparators are used to monitor data flow, priority, and write events in real time. A high-level trigger signal is output when conditions are met. The trigger signal activates the feature parsing module, whose internal CRC check circuit verifies data integrity. The state machine parses header fields such as JSON format, processing requirement code 0x02, and latency limit of 100μs. Then, it performs precise and fuzzy matching through a built-in 1024-entry dual-port RAM mapping table, outputting target task configuration information. Upon receiving this information, the task triggering module generates a 64-byte fixed-format instruction, which may include instruction type, target core ID, task parameter address, priority, and checksum. This instruction is directly written to the instruction buffer of the target computing node via the CXL3.0 management link.

[0032] Through the above-described embodiments of this application, a single node can process over one million trigger events per second through hardware parallel comparison and mapping; at the same time, feature parsing and task matching are entirely executed by dedicated logic, thereby avoiding software polling overhead and reducing server power consumption; furthermore, it also eliminates the dependence on operating systems and drivers, enhancing the determinism of the system.

[0033] In an optional implementation, the data monitoring module includes: multiple monitoring channels that listen to reference data and generate a first-level signal, wherein the first-level signal is used to indicate the presence of data; a parameter acquisition circuit including a counter, a sampling register, and management interface logic, which reads the interface throughput register and priority field through the management link and stores them in a local register; a hardware comparator array including multiple independent hardware comparators, which are respectively connected to the local register and the output signal of each monitoring channel, for real-time comparison of the acquired signal parameters with the corresponding preset judgment parameters; and a logic combination unit connected to the output of each hardware comparator, for performing combined operations on multiple comparison results according to a preset logic relationship to generate a trigger signal.

[0034] It should be noted that the data monitoring module is the core unit in the control node that implements hardware-based data sensing. It consists of multiple monitoring channels, parameter acquisition circuits, a hardware comparator array, and a logic combination unit, all implemented using FPGA logic without software intervention. The monitoring channels are independent hardware circuits dedicated to monitoring write activity to specified memory address ranges and outputting a high-level data presence signal. The parameter acquisition circuit integrates counters, sampling registers, and management interface logic to collect dynamic parameters such as interface throughput and data priority in real time and cache them in local registers. The hardware comparator array consists of multiple parallel comparators, each corresponding to a type of preset judgment parameter such as flow threshold or priority level. The logic combination unit is a configurable combinational logic circuit that integrates multiple comparison results and outputs the final trigger signal.

[0035] Furthermore, multiple monitoring channels listen to different data sources in parallel, the parameter acquisition circuit captures dynamic operating parameters in real time, the hardware comparator array achieves real-time comparison at the micro-nanosecond level, and the logic combination unit supports complex trigger condition combinations, such as triggering only under high priority and high traffic conditions, without the need for software polling and interrupt handling, thus achieving a low-latency response throughout the entire process from data arrival to trigger signal generation.

[0036] In an optional implementation, 64 independent monitoring channels are deployed in the FPGA, each channel bound to a shared memory address segment. When a write operation is detected at that address, a first-level signal with a continuous high level is immediately output. The parameter acquisition circuit periodically reads the throughput register and the priority fields (P0–P4) of the data packet header from the CXL3.0 management link interface, writes them to a local 64-bit sampling register, and a counter synchronously counts the number of accesses within 1ms. The hardware comparator array contains 8 independent comparators, which compare preset thresholds such as whether the traffic is greater than or equal to 10GB / s, whether the priority is greater than or equal to P1, whether the access frequency is greater than or equal to 500 times, and whether the cache usage is less than or equal to 80%. The logic combination unit uses programmable combinational circuits. For example, setting the traffic to be greater than or equal to the threshold and priority = P0 as a valid trigger condition, a trigger signal is generated and output to the feature parsing module only when both comparators output a high level.

[0037] Through the above-described embodiments of this application, the structure supports 64-channel parallel monitoring, and a single module can process over ten million data events per second; hardware comparison and logical combination completely eliminate software latency, and triggering accuracy reaches the byte level; through multi-parameter collaborative judgment such as the combination of parameters between priority, flow, and frequency, instantaneous jitter and noise are effectively filtered, reducing the false trigger rate; at the same time, combined with the management interface, dynamic parameter updates are supported, providing the system with a configurable and highly robust hardware-aware kernel.

[0038] In an optional implementation, the feature parsing module includes: a preprocessing circuit for performing integrity verification on received data packets and discarding data packets that fail verification based on verification logic; a feature extraction circuit, including a state machine circuit and a field parsing unit, wherein the state machine circuit is used to parse the data packet header according to a preset protocol format and extract data format identifiers, processing requirement codes, and parameter fields; a dual-port memory for storing a first mapping table, which consists of multiple entries, each entry containing a mapping relationship between a combination of data format, processing requirements, and performance parameters and the corresponding target core type, core quantity, cache quota, and link priority; a parallel matching array, consisting of multiple hardware comparators, for simultaneously matching multiple entries in the mapping table, wherein the matching process includes: precise matching of data format and processing requirements, requiring complete field consistency; fuzzy matching of performance parameters, allowing deviations from the target value within a preset error range; and an address decoding unit for decoding the resource pointers corresponding to successfully matched entries into target task configuration information and outputting it to the task triggering module.

[0039] It should be noted that the feature parsing module is a hardware circuit unit in the control node that implements data semantic recognition and task mapping. It consists of a preprocessing circuit, a feature extraction circuit, a dual-port memory, a parallel matching array, and an address decoding unit, all implemented based on FPGA logic without the involvement of an operating system. The preprocessing circuit is responsible for data packet integrity verification; the feature extraction circuit parses the data header according to the protocol through a state machine and a field parsing unit; the dual-port memory is a high-speed RAM that stores a mapping table between data features and task configurations, supporting parallel read and write operations; the parallel matching array consists of multiple sets of hardware comparators to achieve concurrent matching of multiple entries; and the address decoding unit converts successfully matched entries into executable task instruction parameters.

[0040] After the data packet undergoes integrity verification, the state machine extracts the data format, processing requirements, and QoS parameters according to the protocol structure. It performs concurrent comparisons on thousands of rules in the mapping table, supporting both exact and fuzzy matching strategies to quickly locate the optimal task configuration. The matching results are directly decoded into information such as the target core type, quantity, cache, and priority parameters that can be executed by the hardware, and output to the task triggering module. The entire process is performed without software intervention, realizing end-to-end hardwareization of data input, semantic parsing, and task mapping.

[0041] In an optional implementation, received data packets are verified for integrity by a CRC-32C check circuit, and erroneous packets are immediately discarded. The feature extraction circuit uses a state machine to parse the header according to JSON or a custom protocol, extracting data format identifiers, processing requirement codes, QoS latency limits, and throughput requirements, and storing them in a register. The dual-port memory can store multiple mapping entries. The parallel matching array consists of multiple sets of hardware comparators, each independently comparing one entry. First, precise matching of the format and requirement fields is performed, followed by fuzzy matching of latency and throughput. Upon successful matching, the address decoding unit reads the resource pointer from the entry, decodes it into 32-bit structured target configuration information, which may include core ID ranges, cache address segments, priority codes, etc., and then outputs it to the task triggering module via the AXI4-Stream bus.

[0042] Through the above-described implementation methods of this application, the single-match latency is faster than that of software table lookup; it supports a hybrid strategy of precise and fuzzy matching to adapt to real-world business QoS fluctuations; the hardware parallel architecture avoids serial scanning, significantly improving throughput and determinism; the mapping table is dynamically updatable, thereby supporting online adjustment of business strategies; and the complete hardware implementation eliminates OS scheduling jitter, ensuring the predictability of task scheduling in scenarios such as AI inference and video analysis.

[0043] In an optional implementation, the task triggering module includes: an instruction generation unit, used to generate task instructions based on the input target task configuration information, wherein the hardware instructions include an instruction type field, a target core identifier field, a task parameter address field, a task priority field, and a checksum field; and an instruction distribution unit, coupled to the instruction generation unit, used to directly write the task instructions into the instruction buffer of the target core through a dedicated control link.

[0044] It should be noted that the task triggering module is a hardware unit in the control node responsible for converting the matched task configuration into executable instructions and sending them to the computing core. It consists of an instruction generation unit and an instruction distribution unit. The instruction generation unit synthesizes a standardized hardware instruction according to a fixed format based on the input target task configuration information. The instruction distribution unit is a dedicated hardware circuit that directly writes the instruction into the instruction buffer of the target processing core through a low-latency control link, bypassing the operating system scheduling mechanism.

[0045] The task triggering module does not rely on software queues or interrupts. Instead, it is based on hardware instruction format. The instruction generation unit constructs 64-byte instructions according to preset field specifications, and the instruction distribution unit directly injects the instructions into the target core's local instruction buffer through a dedicated CXL or high-speed serial link. This achieves a zero-scheduling closed loop of configuration-to-execution, avoiding the problems of kernel switching, virtualization overhead, and queue queuing delays in traditional software scheduling paths.

[0046] In an optional implementation, the instruction generation unit receives target task configuration information from the address decoding unit, such as 8 NPUs, 128MB cache, and P1 priority. It fills in fields according to a fixed 64-byte format: instruction type is 0x01 (task start), target core identifier is a 4-byte core ID range, such as 0x00000001–0x00000008, task parameter address is a 32-byte shared memory physical starting address, task priority is a 1-byte P0–P4 encoded field, and the checksum is generated by hardware CRC8 calculation. After instruction generation, the instruction distribution unit directly writes the instructions to the 1KB instruction buffer of each NPU / GPU in the target compute node via two redundant CXL3.0 management links (primary / backup) using DMA. Each core is equipped with an independent buffer; instruction writing does not require CPU intervention. After completing the current instruction, the core automatically reads the new task from the buffer and executes it.

[0047] Through the above-described embodiments of this application, the structure achieves improved task startup compared to traditional software scheduling, completely eliminating unpredictable delays caused by operating system kernel scheduling, virtualization layer interrupts, and queue queuing; the hardware instruction format ensures accurate delivery, and the checksum field guarantees transmission reliability; it supports batch delivery of single instructions to multiple cores (SIMC mode), improving concurrent scheduling efficiency; and the dedicated control link and instruction buffer direct connection mechanism enable the system to have deterministic real-time response capabilities.

[0048] In an optional implementation, the computing node includes: at least one general-purpose processing core, multiple accelerated processing cores, a photonic switching module, and a core shared memory unit; the general-purpose processing core establishes a fully connected interconnect topology with the multiple accelerated processing cores through the photonic switching module, and the multiple accelerated processing cores are directly connected to the photonic switching module through a single-hop path, and the photonic switching module is also directly connected to the core shared memory unit; wherein, each accelerated processing core is configured with two interconnect channels: the first channel is used for direct connection between accelerated chips of the same type; the second channel is used to establish a cache-coherent access channel with the photonic switching module and the core shared memory unit.

[0049] It should be noted that a compute node is a heterogeneous computing unit within a high-performance server, comprising at least one general-purpose processing core (such as a CPU), multiple accelerated processing cores (such as GPUs or NPUs), a photonic switching module, and a core shared memory unit. The photonic switching module utilizes silicon photonics technology to achieve high-speed optical interconnection, supporting multi-port fully connected topologies and featuring low latency and high bandwidth. The core shared memory unit is a local DRAM or HBM memory pool shared by all processing cores, achieving cache-coherent access via the CXL protocol. Each accelerated core is equipped with two physical interconnect channels: the first channel is a dedicated direct link between accelerators of the same type, such as NVLink; the second channel is a cache-coherent channel connected to the general-purpose core and shared memory via the photonic switching module.

[0050] The general-purpose core and the acceleration core achieve full connectivity and single-hop direct connection through the photonic switching module, ensuring that communication between any cores does not require relay. The acceleration core also has a high-speed direct connection channel between chips of the same type and a unified cache consistency access channel, realizing dual optimization of computing power collaboration and memory sharing, breaking through the traditional PCIe bottleneck and cache inconsistency latency, and forming a local computing power pool with low latency and high bandwidth.

[0051] In an optional implementation, a single computing node deploys one CPU, 16 NPUs, and one photonic switching module. The photonic switching module integrates 32 silicon photonic transceiver ports. The CPU and all NPUs are directly connected to the module via a single optical fiber, achieving a cross-core communication latency of less than or equal to 1 μs. Each NPU is equipped with two physical channels: the first channel is a direct NVLink 5.0 connection between NPUs for parameter synchronization during training; the second channel connects to the photonic switching module via the CXL 3.0 protocol to establish cache-consistent access with the CPU and core shared memory. The consistency protocol uses hardware directory management, with a single cache access latency of less than or equal to 20 ns. The photonic switching module is also directly connected to the core shared memory controller, supporting all cores to access memory through a unified virtual address space without data copying. The system supports hot-swapping; when a new NPU is added, the photonic module automatically allocates ports and updates the topology.

[0052] Through the above-described embodiments of this application, the architecture is an improvement over the traditional PCIe architecture; the dual-channel design takes into account both the collaboration of similar cores such as parameter synchronization between NPUs and the unified memory access of heterogeneous cores, thereby improving resource utilization; cache consistency is guaranteed by hardware, eliminating software synchronization overhead and improving AI training / inference throughput; photonic interconnect supports thousand-node-level expansion, providing a high-density, low-power, and scalable local computing power interconnect foundation for supernode servers, which is superior to the traditional electronic switching architecture.

[0053] In an optional implementation, the computing node includes: upon receiving a task instruction, directly writing the instruction into a reserved instruction buffer within the computing node via a hardware instruction dispatch circuit; when the current instruction cycle executed by the computing node ends, reading the task instruction from the instruction buffer via hardware priority arbitration logic, and accessing the physical address specified in the core shared memory unit according to the task parameter address field to obtain the input data and running parameters required by the task; determining multiple target cores in the processing core combination according to the task instruction, and instructing the target cores to execute the corresponding task.

[0054] It should be noted that a compute node is a heterogeneous computing unit with local instruction processing capabilities, comprising an instruction buffer, a hardware instruction dispatch circuit, hardware priority arbitration logic, and multiple processing cores, such as a CPU, NPU, or GPU. The instruction buffer is a hardware queue reserved for each core to temporarily store pending task instructions; the hardware instruction dispatch circuit is a dedicated logic circuit used to directly write externally issued instructions into the buffer; the hardware priority arbitration logic dynamically schedules the instruction reading order according to instruction priority, ensuring that high-priority tasks are executed first.

[0055] After receiving a task instruction, the computing node directly writes it into the local instruction buffer via hardware circuitry. After the current instruction is executed, the highest priority instruction is automatically selected for reading through hardware priority arbitration logic. The node then directly accesses shared memory to obtain data based on the parameter address embedded in the instruction. Finally, the designated core combination is started to execute the task, realizing a fully hardware closed loop of instruction arrival, buffering, arbitration, data retrieval, and execution, without the need for software scheduling and memory copying.

[0056] In an optional implementation, after the task triggering module issues a 64-byte hardware instruction to the compute node, the instruction dispatch circuit directly writes the instruction into the 1KB instruction buffer of the target core (such as an NPU) via the CXL3.0 management link, without the involvement of the CPU or OS. If the current instruction execution cycle ends, the hardware priority arbitration logic reads the priority fields of all pending instructions in the buffer, employing a multi-level hard-coded priority queue structure, with P0 instructions taking precedence over P1, and so on; if instructions have the same priority, they are processed in FIFO order. After selecting the highest priority instruction, the arbitration logic parses its task parameter address field, directly accesses the core shared memory unit through the cache consistency channel, and reads the input data and runtime parameters. Subsequently, the target core identifier field in the instruction triggers multi-core collaboration logic, and the hardware scheduler synchronously wakes up the designated 8 NPUs. Each core loads the instructions and data and begins parallel execution of tasks such as classification and inference, without the need for a virtualization layer or context switching.

[0057] Through the above-described implementation methods of this application, the mechanism achieves improved task initiation compared to software scheduling; hardware arbitration ensures zero-latency preemption of high-priority tasks, with system response determinism reaching the microsecond level; instructions and data are directly accessed through shared memory, eliminating the overhead of traditional memory copying; multi-core collaboration is automatically triggered by hardware, supporting fine-grained task splitting and improving computing power utilization; and the fully hardware-based execution path eliminates OS jitter and interrupt latency.

[0058] In an optional implementation, the core shared memory is interconnected with multiple reference processing cores in the processing core combination in the computing node through a connection protocol, enabling unified access to the core shared memory by all processing cores and supporting concurrent read and write operations by multiple computing cores. The consistency of memory access is maintained through hardware mechanisms. The core shared memory is directly interconnected with the control layer, enabling the control layer to directly read, write, or schedule data in the core shared memory.

[0059] It should be noted that the core shared memory is a unified physical memory pool accessed by multiple processing cores such as CPUs, GPUs, and NPUs within a compute node. It is directly connected to the processing cores via CXL or photonic interconnect protocols, supporting concurrent read and write operations. The connection protocol is a high-speed interconnect protocol such as CXL 3.0 that supports cache coherency. The control layer refers to the data trigger control node, which includes the FPGA array and management link, and has hardware permissions to directly access the shared memory.

[0060] All processing cores are directly connected to shared memory via a high-speed interconnect protocol, enabling low-latency concurrent access; memory consistency is automatically maintained by a hardware catalog or Snoop mechanism, without the need for software intervention; the control layer is directly connected to shared memory via a dedicated link, which can bypass the processing cores to directly read, write, and schedule data, achieving hardware-level collaboration among control, memory, and computing, breaking the bottleneck of control and data separation in traditional architectures.

[0061] In an optional implementation, the core shared memory employs a hybrid architecture of 128GB HBM3 and persistent Flash memory. It establishes a fully interconnected connection with the 16 processing cores (including CPU and NPU) in the compute node via the CXL3.0 protocol, with all cores sharing a unified virtual address space. Memory consistency is achieved by a hardware directory controller: each cache block maintains a status bit (Modified / Exclusive / Shared / Invalid). When a core initiates access, the directory controller completes owner lookup and cache invalidation within 20ns, ensuring multi-core read / write consistency. The control layer is directly connected to the shared memory controller via an independent CXL3.0 management link, allowing the FPGA to directly initiate DMA read / write requests without going through the CPU. For example, the data monitoring module can directly read the access frequency of data blocks to be processed in the shared memory, the feature parsing module can directly write task parameter addresses, and the task triggering module can directly schedule data migration. The shared memory supports concurrent access from 256 cores.

[0062] Through the above-described embodiments of this application, the structure enables low-latency and high-concurrency access to memory by all processing cores and the control layer, thereby improving system throughput; the hardware consistency mechanism avoids software synchronization overhead and reduces cache miss rate; the direct connection mechanism of the control layer compresses the latency of operations such as task scheduling, cache migration, and prefetch triggering; it supports the decoupling of control and computation, realizes a data-driven architecture closed loop, and provides a unified and efficient memory resource pool for the supernode server.

[0063] In an optional implementation, the system further includes: a shared memory layer, which includes a memory interconnect network for interconnecting the core shared memory of at least one computing node in the computing node layer; when the number of nodes in the computing node layer does not exceed a preset number, memory access requests are broadcast through links and all cores monitor the cache status; when the number of nodes exceeds the preset number, hardware logic temporarily stores access requests through a temporary cache queue; wherein, the shared memory layer includes: an address translation module for dynamically mapping the physical address of the memory to a unified virtual address space of the system.

[0064] It should be noted that the shared memory layer is a global memory interconnect system spanning multiple compute nodes, comprising a memory interconnect network and an address translation module. The memory interconnect network aggregates the core shared memory of each node through RDMA or CXL interconnect technology, forming a unified memory pool. The address translation module is a hardware circuit that dynamically maps the physical addresses of heterogeneous storage media such as Flash and DRAM to the system's unified virtual address. The preset number is a system-configured node threshold (e.g., 8 nodes) used to trigger consistency protocol switching.

[0065] Figure 2This is a schematic diagram of an optional data-triggered heterogeneous server system according to an embodiment of this application. In the data triggering control layer, there are an Ethernet switching module, data triggering control node_1, data triggering control node_2, data triggering control node_3, ..., data triggering control node_n. The Ethernet switching module is connected to each of the data triggering control nodes_1,_2,_3, ...,_n. In the supernode computing core layer, there are an RDMA interconnect network, supernode computing nodes_1,_2,_3, ...,_4. The RDMA interconnect network is connected to the shared memory units of each of the supernode computing nodes_1,_2,_3, ...,_4. Furthermore, the data triggering control nodes and the supernode computing nodes are connected via photonic interconnects supporting the CXL protocol. When the number of nodes is small, the Snoop broadcast mechanism is used to achieve low-latency consistency. When the number of nodes exceeds the threshold, it automatically switches to a directory-based consistency protocol and avoids congestion by temporarily storing requests through a hardware queue. At the same time, through a hardware address translation module, physical storage resources are uniformly mapped to a virtual address space, breaking through the single-node memory capacity limit and realizing seamless cross-node memory access.

[0066] In an optional implementation, the shared memory layer interconnects 16 compute nodes via a RoCE v3 RDMA switch, with each node's core shared memory registered as a globally unified virtual address space. When the number of nodes is 8 or less, memory access requests are broadcast via the CXL link, and all nodes listen for and respond to Snoop requests. When the number of nodes is greater than 8, the hardware logic detects that the link utilization is greater than 60% or the latency is greater than 20ns, and immediately triggers a protocol switch: access requests are temporarily stored in a 512-entry temporary cache queue within each node's memory controller, and the directory controller enables a 64MB hardware directory table, such as recording the ownership and status of each cache block, sending requests only to the cache owner, thus reducing link traffic. The address translation module has a built-in 64MB ATC cache, dynamically mapping Flash physical addresses to unified virtual addresses, such as mapping 0x1000000000 to 0x8000000000, and the mapping table is stored in the FPGA SRAM; when a new node is added, the address management unit automatically allocates a new address range and updates the global mapping table, supporting single Pod expansion to 1024TB of shared memory.

[0067] Through the above-described embodiments of this application, the architecture achieves smooth expansion from 8 nodes to 256 nodes, adaptive switching of the consistency protocol reduces network congestion and improves system throughput; the address translation module breaks through the single-node memory bottleneck, supports unified addressing of heterogeneous storage, and improves memory utilization; cross-node memory access latency is reduced, supports RDMA zero-copy access, and reduces data migration overhead; it provides a globally scalable, low-latency, and large-capacity memory pool for supernode servers, and is the core supporting architecture for realizing large-scale AI clusters and distributed real-time computing.

[0068] In an optional implementation, the shared memory layer includes: an address translation module with a built-in cache mapping table. When the target processing core initiates an access request, if the cache mapping table is not hit, an entry missing interrupt is triggered, and the address management unit dynamically loads the entry from the global address mapping table distributed in each node to the cache mapping table.

[0069] It should be noted that the shared memory layer is a global memory resource pool spanning multiple computing nodes. The address translation module is a hardware circuit unit used to map virtual addresses in access requests to actual physical addresses. The cache mapping table is a high-speed SRAM cache inside the address translation module, storing frequently used address mapping entries. The global address mapping table is a system-wide address mapping database distributed in the Flash memory of each node, recording the physical addresses and home nodes of all storage blocks. The address management unit is a dedicated hardware controller responsible for responding to interrupts and loading missing mapping entries from distributed storage.

[0070] When the address of the target processing core access does not hit the local cache mapping table, the hardware triggers an interrupt. The address management unit automatically loads the required mapping entries from the global address mapping table distributed across each node, enabling transparent access across nodes and large-capacity memory without the need for operating system intervention or pre-allocation of address space.

[0071] In an optional implementation, the address translation module incorporates a 64MB SRAM cache mapping table. Each entry is a mapping record from a 64-bit virtual address to a 64-bit physical address, including a storage level identifier and a validity bit. When the processing core initiates a virtual address access, such as 0x8000000000, the hardware queries the cache mapping table. If a cache miss occurs, an interrupt signal for a missing entry is immediately generated and sent directly to the address management unit. The address management unit resolves the page to which the target address belongs, locates its corresponding global address mapping table fragment (stored in a 1TB distributed mapping table in the Flash memory of each node, stored by node fragment hash), directly accesses the target node's Flash memory via an RDMA Read operation, reads the entry, and writes it to the local cache mapping table via the CXL link. After loading, the address translation module continues the original access request, and subsequent accesses to the same address directly hit the cache.

[0072] Through the above-described implementation methods of this application, the mechanism achieves low-latency, pre-allocation-free, and transparent access to large-capacity shared memory, breaking through the limitations of fixed page tables and address spaces in traditional memory virtualization; hardware-based dynamic loading can avoid the high jitter of software page table traversal; the distributed global mapping table supports hot expansion of nodes, and new nodes do not require global configuration and automatically register mappings; the system supports concurrent access to tens of millions of memory blocks, providing scalable, high-concurrency, and low-overhead global memory addressing capabilities for heterogeneous supernode architectures.

[0073] Example 1:

[0074] Figure 3 This is a schematic diagram of another optional data-triggered heterogeneous server system according to an embodiment of this application. Figure 3 (a) in the diagram represents the data trigger control node. Figure 3 In diagram (b), the supernode computing node is located. The data trigger control node can interconnect with other data trigger control nodes via Ethernet and is also connected to the FPGA's internal bus via the SoC. This bus carries four functional modules: a data monitoring module, a feature parsing module, a task triggering module, and a CXL protocol processing module. The CXL protocol processing module can interconnect with the CXL protocol-enabled photon switching module within the supernode computing node.

[0075] The core of the supernode compute node is the acceleration computing unit. Inside the acceleration computing unit are 16 GPUs, arranged in two rows of eight, all interconnected with a photonic switching module supporting the CXL protocol. Simultaneously, all GPUs are also interconnected with a bottom-level switching module supporting the NVLink protocol, forming a high-speed interconnect network between GPUs. Outside the acceleration computing unit, a CPU and DRAM are deployed on top of the supernode compute node. The CPU is interconnected with the DRAM and with a photonic switching module supporting the CXL protocol. This photonic switching module is also interconnected with a CXL protocol processing module.

[0076] The CXL protocol processing module acts as a bridge between the supernode compute node and the Pod-level shared memory unit, interacting with multiple components within the Pod-level shared memory unit. Within the Pod-level shared memory unit, DDR and DDRCtrl are interconnected, DDRCtrl is interconnected with the data processing and cache scheduling module, and the data processing and cache scheduling module is interconnected with RDMA. Simultaneously, the CXL protocol processing module interacts directly with the data processing and cache scheduling module. The data processing and cache scheduling module is also connected to the address translation module via a bus, which extends downwards to multiple Flash controllers (Flash controller 1, Flash controller 2 to Flash controller n). Each Flash controller is connected to its corresponding Nand timing control module (Nand timing control module_1, Nand timing control module_2 to Nand timing control module_n), and each Nand timing control module is ultimately connected to the corresponding storage area (storage area 1, storage area 2 to storage area n) in the large-capacity Flash storage below. Each storage area contains multiple Flash memory chips.

[0077] In addition, RDMA interconnects with shared cache units of other computing nodes to enable cross-node data interaction.

[0078] Upon system startup, the data trigger control node layer, serving as the "nerve center" of the entire architecture, is initialized first. This layer consists of multiple FPGA acceleration nodes, each integrating a SoC processing unit and a hardware-based data processing pipeline, including a data monitoring module, a feature parsing module, and a task triggering module. The FPGA's internal bus can also connect to the CXL protocol processing module. The FPGA nodes are interconnected via 100Gbps Ethernet, forming a distributed redundant control network, similar to the interconnection between the SoC and Ethernet. When raw data streams are input from external sensors, AI cameras, or network interfaces, the data monitoring module captures data arrival events in real time through 64 parallel channels and performs hardware-level comparisons based on preset trigger conditions (such as data priority P0, traffic threshold, and cache utilization). To avoid false triggers caused by instantaneous fluctuations, the system initiates a configurable verification window of 100ns to 1μs to sample data fluctuations and determine whether they are within the tolerable range. Only when a valid trigger is confirmed is a high-level trigger activation signal and a 32-byte trigger data frame generated. The data frame is synchronized to the shared memory controller and core configuration management module of the supernode computing core layer via the CXL3.0 management link, completing cache reservation and resource pre-allocation. Ethernet can also connect to other data trigger control nodes.

[0079] Meanwhile, the supernode computing core layer is ready. This layer consists of 16 to 256 heterogeneous processing cores, including general-purpose CPUs, AI inference NPUs, and image processing GPUs. The CPUs can also connect to DRAM. All cores form a fully connected topology through photonic switching modules supporting the CXL protocol. This switching module uses silicon photonics chip technology, with a single-port bidirectional bandwidth of 3.0Tb / s, supporting direct access between all cores and core shared memory units within a single hop. Each computing core is equipped with two interconnect channels: one is a high-speed direct link from vendors such as NVLink or HiLink for parameter synchronization between similar cores; the other is a CXL3.0 consistency channel connected to the shared memory unit to ensure cache consistency. The shared memory unit is composed of a hybrid of HBM3 and persistent Flash, and hardware-level directory consistency management is implemented through the CXL protocol processing module, supporting dynamic switching between Snoop and Directory modes. When the number of system nodes is less than 8, Snoop broadcast mode is enabled; when the number of nodes exceeds the threshold, it switches to directory mode, with the hardware directory controller tracking the cache block status in real time, and the latency for a single consistency check is less than 20ns.

[0080] Upon arrival of the control layer trigger signal, the task triggering module immediately generates a 64-byte hardware-based task instruction, containing fields such as the target core ID, task priority, and parameter address. This instruction is then directly written to the target core's instruction buffer via the CXL3.0 management link, bypassing the operating system kernel and virtualization layer. Once the target core completes its current instruction cycle, the hardware priority arbitration logic automatically reads the instruction from the buffer. Based on the 32-byte task parameter address field in the instruction, it directly accesses the shared memory unit through the CXL consistency channel to obtain input data and runtime parameters. The entire process requires no memory copying or context switching.

[0081] At the data access level, the shared memory layer further achieves global expansion through an address translation module. This module has a built-in 64MB hardware-mapped cache. When a virtual address for core access misses the cache, a table miss interrupt is immediately triggered. The address management unit then precisely loads the required mapping entry from the global address mapping table distributed in the Flash memory of each node, dynamically filling the local cache and enabling transparent byte-level access to petabyte-level storage resources. Simultaneously, the shared memory layer constructs a congestion-free interconnect network through an RDMA switch, supporting direct cross-node memory read / write. Combined with congestion prediction and dynamic routing mechanisms, this ensures that data transmission latency remains below a preset threshold.

[0082] The system also has a built-in data classification storage and cache scheduling module, which automatically divides data into hot, medium and cold data through hardware counters and TTL logic, and dynamically migrates them to the DRAM or Flash level; the prefetch engine predicts access patterns based on the N-Gram model, loads data in advance, and the prefetch hit rate is stable at over 80%; the fault recovery unit implements RAID5 redundancy through CRC check and hardware XOR logic to ensure data reliability.

[0083] The embodiments of this application provide a task processing method for a data-triggered heterogeneous server system. Figure 4 This is a flowchart of an optional data-triggered task processing method for a heterogeneous server system according to an embodiment of this application; as follows: Figure 4 As shown, the task processing method of this data-triggered heterogeneous server system includes:

[0084] S402, when the reference data written to the core shared memory meets the preset triggering conditions, the reference data is determined as the target data, and the target data is parsed to obtain data characteristics, wherein the core shared memory is the shared memory for processing core combinations;

[0085] S404, Based on data characteristics, determine multiple target processing cores in the processing core combination, wherein the processing core combination includes at least one general processing core and multiple acceleration processing cores, and the processing core combination includes multiple target processing cores.

[0086] S406 generates task instructions based on multiple target processing cores and writes the task instructions into the registers corresponding to each of the multiple target processing cores.

[0087] It should be noted that the core shared memory is a unified memory pool accessed by multiple processing cores. It achieves cache consistency through the CXL3.0 protocol, supports multi-core concurrent read / write, and serves as the central storage hub for data flow. Reference data refers to the raw input data written to the core shared memory, such as sensor sample values, image frames, or AI inference requests. Preset trigger conditions are hardware-configured trigger thresholds, including parameters such as data priority (P0–P4), data size, access frequency, and cache utilization. Data characteristics are structured attributes extracted from the reference data, including data format (e.g., JSON, Tensor), processing requirement identifiers (e.g., "classification," "filtering"), and QoS constraints (latency limits, throughput requirements). The processing core combination is a collection of heterogeneous computing units within the system, including general-purpose CPUs and dedicated acceleration cores (e.g., NPU, GPU, FPGA). The target processing cores are several processing cores matched based on the data characteristics that are best suited to execute the task. The task instruction is a 64-byte hardware instruction containing fields such as target core ID, task parameter address, and priority, used to directly initiate core execution. Registers refer to the instruction buffers inside each processing core. They are hardware-level instruction queues that support direct writing and fast reading.

[0088] When the reference data written to the core shared memory meets the preset trigger conditions, the system automatically identifies it as target data, and the hardware module parses its data characteristics. Based on the characteristics, it accurately matches multiple target processing cores in the heterogeneous core combination, and finally generates hardware instructions and writes them directly into the instruction register of each target core. This achieves a complete process of data readiness, feature recognition, core matching, and instruction issuance without software involvement.

[0089] In an optional implementation, during system operation, external data flows through the network interface or sensor acquisition unit into the core shared memory and is written to a designated address range. The data monitoring module monitors the write status and attributes of each data block in the shared memory in real time through FPGA hardware logic to determine whether preset trigger conditions are met. For example, if a data block is marked as P0 priority and its size exceeds 1MB, or its access frequency exceeds 100 times within 1ms, the trigger condition is met, and the data block is identified as the target data. Subsequently, the feature parsing module starts the hardware pipeline: first, the data integrity is verified by the CRC check circuit, and corrupted frames are discarded; then, the state machine parses the data header and extracts feature fields such as format (e.g., "JSON"), processing requirements (e.g., "0x02=classification"), and QoS parameters (e.g., latency limit of 100μs, throughput requirement of 10GB / s). After all features are verified by a legality lookup table, a structured data feature vector is formed and sent to the core matching engine.

[0090] The core matching engine integrates data features and a core mapping table stored in dual-port RAM. Table entries are formatted as "[Format][Requirement][Latency][Throughput], [Core Type][Quantity][Cache Quota]", such as "[JSON][Category][100μs][10GB / s], [NPU][8][128MB]". The matching process uses a hardware parallel comparator array to compare 1024 mapping entries simultaneously, employing a hybrid strategy of exact and fuzzy matching: the format and requirement must be completely consistent, while QoS parameters allow for errors. If a match is successful, the system immediately locks a set of idle cores that meet the conditions, such as 8 NPU cores, and confirms that their computing power, cache, and bandwidth resources are not occupied.

[0091] The task triggering module then generates a 64-byte hardware task instruction. The instruction fields include: instruction type (0x01 = Start), target core ID (4 bytes, encoded as 0x00000001–0x00000008), task parameter address (32 bytes, pointing to the physical starting address of the target data in shared memory), task priority (P1), and checksum (1 byte). This instruction bypasses the operating system or virtualization layer, instead being directly written to the 1KB instruction buffer corresponding to each of the eight target NPU cores via the CXL3.0 management link and the hardware instruction dispatch circuit. Each core's instruction buffer is a dedicated hardware queue, supporting one-sided writing and lock-free reading. Once the core completes the current instruction cycle, the hardware priority arbitration logic automatically reads the instruction from the buffer, parses the parameter address, and directly accesses the core's shared memory to obtain input data, without requiring DMA copying, thus starting the computation task.

[0092] It should be noted that this embodiment achieves a complete hardware closed loop from data writing to core execution. Through hardware feature parsing and mapping matching, the system can accurately identify heterogeneous tasks such as AI inference, video analysis, and real-time control, automatically matching the optimal core combination, avoiding inefficient processing of dedicated workloads by general-purpose CPUs, and improving computing power utilization. Instructions are directly written to registers, bypassing OS scheduling and interrupt overhead, reducing latency caused by context switching. At the same time, this mechanism supports parallel loading of multiple target cores and unified parameter access, enabling efficient collaborative execution of high-concurrency tasks, providing industrial-grade deterministic computing power delivery capabilities for scenarios such as edge AI quality inspection, industrial control, and low-latency transactions.

[0093] In an optional implementation, multiple target processing cores are determined in the processing core combination based on data characteristics, including: searching in a first mapping table based on data characteristics to obtain target task configuration information, wherein the first mapping table is used to indicate the relationship between data characteristics and processing task requirements; and determining multiple target processing cores in a core state database based on the target task configuration information, wherein the core state database is used to indicate the current core capabilities of the processing core combination.

[0094] It should be noted that data features are structured attributes parsed from the target data by the hardware, including data format, processing requirement identifiers (such as "classification" and "filtering"), and QoS constraints (latency limits, throughput requirements). The processing core assembly is a collection of heterogeneous processing units within the system, including general-purpose CPUs, AI inference NPUs, and image acceleration GPUs. The first mapping table is a static mapping table stored in hardware, used to directly associate data features with task configuration requirements. The core state database is a real-time updated hardware database that records the current state of each processing core (such as idle, busy, or faulty), performance parameters (computing power, latency), and resource usage (cache allocation, bandwidth usage).

[0095] First, the data features are converted into abstract task configuration requirements through the first mapping table. Then, combined with the real-time resource information of the core state library, multiple target processing cores that meet the task requirements and have the best resources are dynamically selected. This achieves two-level decision-making for requirement matching and resource adaptation, ensuring that tasks are scheduled to suitable and idle core combinations, thus avoiding resource contention and performance waste.

[0096] In an optional implementation, after the feature parsing module extracts data features, it inputs fields such as format, requirements, and QoS into the first mapping table of the dual-port RAM built into the FPGA. This table supports parallel lookup of 1024 entries, and the matching process is completed within 10ns by a hardware comparator array. For example, when the data features are "JSON format, classification requirements, latency limit 105μs, throughput 10GB / s", the system matches the entry "[JSON][Classification][100μs±10%][10GB / s], [NPU][8][128MB L3 cache][P1 priority]" in the mapping table and outputs the target task configuration information. Subsequently, the task triggering module sends this configuration information to the core state database, which is maintained in real time by hardware logic. It polls the status registers of each core every 10μs to update their idle resources (such as the number of available NPUs, remaining cache capacity, and bandwidth utilization). Based on the target task configuration information, the system selects candidate combinations from the idle cores that meet the requirements of "8 NPUs, 128MB cache, P1 priority". It then calculates the resource utilization of each combination based on a greedy algorithm and selects the one with the highest utilization (such as 8 NPUs with 75% cache allocated) as the target core, excluding resource fragmentation or inefficient combinations.

[0097] Through the above-described implementation methods of this application, the mechanism achieves an intelligent decision-making closed loop from function matching to resource adaptation, avoiding resource mismatch caused by traditional allocation based solely on type. Combined with real-time perception of the core state database, resource utilization is improved. Task scheduling no longer relies on a global scheduler, eliminating queuing and arbitration delays, ensuring that high-priority tasks obtain optimal computing power within milliseconds, and providing deterministic and highly efficient heterogeneous computing power allocation capabilities for real-time AI inference and industrial control.

[0098] In an optional implementation, multiple target processing cores are determined in the core state database based on the target task configuration information, including: determining the current core capabilities corresponding to each of the multiple reference processing cores in the processing core combination based on the core state database, wherein the current core capabilities include the status of the reference processing core, the performance parameters of the reference processing core, and the resource quota of the reference processing core; determining at least one reference processing core combination based on the current core capabilities corresponding to each of the multiple reference processing cores and the target task configuration information; and determining the multiple reference processing cores included in the reference processing core combination with the highest resource utilization as multiple target processing cores.

[0099] It should be noted that the target task configuration information is an abstract task requirement output from the first mapping table, including the required core type (e.g., NPU), quantity (e.g., 8), cache quota (e.g., 128MB), priority (e.g., P1), etc. The core state database is a real-time database maintained by the hardware, recording the current core capabilities of each reference processing core, including status (idle, busy, or faulty), performance parameters (e.g., TOPS, latency μs), and resource quotas (allocated cache, occupied bandwidth). The reference processing cores are all currently available candidate cores in the processing core combination. The reference processing core combination is a subset of cores selected from the reference cores that meet the task requirements. Resource utilization is the ratio of task-required resources to the total available core resources, used to measure scheduling efficiency.

[0100] The system first obtains real-time capability data of all candidate cores from the core status library, then enumerates all reference core combinations that meet the type, quantity, and performance requirements based on the target task configuration information, and finally selects the combination with the highest resource utilization as the target core combination to ensure that computing resources are used efficiently.

[0101] In an optional implementation, the core state library is implemented using the FPGA's internal SRAM. Each entry stores 64 bits of state information for one core, including: a 1-bit status flag, a 16-bit computing power value, a 16-bit latency value, a 16-bit allocated cache, and a 5-bit bandwidth usage. When the target task configuration information is 8 NPUs, 128MB cache, and P1 priority, the system hardware enumerator traverses the core state library, filtering out all reference cores that are idle, of type NPU, and have a single-core cache capacity greater than or equal to 16MB (128MB / 8), forming a candidate pool. Subsequently, the hardware combination generator enumerates all 8-core combinations that meet the quantity requirements in parallel (e.g., combination A: NPU1–NPU8; combination B: NPU2–NPU9), and calculates the resource utilization for each combination: resource utilization = task required resources (128MB cache) / total core resources of the combination (total cache capacity of each of the 8 cores). For example, combination A has a total cache capacity of 130MB and a utilization rate of 128 / 130≈98.5%; combination B has a total cache capacity of 150MB and a utilization rate of 85.3%. The hardware comparator array completes the utilization calculation for all combinations within 50ns and outputs the combination with the highest utilization rate (combination A) as the target core combination.

[0102] Through the above-described implementation methods of this application, hardware parallel enumeration and real-time status awareness are used to avoid software scheduling delays and jitter, ensuring that the optimal core combination selection is completed within a preset time. The system can still accurately match idle resources under high load, preventing high-priority tasks from failing to start due to core fragmentation.

[0103] In an optional implementation, determining at least one reference processing core combination based on the current core capabilities and target task configuration information corresponding to each of the multiple reference processing cores includes: generating a first abnormal signal when the reference processing core combination cannot be determined based on the current core capabilities and target task configuration information corresponding to each of the multiple reference processing cores, wherein the first abnormal signal is used to indicate that the processing core combination resources are insufficient.

[0104] It should be noted that the current core capabilities refer to the real-time status (idle / busy / faulty), performance parameters (such as computing power, latency), and resource quotas (allocated cache, bandwidth usage) of each reference processing core. The target task configuration information includes hard constraints such as the minimum core type, number, cache capacity, and priority required by the task. The reference processing core combination is the set of candidate cores that meet the task configuration requirements. The first anomaly signal is a hardware-generated system-level alarm signal indicating that no subset of the current processing core combination can meet the task requirements, i.e., insufficient resources.

[0105] After the system traverses all available cores based on the target task configuration information, if it cannot find a core combination that meets the requirements of quantity, type, and resource quota, it will actively trigger the first abnormal signal to realize the hardware-level real-time perception and reporting of the state where no resources are available for allocation, thereby avoiding long-term task waiting or software timeout failure and improving the robustness and controllability of the system.

[0106] In an optional implementation, the task triggering module receives the target task configuration information, such as requiring 8 NPUs, each with at least 16MB of cache, prioritizing P1, and then starts the hardware enumeration engine to filter all idle cores of type NPU from the core state library. The engine checks the cache quota of each core one by one, retaining only candidate cores with a cache greater than or equal to 16MB. Subsequently, the hardware combination verifier attempts to form a valid combination of 8 cores from the candidate cores. If the total number of candidate cores is less than 8, or if there are more than 8 but at least one core in any of the 8 combinations has a cache less than 16MB, then no valid combination is determined. At this time, the hardware logic immediately generates a first exception signal, which is a 1-bit high-level pulse, sent to the exception handling unit of the data trigger control node layer through a dedicated interrupt line, and synchronously written to the system log register, recording the task ID, configuration requirements, and failure timestamp. This process is completed by a dedicated state machine within 200ns without software intervention, ensuring deterministic response.

[0107] Through the above-described implementation method of this application, the system can immediately activate backup strategies via hardware-level anomaly reporting. If resource preemption is triggered, low-priority tasks are released, and the scheduler is notified to migrate the tasks to other Pods, or an alarm is sent to the operation and maintenance platform to avoid task backlog. This design enhances the supernode server's adaptability and fault tolerance under high concurrency and heterogeneous loads, ensuring that critical tasks (such as P0 industrial control) can still obtain clear feedback and recovery paths when resources are scarce, thereby improving the overall system reliability and maintainability.

[0108] In an optional implementation, parsing the target data to obtain data features includes: identifying the data format type of the reference data, wherein the data format type is used to indicate the data format of the reference data; extracting processing requirement indicator bits of the reference data, wherein the processing requirement indicator bits are determined as first indicator information when the reference data has classification requirements, as second indicator information when the reference data has filtering requirements, and as third indicator information when the reference data has inference requirements; determining the performance requirements of the target data, and determining the data format type, processing requirement indicator bits, and performance requirements as data features.

[0109] It should be noted that the data format type refers to the structured encoding form of the target data, such as JSON, Tensor, CSV, binary stream, etc., used to identify the data organization method. The processing requirement indicator is a 2-bit identifier field embedded in the data header, used to uniquely represent the task type: the first indicator information (01) corresponds to classification, the second indicator information (10) corresponds to filtering, and the third indicator information (11) corresponds to inference. Performance requirements include QoS parameters such as latency limit, throughput requirement, and accuracy requirement, defined by the packet header or metadata fields. Data characteristics are a set of structured metadata composed of the comprehensive data format type, processing requirement indicator, and performance requirements, used to drive subsequent core matching and task scheduling.

[0110] By using third-order structured parsing, the original target data is transformed into standardized and matchable feature vectors, achieving accurate conversion from unstructured data streams to abstract task instructions, and providing a clear decision-making basis for heterogeneous core scheduling.

[0111] In an optional implementation, after the data monitoring module confirms that the target data meets the triggering conditions, the feature parsing module starts the hardware pipeline: First, the format recognition unit reads the first 16 bytes of the data packet through a dedicated state machine, compares it with the preset format signature (e.g., "{" for JSON, "7F 45 4C 46" for ELF binary, "0x00000000" for Tensor header), and outputs the format type field (e.g., "JSON" or "Tensor"); Second, the processing requirement indicator extraction module reads bytes 20-21 of the data packet and, according to a predefined mapping: if the value is 01, it is determined to be a classification requirement, and the first indicator information is output; if it is 10, the second indicator information (filtering) is output; if it is 11, the third indicator information (inference) is output; Finally, the performance requirement parsing module reads the latency limit and throughput requirement from the QoS field of the data packet, such as latency less than or equal to 100μs and throughput greater than or equal to 10GB / s, and converts this value into a standardized code. The three are combined by a hardware register into a 64-bit data feature for subsequent mapping and matching. The entire process is completed within 50ns by pipeline logic, and erroneous data packets are automatically discarded and logged.

[0112] Through the above-described embodiments of this application, the mechanism compresses the millisecond-level latency of traditional software parsing, achieving deterministic and low-jitter extraction of data features, and ensuring the real-time performance and consistency of task scheduling. By standardizing formats and defining indicator bits, the system can accurately distinguish heterogeneous tasks such as classification, filtering, and inference, reducing semantic ambiguity and improving core matching accuracy. Hardware-based parsing does not rely on an operating system or general-purpose CPU, reducing scheduling overhead and providing support for the supernode server to achieve end-to-end closed-loop processing of data as instructions.

[0113] In an optional implementation, the target task configuration information is obtained by searching in the first mapping table based on data characteristics, including: performing precise matching in the first mapping table based on data format type and processing requirement indicator bit to obtain a first matching result, wherein the first matching result is a completely consistent matching result; performing fuzzy matching in the first mapping table based on performance requirements to obtain a second matching result, wherein the second matching result is a matching result with deviation within a preset range; and combining the first matching result and the second matching result to determine the target task configuration information.

[0114] It should be noted that the first mapping table is a static mapping table stored in the FPGA hardware SRAM. The table entry format is "[Data Format Type][Processing Requirement Indicator Bit][Performance Requirement Range], [Target Task Configuration Information]", used to map data characteristics to specific task execution configurations. Exact matching means that the data format type and processing requirement indicator bit must be consistent. Fuzzy matching means that performance requirements (such as latency and throughput) are allowed to match within a preset error range, such as latency ±10% and throughput ±15%. The target task configuration information consists of the execution parameters output by the mapping, which may include the target core type, quantity, cache quota, link priority, etc.

[0115] In an optional implementation, the data features output by the feature parsing module include format type (e.g., "JSON"), processing requirement indicator bits (e.g., "01=Classification"), and performance requirements (e.g., latency 95μs, throughput 9.8GB / s). The matching engine initiates a two-level matching process in parallel: First, the precise matching unit searches the first mapping table for entries whose format and requirement indicator bits completely match, such as "[JSON]

[01] , NPU×8, L3 cache 128MB, P1 priority". If found, the first matching result is output. Second, the fuzzy matching unit performs a range comparison of the performance requirements, with a preset latency error of ±10% (i.e., 90–110μs) and throughput of ±15% (i.e., 8.3–11.3GB / s). If the performance requirements fall within the performance range of any matching entry, a second matching result is generated. For example, the entry "[JSON]

[01] [80–120μs][8–12GB / s], NPU×8, L3 cache 128MB, P1" satisfies the fuzzy matching. The system prioritizes the first matching result. If no matching result is found, the system selects the only valid second matching result. If multiple second matching results exist, the entry with the closest performance range is selected as the final target task configuration information to ensure uniqueness and optimality of the matching.

[0116] Through the above-described implementation methods of this application, this mechanism overcomes the rigid limitations of traditional static matching, enabling the system to accurately trigger tasks even when faced with minor performance deviations such as network jitter and data sampling fluctuations, thereby improving the matching success rate. Precise matching ensures unambiguous task types, while fuzzy matching expands the generalization capability of the mapping table, reduces redundant configuration entries, and lowers storage overhead.

[0117] In an optional implementation, generating task instructions based on multiple target processing cores includes: determining instruction types based on the multiple target processing cores, wherein the instruction type is used to indicate the running state corresponding to each of the multiple target processing cores, and the running state may include starting, pausing, or terminating; determining the target number set and the physical starting address of the core shared memory corresponding to each of the multiple target processing cores; determining the priority of reference data, and generating task instructions based on the instruction type, target number set, physical starting address, and priority.

[0118] It should be noted that the instruction type is a 1-byte encoding indicating task control actions, such as 0x01 = start, 0x02 = pause, 0x03 = terminate, used to uniformly manage the running status of the target core. The target number set is a unique set of identifiers for multiple target processing cores in the system, supporting single or batch specification, such as {0x00000001, 0x00000002, ..., 0x00000008}. The physical starting address of the core shared memory is the starting physical address of the task data in the shared memory pool, allocated by the cache scheduling module, used for direct core access to input / output data. Priority is a five-level QoS identifier (P0–P4), used for scheduling links and cache resource allocation.

[0119] In an optional implementation, the task triggering module receives the target core combination (e.g., 8 NPUs numbered 0x00000001–0x00000008), the physical starting address returned by the cache scheduling module (e.g., 0x10000000), and the priority (e.g., P1) transmitted by the feature parsing module, and sets the instruction type to 0x01 (startup) based on the current task status (first execution). The instruction generation unit assembles the instruction in a fixed 64-byte format: the first byte is the instruction type (0x01), the next 4 bytes are the batch target number (using range representation: start ID 0x00000001, end ID 0x00000008), the next 32 bytes are the shared memory physical starting address (the first address of the 128MB cache address range), 1 byte is the priority (P1=0x01), the last byte is the CRC-8 checksum, and the remaining 20 bytes are reserved. The instruction is encapsulated by hardware logic within 5ns and written directly to the target core's instruction buffer via the CXL3.0 management link. It supports SIMC (Single Instruction Multiple Core) mode, allowing all eight target cores to be started with a single instruction. The address field and cache binding information in the instruction are updated synchronously to ensure that the core can access data immediately after startup.

[0120] Through the above-described embodiments of this application, a unified instruction format is used to avoid instruction parsing errors and improve system reliability. Batch instruction issuance is supported, reducing communication overhead and increasing instruction throughput. The integrated encapsulation of priority and shared addresses ensures the determinism and consistency of the task execution environment, providing core support for the end-to-end closed loop of data triggering, direct instruction delivery, and core parallelism in the supernode architecture.

[0121] In an optional implementation, if the reference data written to the core shared memory meets the preset triggering conditions, the reference data is determined as the target data, including: determining the data information of the reference data, wherein the data information is used to indicate the access interface identifier, traffic rate, priority mark and initial throughput of the reference data; comparing the data information of the reference data with the preset triggering parameter set, and if the comparison conditions are met, the reference data is determined as the target data.

[0122] It should be noted that the reference data is the raw input data stream that has not yet been determined as a valid trigger object. The data information is four-dimensional metadata acquired by the hardware, including the access interface identifier (such as PCIe / CXL port ID), traffic rate (Mbps), priority flag (P0–P4), and initial throughput (GB / s). The preset trigger parameter set is a set of dynamic configuration rules stored in the dual-port RAM inside the FPGA, divided by scenario (such as AI inference, industrial control), including comparison parameters such as interface threshold, traffic upper limit, and priority lower limit. The target data is high-value data that meets the trigger conditions and is identified as needing immediate processing.

[0123] In an optional implementation, the data monitoring module captures the access interface ID (e.g., 0x03), traffic rate (e.g., 1.2Gbps), priority flag (e.g., P1), and initial throughput (e.g., 1.5GB / s) of the reference data in real time via the CXL3.0 management link. The system loads a preset set of trigger parameters based on the current operating scenario (e.g., "AI quality inspection"): interface ID = 0x03, traffic rate greater than or equal to 1Gbps, priority greater than or equal to P1, and throughput greater than or equal to 1GB / s. A hardware comparator array performs parallel four-dimensional comparison: the interface ID is matched to a dedicated logic verification port; the traffic rate and throughput are compared in real time with a threshold register via a hardware multiplier; and the priority flag is determined by an unsigned comparator to be greater than or equal to P1. When all four-dimensional parameters meet the conditions, the AND gate outputs a trigger compliance signal, and the hardware latch immediately locks the reference data as the target data, writing its cache address and data length into the pending queue.

[0124] Through the above-described implementation methods of this application, hardware parallel comparison ensures that high-priority tasks are activated first, improving the system's responsiveness to critical data. Combined with dynamic configuration of scenario-based parameters, it supports cross-application adaptive triggering strategies, providing an underlying guarantee for the supernode architecture to achieve accurate, low-jitter, and high-throughput data-driven processing.

[0125] In an optional implementation, if the reference data written to the core shared memory meets the preset triggering conditions, the reference data is determined as the target data, including: determining the window time according to the task requirements of the reference data, wherein the window time is a preset time; collecting reference data information of the reference data according to the preset sampling rate to obtain the reference data sequence; calculating the maximum value, minimum value and average value of the reference data sequence, and obtaining the fluctuation percentage according to the maximum value, minimum value and average value of the reference data sequence; and clearing the target data if the fluctuation percentage is greater than a preset threshold.

[0126] It should be noted that the window time is a stable monitoring period set by the hardware, used to observe the dynamic fluctuations of the reference data. Its length is preset according to the task type (e.g., 100ns for industrial control, 1μs for AI inference). The reference data information consists of real-time data parameters collected within the window period at a preset sampling rate, including flow rate, throughput, priority, etc. The reference data sequence is a set of multiple sampling points arranged in chronological order within the window. The fluctuation percentage is a measure of data stability, calculated as (maximum value - minimum value) / average value × 100%. The target data is valid data that has been verified and confirmed to trigger processing; if the fluctuation exceeds the limit, it is cleared to avoid false triggering.

[0127] When reference data is written to shared memory and the triggering conditions are initially met, the triggering module loads data from the configuration register with a window time of 100ns and a sampling rate of 10GHz, i.e., 1000 sampling points are collected within 100ns, based on the task type (e.g., "real-time industrial control"). The hardware sampler synchronously collects the flow rate and throughput of each sampling point and stores them in a FIFO buffer with a depth of 512. After sampling, the hardware statistics unit calculates the maximum, minimum, and average values ​​of the sequence in parallel: the maximum value is obtained within 5ns using a comparison tree circuit, the minimum value is calculated similarly, and the average value is calculated jointly by an accumulator and a divider. The fluctuation percentage is calculated in real time by the hardware subtractor and divider: (maximum value - minimum value) ÷ average value. If the fluctuation percentage exceeds a preset threshold (e.g., ±5%), it is judged as a transient disturbance, triggering the clearing logic: the hardware control unit immediately releases the cached flag of the target data in shared memory, clears its task association state, sets the invalid trigger flag, and sends a cancellation signal to the feature parsing module to prevent the subsequent processing flow from starting.

[0128] Through the above-described embodiments of this application, real-time hardware statistics are used to achieve microsecond-level fluctuation determination, avoiding the latency accumulation caused by traditional software filtering, ensuring that only stable and continuous high-value data triggers task execution, and guaranteeing the determinism and security of execution in key scenarios such as industrial control and autonomous driving.

[0129] In an optional implementation, after generating task instructions based on multiple target processing cores, the process includes: sending a cache allocation request to the core shared memory, determining the cache capacity, cache type, and data access permissions in the core shared memory; determining the physical address allocation range based on the cache allocation request; receiving a first signal sent by the core shared memory, and instructing the target cores to read target data from the core shared memory according to the task instructions, wherein the first signal is used to indicate that the core shared memory is ready.

[0130] It should be noted that the core shared memory is a globally unified memory pool interconnected via the CXL protocol, consisting of DRAM and persistent storage, supporting multi-core hardware access. A cache allocation request is a resource request instruction issued by the task triggering module, including the required cache capacity, type (DRAM / Flash), and access permissions (read / write / read-write). The physical address allocation range is a contiguous physical address interval allocated within the shared memory, used for storing and accessing the target data. The first signal is a hardware confirmation signal returned by the shared memory controller to the task triggering module after completing cache allocation and data readiness, used to synchronize task execution timing.

[0131] After generating a 64-byte task instruction, the task triggering module immediately sends a cache allocation request to the shared memory controller in parallel. The request explicitly specifies the required capacity (e.g., 128MB), type (DRAM), and permissions (read / write). Upon receiving the request, the shared memory controller's hardware allocation unit selects a contiguous address range (e.g., 0x10000000–0x10800000) from the free cache block and updates the address mapping table and access permission table. After allocation, the controller sends a first signal (active high) to the task triggering module via a dedicated hardware signal line, simultaneously writing the allocated physical address into the reserved field of the task instruction. Upon receiving the first signal, the task triggering module immediately sends the task instruction with the filled address to the target core's instruction buffer via the CXL3.0 management link. After completing the current instruction, the target core automatically reads the new task from the buffer, finds that the instruction contains a valid physical address, and immediately initiates a memory read via the CXL coherence protocol, without waiting for the operating system or software interrupts.

[0132] Through the above-described embodiments of this application, the mechanism achieves hardware-coordinated scheduling of task instructions and data caches, avoiding the waiting delay in traditional architectures where instructions are issued first but data has not yet arrived. This allows the core to directly access the target data after startup, shortening task execution preparation time. A hardware signal synchronization mechanism ensures strong consistency between instructions and data, preventing idling and resource contention.

[0133] Example 2:

[0134] During system operation, the memory address space is pre-divided into 1024 independent address segments. Each address segment is bound to a specific storage level, such as DRAM, NVDIMM, or Flash, and uniquely associated with a specific task type. For example, address segments 0x0000–0x0FFF correspond to video stream analysis tasks, and 0x1000–0x1FFF correspond to industrial quality inspection AI inference tasks. When external data is written to shared memory via CXL or Ethernet interfaces, the data monitoring module monitors the data writing behavior of each address segment in real time, collecting metadata such as access interface ID, initial throughput, data flow rate, and priority flags, and makes preliminary judgments based on preset thresholds. For example, when the data flow written to address segments 0x1000–0x1FFF continuously exceeds 8GB / s and has a priority of P1, the system determines that the triggering conditions for the task are met and immediately initiates the pre-trigger verification process.

[0135] The pre-triggered verification module collects traffic fluctuation values ​​of 1000 consecutive data points within the address range at 10 times the base sampling rate (10GHz) based on the window time configured for the task type (e.g., 1μs for AI inference tasks). It calculates the maximum, minimum, and average values ​​and obtains the fluctuation percentage. If the fluctuation is less than a preset threshold, it is considered a valid data flow, and a valid trigger signal is generated. Conversely, if the fluctuation is severe, the trigger state is cleared to avoid misscheduling due to momentary interference. This signal is synchronously sent to the feature parsing module, triggering it to start the hardware pipeline processing flow. The pipeline consists of three stages: preprocessing, feature extraction, and feature matching. First, the CRC check circuit verifies the integrity of the data packets and discards erroneous frames. Then, the state machine parses the data format (such as JSON), processes the requirement identifier (such as "classification") and QoS parameters (such as latency limit of 100μs and throughput of 10GB / s). Finally, the lookup table circuit performs a "precise + fuzzy" hybrid matching in the 1024-item mapping table: precise matching of data format and requirement identifier, and fuzzy matching of QoS parameters, outputting a unique matching result within 10ns, such as the target core: 8 NPUs, cache quota of 128MB, link priority P1.

[0136] The matching result is encapsulated into a 64-byte data frame containing features and tasks, which is forwarded by the feature parsing module to the task triggering module. The task triggering module immediately generates a structured hardware instruction, including the instruction type (0x01 = Start), the target core number set (0x00000001–0x00000008), the physical address of the task parameters (e.g., 0x10000000), priority (P1), and checksum, and writes it directly to the target core's instruction buffer via the CXL3.0 management link. Simultaneously, the task triggering module sends a task start instruction to the cache scheduling management module, explicitly requesting 128MB of DRAM allocation, read / write permissions, and specifying the starting address 0x10000000. Within 10ns, the cache scheduling module completes address mapping updates, cache block locking, and access permission configuration, generating a cache-ready hardware signal, which is fed back to the task triggering module. Upon receiving this signal, the task triggering module confirms that both the instruction and data environment are ready, and then sends the start instruction to the target core. After completing the current instruction cycle, the core immediately reads the new task from the instruction buffer and directly accesses the specified physical address through the CXL coherence protocol without the need for operating system intervention.

[0137] During task execution, the task triggering module continuously reads the core status register to monitor task progress and anomalies. If a core failure or data verification error occurs, the system automatically triggers a migration mechanism, redirecting the task to a backup core, and the cache scheduling module initiates a data reload process. Upon task completion, the task triggering module sends a resource release command, and the cache scheduling module immediately releases the corresponding address segment, updates it to an idle state, and notifies the data monitoring module. Simultaneously, the cache scheduling module provides real-time feedback to the data monitoring module on the utilization rate of each storage level. For example, if the DRAM utilization rate exceeds 85%, the monitoring module automatically increases the low-frequency data trigger threshold to reduce the response probability of non-critical tasks. When the cache scheduling module detects hot data migrating to DRAM, it sends a migration signal to the monitoring module, which then suspends trigger monitoring of the original Flash address segment to prevent duplicate activation. The entire system, from data writing, verification, parsing, command issuance, cache binding to execution and recycling, is driven entirely by hardware logic without software intervention.

[0138] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.

[0139] Embodiments of this application also provide a task processing device for a data-triggered heterogeneous server system. Figure 5 This is a structural block diagram of an optional data-triggered heterogeneous server system task processing device according to an embodiment of this application, such as... Figure 5 As shown, the device includes:

[0140] The first determining module 502 is used to determine the reference data as target data when the reference data written to the core shared memory meets the preset triggering conditions, and to parse the target data to obtain data characteristics, wherein the core shared memory is the shared memory for processing core combinations.

[0141] The second determining module 504 is used to determine multiple target processing cores in the processing core combination according to data characteristics, wherein the processing core combination includes at least one general processing core and multiple accelerated processing cores, and the processing core combination includes multiple target processing cores.

[0142] The generation module 506 is used to generate task instructions based on multiple target processing cores and write the task instructions into the registers corresponding to each of the multiple target processing cores.

[0143] Optionally, the second determining module 504 is used to search in the first mapping table according to the data characteristics to obtain the target task configuration information, wherein the first mapping table is used to indicate the relationship between the data characteristics and the processing task requirements; and to determine multiple target processing cores in the core state library based on the target task configuration information, wherein the core state library is used to indicate the current core capabilities of the processing core combination.

[0144] Optionally, the second determining module 504 is used to determine the current core capabilities corresponding to each of the multiple reference processing cores in the processing core combination based on the core state library, wherein the current core capabilities include the state of the reference processing core, the performance parameters of the reference processing core, and the resource quota of the reference processing core; determine at least one reference processing core combination according to the current core capabilities corresponding to each of the multiple reference processing cores and the target task configuration information; and determine the multiple reference processing cores included in the reference processing core combination with the highest resource utilization as multiple target processing cores.

[0145] Optionally, the second determining module 504 is used to generate a first abnormal signal when the combination of reference processing cores cannot be determined based on the current core capabilities and target task configuration information corresponding to each of the multiple reference processing cores, wherein the first abnormal signal is used to indicate that the processing core combination resources are insufficient.

[0146] Optionally, the second determining module 504 is used to identify the data format type of the reference data, wherein the data format type is used to indicate the data format of the reference data; extract the processing requirement indicator bit of the reference data, wherein the processing requirement indicator bit is determined as the first indicator information when the reference data is for classification, the processing requirement indicator bit is determined as the second indicator information when the reference data is for filtering, and the processing requirement indicator bit is determined as the third indicator information when the reference data is for inference; determine the performance requirements of the target data, and determine the data format type, processing requirement indicator bit, and performance requirements as data features.

[0147] Optionally, the second determining module 504 is used to perform precise matching in the first mapping table according to the data format type and processing requirement indicator bit to obtain a first matching result, wherein the first matching result is a completely consistent matching result; perform fuzzy matching in the first mapping table according to performance requirements to obtain a second matching result, wherein the second matching result is a matching result with deviation within a preset range; and combine the first matching result and the second matching result to determine the target task configuration information.

[0148] Optionally, the second determining module 504 is used to determine the instruction type based on multiple target processing cores, wherein the instruction type is used to indicate the running state corresponding to each of the multiple target processing cores, and the running state includes start, pause, and termination; determine the target number set and the physical starting address of the core shared memory corresponding to each of the multiple target processing cores; determine the priority of the reference data, and generate task instructions based on the instruction type, target number set, physical starting address and priority.

[0149] Optionally, the first determining module 502 is used to determine the data information of the reference data, wherein the data information is used to indicate the access interface identifier, traffic rate, priority mark and initial throughput of the reference data; and compares the reference data with the data information of the reference data based on a preset set of trigger parameters, and determines the reference data as the target data if the comparison conditions are met.

[0150] Optionally, the first determining module 502 is used to determine the window time according to the task requirements of the reference data, wherein the window time is a preset time; collect reference data information of the reference data according to the preset sampling rate to obtain the reference data sequence; calculate the maximum value, minimum value and average value of the reference data sequence, and obtain the fluctuation percentage according to the maximum value, minimum value and average value of the reference data sequence; and clear the target data if the fluctuation percentage is greater than a preset threshold.

[0151] Optionally, the generation module 506 is used to send a cache allocation request to the core shared memory, determine the cache capacity, cache type and data access permissions in the core shared memory; determine the physical address allocation range according to the cache allocation request; receive a first signal sent by the core shared memory, and instruct the target core to read target data from the core shared memory according to the task instructions, wherein the first signal is used to indicate that the core shared memory is ready.

[0152] For a description of the features of the task processing device for a data-triggered heterogeneous server system in the corresponding embodiment, please refer to the relevant description of the task processing method for a data-triggered heterogeneous server system in the corresponding embodiment, which will not be repeated here.

[0153] Embodiments of this application also provide an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to perform the steps in any of the above embodiments of the task processing method for a data-triggered heterogeneous server system.

[0154] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above embodiments of the task processing method for a data-triggered heterogeneous server system at runtime.

[0155] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0156] The embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above-described embodiments of the task processing method for a data-triggered heterogeneous server system.

[0157] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described embodiments of the task processing method for a data-triggered heterogeneous server system.

[0158] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be executed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The terms "system," "computing device," or "apparatus" as used herein encompass various means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.

[0159] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0160] The above provides a detailed description of a data-triggered heterogeneous server system and its task scheduling method provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and core ideas of this application. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

1. A data-triggered heterogeneous server system, characterized in that, include: The computing node layer includes at least one computing node, wherein the computing node includes a processing core combination and core shared memory corresponding to the processing core combination, wherein the processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores; The data trigger control layer includes at least one control node, which is connected to at least one general-purpose processing core and multiple accelerated processing cores of the corresponding computing node. When the received reference data meets the triggering conditions, the control node directly sends task instructions to multiple target processing cores in the connected computing node, wherein the processing core combination includes the multiple target processing cores.

2. The system according to claim 1, characterized in that, The control node includes: The data monitoring module is used to collect the reference data and perform hardware comparison according to the preset parameter set, and generate a trigger signal when the triggering conditions are met. The feature parsing module is used to extract data features from the reference data when the trigger signal is received, and to look up the target task configuration information in the first mapping table based on the data features. The task triggering module is used to generate task instructions upon receiving target task configuration information and send the task instructions to the corresponding computing node.

3. The system according to claim 2, characterized in that, The data monitoring module includes: Multiple monitoring channels are provided, which listen to the reference data and generate a first-level signal, wherein the first-level signal is used to indicate the presence of data; The parameter acquisition circuit includes a counter, a sampling register, and management interface logic. It reads the interface throughput register and priority field through the management link and stores them in the local register. The hardware comparator array includes multiple independent hardware comparators, which are respectively connected to the local register and the output signal of each monitoring channel, and are used to compare the acquired signal parameters with the corresponding preset judgment parameters in real time. The logic combination unit is connected to the output of each hardware comparator and is used to perform combination operations on multiple comparison results according to a preset logic relationship to generate the trigger signal.

4. The system according to claim 2, characterized in that, The feature parsing module includes: The preprocessing circuit is used to perform integrity verification on the received data packets and discard data packets that fail the verification based on the verification logic. The feature extraction circuit includes a state machine circuit and a field parsing unit. The state machine circuit is used to parse the data packet header according to a preset protocol format and extract the data format identifier, processing requirement code and parameter field. A dual-port memory is used to store the first mapping table, which consists of multiple entries. Each entry contains a mapping relationship between a combination of data format, processing requirements, and performance parameters and the corresponding target core type, core quantity, cache quota, and link priority. The parallel matching array, consisting of multiple hardware comparators, is used to simultaneously match multiple entries in the mapping table. The matching process includes: precise matching of data format and processing requirements, requiring fields to be completely consistent; and fuzzy matching of performance parameters, allowing deviations from the target value within a preset error range. The address decoding unit is used to decode the resource pointer corresponding to the successfully matched table entry into the target task configuration information and output it to the task triggering module.

5. The system according to claim 2, characterized in that, The task triggering module includes: The instruction generation unit is used to generate the task instruction based on the input target task configuration information. The task instruction includes an instruction type field, a target core identifier field, a task parameter address field, a task priority field, and a checksum field. The instruction distribution unit, coupled to the instruction generation unit, is used to directly write the task instruction into the instruction buffer of the target core through a dedicated control link.

6. The system according to claim 1, characterized in that, The computing node includes: It includes at least one general-purpose processing core, multiple accelerated processing cores, a photon switching module, and a core shared memory unit; The general processing core establishes a fully connected interconnection topology with the multiple accelerated processing cores through the photonic switching module. The multiple accelerated processing cores are directly connected to the photonic switching module through a single-hop path, and the photonic switching module is directly connected to the shared memory unit of the cores. Each of the acceleration processing cores is configured with two interconnect channels: the first channel is used for direct connection between acceleration chips of the same type; the second channel is used to establish a cache-coherent access channel with the photon exchange module and the core shared memory unit.

7. The system according to claim 6, characterized in that, The computing node includes: Upon receiving the task instruction, the task instruction is directly written into the reserved instruction buffer inside this computing node through the hardware instruction distribution circuit; When the current instruction cycle executed by this computing node ends, the task instruction is read from the instruction buffer through the hardware priority arbitration logic, and the physical address specified in the core shared memory unit is accessed according to the task parameter address field to obtain the input data and running parameters required by the task. According to the task instructions, multiple target cores are determined in the processing core combination, and the target cores are instructed to execute corresponding tasks.

8. The system according to claim 1, characterized in that, The core shared memory is interconnected with multiple reference processing cores in the processing core combination in the computing node through a connection protocol, so as to realize unified access of the core shared memory by all processing cores and support concurrent read and write by multiple computing cores, and maintain the consistency of memory access through hardware mechanisms. The core shared memory is directly interconnected with the control layer, enabling the control layer to directly read, write, or schedule data in the core shared memory.

9. The system according to claim 8, characterized in that, The system also includes: A shared memory layer, the shared memory layer including a memory interconnect network, the memory interconnect network being used to interconnect the core shared memory corresponding to each of at least one computing node included in the computing node layer; When the number of nodes in the computing node layer does not exceed a preset number, memory access requests are broadcast through links and all cores monitor the cache status; when the number of nodes exceeds the preset number, the hardware logic temporarily stores access requests through a temporary cache queue. The shared memory layer includes an address translation module, which dynamically maps the physical address of the memory to a unified virtual address space of the system.

10. The system according to claim 9, characterized in that, The shared memory layer includes: The address translation module has a built-in cache mapping table. When the target processing core initiates an access request, if the cache mapping table is not hit, a table entry missing interruption is triggered, and the address management unit dynamically loads the table entry from the global address mapping table distributed in each node to the cache mapping table.

11. A task processing method for a data-triggered heterogeneous server system, characterized in that, include: If the reference data written to the core shared memory meets the preset triggering conditions, the reference data is determined as the target data, and the target data is parsed to obtain data features, wherein the core shared memory is the shared memory of the processing core combination; Based on the data characteristics, multiple target processing cores are determined in the processing core combination, wherein the processing core combination includes at least one general-purpose processing core and multiple accelerated processing cores, and the processing core combination includes the multiple target processing cores; Task instructions are generated based on the plurality of target processing cores, and the task instructions are written into the registers corresponding to each of the plurality of target processing cores.

12. The method according to claim 11, characterized in that, The step of determining multiple target processing cores in the processing core combination based on the data characteristics includes: The target task configuration information is obtained by searching the first mapping table based on the data characteristics, wherein the first mapping table is used to indicate the relationship between the data characteristics and the processing task requirements. Based on the target task configuration information, the plurality of target processing cores are determined in the core state database, wherein the core state database is used to indicate the current core capabilities of the processing core combination.

13. The method according to claim 12, characterized in that, The step of determining the plurality of target processing cores in the core state database based on the target task configuration information includes: Based on the core state database, the current core capabilities corresponding to each of the multiple reference processing cores in the processing core combination are determined, wherein the current core capabilities include the state of the reference processing core, the performance parameters of the reference processing core, and the resource quota of the reference processing core; At least one combination of reference processing cores is determined based on the current core capabilities corresponding to each of the plurality of reference processing cores and the target task configuration information; The multiple reference processing cores included in the reference processing core combination with the highest resource utilization rate are identified as the multiple target processing cores.

14. The method according to claim 13, characterized in that, The step of determining at least one reference processing core combination based on the current core capabilities corresponding to each of the plurality of reference processing cores and the target task configuration information includes: If the combination of reference processing cores cannot be determined based on the current core capabilities corresponding to each of the plurality of reference processing cores and the target task configuration information, a first abnormal signal is generated, wherein the first abnormal signal is used to indicate that the combination of processing cores has insufficient resources.

15. The method according to claim 12, characterized in that, The process of parsing the target data to obtain data features includes: Identify the data format type of the reference data, wherein the data format type is used to indicate the data format of the reference data; Extract the processing requirement indicator bit of the reference data, wherein, when the reference data is for classification, the processing requirement indicator bit is determined as the first indicator information; when the reference data is for filtering, the processing requirement indicator bit is determined as the second indicator information; and when the reference data is for inference, the processing requirement indicator bit is determined as the third indicator information. Determine the performance requirements of the target data, and define the data format type, the processing requirement indicator bit, and the performance requirements as the data characteristics.

16. The method according to claim 15, characterized in that, The step of searching the first mapping table based on the data characteristics to obtain the target task configuration information includes: Based on the data format type and the processing requirement indicator bit, a precise match is performed in the first mapping table to obtain a first matching result, wherein the first matching result is a completely consistent matching result; Based on the performance requirements, a fuzzy match is performed in the first mapping table to obtain a second matching result, wherein the second matching result is a matching result with a deviation within a preset range; The target task configuration information is determined by combining the first matching result and the second matching result.

17. The method according to claim 12, characterized in that, The step of generating task instructions based on the plurality of target processing cores includes: Instruction types are determined based on the plurality of target processing cores, wherein the instruction type is used to indicate the running state corresponding to each of the plurality of target processing cores, and the running state includes start, pause, and termination; Determine the target number set corresponding to each of the plurality of target processing cores and the physical starting address of the core shared memory; The priority of the reference data is determined, and the task instruction is generated based on the instruction type, target number set, physical starting address, and priority.

18. The method according to any one of claims 11 to 17, characterized in that, The step of determining the reference data as target data when the reference data written to the core shared memory meets the preset triggering conditions includes: The data information of the reference data is determined, wherein the data information is used to indicate the access interface identifier, traffic rate, priority marker and initial throughput of the reference data; The reference data is compared with the preset set of trigger parameters. If the comparison conditions are met, the reference data is determined as the target data.

19. The method according to claim 18, characterized in that, The step of determining the reference data as target data when the reference data written to the core shared memory meets the preset triggering conditions includes: The window time is determined based on the task requirements of the reference data, wherein the window time is a preset time; Reference data information is collected from the reference data according to a preset sampling rate to obtain a reference data sequence; Calculate the maximum, minimum, and average values ​​of the reference data sequence, and obtain the percentage of fluctuation based on the maximum, minimum, and average values ​​of the reference data sequence; If the fluctuation percentage exceeds a preset threshold, the target data will be cleared.

20. The method according to any one of claims 11 to 17, characterized in that, After generating task instructions based on the plurality of target processing cores, the process includes: Send a cache allocation request to the core shared memory, and determine the cache capacity, cache type and data access permissions in the core shared memory; Determine the physical address allocation range based on the cache allocation request; The system receives a first signal from the core shared memory and instructs the target core to read the target data from the core shared memory according to the task instructions, wherein the first signal is used to indicate that the core shared memory is ready.