A method and system for DSA-oriented instruction-level performance analysis and a storage medium
By employing a hardware-software co-operational instruction-level performance analysis method, the challenge of acquiring fine-grained operational status data at the instruction level in the NPU chip's dataflow accelerator was solved. This enabled high-precision, low-overhead instruction performance statistics, improving system stability and analysis efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 奕行智能科技(广州)有限公司
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies make it difficult to accurately acquire fine-grained operational status data at the instruction level of the data stream accelerator in NPU chips, resulting in inaccurate program performance evaluation and bottleneck location, and also have limitations in versatility, flexibility and real-time performance.
This paper proposes an instruction-level performance analysis method for DSA (Digital Subtraction Animation). Through a hardware-software co-operated tracing triggering and acquisition mechanism, it adopts tracing acquisition logic independent of the main execution pipeline for bypass acquisition. Combined with an scalable tracing event data encoding format and storage strategy, it optimizes event data storage, reduces storage space and system bandwidth consumption, and optimizes instruction scheduling and tracing acquisition through hardware execution unit configuration to ensure normal system operation.
It enables high-precision acquisition of fine-grained instruction-level runtime data without affecting system performance, reduces storage and bandwidth consumption, improves the accuracy of performance statistics and system stability, and supports instruction execution and analysis in high-concurrency environments.
Smart Images

Figure CN122240441A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence chip performance analysis technology, and to an instruction-level performance analysis method, system and storage medium for DSA (Data Stream Accelerator), and more particularly to an instruction-level performance analysis method, system and storage medium for DSA in NPU (Neural Processing Unit). Background Technology
[0002] With the rapid evolution of artificial intelligence algorithms and deep learning models, AI chips continue to develop towards specialization and heterogeneity, and the NPU has become the core hardware carrier for AI computing in both the cloud and on the edge. Among them, the DSA (Data Transfer Analysis) unit, as a key unit in the NPU responsible for data transfer, scheduling, and execution, directly interfaces with heterogeneous execution components such as the Matrix Calculation Unit (ME), Vector Calculation Unit (VE), and Data Transfer Unit (TE / TEC). Its instruction pipeline efficiency, data throughput, and timing scheduling performance determine the overall computing power utilization and model execution efficiency of the NPU. In the entire process of NPU chip design, compiler development, and model network optimization, instruction-level performance analysis is the core means to locate bottlenecks, debug anomalies, and optimize scheduling strategies. By collecting DSA instruction execution sequences, tracing instruction lifecycles, and statistically analyzing instruction execution cycles and dependencies, problems such as pipeline blockage, data conflicts, bandwidth bottlenecks, and unit idleness can be accurately identified. By obtaining more granular instruction data to demonstrate the instruction execution process, chip designers, compiler developers, and model network developers can better analyze program performance and improve the efficiency of optimizing model programs.
[0003] Current mainstream instruction performance analysis schemes include: 1) Methods and apparatus for processor trace triggering, suitable for real-time instruction flow tracing and dynamic performance analysis in multi-core SoC chips. It employs multiple trigger units as a programmable point-triggered mechanism, which can originate from the debug unit or performance event unit. It supports real-time sampling of hardware instruction flow (trace) and provides transparent, low-overhead analysis of performance bottlenecks in critical code segments. However, this scheme requires the integration of a large number of trigger units, tracing circuits, and performance monitoring units (PMUs), increasing chip area and design complexity. Multi-trigger source collaboration may introduce timing conflicts, requiring complex arbitration logic. Dynamically adjusting trigger conditions requires frequent register reads and writes, potentially affecting real-time performance. 2) Methods, computing devices, computer-readable storage media, and computer program products for determining the instruction performance of GPU kernel programs. This involves setting multiple performance counters in each execution unit (EUI) of the GPU and mapping these counters to consecutive bit fields of the kernel program. When instruction execution is in a waiting state, the performance counter increments by one every clock cycle; otherwise, the performance counter increments with the program counter. However, this approach is limited by fixed bit fields in terms of analysis granularity. The size of consecutive bit fields affects the analysis accuracy, requiring multiple runs of the program to cover all instructions. The fixed mapping between the performance counter and the program counter may lead to resource waste, and the impact of inter-instruction dependencies on performance is not clearly addressed. 3) Methods and devices for instruction performance analysis, targeting superscalar processors, improve analysis accuracy by analyzing the effective sampling number of instruction groups and correcting the CPU index of each instruction. However, this approach has high computational complexity, requires processing a large amount of statistical data and distributed computation, consumes a lot of resources, relies on hardware registers to record group information, and has limited versatility. 4) Methods, electronic devices, media, and programs for static analysis of NPU instruction performance employ simulation methods, calculate the number of clock cycles required for instruction execution, handle inter-instruction dependencies, and estimate execution time using predetermined formulas. However, the execution time accuracy of this approach is limited by the accuracy of the preset formulas, may ignore actual hardware details, has complex dependency handling logic, may increase simulation overhead, and lacks versatility.
[0004] In summary, existing technologies generally suffer from limitations in versatility, flexibility, and real-time performance. They also suffer from low analysis efficiency, high bandwidth utilization, insufficient online performance analysis capabilities, and difficulty in accurately acquiring fine-grained runtime status data for each DSA instruction with low overhead. This results in inaccurate program performance evaluation and bottleneck identification, affecting users' targeted optimization. Summary of the Invention
[0005] The technical problems to be solved by this invention are: 1) to build a general instruction tracing triggering and acquisition mechanism that integrates software and hardware, so as to achieve fine-grained acquisition of data flow accelerator instruction-level runtime data and fully display the instruction execution process without affecting system performance; 2) to design a dedicated and scalable trace event data encoding format for the data flow accelerator, optimize the event data storage strategy, and reduce storage space occupation and system bus bandwidth consumption; 3) to automatically collect and save trace events as needed at key time nodes of data flow accelerator instruction execution, so as to provide users with a low-overhead and high-precision instruction performance statistics solution.
[0006] To address the aforementioned technical problems, this invention provides an instruction-level performance analysis method, system, and storage medium for DSA, specifically providing an instruction-level performance analysis method, system, and storage medium for DSA in an NPU.
[0007] The first aspect of this invention provides an instruction-level performance analysis method for DSA, comprising the following steps: The instruction execution process of DSA is traced, events are encoded and recorded to obtain trace data for the entire instruction lifecycle; and The hardware execution unit configuration is used to collaboratively optimize instruction scheduling and tracing acquisition, ensuring that the processes of tracing event encoding and recording, tracing data storage, and bandwidth control do not interfere with normal system operation. Specifically, a tracing acquisition logic independent of the DSA main execution pipeline is used to bypass the acquisition of instruction lifecycle events. The acquired tracing events are written to a buffer queue, and an asynchronous write-out unit writes the tracing events to memory according to a preset bus arbitration priority and / or bandwidth control strategy, thus preventing the tracing event encoding and recording process from blocking the normal execution of DSA instructions.
[0008] Furthermore, the preset bus arbitration priority is lower than the DSA service memory access priority.
[0009] Furthermore, when the buffer queue is at risk of overflow, an overflow flag is recorded and the DSA main execution process continues to run.
[0010] Furthermore, the encoding and recording of the tracking events includes the following steps: An extensible tracking event data encoding format is adopted, using event type codes to identify key behaviors in the DSA instruction lifecycle. These key behaviors include the start (START), end (END), and restart (RESTART) states of the DSA instruction, with each event corresponding to a unique state (e.g., 0:START, 1:END, ...). This design can clearly distinguish different types of events with extremely low resource overhead, solving the problems of event ambiguity and difficulty in location in traditional acquisition designs, and providing clear basic data support for subsequent performance analysis and scheduling. Assign a timing stamp to each DSA tracking event to achieve nanosecond-level timing traceability, supporting up to 64 seconds of cross-cycle tracking, thereby achieving accurate traceability of instruction execution timing; making instruction performance statistics both fine-grained and suitable for data collection under ultra-long computing paths; for chip environments with intensive instruction execution and high event concurrency, absolute timing tracing completely solves the problem of inaccurate performance attribution caused by scrambled acquisition data and causal confusion; The instruction identifier is formed by combining the instruction type identifier, the execution unit identifier, and the high and low bits of the instruction address; and For scenarios where the previous event (instruction end event) and the current event (instruction start event) arrive simultaneously, and the instruction type identifier (IType) and high address are consistent, these two events are merged into a restart event (when an instruction end event and an instruction start event generated by the same execution unit are received within the same acquisition cycle, and their instruction type identifiers and high-order bits of the instruction address are the same, then the instruction end event and instruction start event are compressed and encoded into a single instruction restart event). This further reduces event storage and system bandwidth usage while ensuring data integrity, addresses the real-world needs of instruction attribution and periodic statistics, and greatly enhances the instruction-level traceability of tracking event data.
[0011] Furthermore, the event type code is 3 bits, and the time stamp is 36 bits.
[0012] Furthermore, the instruction address is segmented into high and low bits: both start and restart events carry the lower 16 bits of the instruction address, while the end event carries the higher 16 bits. The high and low 16 bits of the instruction address are combined to form a 32-bit instruction address. This achieves complete traceability of the entire instruction lifecycle while significantly reducing storage and bandwidth usage. The 32-bit instruction address is formed by combining the high 16 bits of the end event with the low 16 bits of the start event for start events of the same execution unit, instruction type, or sequence number stored in the event cache.
[0013] Furthermore, it also includes: The tracking data is allocated storage resources and bus bandwidth is controlled to reduce storage usage and bus bandwidth consumption.
[0014] Furthermore, the storage resource allocation and bus bandwidth management include the following steps: Local storage resources are allocated in a quantitative manner based on the type of artificial intelligence model in order to achieve storage quota management by model; The system calculates the actual utilization of storage bandwidth by the AI network and allocates bandwidth quotas for writing tracking events based on this data. It sets bandwidth limits and dynamically schedules these quotas to prevent tracking data write traffic from preempting or blocking the main service link. Simultaneously, it assesses the flow control of the on-chip network bus (CNOC bus), prioritizing on-chip cache for local storage of tracking data, and only writing to peripherals via the on-chip network when data resources exceed limits or the on-chip cache overflows. The tracking events are transmitted in batches using data packing alignment and burst writing. When the bandwidth utilization of the on-chip interconnect network exceeds a preset saturation threshold, the tracking event write rate is reduced or the queue cache is triggered.
[0015] Furthermore, the quantitative allocation of local storage resources based on the type of artificial intelligence model includes: quantifying the required local storage resources based on the actual rate of tracking events generated by the artificial intelligence model type and the number of chip cores, prioritizing the allocation of storage space according to different artificial intelligence models and setting usage limits to avoid resource waste and increased costs due to excessive tracking data.
[0016] Furthermore, the hardware execution unit includes a vector calculation unit, a data transfer unit, and a matrix calculation unit; the collaborative optimization of instruction scheduling and tracking acquisition based on the hardware execution unit configuration includes the following steps: For the Vector Computing Unit (VE), Data Transfer Unit (TE / TEC), and Matrix Computing Unit (ME), the number of configuration instructions required for each type of unit to complete the corresponding operation is statistically summarized (e.g., VE requires 8 instructions, TE / TEC requires 10 instructions, and ME requires 15 instructions). Combined with the chip's dual-issue hardware structure (i.e., it can issue two instructions at a time), the minimum startup cycle and maximum issue rate of each type of instruction are determined (e.g., VE can start once every 4 clock cycles, TE / TEC can start once every 5 clock cycles, and ME can start once every 8 clock cycles). Based on the number and architectural characteristics of the hardware execution units, the parallelism of the hardware execution units is determined. Each trace event is saved within two clock cycles. Based on this, it is calculated that during maximum concurrent writes of trace events, 50% of the bus bandwidth will be consumed. The bandwidth quota and write window for trace data are designed accordingly to ensure real-time data acquisition while minimizing resource contention on the main service path, thus reducing system operational risks. Simultaneously, the maximum instruction issue rate for the entire DSA system is defined, fully utilizing the hardware's maximum bandwidth and setting clear boundary conditions for trace acquisition and performance statistics, providing hardware-level support for simplified system scheduling and bandwidth security. Based on the hardware execution unit configuration (e.g., 1 matrix calculation unit ME, 2 vector calculation units VE, 12 data transfer units TE, 2 data transfer units TEC, and 1 synchronization unit fence.em), the system's scheduling mechanism, and the execution unit's state machine, the number of tracking channels, local cache capacity, and upper limit of tracking event write bandwidth are determined according to the maximum instruction issuance rate and hardware execution unit parallelism (e.g., the system must contain at least 1 ME, 2 VE, 4 TE, 2 TEC, and 1 synchronization unit fence). This design can support the simultaneous parallel execution of up to 4 TE instructions and 2 TEC instructions. Combined with the dual vector processing unit (VPU) architecture, the system parallelism can be dynamically adjusted, which can improve both instruction execution throughput and tracking event acquisition efficiency. At the same time, a hardware and software linkage strategy is formulated to address resource contention and data bandwidth allocation issues in high-concurrency tracking operations, ensuring stable system operation.
[0017] Furthermore, the maximum instruction issue rate constraint is one instruction issued every four clock cycles.
[0018] A second aspect of the present invention provides an instruction-level performance analysis system for DSA, comprising: The tracking encoding module is configured to implement tracking event encoding, timing marking, and instruction address segmentation recording; The storage bandwidth management module is configured to perform storage quota allocation, cache priority saving, bandwidth quota control, and burst writes; and The scheduling and coordination optimization module is configured to establish a hardware execution unit cycle model, set the instruction issuance rate, determine the maximum number of concurrent instructions, the number of tracking channels, and the local cache capacity.
[0019] Furthermore, the tracking encoding module is also configured to perform merge processing on consecutive instruction end events and start events to reduce data volume and bandwidth usage.
[0020] Furthermore, the storage bandwidth management module is also configured to dynamically allocate storage space according to different artificial intelligence models and reduce the write rate when the bus pressure exceeds the limit.
[0021] A third aspect of the present invention provides a storage medium having a computer program stored thereon, the computer program implementing the steps of the above-described method when executed by a processor.
[0022] This invention designs a hardware-software co-operational DSA instruction-level performance tracing and event counting method. It can accurately count and record various performance practices at the DSA instruction level, ensuring real-time statistics. A 64-bit and scalable event data encoding format is defined. Through the layered overlay of event type, time stamp, and instruction address, the pressure on tracking event storage resources and system bandwidth is effectively reduced. The data format supports dynamically determining the fields carried at the event acquisition point, effectively supporting on-demand sampling, distinguishing between data points, and efficient backend attribution analysis, thus improving the accuracy of DSA performance statistics. Especially for scenarios with massive event volumes, long durations, and high instruction differentiation, the method uses concise and efficient encoding and storage of key behavioral data for each instruction, facilitating efficient identification and tracing. It ensures low resource consumption and bus friendliness in chip-level tracing acquisition, avoiding bandwidth and storage interference caused by acquisition operations to normal application operation. Simultaneously, it considers the versatility, efficiency, and convenience of software programming analysis of instruction performance. The acquired information is ultimately converted into a visual pipeline, improving the efficiency of compiler development and network development in program analysis and optimization.
[0023] The present invention has at least the following beneficial effects: 1) By encoding and recording fine-grained tracking events during the execution process of DSA instructions, and generating tracking events at lifecycle nodes such as DSA instruction start, end, and restart, and by timestamping and encoding these tracking events, fine-grained timing data of the instruction lifecycle can be obtained; 2) By allocating storage resources and dynamically managing bus bandwidth for tracking data, the present invention effectively reduces on-chip storage occupation and bus bandwidth consumption during the tracking process, avoiding preemption or blocking of the main service link by tracking traffic; 3) By co-optimizing instruction scheduling and tracking acquisition based on hardware execution unit configuration, the present invention ensures parallel compatibility between performance analysis and normal system service execution without changing the original hardware computing path or adding additional critical path latency, thereby achieving chip instruction-level performance visibility. While performing analysis, the high throughput and low latency characteristics of the neural network processor are maintained, improving the stability and observability of the artificial intelligence chip in actual deployment; 4) Through the triple collaborative mechanism of "storage space allocation based on actual application scenarios, bandwidth occupancy limit control and dynamic burst writing", this invention meets the requirements of complete tracking data collection and analysis, while achieving precise control and optimization of storage and bus resource consumption, ensuring the non-intrusiveness of the tracking subsystem to the main business and other high-priority channels, and ensuring the overall stability of the system operation; 5) The data encoding format design of this invention is based on concise event encoding, high-precision timing and flexible instruction positioning, combined with event merging algorithm, to achieve efficient, low-loss and scalable tracking of the entire process of DSA instruction behavior under the premise of storage and bandwidth optimization, meeting the engineering statistical needs of high-concurrency environments such as AI chips. Furthermore, this invention pertains to a hardware-software collaborative tracking acquisition and data processing solution in the field of artificial intelligence chip performance analysis technology. Specifically, it is a specific application in artificial intelligence chip performance analysis. The resulting technical effects are that it greatly improves the storage / bandwidth friendliness and real-time processing capabilities of the tracking system. While achieving fine-grained performance tracking at the DSA instruction level, it significantly reduces the storage and bus bandwidth overhead caused by the tracking process, avoids the tracking behavior from interfering with the normal business operation of the chip, and ensures that the artificial intelligence chip maintains efficient and stable execution while possessing high-performance analysis capabilities. Attached Figure Description
[0024] To further illustrate the above and other advantages and features of the various embodiments of the present invention, a more specific description of the embodiments of the invention will be presented with reference to the accompanying drawings. It is to be understood that these drawings depict only typical embodiments of the invention and are therefore not intended to limit its scope. In the drawings, identical or corresponding parts will be indicated by identical or similar reference numerals for clarity.
[0025] Figure 1The 64-bit scalable tracking event data encoding format for DSA tracking events is shown in some embodiments of the present invention. Detailed Implementation
[0026] It should be noted that the components in the accompanying drawings may be shown exaggerated for illustrative purposes and may not be to scale.
[0027] In this invention, the various embodiments are merely intended to illustrate the solutions of the invention and should not be construed as limiting.
[0028] In this invention, unless otherwise specified, the quantifiers “a” and “one” do not exclude scenarios involving multiple elements.
[0029] In this invention, the modules of the system according to the invention can be implemented using software, hardware, firmware, or a combination thereof. When a module is implemented using software, its function can be implemented through computer program flow. For example, the module can be implemented using code segments (such as code segments in languages like C and C++) stored in a storage device (such as a hard disk, memory, etc.), wherein the corresponding function of the module can be implemented when the code segment is executed by a processor. When a module is implemented using hardware, its function can be implemented by setting up a corresponding hardware structure. For example, the module's function can be implemented by hardware programming a programmable device such as a field-programmable gate array (FPGA), or by designing an application-specific integrated circuit (ASIC) that includes multiple transistors, resistors, capacitors, and other electronic devices, or by implementing it through neural network processor hardware logic. When a module is implemented using firmware, the module's function can be written in the form of program code into a read-only memory of the device, such as an EPROM or EEPROM, and the corresponding function of the module can be implemented when the program code is executed by a processor. In addition, some functions of the module may need to be implemented by separate hardware or by cooperating with said hardware.
[0030] It should also be noted that, in the embodiments of the present invention, only a portion of the parts or components may be shown for clarity and simplicity. However, those skilled in the art will understand that, under the teachings of the present invention, the required parts or components can be added as needed for specific scenarios.
[0031] It should also be noted that within the scope of this invention, the terms "same", "equal", and "equal to" do not mean that the two values are absolutely equal, but allow for a certain reasonable error. In other words, the terms also cover "substantially the same", "substantially equal", and "substantially equal to".
[0032] Furthermore, the embodiments of the present invention describe the process steps in a specific order; however, this is only for the convenience of distinguishing each step, and is not intended to limit the order of the steps. In different embodiments of the present invention, the order of each step can be adjusted according to the process.
[0033] The following embodiment provides an instruction-level performance analysis method for DSA, including the following steps: The instruction execution process of DSA is traced, events are encoded and recorded to obtain trace data for the entire lifecycle of the instruction; Storage resource allocation and bus bandwidth management are implemented for tracking data to reduce storage usage and bus bandwidth consumption; and The hardware execution unit configuration is used to coordinate and optimize instruction scheduling and tracking acquisition so that the processes of tracking event encoding and recording, tracking data storage and bandwidth control do not interfere with the normal operation of the system. The hardware execution unit includes a computing unit (VE), a data transfer unit (TE / TEC) and a matrix computing unit (ME).
[0034] The event encoding and logging describes the definition and usage of DSA instruction trace events, and how the data encoding format efficiently, on demand, expresses and stores instruction trace events, including the following steps: A concise and scalable tracking event data encoding format is defined, which includes precise division and marking of data events, full lifecycle coverage, and identification of key behaviors in the DSA command lifecycle. The key behaviors in the DSA command lifecycle include the start (START), end (END), and restart (RESTART) status of the DSA command, which are marked with a 3-digit event type code. Each event carries a unique status at the first moment (e.g., 0:START, 1:END, ...). This design can clearly distinguish different types of events with extremely low resource overhead, solving the problems of event ambiguity and difficulty in location in traditional acquisition designs, and providing clear basic data support for subsequent performance analysis and scheduling. Timing information and cycle instruction tracing capabilities are provided. Each DSA tracking event is assigned a 36-bit high-precision timing stamp to achieve nanosecond-level timing tracing, supporting up to 64 seconds of cross-cycle tracking, thereby enabling accurate tracing of instruction execution timing. This allows instruction performance statistics to have both fine granularity and the ability to accommodate data collection under ultra-long computational paths. For chip environments with intensive instruction execution and high event concurrency, absolute timing tracing completely solves the problem of inaccurate performance attribution caused by scrambled acquisition data and causal confusion. Flexible instruction address mapping and event data conservation, unique instruction location and correlation, employing a segmented instruction address carrying method. Startup and restart events carry the lower 16 bits of the instruction address, while end events carry the higher 16 bits. The high and low 16 bits of the instruction address achieve a 32-bit level fully unique instruction identifier, enabling complete traceability of the entire instruction lifecycle while significantly reducing storage and bandwidth consumption. These bits are combined to form the instruction identifier. For scenarios where the previous event (instruction end event) and the current event (instruction start event) arrive simultaneously, and the instruction type identifier (IType) and high address are consistent, these two events are merged into a restart event. This innovative approach combines the events into RESTART (EType=2), which further reduces event storage and system bandwidth usage while ensuring data integrity. This addresses the real-world needs of instruction attribution and periodic statistics, and greatly enhances the instruction-level traceability of tracking event data.
[0035] Figure 1 The 64-bit scalable trace event data encoding format is shown. The total length of the encoding format is 64 bits, including an event type field, an instruction type field, a program counter segment field, and a timestamp field. Specifically, the event type field (EType) occupies bits 63 to 61 (3 bits) and is used to indicate the lifecycle type of the current trace event; the instruction type field (IType) occupies bits 60 to 52 (9 bits) and is used to identify the DSA instruction category corresponding to the current event; the program counter segment field occupies bits 51 to 36 (16 bits) and is used to record partial address information of the program counter (PC); the timestamp field (Timestamp) occupies bits 35 to 0 (36 bits) and is used to record the occurrence time of the event. In the start event encoding format, the event type code EType is 0, and the program counter segment field records the lower 16 bits of the PC (PC15:0); in the end event encoding format, the event type code EType is 1, and the program counter segment field records the higher 16 bits of the PC (PC31:16). By matching the start and end events corresponding to the same DSA instruction, the complete program counter address and corresponding time interval can be obtained, thereby enabling refined performance analysis of the entire lifecycle of the DSA instruction. Using the address segmentation method described above, complete instruction identification information can be preserved without increasing the total bit width of a single trace event, and support is provided for subsequent event merging, restart identification, and cross-cycle timing analysis.
[0036] This paper elucidates the design principles and expected usage behavior of storage resource allocation and bus bandwidth management from the perspectives of space utilization, bandwidth scheduling, and burst protection. The storage resource allocation and bus bandwidth management process includes the following steps: Storage planning and space constraint design: Quantitatively allocate local storage resources according to mainstream artificial intelligence model types such as GPTJ and Bert (e.g., 8MB per core for GPTJ, less than 200KB per core for Bert), prioritize model-specific allocation and quota management, and avoid resource waste and cost increases caused by excessive tracking of event data; Bus bandwidth optimization and dynamic bandwidth usage control accurately calculate the actual storage bandwidth occupancy of various AI networks (e.g., GPTJ approximately 0.44%, single-core Bert approximately 0.31%), and allocate bandwidth quotas for trace event writes based on this data. Bandwidth limits are set and dynamically scheduled to prevent trace data write traffic from preempting or blocking the main service link. Simultaneously, the flow control status of the on-chip network bus (CNOC bus) is assessed, prioritizing on-chip cache storage of trace data. Data overflow to peripherals only occurs when data resources exceed limits or the on-chip cache overflows, limiting continuous bus writes at the source and reducing the risk of overall system bandwidth degradation under occasional events. The burst write mechanism and abnormal bandwidth protection utilize the burst bandwidth characteristics of CNOC (burst length=128Byte), employing data packing alignment and burst write methods to write multiple trace events in a single bus burst, thereby improving bandwidth utilization and reducing the continuous bus cycle occupation time of unit data writes. A dynamic pruning mechanism for setting burst windows and write rhythms reduces the trace event write rate or triggers queue buffering when the bandwidth utilization of the on-chip interconnect network exceeds a preset saturation threshold, dynamically responding to burst and occasional bandwidth pressures and ensuring that normal application services are not impacted by trace operations.
[0037] This section describes how to analyze DSA instruction configuration to optimize tracing and acquisition, achieving scheduling resource optimization. The collaborative optimization of instruction scheduling and tracing based on hardware execution unit configuration includes the following steps: Modeling the number of configuration instructions and minimum startup cycle by type: For different heterogeneous asynchronous execution units such as vector computing units (VE), data transfer units (TE / TEC), and matrix computing units (ME), the number of configuration instructions required for each type of unit to complete the corresponding operation is statistically summarized (e.g., VE requires 8 instructions, TE / TEC requires 10 instructions, and ME requires 15 instructions). Combined with the chip's dual-issue hardware structure (i.e., two instructions can be issued at a time), the minimum startup cycle and maximum issue rate of each type of instruction are determined (e.g., VE can start once every 4 clock cycles, TE / TEC can start once every 5 clock cycles, and ME can start once every 8 clock cycles). Based on the number and architectural characteristics of the hardware execution units, the parallelism of the hardware execution units is determined. The bandwidth usage modeling and allocation for tracking event data acquisition specifies that saving one tracking event requires two clock cycles. Based on this, it is calculated that when tracking events are written concurrently at maximum concurrency, 50% of the bus bandwidth will be consumed. Bandwidth quotas and write windows for tracking data are designed accordingly to ensure the real-time performance of tracking data acquisition while minimizing resource contention on the main service path caused by tracking operations, thus reducing system operational risks. Simultaneously, the maximum instruction issuance rate constraint for the entire DSA system is defined as one instruction every four clock cycles. This fully utilizes the maximum hardware bandwidth and sets clear boundary conditions for tracking acquisition and performance statistics, providing hardware-level support for simplified system scheduling and bandwidth security. The design of heterogeneous parallel unit scale and maximum instruction parallelism clarifies the number of each type of asynchronous unit supported by the system (e.g., 1 matrix computation unit ME, 2 vector computation units VE, 12 data transfer units TE, 2 data transfer units TEC, and 1 synchronization unit fence.em). Combined with the system's scheduling mechanism and the execution unit's state machine, the number of tracking channels, local cache capacity, and upper limit of tracking event write bandwidth are determined based on the maximum instruction issue rate and hardware execution unit parallelism (e.g., the system must contain at least 1 ME, 2 VE, 4 TE, 2 TEC, and 1 synchronization unit fence). This design can support the simultaneous parallel execution of up to 4 TE instructions and 2 TEC instructions. Combined with a dual vector processing unit (VPU) architecture, dynamic adjustment of system parallelism is achieved, improving both instruction execution throughput and tracking event acquisition efficiency. Furthermore, a hardware and software coordinated approach is developed to address resource contention and data bandwidth allocation issues in high-concurrency tracking operations, ensuring stable system operation.
[0038] While some embodiments of the present invention have been described in this application, those skilled in the art will understand that these embodiments are merely illustrative. Numerous variations, alternatives, and improvements will arise in those skilled in the art under the teachings of this invention without departing from its scope. The appended claims are intended to define the scope of the invention and thereby cover methods and structures within the scope of the claims themselves and their equivalents.
Claims
1. A method for DSA-oriented instruction-level performance analysis, characterized in that, Includes the following steps: The instruction execution process of DSA is traced, events are encoded and recorded to obtain trace data for the entire instruction lifecycle; and The instruction scheduling and tracking acquisition are optimized in a coordinated manner based on the hardware execution unit configuration so that the process of tracking event encoding and recording does not interfere with the normal operation of the system.
2. The DSA-oriented instruction-level performance analysis method of claim 1, wherein, The tracking event encoding and recording includes the following steps: An extensible tracking event data encoding format is adopted, and event type codes are used to identify key behaviors in the DSA instruction lifecycle. These key behaviors include the start, end, and restart states of the DSA instruction, and each event corresponds to a unique state. Assign a time stamp to each DSA tracking event to achieve nanosecond-level timing traceability, supporting a maximum of 64-second cross-cycle tracking; The instruction identifier is formed by combining the instruction type identifier, the execution unit identifier, and the high and low bits of the instruction address. as well as When an instruction end event and an instruction start event generated by the same execution unit are received within the same acquisition cycle, and the instruction type identifier and the high-order field of the instruction address are the same, the instruction end event and the instruction start event are compressed and encoded into an instruction restart event. and / or The event type code is 3 digits, and the time stamp is 36 digits.
3. The DSA-oriented instruction-level performance analysis method of claim 2, wherein, The instruction address is segmented into high and low bits: the start and restart events both carry the low 16 bits of the instruction address, while the end event carries the high 16 bits of the instruction address. The high 16 bits and the low 16 bits of the instruction address are combined to form a 32-bit instruction address.
4. The DSA-oriented instruction-level performance analysis method of claim 2, wherein, Also includes: The tracking data is allocated storage resources and bus bandwidth is controlled to reduce storage usage and bus bandwidth consumption.
5. The DSA-oriented instruction-level performance analysis method according to claim 4, characterized in that, The storage resource allocation and bus bandwidth management include the following steps: Local storage resources are allocated in a quantitative manner based on the type of artificial intelligence model in order to achieve storage quota management by model; The system calculates the actual utilization of storage bandwidth by the AI network and allocates bandwidth quotas for writing tracking events based on this data, setting bandwidth limits and dynamically scheduling them. Simultaneously, it assesses the flow control of the on-chip network bus, prioritizing on-chip cache for local storage of tracking data, and only writing to peripherals via the on-chip network when data resources exceed limits or the on-chip cache overflows. The tracking events are transmitted in batches using data packing alignment and burst writing. When the bandwidth utilization of the on-chip interconnect network exceeds a preset saturation threshold, the tracking event write rate is reduced or the queue cache is triggered.
6. The DSA-oriented instruction-level performance analysis method of claim 5, wherein, The quantitative allocation of local storage resources based on the type of artificial intelligence model includes: quantifying the required local storage resources based on the actual rate of tracking events generated by the artificial intelligence model type and the number of chip cores, prioritizing the allocation of storage space according to different artificial intelligence models and setting usage limits to avoid resource waste and increased costs due to excessive tracking data.
7. The DSA-oriented instruction-level performance analysis method of claim 1, wherein, The hardware execution unit includes a vector calculation unit, a data transfer unit, and a matrix calculation unit; Coordinated optimization of instruction scheduling and tracing acquisition based on hardware execution unit configuration includes the following steps: For vector computing units, data transfer units, and matrix computing units, the number of configuration instructions required for each type of unit to complete the corresponding operation is statistically summarized. In combination with the dual-issue hardware structure of the chip, the minimum start-up cycle and maximum issue rate of each type of instruction are determined. Based on the number and architecture characteristics of the hardware execution units, the parallelism of the hardware execution units is determined. It is specified that saving each trace event requires 2 clock cycles. Based on this, it is calculated that when trace events are written concurrently at maximum concurrency, 50% of the bus bandwidth will be used. The bandwidth quota and write window for trace data are designed accordingly. The maximum instruction issue rate for the entire DSA system is also defined. By combining the hardware execution unit configuration, the system scheduling mechanism, and the execution unit state machine, the number of tracking channels, the local cache capacity, and the upper limit of tracking event write bandwidth are determined based on the maximum instruction issuance rate and the parallelism of the hardware execution unit.
8. The DSA-oriented instruction-level performance analysis method according to claim 7, characterized in that, The maximum instruction issue rate constraint is one instruction to be issued every four clock cycles.
9. A DSA-oriented instruction-level performance analysis system, characterized by, include: The tracking encoding module is configured to implement tracking event encoding, timing marking, and instruction address segmentation recording; The storage bandwidth management module is configured to perform storage quota allocation, cache priority saving, bandwidth quota management, and burst write; as well as The scheduling and coordination optimization module is configured to establish a hardware execution unit cycle model, set the instruction issuance rate, determine the maximum number of concurrent instructions, the number of tracking channels, and the local cache capacity.
10. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.