Methods, apparatus, media, and products for multi-engine asynchronous parallel computing architecture

By analyzing the notification times of asynchronous engines, the actual execution time of the multi-engine asynchronous parallel computing architecture is determined, which solves the problem of instrumentation measurement methods destroying parallelism in existing technologies and realizes accurate performance analysis and optimization basis.

CN121880038BActive Publication Date: 2026-06-19MOXIN ARTIFICIAL INTELLIGENCE TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MOXIN ARTIFICIAL INTELLIGENCE TECH (SHENZHEN) CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In multi-engine asynchronous parallel computing architectures, existing technologies use instrumentation measurement methods to determine execution time, but inserting waiting operations disrupts the parallelism and execution order of asynchronous engines, leading to inaccurate performance analysis.

Method used

By analyzing the timing of notifications sent by the asynchronous engine to the control core, the actual execution time of the target program segment can be determined without inserting additional waiting operation instructions, thus maintaining the parallel execution order of the asynchronous engine.

Benefits of technology

It enables accurate performance analysis of multi-engine asynchronous parallel computing architecture, providing a reliable data foundation and precise basis for subsequent optimization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121880038B_ABST
    Figure CN121880038B_ABST
Patent Text Reader

Abstract

This invention discloses a method, apparatus, medium, and product for a multi-engine asynchronous parallel computing architecture. The method includes: acquiring interval marker pairs, which indicate the start and end of a target program segment in a preset program; in response to the interval marker pairs, determining the boundary notification corresponding to the target program from among multiple notifications sent to the control core when each asynchronous engine related to the target program segment performs an operation; and determining the actual execution time of the target program segment based on the sending times of all boundary notifications sent by all asynchronous engines related to the target program segment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computers, and more particularly to a method, apparatus, medium, and product for a multi-engine asynchronous parallel computing architecture. Background Technology

[0002] With the development of heterogeneous computing architectures, single processing cores have gradually evolved into complex systems comprising multiple asynchronous engines that can execute in parallel. In such architectures, each asynchronous engine typically executes in parallel through its own independent instruction buffer and expresses execution dependencies through event synchronization mechanisms. While this mechanism fully leverages the hardware's parallel capabilities, it also significantly increases the complexity of runtime behavior, making performance analysis of runtime behavior much more difficult. Summary of the Invention

[0003] According to an embodiment of the present invention, a method for a multi-engine asynchronous parallel computing architecture includes a control core and multiple asynchronous engines. The control core issues operation instructions to at least one of the multiple asynchronous engines based on a preset program, causing the at least one asynchronous engine to perform an operation. The method includes: acquiring interval marker pairs, the interval marker pairs indicating the start and end of a target program segment in the preset program; in response to the interval marker pairs, determining a boundary notification corresponding to the target program segment from multiple notifications sent to the control core when each asynchronous engine related to the target program segment performs an operation, the boundary notification including a start notification sent when the asynchronous engine starts executing the operation corresponding to the target program segment and an end notification sent when the asynchronous engine finishes executing the operation corresponding to the target program segment; and determining the actual execution time of the target program segment based on the sending times of all boundary notifications sent by all asynchronous engines related to the target program segment.

[0004] An apparatus for a multi-engine asynchronous parallel computing architecture according to an embodiment of the present invention includes: a processor; and a memory storing computer-executable instructions thereon, wherein the computer-executable instructions, when executed by the processor, cause the processor to perform the method described above.

[0005] According to an embodiment of the present invention, a computer-readable storage medium stores computer-executable instructions thereon, wherein, when executed by a processor, the computer-executable instructions cause the processor to perform the method described above.

[0006] A computer program product according to an embodiment of the present invention includes computer-executable instructions, wherein, when executed by a processor, these computer-executable instructions cause the processor to perform the method described above.

[0007] The method for a multi-engine asynchronous parallel computing architecture according to embodiments of the present invention is non-intrusive. It does not insert additional waiting operation instructions into the preset program. Instead, it determines the actual execution time of the target program segment by analyzing the sending time of notifications sent by each asynchronous engine to the control core. Because no additional waiting operation instructions are inserted, subsequent instruction issuance is not blocked, and the execution order of asynchronous engines is not changed, resulting in a more realistic and accurate actual execution time. Attached Figure Description

[0008] The invention can be better understood from the following description of specific embodiments of the invention in conjunction with the accompanying drawings, wherein:

[0009] Figure 1 A schematic block diagram illustrating an example of a multi-engine asynchronous parallel computing architecture is shown.

[0010] Figure 2 A schematic flowchart of the existing stake measurement method is shown.

[0011] Figure 3 A schematic flowchart of a method for a multi-engine asynchronous parallel computing architecture according to an embodiment of the present invention is shown.

[0012] Figure 4 A schematic flowchart illustrating a specific example of a method for a multi-engine asynchronous parallel computing architecture according to an embodiment of the present invention is shown.

[0013] Figure 5 A schematic diagram of a computer system is shown that can implement the method and apparatus for a multi-engine asynchronous parallel computing architecture according to embodiments of the present invention. Detailed Implementation

[0014] The features and exemplary embodiments of various aspects of the present invention will now be described in detail. Numerous specific details are set forth in the following detailed description to provide a thorough understanding of the invention. However, it will be apparent to those skilled in the art that the invention may be practiced without requiring some of these specific details. The following description of embodiments is merely intended to provide a better understanding of the invention by illustrating examples of the invention. The invention is by no means limited to any specific configurations and algorithms presented below, but covers any modifications, substitutions, and improvements to elements, components, and algorithms without departing from the spirit of the invention. Well-known structures and techniques are not shown in the drawings and the following description in order to avoid unnecessarily obscuring the invention.

[0015] With the development of heterogeneous computing architectures, a single processing core has gradually evolved into a complex system that includes multiple asynchronous engines that can be executed in parallel, which can also be called a multi-engine asynchronous parallel computing architecture. Figure 1A schematic block diagram illustrating an example of a multi-engine asynchronous parallel computing architecture is shown. Figure 1 As shown, the multi-engine asynchronous parallel computing architecture includes: a control core such as the RVV (RISC-V Vector Extension) control core; multiple asynchronous engines including, but not limited to, DMA (Direct Memory Access), SPU (Sparse Processing Unit), SE (Sort Engine), TE (Transpose Engine), and VPU (Vector Processing Unit); GLB (Global Buffer) implemented through SRAM (Static Random-Access Memory); and external memory implemented through DDR (Double Data Rate Synchronous Dynamic Random-Access Memory). The DMA engine is used for data transfer between DDR and GLB, while the other asynchronous engines only access the GLB. Each asynchronous engine has an independent instruction buffer, enabling parallel operation without shared execution resources.

[0016] In this architecture, all instructions for asynchronous engines are uniformly issued by the control core. The control core schedules the execution of asynchronous engines by writing operation instructions to the instruction buffers corresponding to each asynchronous engine. Each asynchronous engine supports two types of operation instructions. One type is a synchronous event (sync_event), which is sent by the current asynchronous engine to the target asynchronous engine or the control core; the other type is a synchronous request (sync_request), which the current engine uses to wait for a synchronous event issued by the target engine. In addition, when an asynchronous engine executes a specific operation instruction, it can send a synchronous interrupt notification to the control center. The control core can identify the engine that sent the synchronous interrupt notification and can also enter a waiting state through low-power waiting instructions (such as WFI).

[0017] In this type of architecture, each asynchronous engine typically executes in parallel through its own independent instruction buffer and expresses execution dependencies through event synchronization mechanisms. While this mechanism fully leverages the hardware's parallel capabilities, it significantly increases the complexity of runtime behavior, thus making performance analysis of runtime behavior even more urgent. Performance analysis here includes observing the actual execution of the overall architecture and performing analysis based on that actual execution.

[0018] However, as the complexity of runtime behavior increases, the difficulty of performance analysis also rises, especially in observing actual execution. Currently, the actual execution time of various asynchronous engines is typically determined using instrumentation measurement methods. Figure 2 A schematic flowchart of a prior art stake insertion measurement method is shown. Figure 2 As shown, during the execution of the preset program by the control core, when the actual execution progress of the control core reaches the marker indicating the start of the target program segment, the control core actively inserts a synchronous wait operation; after waiting until all related asynchronous engines are idle, the control core begins to execute the target program segment and records the start time; when the actual execution progress of the control core reaches the marker indicating the end of the target program segment, the control core inserts a synchronous wait operation again; after waiting for all engines related to the target program segment to finish executing, the end time is recorded. Although this method formally obtains the execution time of the target program segment, because it explicitly inserts wait operations, it disrupts the original instruction execution sequence of the asynchronous engines corresponding to the preset program. The parallelism and global timeline of the asynchronous engines are disrupted, causing changes in the execution time, scheduling order, and resource utilization of the target program segment, and failing to truly reflect the actual running behavior of the asynchronous engines during the execution of the preset program.

[0019] The method for a multi-engine asynchronous parallel computing architecture according to embodiments of the present invention is non-intrusive. It does not insert additional waiting operation instructions into the preset program. Instead, it determines the actual execution time of the target program segment by analyzing the sending time of notifications sent by each asynchronous engine to the control core. Because no additional waiting operation instructions are inserted, subsequent instruction issuance is not blocked, and the execution order of asynchronous engines is not changed, resulting in a more realistic and accurate actual execution time.

[0020] Figure 3 A schematic flowchart illustrating a method for a multi-engine asynchronous parallel computing architecture according to an embodiment of the present invention is shown. This method can be implemented by a control core or an independent module that interacts with the control core. Figure 3 As shown, the method 300 for a multi-engine asynchronous parallel computing architecture includes steps S301-S303.

[0021] like Figure 3As shown, step S301 includes obtaining interval marker pairs. Here, interval marker pairs are used to indicate the start and end of a target program segment in a preset program. The preset program is a program to be executed by a multi-engine asynchronous computing architecture. In this architecture, the control core sends operation instructions to relevant asynchronous engines based on the preset program, causing the relevant asynchronous engines to perform operations. The target program segment is a continuous program segment in the preset program whose execution time is to be measured, and may contain execution instructions corresponding to different asynchronous engines. Interval marker pairs may include interrelated start markers `prof_begin` and end markers `prof_end`, where the start marker `prof_begin` indicates the start position of the target program segment, and the end marker `prof_end` indicates the end position of the target program segment. The execution entity implementing this embodiment of the invention can determine the position of the target program segment in the preset program based on the interval marker pairs.

[0022] like Figure 3 As shown, step S302 includes, in response to the interval marker pair, determining the boundary notification corresponding to the target program segment from among multiple notifications sent to the control core when each asynchronous engine performs an operation related to the target program segment. This boundary notification includes a start-up notification sent when the asynchronous engine begins executing the operation corresponding to the target program segment, and an end-up notification sent when the asynchronous engine completes the operation corresponding to the target program segment. In a multi-engine asynchronous parallel computing architecture, the control core issues operation instructions to relevant asynchronous engines based on a preset program, and the asynchronous engines perform related operations, such as sending notifications to the control core. In the method according to an embodiment of the present invention, the characteristic of asynchronous engines sending notifications to the control core when performing operations is utilized to identify the boundary notification indicating the target program segment from all notifications sent to the control core.

[0023] like Figure 3As shown, step S303 includes determining the actual execution period of the target program segment based on the sending times of all boundary notifications sent by all asynchronous engines related to the target program segment. Since the asynchronous engines related to the target program segment execute the operations corresponding to the target program segment in parallel, the actual execution start point of the target program segment is the earliest time among all asynchronous engines to begin executing the operation corresponding to the target program segment, and the actual execution end point of the target program segment is the last time among all asynchronous engines to finish executing the operation corresponding to the target program segment. In some embodiments, determining the actual execution period of the target program segment may include: determining the earliest sending time of all start point notifications as the start time of the actual execution period, and the latest sending time of all end point notifications as the end time of the actual execution period. Here, the sending times of the notifications are typically recorded by the control core or a high-precision time synchronization module (such as a timer). Since the asynchronous engines execute in parallel, this strategy of taking the "earliest start point" and "latest end point" can accurately cover the entire span of the target program segment on the physical timeline, truly reflecting the total time spent by multiple asynchronous engines collaboratively completing the computational task of this segment.

[0024] In the method according to embodiments of the present invention, each asynchronous engine performs the following actions: sending a start-up notification to the control core when starting to execute the operation corresponding to the target program segment, and sending an end-up notification to the control core when the target program segment is completed. The instructions for sending the start-up notification and end-up notification are typically placed at the beginning and end of the target program segment. In some embodiments, the target program segment may include: a start-up notification instruction, used to cause each asynchronous engine associated with the target program segment to send a start-up notification when starting to execute the operation corresponding to the target program segment; and an end-up notification instruction, used to cause each asynchronous engine associated with the target program segment to send an end-up notification when completing the operation corresponding to the target program segment. The start-up notification instruction and the end-up notification instruction may be instructions that already exist in the target program segment. The notification sent by these instructions has a specific function or purpose, but in this embodiment, it can also serve as a boundary instruction. The start-up notification instruction and the end-up notification instruction may also be additional instructions added based on the requirements of this embodiment. They can be manually inserted into the target program segment by developers when writing preset programs, automatically inserted by analysis tools according to performance analysis requirements, or inserted into the target program segment by the execution entity of the method of this embodiment after obtaining the interval marker pairs.

[0025] In the method according to an embodiment of the present invention, in the multi-engine asynchronous parallel computing architecture, during the execution process, asynchronous engines may frequently send various types of notifications to the control core, such as status reports, synchronization events, interrupt notifications, start notifications, and end notifications, etc. Identifying the boundary notifications corresponding to the target program segment for each asynchronous engine among this series of notifications is a key step in the embodiment of the present invention. In terms of specific identification methods, the present invention provides multiple embodiments to cope with different hardware supports and application scenarios.

[0026] In some embodiments, determining the boundary notification may include: determining the boundary notification sent by the asynchronous engine based on the expected sending order of notifications when each asynchronous engine performs an operation, where the expected sending order is determined by a preset program. Since the instruction sequence in the preset program is determined, the order in which each asynchronous engine sends notifications to the control core during the execution process is also preset and known. For example, for a given target program segment, it can be预先 known that a specific asynchronous engine participating in this target program segment will send the Nth notification when it starts executing this target program segment and the Mth notification when it finishes executing this target program segment, corresponding to the start notification and the end notification respectively, and M and N are positive integers and N < M. Therefore, by using the sequence number of the notifications actually sent by each asynchronous engine, the corresponding boundary notification can be accurately locked. This method has no additional requirements for the notification content itself and is applicable to hardware platforms with fixed notification formats.

[0027] In some embodiments, determining the boundary notification may include: determining whether the notification is a boundary notification by identifying whether the identifier carried by each notification is the preset identifier corresponding to the boundary notification. Here, the start notification instruction and the end notification instruction are configured such that when the asynchronous engine executes these instructions, the generated notification will carry a preset identifier, such as a specific numerical value, string, or flag bit. According to the method of the embodiment of the present invention, by detecting the identifier of each received notification, it can be directly determined whether it is a boundary notification. This means is more intuitive and accurate, can avoid the disorder of the notification order caused by changes in the program execution path or hardware anomalies, and has strong robustness.

[0028] It should be noted that there is an incorrect expression "预先 known" in the translation of , which should be "previously known".Here, the action of determining boundary notifications or the action of determining the actual execution period can occur after the preset program has finished executing or the target program segment has finished executing, that is, after the boundary notification has been sent to the control core. Based on all the notifications received, the boundary notifications and the final execution period of each asynchronous engine are determined. Alternatively, the action of determining boundary notifications or the action of determining the actual execution period can occur during the execution of the preset program. As notifications are continuously sent to the control core, the data stream is analyzed in real time to determine whether the notifications currently received by the control core are boundary notifications, and the actual execution period is determined as the boundary notifications are determined. For example, in some embodiments, determining the boundary notifications sent by an asynchronous engine based on the expected sending order of notifications when each asynchronous engine performs an operation may include: during the execution of each asynchronous engine operation, based on the expected sending sequence numbers corresponding to the start and end notifications in the expected sending order, monitoring the actual sending sequence numbers of each notification sent by the asynchronous engine and determining the start and end notifications sent by the asynchronous engine.

[0029] Figure 4 A schematic flowchart illustrating a specific example of a method for a multi-engine asynchronous parallel computing architecture according to an embodiment of the present invention is shown. Figure 4 As shown, the control core serves as the execution subject in this embodiment of the invention. During the execution of the preset program, boundary notifications are determined and the actual execution time period is further determined. The specific process includes steps S401-S407. Figure 4As shown, S401 includes inserting an instruction to send a notification to the control core into the asynchronous engine associated with the target program segment when the start-up notification instruction is executed. This instruction only sends the notification and does not cause the engine to wait. S402 includes taking a snapshot of the counter corresponding to each asynchronous engine associated with the target program segment and saving it as a start-up trigger count set. S403 includes continuing to send the target program segment, and using each counter to count in real time the notifications sent to the control core by the corresponding asynchronous engine (the count value corresponds to the actual sending sequence number of the latest notification). S404 includes determining that the notification is a start-up notification when the count value of any counter reaches the expected sending sequence number of the corresponding start-up notification, and recording the sending time of the notification as the start time of the target program segment. S405 includes inserting an instruction to send a notification to the control core into the asynchronous engine associated with the target program segment when the end-up notification instruction is executed. S406 includes taking a snapshot of each counter and saving it as an end-up trigger count set. S407 includes determining each end-up notification when the count values ​​of all the above counters reach the expected sending sequence number of the corresponding end-up notification, and recording the sending time of the last end-up notification as the end time of the target program segment. The start-point trigger count set and the end-point trigger count set provide a technical benchmark for notifying the control core when it enters or exits the target program segment. Through counters and snapshots, the start and end times of each engine's execution of the target program segment, as well as the actual execution time of the target program segment, can be dynamically captured without interfering with the parallel execution of the asynchronous engines. The advantage of this embodiment is that it only inserts notification instructions, without inserting any blocking or waiting instructions, completely preserving the original parallelism and scheduling order, and achieving non-intrusive real-time performance monitoring.

[0030] and Figure 2 Compared to existing instrumentation measurement methods, the embodiments of the present invention do not insert any operation instructions into the preset program that would change the execution order or introduce additional waiting, thus being non-intrusive. Although the target program segment includes start-point notification instructions and end-point notification instructions, they merely utilize the inherent notification mechanism of the asynchronous engine and do not block the issuance of other instructions, nor do they force the control core or asynchronous engine into an idle waiting state. Therefore, the actual execution time obtained by the embodiments of the present invention completely restores the performance of the program in a real running environment, providing a reliable data foundation for subsequent performance analysis and optimization.

[0031] In some embodiments, the number of interval marker pairs can be one or more. When multiple interval marker pairs exist, each pair may include a start marker prof[id]_begin and an end marker prof[id]_end, and different interval marker pairs can be distinguished by different id values. Typically, any two interval marker pairs indicate two target program segments that are logically completely contained within each other (i.e., one program segment is completely nested inside another program segment) or completely separate (i.e., the two program segments do not overlap in logical execution flow). This design allows developers to perform layered or segmented fine-grained performance profiling of predefined programs. For example, it allows measuring the overall execution time of a large functional module, or simultaneously measuring the execution time of key components within that functional module.

[0032] After obtaining the precise actual execution time period and the specific time of each boundary notification, embodiments of the present invention can also provide intuitive performance analysis output. In some embodiments, the method may further include: constructing a visual Gantt chart to display the execution sequence of asynchronous engines related to the target program segment based on the actual execution time period and the sending time of all boundary notifications sent by asynchronous engines related to the target program segment. This Gantt chart can clearly show: when which asynchronous engines start working, the duration of their work, the degree of parallelism between asynchronous engines, and whether there are mutual waiting dependencies between asynchronous engines (e.g., whether the start of execution of one engine for one instruction must be based on the completion of execution of another engine for another instruction). Visualized Gantt charts can significantly improve the efficiency of performance analysis and optimization, and provide precise basis for scheduling strategies and execution pipeline optimization of asynchronous engines. By observing such visualizations, developers and system architects can quickly locate scheduling bottlenecks, resource idle periods, or unreasonable execution dependencies in the system, thereby optimizing task partitioning, instruction scheduling strategies, or hardware resource configuration in a targeted manner.

[0033] It should be noted that the asynchronous engine referred to in the embodiments of this invention refers to any hardware functional unit that can execute operation instructions independently of other engines in parallel under the unified scheduling of the control core. Although DMA, SPU, SE, TE, and VPU have been used as examples in the above specific embodiments, those skilled in the art should understand that these examples do not constitute a limitation on the type of asynchronous engine. Any hardware unit that can receive operation instructions issued by the control core, execute operations independently, and send notifications to the control core, regardless of its specific function, such as data processing, storage access, encryption / decryption, compression / decompression, or other dedicated or general-purpose computing tasks, can constitute the asynchronous engine described in the embodiments of this invention.

[0034] Accordingly, the operations performed by the asynchronous engine are not limited to the data movement, sparse processing, sorting, transposition, vector operations, etc., mentioned in the above specific embodiments. Any operation dispatched by the control core through instructions and executed autonomously by the asynchronous engine, as long as its execution status or progress can be identified by sending notifications during execution, can be applied to the method provided in the embodiments of the present invention.

[0035] Furthermore, the control core in this embodiment of the invention is not limited to the RVV control core, but can be any processor core, microcontroller, or hardware scheduler capable of issuing commands, receiving notifications, and performing scheduling management. "Notification" is not limited to synchronous interrupt notifications, but can also be any form of signal, message, event, or status identifier, as long as it can be recognized by the control core and its arrival time recorded.

[0036] Furthermore, the performance analysis and visualization methods in this embodiment, based on the actual execution time period and the sending time of boundary notifications sent by the asynchronous engine related to the target program segment, are not limited to the Gantt charts mentioned in the specific embodiments above. Any data presentation format that can reflect the execution sequence, parallelism, dependencies, or resource utilization of the asynchronous engine, including but not limited to various types of time series charts, bar charts, pie charts, tables, text reports, and raw or structured data output through application programming interfaces (APIs), falls within the spirit and scope of this embodiment. Those skilled in the art can choose appropriate analysis dimensions and display methods according to actual needs.

[0037] The scope of protection of this invention should be determined by the appended claims, and not limited to the specific hardware architecture, instruction type, implementation details, or analytical methods shown in the above specific embodiments. Any modifications, equivalent substitutions, improvements, or extensions made within the spirit and principles of this invention should be included within the scope of protection of this invention.

[0038] Figure 5 A schematic diagram of a computer system is shown that can implement the method and apparatus for a multi-engine asynchronous parallel computing architecture according to embodiments of the present invention. It should be understood that... Figure 5 The computer system 500 shown is merely an example and should not impose any limitation on the functionality and scope of use of the methods and apparatus for multi-engine asynchronous parallel computing architectures according to embodiments of the present invention.

[0039] like Figure 5As shown, the computer system 500 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 501, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage device 508 into a random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the computer system 500. The processing device 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.

[0040] Typically, the following devices can be connected to I / O interface 505: input devices 506 including, for example, touchscreens, touchpads, cameras, accelerometers, gyroscopes, sensors, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, motors, electronic speed controllers, etc.; storage devices 508 including, for example, flash cards; and communication devices 509. Communication device 509 allows computer system 500 to communicate wirelessly or wiredly with other devices to exchange data. Although... Figure 5 A computer system 500 with various devices is shown, but it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have instead. Figure 5 Each box shown can represent a device or multiple devices as needed.

[0041] In particular, according to some embodiments of the present invention, the processes described above with reference to the flowcharts can be implemented as computer programs. For example, a computer-readable medium is provided having a computer program stored thereon, the computer program comprising methods for executing... Figure 3 The program code shown is for a method for a multi-engine asynchronous parallel computing architecture. In such an embodiment, the computer program can be downloaded and installed from a network via communication device 509, or installed from storage device 508, or installed from ROM 502. When the computer program is executed by processing device 501, it implements the aforementioned functional units defined in the apparatus for a multi-engine asynchronous parallel computing architecture according to an embodiment of the present invention.

[0042] It should be noted that the computer-readable medium according to embodiments of the present invention may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. A computer-readable storage medium according to embodiments of the present invention may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. Additionally, a computer-readable signal medium according to embodiments of the present invention may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium may be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (Radio Frequency), etc., or any suitable combination thereof.

[0043] Computer program code for performing operations according to embodiments of the present invention can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0044] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0045] This invention can be implemented in other specific forms without departing from its spirit and essential characteristics. For example, the algorithm described in a particular embodiment can be modified without departing from the basic spirit of the invention. Therefore, the present embodiments are to be regarded as exemplary rather than limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description, and all changes falling within the meaning and scope of the claims and their equivalents are thus included within the scope of the invention.

Claims

1. A method for a multi-engine asynchronous parallel computing architecture, the multi-engine asynchronous parallel computing architecture including a control core and multiple asynchronous engines, wherein the control core issues operation instructions to at least one of the multiple asynchronous engines based on a preset program, so that the at least one asynchronous engine performs an operation, the method being characterized by comprising: Obtain interval marker pairs, which are used to indicate the start and end of a target program segment in the preset program; In response to the interval marker pair, among the multiple notifications sent to the control core when each asynchronous engine performs an operation related to the target program segment, a boundary notification corresponding to the target program segment is determined. The boundary notification includes a start-up notification sent when the asynchronous engine starts executing the operation corresponding to the target program segment, and an end-point notification sent when the asynchronous engine finishes executing the operation corresponding to the target program segment. as well as The actual execution period of the target program segment is determined based on the sending times of all boundary notifications sent by all asynchronous engines related to the target program segment. The target program segment includes: A start-up notification instruction is used to cause each asynchronous engine associated with the target program segment to send a start-up notification when it begins to execute the operation corresponding to the target program segment. as well as The endpoint notification instruction is used to cause each asynchronous engine associated with the target program segment to send an endpoint notification when it has completed the operation corresponding to the target program segment.

2. The method of claim 1, wherein, Determining the actual execution time of the target program segment includes: The earliest of all start-point notifications is determined as the start time of the actual execution period, and the latest of all end-point notifications is determined as the end time of the actual execution period.

3. The method according to claim 1, characterized in that, Boundary determination notifications include: The boundary notifications sent by each asynchronous engine are determined based on the expected sending order of notifications when each asynchronous engine performs an operation, wherein the expected sending order is determined by the preset procedure.

4. The method of claim 3, wherein, Based on the expected sending order of notifications when each asynchronous engine performs an operation, the boundary notifications sent by that asynchronous engine are determined to include: During the execution of each asynchronous engine operation, based on the expected sending sequence number corresponding to the start notification and end notification in the expected sending order, the actual sending sequence number of each notification sent by the asynchronous engine is monitored, and the start notification and end notification sent by the asynchronous engine are determined.

5. The method of claim 1, wherein, Determining a boundary notification involves identifying whether the notification is a boundary notification by recognizing whether the identifier carried by each notification is a preset identifier corresponding to the boundary notification.

6. The method of claim 1, wherein, The number of interval marker pairs is multiple, and any two interval marker pairs indicate two target program segments that are either completely contained or completely separated.

7. The method of claim 1, wherein, Also includes: Based on the actual execution period and the sending times of all boundary notifications sent by the asynchronous engines related to the target program segment, a visual Gantt chart is constructed to display the execution sequence of the asynchronous engines related to the target program segment.

8. An apparatus for a multi-engine asynchronous parallel computing architecture, comprising: include: processor; as well as A memory having stored thereon computer-executable instructions, wherein, when executed by the processor, the computer-executable instructions cause the processor to perform the method of any one of claims 1 to 7.

9. A computer-readable storage medium having stored thereon computer- executable instructions, wherein, When executed by a processor, the computer-executable instructions cause the processor to perform the method of any one of claims 1 to 7.

10. A computer program product comprising computer executable instructions, characterised in that, When executed by a processor, the computer-executable instructions cause the processor to perform the method of any one of claims 1 to 7.