Data transmission performance prediction method, apparatus, device, and storage medium

By segmenting the data transmission process into data blocks adapted to the target storage unit and using a hardware simulation model for transformation analysis, the problem of inaccurate data transmission performance prediction in existing technologies is solved, and the overall performance evaluation accuracy of the visual task model is improved.

WO2026138172A1PCT designated stage Publication Date: 2026-07-02SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
Filing Date
2025-11-03
Publication Date
2026-07-02

Smart Images

  • Figure CN2025132096_02072026_PF_FP_ABST
    Figure CN2025132096_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of computers, and discloses a data transmission performance prediction method, an apparatus, a device, and a storage medium. The method comprises: on the basis of the storage capacity of a target storage unit, and data information of data to be transmitted in a source storage unit, segmenting said data according to a scheduling mechanism of a data access unit to obtain a data block set; transforming the expression of the data block set to obtain a target data set, such that the expression of the target data set is suitable for transform analysis; performing transform analysis on the target data set to obtain a source-shape stride, a target-shape stride, and a transpose type; and by using a hardware simulation model of the data access unit, and on the basis of the hardware characteristic that the data access unit occupies bandwidth when executing transpose and shape transformation steps, predicting data transmission performance under the scheduling mechanism on the basis of the source-shape stride, the target-shape stride, an original bandwidth, and the transpose type. Therefore, the accuracy of a performance prediction result is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Data transmission performance prediction methods, apparatus, equipment and storage media Technical Field

[0001] This application relates to the field of computer technology, and in particular to a method, apparatus, device and storage medium for predicting data transmission performance.

[0002] This application claims priority to Chinese Patent Application No. 202411965365.8, filed on December 26, 2024, entitled "Data Transmission Performance Prediction Method, Apparatus, Device and Storage Medium", the entire contents of which are incorporated herein by reference. Background Technology

[0003] Random access memory (RAM) includes static random access memory (SRAM) and dynamic random access memory (DRAM). SRAM is fast, has small capacity, and is more expensive, making it suitable for high-speed caching. DRAM is less expensive, has large capacity, and is suitable for large-capacity storage. The data access unit is responsible for transferring data between different media. During data processing, the data access unit needs to divide and move the large amount of data stored in DRAM to SRAM.

[0004] When moving data, it is necessary to evaluate the data transmission performance of the data access unit between different media in advance in order to select a better data moving strategy and reduce power consumption.

[0005] Currently, a simple method to roughly estimate data transfer performance is to divide the amount of data to be transferred by the maximum bandwidth of the memory. However, in complex scenarios where data in memory is not aligned to specific boundaries or undergoes transformations during transmission, the data is not transmitted at the maximum bandwidth, resulting in a smaller actual bandwidth. In such cases, the accuracy of performance predictions using this rough estimation method is low. Technical issues

[0006] This application provides a data transmission performance prediction method, apparatus, device, and storage medium, which can improve the accuracy of performance prediction results. The technical solution is as follows:

[0007] In a first aspect, a data transmission performance prediction method is provided. The method includes: dividing the data to be transmitted into a data block set according to the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, and according to the scheduling mechanism of the data access unit; the data access unit is used for data transfer between the source storage unit and the target storage unit; transforming the expression of the data block set to obtain a target dataset, so that the expression of the target dataset is applicable to transformation analysis; performing transformation analysis on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type; and using the hardware simulation model of the data access unit, predicting the data transmission performance based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type.

[0008] Secondly, a data transmission performance prediction device is provided. The device includes: a segmentation module, used to segment the data to be transmitted according to the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, according to the scheduling mechanism of the data access unit, to obtain a data block set; the data access unit is used for data transfer between the source storage unit and the target storage unit; a transformation module, used to transform the expression of the data block set to obtain a target dataset, so that the expression of the target dataset is applicable to transformation analysis; a transformation analysis module, used to perform transformation analysis on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type; and a prediction module, used to predict the data transmission performance using the hardware simulation model of the data access unit, based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type.

[0009] Thirdly, a computer device is provided, the computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program implementing the method described in the first aspect when executed by the processor.

[0010] Fourthly, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in the first aspect.

[0011] Fifthly, a computer program product containing instructions is provided that, when run on a computer, causes the computer to perform the method described in the first aspect.

[0012] This application provides a data transmission performance prediction method, apparatus, device, and storage medium. According to the solution provided in this application, based on the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, the data to be transmitted is segmented according to the scheduling mechanism of the data access unit to obtain a data block set. The data access unit is used for data transfer between the source storage unit and the target storage unit. The scheduling mechanism divides the data to be transmitted stored in the source storage unit into multiple smaller data blocks (tiles), ensuring that each data block can adapt to the storage capacity of the target storage unit, thus solving the problem of limited target storage unit capacity. The data block set is transformed to obtain a target dataset, making the expression of the target dataset suitable for transformation analysis. Transformation analysis is performed on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type. Transformation analysis includes shape transformation analysis and transpose analysis. By performing transformation analysis on the target dataset, key parameters that need to be configured during data transmission (i.e., the step size of the source shape, the step size of the target shape, and the transpose type) can be obtained. By utilizing a hardware simulation model of the data access unit, data transmission performance is predicted based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type. This solution transforms complex multidimensional data handling scenarios into operations such as data segmentation, shape transformation, and transpose. By establishing a hardware simulation model, and considering the hardware characteristics of the data access unit consuming bandwidth during transpose and shape transformation steps, the data transmission performance under this scheduling mechanism is predicted, improving the accuracy of performance prediction results. Moreover, data transmission performance can be predicted without directly running data handling on the hardware, reducing operational complexity and resource consumption. Attached Figure Description

[0013] Figure 1 is a flowchart of a data transmission performance prediction method provided in an embodiment of this application;

[0014] Figure 2 is a flowchart of another data transmission performance prediction method provided in an embodiment of this application;

[0015] Figure 3 is a schematic diagram illustrating the relationship between transpose type and bandwidth according to an embodiment of this application;

[0016] Figure 4 is a flowchart of a data transmission method provided in an embodiment of this application;

[0017] Figure 5 is a schematic diagram of a data transmission performance prediction device provided in an embodiment of this application;

[0018] Figure 6 is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Embodiments of the present invention

[0019] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0020] Before providing a detailed explanation of the embodiments of this application, the application scenarios and related technologies of the embodiments of this application will be described first.

[0021] The data transmission performance prediction method provided in this application can be applied to data transfer scenarios in the data processing process. The data processing process can be for visual tasks such as target detection, target recognition, and semantic segmentation. This application does not limit this process. Any data transfer scenario involving different media (also known as data transmission scenario) is within the protection scope of this application.

[0022] In this embodiment, DRAM can be double data rate synchronous dynamic random access memory (DDR), typically used to store large amounts of data and program code; SRAM can be data memory (DM), which is located close to the internal computing unit and can be integrated inside the chip for on-chip storage. The data access unit can be direct memory access (DMA), used to handle data transfer between different media (e.g., DDR and DM).

[0023] When evaluating the overall performance of visual task models (e.g., object detection models, object recognition models, and semantic segmentation models), it is necessary to consider not only the processing performance of the visual task model during the execution of the visual task, but also the data transfer performance between different media used in the visual task model. This data transfer can be from DDR to DM or from DM to DDR. The accuracy of the data transfer performance estimation directly affects the overall performance of the visual task model. In other words, inaccurate data transfer performance evaluation can lead to a significant deviation in the overall performance assessment of the visual task model, thus preventing the acquisition of a better-performing visual task model and affecting the processing effect of the visual task.

[0024] In current coarse estimation schemes, if data in DDR or DM is not aligned to specific boundaries, or if data undergoes transpose or concat operations between different dimensions during transmission, then data transmission in these complex scenarios will not be based on the DMA's maximum bandwidth, and the actual transmission bandwidth will be smaller than the maximum bandwidth. Therefore, the accuracy of data transmission performance using coarse estimation schemes in such scenarios is not high.

[0025] Another related technology involves building an artificial intelligence (AI) model. This AI model learns from numerous real-world transmission tests. The trained AI model is then used to predict the transmission performance of different data types in DDR and DM. While this approach improves the accuracy of data transmission performance estimation compared to a more coarse-scale method, it requires building the AI ​​model and training it with a large amount of transmission test data, resulting in higher costs and greater resource consumption.

[0026] To address the aforementioned technical problem of low accuracy in data transmission performance, this application provides a data transmission performance prediction method, as shown in Figure 1. Figure 1 is a flowchart of a data transmission performance prediction method provided by this application, which includes:

[0027] S101. Based on the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, the data to be transmitted is divided according to the scheduling mechanism of the data access unit to obtain a set of data blocks; the data access unit is used for data transfer between the source storage unit and the target storage unit.

[0028] In this embodiment of the application, taking DMA as an example, the DMA Schedule is used to solve the problem of limited capacity of the target storage unit (e.g., DM). It divides the large data (e.g., data to be transferred stored in DDR) buffer into multiple smaller data tiles, so that each data tile can be adapted to the storage capacity of the target storage unit (e.g., DM).

[0029] The DMA Schedule analyzes the shape and size of the buffer containing the data to be transmitted, obtaining data information (including dimensional information and overall storage requirements). Based on the storage capacity constraints of the target storage unit (e.g., DM), it generates a set of data blocks in different tiles, each containing multiple data blocks. The DMA Schedule supports flexible partitioning across different dimensions, generating sets of data blocks of varying sizes for each dimension. Large data buffers can be divided (also known as splitting, slicing, or dicing) into multiple smaller tiles, ensuring that the size of each tile does not exceed the capacity limit of the target storage unit (e.g., DM).

[0030] For example, taking the DMA-related operator as the Copy operator, the source storage unit as DDR, and the target storage unit as DM, it is used to completely copy the data of array A into array B, where A is the original buffer (i.e., array A is stored in DDR), and B is the target buffer (i.e., array B is stored in DM). The Copy operator is used to move the data to be transferred from DDR to DM.

[0031] The basic form of the Copy operator is as follows:

[0032] for i in range(32):

[0033] for j in range(48):

[0034] B[i,j] = A[i,j]

[0035] This can be viewed as 32 sub-data points distributed along the i-axis (horizontal axis) and 48 sub-data points distributed along the j-axis (vertical axis), for a total of 32 * 48 = 1536 sub-data points. The data along both the i-axis and j-axis are divided into four parts using a DMA Schedule. The division format is as follows:

[0036] for i_outer in range(8): # 32 / / 4

[0037] for j_outer in range(12): # 48 / / 4

[0038] for ij_inner in range(16): # 4 * 4

[0039] i = i_outer * 4 + (ij_inner / / 4)

[0040] j = j_outer * 4 + (ij_inner % 4)

[0041] This can be viewed as dividing the 32 sub-data points on the i-axis into 4 parts, each containing 32 / / 4 = 8 data points. Similarly, dividing the 48 sub-data points on the j-axis into 4 parts, each containing 48 / / 4 = 12 data points. Each data block contains 4 * 4 sub-data points. Thus, i takes values ​​from 0 to 31, j from 0 to 47, i_outer from 0 to 7, j_outer from 0 to 11, and ij_inner from 0 to 15. The quotient of ij_inner and 4 represents the horizontal axis, and the remainder represents the column axis.

[0042] The mathematical expression for the data block has been introduced above. Next, the mathematical expression will be converted into a mathematical formal language.

[0043] @T.prim_func

[0044] def func(A: T.Buffer((32, 48), "float32"), B: T.Buffer((32, 48), "float32")):

[0045] T.func_attr({"global_symbol": "main"})

[0046] for i_outer in T.serial(8): # 32 / / 4

[0047] for j_outer in T.serial(12): # 48 / / 4

[0048] for ij_inner in T.serial(16): # 4 * 4

[0049] i = i_outer * 4 + (ij_inner / / 4)

[0050] j = j_outer * 4 + (ij_inner % 4)

[0051] B[i, j] = A[i, j]

[0052] Here, `float32` indicates that the data type is floating-point, using 32-bit two's complement to represent a floating-point number. Of course, the data type can also be integer (int), double-precision floating-point (double), etc., and this embodiment of the application does not impose any restrictions on this. `global_symbol` represents the attribute information of each data tile. The original index can be calculated using the above format.

[0053] S102. Transform the data block set into expressions to obtain the target dataset, so that the expression of the target dataset is applicable to transformation analysis.

[0054] In this embodiment, the data block set is a mathematical formal language, which needs to be converted into the compiler's expressive formal language so that the compiler's subsequent operation module (pass) can perform transformation analysis on the target dataset. Here, pass is a structured technique used by the compiler to perform functions such as analysis, optimization, or transformation of the compiled object.

[0055] For example, the compiler could be a tensor virtual machine (TVM), a deep learning compiler used to optimize and compile deep learning models. Tensor intermediate representations (tir) describe TVM tensor computations, providing various operators and functions. The language in tir source form is closer to the hardware. The mathematical formal language corresponding to the data block set is converted into tir source language, which is suitable for subsequent compiler passes to perform operations such as shape transformation and transpose.

[0056] It should be noted that the compiler can also be an accelerated linear algebra compiler (TensorFlow XLA), a compiler (tensor comprehension, TC, Glow), a multi-level intermediate representation (MLIR), etc., and this application does not limit the implementation of the embodiments.

[0057] S103. Perform transformation analysis on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type.

[0058] In this embodiment, the transformation analysis includes shape transformation analysis and transpose analysis. DMA hardware characteristics indicate that shape transformation involves step size and transpose processes, which are related to bandwidth usage. Based on this, transformation analysis of the target dataset can be performed using multiple passes in the TVM compiler to obtain key parameters that need to be configured during data transmission (i.e., the step size of the source shape, the step size of the target shape, and the transpose type).

[0059] S104. Using the hardware simulation model of the data access unit, predict the data transmission performance based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type.

[0060] In this embodiment, a hardware simulation model (e.g., a cost model) of the data access unit is used to simulate the data transmission process of the data access unit. Based on key parameters affecting the data transmission process (i.e., the step size of the source shape, the step size of the target shape, and the transpose type), the bandwidth of the data transmission process is predicted. Due to the influence of the key parameters, this bandwidth will be smaller than the original bandwidth. Therefore, the data transmission performance is calculated based on the bandwidth and the amount of data to be transmitted, thereby simulating the data transmission process.

[0061] This application provides a method for evaluating the data transfer performance of DMA operators based on a TVM compiler, taking DMA as the data access unit and tir source language expression as the target dataset as an example. Combining the tir source language provided by the TVM compiler, the DMA data transfer operators (e.g., copy operators) are converted into tir source language expressions. By constructing multiple passes of the TVM compiler, the tir source language expressions are converted into key parameters that DMA needs to configure during data transfer (i.e., the step size of the source shape, the step size of the target shape, and the transpose type). These key parameters are then input into the DMA cost model. After obtaining these key parameters, the DMA cost model adapts to different data transfer scenarios (i.e., segmenting the data to be transferred in DDR according to different DMA schedules) to predict the DMA data transfer performance. Compared to a coarse estimation scheme that directly uses the maximum bandwidth of DMA (i.e., the original bandwidth), this method improves the accuracy of performance prediction results. Moreover, it obtains data transfer performance without directly running data transfer on the hardware, reducing operational difficulty and resource consumption.

[0062] It should be noted that the data transmission performance prediction method provided in this application embodiment can also be used to predict the transmission process from DM to DDR. In this transmission process, there is no segmentation process, and the above S102-S103 can be executed. Alternatively, the data to be transmitted in DM can be taken as data blocks. Since the data to be transmitted in DM is continuously generated, a set of data blocks to be transmitted can be obtained, and then the above S102-S103 can be executed. Taking the source storage unit as DM and the target storage unit as DDR as an example, the expression of the set of data blocks to be transmitted corresponding to DM is transformed to obtain the target dataset to be transmitted, so that the expression of the target dataset to be transmitted is applicable to transformation analysis; transformation analysis is performed on the target dataset to be transmitted to obtain the step size of the source shape, the step size of the target shape, and the transpose type; using the hardware simulation model of the data access unit, the data transmission performance is predicted based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type.

[0063] According to the scheme provided in this application, based on the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, the data to be transmitted is divided into a set of data blocks according to the scheduling mechanism of the data access unit. The data access unit is used for data transfer between the source storage unit and the target storage unit. The scheduling mechanism divides the data to be transmitted stored in the source storage unit into multiple smaller data blocks (tiles), so that each data block can be adapted to the storage capacity of the target storage unit, thereby solving the problem of limited capacity of the target storage unit. The set of data blocks is transformed to obtain the target dataset, so that the expression of the target dataset is applicable to transformation analysis. Transformation analysis is performed on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type. Transformation analysis includes shape transformation analysis and transpose analysis. By performing transformation analysis on the target dataset, the key parameters that need to be configured during data transmission (i.e., the step size of the source shape, the step size of the target shape, and the transpose type) can be obtained. Using the hardware simulation model of the data access unit, the data transmission performance is predicted based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type. This solution transforms complex multidimensional data handling scenarios into operations such as data segmentation, shape transformation, and transposition. By establishing a hardware simulation model, and based on the hardware characteristics of the data access unit consuming bandwidth during transposition and shape transformation steps, it predicts the data transmission performance under this scheduling mechanism, improving the accuracy of performance prediction results. Furthermore, it predicts data transmission performance without directly running data handling on the hardware, reducing operational complexity and resource consumption.

[0064] In some embodiments, the source storage unit is a dynamic memory, the target storage unit is a static memory, the storage capacity of the target storage unit is smaller than the storage capacity of the source storage unit, the transmission efficiency of the target storage unit is greater than the transmission efficiency of the source storage unit, and the original bandwidth is the transmission bandwidth of the source storage unit.

[0065] In this embodiment, taking DDR as the source memory unit and DM as the target memory unit as an example, the storage capacity of DM is smaller than that of DDR, and the transmission efficiency of DM is smaller than that of DDR. In the transmission process from DDR to DM or from DM to DDR, since the transmission efficiency of DDR is lower than that of DM, and the transmission bandwidth of DDR is the upper limit, the transmission bandwidth of DDR is used as the original bandwidth. Data transmission performance is calculated based on the transmission bandwidth of DDR and key parameters (i.e., the step size of the source shape, the step size of the target shape, and the transpose type). In this way, compared to the coarse estimation scheme that directly uses the maximum bandwidth of DMA (i.e., the original bandwidth), the accuracy of the performance prediction results can be improved.

[0066] In some embodiments, both the source storage unit and the target storage unit are static memories, and the original bandwidth is the transmission bandwidth of either the source storage unit or the target storage unit.

[0067] In this embodiment of the application, taking the source storage unit and the target storage unit as both being DMs as an example, since there is no difference in transmission efficiency during the transmission process from DM to DM, the transmission bandwidth of any DM is taken as the original bandwidth. The data transmission performance is calculated based on the transmission bandwidth of any DM and key parameters (i.e., the step size of the source shape, the step size of the target shape, and the transpose type). In this way, compared with the rough estimation scheme of directly using the maximum bandwidth of DMA (i.e., the original bandwidth), the accuracy of the performance prediction results can be improved.

[0068] In some embodiments, S102 in Figure 1 above can be implemented in the following way: Hardware language conversion is performed on the data block set to obtain a preset format dataset; the preset format dataset is a multi-dimensional index; dimensional transformation is performed on the preset format dataset to obtain a target dataset; the target dataset is a one-dimensional index.

[0069] In this embodiment, the data block set is a mathematical formal language, and the preset form can be the tir source language expression form corresponding to the TVM compiler. Hardware language conversion is performed on the data block set to generate the tir source language expression form of DMA-related operators. The tir source language expression form is a hardware language, enabling the subsequent compiler pass to perform transformation analysis on the target dataset. The tir source language expression form is a multi-dimensional index; before the pass performs transformation analysis on it, a dimensionality conversion is required to transform the tir source language expression form into a one-dimensional index.

[0070] According to the syntax provided by the TVM compiler, the set of data blocks obtained after the DMA schedule is divided is converted into the TVM tir source expression form. The following code illustrates this.

[0071] @T.prim_func

[0072] def main(A: T.Buffer((32, 48), "float32"), B: T.Buffer((32, 48), "float32")) -> None:

[0073] # function attr dict

[0074] T.func_attr({"global_symbol": "main"})

[0075] A_buf = T.match_buffer(A, (32, 48), "float32")

[0076] B_buf = T.match_buffer(B, (32, 48), "float32")

[0077] # body

[0078] with T.block("root"):

[0079] T.reads(A_buf[0:32, 0:48])

[0080] T.writes(B_buf[0:32, 0:48])

[0081] for i_outer in range(8):

[0082] for j_outer in range(12):

[0083] for ij_inner in range(16):

[0084] with T.block("B"):

[0085] # block vars

[0086] i = T.axis.spatial(32, i_outer * 4 + ij_inner / / 4)

[0087] j = T.axis.spatial(48, j_outer * 4 + ij_inner % 4)

[0088] # block attributes

[0089] T.reads(A_buf[i, j])

[0090] T.writes(B_buf[i, j])

[0091] # block body

[0092] B_buf[i, j] = A_buf[i, j]

[0093] Here, `float32` indicates that the data type is floating-point. `A: T.Buffer((32, 48)` and `T.reads(A_buf[0:32, 0:48])` represent the buffer of array A in the source memory location, and `B: T.Buffer((32, 48)` and `T.writes(B_buf[0:32, 0:48])` represent the buffer of array B in the destination memory location. `global_symbol` represents the attribute information of each data tile. Using the above code, the mathematical formal language can be converted into the TVM's `tir` source language expression form, which is closer to the hardware language.

[0094] The `tir` source code expression in the TVM compiler is a multidimensional index. Therefore, the multidimensional data buffer corresponding to the `tir` source code expression needs to be flattened, i.e., converted to a one-dimensional index. This embodiment sets a buffer flattening converter in the backend of the TVM compiler, which can be understood as a backend pass execution. The following code demonstrates how to convert the multidimensional data buffer corresponding to the above `tir` source code expression into a one-dimensional index using the buffer flattening converter.

[0095] @T.prim_func

[0096] def main(A_flat: T.Buffer((1536,), "float32"), # 32 * 48 = 1536

[0097] B_flat: T.Buffer((1536,), "float32")) -> None:

[0098] # function attr dict

[0099] T.func_attr({"global_symbol": "main"})

[0100] # body

[0101] with T.block("root"):

[0102] T.reads(A_flat[0:1536])

[0103] T.writes(B_flat[0:1536])

[0104] # i_outer: [0, 8)

[0105] for i_outer in range(8): # 32 / / 4 = 8

[0106] # j_outer: [0, 12)

[0107] for j_outer in range(12): # 48 / / 4 = 12

[0108] # ij_inner: [0, 16)

[0109] for ij_inner in range(16): # 4 * 4 = 16

[0110] B_flat[(i_outer * 4 + ij_inner / / 4) * 48 + (j_outer * 4 + ij_inner % 4)] = A_flat[(i_outer * 4 + ij_inner / / 4) * 48 + (j_outer * 4 + ij_inner % 4)]

[0111] Here, A_flat and B_flat represent the flattened buffers. One i-axis (i.e., the horizontal axis) includes 48 j-axis, so it needs to be multiplied by 48. The target dataset mentioned above includes i_outer in range(8), j_outer in range(12), and ij_inner in range(16).

[0112] This buffer flattening transformer can be considered as a component of the entire vector processing operation, transforming complex multidimensional data buffer operations into simple one-dimensional physical memory operations. This allows subsequent compiler backend multiple passes to perform transformation analysis (including shape transformation analysis and transpose analysis) on the program, improving processing efficiency. The multiple passes can be appropriately configured by those skilled in the art based on the actual situation, used to analyze which loop transformations (i.e., shape transformation operations) and rearrangement operations (i.e., transpose operations) are implemented by the relevant operators of the segmented DMA.

[0113] In some embodiments, S103 in FIG1 above can also be implemented in the following way. As shown in FIG2, FIG2 is a flowchart of another data transmission performance prediction method provided by an embodiment of this application.

[0114] S1031. Detect shape transformations in the target dataset and obtain the detection results.

[0115] In this embodiment, shape transformation includes transformations corresponding to splitting operations and transformations corresponding to fuse operations. The target dataset is stored in a flattened buffer, and the fuse, split, and transpose operations contained in the flattened buffer are detected. The fuse and split operations can be considered as cyclic transformation pattern operations. A cyclic transformation pattern detection device can be set up, including a split pattern detection module for recognizing splitting operations and a fuse pattern detection module for recognizing fuse operations. The split pattern detection module identifies splitting operations in the target dataset to obtain segmentation recognition results; the fuse pattern detection module identifies fuse operations in the target dataset to obtain fuse recognition results.

[0116] In some embodiments, shape transformation includes transformations corresponding to segmentation operations and transformations corresponding to fusion operations. Correspondingly, the detection results include segmentation recognition results and fusion recognition results. S1031 in Figure 2 above can be implemented in the following way: If the index of the target dataset contains floor operations and modulo operations, the presence of a segmentation operation is taken as the segmentation recognition result; if the index of the target dataset contains multiplication and addition operations and includes multiple loop variables, the presence of a fusion operation is taken as the fusion recognition result.

[0117] In this embodiment, the slit pattern detection method identifies floordiv and modulo operation pairs, detecting whether floordiv and modulo operators exist in the index of the target dataset. If they exist, it indicates that a splitting operation exists. Floordiv is used to find the quotient of two numbers, while modulo is used to find the remainder of two numbers.

[0118] For example, the split pattern detection module detects the FloorDiv(var, factor) pattern and looks for the corresponding FloorMod(var, factor), where var represents a variable (e.g., ij_inner in the code above) and factor represents a constant (e.g., 4 in the code above). For instance, if it detects a pairing of `ij_inner / / 4` and `ij_inner % 4`, it considers this a split operation and writes it as: ij_inner -> (i_inner, j_inner).

[0119] In this embodiment of the application, the fuse pattern detection method identifies the multiplication (*) operator and the addition operator (+), detects whether there are multiplication and addition operators in the index of the target dataset, and detects whether there are multiple loop variables. If they exist, it indicates that there is a fusion operation.

[0120] For example, the fuse pattern detection module extracts the addition terms and analyzes the expression `outer * factor + inner`, where `outer` and `inner` represent loop variables, and `factor` represents a constant (e.g., 4 in the code above). It identifies the outer loop variable (i.e., how many data blocks it is divided into) and the inner loop variable (i.e., how many sub-data items each data block contains). It constructs the fuse operation, which includes the univariate case: `outer * factor + 0` and the multivariate case: `outer * factor + inner`. In this embodiment, the multivariate case is detected. For example, fuse detects `i_outer * 4 + i_inner -> i / / i = i_outer * 4 + i_inner`, and `j_outer * 4 + j_inner -> j / / j = j_outer * 4 + j_inner`.

[0121] In this embodiment, the presence of a splitting operation is identified by detecting specific operators (i.e., floor function and modulo operator), and the presence of a merging operation is identified by detecting specific operators (i.e., multiplication and addition operators) and multiple loop variables. This allows the compiler to analyze which loop transformations (i.e., shape transformation operations) are implemented by the relevant DMA operators based on the detection results, determine the key parameters affecting data transmission, and thus predict data transmission performance, improving the accuracy of performance prediction results.

[0122] S1032. Generate the target sequence based on the detection results.

[0123] The aforementioned cyclic transformation pattern detection device also includes a sequence generation module for generating change sequences (e.g., fuse sequences or split sequences); the sequence generation module is used to generate target sequences based on segmentation recognition results and fusion recognition results.

[0124] In some embodiments, S1032 in Figure 2 above can be implemented in the following way: A first sequence is generated based on the segmentation recognition result; the first sequence indicates the index of the sub-data included in each data block in the target dataset; a second sequence and a third sequence are generated based on the fusion recognition result; the second sequence indicates the index of the data block included in the target dataset along the horizontal axis, and the third sequence indicates the index of the data block included in the target dataset along the vertical axis; the target sequence includes the first sequence, the second sequence, and the third sequence.

[0125] A complete change sequence (i.e., the target sequence) is generated based on the detection results. A split sequence (i.e., the first sequence) is generated based on the segmentation recognition results. The split sequence is: ij_inner(16) -> [i_inner(4), j_inner(4)], which indicates the index of the sub-data included in each data block. Two fuse sequences (i.e., the second sequence and the third sequence) are generated based on the fusion recognition results. The fuse sequence is: [i_outer(8), i_inner(4)] -> i(32), which indicates the index of the data block included on the horizontal axis (i.e., the i-axis); the fuse sequence is: [j_outer(12), j_inner(4)] -> j(48), which indicates the index of the data block included on the vertical axis (i.e., the j-axis).

[0126] In this embodiment, a split sequence is generated by combining the split identification result and two fuse sequences are generated by combining the fuse identification result. The split sequence and the two fuse sequences are used as the target sequence. The target sequence is used to deduce key parameters affecting the data transmission process, thereby predicting the data transmission performance and improving the accuracy of the performance prediction results.

[0127] S1033. Derive shape transformation information and transpose information based on the target sequence.

[0128] Both the split and fuse operations are equivalent to the reshape operation. The shape transformation information and transpose information can be derived from the split sequence and two fuse sequences.

[0129] In some embodiments, S1033 in FIG2 above can be implemented in the following manner: First shape transformation information is determined based on the target sequence and the segmentation sequence; the segmentation sequence is obtained by segmenting the target sequence; transpose information is determined based on the segmentation sequence and the transposed sequence; the transposed sequence is obtained by transposing the segmentation sequence based on the dimension information in the segmentation sequence; second shape transformation information is determined based on the transposed sequence and the merged sequence; the merged sequence is obtained by merging adjacent dimensions in the transposed sequence, and the shape transformation information includes the first shape transformation information and the second shape transformation information.

[0130] In this embodiment, the initial shape is [8, 12, 16], corresponding to the sequence [i_outer, j_outer, ij_inner]. This sequence includes the first sequence [ij_inner], the output part i_outer in the second sequence [i_outer, i_inner], and the output part j_outer in the third sequence [j_outer, j_inner]. After segmenting the sequence [i_outer, j_outer, ij_inner] (i.e., the first reshape), the segmented sequence [i_outer, j_outer, i_inner, j_inner] is obtained. Based on this sequence and the segmented sequence, the first shape transformation information is determined, i.e., reshape ij_inner -> [i_inner, j_inner]. The first shape transformation information is represented as [8, 12, 16] -> [8, 12, 4, 4].

[0131] Based on the dimensional information (including dimensions i and j) in the segmentation sequence, the segmentation sequence [i_outer, j_outer, i_inner, j_inner] is transposed, with the i-dimensional elements grouped together and the j-dimensional elements grouped together, resulting in the transposed sequence [i_outer, i_inner, j_outer, j_inner]. This is essentially a dimensional rearrangement of the segmentation sequence: [i_outer, j_outer, i_inner, j_inner] -> [i_outer, i_inner, j_outer, j_inner]. The transpose information is determined based on the segmentation sequence and the transposed sequence. The transpose information is represented as [8, 12, 4, 4] -> [8, 4, 12, 4], i.e., transpose = [0, 2, 1, 3], indicating that position 2 is transposed with position 1.

[0132] By grouping the relevant dimensions together as described above, it is easier to perform subsequent fusion operations. Adjacent dimensions in the transposed sequence are merged to obtain a merged sequence, and the adjacent dimensions are reshaped (i.e., a second reshape) to obtain the final shape. The second shape transformation information is determined based on the transposed sequence and the merged sequence, i.e., [i_outer(8), i_inner(4), j_outer(12), j_inner(4)] -> [i(32), j(48)], and the second shape transformation information is represented as [8, 4, 12, 4]->[32, 48].

[0133] The first shape transformation information, transpose information, and second shape transformation information are used as the final three-segment representation information. Initial Reshape: Expand ij_inner, {type: RESHAPE, values: [8, 12, 4, 4]}, that is, [8, 12, 16] -> [8, 12, 4, 4]; {type: TRANSPOSE, values: [0, 2, 1, 3]}, that is, [8, 12, 4, 4] -> [8, 4, 12, 4]; {type: RESHAPE, values: [32, 48]}, that is, [8, 4, 12, 4] -> [32, 48].

[0134] In this embodiment, based on the split sequence and two fuse sequences, the shape transformation information corresponding to the cyclic transformation operation and the transpose information corresponding to the dimensional rearrangement operation are determined. The shape transformation information includes the first shape transformation information corresponding to the first reshape and the second shape transformation information corresponding to the second reshape. The shape transformation information corresponding to the cyclic transformation operation and the transpose information corresponding to the dimensional rearrangement operation can be used to determine key parameters affecting the data transmission process, thereby predicting data transmission performance and improving the accuracy of performance prediction results.

[0135] S1034. Based on the shape transformation information and transpose information, determine the step size of the source shape, the step size of the target shape, and the transpose type, respectively.

[0136] In this embodiment, the source shape (src_shape) and its stride (src_strides), as well as the target shape (dst_shape) and its stride (dst_strides), are calculated using the three-segment representation information (including first shape transformation information, transpose information, and second shape transformation information). The transpose information includes the transpose type.

[0137] For example, the first expression (corresponding to reshape), i.e., the first shape transformation information, can be represented as [8, 12, 16] -> [8, 12, 4, 4], src_shape = [8, 12, 4, 4], src_strides = [192, 16, 4, 1]. Here, 192 = 12 * 4 * 4, indicating that there are 192 sub-data points in the dimension containing 8; 16 = 4 * 4, indicating that there are 16 sub-data points in the dimension containing 12; 4 indicates that there are 4 sub-data points in the dimension containing 4; and 1 indicates that there is 1 sub-data point in the dimension containing 4. The second expression (corresponding to transpose), i.e., the transpose information, can be represented as [8, 12, 4, 4] -> [8, 4, 12, 4], transpose type transpose = [0, 2, 1, 3], indicating that dimensions 1 and 2 are rearranged, or that position 2 is transposed with position 1. The third expression (corresponding to reshape), that is, the second shape transformation information, can be represented as [8, 4, 12, 4] -> [32, 48], dst_shape = [32, 48], dst_strides = [48, 1], where 48 indicates that there are 48 sub-data points in the dimension where 32 is located, and 1 indicates that there is 1 sub-data point in the dimension where 48 is located.

[0138] In this embodiment, the split and fuse operations in the target dataset are detected separately to obtain split recognition results and fuse recognition results. A target sequence is generated based on the split and fuse recognition results. Shape transformation information and transpose information are derived from the target sequence. Based on the shape transformation information and transpose information, key parameters affecting the data transmission process (including src_strides, dst_strides, and transpose type) are determined. These key parameters affecting the data transmission process are used to predict data transmission performance and improve the accuracy of performance prediction results.

[0139] In some embodiments, data transmission performance indicates data transmission duration; S104 in Figure 1 above can be implemented in the following way: The bandwidth of the data access unit is predicted by a hardware simulation model using the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type, and the transmission bandwidth is output; the data transmission duration is determined based on the transmission bandwidth and the amount of data to be transmitted.

[0140] In this embodiment, after obtaining the step size of the source shape, the step size of the target shape, and the transpose type, this information is input into the cost model of DMA. The cost model analyzes different media (including DM and DDR) and analyzes whether transpose exists and the dimensional position of transpose, thereby predicting the transmission bandwidth of the relevant operators of DMA under the current scheduling mechanism (corresponding to the partitioning strategy).

[0141] For example, if a transpose operation is performed on the data, the entire data transfer is discontinuous, or only continuous in a small local area. For example, position 0 represents the outermost dimension, and position 3 represents the innermost dimension. `transpose = [0, 1, 2, 3]` indicates that no dimension rearrangement occurred, and the data transfer is continuous. `transpose = [0, 2, 1, 3]` indicates that the two middle dimensions (dimension 1 and dimension 2) were rearranged. In this scenario, the continuous data size transferred is only the size of the innermost dimension, and the entire data transfer is discontinuous.

[0142] Based on this, a mapping table is established according to different rearrangement dimensions in this embodiment, as shown in Figure 3. Figure 3 is a schematic diagram of the relationship between transpose type and bandwidth provided by this embodiment. The stride of shape in the three-segment representation information and the dimensional transpose between different layers in transpose will affect the bandwidth. By inputting the three-segment representation information into the cost model for simulation, the transmission bandwidth can be obtained. The mapping table shown in Figure 3 is a theoretical analysis of a part of the transpose type and its transmission bandwidth. The meaning represented by the mapping table is consistent with the simulation results. As can be seen from Figure 3 above, the transpose type [0, 1, 2, 3] is a continuous access buffer, which is not transposed, and its corresponding DDR bandwidth is optimal, which is the maximum bandwidth. The transpose type [1, 0, 2, 3] is a transpose between the outermost layer (the outermost and the second outermost layer), with a large stride span, and its corresponding DDR bandwidth is relatively large (belonging to medium bandwidth). The transpose type [0, 2, 1, 3] is a transpose between the second outermost layer and the second innermost layer, with a large stride span, and its corresponding DDR bandwidth is relatively poor (belonging to small bandwidth). The transpose type [3, 1, 2, 0] is the transpose between the outermost and innermost layers, which is completely discontinuous, and its corresponding DDR bandwidth is the worst (belonging to the minimum bandwidth).

[0143] It should be noted that Figure 3 above only shows the relationship between some transpose types and bandwidth. These transpose types are representative. It is understood that there are other forms of transpose types, such as [0, 1, 3, 2], [2, 1, 0, 3], [1, 0, 3, 2], [3, 2, 1, 0], etc. Other forms of transpose types also have their own corresponding bandwidths, which are not limited in this embodiment of the application.

[0144] Data transmission performance is not only related to the transpose type, as shown in Figure 3 above, but also to the actual size of each dimension. Based on this, the actual data transmission performance can be simulated according to the actual size of each dimension and the mapping relationship between the transpose type and the bandwidth. That is, the transpose type, src_strides, and dst_strides are simulated using a cost model, the transmission bandwidth is output, and the amount of data to be transmitted is divided by the transmission bandwidth to obtain the cycle.

[0145] In this embodiment, key parameters affecting data transmission are input into the DMA cost model. The cost model is used to predict the DMA bandwidth and output the transmission bandwidth. Then, the amount of data to be transmitted is divided by the transmission bandwidth to obtain the cycle. The cost model is a hardware simulation model based on software logic. It can predict data transmission performance without directly running data transfer on the hardware, reducing operational difficulty and resource consumption.

[0146] In some embodiments, data transmission performance indicates data transmission duration; this application also provides a data transmission method (also known as a data moving optimization method). After S104 in Figure 1 above, the data to be transmitted is further segmented according to other scheduling mechanisms of the data access unit, and the data transmission duration corresponding to each other scheduling mechanism is predicted; among the data transmission durations corresponding to multiple scheduling mechanisms, the scheduling mechanism corresponding to the minimum data transmission duration is selected, so that the data access unit moves the data to be transmitted from the source storage unit to the target storage unit according to the scheduling mechanism corresponding to the minimum data transmission duration.

[0147] In this embodiment, the data to be transmitted is split according to different partitioning strategies using other DMA scheduling mechanisms. Then, cycle prediction is performed. Each scheduling mechanism corresponds to a partitioning strategy and a cycle. After repeatedly executing S101-S104, the smallest cycle can be selected from multiple cycles, and the smallest cycle and its corresponding scheduling mechanism (i.e., partitioning strategy) are saved. The data transmission performance corresponding to this scheduling mechanism is relatively good, and this scheduling mechanism is used as the optimal partitioning strategy for the relevant DMA operators (i.e., data transfer operators). Thus, after the DMA partitions the data to be transmitted in DDR according to this scheduling mechanism and transfers it from DDR to DM, data transmission efficiency can be improved.

[0148] The characteristics of DMA hardware are that the transpose type and step size (including the step size of the source shape and the step size of the target shape) affect the transmission performance. Based on this, this application provides a data transfer optimization method based on the characteristics of DMA hardware. This data transfer optimization method realizes the conversion process from high-level data (corresponding to the data to be transferred) to low-level DMA (corresponding to the three-segment representation information). Through automatic segmentation, automatic detection and optimization of cyclic transformation modes, and then through the constructed DMA cost model, it accurately predicts the data transfer performance under different data transfer scenarios, automatically generates the optimal segmentation strategy, reduces the optimization difficulty, and effectively avoids the losses caused by manual trial and error. It not only improves development efficiency but also ensures the reliability and stability of the optimization results. When dealing with complex multidimensional data transfer scenarios, the data transfer method provided by this application can automatically complete dimension segmentation, cyclic transformation operations, and dimension rearrangement operations, reducing the time required for manual optimization. Moreover, it can predict the data transfer performance without directly running data transfer on the hardware, thus improving the data transfer efficiency.

[0149] The following will describe an exemplary application of the embodiments of this application in a real-world application scenario.

[0150] Figure 4 is a flowchart of a data transmission method provided in an embodiment of this application. Taking the data transfer operator as an example of the DMA-related operator, the data transmission method is described. The data transfer operator indicates the operator operation (OP), which may include reindex operation, concat operation, split operation, etc. The reindex operation includes floor operations and modulo operations (i.e., remainder operations), and the concat operation is used to join two or more arrays.

[0151] S201. Use the DMA scheduling mechanism to divide the data to be transmitted into a data block set.

[0152] The DMA schedule is used to split the data to be transferred, generating different data tiles. The database collection includes multiple data tiles.

[0153] S202. Perform hardware language conversion on the data block set to generate the tir source language expression form of the data transport operator.

[0154] S203. Flatten the multidimensional data buffer corresponding to the tir source language expression form.

[0155] S204. Detect the segmentation and fusion operations contained in the flattened buffer.

[0156] The split and fuse operations contained in the flattened buffer are detected, and the split recognition results and fuse recognition results are obtained respectively.

[0157] S205. Generate shape transformation sequence and transpose sequence based on segmentation recognition results and fusion recognition results.

[0158] Based on the split recognition results and fuse recognition results, a reshape sequence (i.e., shape transformation information) and a transpose sequence (i.e., transpose information) are generated.

[0159] S206. Based on the shape transformation sequence and transpose sequence, calculate the step size of the source shape and target shape of the data transport operator in different media, as well as the transpose type.

[0160] The strides of the data transport operator `src_shape` and `dst_shape` in different media are calculated based on the reshape sequence. These strides include the strides of `src_shape` in DDR (i.e., `src_strides`) and the strides of `dst_shape` in DM (i.e., `dst_strides`). The transpose type is calculated based on the transpose sequence. The transpose type indicates the effective transport size (size) in different dimensions.

[0161] S207. Input the step size of the source shape and target shape of the data transfer operator in different media, as well as the transpose type, into the hardware simulation model to predict the data transmission time.

[0162] By inputting the src_strides, dst_strides, and transpose type of the data transfer operators into the DMA cost model, cycle prediction can be achieved. This allows for the prediction of the actual data transfer performance of the DMA.

[0163] The DMA scheduling mechanism is used to split the data to be transmitted according to different partitioning strategies, and then cycle prediction is performed, that is, the above S201-S207 are executed repeatedly to obtain multiple cycles.

[0164] S208. Determine if the preset number of attempts has been reached?

[0165] If yes, continue with S209; otherwise, execute S201-S207. After each loop, save the minimum value between the current predicted cycle and the previous predicted cycle, along with its corresponding tile information (i.e., scheduling mechanism or partitioning strategy).

[0166] The more loops, the greater the resource consumption (or loop duration), and the higher the DMA transfer efficiency. Based on this, the aforementioned preset number of loops (i.e., the cycle prediction number or the number of loops) can be a fixed value, such as 50, 20, or 70 times. Alternatively, the preset number of loops can be appropriately set by those skilled in the art according to actual conditions, balancing resource consumption (or loop duration) and data transfer efficiency. Or, the preset number of loops can refer to all segmentation strategies for the data to be transferred; this embodiment does not limit this.

[0167] S209. Save the minimum data transfer time and its corresponding scheduling mechanism, and use it as the optimal DMA transfer method for the data transfer operator.

[0168] Following steps S201-S208, the tile information corresponding to the smallest cycle (i.e., the segmentation strategy) can be selected from multiple cycles. Using this tile information as the optimal DMA transfer method for the data transfer operator, the DMA transfers the data to be transferred from DDR to DM according to this method, improving transfer efficiency and thus data processing efficiency.

[0169] It should be noted that the data transmission method provided in this application embodiment can also be used to predict the transmission process from DM to DDR. In this transmission process, there is no segmentation process or multiple cyclic prediction process. The above S202-S207 can be executed. Alternatively, the data to be transmitted in DM can be taken as a data block. Since the data to be transmitted in DM is continuously generated, a set of data blocks to be transmitted can be obtained, and then the above S202-S207 can be executed. Taking DM as the source storage unit and DDR as the target storage unit as an example, the set of data blocks to be transferred corresponding to DM is converted into hardware language to generate the tir source language expression form of the data transfer operator. The multi-dimensional data buffer corresponding to the tir source language expression form is flattened, and the split and fuse operations contained in the flattened buffer are detected. Based on the split recognition results and fuse recognition results, a reshape sequence (i.e., shape transformation information) and a transpose sequence (i.e., transpose information) are generated. The strides of src_shape in DM and dst_shape in DDR corresponding to the data transfer operator are calculated based on the reshape sequence. The transpose type is calculated based on the transpose sequence. The strides of src_shape and dst_shape of the data transfer operator in different media and the transpose type are input into the cost model of DMA to predict the cycle.

[0170] This application embodiment constructs a DMA cost model based on the chip's DMA hardware module. This cost model integrates data transfer in different data transmission scenarios (i.e., segmenting the data to be transferred in DDR according to different DMA schedules). In each data transmission scenario, the cost model can predict the bandwidth, and thus calculate the data transmission performance (i.e., data transmission duration) for each scenario. Based on this cost model, a DMA schedule for segmenting the data to be transferred is constructed. When segmenting DMA-related operators, the dimensions and the segmentation size for each dimension are selected, generating a multi-level nested loop structure (see the three for loops following the flattening operation of the multi-dimensional data buffer mentioned above). Large data transmission tasks are decomposed into multiple data tiles suitable for DM capacity. Through optimization of loop transformation operations and dimension rearrangement operations, the source shape and its step size, the target shape and its step size, and the transpose type are obtained. This information is input into the DMA cost model, outputting the transmission bandwidth under this schedule. The amount of data to be transferred is divided by this transmission bandwidth to obtain the cycle. The cycle prediction process under different schedules is executed repeatedly, and the schedule corresponding to the smallest cycle is selected as the optimal DMA transfer method. The DMA then uses this transfer method to move the data to be transferred from DDR to DM, improving transfer efficiency and consequently data processing efficiency.

[0171] Based on the data transmission performance prediction method provided in the above embodiments, Figure 5 is a schematic diagram of a data transmission performance prediction device provided in an embodiment of this application. This device can be implemented as part or all of a computer device by software, hardware, or a combination of both. Referring to Figure 5, the data transmission performance prediction device 50 includes: a segmentation module 501, used to segment the data to be transmitted according to the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, according to the scheduling mechanism of the data access unit, to obtain a data block set; a data access unit is used for data transfer between the source storage unit and the target storage unit; a conversion module 502, used to convert the expression of the data block set to obtain a target dataset, so that the expression of the target dataset is applicable to transformation analysis; a transformation analysis module 503, used to perform transformation analysis on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type; and a prediction module 504, used to predict the data transmission performance using the hardware simulation model of the data access unit, based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type.

[0172] Optionally, the conversion module 502 is also used to perform hardware language conversion on the data block set to obtain a preset format dataset; the preset format dataset is a multi-dimensional index; and to perform dimensional conversion on the preset format dataset to obtain a target dataset; the target dataset is a one-dimensional index.

[0173] Optionally, the transformation analysis module 503 is also used to detect shape transformations in the target dataset and obtain detection results; generate a target sequence based on the detection results; derive shape transformation information and transpose information based on the target sequence; and determine the step size of the source shape, the step size of the target shape, and the transpose type based on the shape transformation information and transpose information.

[0174] Optionally, the shape transformation includes the transformation corresponding to the segmentation operation and the transformation corresponding to the fusion operation. Correspondingly, the detection result includes the segmentation recognition result and the fusion recognition result. The transformation analysis module 503 is also used to take the existence of a segmentation operation as the segmentation recognition result when the index of the target dataset contains a rounding operator and a modulo operator; and to take the existence of a fusion operation as the fusion recognition result when the index of the target dataset contains a multiplication operator and an addition operator and includes multiple loop variables.

[0175] Optionally, the detection results include segmentation recognition results and fusion recognition results; the transform analysis module 503 is further configured to generate a first sequence based on the segmentation recognition results; the first sequence indicates the index of the sub-data included in each data block in the target dataset; a second sequence and a third sequence are generated based on the fusion recognition results respectively; the second sequence indicates the index of the data block included on the horizontal axis in the target dataset, and the third sequence indicates the index of the data block included on the vertical axis in the target dataset; the target sequence includes the first sequence, the second sequence, and the third sequence.

[0176] Optionally, the target sequence includes a first sequence, a second sequence, and a third sequence; the transformation analysis module 503 is further configured to determine first shape transformation information based on the target sequence and the segmentation sequence; the segmentation sequence is obtained by segmenting the target sequence; the transpose information is determined based on the segmentation sequence and the transpose sequence; the transpose sequence is obtained by transposing the segmentation sequence based on the dimension information in the segmentation sequence; the second shape transformation information is determined based on the transpose sequence and the merged sequence; the merged sequence is obtained by merging adjacent dimensions in the transpose sequence, and the shape transformation information includes the first shape transformation information and the second shape transformation information.

[0177] Optionally, the data transmission performance indicates the data transmission duration; the prediction module 504 is also used to input the step size of the source shape, the step size of the target shape, the original bandwidth and the transpose type into the hardware simulation model, predict the bandwidth of the data access unit through the hardware simulation model, and output the transmission bandwidth; and determine the data transmission duration based on the transmission bandwidth and the amount of data to be transmitted.

[0178] Optionally, the source storage unit is a dynamic memory, the target storage unit is a static memory, the storage capacity of the target storage unit is smaller than the storage capacity of the source storage unit, the transmission efficiency of the target storage unit is greater than the transmission efficiency of the source storage unit, and the original bandwidth is the transmission bandwidth of the source storage unit.

[0179] Optionally, both the source and target memory units are static storage units, and the original bandwidth is the transmission bandwidth of either the source or target memory unit.

[0180] Optionally, the data transmission performance indicates the data transmission duration; the data transmission performance prediction device 50 further includes a filtering module 505, which is used to continue to segment the data to be transmitted according to other scheduling mechanisms of the data access unit, and predict the data transmission duration corresponding to each other scheduling mechanism; among the data transmission durations corresponding to multiple scheduling mechanisms, the scheduling mechanism corresponding to the minimum data transmission duration is selected, so that the data access unit moves the data to be transmitted from the source storage unit to the target storage unit according to the scheduling mechanism corresponding to the minimum data transmission duration.

[0181] It should be noted that the data transmission performance prediction device provided in the above embodiments is only illustrated by the division of the above functional modules when predicting data transmission performance. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0182] The functional units and modules in the above embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of the embodiments of this application.

[0183] The data transmission performance prediction device and the data transmission performance prediction method provided in the above embodiments belong to the same concept. The specific working process and technical effects of the units and modules in the above embodiments can be found in the method embodiments section, and will not be repeated here.

[0184] Based on the data transmission performance prediction method provided in the above embodiments, Figure 6 is a schematic diagram of the structure of a computer device provided in an embodiment of this application. As shown in Figure 6, the computer device 60 includes: a processor 601, a memory 602, and a computer program 603 stored in the memory 602 and executable on the processor 601. When the processor 601 executes the computer program 603, it implements the steps in the data transmission performance prediction method in the above embodiments.

[0185] Computer device 60 can be a general-purpose computer device or a special-purpose computer device. In specific implementations, computer device 60 can be a desktop computer, portable computer, network server, handheld computer, mobile phone, tablet computer, wireless terminal device, communication device, or embedded device. The embodiments of this application do not limit the type of computer device 60. Those skilled in the art will understand that FIG6 is merely an example of computer device 60 and does not constitute a limitation on computer device 60. It may include more or fewer components than shown, or combine certain components, or different components, such as input / output devices, network access devices, etc.

[0186] Processor 601 can be a Central Processing Unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0187] In some embodiments, memory 602 may be an internal storage unit of computer device 60, such as a hard disk or memory of computer device 60. In other embodiments, memory 602 may be an external storage device of computer device 60, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., provided on computer device 60. Furthermore, memory 602 may include both internal and external storage units of computer device 60. Memory 602 is used to store operating system, application programs, boot loader, data, and other programs. Memory 602 may also be used to temporarily store data that has been output or will be output.

[0188] This application also provides a computer device, which includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor executes the computer program to implement the steps in any of the above method embodiments.

[0189] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps in the various method embodiments described above.

[0190] This application provides a computer program product that, when run on a computer, causes the computer to perform the steps described in the various method embodiments above.

[0191] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above method embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or some intermediate form. The computer-readable medium can include at least: any entity or device capable of carrying the computer program code to a photographing device / terminal device, a recording medium, a computer memory, ROM (Read-Only Memory), RAM (Random Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, and optical data storage devices. The computer-readable storage medium mentioned in this application can be a non-volatile storage medium; in other words, it can be a non-transient storage medium.

Claims

1. A method for predicting data transmission performance, characterized in that, The method includes: Based on the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, the data to be transmitted is divided into a set of data blocks according to the scheduling mechanism of the data access unit; the data access unit is used for data transfer between the source storage unit and the target storage unit. The data block set is transformed by expression to obtain the target dataset, so that the expression of the target dataset is applicable to transformation analysis; Transform analysis is performed on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type; Using the hardware simulation model of the data access unit, the data transmission performance is predicted based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type.

2. The method as described in claim 1, characterized in that, The process of transforming the data block set into expressions to obtain the target dataset includes: The data block set is converted to a hardware language to obtain a dataset in a preset format; the preset format dataset is a multidimensional index. The target dataset is obtained by performing a dimensional transformation on the preset dataset; the target dataset is a one-dimensional index.

3. The method as described in claim 1, characterized in that, The transformation analysis performed on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type includes: The shape transformations in the target dataset are detected to obtain the detection results. Generate a target sequence based on the detection results; Based on the target sequence, derive the shape transformation information and transpose information; Based on the shape transformation information and the transpose information, the step size of the source shape, the step size of the target shape, and the transpose type are determined respectively.

4. The method as described in claim 3, characterized in that, The shape transformation includes the transformation corresponding to the segmentation operation and the transformation corresponding to the fusion operation. Correspondingly, the detection result includes the segmentation recognition result and the fusion recognition result. The detection of shape transformations in the target dataset to obtain detection results includes: If the index of the target dataset contains both rounding and modulo operators, the presence of a segmentation operation will be used as the segmentation recognition result. If the target dataset is found to contain multiplication and addition operators and includes multiple loop variables, a fusion operation will be used as the fusion recognition result.

5. The method as described in claim 3 or 4, characterized in that, The detection results include segmentation recognition results and fusion recognition results; The step of generating the target sequence based on the detection result includes: A first sequence is generated based on the segmentation and recognition results; the first sequence indicates the index of the sub-data included in each data block in the target dataset; A second sequence and a third sequence are generated based on the fusion recognition result; the second sequence indicates the index of the data block included on the horizontal axis in the target dataset, and the third sequence indicates the index of the data block included on the vertical axis in the target dataset; the target sequence includes the first sequence, the second sequence, and the third sequence.

6. The method as described in claim 3 or 4, characterized in that, The target sequence includes a first sequence, a second sequence, and a third sequence; The step of deriving shape transformation information and transpose information based on the target sequence includes: Based on the target sequence and the segmentation sequence, first shape transformation information is determined; the segmentation sequence is obtained by segmenting the target sequence. Based on the segmentation sequence and the transpose sequence, transpose information is determined; the transpose sequence is obtained by transposing the segmentation sequence based on the dimension information in the segmentation sequence. A second shape transformation information is determined based on the transposed sequence and the merged sequence; the merged sequence is obtained by merging adjacent dimensions in the transposed sequence, and the shape transformation information includes the first shape transformation information and the second shape transformation information.

7. The method according to any one of claims 1-3, characterized in that, The data transmission performance indicates the data transmission duration; Using the hardware simulation model of the data access unit, the data transmission performance is predicted based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type, including: The step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type are input into the hardware simulation model. The bandwidth of the data access unit is predicted by the hardware simulation model, and the transmission bandwidth is output. The data transmission duration is determined based on the transmission bandwidth and the amount of data to be transmitted.

8. The method according to any one of claims 1-3, characterized in that, The source storage unit is a dynamic memory, the target storage unit is a static memory, the storage capacity of the target storage unit is smaller than the storage capacity of the source storage unit, and the transmission efficiency of the target storage unit is greater than the transmission efficiency of the source storage unit; the original bandwidth is the transmission bandwidth of the source storage unit.

9. The method according to any one of claims 1-3, characterized in that, The data transmission performance indicates the data transmission duration; After predicting the data transmission performance using the hardware simulation model of the data access unit based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type, the method further includes: Continue to segment the data to be transmitted according to other scheduling mechanisms of the data access unit, and predict the data transmission duration corresponding to each of the other scheduling mechanisms; Among the data transmission durations corresponding to multiple scheduling mechanisms, the scheduling mechanism corresponding to the minimum data transmission duration is selected so that the data access unit moves the data to be transmitted from the source storage unit to the target storage unit according to the scheduling mechanism corresponding to the minimum data transmission duration.

10. A data transmission performance prediction device, characterized in that, The device includes: The segmentation module is used to segment the data to be transmitted according to the storage capacity of the target storage unit and the data information of the data to be transmitted in the source storage unit, and according to the scheduling mechanism of the data access unit, to obtain a set of data blocks; the data access unit is used for data transfer between the source storage unit and the target storage unit. The transformation module is used to transform the expression of the data block set to obtain the target dataset, so that the expression of the target dataset is suitable for transformation analysis; The transformation analysis module is used to perform transformation analysis on the target dataset to obtain the step size of the source shape, the step size of the target shape, and the transpose type. The prediction module is used to predict the data transmission performance based on the step size of the source shape, the step size of the target shape, the original bandwidth, and the transpose type using the hardware simulation model of the data access unit.

11. A computer device, characterized in that, The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the method as described in any one of claims 1-9.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1-9.