An online generation method, device and equipment of a segmentation strategy and a storage medium
By identifying target operators in the AI platform and combining them with hardware specifications and time cost models, a segmentation strategy is determined, which solves the problem of not being able to identify efficient segmentation strategies in real time in the online state, reduces maintenance costs and improves computing performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI SUIYUAN TECH CO LTD
- Filing Date
- 2022-11-29
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies cannot determine the operator segmentation strategy of an AI platform in real time while online, resulting in underutilization of computing performance and high maintenance costs.
By acquiring the target operator from the machine learning model, combining it with the hardware specifications and time cost model of the AI platform, multiple alternative segmentation patterns are determined, and the time cost of each segmentation pattern is evaluated. Finally, the target segmentation strategy is obtained by combining them, and online computation is achieved.
It enables online real-time identification of efficient segmentation strategies, reduces human intervention and maintenance costs, improves operator execution efficiency, and fully leverages the computing performance of the AI platform.
Smart Images

Figure CN115904539B_ABST
Abstract
Description
Technical Field
[0001] The embodiments of the present invention relate to computer hardware technology, and in particular to an online generation method, apparatus, device and storage medium for segmentation strategies. Background Technology
[0002] Once developed, AI platforms generally possess theoretical computational performance. However, the software stack of an AI platform contains multiple built-in operators. During the implementation of these operators, different segmentation strategies for the input data stream determine the data storage method and bandwidth requirements, thus significantly impacting the computational performance of the AI platform.
[0003] In existing technologies, segmentation strategies designed manually or generated automatically are primarily verified through field testing. However, these tests all require offline execution. When online, these methods only cover a subset of operators, failing to acquire segmentation strategies for operators not searched offline. Therefore, they are unsuitable for online environments. Furthermore, the maintenance cost of these methods is high. When modifications are needed (such as adjusting the frequency of some hardware units or changing the software scheduling of hardware), the entire segmentation strategy search must be performed again. Additionally, the segmentation strategies determined offline require manual integration into the AI platform by developers, resulting in high labor costs. Therefore, determining operator segmentation strategies online and in real-time to fully leverage the computational performance of the AI platform is a pressing issue that needs to be addressed. Summary of the Invention
[0004] This invention provides an online generation method, apparatus, device, and storage medium for segmentation strategies, enabling real-time online determination of operator segmentation strategies and fully leveraging the computational performance of AI platforms.
[0005] In a first aspect, embodiments of the present invention provide an online method for generating a segmentation strategy, comprising:
[0006] The system retrieves the machine learning model currently loaded onto the AI platform and identifies the target operator within the machine learning model. The AI platform includes multi-level storage space, at least one DMA unit for data transfer between the multi-level storage spaces, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0007] Based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, multiple alternative partitioning patterns matching the target operator are determined. The partitioning pattern includes the partitioning method of each tensor dimension of each operator parameter in each level of storage space and the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units.
[0008] Obtain a time cost model that matches the AI platform, and use the time cost model to evaluate the time cost of each alternative segmentation pattern under at least one computation cycle.
[0009] Based on the time cost, determine the target segmentation pattern and the target operation loop method, and combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy;
[0010] During the execution of the machine learning model, online computation of the target operator is performed according to the target segmentation strategy.
[0011] Secondly, embodiments of the present invention also provide an online generation apparatus for segmentation strategies, the apparatus comprising:
[0012] The target operator identification module is used to acquire the machine learning model currently loaded into the AI platform and identify the target operator in the machine learning model. The AI platform includes multi-level storage space, at least one DMA unit for data transfer between multi-level storage spaces, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0013] The segmentation pattern determination module is used to determine multiple alternative segmentation patterns that match the target operator based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform. The segmentation pattern includes the segmentation method of each tensor dimension of each operator parameter in each level of storage space and the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units.
[0014] The time cost assessment module is used to obtain a time cost model that matches the AI platform, and to use the time cost model to assess the time cost of each alternative segmentation pattern under at least one operation loop.
[0015] The segmentation strategy determination module is used to determine the target segmentation pattern and the target operation loop method based on the time cost, and to combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy;
[0016] The online computation module is used to perform online computations on the target operator according to the target segmentation strategy during the execution of the machine learning model.
[0017] Thirdly, embodiments of the present invention also provide an electronic device, the electronic device comprising:
[0018] At least one processor; and
[0019] A memory communicatively connected to the at least one processor; wherein,
[0020] The memory stores a computer program that can be executed by the at least one processor, which enables the at least one processor to perform an online generation method for a segmentation strategy as described in any embodiment of the present invention.
[0021] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing computer instructions, which are used to cause a processor to execute an online generation method for a segmentation strategy as described in any embodiment of the present invention.
[0022] This invention obtains the machine learning model currently loaded onto the AI platform and identifies the target operator within it. Then, based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, it determines multiple candidate segmentation patterns matching the target operator. Further, it obtains a time cost model matching the AI platform and uses this model to evaluate the time cost of each candidate segmentation pattern under at least one computational loop. Finally, based on the time cost, it determines the target segmentation strategy obtained by combining the target segmentation pattern and the target computational loop. Ultimately, during the execution of the machine learning model, online computation is performed on the target operator according to the target segmentation strategy. This solves the problem in existing technologies where efficient segmentation strategies corresponding to each operator cannot be identified online in real-time, resulting in high maintenance costs. It addresses the business requirement of generating operator segmentation strategies online in real-time, minimizing human intervention and maintenance costs, and integrating the determination and implementation processes of the operator segmentation strategy. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 This is a flowchart of an online generation method for a segmentation strategy according to Embodiment 1 of the present invention;
[0025] Figure 2 This is a schematic diagram of the structure of an AI platform provided according to Embodiment 1 of the present invention;
[0026] Figure 3 This is a flowchart of an online generation method for a segmentation strategy according to Embodiment 2 of the present invention;
[0027] Figure 4 This is a flowchart of an online generation method for a segmentation strategy according to Embodiment 3 of the present invention;
[0028] Figure 5 This is a schematic diagram of the structure of an online generation device for a segmentation strategy according to Embodiment 4 of the present invention;
[0029] Figure 6 This is a schematic diagram of the structure of an electronic device that implements the online generation method of the segmentation strategy in the embodiments of the present invention. Detailed Implementation
[0030] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0031] It should be noted that the terms "first," "second," "target," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0032] Example 1
[0033] Figure 1 This is a flowchart of an online segmentation strategy generation method provided in Embodiment 1 of the present invention. This embodiment is applicable to the online real-time identification of segmentation strategies with superior operator costs in machine learning models loaded onto an AI (Artificial Intelligence) platform. This method can be executed by an online segmentation strategy generation device, which can be implemented in hardware and / or software and can be configured within the AI platform. Figure 1 As shown, the method includes:
[0034] S110. Obtain the machine learning model currently loaded onto the AI platform and identify the target operator in the machine learning model.
[0035] The AI platform includes multi-level storage space, at least one DMA (Direct Memory Access) unit for data transfer between the multi-level storage space, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0036] The AI platform is an integrated hardware and software platform used to load and execute a predefined machine learning model; it is also known as the AI inference (training) platform. Shared storage space refers to storage space that can be accessed by multiple DMA units. Exclusive storage space refers to storage space that can be accessed by only one computing unit. Specifically, this computing unit can also be called a general-purpose scalable neural processor (SIP).
[0037] As an example rather than a limitation, in Figure 2 The diagram illustrates the structure of an AI platform applicable to Embodiment 1 of the present invention. The on-chip memory of the AI platform consists of a three-level storage structure: L3 is a global storage space; L2 is a shared storage space accessible by multiple DMA units; and L1 is a dedicated storage space, accessible only by one computing unit per L1 level. Data transfer between these storage levels is facilitated by DMA units. The space between L2 and L1 is a local DMA, with each computing unit corresponding to one DMA; the space between L3 and L2 is a shared DMA, shared by multiple computing units. Notably, each storage level has its own storage capacity, with L3 having the largest capacity, followed by L2, and then L1. The computation process is performed by the computing units, and the L1 storage space is the most efficient storage space accessible to the computing units.
[0038] Before operator execution, input data is uniformly moved from outside the AI platform to the L3 layer storage space within the AI platform by the upper-layer framework. Therefore, the main operation flow on the operator side is as follows: Input data is moved from the L3 layer to the L2 layer via a global DMA unit, and then from the L2 layer to the L1 layer by multiple local DMA units; the computation unit directly accesses the L1 layer, loads the input data into its registers for computation, and writes the result back to the L1 layer; the local DMA unit moves the computation result from the L1 layer back to the L2 layer, and then from the L2 layer to the L3 layer by the global DMA unit. At this point, the operator execution process is complete. Operator output data is moved from the upper-layer framework outside the AI platform or passed to the next operator for subsequent computation.
[0039] In this context, a machine learning model can refer to an algorithmic model pre-loaded into an AI platform, containing at least one operator, used to implement a defined scenario computation function, such as object detection or facial recognition. An object operator can refer to an operator in the machine learning model; an operator can refer to a symbol that performs mapping, transformation, or computation on a function or parameter, for example, an addition operator, an integration operator, or a probability operator.
[0040] S120. Based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, determine multiple alternative segmentation patterns that match the target operator.
[0041] The partitioning pattern includes the partitioning method of each tensor dimension of each operator parameter of the target operator in each level of storage space, as well as the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units.
[0042] Operator parameters refer to the data items that the operator relies on to complete the corresponding calculation. Optionally, operator parameters may include the input data and output results of the target operator. Tensor dimensions refer to the data size of the operator parameters. For example, taking matrix multiplication as the target operator, the input data consists of left and right operands. The tensor shape of the left operand can be M*K, and the tensor shape of the right operand can be K*N. That is, the left operand has two tensor dimensions: the first tensor dimension has M data elements, and the second tensor dimension has K data elements; the right operand also has two tensor dimensions: the first tensor dimension has K data elements, and the second tensor dimension has N data elements.
[0043] It is understandable that the data form of operator parameters is generally a tensor with a set number of dimensions, and each tensor dimension has a corresponding data size.
[0044] Here, hardware specifications can refer to the storage capacity of each storage level in the AI platform. For example, it can be the upper limit of the storage capacity of each storage level. The partitioning method can refer to how each tensor dimension of the target operator is partitioned in each storage level. For example, it can be partitioning or not partitioning, a specific partitioning ratio (e.g., M / 3 or N / 2), or the specific dimension value obtained after partitioning (e.g., 512 or 256). Parallelism description information can refer to the description of whether each tensor dimension of each operator parameter is executed in parallel across multiple computational units. For example, it can be parallel or not paralleling, or the specific number of parallel operations. Alternative partitioning patterns can refer to the partitioning patterns that match the target operator. Typically, it can be understood as the partitioning pattern among all the possible partitioning patterns of the target operator that meets the hardware specifications of the AI platform.
[0045] Specifically, we can first arrange and combine all the partitioning methods of each tensor dimension of each operator parameter in each level of storage space and various parallelism descriptions of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units to generate all the possible partitioning patterns corresponding to the target operator. Then, we can use the hardware specifications of the AI platform to filter all the possible partitioning patterns, thereby obtaining the candidate partitioning patterns that meet the hardware specifications of the AI platform.
[0046] S130. Obtain a time cost model that matches the AI platform, and use the time cost model to evaluate the time cost of each alternative segmentation pattern under at least one computation cycle.
[0047] The time cost model can refer to a pre-built model that evaluates the time overhead of alternative segmentation patterns. Time overhead can refer to the time spent by the target operator performing data transport and computation according to the alternative segmentation patterns.
[0048] The operation loop method refers to the loop method that completes all operations on the operator parameters according to the segmentation method of the target operator's operator parameters. It can be understood that the number of data transfers will differ under different operation loop methods, and thus the time cost corresponding to different operation loop methods will also be different. Therefore, the operation loop method is also an important parameter in the segmentation strategy.
[0049] For example, if the target operator is the matrix multiplication operator [M,K]x[K,N]=[M,N], and the partitioning method is to partition the left-hand side into 3 pieces along the M dimension and the right-hand side into 2 pieces along the N dimension, then the result of one partitioning operation is [M / 3,N / 2]. Due to the special nature of matrix multiplication operations, each left-hand side piece needs to be operated on with each right-hand side piece. Therefore, each left-hand side piece needs to be used twice, and each right-hand side piece needs to be used three times. If we do not consider reusing input data, we will move 6 pieces of size [M / 3,K] and 6 pieces of size [K,N / 2]. If we consider reusing left-hand side pieces, each time we update the right-hand side piece, we will move 3 pieces of size [M / 3,K] and 6 pieces of size [K,N / 2]. Similarly, if we reuse right-hand side pieces, each time we update the left-hand side piece, we will move 6 pieces of size [M / 3,K] and 2 pieces of size [K,N / 2]. Therefore, the operation loop method can be determined based on the order of the left and right values and the reuse of the left and right values.
[0050] S140. Based on the time cost, determine the target segmentation pattern and the target operation loop method, and combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy.
[0051] The target segmentation pattern can refer to the segmentation pattern among the candidate segmentation patterns that meets the time cost requirements. The target operation loop mode can refer to the operation loop mode corresponding to the target segmentation pattern.
[0052] Specifically, the target segmentation strategy may include: segmentation method, parallelism description information, and operation loop method. Once the above three parameters are determined, a deterministic segmentation process can be performed on the target operator.
[0053] S150. During the execution of the machine learning model, online computation of the target operator is performed according to the target segmentation strategy.
[0054] Specifically, after obtaining the target segmentation strategy generated by combining the target segmentation pattern and the target operation loop method, online calculations targeting the target operator can be performed during the execution of the machine learning model according to the target segmentation pattern and the target operation loop method in the target segmentation strategy. This can significantly improve the execution efficiency of the target operator and fully utilize the computing resources of the AI platform.
[0055] This invention obtains the machine learning model currently loaded onto the AI platform and identifies the target operator within it. Then, based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, it determines multiple candidate segmentation patterns matching the target operator. Further, it obtains a time cost model matching the AI platform and uses this model to evaluate the time cost of each candidate segmentation pattern under at least one computational loop. Finally, based on the time cost, it determines the target segmentation strategy obtained by combining the target segmentation pattern and the target computational loop. Ultimately, during the execution of the machine learning model, online computation is performed on the target operator according to the target segmentation strategy. This solves the problem in existing technologies where efficient segmentation strategies corresponding to each operator cannot be identified online in real-time, resulting in high maintenance costs. It addresses the business requirement of generating operator segmentation strategies online in real-time, minimizing human intervention and maintenance costs, and integrating the determination and implementation processes of the operator segmentation strategy.
[0056] Example 2
[0057] Figure 3This is a flowchart of an online generation method for a segmentation strategy provided in Embodiment 2 of the present invention. This embodiment is a refinement based on the above embodiment. Specifically, this embodiment refines the operation of determining multiple candidate segmentation patterns matching the target operator based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform. Specifically, it may include: obtaining multiple segmentation categories matching the target operator, wherein the segmentation category defines whether each tensor dimension of each operator parameter of the target operator is segmented in each level of storage space, and parallelism description information such as whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units; calculating the target data volume required by the target operator in each level of storage space under each segmentation category based on the tensor dimensions of each operator parameter in the target operator; identifying at least one target segmentation category that meets the hardware specification conditions among all segmentation categories based on the target data volume and the hardware specifications of the AI platform; and determining multiple candidate segmentation patterns matching the target operator in each target segmentation category. Figure 3 As shown, the method includes:
[0058] S210. Obtain the machine learning model currently loaded onto the AI platform and identify the target operator in the machine learning model.
[0059] The AI platform includes multi-level storage space, at least one DMA unit for data transfer between the multi-level storage space, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0060] S220. Obtain multiple segmentation categories that match the target operator.
[0061] Among them, the segmentation category defines whether each tensor dimension of each operator parameter of the target operator is segmented in each level of storage space, and whether each tensor dimension of each operator parameter is executed in parallel in multiple computational units, which describes the degree of parallelism.
[0062] In this embodiment, considering that the search space for segmented patterns is very large, direct search is slow and time-consuming, which is not conducive to online compilation, it is advisable to classify them first, divide a large number of segmented patterns into a few segmentation categories, and perform preliminary screening of segmented patterns by segmentation category.
[0063] In this embodiment, the dimensional information, storage structure, and parallelism description information of the operators can be used comprehensively to describe the segmentation categories. The number of dimensions of the operators is fixed and finite, and each dimension has only two choices when considering segmentation: segment or not segment. Therefore, the category of the segmented pattern can be determined based on the combination of segmenting or not segmenting for each dimension. Because the dimensions are finite, the categories of segmented patterns are also finite.
[0064] Similarly, the storage capacity of each level of storage space differs, with L3 having the largest, followed by L2, and L1 the smallest. When considering whether to slice a certain dimension, we can also consider the memory level where the current slice resides. For example, we can choose not to slice a certain dimension at L2 but slice it at L1, or we can choose to slice it at both L2 and L1. In summary, classifying according to the above aspects can basically cover the entire segmentation search space in the AI platform.
[0065] For ease of explanation, in a specific example, assume the target operator includes only one operator parameter A, which has only one tensor dimension M, and the AI platform has only one level of storage space L1. Then, for this target operator, there are four segmentation categories: Segmentation category 1: The tensor dimension M of A is segmented on L1, and M is executed in parallel across multiple computational units; Segmentation category 2: The tensor dimension M of A is segmented on L1, and M is not executed in parallel across multiple computational units; Segmentation category 3: The tensor M of A is not segmented on L1, and M is executed in parallel across multiple computational units; and Segmentation category 4: M is not segmented on L1, and M is not executed in parallel across multiple computational units.
[0066] S230. Based on the tensor dimensions of each operator parameter in the target operator, calculate the amount of target data required by the target operator in each level of storage space under each segmentation category.
[0067] The target data size refers to the minimum amount of data required by the target operator in a given storage space under a given segmentation category. For example, the target data size can be the sum of the minimum input data slice size and the minimum output data slice size in each level of storage space for each segmentation category. Typically, the target data size varies depending on the segmentation category.
[0068] In an optional implementation, taking a specific segmentation category as an example, the method of calculating the amount of target data required by the target operator in each level of storage space under each segmentation category, based on the tensor dimensions of each operator parameter in the target operator, is specified as follows:
[0069] Obtain the minimum slice size of input data and the minimum slice size of output data for the target operator in each storage space under the current segmentation category; obtain the sum of the minimum slice size of input data and the minimum slice size of output data in the dedicated storage space, as the target data amount required by the target operator in the dedicated storage space under the current segmentation category; according to the parallelism description information in the current segmentation category, divide the minimum slice size of input data in each shared storage space into dedicated data amount and shared data amount; based on the dedicated data amount, shared data amount, minimum slice size of output data in each shared storage space, and the number of computing units available to the target operator in the AI platform, calculate the target data amount required by the target operator in each shared storage space under the current segmentation category.
[0070] The current segmentation category refers to the segmentation category currently being filtered within the segmentation categories corresponding to the target operator. The minimum slice data size refers to the data size within the minimum slice dimension. For example, if the data is 1024, the slice dimension can be 1, 1024, 2, or 512, and the minimum slice data size can be 1. The minimum input data slice data size refers to the input data of the target operator in each level of storage space within the minimum segmentation dimension. The minimum output data slice data size refers to the output data of the target operator in each level of storage space within the minimum segmentation dimension. The dedicated data size refers to the data size independently sent to a specific dedicated storage space. The shared data size refers to the data size simultaneously sent to multiple dedicated storage spaces.
[0071] In a specific example, if, under the current segmentation category, the tensor dimension a1 of operator parameter A in the target operator is executed in parallel across multiple computational units (e.g., 4), where the multiple computational units refer to the number of computational units available to the target operator in the AI platform. Also, assume that the AI platform has a shared storage space L2 and a dedicated storage space L1.
[0072] After determining the minimum input data slice size a11 and the minimum output data slice size a12 of the tensor dimension a1 of operator parameter A in L2 according to the current segmentation category, and dividing the minimum input data slice size a11 into exclusive data size a111 and shared data size a112, the target data size required by the target operator in L2 under the current segmentation category can be calculated using the formula: a111*4 + a112 + a12*4. Simultaneously, after determining the minimum input data slice size a21 and the minimum output data slice size a22 of the tensor dimension a1 of operator parameter A in L1 according to the current segmentation category, the target data size required by the target operator in L1 under the current segmentation category can be calculated using the formula: a21 + a22.
[0073] S240. When the target data volume calculated by the target operator for each level of storage space under the current segmentation category is less than the upper limit of the storage space capacity of that level of storage space, the current segmentation category is determined to be a target segmentation category that meets the hardware specification conditions.
[0074] Continuing from the previous example, after determining the first target data volume in L1 and the second target data volume in L2 for each tensor dimension of each operator parameter of the target operator based on the current segmentation category, the current segmentation category can only be determined as a target segmentation category that meets the hardware specifications if it is determined that each first target data volume is less than the upper limit of the storage space capacity of L1 and each second target data volume is less than the upper limit of the storage space capacity of L2.
[0075] In this embodiment, since the target data volume is calculated as the minimum amount of data required by the target operator in the set storage space under the set segmentation category, the final target segmentation category that meets the hardware specifications must contain an alternative segmentation pattern that meets the hardware specifications.
[0076] It should be noted that, in addition to considering storage space capacity, the capacity of registers used to store temporary results in the computing unit can also be considered. Specifically, the automatic instruction generation technology in the AI platform will have temporary results that need to be cached in registers. The amount of cached data cannot exceed the upper limit of register capacity, and the amount of cached data is determined by the size of the number of data segments. Therefore, the data size of the smallest slice also needs to meet the requirement of not exceeding the upper limit of register capacity.
[0077] S250. In each target segmentation category, determine the range of values for each tensor dimension of each operator parameter of the target operator in each level of storage space.
[0078] The range of values for the split dimension can refer to the range of values for each operator parameter under the corresponding tensor dimension. For example, if the tensor dimension of the operator parameter of the target operator is 1024×768, then the range of values for the split dimension under the 1024 dimension can be 1 to 1023. Similarly, the range of values for the split dimension under the 768 dimension can be 1 to 767.
[0079] S260. According to the value range of the segmentation dimension corresponding to each target segmentation category and the parallelism description information in each target segmentation category, multiple combined segmentation patterns are obtained, and multiple candidate segmentation patterns that meet the hardware specifications of the AI platform are selected from each combined segmentation pattern.
[0080] Among them, the combined segmentation pattern can refer to the segmentation pattern generated by combining the value range of the segmentation dimension and the parallelism description information in each target segmentation category.
[0081] Specifically, after determining the range of values for each tensor dimension of each operator parameter in each level of storage space and the parallelism description information in each target segmentation category, the range of values for each segmentation dimension and the parallelism description information in each target segmentation category can be combined to generate a combined segmentation pattern. Thus, the segmentation category, segmentation dimension, and parallelism description information can be comprehensively considered to select the candidate segmentation pattern that makes the cost of each operator in the machine learning model more favorable.
[0082] S270. Obtain a time cost model that matches the AI platform, and use the time cost model to evaluate the time cost of each alternative segmentation pattern under at least one computation cycle.
[0083] Specifically, after comprehensively considering the segmentation category, segmentation dimension, and parallelism description information to select candidate segmentation patterns that meet the hardware specifications of the AI platform, the time cost model can be used to evaluate the time cost of each candidate segmentation pattern under at least one computation cycle. Thus, based on the comprehensive consideration of the segmentation category, segmentation dimension, and parallelism description information, computation cycle methods can be added again to ensure the performance of the target segmentation pattern.
[0084] S280. Based on the time cost, determine the target segmentation pattern and the target operation loop method, and combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy.
[0085] Specifically, based on the time cost of each candidate segmentation pattern under at least one operation loop mode, the target segmentation pattern that meets the time cost requirement and the target operation loop mode corresponding to the target segmentation pattern can be determined, and then a target segmentation strategy containing the target segmentation pattern and the target operation loop mode can be constructed.
[0086] S290. During the execution of the machine learning model, online computation of the target operator is performed according to the target segmentation strategy.
[0087] Specifically, when a machine learning model containing target operators is executed, online computation of the target operators can be performed according to the target segmentation strategy. In this way, the target segmentation pattern and target operation loop method with the best cost for each target operator in the machine learning model can be identified online in real time.
[0088] This invention acquires the machine learning model currently loaded onto the AI platform and identifies the target operator within it. Then, it acquires multiple segmentation categories matching the target operator and calculates the target data volume required by the target operator in each segmentation category and each storage level based on the tensor dimensions of each operator parameter. When the target data volume calculated for each storage level under the current segmentation category is less than the upper limit of the storage capacity of that level, the current segmentation category is determined to be a target segmentation category that meets the hardware specifications. Furthermore, within each target segmentation category, the range of values for each tensor dimension of each operator parameter in each storage level is determined. Multiple combined segmentations are obtained by combining the range of values for each tensor parameter corresponding to each target segmentation category and the parallelism description information in each target segmentation category. The system generates a pattern and selects multiple candidate segmentation patterns from each combination of segmentation patterns to meet the hardware specifications of the AI platform. Further, it obtains a time cost model matching the AI platform and uses this model to evaluate the time cost of each candidate segmentation pattern under at least one computational loop. Based on the time cost, it determines the target segmentation pattern and the target computational loop, and combines the target segmentation pattern and the target computational loop to obtain the target segmentation strategy. Finally, during the execution of the machine learning model, it performs online computation on the target operators according to the target segmentation strategy. This solves the problem in existing technologies where efficient segmentation strategies corresponding to each operator cannot be identified online in real time, resulting in high maintenance costs. It can identify the segmentation strategies with the best cost for each operator in the machine learning model online in real time, ensuring that the power consumption constraints of the AI platform are met and significantly improving the execution efficiency of the operators.
[0089] Example 3
[0090] Figure 4 This is a flowchart of an online generation method for a segmentation strategy provided in Embodiment 3 of the present invention. This embodiment is a refinement based on the above embodiment. Specifically, this embodiment refines the operation of obtaining a time cost model matching the AI platform, which may include: obtaining a pre-built standard cost model; setting parameters of the standard cost model according to the hardware description parameters of the AI platform to obtain a time cost model matching the AI platform. Figure 4 As shown, the method includes:
[0091] S310. Obtain the machine learning model currently loaded onto the AI platform and identify the target operator in the machine learning model.
[0092] The AI platform includes multi-level storage space, at least one DMA unit for data transfer between the multi-level storage space, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0093] S320. Obtain multiple segmentation categories that match the target operator.
[0094] Among them, the segmentation category defines whether each tensor dimension of each operator parameter of the target operator is segmented in each level of storage space, and whether each tensor dimension of each operator parameter is executed in parallel in multiple computational units, which describes the degree of parallelism.
[0095] S330. Based on the tensor dimensions of each operator parameter in the target operator, calculate the amount of target data required by the target operator in each level of storage space under each segmentation category.
[0096] Specifically, the minimum slice size of input data and the minimum slice size of output data for the target operator in each level of storage space are obtained for each segmentation category. Then, the sum of the minimum slice sizes of input data and output data in the dedicated storage space is taken as the target data volume required by the target operator in the dedicated storage space for each segmentation category. Further, based on the parallelism description information in the segmentation category, the minimum slice size of input data in each shared storage space is divided into dedicated data volume and shared data volume. Combined with the number of computing units available to the target operator in the AI platform, the target data volume required by the target operator in the shared storage space for each segmentation category is calculated.
[0097] S340. When the target data volume calculated by the target operator for each level of storage space under the current segmentation category is less than the upper limit of the storage space capacity of that level of storage space, the current segmentation category is determined to be a target segmentation category that meets the hardware specification conditions.
[0098] Specifically, after calculating the target data amount required by the target operator in each level of storage space under each segmentation category, for the current segmentation category, if the target data amount required by the target operator in both the shared storage space and the exclusive storage space is less than the upper limit of the storage space capacity of the shared storage space and the exclusive storage space, then the current segmentation category is taken as the target segmentation category.
[0099] S350. In each target segmentation category, determine the range of values for each tensor dimension of each operator parameter of the target operator in each level of storage space.
[0100] S360. Under the current target segmentation category, obtain the value range of each tensor dimension of each operator parameter of the target operator in each level of storage space for the current segmentation dimension.
[0101] The current range of values for the segmentation dimension can refer to the range of values for each tensor dimension of each operator parameter of the target operator in each level of storage space after the current target segmentation category is determined.
[0102] S370. Within the value range of each current segmentation dimension, filter the value of each current segmentation dimension once, and sort the value of each current segmentation dimension within the value range of each current segmentation dimension according to at least one priority sorting rule.
[0103] The priority sorting rule refers to the rule for sorting the filtered values of the current splitting dimension according to their priority. Priority sorting can be further divided into sorting the filtered values of the current splitting dimension in order of priority.
[0104] In an optional implementation, filtering the values of each current segmentation dimension within the range of values of each current segmentation dimension may include: obtaining the register hardware specifications of the AI platform and filtering the values of each current segmentation dimension according to the register hardware specifications; the priority sorting rules include: factor priority rule, integer multiple priority rule of the number of computing units that the target operator can use in the AI platform, and large number priority rule.
[0105] The register hardware specification can refer to the instruction specification of the register. For example, if the instruction format of the calculation unit in the register is 64 bytes, then an integer multiple of 64, such as 64 and 128, needs to be selected as the segmentation dimension value to ensure the granularity of the segmentation.
[0106] Among them, the factor priority rule can refer to the rule that factors in the current split dimension value range of each operator parameter of the target operator have higher priority. For example, if the current range of the segmentation dimension of the target operator's operator parameter is [1,1023] and [1,767], the factors for [1,1023] can be 2, 4, 8, 16, 32, 64, 128, 256 and 512; and the factors for [1,767] can be 2, 4, 6, 8, 12, 24, 32, 64, 96, 128, 192 and 384. Then, for the current range of the segmentation dimension [1,1023], the factors 2, 4, 8, 16, 32, 64, 128, 256 and 512 have higher priority. Similarly, for the current range of the segmentation dimension [1,767], the factors 2, 4, 6, 8, 12, 24, 32, 64, 96, 128, 192 and 384 have higher priority.
[0107] The rule prioritizing integer multiples of the number of computing units available to the target operator in the AI platform refers to the rule that partitioning dimensions that are integer multiples of the number of computing units have higher priority. For example, if the number of computing units available to the target operator in the AI platform is 4, then partitioning dimensions of 4, 8, 12, 16, or 20 have higher priority.
[0108] Among them, the large number first rule can be defined as the rule that gives higher priority to the current split dimension value with a larger index value.
[0109] Specifically, firstly, the values of each current segmentation dimension are filtered according to the hardware specifications of the platform's registers, and the values of the current segmentation dimension that are integer multiples of the instruction specifications of the registers are selected. Then, the filtered values of the current segmentation dimension are sorted according to the priority sorting rules.
[0110] In other words, in the optional implementation of this embodiment, after determining which dimensions need to be segmented, the next step is to determine how to segment them, i.e., to define the search range for the segmented dimensions. For example, if the size of a dimension is N, then the range of values after segmentation can be some integer values between 1 and N-1. At the same time, in order to minimize invalid searches, we can impose some restrictions on the segmentation values: prioritize factors of N, i.e., perform integer segmentation to avoid some remainder processing issues; consider the hardware specifications when selecting segmentation values, such as if each instruction of the computing unit uses 64-byte aligned data, then the segmentation will filter out segments that are not integer multiples of this granularity; in addition, if a larger segmentation can be taken for the same dimension, the smaller segmentation can be discarded, because each segmentation has a fixed overhead, and the number of segmentations should be minimized; considering the parallelism between multiple computing units, the segmentation will also be biased towards integer multiples of the number of computing units that the target operator can use in the AI platform, so as not to waste resources.
[0111] S380. Using the value range of each of the current segmentation dimensions after priority sorting, construct a multi-level nested loop to perform the search.
[0112] In a specific example, suppose the target operator includes only one operator parameter A, which has two tensor dimensions M and N. Also suppose the AI platform only has one level of storage space L1. Assume that after priority sorting, the current partitioning dimension M of the target operator's operator parameter A in storage space L1 has a range of values {M1; M2; M3}, and the current partitioning dimension N of the target operator's operator parameter A in storage space L1 has a range of values {N1; N2; N3}.
[0113] Correspondingly, a two-level nested loop can be constructed. The outer loop iterates through {M1; M2; M3}, and the inner loop iterates through {N1; N2; N3}, so as to achieve the search for 9 possible combinations of 3*3: {M1, N1}, {M1, N2}, {M1, N3}, {M2, N1}, {M2, N2}, {M2, N3}, {M3, N1}, {M3, N2}, and {M3, N3}.
[0114] S390. Based on the parallelism description information of the current target segmentation category, match the combined segmentation pattern generated each time during the search process with the hardware specifications of the AI platform to obtain the successfully matched alternative segmentation pattern.
[0115] Specifically, during each search process, the current segmentation value in each multi-level nested loop is combined with the parallelism description information of the current target segmentation category to generate a combined segmentation pattern. Then, the combined segmentation pattern is matched with the hardware specifications of the AI platform, and the successfully matched combined segmentation pattern is used as a candidate segmentation pattern.
[0116] In this embodiment, the implementation of S390 is the same as that of S260. That is, for each combined segmentation pattern, the target data amount required in each level of storage space is calculated to be less than the upper limit of the storage space capacity of that level of storage space. If so, the combined segmentation pattern is determined as a candidate segmentation pattern.
[0117] S3100. Based on the sorting position of each segmentation dimension value in each level of the multi-level nested loop in each successfully matched candidate segmentation pattern, perform secondary filtering on the segmentation dimension values that have not been searched in the multi-level nested loop until the traversal process of the multi-level nested loop is completed.
[0118] Specifically, after obtaining the successfully matched candidate segmentation patterns, the segmentation dimension values of the outer loop immediately adjacent to the inner loop can be filtered a second time based on the segmentation dimension values of the inner loop in the candidate segmentation patterns.
[0119] It is worth noting that when at least one inner loop in a multi-level nested loop has the highest priority value (that is, the first value of the first segmentation dimension in the range of all segmentation dimension values after priority sorting), values in the outer loop that are immediately adjacent to the inner loop with a lower priority than the current segmentation dimension value of the outer loop can be filtered out.
[0120] In an optional implementation, based on the sorting position of each segmentation dimension value in each level of the multi-level nested loop in each successfully matched candidate segmentation pattern, a secondary filtering of the unsearched segmentation dimension values in the multi-level nested loop may be performed, which may include:
[0121] Obtain the sorting position of each target segmentation dimension value in each level of the multi-level nested loop in the currently successfully matched candidate segmentation pattern; if, based on the sorting position of each target segmentation dimension value, it is determined that, starting from the innermost loop, the target segmentation dimension value in at least one consecutive inner loop is located in the first sorting position, then filter all unsearched segmentation dimension values after the target segmentation dimension values in the target outer loop immediately adjacent to the consecutive inner loops.
[0122] Here, "continuous inner loop" can refer to a continuous inner loop structure within a multi-level nested loop. "Target segmentation dimension value" can refer to the segmentation dimension value corresponding to the candidate segmentation pattern within the continuous inner loop. "Target outer loop" can refer to the outer loop corresponding to the target segmentation dimension value.
[0123] Specifically, after determining the candidate segmentation patterns, the sorting position of each target segmentation dimension value in the candidate segmentation patterns in each level of the multi-level nested loop is obtained. If the target segmentation dimension value in at least one consecutive inner loop is in the first sorting position, then all unsearched segmentation dimension values after the target segmentation dimension value in the target outer loop immediately adjacent to the consecutive inner loop can be filtered. This achieves secondary filtering and avoids resource waste.
[0124] Continuing the previous example, suppose the outer loop in a two-level nested loop includes {M1; M2; M3}, and the inner loop includes {N1; N2; N3}. If when traversing to {M2, N1}, it is determined to be a candidate splitting pattern, then M3 in the outer loop can be directly filtered out without needing to continue traversing and searching for {M3, N1}, {M3, N2}, and {M3, N3}.
[0125] S3110, Obtain a pre-built standard cost model.
[0126] Here, the standard cost model can refer to a standard time cost calculation model. For example, the standard cost model can include the DMA cost model and the computing unit cost model.
[0127] As a medium for data exchange between different storage tiers, DMA operations have one of the highest time overheads. Operator operations involve multiple DMA operations, and the overhead of each DMA operation is related to factors such as data transfer efficiency and memory tier. Therefore, it is necessary to first establish a cost model for DMA on the AI platform.
[0128] DMA bandwidth has a theoretical upper limit, but not every operation can reach this limit. It's necessary to consider the factors affecting the bandwidth rate and the extent of their influence. For example, DMA rate is affected by access continuity: on an AI platform, for continuous data transfer operations, the theoretical upper limit of DMA transfers per clock cycle is 128 bytes. If the data access continuity is not a multiple of 128 bytes, the DMA rate will decrease linearly. Furthermore, the total amount of data transferred also has a linear relationship with DMA bandwidth; within a certain range, the larger the transfer volume, the higher the rate. Therefore, this linear relationship can be expressed by a formula to fit an overhead close to actual measurements within a certain range. On the other hand, DMA operation overhead is also affected by memory access latency. DMA incurs a fixed latency overhead when accessing memory, and the degree of latency varies across different memory levels. During evaluation, it's necessary to consider which memory tiers the data transfer occurs between and add the corresponding fixed overhead to the DMA cost model.
[0129] Therefore, when the DMA transfer action is [A,B,C,D]->[a,b,c,d], the amount of data transferred is a*b*c*d*bpe. This transfer action indicates that the DMA can continuously transfer at least (d*bpe) bytes of data, where bytes are the unit. bpe can represent the number of bytes per element, such as 4 bytes for a value in FP32 and 2 bytes for a value in FP16.
[0130] The DMA cost model can be: Total DMA time = Total data transfer volume / Data transfer efficiency + Fixed latency overhead = (a*b*c*d*bpe) / (dma_bw*dma_bw_efficiency_factor1*dma_bw_efficiency_factor2) +
[0131] dma_latency.
[0132] Where dma_bw can represent the DMA bandwidth rate, in bytes / cycle; dma_bw_efficiency can represent the DMA transfer efficiency, which is mainly related to the transfer continuity and the amount of data transferred; dma_bw_efficiency_factor1 can represent the primary factor affecting DMA transfer efficiency, namely the transfer continuity; dma_bw_efficiency_factor1 = ((d*bpe)%dma_bw) / dma_bw.
[0133] It's worth noting that, generally, higher transport continuity corresponds to higher DMA transport efficiency. When d*bpe is an integer multiple of dma_bw, theoretically, the transport continuity should be at its highest (1). However, directly calculating dma_bw_efficiency_factor1 using the above formula results in dma_bw_efficiency_factor1 = 0.
[0134] Therefore, the following piecewise function can be constructed to calculate efficiency_factor1:
[0135]
[0136] `dma_bw_efficiency_factor2` represents the second influencing factor on DMA transfer efficiency, namely the amount of data transferred. `dma_bw_efficiency_factor2` = (a*b*c*d*bpe) / bytesforbestperf. `bytesforbestperf` represents the amount of data transferred in a single operation. Generally, the larger the amount of data transferred in a single operation, the higher the DMA efficiency. However, there are usually upper and lower limits. Typically, the value of `dma_bw_efficiency_factor2` is fixed using a threshold. For example, if `dma_bw_efficiency_factor2 > 1`, it can be assigned the value 1; similarly, if `dma_bw_efficiency_factor2 < 0.8`, it can be assigned the value 0.8. `dma_latency` represents the fixed latency overhead. Typically, each DMA operation has a fixed latency overhead, which is independent of the transfer amount and is measured in cycles.
[0137] Furthermore, computational overhead is a major aspect of operator operation. The computational power of an AI platform is related to the way instructions are used and the pipeline layout. Generally, different types of operators have different computational costs due to their different computational methods. The computational unit cost model mainly consists of three parts: data loading, data computation, and data write-back. Among them, data loading, data computation, and data write-back can be performed simultaneously on the AI platform. Therefore, the computational unit cost model can be: Total computation time = First data loading time + max(data loading time, data computation time, data write-back time) + Last data write-back time.
[0138] Furthermore, there are some fixed time overheads during operator operations, such as DMA configuration time and the waiting time for synchronization between computing units. Although this part of the time accounts for a small percentage, it still has a significant impact on the segmentation pattern. For example, if the same data can be segmented into 1 / 2 size and run twice, or segmented into 1 / 4 size and run four times, the calculation results are the same, but the actual time consumption is different because the method of running four times will have two more DMA configuration times than the method of running two times. Similarly, the synchronization time of computing units is also affected by the number of synchronizations. Therefore, when building a standard cost model, some fixed time overheads also need to be considered.
[0139] Therefore, by summarizing the DMA cost model, the computing unit cost model, and the fixed time overhead, a standard cost model can be constructed.
[0140] As an example, and not a limitation, the following formula can be used to simply describe how to calculate the time cost of an optional standard cost model:
[0141] The time cost of the standard cost model = first configuration DMA transfer input time + first DMA transfer input time + (max(one DMA transfer input time + one DMA transfer output time, one calculation time) + one DMA configuration transfer input time + one configuration DMA transfer output time) * (total number of loops - 1) + last calculation time + last configuration DMA transfer output time + last DMA transfer output time.
[0142] In this example, in a real-world application scenario where the DMA transfer process and the computation process of the computing unit are executed in parallel, the expected time overhead for each alternative segmentation pattern under at least one computation cycle mode can be estimated.
[0143] At the same time, it should be noted again that the above formula is only an example. In practical applications, technicians can construct other types of standard cost model time cost calculation methods according to different segmentation categories or different AI platform parameters. This embodiment does not limit this.
[0144] S3120. Based on the hardware description parameters of the AI platform, the standard cost model is parameterized to obtain a time cost model that matches the AI platform.
[0145] In one optional implementation, the hardware description parameters include at least one of the following: the storage hierarchy of the AI platform, the direct memory access DMA unit architecture, the bandwidth rate of each DMA unit, the latency of each DMA unit, the amount of data transferred in a single DMA unit, the configuration time of each DMA unit, the synchronization waiting time between computing units, the execution mode of each DMA unit, and the execution mode of the computing unit.
[0146] Specifically, after obtaining the hardware description parameters of the AI platform, the standard cost model is set according to the hardware description parameters of the AI platform to obtain a time cost model that matches the AI platform.
[0147] S3130. Use a time cost model to evaluate the time cost of each alternative segmentation pattern under at least one operation loop.
[0148] S3140. Based on each time cost, obtain the target segmentation pattern corresponding to the minimum time cost, and obtain the target operation loop mode of the target segmentation pattern under the minimum time cost.
[0149] Specifically, after calculating the time cost of each candidate segmentation pattern under at least one operation loop mode, the minimum time cost can be selected from the time costs. The candidate segmentation pattern corresponding to the minimum time cost is taken as the target segmentation pattern, and the operation loop mode corresponding to the target segmentation pattern is taken as the target operation loop mode.
[0150] S3150. The target segmentation pattern and the target operation loop method are combined to obtain the target segmentation strategy.
[0151] S3160. During the execution of the machine learning model, online computation of the target operator is performed according to the target segmentation strategy.
[0152] This invention acquires the machine learning model currently loaded onto the AI platform, identifies target operators within the model, and obtains multiple segmentation categories matching the target operators. Then, based on the tensor dimensions of each operator parameter in the target operator, it calculates the target data volume required by the target operator in each storage level under each segmentation category. When the target data volume calculated for each storage level under the current segmentation category is less than the upper limit of the storage capacity of that level, the current segmentation category is determined to be a target segmentation category that meets the hardware specifications. Furthermore, within each target segmentation category, each tensor dimension of each operator parameter of the target operator is determined. The degree is determined by the range of values for the partitioning dimension in each storage level. Then, under the current target partitioning category, the range of values for each tensor dimension of each operator parameter of the target operator in each storage level is obtained. Within each range of values for the current partitioning dimension, the values are filtered once, and then prioritized according to at least one priority sorting rule. Furthermore, using the prioritized ranges of values for the current partitioning dimension, a multi-level nested loop is constructed for searching. The combinations generated each time during the search are then processed according to the parallelism description information of the current target partitioning category. The process involves matching the segmentation pattern with the hardware specifications of the AI platform to obtain successfully matched candidate segmentation patterns. Then, based on the ranking position of each segmentation dimension value in each level of the multi-level nested loop within each successfully matched candidate segmentation pattern, a secondary filtering is performed on the unsearched segmentation dimension values in the multi-level nested loop until the traversal of the multi-level nested loop is complete. Finally, a pre-built standard cost model is obtained. Based on the hardware description parameters of the AI platform, the standard cost model is parameterized to obtain a time cost model matching the AI platform. This time cost model is then used to evaluate the time cost of each candidate segmentation pattern under at least one computational loop. The algorithm obtains the target segmentation pattern corresponding to the minimum time cost based on the time cost of each segmentation pattern, as well as the target operation loop mode of the target segmentation pattern under the minimum time cost. The target segmentation pattern and the target operation loop mode are combined to obtain the target segmentation strategy. During the execution of the machine learning model, online calculations are performed on the target operators according to the target segmentation strategy. This solves the problem that existing technologies cannot identify efficient segmentation strategies corresponding to each operator online in real time, resulting in high maintenance costs. It can identify the segmentation strategy with the best cost for each operator in the machine learning model online in real time, so that the power consumption constraints of the AI platform can be met, and the execution efficiency of the target operators is significantly improved.
[0153] Example 4
[0154] Figure 5 This is a schematic diagram of an online generation device for a segmentation strategy provided in Embodiment 4 of the present invention. Figure 5 As shown, the device includes: a target operator identification module 410, a segmentation pattern determination module 420, a time cost evaluation module 430, a segmentation strategy determination module 440, and an online calculation module 450;
[0155] The target operator identification module 410 is used to acquire the machine learning model currently loaded into the AI platform and identify the target operator in the machine learning model. The AI platform includes multi-level storage space, at least one DMA unit for data transfer between multi-level storage spaces, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0156] The segmentation pattern determination module 420 is used to determine multiple alternative segmentation patterns that match the target operator based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform. The segmentation pattern includes the segmentation method of each tensor dimension of each operator parameter in each level of storage space and the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units.
[0157] The time cost assessment module 430 is used to obtain a time cost model that matches the AI platform, and to use the time cost model to assess the time cost of each alternative segmentation pattern under at least one operation loop.
[0158] The segmentation strategy determination module 440 is used to determine the target segmentation pattern and the target operation loop method based on the time cost, and to combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy;
[0159] The online computation module 450 is used to perform online computations on the target operator according to the target segmentation strategy during the execution of the machine learning model.
[0160] This invention acquires the machine learning model currently loaded onto an AI platform and identifies the target operator within it. Then, based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, it determines multiple candidate segmentation patterns matching the target operator. Further, it acquires a time cost model matching the AI platform and uses this model to evaluate the time cost of each candidate segmentation pattern under at least one computational loop. Finally, based on the time cost, it determines the target segmentation strategy obtained by combining the target segmentation pattern and the target computational loop. Ultimately, during the execution of the machine learning model, online computation is performed on the target operator according to the target segmentation strategy. This solves the business requirement of online, real-time generation of operator segmentation strategies, minimizes human intervention, reduces maintenance costs, and integrates the determination and implementation processes of the operator segmentation strategy.
[0161] Optionally, the segmentation pattern determination module 420 includes: a segmentation category acquisition unit, a target data volume calculation unit, a target segmentation category identification unit, and a candidate segmentation pattern determination unit;
[0162] The segmentation category acquisition unit is used to acquire multiple segmentation categories that match the target operator. The segmentation category defines whether each tensor dimension of each operator parameter of the target operator is segmented in each level of storage space, and whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units, which describes the degree of parallelism.
[0163] The target data volume calculation unit is used to calculate the target data volume required by the target operator in each level of storage space under each segmentation category, based on the tensor dimensions of each operator parameter in the target operator.
[0164] The target segmentation category identification unit is used to identify at least one target segmentation category that meets the hardware specification conditions among all segmentation categories, based on the target data volume and the hardware specifications of the AI platform.
[0165] The alternative segmentation pattern determination unit is used to determine multiple alternative segmentation patterns that match the target operator in each of the target segmentation categories.
[0166] Optional, the target data volume calculation unit can be used for:
[0167] Obtain the minimum slice size of input data and the minimum slice size of output data in each level of storage space for the target operator under the current segmentation category;
[0168] The sum of the minimum slice size of input data and the minimum slice size of output data in the exclusive storage space is obtained as the target data amount required by the target operator in the exclusive storage space under the current segmentation category.
[0169] Based on the parallelism description information in the current segmentation category, the minimum slice data size of the input data in each shared storage space is divided into exclusive data size and shared data size;
[0170] Based on the amount of dedicated data, the amount of shared data, the minimum slice size of output data in each shared storage space, and the number of computing units available to the target operator in the AI platform, calculate the amount of target data required by the target operator in each shared storage space under the current segmentation category.
[0171] Optionally, the alternative segmentation pattern determination unit may specifically include: a segmentation dimension value range determination subunit and an alternative segmentation pattern determination subunit;
[0172] Among them, the subunit for determining the value range of the split dimension is used to determine the value range of the split dimension of each tensor dimension of each operator parameter of the target operator in each level of storage space in each target split category;
[0173] The candidate segmentation pattern determination subunit is used to combine multiple combined segmentation patterns according to the value range of the segmentation dimension corresponding to each target segmentation category and the parallelism description information in each target segmentation category, and to select multiple candidate segmentation patterns that meet the hardware specifications of the AI platform from each combined segmentation pattern.
[0174] Optionally, the alternative segmentation pattern determines the sub-unit, which can be used to: obtain the current segmentation dimension value range of each tensor dimension of each operator parameter of the target operator in each level of storage space under the current target segmentation category;
[0175] Within the value range of each current segmentation dimension, the value of each current segmentation dimension is filtered once, and the value of each current segmentation dimension within the value range is sorted according to at least one priority sorting rule.
[0176] The search is performed by constructing multi-level nested loops based on the value ranges of each current segmentation dimension after priority sorting.
[0177] Based on the parallelism description information of the current target segmentation category, the combined segmentation pattern generated each time during the search process is matched with the hardware specifications of the AI platform to obtain the successfully matched alternative segmentation pattern;
[0178] Based on the sorting position of each segmentation dimension value in each level of the multi-level nested loop in each successfully matched candidate segmentation pattern, a secondary filtering is performed on the segmentation dimension values that have not been searched in the multi-level nested loop until the traversal process of the multi-level nested loop is completed.
[0179] Optionally, alternative segmentation patterns can be used to determine sub-units, specifically for:
[0180] Obtain the register hardware specifications of the AI platform, and filter the values of each current segmentation dimension according to the register hardware specifications;
[0181] Priority sorting rules include: factor priority rule, integer multiple priority rule of the number of computing units available for the target operator in the AI platform, and large number priority rule.
[0182] Optionally, alternative segmentation patterns can be used to determine sub-units, specifically for:
[0183] Obtain the sorting position of each target segmentation dimension value in each level of the multi-level nested loop in the currently successfully matched candidate segmentation pattern;
[0184] If, based on the sorting position of the target segmentation dimension values, it is determined that, starting from the innermost loop, the target segmentation dimension values in at least one consecutive inner loop are all in the first sorting position, then all unsearched segmentation dimension values following the target segmentation dimension values in the target outer loop immediately adjacent to the consecutive inner loops will be filtered out.
[0185] Optionally, the time cost assessment module 430 may specifically include: a standard cost model acquisition unit and a parameter setting unit;
[0186] The standard cost model acquisition unit is used to acquire a pre-built standard cost model.
[0187] The parameter setting unit is used to set parameters for the standard cost model according to the hardware description parameters of the AI platform, so as to obtain a time cost model that matches the AI platform.
[0188] Optionally, the hardware description parameters include at least one of the following: the storage level of the AI platform, the direct memory access DMA unit architecture, the bandwidth rate of each DMA unit, the latency of each DMA unit, the amount of data transferred in a single DMA unit, the configuration time of each DMA unit, the synchronization waiting time between computing units, the execution mode of each DMA unit, and the execution mode of the computing unit.
[0189] Optional, the time cost evaluation module 430 can be used for:
[0190] The current candidate segmentation pattern, and at least one computational loop mode that matches the current candidate segmentation pattern, are input into the time cost model;
[0191] The time cost model calculates the number of times the current candidate segmentation pattern is repeatedly moved under each operation cycle based on the current candidate segmentation pattern and each operation cycle method.
[0192] The time cost model is used to calculate the time cost of the current alternative segmentation pattern under each operation loop based on the number of repeated handling operations.
[0193] Optionally, the segmentation strategy determination module 440 can be used to: obtain the target segmentation pattern corresponding to the minimum time cost based on each time cost, and obtain the target operation loop mode of the target segmentation pattern under the minimum time cost.
[0194] The online generation device for segmentation strategies provided in this embodiment of the invention can execute the online generation method for segmentation strategies provided in any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
[0195] Example 5
[0196] Figure 6 A schematic diagram of an electronic device 510 that can be used to implement embodiments of the present invention is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.
[0197] like Figure 6 As shown, the electronic device 510 includes at least one processor 520 and a memory, such as a read-only memory (ROM) 530 or a random access memory (RAM) 540, communicatively connected to the at least one processor 520. The memory stores computer programs executable by the at least one processor. The processor 520 can perform various appropriate actions and processes based on the computer program stored in the ROM 530 or loaded into the RAM 540 from storage unit 590. The RAM 540 can also store various programs and data required for the operation of the electronic device 510. The processor 520, ROM 530, and RAM 540 are interconnected via a bus 550. An input / output (I / O) interface 560 is also connected to the bus 550.
[0198] Multiple components in electronic device 510 are connected to I / O interface 560, including: input unit 570, such as keyboard, mouse, etc.; output unit 580, such as various types of displays, speakers, etc.; storage unit 590, such as disk, optical disk, etc.; and communication unit 5100, such as network card, modem, wireless transceiver, etc. Communication unit 5100 allows electronic device 510 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0199] Processor 520 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 520 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 520 performs the various methods and processes described above, such as the online generation method of the segmentation strategy.
[0200] The method includes:
[0201] The system retrieves the machine learning model currently loaded onto the AI platform and identifies the target operator within the machine learning model. The AI platform includes multi-level storage space, at least one DMA unit for data transfer between the multi-level storage spaces, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space.
[0202] Based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, multiple alternative partitioning patterns matching the target operator are determined. The partitioning pattern includes the partitioning method of each tensor dimension of each operator parameter in each level of storage space and the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units.
[0203] Obtain a time cost model that matches the AI platform, and use the time cost model to evaluate the time cost of each alternative segmentation pattern under at least one computation cycle.
[0204] Based on the time cost, determine the target segmentation pattern and the target operation loop method, and combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy;
[0205] During the execution of the machine learning model, online computation of the target operator is performed according to the target segmentation strategy.
[0206] In some embodiments, the online generation method for the segmentation strategy can be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 590. In some embodiments, part or all of the computer program can be loaded and / or installed on electronic device 510 via ROM 530 and / or communication unit 5100. When the computer program is loaded into RAM 540 and executed by processor 520, one or more steps of the online generation method for the segmentation strategy described above can be performed. Alternatively, in other embodiments, processor 520 can be configured to perform the online generation method for the segmentation strategy by any other suitable means (e.g., by means of firmware).
[0207] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0208] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0209] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0210] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0211] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0212] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0213] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.
[0214] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. An online method for generating segmentation strategies, characterized in that, include: The system acquires the machine learning model currently loaded onto the AI platform and identifies the target operator within the machine learning model. The AI platform includes multi-level storage space, at least one Direct Memory Access (DMA) unit for data transfer between the multi-level storage spaces, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space. Based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, multiple alternative partitioning patterns matching the target operator are determined. The partitioning pattern includes the partitioning method of each tensor dimension of each operator parameter in each level of storage space and the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units. Obtain a time cost model that matches the AI platform, and use the time cost model to evaluate the time cost of each alternative segmentation pattern under at least one computation cycle. Based on the time cost, determine the target segmentation pattern and the target operation loop method, and combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy; During the execution of the machine learning model, online computation of the target operator is performed according to the target segmentation strategy; The method involves using a time cost model to evaluate the time cost of each candidate segmentation pattern under at least one computational loop mode. This includes: inputting the current candidate segmentation pattern and at least one computational loop mode matching the current candidate segmentation pattern into the time cost model; calculating the number of repeated handling operations for the current candidate segmentation pattern under each computational loop mode based on the current candidate segmentation pattern and each computational loop mode using the time cost model; and calculating the time cost of the current candidate segmentation pattern under each computational loop mode based on the number of repeated handling operations using the time cost model.
2. The method according to claim 1, characterized in that, Based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform, several alternative segmentation patterns matching the target operator are determined, including: Obtain multiple segmentation categories that match the target operator, wherein the segmentation category defines whether each tensor dimension of each operator parameter of the target operator is segmented in each level of storage space, and whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units, which describes the degree of parallelism. Based on the tensor dimensions of each operator parameter in the target operator, calculate the amount of target data required by the target operator in each level of storage space under each segmentation category; Based on the target data volume and the hardware specifications of the AI platform, identify at least one target segmentation category that meets the hardware specification conditions among all segmentation categories; In each of the target segmentation categories, multiple alternative segmentation patterns that match the target operator are determined.
3. The method according to claim 2, characterized in that, Based on the tensor dimensions of each operator parameter in the target operator, calculate the amount of target data required by the target operator in each level of storage space under each segmentation category, including: Obtain the minimum slice size of input data and the minimum slice size of output data in each level of storage space for the target operator under the current segmentation category; The sum of the minimum slice size of input data and the minimum slice size of output data in the exclusive storage space is obtained as the target data amount required by the target operator in the exclusive storage space under the current segmentation category. Based on the parallelism description information in the current segmentation category, the minimum slice data size of the input data in each shared storage space is divided into exclusive data size and shared data size; Based on the amount of dedicated data, the amount of shared data, the minimum slice size of output data in each shared storage space, and the number of computing units available to the target operator in the AI platform, calculate the amount of target data required by the target operator in each shared storage space under the current segmentation category.
4. The method according to claim 3, characterized in that, Based on the target data volume and the hardware specifications of the AI platform, identify at least one target segmentation category that meets the hardware specification conditions from all segmentation categories, including: When the target data volume calculated by the target operator for each level of storage space under the current segmentation category is less than the upper limit of the storage space capacity of that level of storage space, the current segmentation category is determined to be a target segmentation category that meets the hardware specifications.
5. The method according to claim 2, characterized in that, In each of the target segmentation categories, multiple candidate segmentation patterns matching the target operator are determined, including: In each target segmentation category, the range of values for each tensor dimension of each operator parameter of the target operator in each level of storage space is determined. Based on the value range of the segmentation dimension corresponding to each target segmentation category and the parallelism description information in each target segmentation category, multiple combined segmentation patterns are obtained, and multiple candidate segmentation patterns that meet the hardware specifications of the AI platform are selected from each combined segmentation pattern.
6. The method according to claim 5, characterized in that, Based on the value range of the segmentation dimension corresponding to each target segmentation category and the parallelism description information in each target segmentation category, multiple combined segmentation patterns are obtained. From these combined segmentation patterns, multiple candidate segmentation patterns that meet the hardware specifications of the AI platform are selected, including: Under the current target segmentation category, obtain the value range of each tensor dimension of each operator parameter of the target operator in each level of storage space for the current segmentation dimension; Within the value range of each current segmentation dimension, the value of each current segmentation dimension is filtered once, and the value of each current segmentation dimension within the value range is sorted according to at least one priority sorting rule. The search is performed by constructing multi-level nested loops based on the value ranges of each current segmentation dimension after priority sorting. Based on the parallelism description information of the current target segmentation category, the combined segmentation pattern generated each time during the search process is matched with the hardware specifications of the AI platform to obtain the successfully matched candidate segmentation pattern; Based on the sorting position of each segmentation dimension value in each level of the multi-level nested loop in each successfully matched candidate segmentation pattern, a secondary filtering is performed on the segmentation dimension values that have not been searched in the multi-level nested loop until the traversal process of the multi-level nested loop is completed.
7. The method according to claim 6, characterized in that, Within the value range of each current segmentation dimension, a filtering process is performed on the values of each current segmentation dimension, including: Obtain the register hardware specifications of the AI platform, and filter the values of each current segmentation dimension according to the register hardware specifications; The priority ranking rules include: factor priority rule, integer multiple priority rule of the number of computing units that the target operator can use in the AI platform, and large number priority rule.
8. The method according to claim 6, characterized in that, Based on the sorting position of each segmentation dimension value in each level of the multi-level nested loop in each successfully matched candidate segmentation pattern, a secondary filtering is performed on the segmentation dimension values that have not been searched in the multi-level nested loop, including: Obtain the sorting position of each target segmentation dimension value in each level of the multi-level nested loop in the currently successfully matched candidate segmentation pattern; If, based on the sorting position of the target segmentation dimension values, it is determined that, starting from the innermost loop, the target segmentation dimension values in at least one consecutive inner loop are all in the first sorting position, then all unsearched segmentation dimension values following the target segmentation dimension values in the target outer loop immediately adjacent to the consecutive inner loops will be filtered out.
9. The method according to any one of claims 1-8, characterized in that, Obtaining a time cost model that matches the AI platform includes: Obtain a pre-built standard cost model; Based on the hardware description parameters of the AI platform, the standard cost model is parameterized to obtain a time cost model that matches the AI platform.
10. The method according to claim 9, characterized in that, The hardware description parameters include at least one of the following: The AI platform includes its storage hierarchy, direct memory access (DMA) unit architecture, bandwidth rate of each DMA unit, latency of each DMA unit, single data transfer volume of each DMA unit, configuration time of each DMA unit, synchronization waiting time between computing units, execution mode of each DMA unit, and execution mode of the computing unit.
11. The method according to claim 1, characterized in that, Based on the time cost, determine the target segmentation pattern and the target calculation loop method, including: Based on each time cost, obtain the target segmentation pattern corresponding to the minimum time cost, and obtain the target operation loop mode of the target segmentation pattern under the minimum time cost.
12. An online generation device for a segmentation strategy, characterized in that, include: The target operator identification module is used to acquire the machine learning model currently loaded into the artificial intelligence (AI) platform and identify the target operator in the machine learning model. The AI platform includes multi-level storage space, at least one direct storage access (DMA) unit for data transfer between multi-level storage spaces, and at least one computing unit for computation. The multi-level storage space includes shared storage space and exclusive storage space. The segmentation pattern determination module is used to determine multiple alternative segmentation patterns that match the target operator based on the tensor dimensions of each operator parameter in the target operator and the hardware specifications of the AI platform. The segmentation pattern includes the segmentation method of each tensor dimension of each operator parameter in each level of storage space and the parallelism description information of whether each tensor dimension of each operator parameter is executed in parallel in multiple computing units. The time cost assessment module is used to obtain a time cost model that matches the AI platform, and to use the time cost model to assess the time cost of each alternative segmentation pattern under at least one operation loop. The segmentation strategy determination module is used to determine the target segmentation pattern and the target operation loop method based on the time cost, and to combine the target segmentation pattern and the target operation loop method to obtain the target segmentation strategy; The online computation module is used to perform online computations on the target operator according to the target segmentation strategy during the execution of the machine learning model. Specifically, the time cost assessment module is used to: input the current candidate segmentation pattern and at least one operation loop mode matching the current candidate segmentation pattern into the time cost model; calculate the number of repeated handling operations of the current candidate segmentation pattern under each operation loop mode based on the current candidate segmentation pattern and each operation loop mode; and calculate the time cost of the current candidate segmentation pattern under each operation loop mode based on the number of repeated handling operations.
13. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the online generation method of the segmentation strategy according to any one of claims 1-11.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the online generation method of the segmentation strategy according to any one of claims 1-11.