Method and apparatus for processing computing tasks
By obtaining partitioning information from the operator partitioning information database through the graph optimizer, the input and output tensors of the operators are automatically partitioned, which solves the problem that the graph optimizer cannot decouple and realizes parallel computation of computation tasks on multiple computing resources, thereby enhancing the flexibility and efficiency of the operator partitioning strategy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2022-01-28
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, graph optimizers need to be based on the principles of specific operators when performing operator optimization, and cannot achieve automatic partitioning of operator input tensors. This results in the inability to decouple graph optimization and operator optimization, limiting the parallel computation of computational tasks on multiple computing resources.
The graph optimizer obtains partitioning information from the operator partitioning information library and automatically partitions the operator input and output tensors, achieving complete decoupling between the graph optimizer and the operator optimization module. It utilizes different types of axes for partitioning strategies to adapt to parallel computing of multiple computing resources.
It achieves complete decoupling of the graph optimizer and operator optimization module, improves the parallel computing capability of computing tasks on multiple computing resources, enhances the generalization ability of operator partitioning strategies, and reduces the dependence on operator mathematical semantics and underlying implementation.
Smart Images

Figure CN116888601B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence, and more specifically, to a method and apparatus for processing computational tasks. Background Technology
[0002] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making. Research in the field of AI includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and fundamental AI theories.
[0003] Open-source deep learning frameworks such as TensorFlow, PyTorch, and MXNet provide user-friendly programming environments for deep learning models, enabling users to easily deploy their designed deep learning models on general-purpose computer hardware platforms such as central processing units (CPUs) and graph processing units (GPUs). If a designed deep learning model is deployed on a specific device, it typically uses the forward inference framework from that device vendor; for example, TensorRT is used on NVIDIA GPUs. If a designed deep learning model needs to run on multiple different types of devices, a deep learning compiler can be used to generate code that is effective on different types of devices from the model described by the deep learning framework.
[0004] Deep learning compilers typically improve model performance on different hardware through graph optimization and operator optimization. These two optimization methods are usually relatively decoupled and independent. However, the implementation of graph optimization often requires a foundation in the principles of the operators themselves to obtain suitable parallel optimization strategies. Therefore, how to automatically perform operator partitioning in graph optimizers without relying on the principles of specific operators is a pressing issue that needs to be addressed. Summary of the Invention
[0005] This application provides a method and apparatus for processing computational tasks, enabling a graph optimizer to automatically partition operator input and output tensors without being based on the principles of specific operators, thereby achieving complete decoupling between the graph optimizer and the operator optimization module, and allowing the operators corresponding to the computational task to be computed in parallel on multiple computing resources.
[0006] In a first aspect, a method for processing computational tasks is provided, which is executed by a graph optimizer, comprising: determining a first operator for performing the computational task, the first operator comprising N separable axes, where N is a positive integer greater than or equal to 1; obtaining separable information of the first operator from an operator separable information database, the separable information of the first operator comprising the axis type of the nth separable axis among the N separable axes in the first operator and first position information, wherein the first position information is used to indicate the position of the nth separable axis in the input tensor of the first operator, where n=1,…,N; separating the input tensor of the first operator according to the separable information of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2; and sending the K sets of input tensors to K target computational resources respectively, so that the K target computational resources can complete the computational task.
[0007] It should be understood that the N separable axes included in the first operator mean that the input tensor of the first operator includes N separable axes.
[0008] It should also be understood that the computational task can be a computational task in the field of artificial intelligence, such as image processing, video processing, speech processing, natural language processing, etc.; the computational task can also be a computational task in the field of big data processing, or a computational task in the field of high-performance computing (HPC), and this application does not impose any limitations on this. Accordingly, the input tensor of the first operator corresponding to the computational task can be the input tensor corresponding to the computational task in any of the above-mentioned fields. For example, when the computational task is an image processing task, the input tensor of the first operator represents image-related data.
[0009] Currently, the partitioning of operator input tensors is determined by algorithm engineers at the application layer using scripting languages based on the partitioning axes included in a particular operator type. Therefore, automatic partitioning of operator input tensors is not possible. However, in this embodiment, the graph optimizer obtains operator partitioning information from an operator partitioning information database. Since the partitioning information for each operator can be directly obtained from this database, the graph optimizer does not need to be aware of the mathematical semantics and underlying implementation of each operator to automatically partition the operator input tensors. This achieves complete decoupling between graph optimization and operator optimization, enabling the operators corresponding to the computational tasks to be computed in parallel on multiple computational resources.
[0010] In one possible implementation, the type of the separable axis is one of the following: element axis, reduction axis, and sliding window axis; wherein, the axis in which the elements in the input tensor and output tensor of the operator have a point-to-point mapping relationship is the element axis; if the input tensor of the operator has a first axis, but the output tensor of the operator does not have a first axis, then the first axis is the reduction axis; the axis in which the operator performs a sliding window scan operation on the elements in the input tensor of the operator is the sliding window axis.
[0011] In one possible implementation, a target splitting axis is determined, which is one of N splittable axes; based on the splitting information of the first operator, the splitting method corresponding to the axis type of the target splitting axis in the first operator is determined; based on the splitting method corresponding to the axis type of the target splitting axis in the first operator, the input tensor of the first operator is split to obtain K sets of input tensors.
[0012] In the embodiments of this application, the graph optimizer performs single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. This enables the graph optimizer to automatically obtain different single-operator segmentation strategies without being based on the principle of a specific operator, thereby achieving complete decoupling between the graph optimizer and the operator optimization module.
[0013] In one possible implementation, the input tensor of the first operator is segmented according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator to obtain K sets of input tensors. This includes: determining, according to the segmentation method, Q first input tensors in the first operator that include the target segmentation axis and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; segmenting each of the Q first input tensors according to the axis type of the target segmentation axis in the first operator and the number of target computing resources K to obtain Q sets of second input tensors; and obtaining K sets of input tensors based on the Q sets of second input tensors and the input tensors of the unsegmented first operator.
[0014] In this context, each of the Q groups of second input tensors includes K second input tensors. The qth group of second input tensors in the Q groups is the result of splitting the qth first input tensor in the Q first input tensors into K segments, where q=1,…,Q.
[0015] Among them, the kth input tensor in the K groups of input tensors includes the kth second input tensor in each group of the Q groups of second input tensors and the input tensor of the undivided first operator.
[0016] In one possible implementation, if the operator used to perform the computational task also includes a second operator, the second operator includes P separable axes, which are a subset of N separable axes. Based on the separation information of the first operator, the input tensor of the first operator is separated to obtain K sets of input tensors. This includes: obtaining the separation information of the second operator from an operator separation information database. The separation information of the second operator includes the axis type and second position information of the p-th separable axis among the P separable axes in the second operator. The second position information indicates the position of the p-th separable axis in the input tensor of the second operator. The input tensor of the second operator is the output tensor of the first operator, where P is a positive integer greater than or equal to 1 and less than or equal to N, and p=1. Based on the segmentation information of the first operator and the second operator, P segmentation reference information is determined. The p-th segmentation reference information among the P segmentation reference information includes: the axis type of the p-th separable axis in the first operator, the axis type of the p-th separable axis in the second operator, and the position of the p-th separable axis in the input tensor of the first operator. Based on the P segmentation reference information, P groups of candidate segmentation methods are determined, wherein the p-th group of candidate segmentation methods among the P groups of candidate segmentation methods includes at least one segmentation method. Based on the time required for each segmentation method in the P groups of candidate segmentation methods to complete the computation task, the target segmentation method is determined. Based on the target segmentation method, the input tensor of the first operator is segmented to obtain K groups of input tensors.
[0017] The segmentation methods included in the p-th group of candidate segmentation methods are determined based on the p-th segmentation reference information among the P segmentation reference information and the number of computing resources M.
[0018] In this embodiment, the graph optimizer automatically segments the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to segment the input and output tensors based on the specific principles of the operators; it only needs to segment them based on the operator segmentation methods corresponding to different types of axes. For the operators, segmenting the input and output tensors does not change the operator's calculation formula; only some parameters of the operator are changed. This achieves complete decoupling between graph optimization and the specific operator principles. Furthermore, the generalization ability of segmenting the first input tensor of the operator based on different types of axes is stronger. In addition, based on the axis type of the separable axis and the position information of the separable axis on the operator's input and output tensors included in the operator's segmentation information, a suitable operator segmentation method can be flexibly selected.
[0019] In one possible implementation, the input tensor of the first operator is segmented according to the target segmentation method to obtain K sets of input tensors, including: determining the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, Q first input tensors in the first operator including the target segmentation axis, and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; and determining the target segmentation axis according to the axis type of the target segmentation axis in the first operator and the target segmentation method. The axis type in the second operator and the number of target computational resources K are used to partition each of the Q first input tensors to obtain Q groups of second input tensors. Each group of the Q groups of second input tensors includes K second input tensors. The q-th group of the Q groups of second input tensors is the result of partitioning the q-th first input tensor of the Q first input tensors into K groups, where q=1,…,Q. Based on the Q groups of second input tensors and the input tensors of the unpartitioned first operator, K groups of input tensors are obtained.
[0020] Among them, the kth input tensor in the K groups of input tensors includes the kth second input tensor in each group of the Q groups of second input tensors and the input tensor of the undivided first operator.
[0021] In one possible implementation, based on the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator, and the number of target computational resources K, each of the Q first input tensors is split to obtain Q sets of second input tensors. These include: if the axis type of the target splitting axis in the first operator is an element axis or a sliding window axis, and the axis type of the target splitting axis in the second operator is also an element axis or a sliding window axis, then based on the first position information and the second position information of the target splitting axis, L first output tensors including the target splitting axis are determined in the first operator, and the target splitting axis is also determined in each of the L first output tensors. The position is L, where L is a positive integer greater than or equal to 1; the first input length is used as the input to the positive shape derivation function corresponding to the axis type of the target split axis in the first operator to obtain the third input length, where the first input length is the length of the target split axis in each first input tensor, and the length of the target split axis in each first input tensor is equal; the third input length is used as the input to the positive shape derivation function corresponding to the axis type of the target split axis in the second operator to obtain the first output length; based on the first output length and the number of target computing resources K, the L first output tensors are split according to the target split axis to obtain L groups of second output tensors, each group of second output tensors including K elements. The second output tensor, the l-th second output tensor in the L groups of second output tensors, is the result of splitting the l-th first output tensor in the L groups of first output tensors into K segments; taking the K second output lengths corresponding to the target split axis in each group of second output tensors in the L groups of second output tensors as inputs to the inverse derivation function corresponding to the axis type in the second operator for the target split axis, we obtain the K third input lengths corresponding to the target split axis in each group of fifth input tensors in the Q groups, where the length corresponding to the target split axis in the k-th second output tensor in each group of second output tensors in the L groups of second output tensors is equal, and the length corresponding to the target split axis in the k-th second input tensor in each group of fifth input tensors in the Q groups of fifth input tensors is equal. The lengths corresponding to the target split axis in each of the Q groups of fifth input tensors are equal; the K third input lengths corresponding to the target split axis in each of the Q groups of fifth input tensors are used as inputs to the reverse derivation function corresponding to the axis type in the first operator, to obtain the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors. The lengths corresponding to the target split axis in the kth second input tensor in each of the Q groups of second input tensors are equal; based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, each of the Q groups of first input tensors is split according to the target split axis to obtain the Q groups of second input tensors.
[0022] In this embodiment of the application, the segmented input tensor is subjected to continuous operator operations on the same target computing resource, which enables parallel computing of multiple target computing resources.
[0023] In one possible implementation, when the target splitting axis in the first operator is an element axis or a sliding window axis, the first position information of the target splitting axis is also used to indicate the position of the target splitting axis in the output tensor of the first operator. Based on the axis type of the target splitting axis in the first operator and the number of target computational resources K, each of the Q first input tensors is split to obtain Q groups of second input tensors. This includes: determining L first output tensors in the first operator that include the target splitting axis, and the position of the target splitting axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1, based on the first position information of the target splitting axis; using the first input length as the input to the forward shape derivation function of the target splitting axis to obtain the first output length, where the first input length is the position of the target splitting axis in each of the L first output tensors. The length of the first input tensor is determined by the target split axis, where the length of the target split axis is equal in each first input tensor. Based on the first output length and the number of target computing resources K, the L first output tensors are split along the target split axis to obtain L groups of second output tensors. Each group of second output tensors contains K second output tensors. The K second output lengths corresponding to the target split axis in each group of second output tensors are used as inputs to the inverse derivation function of the target split axis, resulting in the K second input lengths corresponding to the target split axis in each group of Q second input tensors. Based on the K second input lengths corresponding to the target split axis in each group of Q second input tensors, each of the Q first input tensors is split along the target split axis to obtain Q groups of second input tensors.
[0024] Among them, the l-th group of second output tensors in the L-group is the result of splitting the l-th first output tensor in the L-group into K segments.
[0025] Among them, the target splitting axis has the same length in the kth second output tensor of each group of second output tensors in L groups, and the target splitting axis has the same length in the kth second input tensor of each group of second input tensors in Q groups.
[0026] In one possible implementation, when the target split axis in the first operator is an element axis, each of the Q first input tensors is split according to the target split axis based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, to obtain the Q groups of second input tensors. This includes: based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, the first split function is scheduled to split each of the Q first input tensors according to the target split axis to obtain the Q groups of second input tensors.
[0027] In this context, the element corresponding to the target split axis in each of the q-th group of second input tensors in the Q-group is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group, and there is no intersection among the elements corresponding to the target split axis in each of the q-th group of second input tensors. Furthermore, the union of the elements corresponding to the target split axis in each of the q-th group of second input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0028] In one possible implementation, when the target splitting axis in the first operator is a sliding window axis, each of the Q first input tensors is split according to the target splitting axis based on the K second input lengths corresponding to the target splitting axis in each of the Q groups of second input tensors, to obtain the Q groups of second input tensors. This includes: based on the K second input lengths corresponding to the target splitting axis in each of the Q groups of second input tensors, the first slicing function is scheduled to perform overlapping splitting on each of the Q first input tensors according to the target splitting axis, to obtain the Q groups of second input tensors.
[0029] In this context, the element corresponding to the target split axis in each of the q-th second input tensors in the Q-group second input tensors is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group second input tensors. Furthermore, the elements corresponding to the target split axis in each of the q-th second input tensors in the q-group second input tensors have an intersection, and the union of the elements corresponding to the target split axis in each of the q-th second input tensors in the q-group second input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0030] In one possible implementation, when the target splitting axis in the first operator is a sliding window axis, based on the K second input lengths corresponding to the target splitting axis in each of the Q groups of second input tensors, each of the Q first input tensors is split according to the target splitting axis to obtain the Q groups of second input tensors. This includes: by scheduling the second splitting function, each of the Q first input tensors is split according to the target splitting axis to obtain the Q groups of third input tensors, which include K third inputs. Tensors; Based on the lengths of the K second input tensors corresponding to the target split axis in each group of the second input tensors in the Q-group, the second slicing function is used to slice the K third input tensors in each group of the third input tensors in the Q-group according to the target split axis, to obtain the fourth input tensor in the Q-group; The concatenation function is used to concatenate the kth fourth input tensor in the q-th group of the fourth input tensors in the Q-group and the kth third input tensor in the q-th group of the third input tensors in the Q-group according to the target split axis, to obtain the second input tensor in the Q-group.
[0031] In this context, the element corresponding to the target split axis in each of the q-th third input tensors in the Q-group third input tensors is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group third input tensors. Furthermore, there is no intersection among the elements corresponding to the target split axis in each of the q-th third input tensors, and the union of the elements corresponding to the target split axis in each of the q-th third input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0032] Among them, the elements corresponding to the target splitting axis of the kth second input tensor in the qth second input tensor of the Q group are continuous.
[0033] In the embodiments of this application, the non-overlapping partitioning method of the sliding window axis is suitable for scenarios where frequent data synchronization is required between different computing resources, such as multi-die parallelism. The splicing function is used as a data synchronization node between different dies. In this way, the repeated calculation of overlapping data will not be caused, and the overlapping data will not continue to increase. This can effectively reduce the computational and storage pressure of computing resources.
[0034] In one possible implementation, when the target splitting axis is a reduction axis in the first operator, each of the Q first input tensors is split according to the axis type of the target splitting axis in the first operator and the number of target computing resources K, to obtain Q sets of second input tensors. This includes: according to the number of target computing resources K, by calling the third splitting function, each of the Q first input tensors is split, to obtain Q sets of second input tensors.
[0035] In this context, the element corresponding to the target split axis in each of the q-th group of second input tensors in the Q-group is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group, and there is no intersection among the elements corresponding to the target split axis in each of the q-th group of second input tensors. Furthermore, the union of the elements corresponding to the target split axis in each of the q-th group of second input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0036] In this embodiment, since the type of reduction axis has already determined the specific partitioning method, the graph optimizer does not need to be based on the principle of the specific operator to reasonably partition the input tensor including the specific operator of the reduction axis. Compared with the current operator partitioning method, since the traditional partitioning method partitions from the output tensor of the specific operator, and since the characteristic of the reduction axis is that it does not appear on the output tensor or has a length of 1 on the output tensor, the traditional operator partitioning method cannot partition the axis in the input tensor that has the characteristic of the reduction axis.
[0037] In one possible implementation, the reduction axis includes a first type of reduction axis and a second type of reduction axis. The first type of reduction axis is a reduction axis in which the operator performs a reduction operation on the elements in the input tensor of the operator, and the second type of reduction axis is a reduction axis in which the operator does not perform a reduction operation on the elements in the input tensor of the operator.
[0038] In one possible implementation, the first type of reduction axis includes any one of the following: reduction sum axis, reduction maximum axis, reduction minimum axis, and reduction average axis; wherein, the reduction sum axis is the reduction axis for the operator to perform a summation reduction operation on the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform a maximum reduction operation on the elements in the operator's input tensor; the reduction minimum axis is the reduction axis for the operator to perform a minimum reduction operation on the elements in the operator's input tensor; and the reduction average axis is the reduction axis for the operator to perform an average reduction operation on the elements in the operator's input tensor.
[0039] In one possible implementation, the second type of reduction axis includes a reduction acquisition axis, which is an axis of element index data on the operator's input tensor based on the address indicated by the element on the operator's index input tensor.
[0040] As one possible implementation, computing resources include one of the following types: graphics processing unit (GPU), central processing unit (CPU), die, or chip.
[0041] Secondly, an apparatus for processing computational tasks is provided. This apparatus is applied to a graph optimizer and includes a processor and a transmission interface. The processor is configured to: determine a first operator for performing the computational task, the first operator including N separable axes, where N is a positive integer greater than or equal to 1; obtain separable information of the first operator from an operator separable information database, the separable information including the axis type of the nth separable axis among the N separable axes in the first operator and first position information, wherein the first position information indicates the position of the nth separable axis in the input tensor of the first operator, where n = 1, ..., N; and divide the input tensor of the first operator according to the separable information to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2. The transmission interface is configured to send the K sets of input tensors to K target computational resources respectively, so that the K target computational resources can complete the computational task.
[0042] It should be understood that the N separable axes included in the first operator mean that the input tensor of the first operator includes N separable axes.
[0043] It should also be understood that the computational task can be a computational task in the field of artificial intelligence, such as image processing, video processing, speech processing, natural language processing, etc.; the computational task can also be a computational task in the field of big data processing, or a computational task in the field of high-performance computing (HPC), and this application does not impose any limitations on this. Accordingly, the input tensor of the first operator corresponding to the computational task can be the input tensor corresponding to the computational task in any of the above-mentioned fields. For example, when the computational task is an image processing task, the input tensor of the first operator represents image-related data.
[0044] Currently, the partitioning of operator input tensors is determined by algorithm engineers at the application layer using scripting languages based on the partitioning axes included in a particular operator type. Therefore, automatic partitioning of operator input tensors is not possible. However, in this embodiment, the graph optimizer obtains operator partitioning information from an operator partitioning information database. Since the partitioning information for each operator can be directly obtained from this database, the graph optimizer does not need to be aware of the mathematical semantics and underlying implementation of each operator to automatically partition the operator input tensors. This achieves complete decoupling between graph optimization and operator optimization, enabling the operators corresponding to the computational tasks to be computed in parallel on multiple computational resources.
[0045] In one possible implementation, the type of the separable axis is one of the following: element axis, reduction axis, and sliding window axis; wherein, the axis in which the elements in the input tensor and output tensor of the operator have a point-to-point mapping relationship is the element axis; if the input tensor of the operator has a first axis, but the output tensor of the operator does not have a first axis, then the first axis is the reduction axis; the axis in which the operator performs a sliding window scan operation on the elements in the input tensor of the operator is the sliding window axis.
[0046] In one possible implementation, the processor is specifically used to: determine the target splitting axis, which is one of N splittable axes; determine the splitting method corresponding to the axis type of the target splitting axis in the first operator based on the splitting information of the first operator; and split the input tensor of the first operator according to the splitting method corresponding to the axis type of the target splitting axis in the first operator to obtain K sets of input tensors.
[0047] In one possible implementation, the processor is specifically used to: determine, based on the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the Q first input tensors including the target segmentation axis and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; segment each of the Q first input tensors according to the axis type of the target segmentation axis in the first operator and the number of target computing resources K, to obtain Q sets of second input tensors; and obtain K sets of input tensors based on the Q sets of second input tensors and the input tensors of the unsegmented first operator.
[0048] In this context, each of the Q groups of second input tensors includes K second input tensors. The qth group of second input tensors in the Q groups is the result of splitting the qth first input tensor in the Q first input tensors into K segments, where q=1,…,Q.
[0049] Among them, the kth input tensor in the K groups of input tensors includes the kth second input tensor in each group of the Q groups of second input tensors and the input tensor of the undivided first operator.
[0050] In the embodiments of this application, the graph optimizer performs single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. This enables the graph optimizer to automatically obtain different single-operator segmentation strategies without being based on the principle of a specific operator, thereby achieving complete decoupling between the graph optimizer and the operator optimization module.
[0051] In one possible implementation, where the operator used to perform the computational task also includes a second operator, the second operator includes P separable axes, which are a subset of the N separable axes. Specifically, the processor is used to: obtain the separability information of the second operator from an operator separability information database. The separability information of the second operator includes the axis type and second position information of the p-th separable axis among the P separable axes in the second operator. The second position information indicates the position of the p-th separable axis in the input tensor of the second operator. The input tensor of the second operator is the output tensor of the first operator, where P is a positive integer greater than or equal to 1 and less than or equal to N, and p = 1, ..., P; and according to the separability information of the first operator... Based on the segmentation information of the first operator and the segmentation information of the second operator, P segmentation reference information is determined. The p-th segmentation reference information includes: the axis type of the p-th segmentable axis in the first operator, the axis type of the p-th segmentable axis in the second operator, and the position of the p-th segmentable axis in the input tensor of the first operator. Based on the P segmentation reference information, P groups of candidate segmentation methods are determined, wherein the p-th group of candidate segmentation methods includes at least one segmentation method. Based on the time required for each segmentation method in the P groups of candidate segmentation methods to complete the computation task, the target segmentation method is determined. Based on the target segmentation method, the input tensor of the first operator is segmented to obtain K groups of input tensors.
[0052] As one possible implementation, the segmentation methods included in the p-th group of candidate segmentation methods are determined based on the p-th segmentation reference information among the P segmentation reference information and the number of computing resources M.
[0053] In this embodiment, the graph optimizer automatically segments the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to segment the input and output tensors based on the specific principles of the operators; it only needs to segment them based on the operator segmentation methods corresponding to different types of axes. For the operators, segmenting the input and output tensors does not change the operator's calculation formula; only some parameters of the operator are changed. This achieves complete decoupling between graph optimization and the specific operator principles. Furthermore, the generalization ability of segmenting the first input tensor of the operator based on different types of axes is stronger. In addition, based on the axis type of the separable axis and the position information of the separable axis on the operator's input and output tensors included in the operator's segmentation information, a suitable operator segmentation method can be flexibly selected.
[0054] In one possible implementation, the processor is specifically configured to: determine, based on the target segmentation method, the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, the Q first input tensors including the target segmentation axis in the first operator, and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; segment each of the Q first input tensors according to the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the number of target computing resources K, to obtain Q groups of second input tensors, where each group of second input tensors includes K second input tensors, and the q-th group of second input tensors in the Q groups of second input tensors is the segmentation result of the q-th first input tensor in the Q groups of first input tensors being segmented into K groups, where q=1,…,Q; and obtain K groups of input tensors based on the Q groups of second input tensors and the input tensors of the unsegmented first operator.
[0055] Among them, the kth input tensor in the K groups of input tensors includes the kth second input tensor in each group of the Q groups of second input tensors and the input tensor of the undivided first operator.
[0056] In one possible implementation, if the target splitting axis is of type element-wise or sliding window-wise in the first operator, and the target splitting axis is of type element-wise or sliding window-wise in the second operator, then the processor is specifically configured to: determine, based on the first position information and the second position information of the target splitting axis, determine the L first output tensors including the target splitting axis in the first operator, and the position of the target splitting axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1; and use the first input length as the input to the forward shape derivation function corresponding to the axis type of the target splitting axis in the first operator, to obtain... The third input length is given by the first input length, which is the length of the target split axis in each first input tensor, where the length of the target split axis is equal in each first input tensor. The third input length is used as the input to the positive shape derivation function corresponding to the axis type of the target split axis in the second operator to obtain the first output length. Based on the first output length and the number of target computing resources K, L first output tensors are split according to the target split axis to obtain L groups of second output tensors. Each group of second output tensors in the L groups includes K second output tensors, and the l-th group of second output tensors in the L groups is the l-th of the L first output tensors. The first output tensor is split into K segments; the K second output lengths corresponding to the target split axis in each of the L groups of second output tensors are used as inputs to the inverse derivation function corresponding to the axis type in the second operator for the target split axis, to obtain the K third input lengths corresponding to the target split axis in each of the Q groups of fifth input tensors. The lengths corresponding to the target split axis in the k-th second output tensor of each of the L groups of second output tensors are equal, and the lengths corresponding to the target split axis in the k-th second input tensor of each of the Q groups of fifth input tensors are equal. The Q groups of fifth input tensors... In each group of five input tensors, the K third input lengths corresponding to the target split axis are used as inputs to the inverse derivation function corresponding to the axis type in the first operator, resulting in the K second input lengths corresponding to the target split axis in each group of two input tensors. The lengths corresponding to the target split axis in the kth second input tensor of each group of two input tensors in the Q groups are equal. Based on the K second input lengths corresponding to the target split axis in each group of two input tensors in the Q groups, each of the Q first input tensors is split according to the target split axis to obtain the Q groups of second input tensors.
[0057] In this embodiment of the application, the segmented input tensor is subjected to continuous operator operations on the same target computing resource, which enables parallel computing of multiple target computing resources.
[0058] In one possible implementation, when the target split axis in the first operator is an element axis or a sliding window axis, the first position information of the target split axis is also used to indicate the position of the target split axis in the output tensor of the first operator. Specifically, the processor is used to: determine, based on the first position information of the target split axis, L first output tensors including the target split axis in the first operator, and the position of the target split axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1; use the first input length as the input to the forward shape derivation function of the target split axis to obtain the first output length, where the first input length is the length of the target split axis in each first input tensor, and the length of the target split axis in each first input tensor is equal; based on the first output length and the number K of target computing resources, split the L first output tensors according to the target split axis to obtain L groups of second output tensors, where each group of second output tensors contains a second output tensor. The output tensor includes K second output tensors. The l-th group of second output tensors in the L groups is the result of splitting the l-th first output tensor in the L groups of first output tensors into K segments. The K second output lengths corresponding to the target split axis in each group of second output tensors in the L groups of second output tensors are used as inputs to the inverse derivation function of the target split axis, resulting in the K second input lengths corresponding to the target split axis in each group of second input tensors in the Q groups of second input tensors. The lengths corresponding to the target split axis in the k-th second output tensor in each group of second output tensors in the L groups of second output tensors are equal, and the lengths corresponding to the target split axis in the k-th second input tensor in each group of second input tensors in the Q groups of second input tensors are also equal. Based on the K second input lengths corresponding to the target split axis in each group of second input tensors in the Q groups of second input tensors, each of the Q first input tensors is split according to the target split axis to obtain the Q groups of second input tensors.
[0059] In one possible implementation, when the target split axis in the first operator is of type element axis, the processor is specifically used to: based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, by scheduling the first split function, split each of the Q first input tensors according to the target split axis to obtain the Q groups of second input tensors.
[0060] In this context, the element corresponding to the target split axis in each of the q-th group of second input tensors in the Q-group is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group, and there is no intersection among the elements corresponding to the target split axis in each of the q-th group of second input tensors. Furthermore, the union of the elements corresponding to the target split axis in each of the q-th group of second input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0061] In one possible implementation, when the target split axis in the first operator is a sliding window axis, the processor is specifically used to: based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, by scheduling the first slicing function, to perform overlapping splitting on each of the Q first input tensors according to the target split axis, thereby obtaining the Q groups of second input tensors.
[0062] In this context, the element corresponding to the target split axis in each of the q-th second input tensors in the Q-group second input tensors is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group second input tensors. Furthermore, the elements corresponding to the target split axis in each of the q-th second input tensors in the q-group second input tensors have an intersection, and the union of the elements corresponding to the target split axis in each of the q-th second input tensors in the q-group second input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0063] In one possible implementation, when the target splitting axis in the first operator is a sliding window axis, the processor specifically performs the following steps: By scheduling the second splitting function, it splits each of the Q first input tensors according to the target splitting axis to obtain Q groups of third input tensors, where the Q groups of third input tensors include K third input tensors; based on the K second input lengths corresponding to the target splitting axis in each group of the Q second input tensors, it schedules the second slicing function to split each of the K third input tensors in each group of the Q third input tensors according to the target splitting axis to obtain Q groups of fourth input tensors; and by scheduling the concatenation function, it concatenates the kth fourth input tensor in the qth group of the Q fourth input tensors and the kth third input tensor in the qth group of the Q third input tensors according to the target splitting axis to obtain Q groups of second input tensors.
[0064] In this context, the element corresponding to the target split axis in each of the q-th third input tensors in the Q-group third input tensors is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group third input tensors. Furthermore, there is no intersection among the elements corresponding to the target split axis in each of the q-th third input tensors, and the union of the elements corresponding to the target split axis in each of the q-th third input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0065] Among them, the elements corresponding to the target splitting axis of the kth second input tensor in the qth second input tensor of the Q group are continuous.
[0066] In the embodiments of this application, the non-overlapping partitioning method of the sliding window axis is suitable for scenarios where frequent data synchronization is required between different computing resources, such as multi-die parallelism. The splicing function is used as a data synchronization node between different dies. In this way, the repeated calculation of overlapping data will not be caused, and the overlapping data will not continue to increase. This can effectively reduce the computational and storage pressure of computing resources.
[0067] In one possible implementation, when the target splitting axis is a reduction axis in the first operator, the processor is specifically used to: split each of the Q first input tensors by calling the third splitting function according to the number K of target computing resources, to obtain Q groups of second input tensors.
[0068] In this context, the element corresponding to the target split axis in each of the q-th group of second input tensors in the Q-group is a subset of the element corresponding to the target split axis in the q-th first input tensor in the Q-group, and there is no intersection among the elements corresponding to the target split axis in each of the q-th group of second input tensors. Furthermore, the union of the elements corresponding to the target split axis in each of the q-th group of second input tensors is the element corresponding to the target split axis in the q-th first input tensor.
[0069] In this embodiment, since the type of reduction axis has already determined the specific partitioning method, the graph optimizer does not need to be based on the principle of the specific operator to reasonably partition the input tensor including the specific operator of the reduction axis. Compared with the current operator partitioning method, since the traditional partitioning method partitions from the output tensor of the specific operator, and since the characteristic of the reduction axis is that it does not appear on the output tensor or has a length of 1 on the output tensor, the traditional operator partitioning method cannot partition the axis in the input tensor that has the characteristic of the reduction axis.
[0070] In one possible implementation, the reduction axis includes a first type of reduction axis and a second type of reduction axis. The first type of reduction axis is a reduction axis in which the operator performs a reduction operation on the elements in the input tensor of the operator, and the second type of reduction axis is a reduction axis in which the operator does not perform a reduction operation on the elements in the input tensor of the operator.
[0071] In one possible implementation, the first type of reduction axis includes any one of the following: reduction sum axis, reduction maximum axis, reduction minimum axis, and reduction average axis; wherein, the reduction sum axis is the reduction axis for the operator to perform a summation reduction operation on the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform a maximum reduction operation on the elements in the operator's input tensor; the reduction minimum axis is the reduction axis for the operator to perform a minimum reduction operation on the elements in the operator's input tensor; and the reduction average axis is the reduction axis for the operator to perform an average reduction operation on the elements in the operator's input tensor.
[0072] In one possible implementation, the second type of reduction axis includes a reduction acquisition axis, which is an axis of element index data on the operator's input tensor based on the address indicated by the element on the operator's index input tensor.
[0073] As one possible implementation, the target computing resources include one of the following types: graphics processing unit (GPU), central processing unit (CPU), die, or chip.
[0074] In one possible implementation, the device may further include a memory storing instructions, and a processor for executing the instructions stored in the memory. When the instructions are executed, the processor is used to perform the method in any of the implementations of the first aspect.
[0075] Thirdly, a computer-readable medium is provided that stores program code including methods for performing any implementation of the first aspect. Attached Figure Description
[0076] Figure 1 This is a schematic diagram of a deep learning compiler architecture provided in an embodiment of this application;
[0077] Figure 2 A schematic diagram of an operator segmentation provided in an embodiment of this application;
[0078] Figure 3 This is a schematic flowchart of a method for processing computing tasks provided in an embodiment of this application;
[0079] Figure 4 This is a flowchart illustrating an operator segmentation method for a single operator to complete a computation task, as provided in an embodiment of this application.
[0080] Figure 5 This is a schematic flowchart of another method for processing computing tasks provided in an embodiment of this application;
[0081] Figure 6This is a flowchart illustrating an operator segmentation method for completing a computation task using multiple operators, as provided in an embodiment of this application.
[0082] Figure 7 This is a schematic diagram of an element axis segmentation method provided in an embodiment of this application;
[0083] Figure 8 This is a schematic diagram of a method for dividing the sum-of-specifications axis according to an embodiment of this application;
[0084] Figure 9 This is a schematic diagram of a reduction maximum value axis segmentation method provided in an embodiment of this application;
[0085] Figure 10 This is a schematic diagram of a reduction average axis segmentation method provided in an embodiment of this application;
[0086] Figure 11 This is a schematic diagram of a protocol acquisition axis segmentation method provided in an embodiment of this application;
[0087] Figure 12 This is a schematic diagram of a sliding window axis segmentation method provided in an embodiment of this application;
[0088] Figure 13 This is a schematic diagram of another sliding window axis segmentation method provided in the embodiments of this application;
[0089] Figure 14 This is a schematic diagram illustrating the position information of the operator's separable axis in the operator's input and output tensors, provided in an embodiment of this application.
[0090] Figure 15 This is a schematic diagram illustrating a specific application of operator segmentation provided in an embodiment of this application;
[0091] Figure 16 This is a schematic diagram illustrating another specific application of the operator segmentation provided in the embodiments of this application;
[0092] Figure 17 This is a schematic diagram illustrating another specific application of the operator segmentation provided in the embodiments of this application;
[0093] Figure 18 This is a schematic diagram of an operator tensor structure provided in an embodiment of this application;
[0094] Figure 19 This is a schematic diagram of a device for processing computing tasks provided in an embodiment of this application. Detailed Implementation
[0095] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings.
[0096] The terminology used in the following embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting of this application. As used in the specification and appended claims of this application, the singular expressions “a,” “an,” “the,” “the,” “the,” and “this” are intended to also include expressions such as “one or more,” unless the context clearly indicates otherwise. It should also be understood that in the following embodiments of this application, “at least one” and “one or more” refer to one, two, or more than two. The term “and / or” is used to describe the relationship between related objects, indicating that three relationships may exist; for example, A and / or B can indicate: A alone, A and B simultaneously, or B alone, where A and B can be singular or plural. The character “ / ” generally indicates that the preceding and following related objects are in an “or” relationship.
[0097] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0098] To facilitate understanding of the technical solution of this application, the concepts involved in this application will be briefly introduced first.
[0099] (1) Deep learning model
[0100] A deep learning model is a machine learning model that includes a deep neural network structure. Algorithm engineers use deep learning frameworks to build models, tune and train them to optimize their parameters, and then save the final network parameters and model structure together. The resulting file is the model file that can be used for forward inference.
[0101] The format of model files trained by different deep learning frameworks is not exactly the same, but a complete model file generally contains information such as tensor data, computational units, and computation graphs.
[0102] (2) Tensor
[0103] A tensor is a data container in deep learning systems. It can be understood as an extension of a matrix to any dimension. A tensor containing only a single number is called a scalar, scalar tensor, zero-dimensional tensor, or 0D tensor; an array of numbers is called a vector, or one-dimensional tensor, or 1D tensor; an array of vectors is called a matrix, or two-dimensional tensor, or 2D tensor; combining multiple matrices to form a new data structure results in a three-dimensional tensor, which can be intuitively understood as a cube composed of numbers; combining multiple three-dimensional tensors into an array creates a four-dimensional tensor, and so on. Deep learning typically processes tensors from 0D to 4D, but 5D tensors may be encountered when processing video data. In this three-dimensional tensor, the size of the 0th axis is 2, the size of the 1st axis is 1, and the size of the 2nd axis is 3.
[0104] The shape of a tensor represents the number of elements in each dimension. For example, [[[1,2,3]] and [[7,8,9]]] are three-dimensional tensors, where the shape of the three-dimensional tensor is (2,1,3). Another example... Figure 18 This is a tensor diagram provided in an embodiment of this application, such as... Figure 18 The tensor shape shown is (4,20,20,3), assuming Figure 18 The tensor shown represents a feature map, where the tensor shape is in Figure 18 The physical meanings from left to right are as follows: the batch size N of the feature map is 4, which means 4 images; the height H of the feature map is 20; the width W of the feature map is 20, which means the image is 20*20=400 pixels; and the number of channels of the feature map is 3, which means RGB channels.
[0105] The axes of a tensor are relative to its shape and represent the subscripts indicating the shape of the tensor. For example, [[[1,2],[3,4]],[[5,6][7,8]]] is a three-dimensional tensor with a shape of (2,2,2). The 0-axis represents the first dimension: the matrices [[1,2],[3,4]] and [[5,6][7,8]]; the 1-axis represents the second dimension: [1,2], [3,4], [5,6], and [7,8]; and the 2-axis represents the third dimension: 1, 2, 3, 4, 5, 6, 7, and 8. For another example, ... Figure 18 The tensor shown has a shape of (4, 20, 20, 3), where the 0-axis represents the batch size of the feature map, the 1-axis represents the height of the feature map, the 2-axis represents the width of the feature map, and the 3-axis represents the channels of the feature map.
[0106] (3) Operator
[0107] An operator, also known as a computational unit or operator symbol, represents a symbolic computation process and is the basic unit of mainstream deep learning frameworks, i.e., a node in a graph. The input and output of a computational unit are tensors. All transformations learned by deep networks can be simplified to tensor operations on numerical data tensors.
[0108] Common computational units include add units, batch normalization units, convolution units, gated recurrent units, local response normalization (LRN) units, long short-term memory (LSTM) units, max pooling units, rectified liner units (ReLU), recurrent neural network (RNN) units, and the softmax function.
[0109] (4) Calculation graph
[0110] A computation graph, also known as a data flow graph, is defined as a directed acyclic graph (DAG). Tensors and operational units are objects in the graph; operational units are the nodes, and tensors are the data flowing along the edges. Acyclic means the graph cannot have cycles; for example, a tensor x cannot be the input to any layer that generates x. The only allowed processing cycle (i.e., a loop connection) is an inner loop within a loop layer.
[0111] Most deep learning frameworks can be described using a directed acyclic graph (DAG), where each node represents a neuron, and the two nodes share an edge if the output of one node is used as the input of another. In other words, nodes in this computational graph represent operators, and edges between nodes represent data dependencies between them.
[0112] (5) Operator segmentation
[0113] Operator partitioning is the partitioning of the input tensor and output tensor of an operator.
[0114] Figure 1 This is a schematic diagram of a deep learning compiler architecture provided in an embodiment of this application. The following will be combined with... Figure 1 A brief introduction to deep learning compilers.
[0115] A deep learning compiler can be divided into a compiler frontend, a compiler middleend, and a compiler backend. The compiler frontend interfaces with the application layer, meaning it interacts with the deep learning model. The compiler frontend includes a parser, which primarily converts models trained under different frameworks into a hardware-recognizable internal format. For example, it converts the computation graph of frameworks like TensorFlow or Caffe2 into a computation graph in a format recognizable by the compiler. The compiler middleend includes a graph optimizer and operator information. The graph optimizer can also be called a graph optimization module. The compiler middleend allocates different computational tasks to different computing resources (e.g., CPU, GPU) for subsequent model execution. The compiler backend primarily generates code instructions that match different hardware. The compiler backend includes an operator compiler and operator libraries.
[0116] Deep learning compilers typically improve model performance on different devices through two levels: graph optimization and operator optimization. Graph optimization and operator optimization are relatively decoupled and independent. Graph optimization is a general optimization strategy, which is independent of specific operator types, while operator optimization is a specific optimization strategy related to the specific operator type.
[0117] Typical operator optimization strategies include compute optimization and scheduling optimization. These strategies, either manually or automatically, aim to optimize a specific operator for a particular hardware platform. For example, for the general matrix-matrix multiplication (GEMM) operator, manual scheduling techniques such as blocking, vectorization, loop permutation, packing, and multi-core, multi-threaded parallelism are typically used to optimize the scheduling of the GEMM operator, resulting in performance gains of tens of times on the CPU.
[0118] A typical graph optimization strategy is constant folding. Constant folding is a strategy where, if all the input tensors that an operator depends on are constant, then during compilation, operator nodes that are not relevant to the model's operation can be computed in advance, thereby saving runtime overhead.
[0119] Currently, there are many other graph optimization strategies, such as graph partitioning and execution order optimization, multi-die parallelism, multi-threaded parallelism, and chip-level parallelism. These graph optimization strategies all need to be based on the principles of the operators themselves. Without being based on the principles of the operators themselves, it is impossible to express parallel optimization strategies in the computation graph.
[0120] For example, graph partitioning and execution order optimization is a graph optimization strategy that reduces the memory constraints of operator execution. Specifically, it involves uniformly partitioning the outer loop iteration variables of operators. Based on this, the subsequent execution order of operators is adjusted, allowing operators to perform a large number of iterative operations locally, reducing the operator's memory requirements. More intermediate results generated by local operations are stored in the L2 cache, thereby reducing the memory constraints of subsequent operator execution and ultimately optimizing the overall network model's performance. Therefore, the method of operator partitioning is particularly important.
[0121] For example, in multi-die parallel processing, a die is the chip before it is packaged. The chip uses advanced packaging technology to accumulate computing power. In order to fully utilize the chip's performance, an operator can be split across multiple dies for computation, minimizing data interaction between different dies. Therefore, the method of operator splitting is particularly important.
[0122] For example, in multithreaded parallelism, a subgraph is treated as a basic unit of computation—that is, a subgraph containing different operators is treated as a basic unit of computation. When a subgraph is distributed across multiple threads for parallel execution, the operators need to be partitioned. Similarly, when a subgraph is distributed across different computing resources for parallel execution, such as on different CPUs, the operators within the subgraph need to be partitioned. Since data synchronization between multiple threads signifies the completion of a subgraph's execution, multithreaded parallel execution also aims to minimize interactions between different threads. Therefore, the partitioning of operators within a subgraph is particularly important.
[0123] Figure 2 This is a schematic diagram of operator segmentation provided in an embodiment of this application. A tensor of the same operator can be segmented into different slices, and these different slices can run on different threads, different dies, or different chips. For example, as... Figure 2 As shown, the input tensor of operator 1 is divided into slice 1 and slice 2 of operator 1, and the input tensor of operator 2 is divided into slice 1 and slice 2 of operator 2. Slice 1 of operator 1 runs on computing resource 1 and passes the running result to computing resource 2 corresponding to slice 1 of operator 2 for running. At the same time, slice 2 of operator 1 runs on computing resource 3 and passes the running result to computing resource 4 corresponding to slice 2 of operator 2 for running. Finally, the running results on computing resource 2 and computing resource 4 are concatenated.
[0124] Since most current graph optimization schemes related to operator partitioning methods rely on the principles of the operators themselves—for example, graph optimization requires partitioning operators based on the properties of their iterative variables—and the architecture of operator optimization and graph optimization also needs to be relatively decoupled, one current approach is for algorithm engineers to manually classify the iterative variables in necessary operators and summarize the changes resulting from axis partitioning for each type of iterative variable. This helps in the efficient generation of auxiliary graph optimization strategies. Currently, the efficient generation of graph optimization strategies relies heavily on the manual classification of iterative variables in necessary operators; this does not allow for the automatic partitioning and execution of arbitrary operators.
[0125] Currently, there is a method that uses a separable axis (e.g., sample axis, parameter axis, and attribute axis) based on the operator output to split the input tensor of the operator at the application layer, thereby achieving the effect of parallel computation on multiple GPUs through operator splitting.
[0126] Specifically, the sample axis, parameter axis, and attribute axis are three types of splittable axes on the operator output. The sample axis segments the operator input tensor based on samples, meaning it segments the operator input tensor along the sample dimension. The operator input tensor segments along the sample axis are then allocated to different computational resources for data parallelism. The parameter axis segments the operator input tensor based on parameters, meaning it segments the operator input tensor along the parameter dimension. The operator input tensor segments along the parameter axis are then allocated to different computational resources for model parallelism. The attribute axis is the axis in the operator output other than the sample and parameter axes. The operator input tensor is segmented based on the attribute axis of the samples, meaning it segments the operator input tensor samples along the attribute dimension.
[0127] Based on these three axes, the operator input tensor can be partitioned across different computational resources for operation. Partitioning can be performed individually based on each axis, or in combination, achieving parallel computation across multiple resources. While this partitioning method currently enables a certain degree of automatic operator partitioning at the application layer, it still has limitations. First, it currently only defines three dimensions for matrix multiplication operators based on the axes in the output tensor, failing to cover all possible partitioning axes and partitioning methods. Second, the current definition of these three axes is determined by the type of axes in the operator output tensor; that is, if an axis is not present in the operator output tensor, partitioning will not be performed based on the actual partitionable axes of the operator input tensor. This results in a coarse partitioning of the operator input tensor, making it impossible to accurately partition the operator and allocate it to different computing resources for computation. Finally, this method still defines the partitioning axis and partitioning method at the application layer. In other words, algorithm engineers determine the partitioning method at the application layer using scripting languages based on the partitioning axis included in a certain type of operator. This still cannot achieve automatic partitioning of the input and output of different operators, and it cannot achieve complete decoupling between graph optimization and operator optimization.
[0128] To address the aforementioned problems, this application proposes a method and apparatus for processing computational tasks, which will be described below in conjunction with... Figures 3 to 19 This will be described in detail.
[0129] Figure 3 This is a schematic flowchart of a method for processing computing tasks provided in an embodiment of this application.
[0130] S301, determine the first operator for performing the computation task, the first operator includes N separable axes, where N is a positive integer greater than or equal to 1.
[0131] It should be understood that the N separable axes included in the first operator mean that the input tensor of the first operator includes N separable axes.
[0132] S302, obtain the segmentation information of the first operator from the operator segmentation information database. The segmentation information of the first operator includes the axis type of the nth segmentable axis among N segmentable axes in the first operator and the first position information. The first position information is used to indicate the position of the nth segmentable axis in the input tensor of the first operator, where n=1, ...,N.
[0133] In other words, the information included in the first operator's segmentation information indicates that each of the N segmentable axes has its own corresponding axis type in the first operator, which will be discussed later. Figures 7 to 17The different types of axes and their corresponding segmentation methods are explained in detail. The segmentation information of the first operator can also indicate which input tensors each separable axis appears on and which axis it appears on within those input tensors. For example, based on the position information of separable axis 1 in the first operator, we can know that separable axis 1 appears in input tensors 1 and 2 of the first operator, and that separable axis 1 appears on the 0 axis of input tensor 1 and on the 0 axis of input tensor 2.
[0134] S303, based on the segmentation information of the first operator, the input tensor of the first operator is segmented to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2.
[0135] It should be understood that the number of input tensors included in each of the K sets of input tensors is the same as the number of input tensors included in the first operator.
[0136] As one possible implementation, the input tensor of the first operator is divided according to the segmentation information of the first operator and the number of computing resources M to obtain K sets of input tensors, where M is a positive integer greater than or equal to 2.
[0137] Although the number of available computing resources is M, the graph optimizer does not necessarily need to use all of them. For example, the required target number of computing resources K can be estimated based on the size of the computing task, or the target number of computing resources K can be randomly determined. This application embodiment does not impose any restrictions on this.
[0138] It should also be understood that each of the K sets of input tensors is the input tensor required by each target computing resource. For example, if a single computing resource used to complete the computing task requires a input tensors before the input tensor of the first operator is split, then after the input tensor of the first operator is split, each target computing resource used to complete the computing task will also require a input tensors.
[0139] As one possible implementation, the target splitting axis is determined based on the splitting information of the first operator, and the input tensor of the first operator is split according to the target splitting axis to obtain K sets of input tensors. This will be discussed later in conjunction with... Figure 4 The specific explanation outlines the operator segmentation process for completing computational tasks using a single operator.
[0140] It should be noted that segmenting the input tensor of the first operator does not mean segmenting all input tensors of the first operator, but rather segmenting the input tensor that includes the target segmentation axis, while the input tensor that does not include the target segmentation axis is sent as shared input data to each target computing resource.
[0141] As one possible implementation, if a second operator is needed to perform the computation task, a candidate segmentation space is determined based on the segmentation information of the first operator, the segmentation information of the second operator, and the amount of computational resources M. Then, based on the candidate segmentation space, a target segmentation method is determined. According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors. This will be discussed later in conjunction with... Figure 5 The specific steps involve explaining the corresponding partitioning methods for completing computational tasks using multiple operators.
[0142] S304, send K sets of input tensors to K target computing resources respectively, so that the K target computing resources can complete the computing tasks.
[0143] It should be understood that the target number of computing resources K is determined based on the number of computing resources M.
[0144] In this embodiment, the graph optimizer obtains the segmentation information of operators from the operator segmentation information library. Since the segmentation information of each operator can be directly obtained from the operator segmentation information library, the graph optimizer does not need to be aware of the mathematical semantics and underlying implementation of each operator. It can automatically segment the input tensor of the operator, thereby achieving complete decoupling between graph optimization and operator optimization.
[0145] Figure 4 This is a schematic diagram of an operator segmentation method for a single operator to complete a computation task, provided in an embodiment of this application. Figure 4 This is a detailed description of one possible implementation of S303.
[0146] S401, Determine the target splitting axis, which is one of N splittable axes.
[0147] As one possible implementation, the graph optimizer randomly selects a separable axis as the target separable axis. For example, the first axis of the input tensor of the first operator can be used as the target separable axis. The first axis can be a batch axis.
[0148] As one possible implementation, the graph optimizer selects the separable axis with the most common axes among all the input tensors of the first operator as the target separable axis. For example, if the first operator has 3 input tensors, where separable axis 1 appears in 3 input tensors and separable axis 2 appears in 2 input tensors, then separable axis 1 can be used as the target separable axis.
[0149] As one possible implementation, based on the computation time required to complete the segmentation task according to the segmentation method corresponding to each segmentable axis, the segmentable axis with the shortest computation time is selected as the target segmentation axis.
[0150] One possible implementation is to determine the target partition axis based on the computation time required to complete the computation task using the partitioning method corresponding to each partitionable axis and the target number of computing resources K. For example, if the computation time required to complete the computation task using partition axis 1 and the partitioning method corresponding to b target computing resources is the same as the computation time required to complete the computation task using partition axis 2 and the partitioning method corresponding to c target computing resources, but the number of target computing resources b corresponding to partition axis 1 is less than the number of target computing resources c corresponding to partition axis 2, then partition axis 1 is selected as the target partition axis, and the number of target computing resources corresponding to partition axis 1 is b.
[0151] S402, based on the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator.
[0152] S403, according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain K sets of input tensors.
[0153] As one possible implementation, based on the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the Q first input tensors including the target segmentation axis and the position of the target segmentation axis in each of the Q first input tensors are determined, where Q is a positive integer greater than or equal to 1.
[0154] It should be understood that the Q first input tensors are the input tensors of the first operator including the target split axis.
[0155] It should be understood that the position of the target split axis in each of the Q first input tensors indicates which axis the target split axis is on in each first input tensor. For example, the target split axis is on the 0 axis of the first first input tensor and the 0 axis of the second first input tensor.
[0156] As one possible implementation, based on the axis type of the target splitting axis in the first operator and the number of target computational resources K, each of the Q first input tensors is split to obtain Q groups of second input tensors, where each group of Q second input tensors includes K second input tensors.
[0157] It should be understood that the qth second input tensor in the Q-group second input tensor is the result of splitting the qth first input tensor in the Q first input tensors into K parts, where q=1,…,Q.
[0158] It should be understood that when the number of target computing resources is K, each first input tensor including the target partition axis will be partitioned according to the target partition axis into K second input tensors. The K second input tensors serve as input tensors for the K target computing resources. Therefore, when there are Q first input tensors including the target partition axis, Q groups of second input tensors will be formed.
[0159] It should be understood that the target splitting axis can be an element-wise axis, a sliding window axis, or a reduce axis. These will be discussed in conjunction with the following. Figures 7 to 13 The division methods for these different axis types are explained in detail.
[0160] As one possible implementation, K sets of input tensors are obtained based on the Q-group second input tensor and the input tensor of the undivided first operator.
[0161] Among them, the kth input tensor in the K groups of input tensors includes the kth second input tensor in each group of the Q groups of second input tensors and the input tensor of the undivided first operator.
[0162] It should be understood that each of the K sets of input tensors includes the input tensor of the undivided first operator as shared data and the second input tensor of the divided first operator corresponding to each target computational resource.
[0163] Figure 5 This is a schematic diagram of another method for processing computing tasks provided in an embodiment of this application. Figure 5 This is a detailed explanation of another possible implementation of S303.
[0164] When the operator used to perform the computational task also includes a second operator, the second operator includes P separable axes, and the P separable axes are a subset of the N separable axes.
[0165] S501, Obtain the segmentation information of the second operator from the operator segmentation information database. The segmentation information of the second operator includes the axis type and second position information of the p-th segmentable axis among P segmentable axes in the second operator. The second position information is used to indicate the position of the p-th segmentable axis in the input tensor of the second operator. The input tensor of the second operator is the output tensor of the first operator. P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,…,P.
[0166] It should be understood that P separable axes represent a subset of N separable axes. The P separable axes appear in the output tensor of the first operator, and the output tensor of the first operator serves as the input tensor of the second operator. In other words, the P separable axes of the second operator also appear in the N separable axes of the first operator.
[0167] S502, based on the segmentation information of the first operator and the segmentation information of the second operator, determine P segmentation reference information. The p-th segmentation reference information among the P segmentation reference information includes: the axis type of the p-th segmentable axis in the first operator, the axis type of the p-th segmentable axis in the second operator, and the position of the p-th segmentable axis in the input tensor of the first operator.
[0168] S503, based on P segmentation reference information and the amount of computing resources M, determine P groups of candidate segmentation methods, wherein the p-th group of candidate segmentation methods in the P groups includes at least one segmentation method.
[0169] The segmentation methods included in the p-th group of candidate segmentation methods are determined based on the p-th segmentation reference information among the P segmentation reference information and the number of computing resources M.
[0170] It should be understood that each group of candidate segmentation methods is a candidate segmentation method corresponding to each segmentation reference information, that is, the segmentation reference information corresponding to each of the P segmentable axes. The fact that each group of candidate segmentation methods includes at least one segmentation method can also be understood as each group of candidate segmentation methods including M-1 segmentation methods. For example, when the number of computing resources is 4, the target number of computing resources can be 2, 3, or 4, meaning there are 3 possible target numbers of computing resources. Therefore, each group of candidate segmentation methods includes 3 segmentation methods.
[0171] S504. Determine the target segmentation method based on the time required for each segmentation method in the P group of candidate segmentation methods to complete the computation task.
[0172] As one possible approach, the segmentation method with the shortest computation time among the P group of candidate segmentation methods is determined as the target segmentation method.
[0173] Specifically, when the total number of segmentation methods in the P group of candidate segmentation methods is not large, the P group of candidate segmentation methods are traversed to obtain the time required to complete the computation task among all candidate segmentation methods, and the segmentation method with the shortest time to complete the computation task is selected as the target segmentation method. The traversal method can be through simulation, theoretical calculation or running on actual hardware. This application embodiment does not limit the traversal method.
[0174] Specifically, when the total number of segmentation methods in the P group of candidate segmentation methods is large, the target segmentation method is searched from the P group of candidate segmentation methods. There are various search methods, such as Monte Carlo Markov algorithm or genetic algorithm, etc. The embodiments of this application do not limit the search method.
[0175] As one possible implementation, the target partitioning method is determined based on the time required for each partitioning method in the P candidate partitioning methods to complete the computation task and the target number of computing resources K. For example, if the computation time required to complete the computation task using partitioning method 1 and d target computing resources is the same as the computation time required to complete the computation task using partitioning method 2 and e target computing resources, but the number of target computing resources d corresponding to partitioning method 1 is less than the number of target computing resources e corresponding to partitioning method 2, then partitioning method 1 is selected as the target partitioning method, and the number of target computing resources corresponding to the target partitioning method is d.
[0176] S505, according to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
[0177] Figure 6 This is a flowchart illustrating an operator segmentation method for completing a computation task using multiple operators, as provided in an embodiment of this application. The following will be combined with... Figure 6 S505 will be explained in detail.
[0178] S601, based on the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, the Q first input tensors including the target segmentation axis in the first operator, and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.
[0179] It should be understood that the interpretation of the Q first input tensors in S601 is similar to that in S402. For the sake of brevity, please refer to the description in S402 for details, which will not be repeated here.
[0180] S602, based on the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator, and the number of target computing resources K, each of the Q first input tensors is split to obtain Q groups of second input tensors. Each group of the Q groups of second input tensors includes K second input tensors. The q-th group of second input tensors in the Q groups of second input tensors is the splitting result of the q-th first input tensor in the Q first input tensors into K groups, where q=1,…,Q.
[0181] It should be understood that the interpretation of the second input tensor of group Q in S602 is similar to that in S403. For the sake of brevity, please refer to the description in S403 for details, which will not be repeated here.
[0182] It should be noted that obtaining the second input tensor of group Q in S602 requires considering the axis type of the target splitting axis in the first operator and the axis type in the second operator. The specific splitting method will be combined with... Figure 15Let me give an example.
[0183] S603: Based on the second input tensor of group Q and the input tensor of the undivided first operator, obtain the input tensor of group K.
[0184] Among them, the kth input tensor in the K groups of input tensors includes the kth second input tensor in each group of the Q groups of second input tensors and the input tensor of the undivided first operator.
[0185] It should be understood that the interpretation of the second input tensor of group K in S603 is similar to that in S404. For the sake of brevity, please refer to the description in S404 for details, which will not be repeated here.
[0186] It should be noted that completing the computation task may also include more operators. In this embodiment of the application, the completion of the computation task includes the first operator and the second operator as an example for detailed explanation. When the completion of the computation task requires operators other than the first operator and the second operator, the graph optimizer also needs to obtain the segmentation information of other operators to obtain candidate segmentation methods in order to determine the target segmentation method.
[0187] The following will combine Figures 7 to 17 The axis types of the operator input tensors, the position information of the separable axes in the input and output tensors of the operator, and the operator segmentation methods corresponding to different axis types are described in detail in the embodiments of this application.
[0188] Axis type represents the data dependency between the operator input tensor and the output. In other words, the graph optimizer can determine the corresponding partitioning method based on the axis type of the input tensor. Therefore, different operator inputs that include the same axis type can have the same operator partitioning method.
[0189] As one possible implementation, the axis type of the operator input tensor may include divisible axes such as element axis, reduction axis, and sliding window axis, and may also include other types of divisible axes. This application embodiment does not limit this.
[0190] The following will combine Figures 7 to 13 The element axis, reduction axis, and sliding window axis are explained in detail. It should be noted that... Figures 7 to 13 These are all schematic diagrams illustrating operator segmentation methods corresponding to single-operator computation tasks. Figures 7 to 13 Operators A, B, and C in the above can all represent the first operator. The name of the first operator is not limited in the embodiments of this application.
[0191] Element-wise axis: If an iterative variable in the input tensor of operator A is an element-wise axis, then the element-wise axis is an axis that establishes a point-to-point mapping relationship between the elements in the input and output tensors of operator A. That is, a point in the output tensor is in the same position as a point in the input tensor that the output tensor depends on. For example, consider a four-dimensional input tensor with shape (5,7,9,3), where the length of its three axes is 3, and the data points a0, a1, and a2 are included in the three axes. The output tensor has shape (4,6,8,3), with a length of 3, and the data points b0, b1, and b2 are included in the three axes. Since the positions of a0 and b0 correspond, the positions of a1 and b1 correspond, and the positions of a2 and b2 correspond, then the axis type of the three axes of both the input and output tensors is an element-wise axis.
[0192] Figure 7 This is a schematic diagram illustrating an element axis segmentation method provided in an embodiment of this application. The steps for segmenting the input tensor of operator A according to the element axis are as follows: Figure 7 As shown. In Figure 7 In this example, operator A is used as the activation function operator. The type of operator A is not limited in this embodiment. It should be noted that... Figure 7 The input and output tensors of the activation function operator are illustrated using a single input tensor and a single output tensor as examples. The number of input and output tensors of the operator is not limited in the embodiments of this application.
[0193] Specifically, the target split axis in the activation function operator is of type element axis. Based on the position information of the target split axis in the activation function operator, it can be determined that the target split axis of type element axis appears on the 0 axis of the first input tensor of the activation function operator, that is, the 0 axis with a length of 8 is the element axis. Based on the length of the element axis of the first input tensor, the length of the element axis of the first output tensor is obtained through the forward shape derivation function y=f_1(x) of the element axes of the first input tensor and the first output tensor, where x represents the length of the element axis of the first input tensor and y represents the length of the element axis of the first output tensor. The forward derivation logic of the element axis is that the lengths of the element axes of the first input tensor and the first output tensor are equal. Figure 7 As shown in (a), the first input tensor of the activation function operator is (8,56,56,64). The 0 axis of the first input tensor is the element axis. According to the logic that the length of the element axis of the output tensor is equal to the length of the element axis of the input tensor, the length of the 0 axis of the first output tensor is also 8, that is, the first output tensor is (8,56,56,64).
[0194] Based on the number of target computing resources, the first output tensor is partitioned along its element axis to obtain the second output tensor of the activation function operator on each target computing resource. Subsequently, based on the element axis length corresponding to the second output tensor of the operator on each computing resource, the element axis length of each second input tensor is derived in reverse using the inverse shape derivation function x=f_1^(-1)(y). A splitting function is used when partitioning the first input tensor along its element axis. After computation on different computing resources, the second output tensors on different target computing resources are concatenated using a concat function to obtain the first output tensor.
[0195] like Figure 7 As shown in (b), there are two target computational resources used for activation function operator operations. The length of the zero axis of the first output tensor is 8, therefore the length of the element axis of the second output tensor on each computational resource is 4. That is, the first second output tensor is (4,56,56,64), and the second second output tensor is (4,56,56,64). The second output tensor and the first output tensor are synchronized using a concatenation function. The elements on the zero axis of the first second output tensor and the elements on the zero axis of the second second output tensor do not intersect. Subsequently, based on the inverse shape derivation function of the element axis, the length of the element axis of the second input tensor is derived to be 4. That is, the first second input tensor is (4,56,56,64) and the second second input tensor is (4,56,56,64). Therefore, in order to obtain the shape of the second input tensor, the graph optimizer calls the first splitting function to split the first input tensor according to the element axis, that is, according to the 0 axis of the first input tensor, to obtain two second input tensors.
[0196] It should be noted that, Figure 7 The lengths of the zero axes in the two second output tensors shown in (b) are equal, and the lengths of the zero axes in the two second input tensors are also equal. The second input tensors corresponding to different target computing resources only need to satisfy the following conditions: the elements corresponding to the zero axes of the two second input tensors have no overlap, that is, the elements corresponding to the zero axes of the two second input tensors are subsets of the elements corresponding to the zero axes of the first input tensor, and they have no intersection. Furthermore, the union of the elements corresponding to the zero axes of the two second input tensors is the element corresponding to the zero axis of the first input tensor. This application does not limit whether the lengths corresponding to the zero axes of the second input tensors obtained on different computing resources are equal.
[0197] Reduce axis: If an iterative variable in the input tensor of operator B is a reduce axis, then the reduce axis is an axis that exists in the input tensor of the operator but does not exist in the output tensor of the operator or has a length of 1.
[0198] Specifically, reduction axes can be divided into two categories. The first type of reduction axis is the reduction axis that operator B uses to reduce the elements in the input tensor. For example, if the shape of the input tensor of operator B is (2,3,4,5), where the 0 axis of the input tensor is the reduction axis with a length of 2, then after the input tensor is processed by operator B, the shape of the output tensor will be (,3,4,5) or (1,3,4,5).
[0199] The second type of reduction axis is a reduction axis where operator B does not reduce the elements in the input tensor. Although operator B does not reduce the elements on the second type of reduction axis, they also do not appear in the output tensor, but they do appear in the input tensor. For example, the reduction acquisition axis; more details about reduction acquisition axes will be discussed in conjunction with... Figure 11 Detailed explanation.
[0200] The first type of reduction axis can include the reduction sum (reduceSum) axis, the reduction maximum (reduceMax) axis, the reduction minimum (reduceMin) axis, and the reduction average (reduceMean) axis. It should be noted that these different types of reduction axes all share the general characteristics of reduction axes. The difference lies in the type of function called to obtain the equivalent first output tensor before the split, after the first input tensor has been processed by operator B on different target computational resources. This will be discussed below in conjunction with... Figures 8 to 10 Specifically, this explains the specific segmentation methods for different types of Class I reduction axes.
[0201] Figure 8 This is a schematic diagram illustrating a method for segmenting the sum-of-reduction axis according to an embodiment of this application. The steps for segmenting the sum-of-reduction axis in the first input tensor of operator B are as follows: Figure 8 As shown. In Figure 8 In this example, operator B is used as the integrated summation operator. The type of operator B is not limited in the embodiments of this application. It should be noted that... Figure 8 The input and output tensors of the integration summation operator are illustrated using a single input tensor and a single output tensor as examples. The number of input and output tensors of the operator is not limited in the embodiments of this application.
[0202] Specifically, the target splitting axis in the integration summation operator is a reduced summation axis. Based on the position information of the target splitting axis in the integration summation operator, it can be determined that the target splitting axis, being a reduced summation axis, appears on the 0 axis of the first input tensor of the integration summation operator. For example... Figure 8As shown in (a), the first input tensor of the integration sum operator is (8, 56, 56, 64). The axis type of the 0 axis of the first input tensor is the reduction sum axis, that is, the 0 axis with a length of 8 is the reduction sum axis. Therefore, according to the characteristics of the reduction sum axis, the length of the reduction sum axis of the first output tensor is 1, that is, the first output tensor is (, 56, 56, 64).
[0203] Based on the number of target computing resources and the length of the reduction sum axis in the first output tensor, the first input tensor is divided into two second input tensors according to the reduction sum axis by calling the third splitting function. These two second input tensors are then sent to two target computing resources for computation, resulting in two second output tensors: the first second output tensor is (,56,56,64), and the second second output tensor is (,56,56,64). The data from the two target computing resources is then synchronized using the AddN function to obtain the first output tensor. Figure 8 As shown in (b), there are two available computational resources for the summation operator operation. The length of the reduced sum axis of the first output tensor is 1. The second output tensor of the operator on each computational resource is used to obtain the first output tensor through the addition operator. The shape of the second output tensor is the same as that of the first output tensor, which is (,56,56,64). Since there are two computational resources, the reduced sum axis of the first input tensor is divided by the segmentation operator to obtain the second input tensor, where the length of the reduced sum axis of the second input tensor is 4, that is, the shape of the second input tensor is (4,56,56,64).
[0204] It should be noted that, Figure 8 The lengths of the zero axes in the two second output tensors shown in (b) are equal, and the lengths of the zero axes in the two second input tensors are also equal. The second input tensors corresponding to different target computing resources only need to satisfy the following conditions: the elements corresponding to the zero axes of the two second input tensors have no overlap, that is, the elements corresponding to the zero axes of the two second input tensors are subsets of the elements corresponding to the zero axes of the first input tensor, and they have no intersection. Furthermore, the union of the elements corresponding to the zero axes of the two second input tensors is the element corresponding to the zero axis of the first input tensor. This application does not limit whether the lengths corresponding to the zero axes of the second input tensors obtained on different computing resources are equal.
[0205] Figure 9 This is a schematic diagram illustrating a method for segmenting the reduced maximum value axis according to an embodiment of this application. The steps for segmenting the reduced maximum value axis of the first input tensor of operator B are as follows: Figure 9 As shown. In Figure 9 In this example, operator B is used as the integrated maximum value operator. The type of operator B is not limited in this embodiment. It should be noted that... Figure 9 The input and output tensors of the integrated maximum value operator are illustrated using single-input and single-output tensors as examples. The number of input and output tensors of the operator is not limited in the embodiments of this application.
[0206] For the first input tensor including the reduced maximum axis, the main steps and... Figure 8 The general approach of partitioning the reduced sum axis of the first input tensor of the integrated sum operator is the same as that used in the previous approach, which can be referenced here. Figure 8 The explanation of the first input tensor's reduction and axis-splitting steps is omitted here.
[0207] It should be noted that, as Figure 9 As shown, the functions called by operator B after the sum-axis reduction of the first input tensor and after the maximum-axis reduction of the first input tensor are different. The second output tensor, which is the result of the sum-axis integration operation on the second input tensors on different target computing resources, is obtained by calling the addition function for data synchronization, and the first output tensor is obtained by calling the maximum-axis integration operation on the second input tensors on different computing resources for data synchronization.
[0208] For a first input tensor including the reduced minimum axis, the main steps and... Figure 8 The general principle of partitioning the first input tensor of the integration sum operator according to the reduced sum axis is the same, and can be referred to here. Figure 8 The explanation of the first input tensor's reduction and axis-splitting steps is omitted here.
[0209] It should be noted that the function types called by operator B after the sum-axis reduction of the first input tensor and after the minimum-axis reduction of the first input tensor are different. The second output tensor, which performs the sum-axis integration operation on the second input tensors on different target computing resources, obtains the first output tensor by calling the addition function for data synchronization. Similarly, the first output tensor is obtained by performing the minimum-axis operation on the second input tensors on different computing resources and then calling the minimum-axis function for data synchronization.
[0210] Figure 10 This is a schematic diagram illustrating a reduction mean axis segmentation method provided in an embodiment of this application. The steps for segmenting the input tensor of operator B according to the reduction mean axis are as follows: Figure 10 As shown. In Figure 10 In this example, operator B is used as the reduction averaging operator. The types of operator B are not limited in the embodiments of this application. It should be noted that... Figure 10 The input and output tensors of the integrated averaging operator are illustrated using single-input and single-output tensors as examples. The number of input and output tensors of the operator is not limited in the embodiments of this application.
[0211] For a first input tensor including the reduced mean axis, the main steps and... Figure 8 The general principle of partitioning the first input tensor of the integration sum operator according to the reduced sum axis is the same, and can be referred to here. Figure 8 The explanation of the steps for splitting the first input tensor according to the sum of reductions will not be repeated here.
[0212] It should be noted that, as Figure 10 As shown, the number of functions called by operator B after the first input tensor is divided along the minimum value axis and after the sum of the first input tensor is divided along the sum of the first input tensor is different. The second output tensor, after the integration averaging operation on the second input tensors from different target computing resources, is synchronized by calling the addition function to obtain the intermediate output tensor. It also needs to call the multiplication function to obtain the first output tensor. It should be noted that the addition function is a synchronization node that sums the integrated average axis of the second output tensors from different computing resources, and the multiplication function is the integration average axis of the synchronized intermediate output tensor multiplied by 1 / group to obtain the first output tensor, where group is the number of target computing resources. For example, in... Figure 10 In this context, the target computing resources are 2, so the group is 2.
[0213] The second type of reduction axis includes the reduce-gather axis, which is the axis by which operator B indexes data on the elements of the input tensor of operator B according to the address indicated by the element on the index input tensor of operator B. That is, when the first input tensor contains the reduce-gather axis, the corresponding data needs to be found on the reduce-gather axis of the first input tensor according to the address on the first index input tensor as the data of the 0 axis of the first output tensor. Figure 11 This application provides a method for segmenting the acquisition axis according to a specification. Taking operator B as the acquisition (gather2) operator as an example, the details are explained below. This application does not limit the type of operator B. It should be noted that... Figure 11 The input tensor of the acquisition operator is an indexed input tensor and a first input tensor. The output tensor of the acquisition operator is illustrated using a first output tensor as an example. The number of first input tensors and first output tensors of the operator is not limited in the embodiments of this application.
[0214] Specifically, such as Figure 11As shown in (a), the acquisition operator has two input tensors: a first input tensor and a first index input tensor. The first input tensor is a data input tensor with a shape of (80, 64), and the first index input tensor is an input tensor including index addresses with a shape of (20, ). Based on the segmentation information of the acquisition operator, the target segmentation axis is determined to be the reduction acquisition axis, and the reduction acquisition axis appears on the 0 axis of the first input tensor. According to the characteristics of the reduction acquisition axis, the data on the 0 axis of the first output tensor finds the corresponding data element on the 0 axis of the first input tensor based on the index address on the first index input tensor. Therefore, the shape of the first output tensor is (20, 64).
[0215] Based on the number of target computing resources and the length of the reduction acquisition axis of the first output tensor, the first output tensor is divided according to the reduction acquisition axis by calling the third segmentation function, resulting in two second input tensors. Each target computing resource has a corresponding second input tensor, as well as an index input tensor obtained by biasing the first index input tensor using a bias function. Subsequently, each target computing resource passes through an acquisition operator to obtain its own second output tensor. By calling an addition function, the second output tensors from different target computing resources are added together to synchronize the data and obtain the first output tensor. Figure 11 As shown in (b), there are two target computing resources used for acquisition operator operations. The first output tensor has an unreduced acquisition axis, and the length of the 0 axis in the first output tensor is equal to the length of the 0 axis of the first indexed input tensor. Since there are two computing resources, the reduced acquisition axis of the first input tensor is divided by calling the third segmentation function to obtain the second input tensor. The length of the reduced acquisition axis of the second input tensor is 40, that is, the shape of the first second input tensor is (40, 64), and the shape of the second second input tensor is also (40, 64). When each computing resource performs acquisition operator operations, it will obtain the same first indexed input tensor. Since the acquisition operator on each computing resource only obtains half of the data on the reduced acquisition axis of the first input tensor, that is, the first and second second input tensors, the first indexed input tensor needs to be processed by the bias operator to ensure the correctness of the second output tensor after the acquisition operator operations on each computing resource.
[0216] It should be noted that during the acquisition operator operation, since the first input tensor is divided into two parts, when the acquisition operator on each computing resource searches in the acquisition axis of the second input tensor reduction based on the address on the first indexed input tensor to obtain the data on the 0 axis of the second output tensor, there may be a situation where the data does not exist. In this case, 0 is taken as the search result. Finally, the second output tensors on the two computing resources that have undergone the acquisition operator operation are added together to obtain the first output tensor.
[0217] In this embodiment, since the type of reduction axis has already determined the specific partitioning method, the graph optimizer does not need to be based on the principle of the specific operator to reasonably partition the input tensor including the specific operator of the reduction axis. Compared with the current operator partitioning method, since the traditional partitioning method partitions from the output tensor of the specific operator, and since the characteristic of the reduction axis is that it does not appear on the output tensor or has a length of 1 on the output tensor, the traditional operator partitioning method cannot partition the axis in the input tensor that has the characteristic of the reduction axis.
[0218] Sliding window axis: If an iterative variable in the input tensor of operator C is the sliding window axis, then the sliding window axis is the axis by which operator C performs a sliding window scan operation on the elements in the input tensor of operator C. If the sliding window is larger than the step size, the windows of two adjacent scans will overlap.
[0219] If the first output tensor is partitioned along a sliding window axis, and there are two target computational resources, then the elements corresponding to the sliding window axis in the first output tensor are equally divided. In this case, some data on the sliding window axis of the divided first output tensor will simultaneously depend on the same data on the sliding window axis of the first input tensor. Therefore, there are two ways to partition the first input tensor containing the sliding window axis, which will be discussed in detail later. Figure 12 and Figure 13 Detailed explanation.
[0220] It should be noted that the forward shape derivation function y=f_2(x) for the sliding window axes of the first input tensor and the first output tensor is derived from the forward length of the sliding window axis of the first output tensor, where x represents the length of the sliding window axis of the first input tensor and y represents the length of the sliding window axis of the first output tensor. f_2() is related to the convolution padding value, the convolution kernel size, the convolution stride, and the convolution kernel dilation coefficient.
[0221] The inverse shape derivation function x=f_2^(-1)(y) of the sliding window axis of the first input tensor and the first output tensor is derived in reverse based on the length of the sliding window axis in the first output tensor to determine the appropriate segmentation method, so as to obtain the second output tensor and the second input tensor for each computational resource. Here, f_2^(-1)(y) is also related to the convolution padding value, the convolution kernel size, the convolution stride, and the convolution kernel dilation coefficient.
[0222] Figure 12 This is a schematic diagram of a sliding window axis segmentation method provided in an embodiment of this application. The overlapping segmentation method of the input tensor of operator C according to the sliding window axis is as follows: Figure 12 As shown. In Figure 12In this example, operator C is used as the convolution operator. The embodiments of this application do not limit the type of operator C. It should be noted that... Figure 12 The input and output tensors of the convolution operator are illustrated using a single input tensor and a single output tensor as examples. The number of input and output tensors of the operator is not limited in the embodiments of this application.
[0223] Specifically, the target splitting axis in the convolution operator is a sliding window axis. Based on the position information of the target splitting axis in the convolution operator, it can be determined that the target splitting axis, which is a sliding window axis, appears on axis 1 of the first input tensor of the convolution operator, that is, axis 1 with a length of 56 is a sliding window axis. Therefore, based on the forward shape derivation function of the sliding window axis and the length of the sliding window axis in the first input tensor of the convolution operator, the length of the sliding window axis in the first output tensor can be derived forward. Figure 12 As shown in (a), the first input tensor of the operator is (1,56,56,64). Based on the positive shape derivation function of the sliding window axis of the first input tensor and the first output tensor, where the convolution stride is 2 and the convolution kernel size is 3, the first output tensor can be obtained as (1,28,56,64).
[0224] Based on the number K of target computing resources and the length of the sliding window axis of the first output tensor, the first output tensor is segmented along the sliding window axis to obtain K second output tensors. Then, using the sliding window axis inverse shape derivation function, the length of the sliding window axis of the second input tensor for each target computing resource is derived in reverse. Based on the length of the sliding window axis in each second output tensor, the first input tensor can be segmented along the sliding window axis by calling the first slicing function to obtain the second input tensors. After obtaining the second output tensors through operations on different target computing resources, the equivalent first output tensor before segmentation can be obtained by calling the concatenation function.
[0225] like Figure 12As shown in (b), there are two target computational resources used for the convolution operator operation. The first output tensor's I-axis is the sliding window axis with a length of 28. Therefore, the length of the second output tensor's I-axis on each computational resource before calling the concatenation function is 14. Subsequently, based on the sliding window axis reverse shape derivation logic, since the convolution stride is 2 and the convolution kernel size is 3, the length of the second input tensor on each computational resource is 29. Then, by calling the first slicing function 1 and the first slicing function 2, the first input tensor is sliced along the I-axis, resulting in two second input tensors with an I-axis length of 29. These two second input tensors with an I-axis length of 29 have overlapping data. The data range of one second input tensor's I-axis is from 0 to 28 in the first input tensor's I-axis, and the data range of the other second input tensor's I-axis is from 28 to 56 in the first input tensor's I-axis. The 29th data point in the first input tensor's I-axis is the overlapping part of the two second input tensors.
[0226] Figure 12 The overlapping segmentation method shown is suitable for scenarios where the segmented input tensors, after being processed by operators on different computing resources, do not require frequent data synchronization, such as multi-threaded parallelism where different threads are completely independent and can achieve pipelined parallelism. However, in some scenarios, the segmented input tensors, after being processed on different computing resources, require frequent data synchronization. This causes the overlapping parts of the output tensors obtained from different computing resources to be frequently spliced, resulting in an ever-increasing overlap and unnecessary redundant computation. Therefore, this application also provides another segmentation method without overlapping sliding window axes, as detailed below. Figure 13 As shown.
[0227] Figure 13 This is a schematic diagram of another sliding window axis segmentation method provided in an embodiment of this application. The steps for segmenting the sliding window axis of the input tensor of operator C without overlap are as follows: Figure 13 As shown. In Figure 13 In this example, operator C is used as the convolution operator. It should be noted that... Figure 12 Same, Figure 13 The input and output tensors of the convolution operator are illustrated using a single input tensor and a single output tensor as examples. The number of input and output tensors of the operator is not limited in the embodiments of this application.
[0228] Specifically, Figure 13 The steps to derive the length of the sliding window axis in the second input tensor for each target computational resource are as follows: Figure 12 The steps for segmenting overlapping data are the same; please refer to [link / reference]. Figure 12 The description in the text, Figure 13 and Figure 12The difference lies in the process of splitting the first input tensor into K second input tensors.
[0229] Specifically, Figure 13 In (b), the second segmentation function is used to divide the first input tensor equally along the sliding window axis. The second slice function 1 and the second slice function 2 are used to obtain the overlapping part of the sliding window axis of the second input tensor that the sliding window axis data of the second output tensor obtained by the operator operation of different target computing resources commonly depend on. The first concatenation function 1 and the first concatenation function 2 are used to concatenate the third input tensor and the fourth input tensor obtained by the second segmentation function, the second slice function 1 and the second slice function 2 along the sliding window axis to obtain the second input tensor as each target computing resource. The second concatenation function is used to concatenate the second output tensors that have undergone the convolution operator operation on different target computing resources to obtain the first output tensor.
[0230] Specifically, such as Figure 13 As shown in (b), there are two target computing resources, namely computing resource 1 and computing resource 2. The shape of the first output tensor is (1,28,56,64), where the 1 axis of the first input tensor is the sliding window axis.
[0231] By calling the second segmentation function, the first input tensor is segmented along axis 1, resulting in two equally divided third output tensors, each with the shape (1, 28, 56, 64), representing the first and second third input tensors, respectively. By calling the second slicing function 1, the first third input tensor is sliced along axis 1, resulting in the second fourth input tensor with the shape (1, 1, 56, 64). The data in the second fourth input tensor along axis 1 is the last data point of the first third input tensor along the sliding window axis, which is the 28th data point in the first input tensor along axis 1. By calling the second slicing function 2, the second third input tensor is sliced along axis 1, resulting in the first fourth input tensor with the shape (1, 1, 56, 64). The data in the first fourth input tensor along axis 1 is the first data point of the second third input tensor along the sliding window axis, which is the 29th data point in the first input tensor along axis 1.
[0232] By calling the first concatenation function 1, the first third input tensor and the first fourth input tensor are concatenated along axis 1 to obtain the first second input tensor, whose shape is (1, 29, 56, 64). The data range of axis 1 in this second input tensor is from 0 to 28 in axis 1 of the first input tensor. Similarly, by calling the first concatenation function 2, the second third input tensor and the second fourth input tensor are concatenated along axis 1 to obtain the second second input tensor, whose shape is (1, 29, 56, 64). The data range of axis 1 in the second second input tensor is from 28 to 56 in axis 1 of the first input tensor.
[0233] In the embodiments of this application, the non-overlapping partitioning method of the sliding window axis is suitable for scenarios where frequent data synchronization is required between different computing resources, such as multi-die parallelism. The splicing function is used as a data synchronization node between different dies. In this way, the repeated calculation of overlapping data will not be caused, and the overlapping data will not continue to increase. This can effectively reduce the computational and storage pressure of computing resources.
[0234] In the embodiments of this application, the graph optimizer performs single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. This enables the graph optimizer to automatically obtain different single-operator segmentation strategies without being based on the principle of a specific operator, thereby achieving complete decoupling between the graph optimizer and the operator optimization module.
[0235] The above content provides a detailed explanation of different types of axes and their corresponding partitioning methods. These different types of axes can be represented using the following data structures:
[0236] typedef enum AXIS_TYPE
[0237] {
[0238] UNSPLIT = 0x0, / / Indicates that this axis cannot be divided.
[0239] ELEMENTWISE = 0x1, / / Indicates that this axis is the element axis.
[0240] REDUCESUM = 0x2, / / Indicates that this axis is a summation reduction.
[0241] REDUCEMAX = 0x3, / / Indicates that this axis is reduced to its maximum value.
[0242] REDUCEMIN = 0x4, / / Indicates that this axis is a minimum reduction parameter.
[0243] REDUCEMEAN = 0x5, / / Indicates that this axis is a mean reduction.
[0244] REDUCEGATHER = 0x6, / / Indicates this axis is an index axis
[0245] SLIDINGWINDOW = 0x7, / / Indicates that this axis is a sliding window.
[0246] Definitions of other tensor axes…
[0247] }
[0248] It should be noted that the type of tensor axis is not limited to those listed in the embodiments of this application. There may be other tensor axes and their corresponding operator segmentation methods. The embodiments of this application do not limit this.
[0249] It should be noted that computing resources can be GPUs, CPUs, dies, or chips, etc. This application embodiment does not limit the type of computing resources, nor does it limit the number of computing resources. The two computing resources in this application embodiment are only one example.
[0250] In this embodiment, the graph optimizer automatically segments the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to segment the input and output tensors based on the specific principles of the operators; it only needs to segment them based on the operator segmentation methods corresponding to different types of axes. For the operators, segmenting the input and output tensors does not change the operator's calculation formula; only some parameters of the operator are changed. This achieves complete decoupling between graph optimization and the specific operator principles. Furthermore, the generalization ability of segmenting the first input tensor of the operator based on different types of axes is stronger.
[0251] The above content provides a specific explanation of the partitioning methods for the input and output tensors of a single operator. The graph optimizer in this embodiment can determine the partitioning method based on the axis type of the input tensor of a single operator and the position information of the target partitioning axis within the input and output tensors of the single operator. However, when multiple operators are needed to complete the computation task, the position information of the partitionable axis within the input and output tensors of multiple operators makes it possible for the graph optimizer to cascade different operator partitioning methods into subgraphs. The following will combine... Figure 14 Specifically, explain the position information of the operator's separable axis in the operator's input and output tensors.
[0252] Figure 14 This is a schematic diagram illustrating the position information of the operator's separable axis in the operator's input and output tensors, provided in an embodiment of this application.
[0253] The position information of the operator's separable axis in the input and output tensors indicates which input tensors and which output tensors the same separable axis lies on, and the specific position of the same separable axis in the input and output tensors. Each separable axis is of one of the different types of axes mentioned above.
[0254] It should be understood that the embodiments of this application do not limit the number of input tensors and output tensors of the operator. Multiple input tensors can be processed by the operator to obtain multiple output tensors. Furthermore, the embodiments of this application do not limit the number of first tensor axes in the input tensors and output tensors.
[0255] like Figure 14 As shown, taking the convolution operator as an example, the convolution operator has two input tensors: the first feature map input tensor and the first weight input tensor, with corresponding shapes of (8, 56, 56, 64) and (4, 3, 3, 64), respectively. The 0, 1, and 2 axes of the first feature map input tensor are sliding window axes, and the 3 axis is the reduction axis. The 0 axis of the first weight input tensor is the element axis, and the 1, 2, and 3 axes are the reduction axes. Since the reduction axis does not appear on the output tensor, the tensor axes appearing in the first output tensor are the 0, 1, and 2 axes of the first feature map input tensor and the 0 axis of the first weight input tensor. Therefore, the shape of the first output tensor is (8, 56, 56, 4).
[0256] The above, combined with the figures, provides a detailed explanation of the position information of the separable axis in the operator's input and output tensors. Below, two specific data structures for the position information of the separable axis in the operator are given: a data structure centered on the separable axis and a data structure centered on the input and output tensors.
[0257] As one possible implementation, the data structure centered on the separable axis includes the type of the separable axis, the type of the input tensor in which the separable axis appears, and the position of the separable axis in each input and output tensor:
[0258] dim_slice_infos: vector <sliceinfo> / / Splitting information containing multiple splittable axes
[0259] SliceInfo{
[0260] type: AXIS_TYPE; / / The type of this tensor axis
[0261] relate_inputs: vector <pair<int, vector <int>>>; / / This tensor axis appears in which input tensors, and on which axis within each input tensor.
[0262] relate_outputs: vector <pair<int, vector <int>>>; / / This tensor axis appears in which output tensors, and on which axis within each output tensor.
[0263] }
[0264] Specifically, taking the addition operator as an example, if one input tensor has the shape (3,1,5) and another input tensor has the shape (3,4,1), the output tensor obtained after the addition operator is (3,4,5). The specific data structure with the separable axis as the center can be represented as follows:
[0265] dim_slice_infos: [
[0266] {type: elementwise,
[0267] relate_inputs: [{0, {0}}, {1, {0}}]
[0268] relate_outputs: [{0, {0}}]
[0269] / / This axis, with a length of 3, is of type elementwise and appears on the 0 axis of the first input tensor, the 0 axis of the second input tensor, and the 0 axis of the first output tensor.}
[0270] {
[0271] type: elementwise,
[0272] relate_inputs: [{1, {1}}]
[0273] relate_outputs: [{0, {1}}]
[0274] / / This axis, with a length of 4, is of type elementwise and appears on axis 1 of the second input tensor and axis 1 of the first output tensor.}
[0275] {
[0276] type: elementwise,
[0277] relate_inputs: [{0, {2}}]
[0278] relate_outputs: [{0, {2}}]
[0279] / / This axis, with a length of 5, is of type elementwise and appears on axis 2 of the first input tensor and axis 2 of the first output tensor.}
[0280] As one possible implementation, the data structure centered on input and output tensors includes the number of each axis in each input tensor, the number of each axis in each output tensor, and the type of the axis corresponding to each number:
[0281] input_dim_name_defs: vector <vector <int>>\\ indicates the number of each axis in each input.
[0282] output_dim_name_defs: vector <vector <int>>\\ indicates the number of each axis in each output.
[0283] dim_slice_types: map<int, AXIS_TYPE> \\ indicates the type of axis corresponding to each number.
[0284] Specifically, taking the addition operator as an example, if one input tensor has the shape (3,1,5) and another input tensor has the shape (3,4,1), the output tensor obtained after the addition operator is (3,4,5). The specific data structure with the tensor axis as the center can be represented as follows:
[0285] input_dim_name_defs:{{ n1, n2, n3},{ n1, n4, n5}} / / This means the first input tensor contains three axes n1, n2, and n3, while the second input tensor contains three axes n1, n4, and n5.
[0286] output_dim_name_defs:{{ n1, n4, n3}} / / This indicates that the first output tensor contains three axes: n1, n4, and n3.
[0287] dim_slice_types: {
[0288] n1: element-wise / / The type of the n1 axis is element-wise
[0289] n2: can't split / / The n2 axis is an indivisible axis.
[0290] n3: element-wise / / The n3 axis is an element-wise axis
[0291] n4: element-wise / / The n4 axis is an element-wise axis
[0292] n5: Can't split / / n5 axis type is an indivisible axis
[0293] }
[0294] The graph optimizer segments and cascades different operators into subgraphs based on the axis types in the input tensors and the position information of each axis in the input and output tensors. Different axis position information can have different applications, which will be discussed below. Figures 15 to 17 The specific application of the operator segmentation method for processing computational tasks in the embodiments of this application will be described in detail.
[0295] Figure 15 This is a schematic diagram illustrating a specific application of operator segmentation provided in an embodiment of this application. Scenario 1: The output tensor of the first operator, which includes a separable axis, is used as the input tensor of the second operator to optimize the segmentation of the input tensors of multiple consecutive operators.
[0296] As one possible approach, the segmentation method of the first input tensor is determined based on the axis type of the target segmentation axis in different operators and the position information of the first input tensor and the first output tensor in different operators.
[0297] Specifically, Figure 15 There are two types of activation function operators in the graph: ReLU and TanH. The graph optimizer obtains the partitioning information for the ReLU and TanH operators. The target partitioning axis corresponding to the target partitioning method determined by the graph optimizer is an element axis. For the ReLU operator, this element axis appears on the 0 axis of the first input tensor and the 0 axis of the first output tensor. The shape of the first input tensor is (8, 56, 56, 64). According to the partitioning method corresponding to the element axis, the shape of the first output tensor is also (8, 56, 56, 64). For the TanH operator, this element axis appears on the 0 axis of the first input tensor and the 0 axis of the first output tensor. The shape of the first input tensor is (8, 56, 56, 64). Therefore, according to the partitioning method corresponding to the element axis, the shape of the first output tensor is also (8, 56, 56, 64).
[0298] If the position of the element axis in the input and output tensors of the ReLU and TanH operators is unknown, the second output tensor of each target computational resource after passing through the ReLU operator needs to be concatenated and synchronized to obtain the first input tensor of the TanH operator. Then, the first input tensor of the synchronized TanH operator is split to obtain the second input tensors of different target computational resources. Figure 15 As shown in (a), to complete the ReLU and TanH operator operations, it is necessary to call the splitting function twice and the concatenation function twice.
[0299] Since the graph optimizer knows the position information of the target splitting axis in the first input tensor and the first output tensor of different operators, the intermediate splicing operator and intermediate splitting operator generated by the splitting of the tensors of the ReLU operator and the TanH operator can be omitted when the ReLU operator and the TanH operator are operated on consecutively. Figure 12 As shown in (b), the element axes all appear on the 0 axis in the input and output tensors of the ReLU and TanH operators. Only one splitting operator node and one splicing operator node are needed to realize continuous operation of the ReLU and TanH operators.
[0300] Specifically, based on the element axis segmentation method described above, by calling the segmentation function once, the first input tensor can be segmented according to the element axis to obtain two equally divided second input tensors for the ReLU operator. The second output tensor of the ReLU operator is obtained by performing the ReLU operator operation on each computing resource. The second output tensor of the ReLU operator serves as the second input tensor of the TanH operator. The second output tensor of the TanH operator is obtained by performing the TanH operator operation on each target computing resource. Finally, a concatenation operator operation is performed to obtain the final first output tensor.
[0301] It should be noted that, in the embodiments of this application, the axis type of the target segmentation axis in the continuous operator is not limited. Here, the example is given that the axis type of the target segmentation axis is the same in the first operator and the second operator, and both are element axes. The axis type of the target segmentation axis in the continuous operator can be the same or different. The embodiments of this application do not limit this.
[0302] In this embodiment of the application, the segmented input tensor is subjected to continuous operator operations on the same target computing resource, which enables parallel computing of multiple target computing resources.
[0303] Figure 16 This is a schematic diagram illustrating another specific application of operator segmentation provided in this application embodiment. Scenario 2: The segmentable axis appears on multiple input tensors and a single output tensor of a single operator.
[0304] As one possible implementation, the segmentation method in the input tensor of the first operator is determined based on the segmentation information of the first operator and the target number of computing resources K.
[0305] Specifically, taking the addition operator as an example, the addition operator has two first input tensors. The shape of the first first input tensor x is (m,n), and the shape of the second first input tensor y is (m,).
[0306] According to the segmentation information of the addition operator, the type of the segmentable axis 1 is the element axis, which appears on the 0 axis of the first input tensor x and the 0 axis of the first input tensor y, and has a length of m; the type of the segmentable axis 2 is the element axis, which appears on the 1 axis of the first input tensor x, and has a length of n.
[0307] Based on the segmentation information of the first operator, two operator segmentation methods can be determined. The first method is to segment the input tensor that includes a segmentable axis 1 of length m, and the second method is to segment the input tensor that includes a segmentable axis 2 of length n.
[0308] like Figure 16 As shown in (a), the input tensor, including a separable axis 1 of length m, is segmented. The separable axis 1 of length m is determined as the target segmentation axis. Based on the position information of the separable axis 1 within the input tensors of the addition operator, it is determined that the first input tensor x will be segmented along the 0 axis, and the first input tensor y will be segmented along the 0 axis. According to the segmentation method using the separable axis 1 as the element axis, the first input tensor x can be equally divided into two second input tensors x0 and x1 along the 0 axis, and the first input tensor y can be equally divided into two second input tensors y0 and y1 along the 0 axis. These are then sent to two target computing resources for addition operator operations to obtain the second output tensor. Subsequently, the first output tensor is obtained by calling the concatenation function.
[0309] like Figure 16 As shown in (b), the input tensor, including a separable axis 2 of length n, is segmented. The separable axis 2 of length n is determined as the target segmentation axis. Based on the position information of the separable axis 1 within the input tensor of the addition operator, the first input tensor x is determined to be segmented along axis 1. According to the segmentation method using the separable axis 1 as the element axis, the first input tensor x can be equally divided into two second input tensors x0' and x1' along axis 1. Since the first input tensor y has no separable axis 2, it is sent as shared data to different target computing resources. Subsequently, the addition operator operation is performed on each target computing resource to obtain the second output tensor. Then, the first output tensor is obtained by calling the concatenation function.
[0310] As one possible approach, each target computing resource can obtain the first input tensor y by addressing, or the first input tensor y can be copied to each target computing resource. This application embodiment does not limit the sharing method of the first input tensor y.
[0311] In the embodiments of this application, a suitable operator segmentation method can be flexibly selected based on the axis type of the segmentable axis and the position information of the segmentable axis on the input tensor and output tensor of the operator included in the segmentation information of the operator.
[0312] Figure 17 This is a schematic diagram illustrating another specific application of operator segmentation provided in this application embodiment. Scenario 3: The position of the separable axis 1 in the first input tensor and the first output tensor of the first operator is different.
[0313] like Figure 17 As shown in (a), taking the transpose operator as an example, the graph optimizer obtains the segmentation information of the transpose operator. Based on the segmentation information of the transpose operator, it determines that the separable axis 1 is the element axis, and the position of the separable axis 1 in the first input tensor is as follows: Figure 17 As shown in (a), the separable axis 1 is the 0 axis in the first input tensor, and the position of the separable axis in the first output tensor is as follows: Figure 17 As shown in (a), the separable axis 1 is axis 1 in the first output tensor. Based on the positive shape derivation function of the element axis, the shape of the first output tensor (56,8,56,64) can be derived from the shape of the first input tensor (8,56,56,64).
[0314] Specific segmentation methods, such as Figure 17 As shown in (b), there are two target computational resources that can be used for the transformation operator operation. The first output tensor is divided along the 1 axis to obtain two second output tensors with a 1 axis length of 4. Then, according to the reverse shape derivation function of the element axis, the 0 axis length of the two second input tensors is determined to be 4. By calling the division function, the first input tensor with a 0 axis length of 8 is divided along the 0 axis to obtain two second input tensors.
[0315] In this embodiment, the graph optimizer only needs to know the type of the separable axis of the operator's input tensor and the position information of the separable axis in the input and output tensors. Without needing to be based on the specific type of operator, it can appropriately partition the input and output tensors of the operator, thus achieving complete decoupling between operator optimization and graph optimization.
[0316] The above content describes the method for processing computational tasks according to embodiments of this application. The following, in conjunction with... Figure 19 The processing computing task apparatus of the embodiments of this application will be described below. It should be understood that the apparatus described below can perform the methods of the foregoing embodiments of this application. In order to avoid unnecessary repetition, repeated descriptions will be appropriately omitted when describing the apparatus of the embodiments of this application below.
[0317] Figure 19 This is a schematic diagram of a computing task processing device provided in an embodiment of this application. The device 1900 is applied to a graph optimizer and includes a processor 1901 and a transmission interface 1902. Optionally, the device may further include a memory 1903 and a bus 1904.
[0318] The memory 1903, processor 1901, and transmission interface 1902 communicate with each other via bus 1904.
[0319] The memory 1903 can be a ROM, a static storage device, or RAM. The memory 1903 can store programs, and when the program stored in the memory 1903 is executed by the processor 1901, the processor 1901 and the communication interface 1902 are used to execute the various steps of the processing computing task method of the embodiments of this application.
[0320] For example, processor 1901 is configured to determine a first operator for performing a computational task, the first operator comprising N separable axes, where N is a positive integer greater than or equal to 1;
[0321] The processor 1901 is used to obtain the segmentation information of the first operator from the operator segmentation information library. The segmentation information of the first operator includes the axis type of the nth segmentable axis among N segmentable axes in the first operator and the position information of the nth segmentable axis in the first operator. The position information of the nth segmentable axis in the first operator is used to indicate the position of the nth segmentable axis in the input tensor of the first operator, where n=1,…,N.
[0322] The processor 1901 is used to segment the input tensor of the first operator according to the segmentation information of the first operator, and determine K groups of input tensors, where K is a positive integer greater than or equal to 2.
[0323] The transmission interface 1902 is used to send K sets of input tensors to K target computing resources respectively, so that the K target computing resources can complete the computing tasks.
[0324] It should be understood that the above description is merely illustrative. The processing computing task apparatus is used to execute the methods or steps mentioned in the foregoing method embodiments. Therefore, the processing computing task apparatus corresponds to the foregoing method embodiments. For details, please refer to the description of the foregoing method embodiments, which will not be repeated here.
[0325] The processor 1901 may be a general-purpose CPU, microprocessor, ASIC, GPU, or one or more integrated circuits, used to execute relevant programs to achieve the functions required by the units in the processing computing task apparatus of this application embodiment, or to execute the processing computing task method of this application embodiment.
[0326] The processor 1901 can also be an integrated circuit chip with signal processing capabilities. In implementation, each step of the processing computation task method of this application embodiment can be completed by the integrated logic circuitry in the processor 1901 or by software instructions.
[0327] The processor 1901 described above can also be a general-purpose processor, DSP, ASIC, FPGA, or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory 1903. The processor 1901 reads the information in memory 1903 and, in conjunction with its hardware, completes the functions required by the units included in the processing computing task apparatus of the embodiments of this application, or executes the processing computing task method of the method embodiments of this application.
[0328] The transmission interface 1902 uses transceiver devices, such as, but not limited to, transceivers, to enable communication between the device 1900 and other devices or communication networks. For example, an image to be processed can be acquired through the transmission interface 1902.
[0329] Bus 1904 may include a pathway for transmitting information between various components of device 1900 (e.g., memory 1903, processor 1901, transmission interface 1902).
[0330] It should be noted that although the above-described device 1900 only shows a memory, processor, and transmission interface, those skilled in the art should understand that in specific implementations, device 1900 may also include other devices necessary for normal operation. Furthermore, depending on specific needs, those skilled in the art should understand that device 1900 may also include hardware devices for implementing other additional functions. Moreover, those skilled in the art should understand that device 1900 may only include the devices necessary for implementing the embodiments of this application, and may not necessarily include... Figure 19 All the devices shown.
[0331] It should be understood that the processor in the embodiments of this application can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.
[0332] It should also be understood that the memory in the embodiments of this application can be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of random access memory (RAM) are available, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate synchronous DRAM (DDR SDRAM), enhanced synchronous DRAM (ESDRAM), synchronous linked DRAM (SLDRAM), and direct rambus RAM (DR RAM).
[0333] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.
[0334] This application provides a computer-readable storage medium for storing a computer program that, when run on a computer, causes the computer to perform a method for processing computational tasks as described in the foregoing method embodiments.
[0335] This application provides a computer program product, which includes computer program code. When the computer program code is run, it implements the method for processing computing tasks as described in the foregoing method embodiments.
[0336] It should be understood that the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. A and B can be singular or plural. Additionally, the character " / " in this article generally indicates an "or" relationship between the preceding and following related objects, but it may also indicate an "and / or" relationship. Please refer to the context for a more accurate understanding.
[0337] In this application, "at least one" means one or more, and "more than one" means two or more. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or multiple items. For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.
[0338] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0339] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0340] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0341] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0342] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0343] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0344] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0345] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.< / int> < / int> < / int> < / int> < / sliceinfo>
Claims
1. A method of processing a computing task, characterized by, The method is executed by a graph optimizer, and the method includes: A first operator is determined for performing the computational task, the first operator comprising N separable axes, where N is a positive integer greater than or equal to 1; The segmentation information of the first operator is obtained from the operator segmentation information database. The segmentation information of the first operator includes the axis type of the nth segmentable axis among the N segmentable axes in the first operator and the first position information. The first position information is used to indicate the position of the nth segmentable axis in the input tensor of the first operator, where n=1, ...,N. Based on the segmentation information of the first operator, the input tensor of the first operator is segmented to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2; The K sets of input tensors are sent to K target computing resources respectively, so that the K target computing resources can complete the computing task; The type of the divisible axis is one of the following: element axis, reduction axis, and sliding window axis; The axis along which the elements in the input and output tensors of the operator have a point-to-point mapping relationship is the element axis; If the input tensor of the operator has a first axis, but the output tensor of the operator does not have the first axis, then the first axis is the reduction axis; The axis on which the operator performs a sliding window scan operation on the elements in the input tensor of the operator is the sliding window axis.
2. The method of claim 1, wherein, The step of segmenting the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors includes: Determine the target segmentation axis, which is one of the N segmentable axes; Based on the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator; Based on the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors.
3. The method of claim 2, wherein, Based on the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors, including: According to the segmentation method, determine the Q first input tensors including the target segmentation axis in the first operator and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; Based on the axis type of the target splitting axis in the first operator and the number K of the target computing resources, each of the Q first input tensors is split to obtain Q groups of second input tensors, wherein each group of the Q groups of second input tensors includes K second input tensors; The K sets of input tensors are obtained based on the Q-group second input tensor and the input tensor of the undivided first operator.
4. The method of claim 1, wherein, If the operator used to perform the computational task further includes a second operator, the second operator includes P separable axes, wherein the P separable axes are a subset of the N separable axes. The step of segmenting the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors includes: The segmentation information of the second operator is obtained from the operator segmentation information database. The segmentation information of the second operator includes the axis type and second position information of the p-th segmentable axis among the P segmentable axes in the second operator. The second position information is used to indicate the position of the p-th segmentable axis in the input tensor of the second operator. The input tensor of the second operator is the output tensor of the first operator. P is a positive integer greater than or equal to 1 and less than or equal to N, p = 1, ..., P. Based on the segmentation information of the first operator and the segmentation information of the second operator, P segmentation reference information is determined. The p-th segmentation reference information among the P segmentation reference information includes: the axis type of the p-th segmentable axis in the first operator, the axis type of the p-th segmentable axis in the second operator, and the position of the p-th segmentable axis in the input tensor of the first operator. Based on the P segmentation reference information, P groups of candidate segmentation methods are determined, wherein the p-th group of candidate segmentation methods in the P groups of candidate segmentation methods includes at least one segmentation method; Based on the time required to complete the computation task for each of the candidate segmentation methods in group P, determine the target segmentation method; According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
5. The method of claim 4, wherein, The step of segmenting the input tensor of the first operator according to the target segmentation method to obtain K sets of input tensors includes: Based on the target segmentation method, the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, the Q first input tensors including the target segmentation axis in the first operator, and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; Based on the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator, and the number K of the target computing resources, each of the Q first input tensors is split to obtain Q groups of second input tensors. Each of the Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q groups is the result of splitting the qth first input tensor in the Q groups of first input tensors into K segments, where q=1,…,Q; The K sets of input tensors are determined based on the Q-group second input tensor and the input tensor of the undivided first operator.
6. The method of claim 3, wherein, When the target splitting axis in the first operator is of the element axis or the sliding window axis, the first position information of the target splitting axis is also used to indicate the position of the target splitting axis in the output tensor of the first operator. The step of splitting each of the Q first input tensors according to the axis type of the target splitting axis in the first operator and the number K of the target computing resources to obtain Q groups of second input tensors includes: Based on the first position information of the target split axis, determine L first output tensors including the target split axis in the first operator, and the position of the target split axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1; The first input length is used as the input to the positive shape derivation function of the target split axis to obtain the first output length. The first input length is the length of the target split axis in each of the first input tensors, wherein the length of the target split axis in each of the first input tensors is equal. Based on the first output length and the number of target computing resources K, the L first output tensors are divided according to the target splitting axis to obtain the L groups of second output tensors. Each group of the L groups of second output tensors includes K second output tensors. Using the K second output lengths corresponding to the target split axis in each of the L groups of second output tensors as inputs to the reverse derivation function of the target split axis, we obtain the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors. Based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, each of the Q first input tensors is split according to the target split axis to obtain the Q groups of second input tensors.
7. The method of claim 6, wherein, When the target splitting axis in the first operator is of the element axis type, the step of splitting each of the Q first input tensors according to the target splitting axis based on the K second input lengths corresponding to the target splitting axis in each of the Q groups of second input tensors, to obtain the Q groups of second input tensors, includes: Based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, the first split function is scheduled to split each of the Q first input tensors according to the target split axis to obtain the Q groups of second input tensors.
8. The method of claim 6, wherein, When the target splitting axis in the first operator is of the sliding window axis type, the step of splitting each of the Q first input tensors according to the target splitting axis based on the K second input lengths corresponding to the target splitting axis in each of the Q groups of second input tensors, to obtain the Q groups of second input tensors, includes: Based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, the first slicing function is scheduled to perform overlapping slicing on each of the Q groups of first input tensors according to the target split axis to obtain the Q groups of second input tensors.
9. The method of claim 6, wherein, When the target splitting axis in the first operator is of the sliding window axis type, the step of splitting each of the Q first input tensors according to the target splitting axis based on the K second input lengths corresponding to the target splitting axis in each of the Q groups of second input tensors, to obtain the Q groups of second input tensors, includes: By scheduling the second segmentation function, each of the Q first input tensors is segmented according to the target segmentation axis to obtain Q groups of third input tensors, wherein the Q groups of third input tensors include K third input tensors; Based on the K second input lengths corresponding to the target split axis in each group of the second input tensors of the Q-group, the second slicing function is used to slice the K third input tensors in each group of the third input tensors of the Q-group according to the target split axis to obtain the fourth input tensor of the Q-group. By scheduling the splicing function, the kth fourth input tensor in the qth group of the fourth input tensor of the Q group and the kth third input tensor in the qth group of the third input tensor of the Q group are spliced together according to the target splitting axis to obtain the second input tensor of the Q group.
10. The method of claim 3, wherein, When the target splitting axis in the first operator is the reduction axis, according to the axis type of the target splitting axis in the first operator and the number K of the target computing resources, each of the Q first input tensors is split to obtain Q groups of second input tensors, including: Based on the number K of the target computing resources, the third segmentation function is called to segment each of the Q first input tensors to obtain Q groups of second input tensors.
11. The method of claim 10, wherein, The reduction axes include a first type of reduction axis and a second type of reduction axis. The first type of reduction axis is the reduction axis in which the operator performs a reduction operation on the elements in the input tensor of the operator, and the second type of reduction axis is the reduction axis in which the operator does not perform a reduction operation on the elements in the input tensor of the operator.
12. The method of claim 11, wherein, The first type of reduction axis includes any one of the following: reduction sum axis, reduction maximum value axis, reduction minimum value axis, and reduction average value axis; Wherein, the reduction sum axis is the reduction axis of the operator performing a summation and reduction operation on the elements in the input tensor of the operator; The reduction maximum axis is the reduction axis of the operator performing a maximum reduction operation on the elements in the input tensor of the operator; The reduction minimum axis is the reduction axis of the operator performing a minimum reduction operation on the elements in the input tensor of the operator; The reduction average axis is the reduction axis of the operator performing an average reduction operation on the elements in the input tensor of the operator.
13. The method of claim 11, wherein, The second type of reduction axis includes a reduction acquisition axis, which is an axis of element index data on the operator's input tensor based on the address indicated by the element on the operator's index input tensor.
14. The method according to any one of claims 1 to 13, characterized in that, The target computing resources include one of the following types: Graphics processing unit (GPU), central processing unit (CPU), die, or chip.
15. An apparatus for processing a computing task, the apparatus comprising: The device is used in a graph optimizer, and the device includes a processor and a transmission interface: The processor is configured to determine a first operator for performing a computational task, the first operator comprising N separable axes, where N is a positive integer greater than or equal to 1; The processor is configured to obtain the segmentation information of the first operator from the operator segmentation information library. The segmentation information of the first operator includes the axis type of the nth segmentable axis among the N segmentable axes in the first operator and the first position information, wherein the first position information is used to indicate the position of the nth segmentable axis in the input tensor of the first operator, where n=1, ...,N; The processor is used to segment the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2; The transmission interface is used to send the K sets of input tensors to the K target computing resources respectively, so that the K target computing resources can complete the computing task; The type of the divisible axis is one of the following: element axis, reduction axis, and sliding window axis; The axis along which the elements in the input and output tensors of the operator have a point-to-point mapping relationship is the element axis; If the input tensor of the operator has a first axis, but the output tensor of the operator does not have the first axis, then the first axis is the reduction axis; The axis on which the operator performs a sliding window scan operation on the elements in the input tensor of the operator is the sliding window axis.
16. The apparatus of claim 15, wherein, The processor is configured to segment the input tensor of the first operator according to the segmentation information of the first operator, and obtain K groups of input tensors, including: The processor is used for: Determine the target segmentation axis, which is one of the N segmentable axes; Based on the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator; Based on the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors.
17. The apparatus of claim 16, wherein, The processor is specifically used for, Based on the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, determine the axis type in the first operator, the Q first input tensors in the first operator that include the target segmentation axis, and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; Based on the axis type of the target splitting axis in the first operator and the number K of the target computing resources, each of the Q first input tensors is split to obtain Q groups of second input tensors, wherein each group of the Q groups of second input tensors includes K second input tensors; The K sets of input tensors are obtained based on the Q-group second input tensor and the input tensor of the undivided first operator.
18. The apparatus of claim 15, wherein, If the operator used to perform the computational task further includes a second operator, the second operator includes P separable axes, wherein the P separable axes are a subset of the N separable axes. The processor is specifically used for: The segmentation information of the second operator is obtained from the operator segmentation information database. The segmentation information of the second operator includes the axis type and second position information of the p-th segmentable axis among the P segmentable axes in the second operator. The second position information is used to indicate the position of the p-th segmentable axis in the input tensor of the second operator. The input tensor of the second operator is the output tensor of the first operator. P is a positive integer greater than or equal to 1 and less than or equal to N, p = 1, ..., P. Based on the segmentation information of the first operator and the segmentation information of the second operator, P segmentation reference information is determined. The p-th segmentation reference information among the P segmentation reference information includes: the axis type of the p-th segmentable axis in the first operator, the axis type of the p-th segmentable axis in the second operator, and the position of the p-th segmentable axis in the input tensor of the first operator. Based on the P segmentation reference information, P groups of candidate segmentation methods are determined, wherein the p-th group of candidate segmentation methods in the P groups of candidate segmentation methods includes at least one segmentation method; Based on the time required to complete the computation task for each of the candidate segmentation methods in group P, determine the target segmentation method; According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
19. The apparatus of claim 18, wherein, The processor is specifically used for: Based on the target segmentation method, the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, the Q first input tensors including the target segmentation axis in the first operator, and the position of the target segmentation axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; Based on the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator, and the number K of the target computing resources, each of the Q first input tensors is split to obtain Q groups of second input tensors. Each of the Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q groups is the result of splitting the qth first input tensor in the Q groups of first input tensors into K segments, where q=1,…,Q; The K sets of input tensors are obtained based on the Q-group second input tensor and the input tensor of the undivided first operator.
20. The apparatus of claim 17, wherein, When the target splitting axis in the first operator is of the element axis or the sliding window axis, the first position information of the target splitting axis is also used to indicate the position of the target splitting axis in the output tensor of the first operator, and the processor is specifically used to: Based on the first position information of the target split axis, determine L first output tensors including the target split axis in the first operator, and the position of the target split axis in each of the L first output tensors, where L is a positive integer greater than or equal to 1; The first input length is used as the input to the positive shape derivation function of the target split axis to obtain the first output length. The first input length is the length of the target split axis in each of the first input tensors, wherein the length of the target split axis in each of the first input tensors is equal. Based on the first output length and the number of target computing resources K, the L first output tensors are divided according to the target splitting axis to obtain the L groups of second output tensors. Each group of second output tensors in the L groups includes K second output tensors. The l-th group of second output tensors in the L groups is the result of splitting the l-th first output tensor in the L groups into K groups. Using the K second output lengths corresponding to the target split axis in each of the L groups of second output tensors as inputs to the inverse derivation function of the target split axis, we obtain the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors. The lengths corresponding to the target split axis in the kth second output tensor of each of the L groups of second output tensors are equal, and the lengths corresponding to the target split axis in the kth second input tensor of each of the Q groups of second input tensors are also equal. Based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, each of the Q first input tensors is split according to the target split axis to obtain the Q groups of second input tensors.
21. The apparatus of claim 20, wherein, When the target segmentation axis is of the element axis type in the first operator, the processor is specifically used for: Based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, the first split function is scheduled to split each of the Q first input tensors according to the target split axis to obtain the Q groups of second input tensors.
22. The apparatus of claim 20, wherein, When the target segmentation axis in the first operator is of the type of the sliding window axis, the processor is specifically used for: Based on the K second input lengths corresponding to the target split axis in each of the Q groups of second input tensors, the first slicing function is scheduled to perform overlapping slicing on each of the Q groups of first input tensors according to the target split axis to obtain the Q groups of second input tensors.
23. The apparatus of claim 20, wherein, When the target segmentation axis in the first operator is of the type of the sliding window axis, the processor is specifically used for: By scheduling the second segmentation function, each of the Q first input tensors is segmented according to the target segmentation axis to obtain Q groups of third input tensors, wherein the Q groups of third input tensors include K third input tensors; Based on the K second input lengths corresponding to the target split axis in each group of the second input tensors of the Q-group, the second slicing function is used to slice the K third input tensors in each group of the third input tensors of the Q-group according to the target split axis to obtain the fourth input tensor of the Q-group. By scheduling the splicing function, the kth fourth input tensor in the qth group of the fourth input tensor of the Q group and the kth third input tensor in the qth group of the third input tensor of the Q group are spliced together according to the target splitting axis to obtain the second input tensor of the Q group.
24. The apparatus of claim 17, wherein, When the target segmentation axis is of the reduced axis type in the first operator, the processor is specifically used for: Based on the number K of the target computing resources, the third segmentation function is called to segment each of the Q first input tensors to obtain Q groups of second input tensors.
25. The apparatus of claim 24, wherein, The reduction axes include a first type of reduction axis and a second type of reduction axis. The first type of reduction axis is the reduction axis in which the operator performs a reduction operation on the elements in the input tensor of the operator, and the second type of reduction axis is the reduction axis in which the operator does not perform a reduction operation on the elements in the input tensor of the operator.
26. The apparatus of claim 25, wherein, The first type of reduction axis includes any one of the following: reduction sum axis, reduction maximum value axis, reduction minimum value axis, and reduction average value axis; Wherein, the reduction sum axis is the reduction axis of the operator performing a summation and reduction operation on the elements in the input tensor of the operator; The reduction maximum axis is the reduction axis of the operator performing a maximum reduction operation on the elements in the input tensor of the operator; The reduction minimum axis is the reduction axis of the operator performing a minimum reduction operation on the elements in the input tensor of the operator; The reduction average axis is the reduction axis of the operator performing an average reduction operation on the elements in the input tensor of the operator.
27. The apparatus of claim 26, wherein, The second type of reduction axis includes a reduction acquisition axis, which is an axis of element index data on the operator's input tensor based on the address indicated by the element on the operator's index input tensor.
28. The apparatus of any one of claims 15 to 27, wherein, The target computing resources include one of the following types: Graphics processing unit (GPU), central processing unit (CPU), die, or chip.
29. A computer-readable storage medium, characterized in that, The computer-readable medium stores program code, which includes methods for performing any one of claims 1 to 14.
Citation Information
Patent Citations
Tensor processing method and processing system based on parallel branches and tensor segmentation
CN113485837A
Parallel computing scheme generation for neural networks
WO2021190761A1