Information processing method and device, equipment and storage medium

By flexibly allocating tensor parallel resources for the pre-filling and decoding stages of artificial intelligence models, the problem of low device resource utilization efficiency is solved, achieving efficient resource utilization and improved computing efficiency.

CN122309113APending Publication Date: 2026-06-30HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2024-12-31
Publication Date
2026-06-30

Smart Images

  • Figure CN122309113A_ABST
    Figure CN122309113A_ABST
Patent Text Reader

Abstract

This application discloses an information processing method, apparatus, device, and storage medium, belonging to the field of artificial intelligence technology. The method includes: acquiring first input information of a prediction model, the prediction model including a first network and a second network cascaded sequentially; invoking the first network to perform calculations on the first input information using resources in a first tensor parallel domain to obtain first feature information, the first tensor parallel domain including a first number of tensor parallel resources provided by the computer device; invoking the second network to perform calculations on the first feature information using resources in a second tensor parallel domain to obtain first output information; the second tensor parallel domain including a second number of tensor parallel resources provided by the computer device; wherein there is an overlap between the first tensor parallel domain and the second tensor parallel domain, and the first number and the second number are different.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to an information processing method, apparatus, device and storage medium. Background Technology

[0002] With the development of artificial intelligence technology, the scale of parameters in artificial intelligence models is increasing day by day.

[0003] In related technologies, in order to improve the processing efficiency of artificial intelligence models, the prefill and decoding stages of the model are separated and deployed on different devices, such as on different graphics processing units (GPUs) or neural processing units (NPUs), to avoid mutual interference during computation or data transmission.

[0004] However, when the model pre-filling and decoding stages are deployed on the same device, how to improve the resource utilization efficiency of the device is an urgent problem to be solved. Summary of the Invention

[0005] This application provides an information processing method, apparatus, device, and storage medium to improve the resource utilization rate of the device. The technical solution is as follows:

[0006] According to one aspect of this application, an information processing method is provided, the method being executed by a computer device, the method comprising: acquiring first input information of a prediction model, the prediction model including a first network and a second network cascaded together; invoking the first network to perform calculations on the first input information using resources in a first tensor parallel domain to obtain first feature information, the first tensor parallel domain including a first number of tensor parallel resources provided by the computer device, the first number being positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the first network to perform calculations; invoking the second network to perform calculations on the first feature information using resources in a second tensor parallel domain to obtain first output information; the second tensor parallel domain including a second number of tensor parallel resources provided by the computer device, the second number being positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the second network to perform calculations; wherein there is an overlap between the first tensor parallel domain and the second tensor parallel domain, and the first number and the second number are different.

[0007] This application addresses the resource requirements of the first and second networks for computation by calling upon them to utilize different amounts of tensor parallel resources. This approach flexibly provides tensor parallel resources to the first and second networks. Compared to providing the same tensor parallel resources to all networks in the prediction model, this method avoids resource waste caused by providing unnecessary tensor parallel resources to the networks while meeting their computational needs, thus ensuring the efficient utilization of tensor parallel resources.

[0008] In one optional embodiment of this application, the second quantity is less than the first quantity; after calling the first network and using resources in the first tensor parallel domain to perform calculations on the first input information to obtain the first feature information, the method further includes: obtaining the second input information, and calling the prediction model and using resources in the third tensor parallel domain to perform calculations on the second input information to obtain the second output information; wherein, there is an overlap between the third tensor parallel domain and the first tensor parallel domain, and the intersection between the third tensor parallel domain and the second tensor parallel domain is empty.

[0009] This application improves the utilization rate of tensor parallel resources in the first tensor parallel domain by utilizing the idle resources in the third tensor parallel domain to perform calculations on the second input information during the calculation of the second network, avoids the waste caused by idle tensor parallel resources, and improves the resource utilization rate of computer equipment.

[0010] In an optional embodiment of this application, after calling the second network to perform calculations on the first feature information using resources in the second tensor parallel domain to obtain first output information, the method further includes: obtaining third input information; calling the first network to perform calculations on the third input information using resources in the first tensor parallel domain to obtain third feature information; and calling the second network to perform calculations on the third feature information using resources in the second tensor parallel domain to obtain third output information.

[0011] This application achieves the function of repeatedly calling the first and second networks using different amounts of tensor parallel resources by obtaining the third input information after obtaining the first output information. This avoids the waste of resources caused by providing the network with extra tensor parallel resources on the basis of meeting the network's computing needs, and ensures the utilization efficiency of tensor parallel resources.

[0012] In one optional embodiment of this application, the step of invoking the first network and using resources in the first tensor parallel domain to perform calculations on the third input information to obtain third feature information includes: when the first network is used to perform calculations on the first network using the first tensor parallel resources in the third tensor parallel domain, invoking the first network and using the second tensor parallel domain and the first tensor parallel resources to perform calculations on the third input information and the second input information in the same batch to obtain the third feature information; and / or, when the second network is used to perform calculations on the second network using the second tensor parallel resources in the third tensor parallel domain, pausing the execution of the second network's calculations and invoking the first network and using the second tensor parallel domain and the second tensor parallel resources to perform calculations on the third input information to obtain the third feature information.

[0013] This application provides a resource utilization method after acquiring the third input information for all tensor parallel domains in different computational states by performing calculations on the third input information and the second input information in the same batch, and / or pausing the calculation of the second network. This enables the use of resources in the first tensor parallel domain to perform calculations on the third input information, thereby providing a basis for calling the first and second networks using different amounts of tensor parallel resources in a repetitive loop.

[0014] In one optional embodiment of this application, the step of calling the second network and using the resources in the second tensor parallel domain to perform calculations on the third feature information to obtain the third output information includes: when using the second tensor parallel resources to perform calculations on the second network, calling the second network and using the second tensor parallel domain and the second tensor parallel resources to perform calculations on the third feature information and the second feature information in the same batch to obtain the third output information.

[0015] This application achieves the calculation of the third feature information by performing calculations on the second feature information in the same batch, thereby utilizing the resources in the second tensor parallel domain to perform calculations on the third feature information. This provides a basis for calling the first network and the second network using different numbers of tensor parallel resources in a repetitive loop.

[0016] In one optional embodiment of this application, the first quantity is less than the second quantity; the method further includes: obtaining second input information; invoking the prediction model, using resources in the third tensor parallel domain to perform calculations on the second input information, and obtaining second output information; wherein, there is an overlap between the third tensor parallel domain and the second tensor parallel domain, and the intersection between the third tensor parallel domain and the first tensor parallel domain is empty.

[0017] This application improves the utilization rate of tensor parallel resources in the second tensor parallel domain by utilizing the idle resources in the third tensor parallel domain to perform calculations on the second input information during the calculation of the first network, avoids the waste caused by idle tensor parallel resources, and improves the resource utilization rate of computer equipment.

[0018] In one optional embodiment of this application, the step of invoking the second network to perform calculations on the first feature information using resources in the second tensor parallel domain to obtain the first output information includes: pausing the calculations performed by the first network when the calculations of the first network are performed using the third tensor parallel resources in the third tensor parallel domain, and invoking the second network to perform calculations on the first feature information using the first tensor parallel domain and the third tensor parallel resources to obtain the first output information; and / or, when the calculations of the second network are performed using the fourth tensor parallel resources in the third tensor parallel domain, invoking the second network to perform calculations on the second feature information corresponding to the first feature information and the second input information in the same batch using the first tensor parallel domain and the fourth tensor parallel resources to obtain the first output information.

[0019] This application provides a resource utilization method after obtaining the first feature information for all situations in the third tensor parallel domain, by performing calculations on the first feature information and the second feature information in the same batch and / or pausing the calculation of the first network. This enables the use of resources in the second tensor parallel domain to perform calculations on the first feature information.

[0020] In an optional embodiment of this application, the step of invoking the prediction model and utilizing resources in the third tensor parallel domain to perform calculations on the second input information to obtain second output information includes: invoking the first network and utilizing resources in the third tensor parallel domain to perform calculations on the second input information to obtain second feature information; and invoking the second network and utilizing resources in the third tensor parallel domain to perform calculations on the second feature information to obtain second output information. This application provides a specific implementation method for performing calculations on the second input information by sequentially invoking the first and second networks after obtaining the second input information.

[0021] In one alternative embodiment of this application, the first network includes a pre-filled network, and the second network includes a decoding network. This application provides a specific implementation method for invoking a prediction network, such as invoking a large language model to perform computation, by implementing the first network as a pre-filled network and the second network as a decoding network.

[0022] In an optional embodiment of this application, the second quantity is less than the first quantity; the method further includes: when the load information of calling the second network to perform calculations on the first feature information exceeds a first load threshold, adding at least one tensor parallel resource to a waiting queue, wherein the at least one tensor parallel resource belongs to the first tensor parallel domain and does not belong to the second tensor parallel domain; wherein the at least one tensor parallel resource in the waiting queue is used to be added to the second tensor parallel domain when the first output information is obtained.

[0023] This application expands the capacity of the second tensor parallel domain when the resources provided by the second tensor parallel domain for the second network are under high load by adding at least one tensor parallel resource to the waiting queue. This ensures that the resources provided by the second tensor parallel domain can meet the needs of the second network and guarantee the computational efficiency of the second network.

[0024] In an optional embodiment of this application, the first quantity is less than the second quantity; the method further includes: when the load information of calling the first network to perform calculations on the first input information exceeds a second load threshold, adding at least one tensor parallel resource to a waiting queue, wherein the at least one tensor parallel resource belongs to the second tensor parallel domain and does not belong to the first tensor parallel domain; wherein the at least one tensor parallel resource in the waiting queue is used to be added to the first tensor parallel domain when the first feature information is obtained.

[0025] This application expands the capacity of the first tensor parallel domain when the resources provided by the first tensor parallel domain for the first network are under high load by adding at least one tensor parallel resource to the waiting queue. This ensures that the resources provided by the first tensor parallel domain can meet the needs of the first network and guarantee the computational efficiency of the first network.

[0026] In an optional embodiment of this application, the prediction model includes a third number of attention heads based on an attention mechanism, and the method further includes: determining the first number of tensor parallel resources in the first tensor parallel domain based on the third number of attention heads.

[0027] This application ensures that each tensor parallel resource provides resources for the same number of attention heads, or that each attention head call utilizes resources provided by the same number of tensor parallel resources, by determining the first quantity as a multiple or factor of the third quantity; thus guaranteeing the load balancing of resources provided by tensor parallel resources when calling the first network.

[0028] In an optional embodiment of this application, determining the first quantity of tensor parallel resources in the first tensor parallel domain based on the third quantity of attention heads includes: determining at least two candidate resource quantities for the first tensor parallel domain based on the third quantity, wherein each of the at least two candidate resource quantities is a factor or multiple of the third quantity, and each candidate resource quantity is a factor of the total number of tensor parallel resources provided by the computer device; obtaining a first latency threshold for the first network, and the device computing resources and model computing resources required by the first network corresponding to the at least two candidate resource quantities respectively; determining the batch processing quantity corresponding to the at least two candidate resource quantities based on the first latency threshold, and the device computing resources and model computing resources corresponding to the at least two candidate resource quantities respectively; determining the first candidate resource quantity among the at least two candidate resource quantities as the first quantity, wherein the first batch processing quantity corresponding to the first candidate resource quantity is the maximum value among the batch processing quantities corresponding to the at least two candidate resource quantities respectively.

[0029] This application ensures the maximization of the batch size when calling the first network by determining the first resource candidate number, which is the maximum value among the batch sizes corresponding to at least two resource candidate numbers, and thus guarantees the computational efficiency of the first network.

[0030] In an alternative embodiment of this application, the prediction model includes a third number of attention heads based on an attention mechanism, and the method further includes: determining a second number of tensor parallel resources in the second tensor parallel domain based on the third number of attention heads.

[0031] This application ensures that each tensor parallel resource provides resources for the same number of attention heads, or that each attention head call utilizes resources provided by the same number of tensor parallel resources, by determining the second quantity as a multiple or factor of the third quantity; thus guaranteeing the load balancing of resources provided by tensor parallel resources when calling the second network.

[0032] In an optional embodiment of this application, determining the second quantity of tensor parallel resources in the second tensor parallel domain based on the third quantity of attention heads includes: determining at least two candidate quantities of resources in the second tensor parallel domain based on the third quantity, wherein each of the at least two candidate quantities of resources is a factor or multiple of the third quantity, and each candidate quantity of resources is a factor of the total number of tensor parallel resources provided by the computer device; obtaining a second latency threshold of the second network, and the device transmission resources and model transmission resources required by the second network corresponding to the at least two candidate quantities of resources respectively; determining the batch processing quantity corresponding to the at least two candidate quantities of resources based on the second latency threshold, and the device transmission resources and model transmission resources corresponding to the at least two candidate quantities of resources respectively; determining the second candidate quantity of resources among the at least two candidate quantities of resources as the second quantity, wherein the second batch processing quantity corresponding to the second candidate quantity of resources is the maximum value among the batch processing quantities corresponding to the at least two candidate quantities of resources.

[0033] This application ensures the maximization of the batch size when calling the second network by determining the second resource candidate number as the maximum value among the batch sizes corresponding to at least two resource candidate numbers, thereby guaranteeing the computational efficiency of the second network.

[0034] According to another aspect of this application, an information processing apparatus is provided, the apparatus comprising:

[0035] The module is configured to acquire first input information of a prediction model, the prediction model including a first network and a second network cascaded together; the module is configured to invoke the first network and use resources in a first tensor parallel domain to perform calculations on the first input information to obtain first feature information, the first tensor parallel domain including a first number of tensor parallel resources provided by the computer device, the first number being positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the first network to perform calculations; the module is further configured to invoke the second network and use resources in a second tensor parallel domain to perform calculations on the first feature information to obtain first output information; the second tensor parallel domain including a second number of tensor parallel resources provided by the computer device, the second number being positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the second network to perform calculations; wherein, there is an overlap between the first tensor parallel domain and the second tensor parallel domain, and the first number and the second number are different.

[0036] In one alternative embodiment of this application, the second quantity is less than the first quantity;

[0037] The acquisition module is also used to acquire second input information;

[0038] The processing module is also used to call the prediction model and use the resources in the third tensor parallel domain to perform calculations on the second input information to obtain the second output information; wherein, there is an overlap between the third tensor parallel domain and the first tensor parallel domain, and the intersection between the third tensor parallel domain and the second tensor parallel domain is empty.

[0039] In an optional embodiment of this application, the acquisition module is further configured to acquire third input information;

[0040] The processing module is further configured to invoke the first network and use the resources in the first tensor parallel domain to perform calculations on the third input information to obtain the third feature information;

[0041] The processing module is further configured to invoke the second network and utilize the resources in the second tensor parallel domain to perform calculations on the third feature information to obtain the third output information.

[0042] In an optional embodiment of this application, the processing module is further configured to: when the computation of the first network is performed using the first tensor parallel resource in the third tensor parallel domain, call the first network and perform computation on the third input information and the second input information in the same batch using the second tensor parallel domain and the first tensor parallel resource to obtain the third feature information;

[0043] And / or, when the computation of the second network is performed using the second tensor parallel resources in the third tensor parallel domain, the computation of the second network is paused, and the first network is invoked to perform computation on the third input information using the second tensor parallel domain and the second tensor parallel resources to obtain the third feature information.

[0044] In an optional embodiment of this application, the processing module is further configured to: when the computation of the second network is performed using the second tensor parallel resource, invoke the second network, and use the second tensor parallel domain and the second tensor parallel resource to perform computation on the third feature information and the second feature information in the same batch to obtain the third output information.

[0045] In one alternative embodiment of this application, the first quantity is less than the second quantity;

[0046] The acquisition module is also used to acquire second input information;

[0047] The processing module is further configured to call the prediction model and use the resources in the third tensor parallel domain to perform calculations on the second input information to obtain the second output information; wherein, there is an overlap between the third tensor parallel domain and the second tensor parallel domain, and the intersection between the third tensor parallel domain and the first tensor parallel domain is empty.

[0048] In an alternative embodiment of this application, the processing module is further configured to:

[0049] When the computation of the first network is performed using the third tensor parallel resource in the third tensor parallel domain, the computation performed by the first network is paused, and the second network is invoked to perform computation on the first feature information using the first tensor parallel domain and the third tensor parallel resource to obtain the first output information; and / or, when the computation of the second network is performed using the fourth tensor parallel resource in the third tensor parallel domain, the second network is invoked to perform computation on the second feature information corresponding to the first feature information and the second input information in the same batch using the first tensor parallel domain and the fourth tensor parallel resource to obtain the first output information.

[0050] In an optional embodiment of this application, the processing module is further configured to: invoke the first network to perform calculations on the second input information using resources in the third tensor parallel domain to obtain second feature information; and invoke the second network to perform calculations on the second feature information using resources in the third tensor parallel domain to obtain second output information.

[0051] In one alternative embodiment of this application, the first network includes a pre-filled network, and the second network includes a decoding network.

[0052] In an alternative embodiment of this application, the second quantity is less than the first quantity; the device further includes:

[0053] The configuration module is configured to add at least one tensor parallel resource to a waiting queue when the load information of calling the second network to perform calculations on the first feature information exceeds a first load threshold. The at least one tensor parallel resource belongs to the first tensor parallel domain and does not belong to the second tensor parallel domain. The at least one tensor parallel resource in the waiting queue is used to be added to the second tensor parallel domain when the first output information is obtained.

[0054] In an alternative embodiment of this application, the first quantity is less than the second quantity; the device further includes:

[0055] The configuration module is configured to add at least one tensor parallel resource to a waiting queue when the load information of the first network performing calculations on the first input information exceeds a second load threshold. The at least one tensor parallel resource belongs to the second tensor parallel domain and does not belong to the first tensor parallel domain. The at least one tensor parallel resource in the waiting queue is used to be added to the first tensor parallel domain when the first feature information is obtained.

[0056] In an alternative embodiment of this application, the prediction model includes a third number of attention heads based on an attention mechanism, and the apparatus further includes: a determination module, configured to determine the first number of tensor parallel resources in the first tensor parallel domain based on the third number of attention heads.

[0057] In an optional embodiment of this application, the determining module is further configured to: determine at least two candidate resource quantities for the first tensor parallel domain based on the third quantity, wherein each of the at least two candidate resource quantities is a factor or multiple of the third quantity, and each candidate resource quantity is a factor of the total number of tensor parallel resources provided by the computer device; obtain a first latency threshold for the first network, and the device computing resources and model computing resources required by the first network corresponding to the at least two candidate resource quantities respectively; determine the batch processing quantity corresponding to the at least two candidate resource quantities based on the first latency threshold, and the device computing resources and model computing resources corresponding to the at least two candidate resource quantities respectively; determine the first candidate resource quantity among the at least two candidate resource quantities as the first quantity, wherein the first batch processing quantity corresponding to the first candidate resource quantity is the maximum value among the batch processing quantities corresponding to the at least two candidate resource quantities respectively.

[0058] In an alternative embodiment of this application, the prediction model includes a third number of attention heads based on an attention mechanism, and the apparatus further includes: a determination module, configured to determine the second number of tensor parallel resources in the second tensor parallel domain based on the third number of attention heads.

[0059] In an optional embodiment of this application, the determining module is further configured to: determine at least two candidate resource quantities for the second tensor parallel domain based on the third quantity, wherein each of the at least two candidate resource quantities is a factor or multiple of the third quantity, and each candidate resource quantity is a factor of the total number of tensor parallel resources provided by the computer device; obtain a second latency threshold for the second network, and the device transmission resources and model transmission resources required by the second network corresponding to the at least two candidate resource quantities respectively; determine the batch processing quantity corresponding to the at least two candidate resource quantities based on the second latency threshold, and the device transmission resources and model transmission resources corresponding to the at least two candidate resource quantities respectively; determine the second candidate resource quantity among the at least two candidate resource quantities as the second quantity, wherein the second batch processing quantity corresponding to the second candidate resource quantity is the maximum value among the batch processing quantities corresponding to the at least two candidate resource quantities respectively.

[0060] According to another aspect of this application, a computer device is provided, the computer device including a processor and a memory, the memory storing at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the code set or instruction set being loaded and executed by the processor to implement the information processing method as described above.

[0061] According to another aspect of this application, a computer-readable storage medium is provided, wherein at least one instruction, at least one program, code set, or instruction set is stored therein, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the information processing method described above.

[0062] According to another aspect of this application, a computer program product is provided, the computer program product including computer instructions stored in a computer-readable storage medium, wherein a processor reads from the computer-readable storage medium and executes the computer instructions to implement the information processing method described above.

[0063] It should be understood that the beneficial effects of the aforementioned information processing apparatus, computer equipment, computer-readable storage medium, and computer program products, and their corresponding possible implementations, can be found in the technical effects of the first aspect and its corresponding possible implementations described above, and will not be repeated here. Attached Figure Description

[0064] Figure 1 This is a schematic diagram of a computer system provided in an exemplary embodiment of this application;

[0065] Figure 2 This is a schematic diagram of an information processing method provided in an exemplary embodiment of this application;

[0066] Figure 3 This is a flowchart of an exemplary embodiment of the information processing method provided in this application;

[0067] Figure 4 This is a flowchart of an exemplary embodiment of the information processing method provided in this application;

[0068] Figure 5 This is a flowchart of an exemplary embodiment of the information processing method provided in this application;

[0069] Figure 6 This is a flowchart of an exemplary embodiment of the information processing method provided in this application;

[0070] Figure 7 This is a schematic diagram of tensor parallel resources provided in an exemplary embodiment of this application;

[0071] Figure 8 This is a flowchart of an exemplary embodiment of the information processing method provided in this application;

[0072] Figure 9 This is a structural block diagram of an information processing apparatus provided in an exemplary embodiment of this application;

[0073] Figure 10 This is a schematic diagram of the structure of an information processing device provided in an exemplary embodiment of this application;

[0074] Figure 11 This is a schematic diagram of the structure of an information processing device provided in an exemplary embodiment of this application. Detailed Implementation

[0075] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0076] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0077] Figure 1 A schematic diagram of a computer system provided in one embodiment of this application is shown. This computer system can implement a system architecture that constitutes an information processing method. The computer system may include: a terminal 100 and a server 200.

[0078] Terminal 100 can be an electronic device such as a mobile phone, tablet computer, or PC (personal computer). A client application for the target application can be installed and run on terminal 100. This target application can be an information processing application or other applications providing information processing functions; this application does not limit its specific form. Furthermore, this application does not limit the form of the target application, including but not limited to apps, mini-programs, etc., installed on terminal 100, and can also be in web page form.

[0079] Server 200 can be a standalone physical server, a server cluster or distributed system consisting of multiple physical servers, or a cloud server providing cloud computing services. Server 200 can be the backend server for the aforementioned target application, used to provide backend services to the clients of the target application.

[0080] The information processing method provided in this application embodiment can be executed by a computer device, which refers to an electronic device with data computing, processing, and storage capabilities. Figure 1 Taking the implementation environment of the scheme shown as an example, the information processing method can be executed by the terminal 100 (such as by the client of the target application installed and running in the terminal 100), or by the server 200, or by the interaction and cooperation between the terminal 100 and the server 200. This application does not limit this.

[0081] Furthermore, the technical solution of this application can be combined with blockchain technology. For example, in the information processing method disclosed in this application, some data (such as input information, feature information, etc.) can be stored on the blockchain. The terminal 100 and the server 200 can communicate through a network, such as a wired or wireless network.

[0082] Figure 2 A schematic diagram of an information processing method provided in an exemplary embodiment of this application is shown.

[0083] Server 200 includes n image processors, for example, server 200 has n image processors installed via hardware interfaces, or server 200 includes n computing modules with one image processor as the computing core; n is an integer greater than 1. For example, server 200 is used to perform artificial neural network training and / or usage, and server 200 is also called a rack for performing artificial neural network training and / or usage. Each image processor corresponds to a tensor parallel resource, which includes at least one of computing resources, storage resources, and transmission resources.

[0084] In this embodiment, a large language model 310 implemented using an artificial neural network is used as an example. The large language model 310 includes a pre-filled network 312 and a decoding network 314 cascaded together.

[0085] At time T1, the first input information 302 is obtained; the first input information 302 is the natural language information input to the large language model 310, and the large language model 310 is used to predict the response statement obtained from the first input information 302.

[0086] For the pre-padded network 312, the pre-padded network 312 is invoked, and the server 200 provides n tensor parallel resources to perform pre-padded operations on the first input information 302, obtaining the first feature information 302a. The first feature information 302a is the feature representation of the first input information 302 in the hidden layer space. In some examples, the first input information 302 includes multiple pieces of information input from the large language model 310 at different times and / or with different input directions. The n tensor parallel resources perform batch processing on the multiple pieces of information. Since the word lengths of the multiple pieces of information are different, padding processing needs to be performed on the information of different lengths. When invoking the pre-padded network 312, a parallel strategy is adopted. Based on at least one of the computing resources, storage resources, and transmission resources required for the pre-padded network 312 to perform computation, the use of n tensor parallel resources to perform pre-padded operations on the first input information is determined.

[0087] At time T2, the decoding network 314 is invoked, and the first feature information 302a is decoded using x tensor parallel resources to obtain the first output information 302b; at time T2, the second input information 304 is obtained, and the large language model 310 is invoked, and the second input information 304 is calculated using nx tensor parallel resources to obtain the second output information 304b.

[0088] Time T2 is the time after the first feature information 302a is calculated by calling the pre-filled network 312, and time T2 is after time T1.

[0089] By utilizing nx tensor parallel resources to perform calculations on the second input information 304, while ensuring that x tensor parallel resources can meet the calculation requirements of the decoding network 314, the idle resources that have not been used for calculations are fully utilized to perform calculations on the second input information 304.

[0090] The intersection between the n tensor parallel resources used to perform calculations on the first feature information 302a and the nx tensor parallel resources used to perform calculations on the second input information 304 is empty. This ensures that after obtaining the first feature information 302a, one tensor parallel resource is used to perform either the calculation of the first feature information 302a or the calculation of the second input information 304, thus avoiding conflicts where tensor parallel resources need to process the calculations of two pieces of information simultaneously.

[0091] The information processing method will now be described through the following examples.

[0092] Figure 3 A flowchart illustrating an exemplary embodiment of the information processing method provided in this application is shown. This method can be executed by a computer device and includes, but is not limited to, steps 510-530.

[0093] Step 510: Obtain the first input information for the prediction model.

[0094] The prediction model can be implemented as an artificial neural network (ANN). The prediction model includes a first network and a second network cascaded together. The network structure of either the first network or the second network includes, but is not limited to, at least one of the following: convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory network (LSTM), generative adversarial network (GAN), and deep learning model based on attention mechanism (such as the transformer model, also known as the transformation model).

[0095] The first input information is the input parameter of the prediction model. The first input information can be natural language information, vector information, or matrix information; there are no restrictions on the format of the first input information. The prediction model can be used to respond to natural language information, predict the label of the first input information, predict media information such as images, etc.; there are no restrictions on the predictive capabilities of the prediction model. The first input information can be entered into the computer device through human-computer interaction, obtained by the computer device through a network connection, or read from a database storing training data of artificial neural networks.

[0096] Step 520: Invoke the first network and use the resources in the first tensor parallel domain to perform calculations on the first input information to obtain the first feature information.

[0097] The first tensor parallel domain includes a first number of tensor parallelism (TP) resources provided by the computer device. In one example, the computer device is implemented as a server, which includes multiple image processors, each corresponding to a tensor parallel resource; accordingly, using the first number of processors in the server, a first network is invoked to perform computations on the first input information to obtain the first feature information.

[0098] For example, tensor parallel resources include at least one of computational resources, storage resources, and transport resources. Computational resources are used to indicate the computational capability for the prediction model, storage resources are used to indicate the caching capability for model parameters or model output information of the prediction model, and transport resources are used to indicate the read and write capability for cached model parameters or model input information or model output information.

[0099] The first quantity is positively correlated with at least one of the computing resources, storage resources, and transmission resources required for the first network to perform computation; as the computing resources, storage resources, and transmission resources required for the first network to perform computation increase, the number of resources in the first tensor parallel domain increases accordingly.

[0100] The first feature information is the result of the first network's calculation of the first input information. The first feature information can be implemented as at least one of eigenvalues, eigenvectors, and feature matrices. The first feature information is the information input to the second network in the prediction network, and the first feature information is the intermediate information calculated by the prediction network from the first input information.

[0101] Step 530: Call the second network and use the resources in the second tensor parallel domain to perform calculations on the first feature information to obtain the first output information.

[0102] The second tensor parallel domain includes a second number of tensor parallel resources provided by the computer device. The second number is positively correlated with at least one of the computing resources, storage resources, and transmission resources required for the second network to perform computation. As the computing resources, storage resources, and transmission resources required for the second network to perform computation increase, the number of resources in the second tensor parallel domain increases accordingly.

[0103] There is an overlap between the first and second tensor parallel domains, and the first and second quantities are different. For example, the tensor parallel resources provided by the intersection of the first and second tensor parallel domains are first used to execute the computation of the first network, and then used to execute the computation of the second network. In one example, the first and second quantities are different because at least one of the computational resources, storage resources, and transmission resources required for the first and second networks to perform computations is different; this application does not limit the size relationship between the first and second quantities.

[0104] The first output information is the computation result of the prediction network on the first input information; for example, when the prediction model is used to respond to natural language information, such as a large language model (LLM), the first output information is the natural language information. When the prediction model is used to predict the label of the first input information, the first output information is the predicted label of the first input information. When the prediction model is used to predict an image indicated by the first input information, such as a text-to-image model, the first output information is the image information.

[0105] The method provided in this embodiment calls upon the first and second networks to perform computations using different amounts of tensor parallel resources based on the resource requirements of the first and second networks. This achieves flexible provision of tensor parallel resources to the first and second networks. Compared to providing the same tensor parallel resources to each network in the prediction model, this method avoids resource waste caused by providing redundant tensor parallel resources to the network after meeting its computational needs, thus ensuring the utilization efficiency of tensor parallel resources.

[0106] See Figure 4 ,exist Figure 3 Based on the illustrated embodiment, the information processing method provided in this application further includes step 542, which is executed after step 520; further, it also includes steps 544, 546, and 548.

[0107] Step 542: Obtain the second input information and call the prediction model to perform calculations on the second input information using the resources in the third tensor parallel domain to obtain the second output information.

[0108] In this embodiment, in step 542, the second quantity is less than the first quantity, and the resources required for the first network to perform computation are greater than the resources required for the second network to perform computation. When performing computation on the second network, there are idle resources in the first tensor parallel domain other than the second tensor parallel domain. The second tensor parallel domain is the resource used when performing computation on the second network.

[0109] By utilizing the resources in the third tensor parallel domain to perform computations on the second input information, while ensuring that the resources in the second tensor parallel domain can meet the computational needs of the second network, the idle resources not used for computation in the second network are fully utilized to perform computations on the second input information. In some examples, the second tensor parallel domain is a subset of the first tensor parallel domain; this improves the utilization rate of tensor parallel resources in the first tensor parallel domain, avoids waste caused by idle tensor parallel resources, and improves the resource utilization rate of computer equipment.

[0110] The intersection between the third tensor parallel domain and the second tensor parallel domain is empty, ensuring that after obtaining the first feature information, one tensor parallel resource is used to perform either the calculation of the first feature information or the calculation of the second input information, avoiding conflicts where the tensor parallel resource needs to process the calculation of two pieces of information simultaneously. Furthermore, there is an overlap between the third tensor parallel domain and the first tensor parallel domain, ensuring that the resource used to perform the calculation of the second input information is an idle resource that was not used to calculate the first feature information after it was obtained. Further, the third tensor parallel domain is a subset of the first tensor parallel domain. While there is at least one overlapping moment in the execution sequence of steps 542 and 530, this application does not impose restrictive provisions on the order of the start and end execution times of steps 542 and 530. By utilizing different tensor parallel resources to simultaneously perform the calculation of the first feature information and the calculation of the second input information, compared to the related technologies where the first and second networks use the same tensor parallel resources, and the calculation of the output result of the preceding input information is performed before the calculation of the following input information, the processing efficiency of the input prediction model is improved.

[0111] As described above, the prediction network consists of a first network and a second network cascaded together; correspondingly, calling the prediction model can be achieved through the following two steps:

[0112] The first network is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second input information to obtain the second feature information.

[0113] The second network is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second feature information to obtain the second output information.

[0114] Similar to the first feature information and the first output information mentioned above, the second feature information is the calculation result of the first network on the second input information, and the second feature information is the intermediate information calculated by the prediction network on the second input information. The second output information is the calculation result of the prediction network on the second input information.

[0115] Step 544: Obtain the third input information.

[0116] The third input information is obtained after the first output information is received; step 544 is also executed after step 530. The third input information, similar to the first input information, is the input parameter of the prediction model.

[0117] Step 546: Call the first network and use the resources in the first tensor parallel domain to perform calculations on the third input information to obtain the third feature information.

[0118] The computation of the third input information by calling the first network utilizes the resources in the first tensor parallel domain, which uses the same resources as the computation of the first input information.

[0119] As described above, there is at least one overlap in the execution timing of steps 542 and 530, and step 546 is executed after step 530. To utilize the tensor parallel resources in the first tensor parallel domain to perform the computation of the third input information, step 546 can be implemented as at least one of the following two sub-steps:

[0120] Sub-step 1: While using the first tensor parallel resources in the third tensor parallel domain to perform the computation of the first network, call the first network, and use the second tensor parallel domain and the first tensor parallel resources to perform the computation on the third input information and the second input information in the same batch to obtain the third feature information;

[0121] When the computation of the first network is performed using the first tensor parallel resource in the third tensor parallel domain, the first tensor parallel resource is used to perform computation of the second input information, such as calling the first network and using the first tensor parallel resource to perform computation on the second input information.

[0122] By utilizing the parallel resources of the second tensor parallel field and the first tensor parallel field, the third input information and the second input information are computed in the same batch, so that the computation of the first input information and the third input information utilizes the same resources.

[0123] The first tensor parallel resource can be part or all of the resources in the third tensor parallel domain.

[0124] Furthermore, the first network is invoked, utilizing the second tensor parallel domain and the first tensor parallel resources to perform batch computations on the third and second input information simultaneously, thereby obtaining the second feature information corresponding to the second input information. In other words, by utilizing the second tensor parallel domain and the first tensor parallel resources, batch processing of the third and second input information is achieved, simultaneously obtaining the computation results of the first network on the third and second input information.

[0125] Sub-step 2: While using the second tensor parallel resources in the third tensor parallel domain to perform the computation of the second network, pause the computation of the second network and call the first network to perform computation on the third input information using the second tensor parallel domain and the second tensor parallel resources to obtain the third feature information.

[0126] When the computation of the second network is performed using the second tensor parallel resources in the third tensor parallel domain, the second tensor parallel resources provide resources for calling the second network to compute the second feature information; for example, calling the second network and using the second tensor parallel resources to perform computation on the second feature information.

[0127] By pausing the computation of the second network and utilizing the second tensor parallel domain and second tensor parallel resources to perform computation on the third input information, the computation of the second feature information is paused, thus achieving the same resource utilization for the computation of the first and third input information.

[0128] The second tensor parallel resource can be part or all of the resources in the third tensor parallel domain.

[0129] Step 548: Call the second network and use the resources in the second tensor parallel domain to perform calculations on the third feature information to obtain the third output information.

[0130] For the computation of the third feature information by calling the second network, after obtaining the third feature information, the resources in the second tensor parallel domain are utilized to ensure that the resources in the second tensor parallel domain can meet the computational needs of the second network.

[0131] As described above, optionally, when using the second tensor parallel resources in the third tensor parallel domain to perform the computation of the second network, in order to ensure that the resources used for the computation of the first input information and the third input information are the same, the computation of the second network is paused; step 548 is implemented as follows:

[0132] When the computation of the second network is performed using the second tensor parallel resources, the second network is invoked, and the third feature information and the second feature information are computed in batches using the second tensor parallel domain and the second tensor parallel resources to obtain the third output information.

[0133] Furthermore, the second network is invoked, utilizing the second tensor parallel domain and the second tensor parallel resources to perform batch calculations on the third and second feature information, thus obtaining the second output information. In other words, by utilizing the second tensor parallel domain and the second tensor parallel resources, batch processing of the third and second feature information is achieved, simultaneously obtaining the calculation results of the second network on the third and second feature information.

[0134] Optionally, after step 546, the method further includes: obtaining the fourth input information and calling the prediction model to perform calculations on the fourth input information using resources in the third tensor parallel domain to obtain the fourth output information;

[0135] Referring to step 542 above, when performing the computation of the second network, there are idle resources in the first tensor parallel domain other than the second tensor parallel domain. By utilizing the resources in the third tensor parallel domain to perform computation on the fourth input information, while ensuring that the resources in the second tensor parallel domain can meet the computational needs of the second network, the idle resources that have not been used for computation of the second network are fully utilized to perform computation on the fourth input information.

[0136] In summary, the method provided in this embodiment, by obtaining the third input information after obtaining the first output information, enables the use of different amounts of tensor parallel resources to call the first and second networks in a repetitive loop. This avoids the waste of resources caused by providing the network with excess tensor parallel resources on the basis of meeting the network's computing needs, and ensures the utilization efficiency of tensor parallel resources.

[0137] Figure 5 A flowchart illustrating an exemplary embodiment of the information processing method provided in this application is shown. This method can be executed by a computer device. That is, in Figure 3 Based on the illustrated embodiment, steps 525 and 526 are also included; furthermore, Figure 3 Step 530 in the illustrated embodiment can be implemented as at least one of steps 532 and 534.

[0138] Step 525: Obtain the second input information.

[0139] In this embodiment, the execution order between steps 525 and 510 is not restricted; step 525 can be executed before, after, or simultaneously with step 510. Similar to the first input information, the second input information is the input parameters of the prediction model. This embodiment does not exclude the possibility of merging steps 510 and 525 into the same step, determining a portion of the acquired input information as the first input information, and the other portion as the second input information.

[0140] Step 526: Invoke the prediction model and use the resources in the third tensor parallel domain to perform calculations on the second input information to obtain the second output information.

[0141] In this embodiment, step 526 is executed after step 510; in step 526, the first quantity is less than the second quantity, and the resources required for the first network to perform computation are less than the resources required for the second network to perform computation; when performing computation of the first network, there are idle resources in the second tensor parallel domain other than the first tensor parallel domain, and the first tensor parallel domain is the resource used when performing computation of the first network.

[0142] By utilizing resources in the third tensor parallel domain to perform computations on the second input information, while ensuring that the resources in the first tensor parallel domain can meet the computational needs of the first network, the idle resources not used for computation in the first network are fully utilized to perform computations on the second input information. In some examples, the first tensor parallel domain is a subset of the second tensor parallel domain; this improves the utilization rate of tensor parallel resources in the second tensor parallel domain, avoids waste caused by idle tensor parallel resources, improves the resource utilization rate of computer equipment, and improves the processing efficiency of information input to the prediction model.

[0143] The intersection between the third tensor parallel domain and the first tensor parallel domain is empty, ensuring that a tensor parallel resource is used to perform either the computation of the first input information or the computation of the second input information, avoiding conflicts where a tensor parallel resource needs to process computations of two pieces of information simultaneously. Furthermore, there is an overlap between the third and second tensor parallel domains, ensuring that the resource used to perform computations on the second input information is an idle resource that has not yet been used to perform computations on the first input information. Moreover, the third tensor parallel domain is a subset of the second tensor parallel domain.

[0144] In this embodiment, there is at least one overlap in the execution timing of steps 526 and 520, but no restrictive provisions are made regarding the order of the start and end execution times of steps 526 and 520. By utilizing different tensor parallel resources, the calculation of the first input information and the calculation of the second input information are performed simultaneously. Compared with the related technology where the first network and the second network use the same tensor parallel resources, and the calculation of the second input information is performed after the output result of the first input information is obtained, the processing efficiency of the information input to the prediction model is improved.

[0145] As described above, the prediction network includes a first network and a second network cascaded together. Step 526 can be implemented by calling the first network and the second network in sequence. For the specific implementation method, please refer to the description in step 542 above, which will not be repeated here.

[0146] Step 532: While the first network is being computed using the third tensor parallel resources in the third tensor parallel domain, the computation of the first network is paused, and the second network is invoked to compute the first feature information using the first tensor parallel domain and the third tensor parallel resources to obtain the first output information.

[0147] As described above, there is at least one overlap in the execution timing of steps 526 and 520, and the calculation of the first feature information is performed after step 520. In this embodiment, the calculation of the first feature information using resources in the second tensor parallel domain is implemented as at least one of steps 532 and 534.

[0148] When the computation of the first network is performed using the third tensor parallel resources in the third tensor parallel domain, the third tensor parallel resources are used to perform computation of the second input information, such as calling the first network and using the first tensor parallel resources to perform computation on the second input information.

[0149] By pausing the computation of the first network, the computation of the first feature information is performed using the first tensor parallel domain and the third tensor parallel resources. In this way, the computation of the second input information is paused, and the computation of the first feature information is performed using the third tensor parallel resources in the third tensor parallel domain.

[0150] The third tensor parallel resource can be some or all of the resources in the third tensor parallel domain.

[0151] Step 534: When the computation of the second network is performed using the fourth tensor parallel resource in the third tensor parallel domain, the second network is invoked, and the computation of the second feature information corresponding to the first feature information and the second input information is performed in batches using the first tensor parallel domain and the fourth tensor parallel resource to obtain the first output information.

[0152] When the computation of the second network is performed using the fourth tensor parallel resource in the third tensor parallel domain, the fourth tensor parallel resource is used to perform the computation of the second feature information. For example, the second network is invoked, and the computation of the second feature information is performed using the first tensor parallel resource. The second feature information is the computation result of the first network on the second input information, and the second feature information is the intermediate information calculated by the prediction network on the second input information.

[0153] By utilizing the first tensor parallel domain and the fourth tensor parallel resource, the first feature information and the second feature information are computed in batches, providing a way for the fourth tensor parallel resource to provide resources for the second network.

[0154] The fourth tensor parallel resource can be some or all of the resources in the third tensor parallel domain.

[0155] In one alternative implementation, a second network is invoked, utilizing the parallel resources of the first and fourth tensors to perform batch calculations on the second feature information corresponding to the first and second input information, thereby obtaining the second output information. That is, by utilizing the parallel resources of the first and fourth tensors, batch processing of the first and second feature information is achieved, simultaneously obtaining the calculation results of the second network on the first and second feature information.

[0156] In another alternative implementation, after step 532 and / or step 534, the following is also included:

[0157] Obtain the third input information; call the first network and use the resources in the first tensor parallel domain to perform calculations on the third input information to obtain the third feature information; call the second network and use the resources in the second tensor parallel domain to perform calculations on the third feature information to obtain the third output information.

[0158] As mentioned above, the first quantity is less than the second quantity. For further information regarding the third input information, third feature information, and any unresolved aspects of the third input information, please refer to the above text. Figure 3 Steps 544 to 548 in the corresponding embodiment.

[0159] Further optionally, the computation of the third input information by calling the first network can be implemented as follows: when the computation of the first network is performed using the third tensor parallel resources in the third tensor parallel domain, the first network is called, and the computation of the third input information and the second input information is performed in batches using the first tensor parallel domain and the third tensor parallel resources to obtain the third feature information.

[0160] Furthermore, the first network is invoked, utilizing the first tensor parallel domain and the third tensor parallel resources to perform batch calculations on the third and second input information, thereby obtaining the second feature information. In other words, by utilizing the first and third tensor parallel domains, batch processing of the third and second input information is achieved, simultaneously obtaining the calculation results of the first network on the third and second input information.

[0161] In summary, the method provided in this embodiment improves the utilization rate of tensor parallel resources in the second tensor parallel domain by utilizing idle resources in the third tensor parallel domain to perform calculations on the second input information during the calculation of the first network. This avoids waste caused by idle tensor parallel resources and improves the resource utilization rate of the computer equipment. By performing calculations on the first and second feature information in the same batch and / or pausing the calculation of the first network, a resource utilization method is provided for the tensor parallel domain in the third tensor parallel domain under different calculation conditions after obtaining the first feature information, thus realizing the use of resources in the second tensor parallel domain to perform calculations on the first feature information.

[0162] In one alternative design of this application, the first network in the prediction network includes a pre-filled network, and the second network includes a decoding network; furthermore, the prediction network can be implemented as a large language model with the ability to respond to natural language information. Figure 6 A flowchart illustrating an exemplary embodiment of the information processing method provided in this application is shown. The method can be executed by a computer device. The method includes steps 510, 520a, and 530a; further, it also includes step 535a.

[0163] Step 510: Obtain the first input information for the prediction model.

[0164] The first input information is the input parameters of the prediction model; when the prediction model is implemented as a large language model, the first input information is natural language information. Natural language information is the input parameter of the large language model, and the large language model predicts the response information in natural language form based on the natural language information.

[0165] Step 520a: Call the pre-filling network and use the resources in the first tensor parallel domain to pre-fill the first input information to obtain the first feature information.

[0166] In prediction networks, pre-padded networks are used to perform preprocessing and encoding on the input natural language information. In some examples, for natural language processing (NLP), pre-padded networks typically include word embedding layers and positional encoding layers. The word embedding layer converts the first input information into a vector representation, capturing the semantic information of the words in the first input information. The positional encoding layer adds positional information to the words in the first input information, indicating the order of the words in the natural language sequence of the first input information. In some examples, step 520a is an optional implementation of step 520 in the various embodiments described above.

[0167] Step 530a: Call the decoding network and use the resources in the second tensor parallel domain to decode the first feature information to obtain the first output information.

[0168] In the prediction network, the decoding network is used to decode the first feature information of the pre-filled network output. In some examples, for natural language processing, the decoding network typically includes one or more decoder layers (such as transformer layers, also called transformation layers), which progressively generate the first output information, for example, generating the words in the first output information in word order from beginning to end. The transformation layer includes a multi-head self-attention mechanism and a feed-forward neural network to achieve the generation of words in the first output information from beginning to end. In one example, step 530a is an optional implementation of step 530 in the various embodiments above.

[0169] Step 535a: If the load information of calling the decoding network to perform decoding on the first feature information exceeds the first load threshold, add at least one tensor parallel resource to the waiting queue.

[0170] If the load of calling the decoding network to decode the first feature information exceeds the first load threshold, the resources provided by the second tensor parallel domain to the decoding network are under high load. To reduce the load of calling the decoding network to perform decoding, at least one tensor parallel resource is added to the waiting queue. Specifically, at least one tensor parallel resource in the waiting queue is used to add to the second tensor parallel domain upon receiving the first output information.

[0171] In this embodiment, the second quantity is less than the first quantity; the resources provided for computation by the pre-filled network exceed the resources provided for computation by the decoding network. At least one tensor parallel resource added to the waiting queue belongs to the first tensor parallel domain and not to the second tensor parallel domain; to achieve the additional resources provided to the decoding network.

[0172] In one example, the load information for calling the decoding network to decode the first feature information is the load information of the storage resources when the decoding network is called. Load information exceeding a first load threshold indicates that the storage resources are under high load when the decoding network is called. Figure 7 This diagram illustrates a tensor parallel resource provided in an exemplary embodiment of this application. A decoding network is invoked, and x tensor parallel resources are used to decode first feature information to obtain first output information. Second input information is then obtained, and a large language model is invoked, using nx tensor parallel resources to perform computation on the second input information to obtain second output information. A first load threshold is used to indicate that storage resources are under high load when the decoding network is invoked; there is a risk that insufficient storage resources will increase the computation time of the decoding network. At least one tensor parallel resource is added to a waiting queue, for example, the nth tensor parallel resource is added to the waiting queue. The nth tensor parallel resource is the last tensor parallel resource used for computation among the nx tensor parallel resources.

[0173] Upon obtaining the first output information, the nth tensor parallel resource is added to the second tensor parallel domain to enable the use of the nth tensor parallel resource to call the decoding network, thus providing more storage resources for calling the decoding network.

[0174] In one alternative design of this application, similar to the description in step 535a, the second quantity is less than the first quantity. (As described above...) Figure 3 Based on the corresponding embodiments, it also includes:

[0175] If the load information of calling the second network to perform calculations on the first feature information exceeds the first load threshold, at least one tensor parallel resource will be added to the waiting queue.

[0176] Wherein, at least one tensor parallel resource belongs to the first tensor parallel domain and not to the second tensor parallel domain; at least one tensor parallel resource in the waiting queue is used to be added to the second tensor parallel domain when the first output information is obtained.

[0177] For information on load, waiting queues, etc., please refer to step 535a above.

[0178] In summary, the method provided in this embodiment expands the second tensor parallel domain when the resources provided by the second tensor parallel domain for the second network are under high load by adding at least one tensor parallel resource to the waiting queue. This ensures that the resources provided by the second tensor parallel domain can meet the needs of the second network and guarantee the computational efficiency of the second network.

[0179] In another alternative design of this application, the first quantity is less than the second quantity; as described above Figure 3 Based on the corresponding embodiments, it also includes:

[0180] If the load information of calling the first network to perform calculations on the first input information exceeds the second load threshold, at least one tensor parallel resource will be added to the waiting queue.

[0181] If the load of the first network performing computation on the first input information exceeds a second load threshold, the resources provided by the first tensor parallel domain to the first network are under high load. To reduce the load of the first network performing computation, at least one tensor parallel resource is added to a waiting queue. Specifically, the at least one tensor parallel resource in the waiting queue is used to add to the first tensor parallel domain upon obtaining the first feature information.

[0182] At least one tensor parallel resource added to the waiting queue belongs to the second tensor parallel domain but not to the first tensor parallel domain; in order to provide additional resources for the first network.

[0183] In summary, the method provided in this embodiment expands the capacity of the first tensor parallel domain when the resources provided by the first tensor parallel domain for the first network are under high load by adding at least one tensor parallel resource to the waiting queue, thus ensuring that the resources provided by the first tensor parallel domain can meet the needs of the first network and guarantee the computational efficiency of the first network.

[0184] Next, we will introduce the first number of tensor parallel resources in the first tensor parallel domain and the second number of tensor parallel resources in the second tensor parallel domain.

[0185] Figure 8 A flowchart illustrating an exemplary embodiment of the information processing method provided in this application is shown. This method can be executed by a computer device. That is, in Figure 3 Based on the illustrated embodiment, at least one of steps 502 and 504 is also included.

[0186] Step 502: Determine the first number of tensor parallel resources in the first tensor parallel domain based on the third number of attention heads.

[0187] The prediction model includes a third number of attention heads based on the attention mechanism; in one example, the third number is the number of attention heads included in the first network of the prediction model. Further, the first number is a multiple or factor of the third number; further still, the first number is an integer greater than 1. By determining the first number to be a multiple or factor of the third number, it ensures that when the first network is invoked, each tensor parallel resource provides resources for the same number of attention heads, or that each invocation of an attention head utilizes resources provided by the same number of tensor parallel resources.

[0188] In one alternative implementation, step 502 can be implemented as sub-steps three through six as follows:

[0189] Sub-step 3: Based on the third quantity, determine at least two candidate resource quantities for the first tensor parallel domain;

[0190] Each of the at least two resource candidate quantities is a factor or multiple of the third quantity, and each resource candidate quantity is a factor of the total number of tensor parallel resources provided by the computer device;

[0191] Here, the number of each resource candidate is a factor of the total number of tensor parallel resources provided by the computer device. In one example, the computer device is implemented as a server, which includes multiple image processors, and the number of resource candidates is a factor of the number of image processors; it is guaranteed that the server can simultaneously provide resources for the computation of an integer number of first networks without any idle image processors, thus guaranteeing the resource utilization of the server in the image processor dimension.

[0192] For example, if the number of attention heads is 16, the number of resource candidates can be a multiple of 16, or a power of 2 that is divisible by the number of heads, such as 16, 8, 4, 2, etc. For instance, let P be the number of at least two resource candidates in the first tensor parallel domain. pT .

[0193] Sub-step four: Obtain the first latency threshold of the first network, and the device computing resources and model computing resources required by the first network corresponding to at least two resource candidate quantities;

[0194] The first latency threshold is the allowed latency for the first network to perform computation when calling the recommendation model, such as the allowed latency for calling the service-level objective (SLO) of a large language model. The device computational resources of one resource candidate number are used to indicate the computational capacity of the resource candidate number of tensor parallel resources for the prediction model; the model computational resources required by the first network are used to indicate the computational power required when calling the first network using the resource candidate number of tensor parallel resources.

[0195] Sub-step 5: Based on the first delay threshold and the device computing resources and model computing resources corresponding to at least two resource candidate quantities, determine the batch processing quantity corresponding to each of the at least two resource candidate quantities;

[0196] Taking one of the at least two resource candidate quantities as an example, the batch processing quantity is the product of a first latency threshold and a first ratio, where the first ratio is the quotient of the device computing resources divided by the model computing resources. The batch processing quantity corresponding to each of the at least two resource candidate quantities is calculated separately.

[0197] In one example, the batch size corresponding to the number of resource candidates is determined by using a second ratio that is less than or equal to a first delay threshold as a constraint; where the second ratio is the product of the batch size and the model's computing resources, divided by the device's computing resources. In one example:

[0198]

[0199] Among them, B p The number of batches (also known as the batch size) that call the first network (e.g., calling a pre-filled network);

[0200] L represents the number of layers in the prediction network (e.g., the number of layers in a large language model);

[0201] sum pcomp-attn The computational power required to compute a single attention layer in the first network;

[0202] sum pcomp-mlp The computing power required to compute a single multilayer perceptron (MLP) in the first network;

[0203] sum pcomp-inputEm The computational power required to compute the input embedding layer in the first network;

[0204] sum pcomp-outEm The computing power required to calculate the output embedding layer in the first network;

[0205] C comp The total resource computing power of tensor parallel resources is the number of resource candidates. For example, the unit of total resource computing power is one trillion (=10^12) floating-point operations per second (TFLOPS).

[0206] η comp The total resource computing efficiency of tensor parallel resources is the number of resource candidates.

[0207] T p This is the first delay threshold.

[0208] Sub-step six: Determine the first resource candidate quantity from at least two resource candidate quantities as the first quantity;

[0209] The first batch processing quantity corresponding to the first resource candidate quantity is the maximum value among the batch processing quantities corresponding to at least two resource candidate quantities. With the goal of maximizing the batch processing quantity, the first resource candidate quantity is selected from at least two resource candidate quantities and determined as the first quantity.

[0210] Step 504: Determine the second number of tensor parallel resources in the second tensor parallel domain based on the third number of attention heads.

[0211] The prediction model includes a third number of attention heads based on the attention mechanism. In one example, the third number is the number of attention heads included in the second network of the prediction model. Further, the second number is a multiple or factor of the third number; further still, the second number is an integer greater than 1. By determining the second number to be a multiple or factor of the third number, it ensures that when the second network is invoked, each tensor parallel resource provides resources for the same number of attention heads, or that each invocation of an attention head utilizes resources provided by the same number of tensor parallel resources.

[0212] In one alternative implementation, step 504 can be implemented as sub-steps seven through ten as follows:

[0213] Sub-step seven: Based on the third quantity, determine at least two candidate resource quantities for the second tensor parallel domain;

[0214] Each of the at least two resource candidate quantities is a factor or multiple of the third quantity, and each resource candidate quantity is a factor of the total number of tensor parallel resources provided by the computer device.

[0215] Furthermore, with the constraint that the number of resource candidates is less than the first number, at least two resource candidate numbers are determined; that is, the case where the first number is greater than the second number as described in various embodiments of this application. It is understood that each of the at least two resource candidate numbers is a factor of the first number, and the second number is an integer greater than 1. For example, the at least two resource candidate numbers of the second tensor parallel domain are denoted as P. dT .

[0216] Sub-step 8: Obtain the second latency threshold of the second network, and the device transmission resources corresponding to at least two resource candidate quantities and the model transmission resources required by the second network;

[0217] The second latency threshold is the allowed latency for the second network to perform computations when calling the recommendation model, such as the allowed latency for calling the service-level objective (SLO) of a large language model. One resource candidate's number of device transmission resources indicates the read / write capability of the resource candidate's number of tensor parallel resources for the cached model parameters, model input information, or model output information of the prediction model; the model transmission resources required by the second network indicate the read / write capability required when calling the second network using the resource candidate's number of tensor parallel resources.

[0218] Sub-step nine: Based on the second delay threshold and the device transmission resources and model transmission resources corresponding to the at least two resource candidate quantities respectively, determine the batch processing quantity corresponding to the at least two resource candidate quantities respectively;

[0219] Taking one of the at least two resource candidate quantities as an example, the batch size is the product of the second latency threshold and the third ratio, where the third ratio is the quotient of the device transmission resources divided by the model transmission resources. The batch size corresponding to each of the at least two resource candidate quantities is transmitted separately.

[0220] In one example, the batch size corresponding to the number of resource candidates is determined by using a fourth ratio that is less than or equal to a second delay threshold as a constraint; where the second ratio is the product of the batch size and the model's transmission resources, divided by the device's transmission resources. In one example:

[0221]

[0222] Among them, B d The number of batches that call the second network (e.g., the decoding network);

[0223] L represents the number of layers in the prediction network (e.g., the number of layers in a large language model);

[0224] sum dw-inEmThe parameter size of the input embedding layer in the second network;

[0225] sum dw-attn The parameter size of a single attention mechanism layer in the second network;

[0226] sum dw-mlp The parameter size of a single multilayer perceptron in the second network;

[0227] sum dw-outEm The parameter size of the output embedding layer in the second network;

[0228] sum dkv-attn This refers to the size of the key-value pair cache (KV Cache) generated by each word in the single-layer attention mechanism layer in the second network.

[0229] N d The size of the output sequence of the prediction model (such as a large language model);

[0230] B mem The memory access bandwidth for the number of tensor parallel resources is the number of resource candidates. For example, the unit of memory access bandwidth is terabytes per second (TB / s).

[0231] η mem The memory access efficiency of tensor parallel resources for the number of resource candidates;

[0232] T d This is the second delay threshold.

[0233] Sub-step 10: Determine the second resource candidate quantity from at least two resource candidate quantities as the second quantity;

[0234] The second batch size corresponding to the second resource candidate size is the maximum value among the batch sizes corresponding to at least two resource candidate sizes. With the goal of maximizing the batch size, a second resource candidate size is selected from at least two resource candidate sizes and determined as the second size. Optionally, the constraint condition for the second resource candidate size also includes that the second resource candidate size is less than the first size.

[0235] In summary, the method provided in this embodiment ensures that each tensor parallel resource provides resources for the same number of attention heads, or that each attention head call utilizes resources provided by the same number of tensor parallel resources, by determining the first and second quantities as multiples or factors of the third quantity. This guarantees load balancing of the resources provided by the tensor parallel resources when calling the first network. Based on the resource requirements of the first and second networks for computation, the first and second networks utilize different quantities of tensor parallel resources. This achieves flexible provision of tensor parallel resources for the first and second networks, avoiding resource waste caused by providing excess tensor parallel resources to the networks while meeting their computational needs, thus ensuring the utilization efficiency of tensor parallel resources. Compared to related technologies that provide the same tensor parallel resources to each network in the prediction model, this method avoids the waste of tensor parallel resources.

[0236] Those skilled in the art will understand that the above embodiments can be implemented independently, or the above embodiments can be freely combined to create new embodiments to implement the information processing method of this application.

[0237] The above describes the information processing method provided by the embodiments of this application. Corresponding to the above method, the embodiments of this application also provide an information processing device. This device is applied to a computer device. The device is used to... Figure 9 Each module shown performs the above... Figures 3-6 or Figure 8 Information processing methods performed by computer devices. The information processing apparatus provided in this embodiment includes the following modules:

[0238] The acquisition module 810 is used to acquire the first input information of the prediction model, which includes a first network and a second network cascaded together.

[0239] Processing module 820 is used to call the first network and use the resources in the first tensor parallel domain to perform calculations on the first input information to obtain the first feature information. The first tensor parallel domain includes a first number of tensor parallel resources provided by the computer device. The first number is positively correlated with at least one of the computing resources, storage resources and transmission resources required by the first network to perform calculations.

[0240] The processing module 820 is also used to call the second network and use the resources in the second tensor parallel domain to perform calculations on the first feature information to obtain the first output information; the second tensor parallel domain includes a second number of tensor parallel resources provided by the computer device, and the second number is positively correlated with at least one of the computing resources, storage resources and transmission resources required by the second network to perform calculations;

[0241] There is an overlap between the first tensor parallel domain and the second tensor parallel domain, and the first quantity and the second quantity are different.

[0242] In one optional implementation of this embodiment, the second quantity is less than the first quantity;

[0243] The acquisition module 810 is also used to acquire the second input information;

[0244] The processing module 820 is also used to call the prediction model, use the resources in the third tensor parallel domain to perform calculations on the second input information, and obtain the second output information.

[0245] There is an overlap between the third tensor parallel domain and the first tensor parallel domain, and the intersection between the third tensor parallel domain and the second tensor parallel domain is empty.

[0246] In an optional implementation of this embodiment, the acquisition module 810 is further configured to acquire third input information;

[0247] The processing module 820 is also used to call the first network and use the resources in the first tensor parallel domain to perform calculations on the third input information to obtain the third feature information;

[0248] The processing module 820 is also used to call the second network, use the resources in the second tensor parallel domain to perform calculations on the third feature information, and obtain the third output information.

[0249] In an optional implementation of this embodiment, the processing module 820 is further configured to:

[0250] When the computation of the first network is performed using the first tensor parallel resources in the third tensor parallel domain, the first network is invoked, and the third input information and the second input information are computed in batches using the second tensor parallel domain and the first tensor parallel resources to obtain the third feature information.

[0251] And / or, when the computation of the second network is performed using the second tensor parallel resources in the third tensor parallel domain, the computation of the second network is paused, and the first network is invoked to perform computation on the third input information using the second tensor parallel domain and the second tensor parallel resources to obtain the third feature information.

[0252] In an optional implementation of this embodiment, the processing module 820 is further configured to:

[0253] When the computation of the second network is performed using the second tensor parallel resources, the second network is invoked, and the third feature information and the second feature information are computed in batches using the second tensor parallel domain and the second tensor parallel resources to obtain the third output information.

[0254] In one optional implementation of this embodiment, the first quantity is less than the second quantity;

[0255] The acquisition module 810 is also used to acquire the second input information;

[0256] The processing module 820 is also used to call the prediction model, use the resources in the third tensor parallel domain to perform calculations on the second input information, and obtain the second output information;

[0257] There is an overlap between the third tensor parallel domain and the second tensor parallel domain, and the intersection between the third tensor parallel domain and the first tensor parallel domain is empty.

[0258] In an optional implementation of this embodiment, the processing module 820 is further configured to:

[0259] When the computation of the first network is performed using the third tensor parallel resources in the third tensor parallel domain, the computation performed by the first network is paused, and the second network is invoked to perform computation on the first feature information using the first tensor parallel domain and the third tensor parallel resources to obtain the first output information.

[0260] And / or, when the computation of the second network is performed using the fourth tensor parallel resource in the third tensor parallel domain, the second network is invoked, and the computation of the second feature information corresponding to the first feature information and the second input information is performed in batches using the first tensor parallel domain and the fourth tensor parallel resource to obtain the first output information.

[0261] In an optional implementation of this embodiment, the processing module 820 is further configured to:

[0262] The first network is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second input information to obtain the second feature information.

[0263] The second network is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second feature information to obtain the second output information.

[0264] In an optional implementation of this embodiment, the first network includes a pre-filled network, and the second network includes a decoding network.

[0265] In one optional implementation of this embodiment, the second quantity is less than the first quantity; the apparatus further includes:

[0266] The setting module 830 is used to add at least one tensor parallel resource to the waiting queue when the load information of calling the second network to perform calculation on the first feature information exceeds the first load threshold. The at least one tensor parallel resource belongs to the first tensor parallel domain and does not belong to the second tensor parallel domain.

[0267] In this process, at least one tensor parallel resource in the waiting queue is used to add to the second tensor parallel domain upon receiving the first output information.

[0268] In one optional implementation of this embodiment, the first quantity is less than the second quantity; the apparatus further includes:

[0269] The setting module 830 is used to add at least one tensor parallel resource to the waiting queue when the load information of calling the first network to perform calculations on the first input information exceeds the second load threshold. The at least one tensor parallel resource belongs to the second tensor parallel domain and does not belong to the first tensor parallel domain.

[0270] In this context, at least one tensor parallel resource in the waiting queue is used to be added to the first tensor parallel domain upon obtaining the first feature information.

[0271] In an optional implementation of this embodiment, the prediction model includes a third number of attention heads based on an attention mechanism, and the device further includes:

[0272] The determination module 840 is used to determine the first number of tensor parallel resources in the first tensor parallel domain based on the third number of attention heads.

[0273] In an optional implementation of this embodiment, the determining module 840 is further configured to:

[0274] Based on the third quantity, at least two resource candidate quantities for the first tensor parallel domain are determined, each of the at least two resource candidate quantities being a factor or multiple of the third quantity, and each resource candidate quantity being a factor of the total number of tensor parallel resources provided by the computer device.

[0275] Obtain the first latency threshold of the first network, and the device computing resources and model computing resources required by the first network corresponding to at least two resource candidate quantities;

[0276] Based on the first delay threshold, and the device computing resources and model computing resources corresponding to at least two resource candidate quantities respectively, determine the batch processing quantity corresponding to at least two resource candidate quantities respectively;

[0277] The first resource candidate quantity among at least two resource candidate quantities is determined as the first quantity, and the first batch processing quantity corresponding to the first resource candidate quantity is the maximum value among the batch processing quantities corresponding to the at least two resource candidate quantities respectively.

[0278] In an optional implementation of this embodiment, the prediction model includes a third number of attention heads based on an attention mechanism, and the device further includes:

[0279] The determination module 840 is used to determine the second number of tensor parallel resources in the second tensor parallel domain based on the third number of attention heads.

[0280] In an optional implementation of this embodiment, the determining module 840 is further configured to:

[0281] Based on the third quantity, at least two resource candidate quantities for the second tensor parallel domain are determined, each of the at least two resource candidate quantities being a factor or multiple of the third quantity, and each resource candidate quantity being a factor of the total number of tensor parallel resources provided by the computer device.

[0282] Obtain the second latency threshold of the second network, and the device transmission resources corresponding to at least two resource candidate quantities and the model transmission resources required by the second network, respectively;

[0283] Based on the second delay threshold and the device transmission resources and model transmission resources corresponding to at least two resource candidate quantities, the batch processing quantity corresponding to each of the at least two resource candidate quantities is determined.

[0284] The second resource candidate quantity among at least two resource candidate quantities is determined as the second quantity, and the second batch quantity corresponding to the second resource candidate quantity is the maximum value among the batch quantities corresponding to the at least two resource candidate quantities respectively.

[0285] It should be understood that the beneficial effects of the apparatus provided in the above embodiments are the same as those of the information processing method described above when implementing its functions, and will not be repeated here. Furthermore, the information processing apparatus described above is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process is detailed in the method embodiments, and will not be repeated here.

[0286] Figure 10 A schematic diagram of an exemplary information processing device 1200 of this application is shown. The information processing device 1200 includes at least one processor 1201, a memory 1203, and at least one network interface 1204. The information processing device 1200 is also referred to as a computer device.

[0287] Processor 1201 may be, for example, a general-purpose central processing unit (CPU), a digital signal processor (DSP), a network processor (NP), a GPU, a neural-network processing unit (NPU), a data processing unit (DPU), a microprocessor, or one or more integrated circuits or application-specific integrated circuits (ASICs), programmable logic devices (PLDs), other general-purpose processors or other programmable logic devices, discrete gates, transistor logic devices, discrete hardware components, or any combination thereof for implementing the scheme of this application. A PLD may be, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), generic array logic (GAL), or any combination thereof. A general-purpose processor may be a microprocessor or any conventional processor. It is worth noting that the processor may be a processor supporting an advanced reduced instruction set machine (RISC) machine (ARM) architecture. It can implement or execute various logic blocks, modules, and circuits described in conjunction with the disclosure of this application. The processor can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.

[0288] Optionally, the information processing device 1200 also includes a bus 1202. The bus 1202 is used to transmit information between the components of the information processing device 1200. The bus 1202 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus 1202 can be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one line is used to represent it in the figure, but this does not mean that there is only one bus or one type of bus.

[0289] The memory 1203 may be, for example, volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache.

[0290] By way of example, but not limitation, many forms of ROM and RAM are available. For example, ROM is a compact disc read-only memory (CD-ROM). RAM includes, but is not limited to, static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM).

[0291] The memory 1203 can also be other types of storage devices capable of storing static information and instructions. Alternatively, it can be other types of dynamic storage devices capable of storing information and instructions. It can also be other optical disc storage, optical disk storage (including compressed optical discs, laser discs, optical discs, digital versatile optical discs, Blu-ray discs, etc.), magnetic disk storage media, or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures that can be accessed by a computer, but is not limited thereto. The memory 1203 may exist independently, for example, and be connected to the processor 1201 via bus 1202. The memory 1203 may also be integrated with the processor 1201.

[0292] Network interface 1204 uses any transceiver-like device for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), or wireless local area network (WLAN). Network interface 1204 may include wired network interfaces and wireless network interfaces. Specifically, network interface 1204 can be an Ethernet interface, such as Fast Ethernet (FE), Gigabit Ethernet (GE), Asynchronous Transfer Mode (ATM), WLAN, cellular network, or combinations thereof. The Ethernet interface can be an optical interface, an electrical interface, or a combination thereof. In some embodiments of this application, network interface 1204 can be used for information processing device 1200 to communicate with other devices.

[0293] In specific implementations, as some embodiments, processor 1201 may include one or more CPUs, such as CPU0 and CPU1 shown in the figure. Each of these processors may be a single-core processor or a multi-core processor. Here, "processor" may refer to one or more devices, circuits, and / or processing cores for processing data (e.g., computer program instructions).

[0294] In specific implementations, as some embodiments, the information processing device 1200 may include multiple processors, such as processor 1201 and processor 1205 shown in the figure. Each of these processors may be a single-core processor or a multi-core processor. Here, "processor" may refer to one or more devices, circuits, and / or processing cores used for processing data (such as computer program instructions).

[0295] In some embodiments, the memory 1203 is used to store program instructions 1210 for executing the present application solution, and the processor 1201 can execute the program instructions 1210 stored in the memory 1203. That is, the information processing device 1200 can implement the method provided in the method embodiment through the processor 1201 and the program instructions 1210 in the memory 1203. The program instructions 1210 may include one or more software modules. Optionally, the processor 1201 itself may also store program instructions for executing the present application solution.

[0296] In specific implementation, the information processing device 1200 of this application can correspond to the first network element device for executing the above method. The processor 1201 in the information processing device 1200 reads the instructions in the memory 1203, so that the information processing device 1200 can execute all or part of the steps in the method embodiment.

[0297] The information processing device 1200 can also correspond to the information processing apparatus described above, where each functional module is implemented using software from the information processing device 1200. In other words, the functional modules included in the information processing apparatus are generated by the processor 1201 of the information processing device 1200 reading the program instructions 1210 stored in the memory 1203.

[0298] Each step of the information processing method is completed through integrated logic circuits in the hardware or instructions in the software form of the processor in the information processing device 1200. The steps of the method embodiments disclosed in this application can be directly implemented by the hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. Since the storage medium is located in memory, the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method embodiments; to avoid repetition, these will not be described in detail here.

[0299] Figure 11 A schematic diagram of an exemplary information processing device 1300 of this application is shown. The information processing device 1300 includes a main control board 1310 and an interface board 1330. The information processing device 1300 is used to perform the operations involved in the information processing method described above. This information processing device 1300 is, for example, a server. The information processing device 1300 is also called a computer device.

[0300] The main control board 1310, also known as the main processing unit (MPU) or route processor card, is used to control and manage the various components in the information processing device 1300, including routing calculation, device management, device maintenance, and protocol processing functions. The main control board 1310 includes a central processing unit 1311 and a memory 1312.

[0301] Interface board 1330 is also known as a line processing unit (LPU), linecard, or service board. Interface board 1330 provides various service interfaces and implements packet forwarding. Service interfaces include, but are not limited to, Ethernet interfaces, POS (Packet over SONET / SDH) interfaces, etc., with Ethernet interfaces including, for example, flexible Ethernet clients (FlexE Clients). Interface board 1330 includes: a central processing unit 1331, a network processor 1332, a forwarding table entry memory 1334, and a physical interface card (PIC) 1333.

[0302] The central processing unit 1331 on the interface board 1330 is used to control and manage the interface board 1330 and communicate with the central processing unit 1311 on the main control board 1310.

[0303] Network processor 1332 is used to implement packet forwarding processing. Network processor 1332 can be in the form of a forwarding chip. Specifically, network processor 1332 forwards received packets based on the forwarding table stored in forwarding table entry memory 1334. If the destination address of the packet is the address of information processing device 1300, the packet is sent to the CPU (such as central processing unit 1311) for processing; if the destination address of the packet is not the address of information processing device 1300, the next hop and outgoing interface corresponding to the destination address are looked up in the forwarding table according to the destination address, and the packet is forwarded to the outgoing interface corresponding to the destination address. Uplink packet processing includes: packet ingress interface processing, forwarding table lookup; downlink packet processing includes forwarding table lookup, etc.

[0304] The physical interface card 1333 is used to implement physical layer interfacing functions. Raw traffic enters the interface board 1330 through this card, and processed packets are sent out from the physical interface card 1333. The physical interface card 1333, also known as a daughter card, can be installed on the interface board 1330. It is responsible for converting photoelectric signals into packets, performing validity checks on the packets, and forwarding them to the network processor 1332 for processing. In some implementations, the central processing unit can also perform the functions of the network processor 1332, such as implementing software forwarding based on a general-purpose CPU, thus eliminating the need for the network processor 1332 in the physical interface card 1333.

[0305] Optionally, the information processing device 1300 includes multiple interface boards. For example, the information processing device 1300 also includes an interface board 1340, which includes a central processing unit 1341, a network processor 1342, a forwarding table entry memory 1344, and a physical interface card 1343.

[0306] Optionally, the information processing device 1300 also includes a switching fabric board 1320. The switching fabric board 1320 can also be referred to as a switch fabric unit (SFU). When the information processing device has multiple interface boards 1330, the switching fabric board 1320 is used to complete data exchange between the interface boards. For example, interface boards 1330 and 1340 can communicate via the switching fabric board 1320.

[0307] The main control board 1310 and the interface board 1330 are coupled. For example, the main control board 1310, interface board 1330, interface board 1340, and switching network board 1320 communicate with each other via a system bus connected to the system backplane. In one possible implementation, an inter-process communication (IPC) channel is established between the main control board 1310 and the interface board 1330, and the main control board 1310 and the interface board 1330 communicate with each other through the IPC channel.

[0308] Logically, the information processing device 1300 includes a control plane and a forwarding plane. The control plane includes a main control board 1310 and a central processing unit 1331, while the forwarding plane includes various components that perform forwarding, such as a forwarding table entry memory 1334, a physical interface card 1333, and a network processor 1332. The control plane performs functions such as router operation, generating forwarding tables, processing signaling and protocol messages, and configuring and maintaining the device's status. The control plane distributes the generated forwarding tables to the forwarding plane. In the forwarding plane, the network processor 1332 uses the forwarding tables distributed by the control plane to look up and forward messages received by the physical interface card 1333. The forwarding tables distributed by the control plane can be stored in the forwarding table entry memory 1334. In some embodiments, the control plane and the forwarding plane can be completely separated and not on the same device.

[0309] It's worth noting that a main control board may consist of one or more boards, including a primary and a backup main control board. Similarly, an interface board may exist; the more powerful the data processing equipment, the more interface boards it provides. Each interface board may also have one or more physical interface cards. A switching network board may or may not exist; multiple boards can share the load and provide redundancy. In a centralized forwarding architecture, the information processing equipment may not need a switching network board, as the interface boards handle the entire system's business data processing. In a distributed forwarding architecture, the information processing equipment can have at least one switching network board, enabling data exchange between multiple interface boards and providing high-capacity data exchange and processing capabilities. Therefore, the data access and processing capabilities of a distributed architecture information processing equipment are greater than those of a centralized architecture equipment. Alternatively, the information processing device can also be a single board, without a switching network board. The functions of the interface board and the main control board are integrated on this one board. In this case, the central processing unit on the interface board and the central processing unit on the main control board can be combined into a single central processing unit, executing the combined functions of both. This type of device has lower data exchange and processing capabilities (e.g., low-end switches or routers). The specific architecture adopted depends on the specific network deployment scenario, and no restrictions are imposed here.

[0310] In an exemplary embodiment, a computer program (product) is provided, comprising: computer program code, which, when executed by a computer, causes the computer to perform an information processing method.

[0311] In an exemplary embodiment, a computer-readable storage medium is provided that stores a program or instructions, which, when executed on a computer, cause the computer to perform the aforementioned information processing method.

[0312] In an exemplary embodiment, a chip is provided, including a processor for calling and executing instructions stored in a memory, such that a computer on which the chip is mounted performs the method shown in the figure.

[0313] In an exemplary embodiment, another chip is provided, including: an input interface, an output interface, a processor, and a memory. The input interface, the output interface, the processor, and the memory are connected through an internal connection path. The processor is used to execute code in the memory. When the code is executed, a computer with the chip installed performs an information processing method.

[0314] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive).

[0315] In this application, the terms "first," "second," etc., are used to distinguish identical or similar items with substantially the same function. It should be understood that there is no logical or temporal dependency between "first," "second," and "nth," nor does it limit the quantity or order of execution. It should also be understood that although the following description uses the terms "first," "second," etc., to describe various elements, these elements should not be limited by the terms. These terms are merely used to distinguish one element from another.

[0316] It should also be understood that, in the various embodiments of this application, the sequence number of each process does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0317] In this application, the term "at least one" means one or more, and the term "multiple" means two or more. For example, multiple second devices means two or more second devices. The terms "system" and "network" are often used interchangeably herein.

[0318] It should be understood that the terminology used in the description of the various examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various examples and the appended claims, the singular forms “a” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.

[0319] It should also be understood that the term "and / or" as used herein refers to and covers any and all possible combinations of one or more of the associated listed items. The term "and / or" describes an association between related objects, indicating that three relationships can exist; for example, A and / or B can represent: A alone, A and B simultaneously, or B alone. Additionally, the character " / " in this application generally indicates that the preceding and following related objects are in an "or" relationship.

[0320] It should also be understood that the terms “if” and “if” can be interpreted as meaning “when” or “upon”, or “in response to determination” or “in response to detection”. Similarly, depending on the context, the phrases “if determination…” or “if detection [the stated condition or event]” can be interpreted as meaning “when determination…”, or “in response to determination…”, or “when detection [the stated condition or event]” or “in response to detection [the stated condition or event]”.

[0321] The above description is merely an embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the principles of this application should be included within the protection scope of this application.

Claims

1. An information processing method characterized by comprising: The method is performed by a computer device, and the method includes: Obtain the first input information of the prediction model, which includes a first network and a second network cascaded together. The first network is invoked, and the first input information is computed using the resources in the first tensor parallel domain to obtain the first feature information. The first tensor parallel domain includes a first number of tensor parallel resources provided by the computer device. The first number is positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the first network to perform the computation. The second network is invoked to perform calculations on the first feature information using resources in the second tensor parallel domain to obtain first output information; the second tensor parallel domain includes a second number of tensor parallel resources provided by the computer device, and the second number is positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the second network to perform calculations; There is an overlap between the first tensor parallel domain and the second tensor parallel domain, and the first quantity and the second quantity are different.

2. The method of claim 1, wherein, The second quantity is less than the first quantity; After invoking the first network and using resources in the first tensor parallel domain to perform calculations on the first input information to obtain the first feature information, the method further includes: The second input information is obtained, and the prediction model is invoked to perform calculations on the second input information using resources in the third tensor parallel domain, thereby obtaining the second output information; Wherein, there is an overlap between the third tensor parallel domain and the first tensor parallel domain, and the intersection between the third tensor parallel domain and the second tensor parallel domain is empty.

3. The method according to claim 2, characterized in that, After invoking the second network to perform computation on the first feature information using resources in the second tensor parallel domain to obtain the first output information, the method further includes: Obtain third input information; The first network is invoked, and the resources in the first tensor parallel domain are used to perform calculations on the third input information to obtain the third feature information. The second network is invoked, and the resources in the second tensor parallel domain are used to perform calculations on the third feature information to obtain the third output information.

4. The method according to claim 3, characterized in that, The step of calling the first network and using the resources in the first tensor parallel domain to perform calculations on the third input information to obtain third feature information includes: When the computation of the first network is performed using the first tensor parallel resource in the third tensor parallel domain, the first network is invoked, and the computation of the third input information and the second input information is performed in batches using the second tensor parallel domain and the first tensor parallel resource to obtain the third feature information. And / or, when the computation of the second network is performed using the second tensor parallel resources in the third tensor parallel domain, the computation of the second network is paused, and the first network is invoked to perform computation on the third input information using the second tensor parallel domain and the second tensor parallel resources to obtain the third feature information.

5. The method according to claim 4, characterized in that, The invocation of the second network, utilizing the resources in the second tensor parallel domain to perform calculations on the third feature information, yields third output information, including: When the computation of the second network is performed using the second tensor parallel resource, the second network is invoked, and the computation of the third feature information and the second feature information is performed in batches using the second tensor parallel domain and the second tensor parallel resource to obtain the third output information.

6. The method according to claim 1, characterized in that, The first quantity is less than the second quantity; the method further includes: Obtain the second input information; The prediction model is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second input information to obtain the second output information. Wherein, there is an overlap between the third tensor parallel domain and the second tensor parallel domain, and the intersection between the third tensor parallel domain and the first tensor parallel domain is empty.

7. The method according to claim 6, characterized in that, The invocation of the second network, utilizing resources in the second tensor parallel domain to perform computation on the first feature information to obtain first output information, includes: When the computation of the first network is performed using the third tensor parallel resources in the third tensor parallel domain, the computation performed by the first network is paused, and the second network is invoked to perform computation on the first feature information using the first tensor parallel domain and the third tensor parallel resources to obtain the first output information. And / or, when the computation of the second network is performed using the fourth tensor parallel resource in the third tensor parallel domain, the second network is invoked, and the computation of the second feature information corresponding to the first feature information and the second input information is performed in batches using the first tensor parallel domain and the fourth tensor parallel resource to obtain the first output information.

8. The method according to any one of claims 2 to 7, characterized in that, The invocation of the prediction model, utilizing resources in the third tensor parallel domain to perform calculations on the second input information, and obtaining the second output information includes: The first network is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second input information to obtain the second feature information; The second network is invoked, and the resources in the third tensor parallel domain are used to perform calculations on the second feature information to obtain the second output information.

9. The method according to any one of claims 1 to 8, characterized in that, The first network includes a pre-filled network, and the second network includes a decoding network.

10. The method according to any one of claims 1 to 5, 8 or 9, characterized in that, The second quantity is less than the first quantity; the method further includes: If the load information of calling the second network to perform calculations on the first feature information exceeds the first load threshold, at least one tensor parallel resource is added to the waiting queue. The at least one tensor parallel resource belongs to the first tensor parallel domain and does not belong to the second tensor parallel domain. The at least one tensor parallel resource in the waiting queue is used to be added to the second tensor parallel domain upon receiving the first output information.

11. The method according to claim 1, or any one of 6 to 9, characterized in that, The first quantity is less than the second quantity; the method further includes: If the load information of calling the first network to perform calculations on the first input information exceeds the second load threshold, at least one tensor parallel resource is added to the waiting queue. The at least one tensor parallel resource belongs to the second tensor parallel domain and does not belong to the first tensor parallel domain. Wherein, the at least one tensor parallel resource in the waiting queue is used to be added to the first tensor parallel domain upon obtaining the first feature information.

12. The method according to any one of claims 1 to 11, characterized in that, The prediction model includes a third number of attention heads based on an attention mechanism, and the method further includes: The first number of tensor parallel resources in the first tensor parallel domain is determined based on the third number of attention heads.

13. The method according to claim 12, characterized in that, Determining the first quantity of tensor parallel resources in the first tensor parallel domain based on the third quantity of attention heads includes: Based on the third quantity, at least two resource candidate quantities for the first tensor parallel domain are determined, each of the at least two resource candidate quantities being a factor or multiple of the third quantity, and each resource candidate quantity being a factor of the total number of tensor parallel resources provided by the computer device. Obtain the first latency threshold of the first network, and the device computing resources and model computing resources required by the first network corresponding to the at least two resource candidate quantities, respectively; Based on the first latency threshold and the device computing resources and model computing resources corresponding to the at least two resource candidate quantities respectively, the batch processing quantity corresponding to the at least two resource candidate quantities is determined. The first resource candidate quantity among the at least two resource candidate quantities is determined as the first quantity, and the first batch processing quantity corresponding to the first resource candidate quantity is the maximum value among the batch processing quantities corresponding to the at least two resource candidate quantities respectively.

14. The method according to any one of claims 1 to 13, characterized in that, The prediction model includes a third number of attention heads based on an attention mechanism, and the method further includes: The second number of tensor parallel resources in the second tensor parallel domain is determined based on the third number of attention heads.

15. The method according to claim 14, characterized in that, Determining the second quantity of tensor parallel resources in the second tensor parallel domain based on the third quantity of attention heads includes: Based on the third quantity, at least two resource candidate quantities for the second tensor parallel domain are determined, each of the at least two resource candidate quantities being a factor or multiple of the third quantity, and each resource candidate quantity being a factor of the total number of tensor parallel resources provided by the computer device. Obtain the second latency threshold of the second network, and the device transmission resources and model transmission resources required by the second network corresponding to the at least two resource candidate quantities, respectively; Based on the second latency threshold and the device transmission resources and model transmission resources corresponding to the at least two resource candidate quantities respectively, the batch processing quantity corresponding to the at least two resource candidate quantities is determined. The second resource candidate quantity among the at least two resource candidate quantities is determined as the second quantity, and the second batch quantity corresponding to the second resource candidate quantity is the maximum value among the batch quantities corresponding to the at least two resource candidate quantities respectively.

16. An information processing device, characterized in that, The device includes: The acquisition module is used to acquire the first input information of the prediction model, which includes a first network and a second network cascaded together. The processing module is used to call the first network and use the resources in the first tensor parallel domain to perform calculations on the first input information to obtain the first feature information. The first tensor parallel domain includes a first number of tensor parallel resources provided by the computer device. The first number is positively correlated with at least one of the computing resources, storage resources, and transmission resources required by the first network to perform calculations. The processing module is further configured to invoke the second network and use the resources in the second tensor parallel domain to perform calculations on the first feature information to obtain first output information; the second tensor parallel domain includes a second number of tensor parallel resources provided by the computer device, and the second number is positively correlated with at least one of the computing resources, the storage resources, and the transmission resources required by the second network to perform calculations; There is an overlap between the first tensor parallel domain and the second tensor parallel domain, and the first quantity and the second quantity are different.

17. A computer device, characterized in that, The computer device includes: a processor and a memory, wherein the memory stores at least one program; the processor is configured to execute the at least one program in the memory to implement the information processing method as described in any one of claims 1 to 15.

18. A computer-readable storage medium, characterized in that, The readable storage medium stores executable instructions, which are loaded and executed by a processor to implement the information processing method as described in any one of claims 1 to 15.

19. A computer program product, characterized in that, The computer program product includes computer instructions stored in a computer-readable storage medium, and a processor reads from and executes the computer instructions to implement the information processing method as described in any one of claims 1 to 15.