Vector embedding method, apparatus, device and medium based on heterogeneous computing architecture
By dividing the network layer of the vector embedding model into general-purpose and intensive computing units through a heterogeneous computing architecture and allocating tasks according to execution efficiency indicators, the problem of resource mismatch in a single computing unit is solved, thereby improving the system's resource utilization and processing efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN RES INST OF BIG DATA
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, when vector embedding models are deployed on a single computing unit, the mismatch between computing resource requirements and actual resource idleness or processing bottlenecks reduces the system's resource utilization and the processing efficiency of large-scale vector embedding tasks.
A heterogeneous computing architecture is adopted, which divides the network layers of the pre-trained vector embedding model into general computing units and dense computing units to handle lightweight computing layers and dense computing layers respectively. By measuring the processing latency of each network layer on different computing units, the execution efficiency index is calculated, and the vector embedding task is processed in parallel.
It improves the system's resource utilization and processing efficiency for large-scale vector embedding tasks, avoids idle computing power and bottlenecks in computing units, and realizes hierarchical pipelined parallel processing across devices.
Smart Images

Figure CN122197894A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of vector processing technology, and in particular to a vector embedding method, apparatus, device and medium based on a heterogeneous computing architecture. Background Technology
[0002] With the rapid development of artificial intelligence technology, vector embedding technology has become a core support for semantic understanding and information retrieval. Specifically, vector embedding technology transforms massive amounts of text data into high-dimensional semantic vectors in real time, achieving semantic-level information matching through similarity calculation between vectors. This places extremely high demands on the processing capabilities of vector embedding services. Therefore, it is necessary to improve the processing efficiency of vector embedding services to meet the ever-increasing business needs.
[0003] In related technologies, inference is generally performed by deploying a pre-trained vector embedding model on a single computing unit. However, the structural characteristics of the model result in different parts having different requirements for computing resources. When the computing unit processes the part of the model that does not match its computing power characteristics, it either has excess computing power, resulting in idle resources, or insufficient computing power, forming a processing bottleneck. As a result, the resource utilization rate of the system and the processing efficiency for large-scale vector embedding tasks are reduced. Summary of the Invention
[0004] This application proposes a vector embedding method, apparatus, device, and medium based on a heterogeneous computing architecture, which can improve the resource utilization of the system and the processing efficiency of large-scale vector embedding tasks.
[0005] To achieve the above objectives, a first aspect of this application proposes a vector embedding method based on a heterogeneous computing architecture, wherein the heterogeneous computing architecture includes general-purpose computing units and intensive computing units, and the method includes: The pre-trained vector embedding model contains multiple network layers, and each network layer processes preset test samples through the general computing unit and the dense computing unit respectively, to obtain the first processing delay and the second processing delay corresponding to each network layer; Based on the ratio of the first processing latency to the second processing latency of each network layer, the execution efficiency index corresponding to each network layer is calculated. Based on the execution efficiency index corresponding to each network layer, the multiple network layers are divided into a set of lightweight computing layers executed by the general-purpose computing unit and a set of dense computing layers executed by the dense computing unit. Multiple vector embedding tasks are obtained, and each network layer in the set of dense computing layers is run sequentially through the dense computing unit to process the multiple vector embedding tasks in parallel, thereby obtaining intermediate data corresponding to each vector embedding task. The general-purpose computing unit sequentially runs each network layer in the lightweight computing layer set to process multiple intermediate data in parallel, thereby obtaining the semantic vector corresponding to each intermediate data.
[0006] Accordingly, a second aspect of this application proposes a vector embedding device based on a heterogeneous computing architecture, wherein the heterogeneous computing architecture includes general-purpose computing units and intensive computing units, and the device includes: The determination module is used to determine the multiple network layers contained in the pre-trained vector embedding model, and to run each network layer to process preset test samples through the general computing unit and the dense computing unit respectively, so as to obtain the first processing delay and the second processing delay corresponding to each network layer. The calculation module is used to calculate the execution efficiency index corresponding to each network layer based on the ratio between the first processing latency and the second processing latency of each network layer. The partitioning module is used to divide the multiple network layers into a set of lightweight computing layers executed by the general-purpose computing unit and a set of dense computing layers executed by the dense computing unit, based on the execution efficiency index corresponding to each network layer. The running module is used to acquire multiple vector embedding tasks and sequentially run each network layer in the set of dense computing layers through the dense computing unit to process the multiple vector embedding tasks in parallel and obtain intermediate data corresponding to each vector embedding task. The processing module is used to sequentially run each network layer in the lightweight computing layer set through the general-purpose computing unit to process multiple intermediate data in parallel and obtain the semantic vector corresponding to each intermediate data.
[0007] In some implementations, the operating module is further configured to: Obtain the first queue task threshold corresponding to the intensive computing unit. When the first number of first tasks to be processed in the first queue is less than the first queue task threshold, allocate the multiple vector embedding tasks to the first queue. The intensive computing unit sequentially runs each network layer in the intensive computing layer set to perform parallel processing on the multiple vector embedding tasks contained in the first queue, thereby obtaining intermediate data corresponding to each vector embedding task.
[0008] In some implementations, the operating module is further configured to: The number of concurrent tasks is increased by a preset step size. The average processing latency of the intensive computing unit for processing the preset test tasks is collected under different numbers of concurrent tasks until the target average processing latency corresponding to the target number of concurrent tasks exceeds the preset service threshold, and multiple sets of target sample data are obtained. Each set of target sample data includes the number of concurrent tasks and the corresponding average processing latency; The least squares method is used to perform linear regression fitting on the multiple sets of target sample data to obtain the delay coefficient and base delay of the dense computing unit; Based on the target average processing latency, the latency coefficient, and the base latency, the first queue task threshold corresponding to the intensive computing unit is calculated.
[0009] In some implementations, the operating module is further configured to: Obtain the second number corresponding to the plurality of first pending tasks currently contained in the first queue; When the second quantity in the first queue reaches the preset batch processing threshold, multiple target vectors corresponding to the selected preset batch processing threshold are embedded into the task in the order of arrival time from the first queue. The multiple processing cores of the intensive computing unit are identified, and a corresponding target vector embedding task is assigned to each processing core; For each processing core, each network layer in the set of dense computing layers is run sequentially to process the corresponding target vector embedding task and obtain the intermediate data corresponding to the target vector embedding task.
[0010] In some embodiments, the vector embedding device based on a heterogeneous computing architecture further includes an allocation module for: When the first number of multiple first tasks to be processed contained in the first queue is greater than or equal to the first queue task threshold, and the third number of multiple second tasks to be processed contained in the second queue corresponding to the general computing unit is less than the preset second queue task threshold, the multiple vector embedding tasks are assigned to the second queue. The general-purpose computing unit sequentially runs each network layer in the dense computing layer set to perform parallel processing on the multiple vector embedding tasks, thereby obtaining intermediate data corresponding to each vector embedding task. The general-purpose computing unit sequentially runs each network layer in the lightweight computing layer set to process multiple intermediate data in parallel, thereby obtaining the semantic vector corresponding to each intermediate data.
[0011] In some implementations, the intermediate data includes intermediate representative data and intermediate combined data, and the running module is further configured to: Clustering is performed on the multiple vector embedding tasks to obtain multiple task sets; For each task set, a representative task is selected, and each network layer in the dense computing layer set is run sequentially through the dense computing unit to process the representative task and obtain the intermediate representative data corresponding to the representative task. For each task set, the semantic difference between each remaining vector embedding task and the representative task is obtained, and based on the semantic difference corresponding to each remaining vector embedding task and the intermediate representative data, intermediate combined data corresponding to each remaining vector embedding task is constructed.
[0012] In some embodiments, the processing module is further configured to: For each intermediate combination of data, the corresponding target intermediate data is obtained by calculation based on the corresponding intermediate representative data and the corresponding semantic difference; The general-purpose computing unit runs each network layer in the lightweight computing layer set to process the intermediate representative data and the target intermediate data in parallel, thereby obtaining the semantic vectors corresponding to the intermediate representative data and the target intermediate data, respectively.
[0013] Accordingly, a third aspect of the embodiments of this application proposes a computer device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the vector embedding method based on a heterogeneous computing architecture according to any one of the embodiments of the first aspect of this application.
[0014] Accordingly, a fourth aspect of the embodiments of this application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the vector embedding method based on a heterogeneous computing architecture according to any one of the embodiments of the first aspect of this application.
[0015] This application embodiment determines the multiple network layers contained in the pre-trained vector embedding model, and runs each network layer to process preset test samples through general-purpose computing units and intensive computing units respectively, obtaining a first processing latency and a second processing latency corresponding to each network layer; based on the ratio between the first processing latency and the second processing latency of each network layer, calculates the execution efficiency index corresponding to each network layer; based on the execution efficiency index corresponding to each network layer, the multiple network layers are divided into a set of lightweight computing layers executed by general-purpose computing units and a set of dense computing layers executed by intensive computing units; multiple vector embedding tasks are obtained, and each network layer in the dense computing layer set is run sequentially by intensive computing units to perform parallel processing on multiple vector embedding tasks, obtaining intermediate data corresponding to each vector embedding task; each network layer in the lightweight computing layer set is run sequentially by general-purpose computing units to perform parallel processing on multiple intermediate data, obtaining the semantic vector corresponding to each intermediate data. In this way, precise matching of computing tasks and computing unit computing power characteristics can be achieved through layer-level hardware adaptation scheduling. Specifically, this application obtains the execution efficiency indicators of each network layer on different computing units through testing. Based on this, it allocates computationally intensive network layers to computationally intensive computing units that are good at parallel computing, and allocates lightweight computationally intensive network layers to general-purpose computing units. This breaks through the resource mismatch dilemma caused by binding the entire model to a single computing unit, so that each computing unit can handle layer tasks that match its computing power characteristics. This avoids the idle computing power of computationally intensive computing units when handling lightweight layers and the computing power bottleneck of general-purpose computing units when handling dense layers. At the same time, it activates the idle resources of general-purpose computing units to participate in vector embedding computation, forming a cross-device hierarchical pipeline parallel processing. In summary, this application can improve the system's resource utilization and the processing efficiency of large-scale vector embedding tasks. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of the architecture of a vector embedding system based on a heterogeneous computing architecture provided in an embodiment of this application; Figure 2 This is a flowchart of a vector embedding method based on a heterogeneous computing architecture provided in an embodiment of this application; Figure 3 This is an overall introductory diagram of the fine-tuning reuse provided in the embodiments of this application; Figure 4 This is a flowchart illustrating the dynamic scheduling of vector embedding tasks under a heterogeneous computing architecture provided in this application embodiment; Figure 5 This is a schematic diagram of cross-device dynamic decision-making and pipeline execution based on hierarchical efficiency ratio provided in an embodiment of this application; Figure 6 This is a schematic diagram of the functional modules of the vector embedding device based on a heterogeneous computing architecture provided in the embodiments of this application; Figure 7 This is a schematic diagram of the hardware structure of the computer device provided in the embodiments of this application. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0018] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, and the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0019] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0020] With the rapid development of artificial intelligence technology, vector embedding technology has become a core support for semantic understanding and information retrieval. Specifically, vector embedding technology transforms massive amounts of text data into high-dimensional semantic vectors in real time, achieving semantic-level information matching through similarity calculation between vectors. This places extremely high demands on the processing capabilities of vector embedding services. Therefore, it is necessary to improve the processing efficiency of vector embedding services to meet the ever-increasing business needs.
[0021] In related technologies, inference is generally performed by deploying a pre-trained vector embedding model on a single computing unit. However, the structural characteristics of the model result in different parts having different requirements for computing resources. When the computing unit processes the part of the model that does not match its computing power characteristics, it either has excess computing power, resulting in idle resources, or insufficient computing power, forming a processing bottleneck. As a result, the resource utilization rate of the system and the processing efficiency for large-scale vector embedding tasks are reduced.
[0022] Based on this, embodiments of this application provide a vector embedding method, apparatus, device, and medium based on a heterogeneous computing architecture, which can improve the resource utilization of the system and the processing efficiency of large-scale vector embedding tasks.
[0023] The vector embedding method, apparatus, device, and medium based on heterogeneous computing architecture provided in this application are specifically described through the following embodiments. First, the vector embedding system based on heterogeneous computing architecture in this application embodiment is described.
[0024] Please refer to Figure 1 In some implementations, embodiments of this application provide a vector embedding system based on a heterogeneous computing architecture, including a terminal 11 and a server 12.
[0025] In some implementations, terminal 11 can be used to initiate vector embedding tasks and receive the final semantic vector results. For example, it can be a smartphone, personal computer, IoT device, or application server deployed at the business front end. Terminal 11 can send a vector embedding request containing the text to be processed to server 12 through a built-in client module or application programming interface (API), and wait to receive the high-dimensional semantic vector returned by server 12 to support subsequent processing by upper-layer applications (such as retrieval enhancement generation, semantic matching, or text classification).
[0026] Furthermore, the server 12 can be used to split the pre-trained vector embedding model into layers and execute large-scale vector embedding tasks in parallel based on a heterogeneous computing architecture. For example, it can be a standalone server, server cluster, or cloud computing node equipped with general-purpose computing units (such as CPUs) and intensive computing units (such as GPUs or NPUs).
[0027] For example, the server 12 can run modules such as a device detector, a layer splitting and allocation engine, a queue manager, and a linear regression queue depth estimator to perform layer-granular scheduling and hardware adaptation on the received vector embedding task. It can allocate the computationally intensive layer to the intensive computing unit for parallel processing to obtain intermediate data, and then pass the intermediate data to the general-purpose computing unit to complete the processing of the lightweight computing layer, and finally generate semantic vectors and return them to the terminal 11.
[0028] In some implementations, the terminal 11 and the server 12 can communicate via a network (such as a local area network or the Internet). The terminal 11, as the business entry point, is responsible for initiating requests and receiving results, while the server 12, as the core computing platform, is responsible for efficient inference of the vector embedding model and collaborative scheduling of heterogeneous resources. The two interact through RESTful APIs or remote procedure call protocols, thereby achieving an organic unity between the business flexibility on the terminal side and the computing efficiency on the server side.
[0029] The vector embedding method based on heterogeneous computing architecture in this application can be illustrated through the following embodiments.
[0030] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent will be obtained first. Furthermore, the collection, use, and processing of this data will comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user will be obtained through pop-ups or redirects to confirmation pages. Only after obtaining the user's separate permission or consent will the necessary user-related data for the normal operation of the embodiments of this application be obtained.
[0031] In this application embodiment, the description will focus on a vector embedding device based on a heterogeneous computing architecture. This heterogeneous computing architecture can be integrated into a computer device, and includes general-purpose computing units and intensive computing units. See [link to relevant documentation]. Figure 2 , Figure 2 The flowchart illustrates the steps of the vector embedding method based on a heterogeneous computing architecture provided in this embodiment. Taking the integration of the vector embedding device based on a heterogeneous computing architecture into a terminal or server as an example, the specific process when the processor on the terminal or server executes the program instructions corresponding to the vector embedding method based on the heterogeneous computing architecture is as follows: Step 101: Determine the multiple network layers contained in the pre-trained vector embedding model, and run each network layer to process preset test samples through general-purpose computing units and dense computing units respectively, to obtain the first processing delay and the second processing delay corresponding to each network layer.
[0032] In some implementations, to achieve refined and adaptive partitioning and collaborative processing of vector embedding tasks in heterogeneous computing architectures, the latency of each network layer in the pre-trained model processing test samples on general-purpose computing units and intensive computing units can be measured separately, and the ratio between the two can be calculated to construct a quantitative execution efficiency index. This allows for precise task allocation based on the actual computational characteristics of the network layers on different hardware (such as memory-intensive or computationally intensive) rather than solely on the network layer type. This provides a scientific basis for subsequent pipelined parallel processing in heterogeneous architectures, maximizing the computational advantages of different computing units.
[0033] The vector embedding model can be any pre-trained deep learning model used to convert input data (such as text, images, audio, and video) into low-dimensional dense semantic vectors. Examples include BERT and RoBERTa models based on the Transformer architecture, or ResNet and EfficientNet models based on the convolutional neural network architecture. Internally, it can consist of multiple cascaded network layers with different computational characteristics, used to capture the deep semantic features of the input data.
[0034] The network layer can be the basic computational unit that constitutes the vector embedding model, such as the multi-head attention layer or feedforward neural network layer in the Transformer structure, or the convolutional layer, pooling layer, or fully connected layer in the convolutional neural network. Each network layer performs specific mathematical transformations in the vector embedding model.
[0035] Among them, the general-purpose computing unit can be a processor in a heterogeneous computing architecture that is responsible for handling complex logic control, irregular data structures and lightweight parallel computing, such as a central processing unit (CPU). Its characteristics are strong single-core performance and complex cache system, which can be used to execute network layers with memory access intensive or complex control flow in the execution model.
[0036] Among them, the intensive computing unit can be a processor in a heterogeneous computing architecture that is designed for high-throughput, regular, large-scale parallel computing, such as a graphics processing unit (GPU) or a neural network processing unit (NPU). It has thousands of computing cores and can be used to execute computationally intensive network layers in the model (such as large matrix multiplications) to significantly improve computing efficiency.
[0037] The preset test samples can be representative single or small amounts of input data, such as a standard-length text or a standard-sized image, used to benchmark the processing performance of each network layer on different computing units before model deployment.
[0038] The first processing latency can be the time taken from the input data entering the layer to the completion of the calculation when a specific network layer is run using only a general-purpose computing unit (such as a CPU) to process a preset test sample. It is used to quantify the computation time of the network layer on general-purpose hardware.
[0039] The second processing latency can be the time taken from the input data entering the layer to the completion of the computation when only a intensive computing unit (such as a GPU) is used to run the same specific network layer to process the same preset test sample. It is used to quantify the computation time of the network layer on dedicated parallel hardware.
[0040] For example, an ordered list of network layers can be obtained by parsing the configuration file or network structure definition file corresponding to the vector embedding model. For instance, for a standard BERT model, its network layers may include an input embedding layer, multiple Transformer encoder layers (each encoder layer contains a multi-head self-attention sub-layer and a feedforward neural network sub-layer), a layer normalization layer, and an output layer.
[0041] Furthermore, each determined network layer in the heterogeneous computing architecture is run using both a general-purpose computing unit and a dense computing unit to process a preset test sample, in order to measure the processing latency of each network layer on different hardware. This heterogeneous computing architecture includes at least one general-purpose computing unit and one dense computing unit.
[0042] Specifically, general-purpose computing units can be central processing units (CPUs), which are good at handling complex logic control and sparse computation; intensive computing units can be graphics processing units (GPUs) or neural network processing units (NPUs), which have a large number of computing cores and are good at performing large-scale parallel numerical computations, such as matrix multiplication.
[0043] In some implementations, the preset test sample can be a representative, lightweight input data used to test different network layers. The preset test samples can be the same or different, and the preset test samples for different network layers of different models can also be different. For example, for a text processing model, the preset test sample can be a text fragment containing a fixed number (e.g., 128) tokens; for an image processing model, it can be a standard-sized (e.g., 224x224) image.
[0044] In one implementation, to obtain the processing latency of each network layer separately, the entire pre-trained vector embedding model can be loaded into memory first. Then, for each network layer, it is instantiated and run on a general-purpose computing unit (such as a CPU), a preset test sample (e.g., sample a) is input, and the time taken from data input to the completion of computation by that layer is recorded as the first processing latency. Similarly, the network layer can be instantiated and run on a intensive computing unit (such as a GPU), the same preset test sample (e.g., sample a) is input, and its computation time is recorded as the second processing latency. For example, to ensure the accuracy of the measurement, the operation of each network layer on each computing unit can be measured multiple times (e.g., 10 or 100 times), and the average value is taken as the final latency value to eliminate the influence of system noise.
[0045] In some implementations, the first and second processing latencies corresponding to each network layer can be recorded at the code level using a high-precision timer (such as std::chrono::high_resolution_clock). Specifically, the timer can be started before the forward propagation function call of the network layer and stopped after the function returns, thereby accurately obtaining the processing latency of the network layer. Taking a network layer named TransformerLayer as an example, its first processing latency on the CPU (general-purpose computing unit) can be obtained by subtracting the timestamp of the start of processing on the CPU from the timestamp of the end of the corresponding network layer on the CPU, thereby accurately obtaining the processing time of the network layer. Similarly, the calculation method for the second processing latency of the same network layer and the same preset on the GPU (intensive computing unit) is the same as above and will not be repeated here.
[0046] In some implementations, in addition to measuring each network layer independently and sequentially as described above, to more realistically simulate the model's performance during pipeline parallelism, a segmented measurement approach can be used to obtain the combined latency of network layers. Specifically, a sliding window can be set, treating multiple consecutive network layers (e.g., two or four consecutive layers) as a measurement granularity. This group of layers is then run on both the CPU and GPU to process preset test samples, and the combined processing latency is recorded. By analyzing the combined latency of different layer groups on different hardware, micro-pipeline segments that are more suitable for execution together can be identified. Consequently, in subsequent task partitioning, allocation is not based on individual network layers but on the optimal network layer, further reducing the number of cross-unit calls and data transfer overhead, thereby improving overall computational efficiency.
[0047] By using the above methods, objective performance data of the same network layer on different computing units can be obtained. The execution efficiency index calculated based on this data can accurately reflect the computing preference of the network layer for specific hardware. This provides a data-driven decision basis for task partitioning under heterogeneous computing architecture, and lays the foundation for achieving optimal matching of computing resources and computing tasks, improving the overall throughput of vector embedding tasks, and reducing service latency.
[0048] Step 102: Calculate the execution efficiency index corresponding to each network layer based on the ratio between the first processing latency and the second processing latency of each network layer.
[0049] In some implementations, in order to objectively quantify and distinguish the relative computational efficiency of each network layer on general-purpose and intensive computing units, an execution efficiency index reflecting the hardware affinity of each network layer can be calculated, providing a reliable data foundation for subsequently adaptively dividing the network layers into sets of lightweight computing layers and sets of intensive computing layers.
[0050] The execution efficiency index can be a quantitative value calculated based on the ratio of the first processing latency of the same network layer on a general-purpose computing unit to the second processing latency on a dense computing unit. For example, when the first processing latency is much greater than the second processing latency, the ratio is much greater than 1, indicating that the network layer is more suitable for running on a dense computing unit. Conversely, if the ratio is close to or less than 1, it indicates that the network layer is more efficient or has little difference in execution on a general-purpose computing unit, thus providing a core decision-making basis for accurately dividing multiple network layers into different computing units for execution.
[0051] In some implementations, for the i-th network layer, its execution efficiency metric It can be determined in the following ways: ; in, This represents the first processing latency consumed by the i-th network layer in processing the preset test samples on a general-purpose computing unit (such as a CPU). This represents the second processing latency consumed by the i-th network layer in processing the same preset test sample on a computationally intensive unit (such as a GPU).
[0052] For example, if Much greater than 1, for example This indicates that the network layer's processing speed on the GPU (GPU-intensive computing unit) far exceeds that of the CPU, exhibiting strong computationally intensive characteristics and a very high affinity for the GPU. If Close to 1, for example This indicates that the processing efficiency of the network layer on the two computing units is comparable, possibly limited by memory access bandwidth or other factors. If Less than 1, for example This indicates that the network layer is actually more efficient on a CPU (general-purpose computing unit), possibly due to its smaller computational load, more complex control logic, or the presence of a large number of sparse operations, making it more suitable for execution on a CPU. In this way, the original latency data with different dimensions can be transformed into a unified efficiency index with clear physical meaning, providing accurate data support for subsequent network layer partitioning.
[0053] In some implementations, to more comprehensively consider the computational characteristics of network layers, the computational density of the network layers can be introduced as an auxiliary factor to adjust the execution efficiency metric. Specifically, the computational cost (e.g., in floating-point operations per second) and parameter count of each network layer can be obtained first. Then, combined with its processing latency on different computing units, its actual computational throughput (FLOPs / second) on different hardware can be calculated. At this point, the execution efficiency metric... It can be defined as a combined function of the throughput ratio and the theoretical peak value, for example: ; in, and These represent the actual computational throughput achieved by the i-th network layer on the CPU and GPU, respectively. and The weighting coefficients are preset. In this way, the metric not only reflects the latency difference, but also the efficiency of hardware resource utilization. It can avoid incorrectly assigning certain network layers, which may have low latency ratios but can better utilize the parallel capabilities of the GPU, to the CPU, thereby achieving more refined task scheduling.
[0054] By using the above methods, the measured raw latency data can be transformed into a dimensionless efficiency index with clear physical meaning. This eliminates the interference of the absolute performance differences of different hardware on the judgment, and allows the hardware affinity of each network layer to be quantitatively presented. This lays a precise data foundation for the subsequent scientific and accurate division of lightweight computing layer sets and dense computing layer sets.
[0055] Step 103: Based on the execution efficiency index corresponding to each network layer, the multiple network layers are divided into a set of lightweight computing layers executed by general-purpose computing units and a set of dense computing layers executed by dense computing units.
[0056] In some implementations, in order to achieve precise matching between computing tasks and hardware resources in a heterogeneous computing architecture, the network layers can be divided into two sets, one adapted to general-purpose computing units and the other to intensive computing units, based on the execution efficiency index corresponding to each network layer. This allows network layers with different characteristics to run on the most suitable hardware in subsequent processing, thereby maximizing the overall computational efficiency of vector embedding.
[0057] The lightweight computing layer set can be a collection of network layers with low computational intensity or complex control logic that are divided from the pre-trained vector embedding model based on execution efficiency indicators. Examples include layer normalization layers, activation function layers, and residual connection layers. These layers can be subsequently executed by general-purpose computing units (such as CPUs) to leverage their advantages in task scheduling, branch prediction, and low-latency processing.
[0058] Among them, the set of dense computing layers can be a collection of network layers with high computational intensity and outstanding data parallelism that are divided from the pre-trained vector embedding model according to the execution efficiency index. For example, network layers containing large-scale matrix multiplication, such as multi-head self-attention layers and feedforward neural network layers, can be used for subsequent execution by dense computing units (such as GPUs) to achieve high throughput parallel acceleration by utilizing their massive computing cores.
[0059] In some implementations, an efficiency threshold can be preset. (For example =1.5), then for the i-th network layer, if its execution efficiency index Greater than or equal to this threshold, i.e. If the network layer is determined to be computationally intensive, it will be classified into the set of computationally intensive layers. Conversely, if If it is determined to be a lightweight computing layer, it will be included in the lightweight computing layer set. For example, for a BERT model containing 12 Transformer blocks, calculations revealed that the first four multi-head attention sublayers (QKV projection) had... The values are generally high (e.g., >2.0), and the feedforward network sublayers (FFN) of the last 8 layers... The values are relatively low (e.g., 1.2). Meanwhile, all layer normalization layers and residual connection layers... The values may all be less than 1.0. Based on the preset threshold... =1.5, which can be used in all multi-head attention sub-layers and feedforward network sub-layers. The portion is allocated to the GPU for execution, while the normalization layer, residual connection layer, and... The feedforward network sublayers are assigned to the CPU for execution. In this way, precise and adaptive allocation of network layers with different computational characteristics within the model is achieved on heterogeneous hardware.
[0060] In some implementations, a multi-threshold or dynamic threshold partitioning strategy can be employed to more finely balance the load or meet specific latency targets. For example, a high threshold can be set. (e.g., 2.0) and a low threshold (e.g., 1.2). [The following is a list of steps / mechanisms:] ... The layers are forcibly allocated to the GPU, The layers are forcibly allocated to the CPU, while those in the middle range ( The fuzzy layer can be dynamically adjusted based on the real-time load of the two computing units. If the GPU load is low, some fuzzy layers can be allocated to the GPU for acceleration; if the CPU queue is backlogged, they can be left for CPU processing. This allows the system to have stronger adaptability and load balancing capabilities when facing fluctuating online requests.
[0061] By using the above methods, network layers with different computational characteristics in the model can be scientifically classified and divided according to measured hardware efficiency indicators. This ensures that each network layer can be assigned to the computational unit it is best at, avoiding resource waste or performance bottlenecks caused by mismatch between tasks and hardware characteristics.
[0062] Step 104: Obtain multiple vector embedding tasks, and sequentially run each network layer in the set of dense computing layers through dense computing units to process multiple vector embedding tasks in parallel, and obtain intermediate data corresponding to each vector embedding task.
[0063] In some implementations, in order to fully utilize the parallel architecture of the intensive computing unit and achieve true concurrent processing of multiple vector embedding tasks, different vector embedding tasks can be processed in parallel by the intensive computing unit to execute all the intensive computing processes of multiple vector embedding tasks at the same time, thereby greatly improving the parallelism of task processing and overall throughput, and quickly producing intermediate data corresponding to all tasks.
[0064] Among them, vector embedding tasks can be raw input data units that need to be converted into low-dimensional dense semantic vectors, such as user query text, product images to be retrieved, and user behavior sequences that need to be calculated for similarity.
[0065] The intermediate data can be the feature representations generated by multiple vector embedding tasks after forward computation of all network layers in the dense computation layer set. For example, the feature tensor obtained after processing by several Transformer layers of the BERT model has completed the computationally intensive part of the processing but has not yet passed through the remaining lightweight computation layers. It can be used as input data for subsequent processing by general-purpose computing units.
[0066] For example, vector embedding tasks can originate from various application scenarios. In information retrieval systems, a vector embedding task could be a query text submitted by a user through a search engine; in image search applications, the task could be a product image uploaded by a user; and in recommendation systems, the task could be a sequence of user behaviors or item features requiring similarity calculation. In practical systems, these tasks are typically generated in real-time by front-end services and encapsulated into a unified request object, placed in a global pending queue, such as an Apache Kafka message queue or a Redis task cache. The system then pulls multiple tasks from this queue at a preset rate or batch size, preparing for subsequent heterogeneous computing.
[0067] Specifically, a computationally intensive unit (such as a GPU) can contain multiple processing cores (such as streaming multiprocessors, SMs), each capable of independently executing computational tasks. In some implementations, different vector embedding tasks can be assigned to different processing cores. For example, assuming there are 32 vector embedding tasks and the GPU has 16 SMs, the tasks can be evenly distributed, with each SM handling two tasks. Subsequently, each SM independently runs a complete pipeline of intensive computational layers for its assigned tasks. Taking a intensive pipeline of 12 network layers as an example, when SM1 processes its two assigned tasks, it first performs the computation of layer 1 (such as a multi-head attention layer) on the data for these two tasks, obtaining the layer 1 outputs for these two tasks; then, it performs the computation of layer 2 (such as a feedforward layer) on the layer 1 outputs of these two tasks, and so on, until all 12 layers have been run. Finally, each SM outputs the feature representations of its assigned tasks after passing through all intensive computational layers, i.e., intermediate data. The intermediate data produced by all SMs are aggregated and transferred to the memory of the general-purpose computing unit, awaiting the execution of the remaining lightweight computational layers. This multi-core parallel approach enables true concurrent processing of multiple vector embedding tasks, significantly improving system throughput.
[0068] In some implementations, the SM (Structured Stream) of a computationally intensive unit can run all network layers in the computationally intensive layer set to process a single vector embedding task, obtain intermediate data, and then the SM can continue to process the next vector embedding task. Multiple SMs can process multiple vector embedding tasks in parallel.
[0069] By using the above methods, the powerful parallel computing capabilities of dense computing units can be utilized to batch process a large number of vector embedding tasks, thereby significantly shortening the overall computing time required to process a large number of tasks and avoiding idle waiting of computing resources.
[0070] In some implementations, to achieve load awareness and adaptive scheduling of intensive computing units in a heterogeneous computing architecture and avoid a sharp increase in processing latency due to unit overload caused by a sudden surge in tasks, a first queue task threshold based on the unit's processing capacity model can be pre-obtained. Newly arrived vector embedding tasks are only assigned to their corresponding first queues when the number of currently pending tasks is below this threshold. This ensures that the unit always operates within its optimal load range, thereby fully utilizing its parallel computing capabilities while maintaining processing latency stability. For example, step 104, "running each network layer in the intensive computing layer set sequentially through the intensive computing unit to process multiple vector embedding tasks in parallel and obtain intermediate data corresponding to each vector embedding task," may include: (104.A1) Obtain the first queue task threshold corresponding to the intensive computing unit. When the number of first tasks corresponding to multiple first tasks to be processed in the first queue is less than the first queue task threshold, allocate multiple vector embedding tasks to the first queue. (104.A2) Each network layer in the set of dense computing layers is run sequentially by dense computing units to process multiple vector embedding tasks in the first queue in parallel, and intermediate data corresponding to each vector embedding task is obtained.
[0071] The first queue task threshold can be the maximum number of concurrent tasks calculated through performance modeling based on the processing capacity of the intensive computing unit. It is used as an admission control limit to determine whether new tasks can be allocated to the first queue corresponding to the intensive computing unit, so as to dynamically protect the unit from overload when the system load fluctuates.
[0072] The multiple first tasks to be processed can be a set of vector embedding tasks that have been assigned to a first queue and are waiting for the intensive computing unit to execute the intensive computing layer set portion of them. For example, it can be a batch of text vectorization requests that have arrived but have not yet started or are being queued for GPU processing, serving as a source of tasks for batch scheduling by the intensive computing unit.
[0073] The first quantity can be the total number of the first pending tasks contained in the first queue at the current moment.
[0074] The first queue can be a first-in-first-out (or priority-based scheduling) buffer set up at the front end of the intensive computing unit to cache vector embedding tasks to be processed, such as the GPU task scheduling queue, to smooth out the instantaneous mismatch between the task arrival rate and the GPU processing rate, and to provide a data aggregation place for batch processing of tasks.
[0075] In some implementations, the first queue task threshold can be obtained by stress testing the intensive computing unit, which represents the maximum number of queued tasks that the intensive computing unit can stably process under current processing power without causing a breach of the service level agreement.
[0076] Furthermore, during task allocation, the system can monitor in real time the number of tasks corresponding to the first tasks in the first queue (i.e., the task buffer at the front end of the intensive computing unit), indicating the current queue depth. When this number is less than a preset first queue task threshold, it indicates that the intensive computing unit is currently lightly loaded and has sufficient remaining processing capacity to accept new tasks. At this time, newly arriving vector-embedded tasks can be allocated to the first queue for processing. This dynamically protects the intensive computing unit, preventing overload due to a sudden surge in tasks and ensuring the stability of service quality.
[0077] For example, suppose the threshold for the first queue is 64, and there are already 40 tasks in the first queue (first number = 40). At this time, the first number is less than the threshold for the first queue. When 20 new vector embedding tasks arrive, the system will successfully add these vector embedding tasks to the first queue.
[0078] In some implementations, the sum of the number of multiple vector embedding tasks and a first quantity can be calculated and compared with a first queue task threshold to determine whether the multiple vector embedding tasks can be added to the first queue task. Continuing with the example above, the sum of 20 vector embedding tasks and the first quantity (=40) is 60, which is less than the first queue task threshold of 64. Therefore, these vector embedding tasks can be successfully added to the first queue.
[0079] In some implementations, the system retrieves all pending tasks (e.g., 60) from a first queue and allocates tasks based on the number of SMs (e.g., 20), with each SM responsible for handling 3 tasks. Subsequently, each SM independently runs a complete intensive computational layer pipeline for its assigned tasks. Taking a 12-layer dense set as an example, SM1, when processing its 3 assigned tasks, employs a task-interleaved execution approach: first, it calls upon its hundreds of internal CUDA cores to execute the first layer of task A in parallel, obtaining the first layer output of task A; then it quickly switches to task B to execute its first layer; then it executes the first layer of task C. Once the first layers of all three tasks are completed, SM1 restarts from task A, executing the second layer computation on its first layer output, and so on, until all 12 layers are completed. Through this multi-SM parallel, task-interleaved execution mechanism, the intensive computational unit can efficiently process large batches of tasks, ultimately producing intermediate data corresponding to all tasks and transferring it to the memory of the general-purpose computational unit.
[0080] By using the above methods, the allocation path of new tasks can be dynamically determined based on the real-time processing capability of the intensive computing unit and the current load status, thereby effectively preventing the loss of control over processing delays caused by task backlog and ensuring the stability of service quality. At the same time, by temporarily storing tasks in a queue for subsequent batch processing, the throughput of the intensive computing unit can be maximized, thus providing a key guarantee for the stable and efficient operation of the entire heterogeneous computing system in high-concurrency scenarios.
[0081] In some implementations, to accurately quantify the load processing capacity of intensive computing units, a first queue task threshold to ensure quality of service can be calculated for the intensive computing units, thereby providing the system with a scientific and adaptive overload protection mechanism. For example, "obtaining the first queue task threshold corresponding to the intensive computing unit" in (104.A1) may include: (104.A1.1) Increment the number of concurrent tasks according to a preset step size, collect the average processing latency of the intensive computing unit for processing the preset test tasks under different numbers of concurrent tasks, until the target average processing latency corresponding to the target number of concurrent tasks exceeds the preset service threshold, and obtain multiple sets of target sample data. Each set of target sample data includes the number of concurrent tasks and the corresponding average processing latency; (104.A1.2) The least squares method is used to perform linear regression fitting on multiple sets of target sample data to obtain the delay coefficient and base delay of the dense computing unit; (104.A1.3) Calculate the first queue task threshold corresponding to the intensive computing unit based on the target average processing latency, latency coefficient and base latency.
[0082] The preset step size can be a fixed interval value for increasing the number of concurrent tasks each time when stress testing intensive computing units, such as increasing by 4 or 8 concurrent tasks each time.
[0083] The number of concurrent tasks can be the number of tasks submitted to the intensive computing unit for processing simultaneously within the same time period, such as the number of texts submitted to the GPU for vector embedding calculations at the same time, used to simulate different levels of load pressure to test the processing capacity of the unit.
[0084] The test tasks can be standardized tasks specifically designed for performance benchmarking of intensive computing units. For example, they can be a set of vector embedding computation tasks of fixed length and format, with a computational load similar to the actual task, used to perform repeatable and stable measurements of the unit's processing capability under different concurrency conditions.
[0085] The average processing latency can be the statistical average of the time consumed by a computationally intensive unit to complete all concurrent test tasks under a specific number of concurrent tasks. For example, it can be the average time of a single task when processing 32 tasks simultaneously, divided by the number of tasks, and used to quantify the unit's response speed under different load levels.
[0086] The target concurrent task count can be the specific number of concurrent tasks that first causes the average processing latency to exceed the preset service threshold during the stress test. For example, when the number of concurrent tasks increases from 24 to 28, the average latency first exceeds the service threshold of 50 milliseconds. Then 28 is the target concurrent task count, marking the critical point when the unit enters the overload state.
[0087] The target average processing latency can be a specific average processing latency value that is measured under the target number of concurrent tasks and is just above or close to the service threshold, such as 50.3 milliseconds.
[0088] The service threshold can be the maximum average processing latency allowed by the system, which is preset according to business needs, such as 50 milliseconds, and is used as a standard to judge whether the intensive computing unit is overloaded.
[0089] The target sample data can be paired data of each group of concurrent tasks and their corresponding average processing latency collected during the stress test, such as {concurrency: 16, latency: 20ms}, {concurrency: 20, latency: 28ms}, etc., which can be used as the basic dataset for subsequent linear regression analysis.
[0090] The latency coefficient can be the slope of the regression line obtained by performing a linear regression fit on the target sample data using the least squares method. It is used to quantify the rate of increase in average processing latency as the number of concurrent tasks increases. It can be used to characterize the sensitivity of intensive computing units to load changes.
[0091] The base delay can be the intercept on the vertical axis of the regression line obtained by performing a linear regression fit on the target sample data using the least squares method. It can be used to represent the basic time required to process a single test task under ideal conditions without task queuing or competition, that is, the inherent processing speed of a computationally intensive unit.
[0092] Specifically, the number of concurrent test tasks sent to the intensive computing unit can be gradually increased according to a preset step size (e.g., a step size of 2 concurrent tasks) using a device detector or a linear regression queue depth estimator. The intensive computing unit performs a complete forward computation on the set of intensive computing layers it is responsible for (e.g., layers 1 to 20 of the Transformer model handled by the NPU), obtaining the total time taken for the number of test tasks corresponding to the current number of concurrent test tasks. Then, based on the ratio of the total time taken to the number of concurrent tasks, the average processing latency corresponding to that number of concurrent tasks can be obtained.
[0093] For example, when the number of concurrent tasks is 2, the intensive computing unit processes a specified segment of two queries simultaneously (e.g., the NPU executes layers 1-20), and the total processing latency is recorded as 0.6 seconds. The average processing latency is then 0.6 seconds divided by 2, resulting in 0.3 seconds per task. The system increments the number of concurrent tasks according to a preset step size, continuously collecting the corresponding average processing latency, until the target average processing latency corresponding to a certain target number of concurrent tasks (e.g., 1.1 seconds) first exceeds the preset service level target (SLO, e.g., 1 second). At this point, the increment stops, and the target number of concurrent tasks that first exceeds the preset service level target and the corresponding target average processing latency are taken as a target sample data. At the same time, multiple sets of data between the initial number of concurrent tasks and the target number of concurrent tasks are determined as multiple sets of first sample data. These data cover the complete range from low load to slightly exceeding the load threshold.
[0094] Based on this, to establish a mathematical model of the relationship between concurrency and latency, a least squares method can be used to perform linear regression fitting on multiple sets of target sample data using an estimator. The principle behind this fitting is that, before system resources become a bottleneck, the average processing latency and concurrency are usually approximately linearly related. In some implementations, the latency model for intensive computing units can be expressed by the following formula: ; in, This indicates that the intensive computing unit d operates at a concurrency of 1. Average processing latency; The latency factor for intensive computing units represents the increase in average processing latency caused by each additional concurrent task, reflecting the unit's sensitivity to load changes. The base delay to be fitted represents the basic time required to process a single task under ideal conditions (i.e., no queuing, no competition), reflecting the inherent processing speed of the unit. The base delay is determined by minimizing the sum of squared errors between the fitted line and each sample point using the least squares method. and The specific value.
[0095] Furthermore, based on the established latency model and service level target, the maximum queue depth of this intensive computing unit while ensuring service quality can be deduced. Specifically, the preset service threshold (i.e., the SLO value that the target average processing latency must not exceed) can be substituted into the above formula. Solve for the corresponding concurrency number This is the threshold for the first queue of tasks. In some implementations, the formula for calculating this threshold can be expressed as: ; in, This is a preset service threshold. For example, if the latency coefficient of a certain NPU is obtained through fitting... =0.018 seconds / unit, base delay =0.27 seconds, and the Service Limit Order (SLO) is set to 1 second, then substituting these values into the formula yields the maximum queue depth. =(1 0.27) / 0.018≈40 means that when the number of concurrent users does not exceed 40, the average processing latency of the NPU queue can be guaranteed to be within 1 second.
[0096] For example, in addition to using a linear regression model to fit the sample data, piecewise linear regression or nonlinear regression models (such as multinomial regression or logarithmic regression) can be introduced to handle the nonlinear inflection points that some hardware exhibits when the load is close to saturation. Specifically, when the collected sample data shows a significant curvature change in the high concurrency region (i.e., a sudden acceleration in latency rise), a single linear model may overestimate or underestimate the actual maximum queue depth. In this case, a concurrency threshold can be set, and piecewise fitting can be performed using linear models with different slopes on both sides of the concurrency threshold. In this way, the behavioral characteristics of the hardware in the critical saturation region can be captured more accurately, avoiding SLO violations or insufficient resource utilization due to model mismatch, and further improving the accuracy of queue threshold calibration.
[0097] By using the above methods, the complex performance characteristics of intensive computing units can be abstracted into a concise linear mathematical model, thereby quantifying the deterministic relationship between load and latency. This avoids the problem of resource waste or insufficient protection caused by empirically setting thresholds, and ensures the accuracy and effectiveness of system overload protection.
[0098] In some implementations, to maximize the utilization of computing resources on the intensive computing unit and reduce the overhead of task scheduling, when the number of tasks to be processed in the first queue reaches a preset batch processing threshold, a corresponding number of tasks can be retrieved at once, and the multiple processing cores of the intensive computing unit can be fully utilized for parallel allocation. This allows each core to independently and pipelinedly execute the complete set of intensive computing layers, thereby efficiently producing intermediate data for all target tasks in a highly parallel batch processing manner. For example, (104.A2) may include: (104.A2.1) Obtain the second quantity corresponding to the multiple first pending tasks currently contained in the first queue; (104.A2.2) When the second quantity in the first queue reaches the preset batch processing threshold, the task is embedded from multiple target vectors in the first queue that correspond to the selected preset batch processing threshold in the order of arrival time; (104.A2.3) Identify multiple processing cores of the intensive computing unit and assign a corresponding target vector embedding task to each processing core; (104.A2.4) For each processing core, each network layer in the set of dense computation layers is run in sequence to process the corresponding target vector embedding task and obtain the intermediate data corresponding to the target vector embedding task.
[0099] The second quantity can be the total number of first-pending tasks currently retrieved from the first queue and in a waiting-to-process state, i.e., the real-time depth of the first queue, representing the total number of tasks currently waiting to be processed by intensive computing units. For example, the number of queued tasks in the GPU task queue in real time is used as a condition to trigger batch processing operations.
[0100] The preset batch processing threshold can be an optimal batch size pre-set based on the hardware characteristics (such as memory size and number of cores) and computing efficiency of intensive computing units, such as 32 or 64 tasks. This threshold is used to start batch processing when the threshold is reached to balance processing latency and throughput, and to avoid insufficient utilization of computing units due to excessively small batches or resource overflow due to excessively large batches.
[0101] The target vector embedding task can be a specific task selected from the first queue according to the order of arrival time, which is prepared to be submitted to the intensive computing unit for this round of batch processing. For example, the latest 32 text vectorization requests can be the actual processing objects of this round of parallel computing.
[0102] The processing core can be an independent computing entity within a dense computing unit, such as a streaming multiprocessor (SM) or CUDA core in a GPU. Each core can execute instruction streams in parallel to process the assigned target vector embedding task in parallel, thereby accelerating multiple tasks simultaneously.
[0103] For example, the system can obtain the second quantity in real time by maintaining an atomic counter or calling the API provided by the queue. Specifically, the counter can be updated each time a task is added to or removed from the queue, thereby accurately grasping the current backlog.
[0104] Understandably, the batch processing threshold can be an optimal batch size preset based on the hardware characteristics of intensive computing units (such as memory capacity, number of stream multiprocessors, and total number of computing cores) and computing efficiency. For example, for a GPU with 24 stream processors and 16GB of memory, experiments have shown that when the batch size reaches 32 tasks, the utilization of computing resources is the highest, and there is no performance degradation due to memory overflow. Therefore, the batch processing threshold can be preset to 32.
[0105] In some implementations, the system can continuously monitor a second quantity in the first queue. When this second quantity reaches a preset batch processing threshold, a batch processing operation is triggered. At this time, the system selects tasks from the head of the first queue in the order of their arrival time (first-in, first-out principle), corresponding to the batch processing threshold. These tasks are marked as target vector embedding tasks and removed from the first queue. For example, when the number of tasks in the first queue accumulates to 32, the system immediately retrieves these 32 earliest arriving tasks as target vector embedding tasks and submits them to the intensive computing unit for processing.
[0106] It should be noted that the processing core can refer to parallel units at different levels at the hardware level, therefore there are multiple implementation methods for this step. These will be listed one by one below.
[0107] In some implementations, streaming multiprocessors (SMs), which are computationally intensive units, can be used as the basic unit for task allocation. For example, for a GPU with 16 processing cores (such as SMs), 32 target tasks can be evenly distributed among these 16 SMs, with each SM handling 2 tasks. A round-robin approach can be used to allocate tasks one by one to different SMs, ensuring load balancing.
[0108] In some implementations, task allocation can be more granular. When the GPU kernel starts, the system defines a thread grid and thread blocks. Each target task can be assigned to an independent thread block, and each thread block is then scheduled to execute on a specific execution mode (SM). In this case, 32 target tasks correspond to 32 thread blocks, and the GPU's runtime system is responsible for dynamically scheduling these thread blocks to available SMs for execution, achieving automatic load balancing.
[0109] In some implementations, to further optimize load balancing, the complexity of the target task can be estimated (e.g., based on the length of the input sequence) before allocation, and then tasks with higher computational requirements can be combined with tasks with lower computational requirements and allocated to the same SM. That is, the computational requirements of the tasks processed by each processing core are balanced to ensure that the processing time of each SM is as close as possible.
[0110] Specifically, for each processing core (or each SM / thread block), it can independently run a complete pipeline of dense computational layers for its assigned target vector embedding task. For example, an SM might be assigned two tasks (task A and task B). The SM would first process layer 1 of task A (e.g., a multi-head attention layer), calling upon hundreds of CUDA cores within it to perform matrix operations in parallel, obtaining the output of layer 1 of task A. Then, it would quickly switch to task B, execute its layer 1, and obtain the output of layer 1 of task B. Once the layer 1 of both tasks is complete, the SM would restart from task A, performing computation on the second layer (e.g., a feedforward layer) of its layer 1 output, and so on, until all network layers (e.g., 12 layers) in the dense computational layer set are completed. Finally, the SM outputs intermediate data from task A and task B after all dense layer computations. All intermediate data produced by the SMs is aggregated and transferred to the memory of the general-purpose computing unit, awaiting execution of the remaining lightweight computational layers.
[0111] In some implementations, a task-serialization approach can be adopted. Each streaming multiprocessor (SM), after being assigned one or more target vector embedding tasks, first runs all network layers in its dense computation layer set for the first assigned task, executing layers 1 through M sequentially to obtain and temporarily store the complete intermediate data for that task. Then, it begins processing the next task, running the entire dense computation layer set from beginning to end to obtain the intermediate data for the second task, and so on, until all assigned tasks are completed. This approach is suitable for scenarios with strong data dependencies between tasks or limited shared memory resources, simplifying task scheduling complexity and reducing register resource contention when multiple tasks execute simultaneously.
[0112] By using the above methods, dynamic batch processing based on queue depth can be implemented on intensive computing units, thereby aggregating scattered tasks into computing batches of optimal size, avoiding the additional overhead caused by frequent small-scale task scheduling; at the same time, by evenly distributing tasks within a batch to multiple processing cores for parallel execution, the parallel computing potential of the hardware can be fully explored, thereby bringing higher processing throughput and better resource utilization to the overall system.
[0113] In some implementations, to ensure system robustness and service continuity when intensive computing units are unable to accept new tasks due to overload, it is possible to determine whether the second queue corresponding to the general-purpose computing unit still has idle processing capacity. If the capacity is below a threshold, the vector embedding tasks that should have been processed by the intensive unit are temporarily scheduled to be executed by the general-purpose computing unit. This unit then sequentially completes the processing of both the intensive and lightweight computing layer sets, thereby achieving smooth task flow through heterogeneous backup and preventing tasks from being dropped or blocked due to single-point overload. For example, the vector embedding method based on a heterogeneous computing architecture further includes: (A.1) When the first number of multiple first tasks to be processed contained in the first queue is greater than or equal to the first queue task threshold, and the third number of multiple second tasks to be processed contained in the second queue corresponding to the general computing unit is less than the preset second queue task threshold, the multiple vector embedding tasks are assigned to the second queue. (A.2) By running each network layer in the dense computing layer set in sequence through a general-purpose computing unit, multiple vector embedding tasks are processed in parallel to obtain the intermediate data corresponding to each vector embedding task; (A.3) Each network layer in the lightweight computing layer set is run sequentially by a general-purpose computing unit to process multiple intermediate data in parallel and obtain the semantic vector corresponding to each intermediate data.
[0114] Among them, the multiple second tasks to be processed can be a set of vector embedding tasks that have been assigned to the second queue and are waiting for the general-purpose computing unit to perform computation on their relevant network layers.
[0115] The third quantity can be the total number of second pending tasks contained in the second queue at the current moment.
[0116] The preset second queue task threshold can be a maximum allowed number of queued tasks set according to the processing capacity of the general-purpose computing unit. Its calculation method is the same as that of the first queue task threshold. Therefore, the calculation process of the first queue task threshold can be referred to, and will not be repeated here.
[0117] The second queue can be a first-in-first-out (or priority-first-out) buffer set up at the front end of a general-purpose computing unit to cache tasks to be processed, such as the CPU's task scheduling queue.
[0118] For example, the system can detect in real time the first quantity (currently queued tasks) of the first queue (front-end buffer of intensive computing units) and the third quantity of the second queue (front-end buffer of general-purpose computing units). When the first quantity is detected to be greater than or equal to the task threshold of the first queue, it indicates that the intensive computing unit is overloaded. If tasks are continued to be assigned to it, it will lead to a sharp increase in processing latency or even system crash. At this time, the system can further check whether the third quantity of the second queue is less than the preset task threshold of the second queue, that is, to determine whether the general-purpose computing unit still has idle processing capacity. If the third quantity is less than the task threshold of the second queue, it means that the general-purpose computing unit is currently lightly loaded and can handle additional computing tasks. Under this condition, the system can assign multiple newly arrived vector embedding tasks to the second queue instead of continuing to try to join the full first queue. In this way, load balancing and disaster recovery backup between heterogeneous computing units can be achieved. When the main processing path (intensive computing unit) is busy, tasks are automatically and seamlessly switched to the backup processing path (general-purpose computing unit), avoiding tasks being dropped or blocked due to single-point overload, thereby ensuring the continuity and stability of services.
[0119] For example, if the task threshold of the first queue is 64, and there are already 64 tasks in the first queue, the first queue is full, while the task threshold of the second queue is 32, and there are only 10 tasks in the second queue (the third quantity = 10 < 32), then the 20 newly arrived tasks can be assigned to the second queue.
[0120] Specifically, in the above scenario, a general-purpose computing unit can take over the set of intensive computation layers that would normally be executed by a intensive computing unit. The second queue can also have a preset batch processing threshold, which can be the same as or different from the preset batch processing threshold of the first queue. Then, the system can retrieve multiple vector embedding tasks from the second queue, the same number as the preset batch processing threshold, such as the 20 tasks mentioned above, and have a general-purpose computing unit (such as a CPU) sequentially run each network layer in the intensive computation layer set to process these tasks in parallel. Although general-purpose computing units are less efficient than intensive computing units in handling large-scale parallel matrix operations, they possess powerful single-core performance and a complex caching system, allowing them to serve as a fallback processing path when the GPU is overloaded, ensuring that tasks are not discarded.
[0121] For example, when processing these tasks, the CPU can utilize its multi-core architecture for task-level parallelism. This involves assigning different tasks to different CPU cores, with each core independently running a complete pipeline of intensive computational layers for its assigned task, ultimately obtaining intermediate data for each task. This intermediate data is temporarily stored in CPU memory for later use.
[0122] Furthermore, the remaining lightweight computing layers can be completed by the general-purpose computing units. After obtaining the intermediate data for all tasks, each network layer in the lightweight computing layer set can be run sequentially by the general-purpose computing units to process this intermediate data in parallel. Similarly, the CPU can leverage its multi-core advantage to allocate different intermediate data to different cores, with each core independently running the complete lightweight computing layer set for its assigned intermediate data, ultimately obtaining the semantic vector corresponding to each intermediate data. Thus, despite the overload of the intensive computing units, the system successfully produced the final semantic vectors for all tasks by scheduling tasks to the general-purpose computing units, which then fully execute both the intensive and lightweight layer computations, achieving smooth service degradation and continuous availability.
[0123] In some implementations, multi-priority heterogeneous scheduling can be achieved based on queue status. Specifically, the first priority is as follows: If the NPU queue (first queue) is not full, the task is assigned to the NPU, which performs pipeline computation on the intensive computation layer set (e.g., layers 1-20), outputs intermediate data, and then hands it over to the CPU to perform lightweight computation layer set (e.g., layers 21-24), forming a heterogeneous pipeline between the NPU and the CPU. The second priority is as follows: If the NPU queue is full but the CPU queue is not full, computation offloading is triggered, and all tasks are assigned to the CPU queue (second queue), where the CPU independently completes the complete computation of the intensive and lightweight layers, reusing idle CPU resources in exchange for degraded throughput. The third priority is as follows: If both queues are full, overload protection is triggered, and a service busy response is returned to avoid system avalanche caused by queue overflow.
[0124] In some implementations, during the scheduling process, the queue manager can read hardware status indicators fed back by the device detector in real time (such as temporarily lowering the queue task threshold of each queue (such as the first queue and / or the second queue) when the CPU utilization is >80% to avoid CPU overload), ensuring the flexibility of the scheduling strategy.
[0125] By using the above methods, a cross-unit collaborative overload protection mechanism can be established in a heterogeneous computing architecture. When the main processing path (intensive computing unit) is busy, the task can be automatically and seamlessly switched to the backup processing path (general-purpose computing unit). This avoids service interruption or task failure caused by single hardware overload, thereby significantly improving the robustness and availability of the entire vector embedding system in high-concurrency scenarios and ensuring service continuity and stability.
[0126] Step 105: Run each network layer in the lightweight computing layer set sequentially through the general-purpose computing unit to process multiple intermediate data in parallel and obtain the semantic vector corresponding to each intermediate data.
[0127] In some implementations, in order to complete the entire process of vector embedding computation in a heterogeneous computing architecture, each layer in a pre-divided set of lightweight computing layers can be executed sequentially in a pipeline manner by general-purpose computing units, and multiple intermediate data can be processed in parallel batches. This efficiently transforms the feature representations preprocessed by dense computing units into the final semantic vectors, so as to output low-dimensional dense vector representations that can be directly used by upper-layer applications.
[0128] The semantic vector can be a low-dimensional dense vector that represents the deep semantic information of the original input data, and is ultimately output by the vector embedding model after processing through the entire heterogeneous computing process. For example, it can be a 512-dimensional floating-point vector that can be used for downstream tasks such as text similarity calculation, image retrieval, or item representation in recommendation systems.
[0129] In some implementations, a general-purpose computing unit can take over the intermediate data produced after preprocessing by the intensive computing unit and complete the final computation of the vector embedding process. Specifically, the system acquires multiple intermediate data points generated by the intensive computing unit (such as a GPU) and transferred to the memory of the general-purpose computing unit. These intermediate data points represent the feature representations of each vector embedding task after passing through a set of intensive computational layers (such as the first 20 layers of a Transformer model). Subsequently, the general-purpose computing unit (such as a CPU) runs each network layer in the set of lightweight computational layers in a predetermined order, processing these intermediate data points in parallel to obtain the semantic vector corresponding to each intermediate data point.
[0130] In some implementations, the parallel processing mechanism of the general-purpose computing unit when processing this intermediate data is similar to that of the aforementioned intensive computing unit. The main difference lies in that the processing object changes from the original task data to intermediate data, and the task source changes from the first queue corresponding to the intensive computing unit to the second queue corresponding to the general-purpose computing unit. Specifically, the general-purpose computing unit can utilize its multi-core architecture to obtain a batch of intermediate data from the second queue and allocate tasks according to the number of cores. Each CPU core independently runs a complete lightweight computing layer pipeline for its assigned intermediate data, such as sequentially executing layer normalization, residual connections, and output layers, ultimately producing a low-dimensional dense semantic vector corresponding to the intermediate data. Specific scheduling details regarding task retrieval from the queue, threshold-based batch processing triggering, and task allocation and parallel execution among multiple cores can be found in the description of the intensive computing unit above. The only differences are that the execution hardware changes from the intensive computing unit to the general-purpose computing unit, the processing object changes from the original task data to intermediate data, and the queue changes from the first queue to the second queue, etc., and will not be elaborated further here. Through the above process, the system can complete the entire heterogeneous computation process from the original input to the final semantic vector.
[0131] This application embodiment determines the multiple network layers contained in the pre-trained vector embedding model, and runs each network layer to process preset test samples through general-purpose computing units and intensive computing units respectively, obtaining a first processing latency and a second processing latency corresponding to each network layer; based on the ratio between the first processing latency and the second processing latency of each network layer, calculates the execution efficiency index corresponding to each network layer; based on the execution efficiency index corresponding to each network layer, the multiple network layers are divided into a set of lightweight computing layers executed by general-purpose computing units and a set of dense computing layers executed by intensive computing units; multiple vector embedding tasks are obtained, and each network layer in the dense computing layer set is run sequentially by intensive computing units to perform parallel processing on multiple vector embedding tasks, obtaining intermediate data corresponding to each vector embedding task; each network layer in the lightweight computing layer set is run sequentially by general-purpose computing units to perform parallel processing on multiple intermediate data, obtaining the semantic vector corresponding to each intermediate data. In this way, precise matching of computing tasks and computing unit computing power characteristics can be achieved through layer-level hardware adaptation scheduling. Specifically, this application obtains the execution efficiency indicators of each network layer on different computing units through testing. Based on this, it allocates computationally intensive network layers to computationally intensive computing units that are good at parallel computing, and allocates lightweight computationally intensive network layers to general-purpose computing units. This breaks through the resource mismatch dilemma caused by binding the entire model to a single computing unit, so that each computing unit can handle layer tasks that match its computing power characteristics. This avoids the idle computing power of computationally intensive computing units when handling lightweight layers and the computing power bottleneck of general-purpose computing units when handling dense layers. At the same time, it activates the idle resources of general-purpose computing units to participate in vector embedding computation, forming a cross-device hierarchical pipeline parallel processing. In summary, this application can improve the system's resource utilization and the processing efficiency of large-scale vector embedding tasks.
[0132] In some implementations, to further reduce computational redundancy and improve processing efficiency on intensive computing units, multiple vector embedding tasks can be clustered to mine semantic similarity between tasks. Then, only representative tasks are selected from each task set to execute the complete intensive computation layer. Based on the semantic differences between the representative task and the remaining tasks, and combined with the intermediate computation results of the representative task, intermediate data for the remaining tasks is quickly constructed. This significantly reduces the total number of tasks that the intensive computing unit needs to process through computational substitution, improving computational efficiency and saving computing resources. For example, the intermediate data may include intermediate representative data and intermediate combined data. Step 104, "running each network layer in the intensive computation layer set sequentially through the intensive computing unit to process multiple vector embedding tasks in parallel and obtain intermediate data corresponding to each vector embedding task," may further include: (104.B1) Clustering is performed on multiple vector embedding tasks to obtain multiple task sets; (104.B2) For each task set, a representative task is selected, and each network layer in the set of dense computing layers is run sequentially through dense computing units to process the representative task and obtain the intermediate representative data corresponding to the representative task. (104.B3) For each set of tasks, obtain the semantic difference between each remaining vector embedding task and the representative task, and construct the intermediate combined data corresponding to each remaining vector embedding task based on the semantic difference and intermediate representative data.
[0133] The intermediate representative data can be an intermediate feature representation generated after the task has undergone a complete set of intensive computing layers by intensive computing units.
[0134] The intermediate combined data can be vector embedding tasks that are not representative tasks in the task set. Based on the semantic differences between these tasks and the representative tasks, as well as the intermediate representative data of the representative tasks, an approximate intermediate feature representation is generated through combination or reconstruction. For example, it can be an intermediate feature tensor obtained by operating on the semantic difference vector and the intermediate representative data, so that the subsequent general-purpose computing unit can directly process it to obtain the corresponding semantic vector; or it can be the semantic difference vector and the intermediate representative data, which can be operated on by the general-purpose computing unit to obtain the intermediate feature tensor, and then further processed to obtain the corresponding semantic vector.
[0135] The task set can be a subset of tasks with high semantic similarity, obtained by clustering multiple vector embedding tasks. For example, it can be a batch of semantically similar query texts.
[0136] The representative task can be a typical task selected from each task set to perform a complete intensive computation process. For example, it can be the task closest to the cluster center or a task randomly selected from the task set. Its computation result can be used as the computation basis for other tasks in the set to achieve the sharing and reuse of computation results.
[0137] Semantic differences can be quantified differences between each non-representative task and the representative task in the input space or shallow feature space. For example, it can be the difference between the bag-of-words vectors of two texts or the difference vector obtained after lightweight encoding, which is used to recover approximate intermediate features of non-representative tasks by combining the intermediate computation results of the representative task during the intermediate data construction stage.
[0138] Specifically, features can be extracted from the input data of each vector embedding task to obtain a low-dimensional feature vector that represents its semantics. For example, for text tasks, a lightweight bag-of-words model or a pre-trained Sentence-BERT model can be used to quickly generate a text embedding vector; for image tasks, color histograms can be extracted or feature maps can be generated through the first few layers of a small CNN network. Then, based on these feature vectors, clustering algorithms such as K-means, DBSCAN, or hierarchical clustering are used to group semantically similar tasks into the same task set. For example, after clustering 100 text query tasks, 10 task sets may be obtained, such as queries related to electronic products being clustered into one category, and queries related to food being clustered into another category.
[0139] In some implementations, to achieve fast and lightweight clustering of multiple vector embedding tasks, a Hamming distance clustering method based on the SimHash algorithm can be used. Specifically, when a batch of query texts (e.g., 16-32 texts) enters the batch processing queue of a computationally intensive unit (such as an NPU), a fixed-length semantic hash code (e.g., a 64-bit or 128-bit binary fingerprint) is first generated for each text using the SimHash algorithm. The SimHash algorithm can map semantically similar content to hash codes with similar Hamming distances. Subsequently, the Hamming distance between any two text hash codes (i.e., the number of different bits in the two binary strings) is calculated, and clustering is performed based on a preset distance threshold (e.g., Hamming distance ≤ 3), grouping texts with similar hash codes into the same task set. In this way, there is no need to perform high-dimensional vector similarity calculations; clustering can be completed solely through bitwise operations, providing an efficient clustering foundation for subsequent representative task selection and computational reuse.
[0140] For example, representative tasks can be selected in several ways, such as choosing the task closest to the cluster center or randomly selecting a task as the representative. After selecting a representative task, it can be submitted to a computationally intensive unit (such as a GPU). The GPU processes this representative task by running each network layer in the set of intensive computational layers in sequence, and finally outputs the feature representation of the task after passing through all intensive layers, i.e., the intermediate representative data. It should be noted that the process of generating intermediate representative data is the same as the process of generating intermediate data described above, and will not be repeated here.
[0141] In some implementations, the specific process of the intensive computing unit processing a single vector embedding task (representative task) can be found in the detailed description above. The core of this approach lies in utilizing the streaming multiprocessor (SM) within the intensive computing unit and its hundreds of CUDA cores to execute matrix operations at each layer in parallel, which will not be elaborated upon here. In this way, each task set only needs to perform a complete intensive computation once, rather than performing computation on all tasks within the set, thereby significantly reducing the load on the intensive computing unit.
[0142] In some implementations, there can be multiple ways to obtain semantic differences. For example, based on the feature vectors generated during clustering, the difference between the feature vector of each remaining task and the feature vector of the representative task can be calculated as the semantic difference. For example, suppose the feature vector representing task A is... One of the remaining tasks B has the following feature vector: The semantic difference between the two for Then, based on this semantic difference and the intermediate representative data representing the task... Construct intermediate combined data for task B. For example, a simple linear combination approach can be used: ,in This is a mapping function used to map low-dimensional semantic differences to the same feature space as the intermediate data. This mapping function can be a pre-trained shallow neural network or a simple linear transformation matrix. This construction method allows for the rapid acquisition of approximate intermediate data for task B without performing a full, computationally intensive process.
[0143] In some implementations, the semantic differences corresponding to each remaining vector embedding task and the set corresponding to the intermediate representative data can be used as intermediate combined data. The general-purpose computing unit can then fuse the two data and finally calculate the corresponding semantic vector.
[0144] In some implementations, a multi-level difference compensation mechanism can be introduced to improve the quality of the intermediate combined data. Specifically, after constructing the intermediate combined data for the remaining tasks based on semantic differences and intermediate representative data, this data can be input into a lightweight difference compensation network. This network, pre-trained, can fine-tune the intermediate combined data according to the feature differences in the input, making it closer to the result obtained from performing the complete computation in reality. For example, a residual network consisting of several fully connected layers can be constructed, with the intermediate combined data as input and the compensated, refined intermediate data as output. During training, the residual mapping between the real intermediate data and the intermediate combined data is learned using the real intermediate data as the target and the intermediate combined data as the input. By introducing this compensation mechanism, the computational accuracy of the non-representative tasks can be maximized while maintaining computational efficiency, achieving a better balance between efficiency and accuracy.
[0145] By leveraging the semantic similarity between tasks, the results of intensive computation layers can be efficiently reused, thereby reducing the number of tasks that intensive computation units need to process from the total number of tasks to the number of representative tasks after clustering, significantly reducing the core computational load. At the same time, by combining semantic differences with intermediate representative data, approximate intermediate data for non-representative tasks can be quickly obtained, maintaining the continuity of the overall processing flow with a small computational cost. This enables the system to achieve a leap in throughput when processing large-scale, highly similar tasks.
[0146] In some implementations, to accurately restore the approximate feature representation carried by the intermediate combined data into a complete data form that can be directly processed by a lightweight computing layer, each intermediate combined data can be recalculated based on the intermediate representative data of its respective task set and pre-calculated semantic differences. This yields accurate target intermediate data equivalent to actually performing full intensive computation. Then, a general-purpose computing unit performs unified parallel batch processing on the intermediate representative data and all target intermediate data to efficiently produce the final semantic vectors corresponding to all tasks. For example, step 105 may include: (105.1) For each intermediate combination of data, the corresponding target intermediate data is obtained by calculating based on the corresponding intermediate representative data and the corresponding semantic difference; (105.2) Run each network layer in the lightweight computing layer set through a general-purpose computing unit to process the intermediate representative data and the target intermediate data in parallel, and obtain the semantic vectors corresponding to the intermediate representative data and the target intermediate data respectively.
[0147] The target intermediate data can be an accurate intermediate feature obtained by restoring and calculating the intermediate combined data, which is equivalent to the feature representation output by the actual execution of the complete dense computing layer.
[0148] For example, for each intermediate composite data constructed through a computation reuse strategy, a reconstruction computation is performed based on the intermediate representative data of its respective task set and the corresponding semantic differences to obtain accurate target intermediate data equivalent to actually performing a complete intensive computation. This reconstruction computation can be performed on general-purpose computing units (such as CPUs) or intensive computing units (such as GPUs), depending on the system's resource scheduling strategy.
[0149] In some implementations, for the k-th non-representative task in the j-th task set, its target intermediate data It can be determined in the following ways: ; in, This represents the intermediate representative data obtained after the representative tasks of this task set have been processed by intensive computing units. This represents the semantic difference vector between the non-representative task and the representative task. It is a mapping function used to map a low-dimensional semantic difference vector to the same feature space as the intermediate data. This mapping function can be a pre-trained shallow neural network or a simple linear transformation matrix.
[0150] Through the above restoration calculation, the intermediate combined data previously constructed by approximation can be restored into accurate target intermediate data that is equivalent to the actual calculation and can be directly processed by the subsequent lightweight computing layer.
[0151] Furthermore, all the restored target intermediate data, along with the original intermediate representative data, can be submitted to a general-purpose computing unit. This unit then runs each network layer in the lightweight computing layer set to process these data in parallel, ultimately obtaining the semantic vector corresponding to each data point. When processing this data, the general-purpose computing unit (such as a CPU) employs the same parallel processing mechanism described above: utilizing its multi-core architecture, it retrieves a batch of intermediate data (including intermediate representative data and target intermediate data) from a second queue and allocates tasks based on the number of cores. Each CPU core independently runs the complete lightweight computing layer set pipeline for its assigned data, such as sequentially executing layer normalization, residual connections, and output layers, ultimately producing a low-dimensional dense semantic vector corresponding to each intermediate data point. Specific scheduling details regarding task retrieval from the queue, threshold-based batch processing triggering, and task allocation and parallel execution among multiple cores can be found in the detailed description above and will not be repeated here.
[0152] Please refer to Figure 3 In some implementations, combined with Figure 3The above-described fine-tuning and reuse implementation is described in general. First, when a batch of vector embedding tasks containing multiple query texts (e.g., 16-32 texts) enters the batch processing queue of a computationally intensive unit (such as an NPU), a fixed-length semantic hash code (e.g., a 64-bit or 128-bit binary fingerprint) is generated for each query text using the SimHash algorithm. Then, clustering is performed based on the Hamming distance between any two hash codes, grouping semantically similar tasks with a Hamming distance less than a preset threshold (e.g., ≤3) into the same task set, thus forming multiple task sets, each typically containing 2-8 query texts. Next, for each task set, a representative task is selected (e.g., randomly selected or a cluster center is chosen), and this representative task is submitted to the computationally intensive unit. The unit then sequentially runs each network layer (e.g., a multi-head attention layer) in the computationally intensive layer set to perform a complete computation, obtaining intermediate representative data corresponding to the representative task.
[0153] Subsequently, for the remaining non-representative tasks in each task set, intermediate combined data is constructed based on their semantic differences with the representative tasks (e.g., determined by hash code differences or feature vector differences). Specifically, the intermediate representative data is transmitted from the intensive computing unit to the general-purpose computing unit (such as the CPU) via a heterogeneous data bus. On the CPU side, for each non-representative task, the intermediate representative data is fine-tuned and reused based on its semantic differences with the representative task to construct the intermediate combined data corresponding to each non-representative task. To ensure accuracy, a cosine similarity threshold (e.g., requiring a recall decrease of <2% due to reuse) can be set as a constraint to control the quality of fine-tuning and reuse. For each intermediate combined data, calculation reconstruction is performed by combining its corresponding intermediate representative data and semantic differences to obtain target intermediate data equivalent to the actual computation.
[0154] Furthermore, each network layer in the lightweight computing layer set (such as layer normalization and output layers subsequently allocated to the CPU) can be run by a general-purpose computing unit to perform parallel batch processing on the intermediate representative data and all target intermediate data, respectively obtaining the final semantic vector corresponding to each intermediate representative data and target intermediate data. In this way, a heterogeneous computing reuse mechanism is realized that significantly reduces the load on intensive computing units and improves the overall processing efficiency while ensuring controllable accuracy.
[0155] By employing the above methods, the computational resources saved through computation reuse strategies in the early stages can be transformed into a guarantee of final accuracy. This significantly reduces the load on intensive computing units while ensuring that all vector embedding tasks (including non-representational tasks) can obtain accurate semantic vectors equivalent to those obtained through full computation. At the same time, by merging intermediate representative data and target intermediate data and submitting them to a general-purpose computing unit for unified batch processing, the parallel efficiency of the lightweight computing stage can be maximized, thereby enabling the system to achieve high-throughput, low-overhead, and lossless accuracy vector embedding services.
[0156] Please refer to Figure 4 and Figure 5 In some implementations, combined with Figure 4 and Figure 5 The overall process of this application is described below.
[0157] First, perform the offline preparation phase, such as... Figure 5 As shown, the system can perform offline hierarchical micro-benchmark tests on pre-loaded vector embedding models. It runs each network layer through both a general-purpose computing unit (CPU) and a high-density computing unit (NPU) to process preset test samples, calculates the execution latency of each layer on the CPU and NPU, and obtains the layer efficiency ratio, i.e., the ratio between the first processing latency and the second processing latency. This efficiency metric provides a quantitative basis for subsequent runtime decisions.
[0158] Furthermore, we will move into the online service phase, as shown in the attached document. Figure 4 As shown, the system receives vector embedding tasks in real time and obtains the current hardware status through device detection and instance initialization. Simultaneously, it estimates the queue depth to obtain the current first number in the first queue (NPU queue) and the current third number in the second queue (CPU queue). The system determines whether the first number is less than the task threshold of the first queue. If it is, the task is assigned to the first queue and waits for batch processing conditions to be met before being distributed to the embedding instance for execution. If the first number is greater than or equal to the task threshold of the first queue, it further determines whether heterogeneous mode is enabled and whether the third number is less than the task threshold of the second queue. If these conditions are met, the task is assigned to the second queue. If both queues are full, a service busy response is returned to the business system to avoid queue overflow leading to uncontrolled latency.
[0159] When the batch processing conditions are met, the system distributes the task batch to the embedded instance execution vector generation, a process that combines... Figure 5The runtime dynamic decision-making is illustrated. Based on offline measured layer efficiency ratios, the embedded instance employs a greedy strategy combined with real-time hardware load adjustments to dynamically determine the device allocation for each network layer: compute-intensive layers (such as multi-head attention layers) are preferentially allocated to computationally intensive units (NPUs), while lightweight layers (such as layer normalization and output projection layers) are preferentially allocated to general-purpose units (CPUs). For a single query, a cross-device pipeline is formed: the NPU first executes the assigned intensive layer, the resulting intermediate data is transmitted to the CPU via a heterogeneous data bus, and then the CPU continues to execute the remaining lightweight layers, ultimately forming a complete semantic vector output.
[0160] Finally, the system detects the actual processing latency of each vector generation to determine whether it meets the Service Level Agreement (SLO). If the actual latency is less than the SLO, the semantic vector is returned normally; if the actual latency exceeds the SLO, the anomaly is recorded while returning the semantic vector for subsequent dynamic threshold adjustment and system optimization. Through the mechanism of combining offline testing and online scheduling, this application achieves refined and adaptive processing of vector embedding tasks under a heterogeneous computing architecture.
[0161] Please see Figure 6 This application also provides a vector embedding device based on a heterogeneous computing architecture, which can implement the above-mentioned vector embedding method based on a heterogeneous computing architecture. The heterogeneous computing architecture includes general-purpose computing units and intensive computing units. The vector embedding device based on the heterogeneous computing architecture includes: The determination module 61 is used to determine the multiple network layers contained in the pre-trained vector embedding model, and to run each network layer to process preset test samples through a general computing unit and a dense computing unit, respectively, to obtain the first processing delay and the second processing delay corresponding to each network layer. The calculation module 62 is used to calculate the execution efficiency index corresponding to each network layer based on the ratio between the first processing latency and the second processing latency of each network layer. The partitioning module 63 is used to divide multiple network layers into a set of lightweight computing layers executed by general-purpose computing units and a set of dense computing layers executed by dense computing units, based on the execution efficiency index corresponding to each network layer. The running module 64 is used to acquire multiple vector embedding tasks and sequentially run each network layer in the set of dense computing layers through the dense computing unit to process multiple vector embedding tasks in parallel and obtain intermediate data corresponding to each vector embedding task. The processing module 65 is used to sequentially run each network layer in the lightweight computing layer set through a general-purpose computing unit, perform parallel processing on multiple intermediate data, and obtain the semantic vector corresponding to each intermediate data.
[0162] The specific implementation of this vector embedding device based on a heterogeneous computing architecture is basically the same as the specific embodiment of the vector embedding method based on a heterogeneous computing architecture described above, and will not be repeated here. Subject to meeting the requirements of the embodiments of this application, the vector embedding device based on a heterogeneous computing architecture may also be equipped with other functional modules to implement the vector embedding method based on a heterogeneous computing architecture described above.
[0163] This application also provides a computer device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the aforementioned vector embedding method based on a heterogeneous computing architecture. This computer device can be any smart terminal, including tablet computers, in-vehicle computers, etc.
[0164] Please see Figure 7 , Figure 7 The hardware structure of a computer device according to another embodiment is illustrated. The computer device includes: The processor 71 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application. The memory 72 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 72 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 72 and is called and executed by the processor 71 to execute the vector embedding method based on heterogeneous computing architecture of the embodiments of this application. Input / output interface 73 is used to implement information input and output; The communication interface 74 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.). Bus 75 transmits information between various components of the device (e.g., processor 71, memory 72, input / output interface 73, and communication interface 74); The processor 71, memory 72, input / output interface 73, and communication interface 74 are connected to each other within the device via bus 75.
[0165] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described vector embedding method based on a heterogeneous computing architecture.
[0166] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0167] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0168] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0169] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0170] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0171] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0172] It should be understood that in this application, "at least one" and "several" refer to one or more, and "multiple" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0173] In the embodiments provided in this application, it should be understood that the disclosed systems and methods can be implemented in other ways. For example, the system embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interfaces, devices, or units, and may be electrical, mechanical, or other forms.
[0174] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0175] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0176] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0177] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A vector embedding method based on a heterogeneous computing architecture, characterized in that, The heterogeneous computing architecture includes general-purpose computing units and intensive computing units, and the method includes: The pre-trained vector embedding model contains multiple network layers, and each network layer processes preset test samples through the general computing unit and the dense computing unit respectively, to obtain the first processing delay and the second processing delay corresponding to each network layer; Based on the ratio of the first processing latency to the second processing latency of each network layer, the execution efficiency index corresponding to each network layer is calculated. Based on the execution efficiency index corresponding to each network layer, the multiple network layers are divided into a set of lightweight computing layers executed by the general-purpose computing unit and a set of dense computing layers executed by the dense computing unit. Multiple vector embedding tasks are obtained, and each network layer in the set of dense computing layers is run sequentially through the dense computing unit to process the multiple vector embedding tasks in parallel, thereby obtaining intermediate data corresponding to each vector embedding task. The general-purpose computing unit sequentially runs each network layer in the lightweight computing layer set to process multiple intermediate data in parallel, thereby obtaining the semantic vector corresponding to each intermediate data.
2. The vector embedding method based on heterogeneous computing architecture according to claim 1, characterized in that, The process involves sequentially running each network layer in the set of dense computing layers through the dense computing unit to perform parallel processing on the multiple vector embedding tasks, obtaining intermediate data corresponding to each vector embedding task, including: Obtain the first queue task threshold corresponding to the intensive computing unit. When the first number of first tasks to be processed in the first queue is less than the first queue task threshold, allocate the multiple vector embedding tasks to the first queue. The intensive computing unit sequentially runs each network layer in the intensive computing layer set to perform parallel processing on the multiple vector embedding tasks contained in the first queue, thereby obtaining intermediate data corresponding to each vector embedding task.
3. The vector embedding method based on heterogeneous computing architecture according to claim 2, characterized in that, The step of obtaining the first queue task threshold corresponding to the intensive computing unit includes: The number of concurrent tasks is increased by a preset step size. The average processing latency of the intensive computing unit for processing the preset test tasks is collected under different numbers of concurrent tasks until the target average processing latency corresponding to the target number of concurrent tasks exceeds the preset service threshold, and multiple sets of target sample data are obtained. Each set of target sample data includes the number of concurrent tasks and the corresponding average processing latency; The least squares method is used to perform linear regression fitting on the multiple sets of target sample data to obtain the delay coefficient and base delay of the dense computing unit; Based on the target average processing latency, the latency coefficient, and the base latency, the first queue task threshold corresponding to the intensive computing unit is calculated.
4. The vector embedding method based on heterogeneous computing architecture according to claim 2, characterized in that, The process involves sequentially running each network layer in the dense computing layer set through the dense computing unit to perform parallel processing on the multiple vector embedding tasks included in the first queue, obtaining intermediate data corresponding to each vector embedding task, including: Obtain the second number corresponding to the plurality of first pending tasks currently contained in the first queue; When the second quantity in the first queue reaches the preset batch processing threshold, multiple target vectors corresponding to the selected preset batch processing threshold are embedded into the task in the order of arrival time from the first queue. The multiple processing cores of the intensive computing unit are identified, and a corresponding target vector embedding task is assigned to each processing core; For each processing core, each network layer in the set of dense computing layers is run sequentially to process the corresponding target vector embedding task and obtain the intermediate data corresponding to the target vector embedding task.
5. The vector embedding method based on heterogeneous computing architecture according to claim 2, characterized in that, The method further includes: When the first number of multiple first tasks to be processed contained in the first queue is greater than or equal to the first queue task threshold, and the third number of multiple second tasks to be processed contained in the second queue corresponding to the general computing unit is less than the preset second queue task threshold, the multiple vector embedding tasks are assigned to the second queue. The general-purpose computing unit sequentially runs each network layer in the dense computing layer set to perform parallel processing on the multiple vector embedding tasks, thereby obtaining intermediate data corresponding to each vector embedding task. The general-purpose computing unit sequentially runs each network layer in the lightweight computing layer set to process multiple intermediate data in parallel, thereby obtaining the semantic vector corresponding to each intermediate data.
6. The vector embedding method based on heterogeneous computing architecture according to claim 1, characterized in that, The intermediate data includes intermediate representative data and intermediate combined data. The process involves sequentially running each network layer in the dense computing layer set through the dense computing unit to perform parallel processing on the multiple vector embedding tasks, obtaining the intermediate data corresponding to each vector embedding task, including: Clustering is performed on the multiple vector embedding tasks to obtain multiple task sets; For each task set, a representative task is selected, and each network layer in the dense computing layer set is run sequentially through the dense computing unit to process the representative task and obtain the intermediate representative data corresponding to the representative task. For each task set, the semantic difference between each remaining vector embedding task and the representative task is obtained, and based on the semantic difference corresponding to each remaining vector embedding task and the intermediate representative data, intermediate combined data corresponding to each remaining vector embedding task is constructed.
7. The vector embedding method based on heterogeneous computing architecture according to claim 6, characterized in that, The process involves sequentially running each network layer in the lightweight computing layer set through the general-purpose computing unit to process multiple intermediate data in parallel, obtaining a semantic vector corresponding to each intermediate data, including: For each intermediate combination of data, the corresponding target intermediate data is obtained by calculation based on the corresponding intermediate representative data and the corresponding semantic difference; The general-purpose computing unit runs each network layer in the lightweight computing layer set to process the intermediate representative data and the target intermediate data in parallel, thereby obtaining the semantic vectors corresponding to the intermediate representative data and the target intermediate data, respectively.
8. A vector embedding device based on a heterogeneous computing architecture, characterized in that, The heterogeneous computing architecture includes general-purpose computing units and intensive computing units, and the device includes: The determination module is used to determine the multiple network layers contained in the pre-trained vector embedding model, and to run each network layer to process preset test samples through the general computing unit and the dense computing unit respectively, so as to obtain the first processing delay and the second processing delay corresponding to each network layer. The calculation module is used to calculate the execution efficiency index corresponding to each network layer based on the ratio between the first processing latency and the second processing latency of each network layer. The partitioning module is used to divide the multiple network layers into a set of lightweight computing layers executed by the general-purpose computing unit and a set of dense computing layers executed by the dense computing unit, based on the execution efficiency index corresponding to each network layer. The running module is used to acquire multiple vector embedding tasks and sequentially run each network layer in the set of dense computing layers through the dense computing unit to process the multiple vector embedding tasks in parallel and obtain intermediate data corresponding to each vector embedding task. The processing module is used to sequentially run each network layer in the lightweight computing layer set through the general-purpose computing unit to process multiple intermediate data in parallel and obtain the semantic vector corresponding to each intermediate data.
9. A computer device, characterized in that, The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the vector embedding method based on a heterogeneous computing architecture as described in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the vector embedding method based on heterogeneous computing architecture as described in any one of claims 1 to 7.