Function batch calling method and electronic device

By queuing and batch processing function call requests for computing tasks within the cluster, the problem of low efficiency and resource waste caused by high-frequency calls to structured query language engines and artificial intelligence functions is solved, achieving more efficient data processing and reliable results.

CN122240215APending Publication Date: 2026-06-19INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, the overhead of calling structured query language engines and artificial intelligence functions is large, and the high frequency of calls leads to inefficiency and waste of resources. Furthermore, artificial intelligence models are prone to attention decay and reduced result quality when processing long term requests.

Method used

By queuing up function call requests for computing tasks within the cluster, requests with the same model inference service and function type are aggregated, and when preset conditions are met, they are encapsulated into model inference service requests to achieve batch processing, reduce high-frequency independent calls, and alleviate resource waste and attention decay.

🎯Benefits of technology

It improves data processing efficiency, reduces computing power and lexical consumption, avoids resource waste, and ensures the stability and reliability of function call results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240215A_ABST
    Figure CN122240215A_ABST
Patent Text Reader

Abstract

This application discloses a method and electronic device for batch function invocation, relating to the field of big data technology. The method involves grouping function call requests from various computing tasks within a cluster into queues based on the same model inference service and the same function type, thereby aggregating and classifying similar requests. Then, based on batch control parameters meeting certain conditions, multiple function call requests are uniformly encapsulated into a model inference service request and sent to the corresponding model inference service. This achieves batch processing and batch return of multiple function call requests, reducing the number of calls to the model inference service, lowering computational power consumption, lexical consumption, and time consumption, improving query processing efficiency, and avoiding resource waste for both the caller and the callee.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of big data technology, and in particular to a method for batch function invocation and an electronic device. Background Technology

[0002] The development of Artificial Intelligence (AI) technology has driven the deep integration of data processing and intelligent analysis. Mainstream Structured Query Language (SQL) engines have combined SQL processing with AI capabilities to compensate for the shortcomings of traditional engines in intelligent analysis and improve the level of intelligent data analysis. Currently, the two are mainly integrated by introducing AI functions, supporting intelligent functions such as sentiment analysis and intent recognition. Their usage is similar to that of custom functions, lowering the application threshold. However, due to hardware and model technology limitations, AI model calls are costly, and high-frequency calls under massive amounts of data can easily lead to low efficiency and wasted resources. At the same time, models face problems such as attention decay and decreased performance after super-terminal restrictions when processing long texts. Summary of the Invention

[0003] This application provides a method and electronic device for batch function invocation, which at least solves the problems of high overhead and low efficiency and resource waste caused by high frequency of function invocation in related technologies.

[0004] This application provides a method for batch function invocation, comprising: obtaining function invocation requests from various computing tasks within a cluster, and dividing the function invocation requests into queues; wherein, function invocation requests assigned to the same queue correspond to the same model inference service and have the same function type; for any queue, encapsulating multiple function invocation requests corresponding to its batch control parameters when they meet preset conditions into a model inference service request; sending the model inference service request to the corresponding model inference service, so that the model inference service can batch process the multiple function invocation requests contained in the model inference service request and return the function invocation results in batches.

[0005] This application also provides a function batch invocation apparatus, including: The request queue module is used to obtain function call requests from various computing tasks within the cluster and to divide the function call requests into queues; among them, function call requests assigned to the same queue correspond to the same model inference service and have the same function type; The batch encapsulation module is used to encapsulate multiple function call requests corresponding to any queue when its batch control parameters meet preset conditions into model inference service requests. The sending module is used to send model inference service requests to the corresponding model inference service, so that the model inference service can process multiple function call requests contained in the model inference service request in batches and return the function call results in batches.

[0006] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for implementing the above-described function batch call method when executing the computer program.

[0007] This application also provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the above-described batch function call method.

[0008] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described batch function call method.

[0009] This application achieves aggregation and categorization of similar requests by dividing function call requests from various computing tasks within the cluster into queues based on the same model inference service and the same function type. Then, based on the condition that the batch control parameters meet, multiple function call requests are uniformly encapsulated into a model inference service request and sent to the corresponding model inference service. This enables batch processing and batch return of multiple function call requests, reduces the number of calls to the model inference service, reduces computing power consumption, token consumption and time consumption, improves query processing efficiency, and avoids resource waste for both the caller and the callee. Attached Figure Description

[0010] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0011] Figure 1 A schematic diagram of a specific hardware architecture on which the execution of batch function calls depends, provided for an embodiment of this application; Figure 2 A flowchart illustrating a method for batch function invocation provided in this application embodiment; Figure 3 This is a schematic diagram of request and response queue mapping provided in an embodiment of this application; Figure 4 This is a schematic diagram of parallel sampling with multiple result queues provided in an embodiment of this application; Figure 5 A flowchart illustrating another method for batch function invocation provided in this application embodiment; Figure 6 This application provides a schematic diagram of the structure of a function batch calling device. Figure 7This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0012] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.

[0013] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.

[0014] To more clearly illustrate the embodiments of this application, the technical terms used in the embodiments will be briefly introduced below: The SQL engine is a core component of a database management system. It is responsible for parsing, optimizing, and executing user-submitted structured query language statements, enabling operations such as retrieval, insertion, updating, and deletion of data in the database. After receiving the SQL statement, it generates an execution plan through lexical analysis, syntax analysis, and query optimization. Finally, it interacts with the storage engine to complete data processing, serving as a crucial bridge connecting user operations and the underlying data storage.

[0015] AI functions are a special type of function integrated into data processing engines (such as SQL engines and computing engines). They encapsulate the capabilities of pre-trained artificial intelligence models (such as sentiment analysis, intent recognition, text classification, entity extraction, etc.). Their calling method is consistent with that of regular functions (such as user-defined functions). Users do not need to master complex AI model development knowledge to directly call AI capabilities during data query or calculation, thereby realizing the integration of data and intelligent analysis and lowering the application threshold of AI technology in data processing scenarios.

[0016] User-defined functions are reusable functions that users define according to their business needs, supplementing the shortcomings of built-in functions in databases or computing engines. They allow users to define specific logic using programming languages ​​such as SQL, Java, and Python, and can be called in query statements like built-in functions, supporting parameter input and result return.

[0017] Prefix caching is a caching optimization technique for frequently repeated prefix requests. Its core is to cache the same prefix part (such as text sequence, query statement prefix, model input prefix) in multiple requests in memory or high-speed storage medium. When subsequent requests contain the prefix, the prefix calculation result in the cache (such as the intermediate state of model inference, the pre-result of query parsing) is directly reused to avoid repeated calculation.

[0018] The cross encoder is a dual-tower text matching model based on the Transformer architecture. Its core feature is that it concatenates two input texts (such as query text and candidate text) and inputs them into the encoder. It captures the interaction features between the two texts through a self-attention mechanism and finally outputs the matching score of the text pair.

[0019] Flink is an open-source, distributed computing framework that unifies stream and batch processing. Its core is based on a dataflow programming model, supporting high-throughput, low-latency real-time data processing while also being compatible with batch data processing. Flink AI functions are built-in native features that allow users to call large models for real-time inference using SQL, achieving a simplified integration of streaming data and AI.

[0020] With the rapid development of artificial intelligence technology, the need for the integration of data processing and intelligent analysis is becoming increasingly prominent. Mainstream structured query language engines are gradually introducing a deep integration of structured query language processing capabilities and artificial intelligence capabilities to adapt to the data processing needs of the new era. Traditional structured query language engines have significant shortcomings in the field of intelligent analysis, struggling to achieve complex data semantic understanding, deep feature mining, and other intelligent processing functions. The integration of artificial intelligence technology precisely fills this gap, significantly improving the data analysis capabilities and intelligence level of structured query language engines.

[0021] Currently, the core approach to integrating mainstream structured query language (SCL) computing engines with artificial intelligence (AI) technology involves introducing AI functions. These AI functions encompass a variety of common intelligent processing capabilities, including sentiment analysis, intent recognition, intelligent classification, and language translation. The usage of these AI functions is highly similar to that of user-defined functions. Users do not need to switch programming languages ​​or master complex AI principles; they can directly invoke them within SCL query statements. This effectively breaks down the technical barriers between data processing and AI applications, significantly lowering the user threshold and enhancing the convenience of intelligent data analysis.

[0022] However, under current technological conditions, the combined application of structured query language engines and artificial intelligence (AI) technology still faces numerous unresolved technical challenges, severely hindering its application effectiveness and widespread adoption. On one hand, limitations imposed by current hardware performance and the development level of AI model technology result in significant overall overhead when calling AI models, manifested in high computational resource consumption, excessive lexical unit consumption, and long response times. On the other hand, structured query language queries typically involve massive data processing, and a single query often requires separate AI function calls to multiple data sets. This leads to frequent triggering of AI models, resulting not only in low data processing efficiency but also severe resource waste for both the structured query language engine (the caller) and the AI ​​model service (the callee).

[0023] Furthermore, when processing long-term queries, AI models are prone to attention decay due to the characteristics of their own attention mechanisms, leading to decreased accuracy and deterioration in output quality. If the number of terms generated by a single query exceeds the maximum processing limit of the AI ​​model, it will have a more severe negative impact on the quality of the output, even causing the results to become invalid. In addition, related structured query language engines lack effective quality monitoring mechanisms for the results of AI function calls, failing to promptly detect and report quality issues in the call results, further affecting the reliability and usability of the data processing results.

[0024] To address all or at least some of the aforementioned technical problems, this application provides a method for batch function invocation. This method effectively solves the technical problems of low efficiency, resource waste, and insufficient result reliability when combining SQL engines with AI functions. The method categorizes function call requests from computing tasks within the cluster into queues based on the same model inference service and function type, thus aggregating and classifying similar requests. Then, based on batch control parameters, multiple function call requests that meet the conditions are encapsulated into a unified model inference service request and sent in batches to the corresponding model inference service for processing. This solution reduces the overhead of high-frequency independent calls to AI models, avoids resource waste caused by repeated calls in massive data scenarios, and simultaneously reduces the frequency of model calls through batch processing, alleviating attention decay and term exceeding issues caused by long term requests. This improves data processing efficiency and system resource utilization from the call process level, indirectly ensuring the stability and reliability of function call results.

[0025] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0026] like Figure 1As shown, the specific hardware architecture upon which the batch function call method depends includes: Flink distributed cluster, model inference service cluster, distributed cache storage layer, and network communication architecture.

[0027] Among them, the Flink distributed cluster provides computing power and distributed task scheduling support for the interception, batching, and process control of AI function requests.

[0028] The model inference service cluster consists of inference server nodes. Each inference server node deploys a large model inference framework, receives batch AI function inference requests from the Flink cluster, parses the structured request data, and schedules accelerator cards to execute model inference computations. It also encapsulates the inference results in a structured format and returns them to the Flink cluster. Furthermore, it monitors the running status of inference tasks and provides feedback on call failures such as exceeding token limits or API timeouts. The inference server nodes deploy AI model accelerator cards, enabling rapid forward inference computations of large models, reducing inference latency for single / batch AI function requests, and supporting KV cache space allocation to achieve cache reuse with shared prefixes.

[0029] The distributed cache storage layer comprises a distributed in-memory cache cluster and persistent storage nodes. The distributed in-memory cache cluster is responsible for caching batches of AI function requests (request cache) and batch results returned by large model inference (result cache) after the Flink cluster has accumulated them. It also supports organizing and managing cached data by queue dimension, achieving precise mapping between requests and results. The cluster employs master-slave replication and sharding mechanisms to ensure high availability and horizontal scalability of cached data, preventing data loss due to single-node failures. The persistent storage nodes are responsible for persistently storing results after successful sampling and marking failed calls. Once the batch results pass the accuracy sampling module verification, the Flink cluster writes the results to the corresponding database / table in this layer for long-term data storage. Simultaneously, AI function request data marked as failed is stored separately, providing a data foundation for subsequent data replenishment and reruns.

[0030] The network communication architecture provides support for low latency, high bandwidth, and high reliability network transmission.

[0031] The embodiments of this application provide a method for batch function invocation, which can be applied to enterprise-level big data infrastructure platforms and big data all-in-one machines such as vertical knowledge management all-in-one machines.

[0032] like Figure 2 As shown, the method includes the following steps: S201. Obtain the function call requests of each computing task in the cluster and divide the function call requests into queues.

[0033] This application abandons the traditional one-by-one real-time invocation mode, intercepting all AI function call requests from all jobs within the Flink cluster. All AI function call requests are temporarily stored in the request cache module and then enter a unified scheduling process. Here, "all jobs" refers to an independent computational task unit submitted to and executed by the Flink cluster. In the distributed architecture, the Flink cluster consists of one master node and multiple worker nodes. Users drive computation by submitting "jobs." Each job corresponds to a complete computational logic, encompassing the entire process from data reading and transformation to result output.

[0034] Function call requests are queued, with requests in the same queue corresponding to the same model inference service and possessing the same function type. This queueing can be based on both the model's application programming interface (API) and the AI ​​function type. Each API corresponds to a unique inference service with standardized technical metrics such as maximum token count and inference rate. The same function type implies fixed system hints, ensuring consistent inference requirements and format. Function call requests within the same queue point to the same model inference service and belong to the same category of system hints.

[0035] Grouping function call requests with the same model inference service and the same function type into the same queue aims to achieve unified queue-level management of configuration parameters. Since the model service characteristics and inference optimization techniques corresponding to the same queue are completely consistent, unified batch control parameters can be configured for the queue. This allows the batching logic of all batches to be accurately matched with the technical indicators of the model service, avoiding problems such as word limit exceeding, low cache reuse rate, and degraded inference quality caused by inconsistent parameters.

[0036] In some embodiments, during the process of queuing function call requests, different queue division methods are used based on the different data processing modes of the computing tasks within the cluster. These data processing modes include batch processing and stream processing. Batch processing is one of the two basic computing modes supported by Flink, and it is a processing method for bounded data (i.e., datasets with a fixed amount of data and clear start and end boundaries) that performs computations in a single operation. Stream processing is Flink's core computing mode, and it is a processing method for unbounded data (i.e., data streams that are continuously generated and have no clear end boundary) that performs continuous computations.

[0037] If the data processing mode for the computation task is batch processing, the data is first sorted according to the field values ​​of the function call requests, and then function call requests matching the prefix fields are grouped into the same queue. The prefix field of the field value indicates the model inference service and function type of the function call request. Because batch processing mode involves large amounts of data and has no real-time pressure, this application performs job-level batching and sorting to maximize the use of prefix caching technology and improve resource utilization.

[0038] Specifically, function call requests are sorted according to the field values ​​of the function call requests, so that function call requests with the same or similar prefixes can be arranged adjacently. This ensures that such function call requests can be classified into the same batch or adjacent batches during batching, so that the model inference of each batch can fully reuse the KV Cache and minimize the latency of the first token and the redundant consumption of computing power.

[0039] In batch processing mode, requests are first sorted based on the field values ​​of function call requests. Then, requests with prefix fields that are the same or similar to the function type are grouped into the same queue. This further improves the accuracy of request aggregation and the matching degree of batch processing. It makes the data features within the batch-encapsulated model inference service requests more unified and the processing logic more consistent. This reduces inference anomalies and result deviations caused by differences in request features, strengthens the intensive use of tokens and computing resources, effectively alleviates the attention decay problem in long text or multi-field scenarios, and improves the accuracy and stability of inference results. At the same time, the ordered aggregation and grouping of similar requests can reduce the complexity of scheduling and distribution, reduce invalid calls and redundant calculations, and further improve the overall efficiency of data processing and system resource utilization in batch processing scenarios.

[0040] Meanwhile, considering the large data volume characteristics of batch processing mode, the batch granularity is limited to the job level. Function call requests within the same computing task (job) are batched and scheduled, and function call requests are not aggregated across computing tasks. This avoids additional cluster resource overhead and batch delay caused by cross-job scheduling of massive data, and ensures the lightweight and efficient batching process.

[0041] If the data processing mode of the computation task is stream processing, the batching granularity in this mode is at the cluster level, which involves queuing function call requests from all computation tasks in the cluster. Even if function call requests in the same batch come from different computation tasks, cluster-level aggregation can fully aggregate scattered requests, improving the batching rate and batch processing efficiency of large model calls. Since stream processing data arrives continuously and requires low latency, this application adopts cluster-level batching without sorting, aiming to integrate scattered requests across jobs to improve efficiency, while strictly controlling latency through waiting time thresholds to prevent impact on business operations.

[0042] To address the issue of processing delays caused by request aggregation during cluster-level batch processing, this application configures low-frequency computing tasks with stringent real-time requirements as high-priority tasks. Function call requests for these computing tasks will skip the batch processing process and adopt the traditional method of calling each task in real time, thereby fundamentally minimizing request processing latency and meeting real-time business needs.

[0043] Subsequently, function requests in each queue are continuously collected and batched. For batch processing requests, the data is sorted according to the field values ​​of the AI ​​function calls before batching, so that data with the same or similar prefixes are arranged adjacently, maximizing the use of prefix caching optimization technology; for stream processing requests, they are batched directly in the order of receipt, without performing data sorting operations.

[0044] S202. For any queue, when its batch control parameters meet the preset conditions, encapsulate the multiple function call requests corresponding to them into a model inference request service.

[0045] Each queue is configured with batch control parameters, including a batch capacity limit and a waiting time threshold. The batch capacity limit is the maximum number of function call requests within a batch. The core purpose of setting this threshold is to impose a hard limit on the amount of data, preventing excessive batch data from causing the number of tokens in a single call to exceed the capacity of the large model inference service. It also prevents issues such as attention decay and deterioration in inference result quality caused by excessive token counts. The waiting time threshold is the longest batch accumulation waiting time, starting from the first data item entering the batch. Accumulation ends when the timeout is reached. The core purpose of this threshold is to avoid indefinitely waiting for subsequent data in pursuit of larger batches, controlling batch accumulation latency from a time perspective, preventing excessive delays in the overall data processing chain, and balancing the scale effect of batch accumulation with processing timeliness.

[0046] For any queue, when the number of function call requests reaches the batch capacity limit, multiple function call requests that have reached the batch capacity limit are encapsulated into a model inference service request; or, for any queue, if the waiting time calculated from the first function call request entering the queue exceeds the waiting time threshold, multiple function call requests within the waiting time threshold are encapsulated into a model inference service request.

[0047] Specifically, the status of batch data in each queue is monitored in real time. A batch is considered complete when either of the following conditions is met, and the multiple function call requests in that batch are encapsulated into a single model inference service request: 1) the number of function call requests in the batch reaches the preset batch capacity limit; 2) the waiting time from the first function call request to the start of computation in the batch exceeds a preset waiting time threshold. After batch completion is determined, the batch data is marked as pending transmission, the queue batching progress is updated synchronously, and the data collection process for the next batch is initiated.

[0048] It's important to note that a queue can hold multiple batches with the same rules, and these batches share unified batch control parameters. A batch is the smallest unit in the queue that actually carries data and initiates calls. Prefix caching technology, adapted for large model inference, allows multiple function call requests within the same batch to fully reuse cache resources.

[0049] The above embodiments, by setting dual batch control parameters—a maximum batch capacity and a waiting time threshold—for each queue, achieve a flexible and balanced scheduling mechanism between request quantity and waiting latency. When function call requests rapidly accumulate to the maximum batch capacity, batch encapsulation and inference request sending can be triggered, ensuring sufficient batch processing and efficient resource utilization, and avoiding request backlog and resource idleness. Conversely, when request generation is slow, starting with the first request, batch encapsulation is proactively triggered after the waiting time expires, preventing processing blockage and excessive latency caused by a persistently low request quantity. This mechanism not only ensures the batch processing efficiency of the model inference service and reduces system overhead and resource waste caused by high-frequency calls, but also achieves controllable and stable processing latency, avoiding response delays in extreme scenarios. Thus, it achieves an optimal balance between batch benefits and real-time performance, improving the overall scheduling rationality and system processing stability.

[0050] After batching, function call requests are temporarily stored in the request cache module. The request cache module receives the returned batch results after the function calls are completed and caches these results, enabling centralized management of function call requests and inference results. Within the request cache module, batches that have been batched and are awaiting delivery are sorted in order of their batching completion time, ensuring that batches are packaged sequentially according to their generation order and avoiding scheduling chaos. For the returned function call results, associated pointers are added, which directly map to the corresponding function call requests, achieving precise association between individual results, batch results, and function call requests.

[0051] like Figure 3As shown, the request cache and result cache have a one-to-one queue mapping relationship. The sending queue corresponds to the batch request queue in the request cache module. Requests 1, 2, 3, 4, etc., in the queue are batch requests for AI functions to be called, which belong to the same queue after being grouped by the batching module, and are arranged in order of batching end time. The receiving queue corresponds to the batch result queue in the request cache module, and belongs to the same queue as the sending queue. Responses a, b, c, d, etc., in the receiving queue are the corresponding batch inference results returned after the batch calling module sends a request to the large model inference service. The correspondence between the two is achieved by adding an association pointer to each batch response. This pointer can be directly mapped to the corresponding batch request in the sending queue, ensuring that each batch inference result can accurately match the original AI function call batch. This solves the problem of matching the ownership of multiple requests and multiple results in the batch calling mode, providing a basic association basis for subsequent accuracy sampling and result splitting and implementation, while also ensuring the orderliness and traceability of requests and results within the queue.

[0052] In some embodiments, when performing step S202, for any queue, multiple function call requests corresponding to the batch control parameters meeting preset conditions are uniformly encapsulated using a structured data format, and combined with the function type corresponding to the queue, a model inference service request is assembled.

[0053] Specifically, all function call requests within the same batch are uniformly encapsulated in a structured data format. At the same time, the system prompts for the corresponding functions in that batch are combined with the structured data to form a single complete model inference service request, ensuring that the large model service can accurately identify each inference request and task requirement within the batch.

[0054] The above embodiments encapsulate multiple function call requests that meet the batch conditions using a unified structured data format, and assemble model inference service requests by combining the corresponding function types of the queue. This makes the format of inference requests more standardized and the parsing more efficient, reducing the parsing cost and error probability of the model inference service. Unified encapsulation can also further improve the transmission efficiency of batch requests, reduce data redundancy, and ensure that the model inference service can quickly match the processing logic by accurately binding with the queue function type, avoiding inference anomalies caused by format chaos or type mismatch. This not only improves the stability and reliability of the overall processing flow, but also enhances the execution efficiency of batch calls, further reducing system interaction overhead and resource consumption.

[0055] S203. Send the model inference service request to the corresponding model inference service so that the model inference service can process the multiple function call requests contained in the model inference service request in batches and return the function call results in batches.

[0056] The assembled model inference service request is sent to the corresponding model inference service. This model inference service processes multiple function call requests contained in the model inference service request in batches and returns the function call results in batches. The batch function call results can be returned in a structured data format.

[0057] If any function call result indicates success, it means the batch request to the model inference service was successful. The batch results returned by the model inference service are then written to the result cache of the request cache module via an associated pointer, matching them with the original batch function call requests in the request cache. The associated pointer maps to the corresponding original batch function call requests in the request cache, achieving a precise one-to-one binding between the batch results and the original function requests.

[0058] For scenarios involving call failures, the cause of the failure is first determined, and then the corresponding fault tolerance strategy is executed. If any function call result indicates a failure, and the cause is that the number of tokens exceeds the limit, indicating that the batch size is too large and causing a hard limit, then the encapsulated model inference service request is split into multiple inference service sub-requests, and the batch capacity limit configured for the queue is reduced. Then, the multiple inference service sub-requests are resent to the model inference service, so that the model inference service can process the function call requests contained in these multiple inference service sub-requests sequentially.

[0059] For example, the current batch of data can be divided into two sub-batches, and asynchronous calls can be re-initiated for each sub-batch. At the same time, the batch capacity of the queue can be halved, and subsequent batching can be performed according to the adjusted parameters, thus avoiding the problem of exceeding the word count limit again in batching from the root.

[0060] When a call failure is detected as being caused by the number of tokens requested in the batch exceeding the model's limit, the original batch inference request is automatically split into multiple smaller inference sub-requests. Simultaneously, the batch capacity limit of the corresponding queue is reduced, and the split sub-requests are resubmitted to the model inference service for sequential processing. This approach avoids call failures and task interruptions caused by excessively large batches, ensuring the continuity and stability of the data processing flow. Furthermore, by dynamically adjusting the batch capacity, the probability of triggering the token limit again is reduced from the source, improving the success rate and robustness of the model inference service call. At the same time, invalid requests are prevented from repeatedly occupying system resources, further optimizing the overall reliability and processing efficiency of the system.

[0061] If any function call result indicates a failure, and the failure reason is an interface timeout, service unavailability, or other reasons, an exponential backoff retry strategy is adopted to re-initiate the call. That is, as the number of retries increases, the time interval between two retries is gradually increased to reduce the pressure of frequent retries on large model services and cluster resources in a short period of time. If the number of consecutive failures reaches a preset threshold, it is determined that the batch cannot be called successfully for the time being, retries are stopped, and a pre-configured special identifier value (failure identifier) ​​is used as the call result of the batch of data and written to the result cache area to achieve accurate location of failed data.

[0062] In some embodiments, after executing step S203, the validity of the batch call results can be verified by random sampling. This allows for adjusting the amount of function call requests processed in the same batch based on the success or failure of the batch call, i.e., adjusting the maximum batch capacity of each queue. The success of the batch function call can be determined by the correlation between any target function call result in the function call results and its corresponding independent inference result. Here, the independent inference result is obtained by the model inference service processing the target function call request independently, and the target function call request is the function call request corresponding to the target function call result in the batch results.

[0063] Specifically, a target function call request is randomly selected from multiple function call requests and sent separately to the model inference service. The model inference service processes this target function call request independently and returns an independent inference result. The target function call result corresponding to the target function call request is determined from the function call results, and then the correlation between the target function call result and the independent inference result is calculated. The correlation between the two results can be calculated using methods such as vector similarity scoring, cross-encoder semantic comparison, and large language model scoring. Vector similarity scoring measures the semantic matching degree by vectorizing the two results and calculating a similarity value; cross-encoder semantic comparison captures the interaction features of the two results using a deep learning model and outputs a similarity score; and large language model scoring directly evaluates the consistency of the two results by calling a large model.

[0064] If the relevance is greater than or equal to the preset relevance threshold, the relevance is considered acceptable, and the batch function call is confirmed to be successful. If the relevance is less than the preset relevance threshold, the relevance is considered unacceptable, and the batch function call is confirmed to have failed. In this case, the multiple function call requests contained in the model inference service request are sent separately to the model inference service for individual processing.

[0065] The above embodiments verify the validity of batch call results through a random sampling mechanism, using the correlation between batch call results and independent inference results as the criterion. This accurately identifies result quality degradation issues caused by aggregation processing during batch inference, ensuring the accuracy and reliability of batch calls. By comparing the correlation with a preset threshold, the success of batch calls can be clearly determined. When the correlation is acceptable, the relevant batch strategy is maintained; when the correlation is unacceptable, it automatically switches to a single independent call mode. This ensures the efficiency advantage of batch processing in high-quality scenarios while avoiding the risk of result failure when quality is substandard. Simultaneously, dynamically adjusting the upper limit of the queue's batch capacity based on the verification results enables adaptive optimization of the batch processing scale, forming a continuously iterative balance mechanism between inference quality and processing efficiency, further improving the overall stability, robustness, and resource utilization of the system.

[0066] The batch results that have been matched in the result buffer are checked sequentially according to the queue order. The checking process of different queues is executed in parallel without mutual blocking.

[0067] like Figure 4 As shown in the diagram, result queues A, B, and C are independent batch result queues within the request caching module. Each queue corresponds to a different model inference service or a different function type, and they are unrelated to each other. This application employs a parallel processing strategy for different result queues, ensuring that the sampling progress of each queue does not interfere with each other, allowing for simultaneous quality verification, fully utilizing cluster computing power, and improving the overall sampling efficiency of multiple batches of results. For batch inference results within the same result queue (such as result a1, result a2, result a3, etc.), a serial queuing sampling strategy is adopted, executing sampling sequentially according to the order of result caching. This avoids resource contention caused by simultaneous sampling within the same queue, while ensuring the consistency of sampling results and allowing dynamic adjustments to batch capacity to proceed in an orderly manner at the queue granularity. This design balances sampling efficiency and order, and is the key execution logic for achieving batch result quality monitoring.

[0068] This can be understood as follows: for a batch of results in a single batch, a data point is randomly selected as a sample for inspection, and a function call request is initiated separately to obtain the independent inference result of that sample. The correlation between the independent inference result of the sample and the result of the corresponding sample in the batch results is compared. If the correlation is greater than or equal to a preset threshold, the batch inspection is determined to be successful; otherwise, the inspection is determined to be unsuccessful.

[0069] If the batch sampling is successful, its batch inference results are determined to be of reliable quality. The batch results are then split according to the calling source and mapped to the Flink job and corresponding data table of the original function call, completing the downstream writing of the results. After successful writing, the storage space of the batch in the request cache and result cache is released, completing the cache cleanup.

[0070] If a batch sampling fails, its batch inference results are deemed unreliable. The batch results in the result cache are discarded, and all data from the original batch is read from the request cache, broken down into individual non-batch requests. A new model inference service call is then initiated, ensuring result quality from the source and preventing degraded results from flowing downstream. Simultaneously, the batch capacity of the queue is halved, reducing the number of terms in a single batch and preventing large-scale batching from causing further degradation in model inference quality.

[0071] By comparing the correlation between the target function call results and the corresponding independent inference results, the effectiveness of batch function calls is determined, enabling effective monitoring of the quality of AI call results. This ensures the accuracy and reliability of function call results, avoids problems such as attention decay and result quality degradation caused by long term requests, and improves the stability and usability of collaborative processing between the structured query language engine and the AI ​​model.

[0072] The results of the spot checks are linked to the dynamic adjustment of the queue-level batch capacity limit. In some embodiments, for any queue, if the number of consecutive successful function batch calls reaches a preset number, the batch capacity limit of any queue is increased by a preset adjustment amount.

[0073] Batch capacity expansion is performed based on the number of consecutive successful checks in a queue: If a queue has N consecutive successful checks, the batch capacity of that queue will be expanded by a preset adjustment range. The expansion ratio is the original maximum batch capacity × (1 + adjustment range). The adjustment range can be configured as needed, such as 10%, to avoid the result quality fluctuation caused by excessive adjustment.

[0074] The above embodiments monitor the number of consecutive successful function batch calls in the queue. When the number of consecutive successful calls reaches a preset threshold, the maximum batch capacity of the queue is automatically increased by a fixed adjustment range, enabling adaptive and step-by-step optimization of batch processing scale. This positive feedback mechanism based on consecutive successes can gradually increase the number of requests per batch while ensuring stable calls and satisfactory result quality. This further enhances the batch aggregation benefits of model inference services, reduces the overall number of calls and system interaction overhead, and avoids problems such as token overruns and quality degradation caused by blindly expanding batches. It ensures that the batch capacity always matches the current system operating status and model processing capacity, continuously improving data processing efficiency and system resource utilization while guaranteeing processing stability and result reliability.

[0075] like Figure 5 As shown, Figure 5This is a flowchart illustrating another function batch calling method provided in this application embodiment. The process begins with initiating a function call. First, the large model request batching module intercepts and accumulates the request. Then, it determines whether the batch triggering conditions are met: reaching the batch capacity limit or a waiting time threshold. If either condition is met, the batch collection is complete, and the batch calling module asynchronously sends the model inference service request. If the call fails, it first determines whether the problem is due to excessive lexical count. If so, the batch is split and the queue batch capacity limit is halved; otherwise, an exponential backoff retry strategy is executed. If the call succeeds, the function call result is written to the result cache and queued for random checks. The random check is the core of quality verification. If the random check fails, the batch result is discarded, broken down into individual requests for re-calling, and the queue batch capacity limit is halved. If the random check succeeds, the result is written to the corresponding data table, completing the batch processing. Simultaneously, the process also includes a capacity expansion mechanism for N consecutive successful random checks, achieving dynamic optimization of batch capacity. The entire process not only solves the core problems of high-frequency calls and lack of result quality monitoring but also achieves a dynamic balance between batch capacity, result quality, and request latency.

[0076] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.

[0077] like Figure 6 As shown, embodiments of this application also provide a function batch invocation apparatus, the apparatus comprising: The request queue module 601 is used to obtain function call requests from various computing tasks within the cluster and to divide the function call requests into queues; wherein, function call requests divided into the same queue correspond to the same model inference service and have the same function type; The batch encapsulation module 602 is used to encapsulate multiple function call requests corresponding to any queue when its batch control parameters meet preset conditions into model inference service requests. The sending module 603 is used to send the model inference service request to the corresponding model inference service, so that the model inference service can process multiple function call requests contained in the model inference service request in batches and return the function call results in batches.

[0078] As an optional implementation provided in this application embodiment, the request queue module 601 is used to: obtain function call requests of each computing task in the cluster; if the data processing mode of the computing task is batch processing mode, sort them according to the field values ​​of the function call requests; wherein, the prefix field of the field value indicates the model inference service and function type of the function call request; and divide the function call requests with matching prefix fields into the same queue.

[0079] As an optional implementation provided in this application, each queue is configured with batch control parameters, including a batch capacity limit and a waiting time threshold; the batch encapsulation module 602 is used to: for any queue, when the number of function call requests reaches the batch capacity limit, encapsulate multiple function call requests that have reached the batch capacity limit into a model inference service request; or, for any queue, if the waiting time calculated from the first function call request entering the queue exceeds the waiting time threshold, then encapsulate multiple function call requests within the waiting time threshold into a model inference service request.

[0080] As an optional implementation method provided in this application, the batch encapsulation module 602 is used to: uniformly encapsulate multiple function call requests corresponding to the batch control parameters of any queue when they meet preset conditions using a structured data format, and assemble them into a model inference service request by combining the function type corresponding to any queue.

[0081] As an optional implementation provided in this application, the sending module 603 is further configured to: if any function call result indicates that the call failed because the number of lexical units exceeded the upper limit, then split the model inference service into multiple inference service sub-requests and reduce the batch capacity upper limit of any queue; and send the multiple inference service sub-requests to the model inference service corresponding to the model inference service request.

[0082] As an optional implementation provided in this application, the device further includes a result quality sampling module, used to: determine whether the batch function call is successful based on the correlation between the target function call result and its corresponding independent inference result; the success of the batch function call affects the number of batch function call requests processed; wherein, the target function call result is any one of the function call results, corresponding to the target function call request; the independent inference result is obtained by the model inference service processing the target function call request separately.

[0083] As an optional implementation provided in this application, the device further includes a result quality sampling module, used for: randomly sampling target function call requests from multiple function call requests; sending the target function call requests individually to the model inference service, so that the model inference service can process the target function call requests individually and return independent inference results; determining the target function call results corresponding to the target function call requests from the function call results; calculating the correlation between the target function call results and the independent inference results; and determining that the batch function calls were successful if the correlation is greater than or equal to a preset correlation threshold.

[0084] As an optional implementation provided in this application, the result quality sampling module, after calculating the similarity between the target function call result and the independent inference result, is further configured to: determine that the batch function call failed if the relevance is less than a preset relevance threshold; send the multiple function call requests contained in the model inference service request to the model inference service respectively, so that the model inference service can process the multiple function call requests separately; and reduce the batch capacity limit of any queue.

[0085] As an optional implementation provided in this application, the result quality sampling module, after determining whether the batch function call is successful based on the correlation between the target function call result and its corresponding independent inference result, is further configured to: for any queue, if the number of consecutive successful batch function calls reaches a preset number, increase the upper limit of the batch capacity of any queue by a preset adjustment range.

[0086] For a description of the features in the embodiment corresponding to the function batch calling device, please refer to the relevant description of the embodiment corresponding to the function batch calling method, which will not be repeated here.

[0087] like Figure 7 As shown, embodiments of this application also provide an electronic device, including a memory 702 and a processor 701. The memory 702 stores a computer program, and the processor 701 is configured to run the computer program to perform the steps in any of the above-described function batch call method embodiments.

[0088] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above-described function batch invocation method embodiments at runtime.

[0089] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0090] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above-described methods for batch function invocation.

[0091] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described function batch call method embodiments.

[0092] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0093] The above provides a detailed description of a method for batch function invocation and an electronic device provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only intended to help understand the method and core ideas of this application. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

1. A method for batch function invocation, characterized in that, include: Obtain function call requests from each computing task within the cluster, and divide the function call requests into queues; wherein, function call requests divided into the same queue correspond to the same model inference service and have the same function type; For any queue, when its batch control parameters meet the preset conditions, the multiple function call requests corresponding to them are encapsulated into a model inference service request; The model inference service request is sent to the corresponding model inference service, so that the model inference service can process the multiple function call requests contained in the model inference service request in batches and return the function call results in batches.

2. The method according to claim 1, characterized in that, The step of obtaining function call requests from each computing task within the cluster and dividing the function call requests into queues includes: Obtain function call requests from each computing task within the cluster; If the data processing mode of the computation task is batch processing mode, then the data is sorted according to the field values ​​of the function call request; wherein, the prefix field of the field value indicates the model inference service and function type of the function call request; Function call requests that match the prefix field are grouped into the same queue.

3. The method according to claim 1, characterized in that, Each queue is configured with batch control parameters, including the maximum batch capacity and the waiting time threshold; For any given queue, when its batch control parameters meet preset conditions, multiple function call requests are encapsulated into a model inference service request, including: For any queue, when the number of function call requests reaches the batch capacity limit, multiple function call requests that have reached the batch capacity limit are encapsulated into the model inference service request; or, For any queue, if the waiting time calculated from the first function call request entering the queue exceeds a waiting time threshold, then multiple function call requests within the waiting time threshold are encapsulated into the model inference service request.

4. The method according to claim 1, characterized in that, For any given queue, when its batch control parameters meet preset conditions, multiple function call requests are encapsulated into a model inference service request, including: For any queue, multiple function call requests corresponding to the batch control parameters meeting preset conditions are uniformly encapsulated using a structured data format, and combined with the function type corresponding to any queue, the model inference service request is assembled.

5. The method according to claim 1, characterized in that, After sending the model inference service request to the corresponding model inference service, so that the model inference service can process multiple function call requests contained in the model inference service request in batches and return the function call results in batches, the method further includes: If any function call result indicates that the call failed because the number of lexical units exceeded the limit, the model inference service is split into multiple inference service sub-requests, and the batch capacity limit of any queue is reduced. The multiple inference service sub-requests are sent to the model inference service corresponding to the model inference service request.

6. The method according to claim 1, characterized in that, After sending the model inference service request to the corresponding model inference service, so that the model inference service can process multiple function call requests contained in the model inference service request in batches and return the function call results in batches, the method further includes: The success of the batch function call is determined by the correlation between the result of the target function call and its corresponding independent inference result; the success of the batch function call affects the number of batch function call requests processed. The target function call result is any one of the function call results, corresponding to the target function call request; the independent inference result is obtained by the model inference service processing the target function call request separately.

7. The method according to claim 6, characterized in that, The step of determining whether batch function calls were successful based on the correlation between the target function call results and their corresponding independent inference results includes: The target function call request is randomly selected from the plurality of function call requests; The objective function call request is sent separately to the model inference service, so that the model inference service can process the objective function call request separately and return the independent inference result; Determine the target function call result corresponding to the target function call request from the function call result; Calculate the correlation between the result of the objective function call and the result of the independent inference; If the relevance is greater than or equal to a preset relevance threshold, the batch function call is determined to be successful.

8. The method according to claim 7, characterized in that, After calculating the similarity between the target function call result and the independent inference result, the method further includes: If the relevance is less than the preset relevance threshold, the batch function call is determined to have failed. The multiple function call requests contained in the model inference service request are sent to the model inference service respectively, so that the model inference service can process the multiple function call requests individually. Reduce the maximum batch capacity of any of the queues.

9. The method according to claim 7, characterized in that, After determining whether the batch function calls were successful based on the correlation between the target function call results and their corresponding independent inference results, the method further includes: For any queue, if the number of consecutive successful function batch calls reaches a preset number, the maximum batch capacity of any queue is increased by a preset adjustment amount.

10. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the function bulk invocation method as described in any one of claims 1 to 9 when executing the computer program.