Large model testing methods and related devices, electronic equipment and storage media

By generating task objects and using an isolated runner and multimodal large model to identify anomaly categories, the data crosstalk problem in multi-task parallel testing of large models is solved, the self-healing capability is improved, and stable concurrent processing is achieved.

CN122309366APending Publication Date: 2026-06-30IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2026-03-26
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies are prone to data crosstalk in large-scale multi-task parallel testing and lack self-healing capabilities in abnormal scenarios.

Method used

By generating task objects and storing them in a task queue, tasks are executed using isolated runners, inputting information into browser pages to copy response content, and calling a multimodal large model to identify anomaly categories and perform corresponding actions to re-execute the task after a timeout.

Benefits of technology

It reduces data crosstalk between concurrent tasks, improves self-healing capabilities in abnormal scenarios, and achieves stable concurrent processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309366A_ABST
    Figure CN122309366A_ABST
Patent Text Reader

Abstract

This application discloses a large-scale model testing method and related devices, electronic devices, and storage media. The large-scale model testing method includes: generating task objects based on test cases of the target large-scale model; controlling the runner to execute task objects in the task queue; in response to the runner failing to obtain a response after exceeding a target time threshold, controlling the runner to call a multimodal large-scale model to identify the page image of the browser page, obtain the target anomaly category, and controlling the runner to execute the target handling action matching the target anomaly category; and controlling the runner to re-execute the currently unsuccessful task object. This solution can minimize data crosstalk between different tasks and improve self-healing capabilities under abnormal scenarios while achieving concurrent testing of large-scale models with multiple tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of automated testing technology, and in particular to a large-scale model testing method and related apparatus, electronic devices and storage media. Background Technology

[0002] With the widespread application of large models in R&D, evaluation, and engineering implementation, the industry generally has a need for high-frequency, long-term, and batch interactions with large models in processes such as model iteration, prompt word system construction, alignment / security evaluation, tool call verification, and online service capacity assessment.

[0003] Currently, existing technologies typically implement large-scale model testing based on desktop simulation and clipboard data collection. However, this approach is prone to data crosstalk when multiple tasks are running in parallel, making it difficult to achieve high concurrency on a single machine. Furthermore, the script is prone to crashing or entering infinite loops when unexpected situations such as network jitter occur. Therefore, how to minimize data crosstalk between different tasks and improve self-healing capabilities under abnormal scenarios while achieving multi-task concurrent testing of large models has become an urgent problem to be solved. Summary of the Invention

[0004] The main technical problem addressed by this application is to provide a large-scale model testing method and related devices, electronic devices, and storage media that can minimize data crosstalk between different tasks and improve self-healing capabilities under abnormal scenarios while enabling concurrent testing of large-scale models across multiple tasks.

[0005] To address the aforementioned technical problems, the first aspect of this application provides a large-scale model testing method, comprising: generating task objects based on test cases of a target large-scale model; wherein the task object contains field values ​​of several task fields, the task fields including at least prompt text and a task timeout threshold, and the task object is stored in a task queue; controlling a runner to execute the task objects in the task queue; wherein different runners are isolated from each other, the runner runs a browser page containing the target large-scale model, the runner fills the prompt text parsed from the task object into the input box of the browser page to copy the response content generated by the target large-scale model in response to the prompt text from the browser page; in response to the runner failing to obtain response content after exceeding the target timeout threshold, controlling the runner to call a multimodal large-scale model to identify the page image of the browser page, obtain the target anomaly category, and controlling the runner to execute the target handling action matching the target anomaly category, and controlling the runner to re-execute the currently unexecuted task object; wherein the target timeout threshold is the field value of the task timeout threshold in the currently unexecuted task object.

[0006] To address the aforementioned technical problems, a second aspect of this application provides a large-scale model testing device, comprising: a task generation module, a task execution module, and an exception handling module. The task generation module generates task objects based on test cases of a target large-scale model. Each task object contains field values ​​for several task fields, including at least prompt text and a task timeout threshold. The task objects are stored in a task queue. The task execution module controls a runner to execute the task objects in the task queue. Different runners are isolated from each other. Each runner runs a browser page containing the target large-scale model. The runner fills the prompt text parsed from the task objects into an input box on the browser page to copy the response content generated by the target large-scale model in response to the prompt text from the browser page. The exception handling module, in response to the runner failing to obtain a response content after exceeding a target timeout threshold, controls the runner to call a multimodal large-scale model to identify the page image of the browser page, obtain a target exception category, and controls the runner to execute a target handling action matching the target exception category. It also controls the runner to re-execute the currently unexecuted task objects. The target timeout threshold is the field value of the task timeout threshold in the currently unexecuted task objects.

[0007] To address the aforementioned technical problems, a third aspect of this application provides an electronic device comprising at least a memory and a processor coupled to each other, wherein the memory stores at least program instructions, and the processor executes the program instructions to implement the large model testing method described in the first aspect.

[0008] To address the aforementioned technical problems, a fourth aspect of this application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being used to implement the large-scale model testing method of the first aspect described above.

[0009] The above scheme generates task objects based on test cases of the target large model. Each task object contains field values ​​for several task fields, including at least the prompt text and a task timeout threshold. The task objects are stored in a task queue. The runners then execute the task objects in the queue, and different runners are isolated from each other. Each runner runs a browser page containing the target large model. The runners fill the prompt text parsed from the task objects into the input boxes on the browser page to copy the response content generated by the target large model in response to the prompt text. If the runner fails to obtain a response content after exceeding the target timeout threshold, the runners call the multimodal large model to identify the page image of the browser page, obtain the target anomaly category, and execute the target handling action matching the target anomaly category. The scheme also controls the execution of other tasks. The runner re-executes the currently unexecuted task objects, and the target duration threshold is the field value of the task timeout threshold in the currently unexecuted task objects. On the one hand, since the task objects in the task queue are executed by different runners, and the different runners are isolated from each other, it helps to fundamentally avoid result crosstalk and session pollution between concurrent tasks as much as possible, and achieves stable concurrent processing that can be horizontally scaled. Under the premise of realizing large-scale multi-task concurrent testing, it can minimize data crosstalk between different tasks. On the other hand, if the runner has not obtained a response content after exceeding the target duration threshold, the runner calls the multimodal large model to identify the page image to obtain the target anomaly category, and executes the target handling action accordingly, and then re-executes the currently unexecuted task objects, which can improve the self-healing capability in abnormal scenarios. Therefore, under the premise of realizing large-scale multi-task concurrent testing, it can minimize data crosstalk between different tasks and improve the self-healing capability in abnormal scenarios. Attached Figure Description

[0010] Figure 1 This is a flowchart illustrating an embodiment of the large-scale model testing method of this application; Figure 2a This is a schematic diagram of a process of an embodiment of the large model testing method of this application; Figure 2b This is a schematic diagram of an embodiment of the result stability determination and pull-down reachability mechanism of this application; Figure 3 This is a schematic diagram of the framework of an embodiment of the large model testing device of this application; Figure 4 This is a schematic diagram of the framework of an embodiment of the electronic device of this application; Figure 5 This is a schematic diagram of a framework of an embodiment of the computer-readable storage medium of this application. Detailed Implementation

[0011] The embodiments of this application will now be described in detail with reference to the accompanying drawings.

[0012] In the following description, specific details such as particular system architectures, interfaces, and technologies are presented for illustrative purposes rather than for limiting purposes, in order to provide a thorough understanding of this application.

[0013] In this paper, the terms "system" and "network" are often used interchangeably. The term "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. Additionally, the slash " / " generally indicates that the preceding and following related objects have an "or" relationship. Furthermore, "many" in this paper indicates two or more objects.

[0014] Please refer to the following: Figure 1 and Figure 2a , Figure 1 This is a flowchart illustrating an embodiment of the large-scale model testing method of this application. Figure 2a This is a schematic diagram illustrating a process of an embodiment of the large-scale model testing method of this application. It should be noted that the process operations in this embodiment can be executed by an electronic device with computing capabilities or related equipment containing an electronic device. The specific structure and type of the electronic device and related equipment containing the electronic device are not limited herein. Specifically, this embodiment may include the following steps: Step S11: Generate task objects based on test cases of the target large model.

[0015] In this embodiment of the disclosure, the task object may contain field values ​​of several task fields, which may include at least a prompt text and a task timeout threshold. The task object may be stored in a task queue. For example, this embodiment of the disclosure may include a queue service for caching tasks to be processed (i.e., task objects) and processing results (i.e., result objects). That is, the queue service may include at least task objects and a result queue.

[0016] In one implementation scenario, the target large model can be a large model to be tested. For example, the target large model can be an open-source large model, a large model obtained by fine-tuning and training an open-source large model based on a specific dataset, or a custom large model. The specific source of the target large model is not limited here.

[0017] In one implementation scenario, test cases for the target large model can be predefined in test files such as JSON or JSONL. For example, the test file can contain several test cases to facilitate the construction of task objects corresponding to each test case. As a possible example, large model instructions can be constructed based on the test cases. These instructions instruct the large language model to generate task objects for the test cases according to a specific format, which may include various task fields of the task objects. Based on this, the task objects output by the large language model in response to the large model instructions can be obtained. It should be noted that the large language model used to generate the task objects here can be a different large model from the target large model.

[0018] In one implementation scenario, several task fields may include, but are not limited to: task identifier (task id), maximum number of retries (max retry), task priority, target large model (model), etc. The specific content of the task objects is not limited here. For example, the task identifier is a unique identifier for the task object, used for idempotency, tracking, and deduplication. In addition, several task fields may also include metadata such as source file, line number, and tags, which will not be elaborated further here.

[0019] Step S12: Control the runner to execute the task objects in the task queue.

[0020] In this embodiment, different runners can be isolated from each other. Each runner can run a browser page containing the target large model. The runner can fill the prompt text (i.e., the field value of the task field "prompt text") parsed from the task object into the input box of the browser page to copy the response content generated by the target large model responding to the prompt text from the browser page. For example, the response content can be used as the field value of the model response field (answer) in the result object corresponding to the task object. The result object can be stored in a result queue, and the result object also includes at least one of the following result fields: execution status field (status), number of attempts field (attempt), execution duration field (duration ms), prompt text (query), error code (error code) / message (error msg), etc., which are not limited here. The execution status field can indicate whether the runner successfully executed the task object (i.e., its field value can be "success (or SUCCESS)" or "failure (or FAIL)"); the error code / message can be extracted from the browser page if the task object fails to execute. Other result fields can refer to the relevant descriptions of the task fields mentioned above, and will not be repeated here. In addition, the result object may also include result fields similar to or the same as the task object, such as task ID and target model. Of course, the above example is only one possible example of the result object, and other possible situations are not limited here, nor will they be listed one by one.

[0021] In an implementation scenario, the runner can be of various types, including but not limited to: containers (e.g., worker containers), virtual machines, etc. The specific type of runner is not limited here. Taking a container as an example, multiple containers (e.g., worker containers) can be orchestrated to consume various task objects in the task queue in parallel. It should be noted that different runners execute different task objects, and if a runner fails to execute a task object, the task object can be returned to the task queue for other runners to retrieve and execute.

[0022] In one implementation scenario, different runners can be isolated from each other in the following ways: browser sessions, runtime directories, clipboard environments, etc., without further limitation. That is to say, in actual applications, different runners can run on the same hardware platform or distributed across different hardware platforms, but regardless of the method adopted, different runners are isolated from each other.

[0023] In one implementation scenario, the runner can retrieve task objects from the task queue in a blocking manner and parse the field values ​​of each task object. That is, when there are no available task objects in the task queue, the runner can pause and wait until a task object appears in the queue, instead of directly returning "no" or an error.

[0024] In one implementation scenario, as mentioned earlier, several task fields may include a maximum number of retries, and a result field may include an attempt count field. In this case, the attempt count field can be initialized to 0. If it is less than the maximum number of retries field value (which can be called the target count threshold for easy distinction), then the runner executes the task object; otherwise, the execution status field value can be determined to indicate execution failure (e.g., "failed," "FAIL," etc.). For example, the target count threshold can be set to any value between 2 and 10. Of course, the above example is only one possible example of the target count threshold, and the specific value of the target count threshold is not limited here, nor will it be listed in detail.

[0025] In one implementation scenario, during the execution of a task object by the control runner, a first recognition result can be obtained based on the page image of the browser page (i.e., the browser page run by the runner). Responding to the first recognition result, which includes the first position of a first control for content copying, the runner can be controlled to trigger the first control based on the first position to obtain the current response content. Furthermore, a target count value can be updated based on the content length of the current response content, and the target count value can be used to determine whether the current response content is the final generated response content. The target count value represents the stability of the current response content. This method, by monitoring the content length of the current response content and updating the target count value, which represents the stability of the current response content, and thereby determining whether the current response content is the final generated response content, can minimize problems such as truncation, incomplete results, and empty results caused by streaming output.

[0026] In a specific implementation scenario, before executing the task object, the runner can initialize an image template library and several regions of interest (ROIs) for controls. The image template library may contain a first template for matching and locating the first control (e.g., the first template may specifically be an image template containing the position layout of the first control on the browser page). To obtain a first recognition result based on the browser page image, local recognition can be performed first based on the ROI of the first control in the page image, yielding a local recognition result for the first control. The ROI of the first control is pre-located by matching the browser page image using the first template of the first control. Based on this, in response to the local recognition result of the first control, including its first position, the local recognition result can be selected as the first recognition result. Conversely, in response to the local recognition result of the first control, including cases where the first control is not recognized, global recognition can be performed based on the page image (i.e., recognizing the first control across the entire page image) to obtain the first recognition result. It should be noted that the first control is not limited to a single button; it may include, but is not limited to, a right-click menu, a shortcut key, or other equivalent copy / export entry point. The above method first performs local recognition based on the region of interest of the first control in the page image to obtain the local recognition result of the first control. Then, in the case where the local recognition result of the first control is not recognized, global recognition is performed based on the page image to obtain the first recognition result. This helps to reduce the time consumption of error matching and localization, and improve throughput and stability.

[0027] In a specific implementation scenario, after triggering the first control to obtain the response content at the current moment, the target count value can be updated based on the content length of the response content at the current moment. Specifically, if the content length of the response content at the current moment is empty, the target count value stable_cnt can be kept at zero. If the content length of the response content at the current moment increases compared to the content length of the response content at a historical moment, the target count value stable_cnt can be kept at zero. If the content length of the response content at the current moment remains unchanged compared to the content length of the response content at a historical moment, the target count value stable_cnt can be increased (e.g., stable_cnt+1). It should be noted that the historical moment refers to the moment before the current moment. For example, the response content can be continuously copied and obtained at a certain frequency to obtain the response content at each moment. In this case, the historical moment can specifically be the moment before the current moment. The above method takes different measures to update the target count value when the content length of the response content at the current moment is in different situations, and can accurately represent the stability of the response content at the current moment through the target count value.

[0028] In a specific implementation scenario, after updating the target count value, it can be used to determine whether the current response content is the final generated response content. As one possible example, it can be detected whether the target count value exceeds the counting threshold STABLE_THRESHOLD (e.g., it can be set to any value between 2 and 5; the specific value of the counting threshold is not limited here, nor will examples be given). If the target count value is not less than the counting threshold STABLE_THRESHOLD, it can be determined that the current response content is the final generated response content; if the target count value is less than the counting threshold STABLE_THRESHOLD, it can be determined that the current response content is not the final generated response content. As another possible example, it can be detected whether the target count value exceeds the counting threshold and whether the current response content reaches the minimum length threshold. If the target count value is not less than the counting threshold STABLE_THRESHOLD and the current response content reaches the minimum length threshold, it can be determined that the current response content is the final generated response content; otherwise, it can be determined that the current response content is not the final generated response content. Of course, the above examples are just a few possible examples of determining whether the current response content is the final generated response content based on the target count value in practical applications. Other possible detection methods are not limited here, nor will they be listed one by one.

[0029] In a specific implementation scenario, in practical applications, there may be situations where the current reply content exceeds the display area of ​​the browser page (e.g., the current reply content is too long and exceeds the display area of ​​the browser page), the first recognition result includes not recognizing the first control, or the first recognition result includes not recognizing the first control multiple times (e.g., the specific number of times can be set to any value between 2 and 8, the specific number of times is not limited here, and will not be listed one by one). In this case, the runner can be controlled to perform page pull-down on the browser page until it is pulled down to the bottom of the browser page, and can return to perform recognition based on the page image of the browser page, and the first recognition result is obtained by iterative looping. Similar to the aforementioned copy operation, to implement the page dropdown, the runner initializes an image template library and regions of interest (ROIs) for several controls before executing the task object. The image template library can also contain second templates for matching and locating second controls (e.g., down.png, PageDown, PageEnd, etc., which implement page dropdown). Local recognition can then be performed based on the ROI of the second control in the page image to obtain the local recognition result of the second control. The ROI of the second control can be pre-defined by matching and locating the browser page image using the second template of the second control (e.g., refer to the aforementioned description of the ROI of the first control, which will not be repeated hereafter). Responding to the local recognition result of the second control including its second position, the local recognition result of the second control can be selected as the second recognition result of the second control. Responding to the local recognition result of the second control including the absence of a recognized second control, global recognition can be performed based on the page image to obtain the second recognition result of the second control, which includes its second position. Based on this, the runner can be controlled to trigger the second control based on its second position to implement the page dropdown on the browser page. In addition, other controls such as input boxes on browser pages can also be identified using the same or similar methods. First, local identification is performed through ROI, and if the corresponding control is not identified in the local identification, then global identification is performed on the page image of the browser page. This will not be elaborated on here.

[0030] In a specific implementation scenario, as mentioned earlier, some task fields may also include the maximum number of retries (max retry). If the total number of times the runner re-executes a currently unsuccessful task object exceeds a target threshold, execution failure can be determined. It should be noted that the target threshold can specifically be the value of the maximum number of retries field. In addition, as mentioned earlier, in the event of execution failure, the task object can be returned to the task queue for another runner to retrieve and execute.

[0031] In a specific implementation scenario, please refer to the relevant documents. Figure 2b, Figure 2b This is a schematic diagram illustrating an embodiment of the result stability determination and pull-down reachability mechanism of this application. Figure 2b As shown, Figure 2b The dashed box on the left represents the drop-down reachable sub-process, and the dashed box on the right represents the stability determination sub-process. Specifically, when the runner retrieves, parses, and begins executing the task object, it enters the monitoring loop. First, it can determine whether a timeout has occurred (such as the aforementioned target duration threshold, i.e., the field value of the task field "task timeout threshold"). If so, a timeout exception can be triggered and handed over to the exception self-healing module (as described in the following related operations in this embodiment). Otherwise, it can continue to determine whether a first control for implementing content copying has been identified. If not, a second control for implementing page dropdown can be identified and triggered until the bottom of the page is reached. Then, it can wait for the next monitoring interval (e.g., it can be set to any value between 2 and 15 seconds, such as 10 seconds; the specific value of the time interval is not limited here, nor will it be listed in detail). If a first control for implementing content copying has been identified, the first control can be triggered to obtain the current reply content (during this process, generation status verification can also be performed, such as through button status changes, output area changes, or one or more equivalent features). Based on the content length of the reply content, the target count value is updated. For example, in the case of empty, invalid, or increased length, it can be kept at zero, while in the case of unchanged length, the target count value can be increased (e.g., by 1). Based on this, it can be determined whether the target count value exceeds the counting threshold. If not, the interval can be waited for. If so, the response content at the current moment can be determined as the final generated response content.

[0032] Step S13: In response to the fact that the runner has not obtained a response after exceeding the target time threshold, the runner is controlled to call the multimodal large model to identify the page image of the browser page, obtain the target anomaly category, and the runner is controlled to execute the target handling action that matches the target anomaly category, and the runner is controlled to re-execute the currently unsuccessful task object.

[0033] In this embodiment of the disclosure, the target duration threshold can be the field value of the task timeout threshold (timeout sec) in the currently unexecuted task object. For example, the target duration threshold can be set to any value between 60 and 900 seconds. Of course, the above example is merely one possible example of the target duration threshold in practical applications; other possible values ​​are not limited here, nor will they be listed in detail.

[0034] In one implementation scenario, if the runner fails to receive a response after exceeding the target time threshold, an anomaly can be considered to have occurred. At this point, the runner can be controlled to invoke a multimodal large model to identify the browser page image and determine the target anomaly category. It should be noted that the multimodal large model can include, but is not limited to, open-source large models such as Qwen-VL and Intern-S1, or it can be a fine-tuned open-source large model based on a specific dataset, or it can be a custom large model. The specific type of multimodal large model is not limited here. Of course, in practical applications, besides using a multimodal large model to identify browser page images to obtain the target anomaly category, image templates (e.g., image templates for various anomaly categories, which can be included in the previously initialized image template library), and text keywords (e.g., text keywords for various anomaly categories) can also be used to identify browser page images to obtain the target anomaly category. The specific methods for identifying the target anomaly category are not limited here, nor will they be listed in detail.

[0035] In one implementation scenario, in response to any of the target exception categories—page exception, generation exception, or copy exception—the runner can be controlled to perform a forced reset of the browser page. In other words, the target action matching the target exception category is to perform a forced reset of the browser page. It should be noted that performing a forced reset of the browser page is primarily used to return to the initial state, and may include, but is not limited to, opening a new session on the browser page. The specific implementation of the forced reset is not limited here, nor will it be listed in detail.

[0036] In one implementation scenario, in response to a target exception category of "service exception," the runner can be controlled to perform a backoff wait (e.g., fixed or exponential backoff) before retrying. It should be noted that service exceptions include, but are not limited to, rate limiting, busy, and unavailability. The specific types of service exceptions are not limited here, nor will they be listed individually. Furthermore, when the target exception category is "service exception," the runner can also be controlled to perform model switching, model degradation, and other operations on the target large model of the browser page; these are not limited here.

[0037] In one implementation scenario, in response to a target exception category being a model exception, the runner can be controlled to select a backup large model of the target large model to continue executing the task object according to a preset priority. As a possible example, in this case, `model_used` can also be logged for tracking purposes.

[0038] In one implementation scenario, please continue to refer to [the relevant documentation]. Figure 2a ,like Figure 2aAs shown, as a possible implementation in practical applications, test files in file formats such as JSON and JSONL can be obtained first. Then, task input and construction (e.g., validation, splitting, numbering, parameter completion) can be performed based on the test files to obtain several task objects, which are then added to the task queue of the queue service. Based on this, when using worker containers as runners, N worker containers can be launched concurrently through container coding to consume the task objects in the task queue in parallel. Specifically, in the worker execution cluster, each worker container can have its own independent session and clipboard working directory (i.e., isolated from each other). This allows image recognition and localization to initialize the region of interest (ROI) through web-based interaction. Then, through the aforementioned stability determination, pull-down reachability, and anomaly self-healing mechanisms, the corresponding result object can be obtained and added to the result queue. Finally, the results can be aggregated and processed after being written to disk.

[0039] The above scheme generates task objects based on test cases of the target large model. Each task object contains field values ​​for several task fields, including at least the prompt text and a task timeout threshold. The task objects are stored in a task queue. The runners then execute the task objects in the queue, and different runners are isolated from each other. Each runner runs a browser page containing the target large model. The runners fill the prompt text parsed from the task objects into the input boxes on the browser page to copy the response content generated by the target large model in response to the prompt text. If the runner fails to obtain a response content after exceeding the target timeout threshold, the runners call the multimodal large model to identify the page image of the browser page, obtain the target anomaly category, and execute the target handling action matching the target anomaly category. The scheme also controls the execution of other tasks. The runner re-executes the currently unexecuted task objects, and the target duration threshold is the field value of the task timeout threshold in the currently unexecuted task objects. On the one hand, since the task objects in the task queue are executed by different runners, and the different runners are isolated from each other, it helps to fundamentally avoid result crosstalk and session pollution between concurrent tasks as much as possible, and achieves stable concurrent processing that can be horizontally scaled. Under the premise of realizing large-scale multi-task concurrent testing, it can minimize data crosstalk between different tasks. On the other hand, if the runner has not obtained a response content after exceeding the target duration threshold, the runner calls the multimodal large model to identify the page image to obtain the target anomaly category, and executes the target handling action accordingly, and then re-executes the currently unexecuted task objects, which can improve the self-healing capability in abnormal scenarios. Therefore, under the premise of realizing large-scale multi-task concurrent testing, it can minimize data crosstalk between different tasks and improve the self-healing capability in abnormal scenarios.

[0040] Please see Figure 3 , Figure 3This is a schematic diagram of the framework of an embodiment of the large model testing device of this application. The large model testing device 30 includes: a task generation module 31, a task execution module 32, and an exception handling module 33. The task generation module 31 is used to generate task objects based on test cases of the target large model; wherein, the task object contains field values ​​of several task fields, and the several task fields include at least prompt text and task timeout threshold, and the task object is stored in a task queue; the task execution module 32 is used to control the runner to execute the task objects in the task queue; wherein, different runners are isolated from each other, the runner runs a browser page with the target large model, and the runner will parse the prompts from the task objects. The text is entered into the input box on the browser page to copy the response content generated by the target large model responding to the prompt text from the browser page; the exception handling module 33 is used to respond to the fact that the runner has not obtained the response content after exceeding the target time threshold, control the runner to call the multimodal large model to identify the page image of the browser page, obtain the target exception category, control the runner to execute the target handling action that matches the target exception category, and control the runner to re-execute the currently unexecuted task object; wherein, the target time threshold is the field value of the task timeout threshold in the currently unexecuted task object.

[0041] In the above scheme, the large-scale model testing device 30 generates task objects based on test cases of the target large-scale model. Each task object contains field values ​​for several task fields, including at least prompt text and a task timeout threshold. The task objects are stored in a task queue. The controller then executes the task objects in the task queue. Different controllers are isolated from each other. Each controller runs a browser page containing the target large-scale model. The controller fills the prompt text parsed from the task objects into the input boxes of the browser page to copy the response content generated by the target large-scale model in response to the prompt text. If the controller fails to obtain a response content after exceeding the target timeout threshold, the controller calls the multimodal large-scale model to identify the page image of the browser page, obtains the target anomaly category, and controls the controller to execute the target handling action matching the target anomaly category. The system controls the runner to re-execute currently unsuccessful task objects, with the target duration threshold being the timeout threshold field value of the currently unsuccessful task objects. On one hand, since task objects in the task queue are executed by different runners, and these runners are isolated from each other, it helps to fundamentally avoid result crosstalk and session pollution between concurrent tasks, achieving horizontally scalable and stable concurrent processing. This minimizes data crosstalk between different tasks while enabling large-scale multi-task concurrent testing. On the other hand, if the runner fails to obtain a response after exceeding the target duration threshold, it calls a multimodal large model to identify the page image to obtain the target anomaly category and executes the corresponding target handling action before re-executing the currently unsuccessful task object, thus improving self-healing capabilities in abnormal scenarios. Therefore, it minimizes data crosstalk between different tasks and improves self-healing capabilities in abnormal scenarios while enabling large-scale multi-task concurrent testing.

[0042] In some disclosed embodiments, the exception handling module 33 includes a first handling submodule, used to control the runner to perform a forced reset of the browser page in response to any of the target exception categories being page exception, generation exception, or copying exception; the exception handling module 33 includes a second handling submodule, used to control the runner to perform a backoff wait and retry in response to the target exception category being service exception; wherein, service exception includes at least one of the following: rate limiting, busy, unavailable; the exception handling module 33 includes a third handling submodule, used to control the runner to select a backup large model of the target large model to continue executing the task object according to a preset priority in response to the target exception category being model exception.

[0043] In some disclosed embodiments, the task execution module 32 includes a first identification submodule, used to identify based on the page image of the browser page to obtain a first identification result; the task execution module 32 includes a copy determination submodule, used to respond to the first identification result including a first position of a first control for implementing content copying, control the runner to trigger the first control based on the first position to obtain the reply content at the current moment, update the target count value based on the content length of the reply content at the current moment, and determine whether the reply content at the current moment is the final generated reply content based on the target count value; wherein, the target count value characterizes the stability of the reply content at the current moment.

[0044] In some disclosed embodiments, the runner initializes an image template library and regions of interest (ROIs) for several controls before executing the task object. The image template library contains a first template for matching and locating the first control. The first recognition submodule includes a first local recognition unit for performing local recognition based on the ROIs of the first control in the page image to obtain a local recognition result of the first control. The ROI of the first control is obtained by matching and locating the page image of the browser page in advance using the first template of the first control. The first recognition submodule includes a first result selection unit for selecting the local recognition result of the first control as the first recognition result in response to the local recognition result of the first control including the first position of the first control. The first recognition submodule includes a first global recognition unit for performing global recognition based on the page image in response to the local recognition result of the first control including the first control not being recognized to obtain the first recognition result.

[0045] In some disclosed embodiments, the replication determination submodule includes a first maintenance unit, used to maintain a target count value of zero in response to the current response content's content length being empty; the replication determination submodule includes a second maintenance unit, used to maintain a target count value of zero in response to the current response content's content length increasing compared to the content length of the response content at a historical time; the replication determination submodule includes a count increase unit, used to increase the target count value in response to the current response content's content length remaining unchanged compared to the content length of the response content at a historical time; wherein, the historical time is the time before the current time.

[0046] In some disclosed embodiments, the exception handling module 33 includes a page drop-down submodule, which is used to control the runner to perform page drop-down on the browser page in response to the current reply content exceeding the display area of ​​the browser page or the first recognition result including the first control not being recognized, until the drop-down reaches the bottom of the browser page, and then return to perform the step of recognizing the page image based on the browser page to obtain the first recognition result.

[0047] In some disclosed embodiments, the runner initializes an image template library and regions of interest (ROIs) for several controls before executing the task object. The image template library contains a second template for matching and locating a second control. The second control is used to implement a page dropdown. The page dropdown submodule includes a second local recognition unit for performing local recognition based on the ROIs of the second control in the page image to obtain a local recognition result of the second control. The ROI of the second control is obtained by matching and locating the page image of the browser page in advance using the second template of the second control. The page dropdown submodule includes a second result selection unit for selecting the local recognition result of the second control as the second recognition result of the second control in response to the local recognition result of the second control including the second position of the second control. The page dropdown submodule includes a second global recognition unit for performing global recognition based on the page image in response to the local recognition result of the second control including the second control not being recognized, to obtain a second recognition result of the second control. The second recognition result of the second control includes the second position of the second control. The page dropdown submodule includes a second control triggering unit for controlling the runner to trigger the second control based on the second position to implement the page dropdown on the browser page.

[0048] In some disclosed embodiments, the copy determination submodule includes a counting detection unit for detecting whether the target count value exceeds a counting threshold; the copy determination submodule includes a first response unit for determining that the response content at the current moment is the final generated response content in response to the target count value being not less than the counting threshold; the copy determination submodule includes a second response unit for determining that the response content at the current moment is not the final generated response content in response to the target count value being less than the counting threshold.

[0049] In some disclosed embodiments, several task fields further include at least one of: task identifier, maximum number of retries, and task priority; and / or, when several task fields include the maximum number of retries, in response to the total number of times the runner re-executes the currently unsuccessful task object exceeding a target threshold, execution failure is determined, the target threshold being the field value of the maximum number of retries. and / or, the runner is any type of container or virtual machine; and / or, different runners are isolated from each other in the following ways: browser session, execution directory, clipboard environment; and / or, the runner retrieves task objects from the task queue in a blocking manner; and / or, the response content is used as the field value of the model response field in the result object corresponding to the task object, the result object is stored in the result queue, and the result object further includes at least one of the following result fields: execution status field, number of attempts field, execution duration field, prompt text, and error code / message.

[0050] Please see Figure 4 , Figure 4This is a schematic diagram of a framework of an embodiment of the electronic device of this application. The electronic device 40 includes at least a memory 41 and a processor 42 coupled to each other. The memory 41 stores at least program instructions, and the processor 42 is used to execute the program instructions to implement the steps in any of the above-described large model testing method embodiments. For details, please refer to the foregoing disclosed embodiments, which will not be repeated here. Exemplarily, the electronic device 40 may include, but is not limited to, servers, desktop computers, laptops, etc., and the specific type of the electronic device 40 is not limited here.

[0051] Specifically, processor 42 controls itself and memory 41 to implement the steps in any of the above-described large model testing method embodiments. Processor 42 can also be referred to as a CPU (Central Processing Unit). Processor 42 may be an integrated circuit chip with signal processing capabilities. Processor 42 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor. Furthermore, processor 42 can be implemented using integrated circuit chips.

[0052] In the above scheme, the electronic device 40 generates task objects based on test cases of the target large model. Each task object contains field values ​​for several task fields, including at least prompt text and a task timeout threshold. The task objects are stored in a task queue. The controller then executes the task objects in the task queue. Different controllers are isolated from each other. Each controller runs a browser page containing the target large model. The controller fills the prompt text parsed from the task objects into the input boxes of the browser page to copy the response content generated by the target large model in response to the prompt text. If the controller fails to obtain a response content after exceeding the target timeout threshold, the controller calls the multimodal large model to identify the page image of the browser page, obtains the target anomaly category, and controls the controller to execute the target handling action matching the target anomaly category. The control runner re-executes the currently unsuccessful task objects, with the target duration threshold being the timeout threshold field value of the currently unsuccessful task objects. On one hand, since the task objects in the task queue are executed by different runners, and these runners are isolated from each other, it helps to fundamentally avoid result crosstalk and session pollution between concurrent tasks, achieving horizontally scalable and stable concurrent processing. This minimizes data crosstalk between different tasks while enabling large-scale multi-task concurrent testing. On the other hand, if the runner fails to obtain a response after exceeding the target duration threshold, it calls a multimodal large model to identify the page image to obtain the target anomaly category and executes the matching target handling action before re-executing the currently unsuccessful task object, thus improving self-healing capabilities in abnormal scenarios. Therefore, it minimizes data crosstalk between different tasks and improves self-healing capabilities in abnormal scenarios while enabling large-scale multi-task concurrent testing.

[0053] Please see Figure 5 , Figure 5 This is a schematic diagram of a framework of an embodiment of the computer-readable storage medium of this application. The computer-readable storage medium 50 stores program instructions 51 that can be executed by a processor. The program instructions 51 are used to implement the steps in any of the above-described large model testing method embodiments.

[0054] In the above scheme, the computer-readable storage medium 50 generates task objects based on test cases of the target large model. Each task object contains field values ​​for several task fields, including at least prompt text and a task timeout threshold. The task objects are stored in a task queue. The runners then execute the task objects in the task queue. Different runners are isolated from each other. Each runner runs a browser page containing the target large model. The runners fill the prompt text parsed from the task objects into the input boxes of the browser page to copy the response content generated by the target large model in response to the prompt text from the browser page. If the runner fails to obtain a response content after exceeding the target timeout threshold, the runners call the multimodal large model to identify the page image of the browser page, obtain the target anomaly category, and execute the target handling action matching the target anomaly category. The system controls the runner to re-execute currently unsuccessful task objects, with the target duration threshold being the timeout threshold field value of the currently unsuccessful task objects. On one hand, since task objects in the task queue are executed by different runners, and these runners are isolated from each other, it helps to fundamentally avoid result crosstalk and session pollution between concurrent tasks, achieving horizontally scalable and stable concurrent processing. This minimizes data crosstalk between different tasks while enabling large-scale multi-task concurrent testing. On the other hand, if the runner fails to obtain a response after exceeding the target duration threshold, it calls a multimodal large model to identify the page image to obtain the target anomaly category and executes the matching target handling action before re-executing the currently unsuccessful task object, thus improving self-healing capabilities in abnormal scenarios. Therefore, it minimizes data crosstalk between different tasks and improves self-healing capabilities in abnormal scenarios while enabling large-scale multi-task concurrent testing.

[0055] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

[0056] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.

[0057] In the several embodiments provided in this application, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the apparatus implementations described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0058] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0059] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0060] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods of various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0061] If the technical solution of this application involves personal information, the product using this technical solution has clearly informed the user of the personal information processing rules and obtained the user's voluntary consent before processing the personal information. If the technical solution of this application involves sensitive personal information, the product using this technical solution has obtained the user's separate consent before processing the sensitive personal information, and also meets the requirement of "express consent". For example, at personal information collection devices such as cameras, clear and prominent signs are set up to inform users that they have entered the scope of personal information collection and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed that they have agreed to the collection of their personal information; or on the personal information processing device, with clear signs / information informing users of the personal information processing rules, authorization is obtained from the individual through pop-up information or by asking the individual to upload their personal information; wherein, the personal information processing rules may include information such as the personal information processor, the purpose of personal information processing, the processing method, and the types of personal information processed.

Claims

1. A large-scale model testing method, characterized in that, include: Based on the test cases of the target large model, a task object is generated; wherein, the task object contains field values ​​of several task fields, the several task fields include at least the prompt text and the task timeout threshold, and the task object is stored in the task queue; The control runner executes the task objects in the task queue; wherein, different runners are isolated from each other, the runner runs a browser page of the target large model, and the runner fills the prompt text parsed from the task object into the input box of the browser page to copy the response content generated by the target large model in response to the prompt text from the browser page; In response to the fact that the runner has not obtained the response content after exceeding the target duration threshold, the runner is controlled to call a multimodal large model to identify the page image of the browser page, obtain the target anomaly category, and the runner is controlled to execute the target handling action that matches the target anomaly category, and the runner is controlled to re-execute the currently unexecuted task object; wherein, the target duration threshold is the field value of the task timeout threshold in the currently unexecuted task object.

2. The method according to claim 1, characterized in that, The control of the operator to execute target handling actions matching the target anomaly category includes: In response to the target exception category being any of page exception, generation exception, or copying exception, the runner is controlled to perform a forced reset on the browser page. In response to the target exception category being a service exception, the operator is controlled to perform a backoff wait and then retry; wherein the service exception includes at least one of the following: rate limiting, busy, unavailable; In response to the target anomaly being classified as a model anomaly, the runner is controlled to select a backup large model of the target large model according to a preset priority to continue executing the task object.

3. The method according to claim 1, characterized in that, The control operator executes the task objects in the task queue, including: A first recognition result is obtained by recognizing the page image of the browser page; In response to the first identification result including the first position of the first control for implementing content copying, the runner is controlled to trigger the first control based on the first position to obtain the reply content at the current moment, the target count value is updated based on the content length of the reply content at the current moment, and the reply content at the current moment is determined to be the final generated reply content based on the target count value; wherein, the target count value characterizes the stability of the reply content at the current moment.

4. The method according to claim 3, characterized in that, Before executing the task object, the runner initializes an image template library and regions of interest for several controls. The image template library contains a first template for matching and locating the first control. The recognition based on the page image of the browser page to obtain a first recognition result includes: Based on the region of interest of the first control in the page image, local recognition is performed to obtain the local recognition result of the first control; wherein, the region of interest of the first control is obtained in advance by the first template of the first control matching and locating the page image of the browser page; In response to the local recognition result of the first control including the first position of the first control, the local recognition result of the first control is selected as the first recognition result; In response to the local recognition result of the first control including the first control not being recognized, a global recognition is performed based on the page image to obtain the first recognition result.

5. The method according to claim 3, characterized in that, The step of updating the target count value based on the content length of the reply at the current moment includes: In response to the current time's reply content having an empty length representation, the target count value is maintained at zero; In response to an increase in the length of the reply content at the current moment compared to the length of the reply content at a historical moment, the target count value is maintained at zero; In response to the fact that the length of the reply content at the current moment remains unchanged compared to the length of the reply content at the historical moment, the target count value is increased; wherein, the historical moment is the moment before the current moment.

6. The method according to claim 3, characterized in that, The method further includes: In response to the current response content exceeding the display area of ​​the browser page or the first recognition result including failure to recognize the first control, the operator is controlled to pull down the browser page until it reaches the bottom of the browser page, and then return to perform the step of recognizing the page image based on the browser page to obtain the first recognition result.

7. The method according to claim 6, characterized in that, Before executing the task object, the runner initializes an image template library and regions of interest for several controls. The image template library contains a second template for matching and locating a second control, which is used to implement a page dropdown. Controlling the runner to execute the page dropdown on the browser page includes: Based on the region of interest of the second control in the page image, local recognition is performed to obtain the local recognition result of the second control; wherein, the region of interest of the second control is obtained in advance by the second template of the second control matching and locating the page image of the browser page; In response to the partial recognition result of the second control including the second position of the second control, the partial recognition result of the second control is selected as the second recognition result of the second control; In response to the local recognition result of the second control including that the second control was not recognized, a global recognition is performed based on the page image to obtain a second recognition result of the second control; wherein, the second recognition result of the second control includes the second position of the second control; Based on the second position, the runner is controlled to trigger the second control to execute the page dropdown on the browser page.

8. The method according to claim 3, characterized in that, The step of determining whether the response content at the current moment is the final generated response content based on the target count value includes: Detect whether the target count value exceeds the counting threshold; In response to the target count value being not less than the count threshold, the response content at the current moment is determined to be the final generated response content; In response to the target count value being less than the count threshold, it is determined that the response content at the current moment is not the final generated response content.

9. The method according to any one of claims 1 to 8, characterized in that, The aforementioned task fields also include at least one of the following: task identifier, maximum number of retries, and task priority; And / or, if the plurality of task fields include a maximum number of retries, in response to the total number of times the runner re-executes the currently unsuccessful task object exceeding a target number threshold, an execution failure is determined, the target number threshold being the field value of the maximum number of retries. And / or, the runner is either a container or a virtual machine; And / or, the different runners are isolated from each other in the following ways: browser session, runtime directory, clipboard environment; And / or, the runner retrieves the task object from the task queue in a blocking manner; And / or, the response content serves as the field value of the model response field in the result object corresponding to the task object. The result object is stored in the result queue, and the result object also includes at least one of the following result fields: execution status field, number of attempts field, execution duration field, prompt text, and error code / message.

10. A large-scale model testing device, characterized in that, include: The task generation module is used to generate task objects based on test cases of the target large model; wherein, the task object contains field values ​​of several task fields, the several task fields include at least prompt text and task timeout threshold, and the task object is stored in the task queue; The task execution module is used to control the runner to execute the task objects in the task queue; wherein, different runners are isolated from each other, the runner runs a browser page of the target large model, and the runner fills the prompt text parsed from the task object into the input box of the browser page to copy the response content generated by the target large model in response to the prompt text from the browser page; An exception handling module is used to respond to the situation where the runner fails to obtain the response content after exceeding a target time threshold. This module controls the runner to call a multimodal large model to identify the page image of the browser page, obtain the target exception category, and controls the runner to execute a target handling action matching the target exception category. It also controls the runner to re-execute the currently unexecuted task object. The target time threshold is the field value of the task timeout threshold in the currently unexecuted task object.

11. An electronic device, characterized in that, It includes at least a memory and a processor coupled to each other, wherein the memory stores at least program instructions, and the processor is used to execute the program instructions to implement the large model testing method according to any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that, The device stores program instructions that can be executed by a processor, the program instructions being used to implement the large model testing method according to any one of claims 1 to 9.