Multi-step inference agent system and operation method thereof

The multi-stage inference agent system addresses computational bottlenecks and user customization issues by dynamically reconstructing user personas, executing tasks with self-correction, and verifying information reliability, enhancing productivity and efficiency.

WO2026121927A1PCT designated stage Publication Date: 2026-06-11LG MANAGEMENT DEV INST CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
LG MANAGEMENT DEV INST CO LTD
Filing Date
2025-12-08
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Existing language models face computational bottlenecks and memory issues due to exponential increases in computational load and memory usage with long contexts, lack user customization, fail to self-correct errors, and provide unreliable information, hindering complex reasoning and business adoption.

Method used

A multi-stage inference agent system that dynamically reconstructs user personas, executes tasks in a sandbox environment with self-correction, verifies information reliability, and processes long contexts efficiently using a hybrid attention mechanism.

🎯Benefits of technology

Provides hyper-personalized assistance, ensures error-free complex task execution, and enhances productivity by reducing user intervention and secondary processing, while maintaining computational efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025020994_11062026_PF_FP_ABST
    Figure KR2025020994_11062026_PF_FP_ABST
Patent Text Reader

Abstract

The present disclosure relates to a method and a system for providing a multi-step inference artificial intelligence agent service. According to an embodiment, the system may dynamically reconfigure an agent persona on the basis of user meta information and decompose a query into a plurality of subtasks to establish an execution plan. In addition, a data analysis task is executed in a sandbox environment, and execution integrity can be ensured through a self-correction loop that corrects a code by itself when an error is detected. Furthermore, it is possible to maximize work automation efficiency by providing reliability-based search results and analysis content as artifacts in the form of editable native office objects.
Need to check novelty before this filing date? Find Prior Art

Description

Multi-level reasoning agent system and its operation method

[0001] The present invention relates to a multi-stage inference system and a method of operation thereof, and an artificial intelligence model service platform providing system including such agent system and a method of operation thereof.

[0002] With the recent advancement of deep learning technology, massive language models (LLMs) are being utilized in various fields. These language models demonstrate excellent performance in general conversation, simple information retrieval, and sentence generation.

[0003] However, existing standard Transformer models adopt a Global Attention method that calculates the interaction between all token pairs within a sequence. This causes problems where computational load and memory usage increase exponentially (O(N^2)) depending on the context length (N), leading to severe bottlenecks in agent systems that need to analyze long documents or maintain long-term conversation histories. Furthermore, indiscriminately increasing the number of parameters to improve model inference performance exponentially increases the cost (FLOPs) during training and inference, and simple Mixed Expert (MoE) models exhibit training instability, where the load is concentrated on specific experts during the early stages of training or knowledge fragmentation occurs.

[0004] Meanwhile, existing services or chatbot systems utilizing language models remain at the level of processing user queries in a single turn or presenting only simple information retrieval results. In particular, for tasks requiring complex reasoning, such as "analyze market trends and suggest a strategy," existing systems show limitations in failing to establish specific execution plans or merely listing fragmentary information.

[0005] Above all, the entire process frequently comes to a halt due to the lack of the ability to self-correct errors that occur when executing code or analyzing data using external tools. Furthermore, the inadequacy of fact verification systems to confirm whether generated results are based on facts leads to persistent issues of hallucination, which serves as a major factor hindering adoption in corporate work environments.

[0006] Therefore, there is an urgent need to develop an integrated agent system that provides reliable, structured results by recognizing users' business contexts and personas, based on a high-efficiency model architecture capable of efficiently processing long contexts to digest vast background knowledge and logs, and by logically breaking down complex tasks, self-correcting execution errors, and utilizing a high-efficiency model architecture.

[0007] The present disclosure is designed to overcome the limitations of the aforementioned background technology and aims to solve the following specific technical problems.

[0008] First, the multi-stage inference agent system and method of operation according to the present disclosure aim to address the lack of user-customized context recognition and inference. Existing chatbot systems have limitations in that they cannot remember user meta-information, such as the user's job, department, or preferred response style, and only perform fragmentary question-and-answer interactions each time. The present disclosure aims to provide a customized inference service that grasps even the user's implicit intentions by retrieving stored meta-information based on user identification information and dynamically injecting it into a system prompt to reconstruct an agent persona.

[0009] Furthermore, the multi-stage inference agent system and the method of operation according to the present disclosure aim to prevent execution errors and interruptions during the performance of complex tasks. When performing complex data analysis or coding tasks, problems frequently arise where the entire process is halted if syntactic errors or data format inconsistencies occur in the generated code. The present disclosure aims to ensure execution completeness without user intervention by executing code in an isolated sandbox environment and introducing a self-correcting loop that analyzes the cause of an error and regenerates corrected code.

[0010] Furthermore, the multi-stage inference agent system and method of operation according to the present disclosure aim to verify the reliability of generated information and maximize its usability. Existing services had low business utility because they provided information with unclear sources or presented only simple text-based answers. The present disclosure aims to enhance verifiability by selectively searching for only reliable documents based on source control parameters set by the user and by providing a bidirectional cross-referencing function between the generated answers and the original documents. Moreover, it aims to improve business productivity by providing analysis results as artifacts in the form of native office objects that the user can immediately edit.

[0011] A method for providing an artificial intelligence agent service executed by a computer according to one embodiment of the present disclosure for solving the above-described problem includes the following steps.

[0012] (a) User Context-Based Persona Reconstruction Step: The processor of the computing device retrieves user metadata stored in the database using the user identifier received from the user terminal. Based on this information, the system prompt is dynamically reconstructed to initialize an agent persona optimized for the corresponding session.

[0013] (b) Complex Query Decomposition and Execution Plan Formulation Phase: The processor analyzes the complexity of the user query, decomposes it into multiple subtasks to derive a logically complete answer, and formulates an execution plan that considers data dependencies between each task. At this time, if deep search mode is activated in response to a user request, the inference budget—which is the allocation of thought tokens for inference and verification—is increased to enhance the precision of the plan.

[0014] (c) Sandbox-based execution and self-correction phase: The processor routes each subtask to expert modules, such as web search, code interpreters, and high-performance inference models, according to the execution plan. In particular, for data analysis tasks, code is executed in a secure virtual sandbox environment; upon error detection, the code is self-debugged and re-executed through a self-correction loop to derive valid results. For search tasks, user-configured source control parameters are applied to filter out untrusted domains or prioritize searches of specific academic and patent databases.

[0015] (d) Stage of Implementing Verifiable Results and Providing Artifacts: The processor implements the final execution result in the user interface and enables verification of reliability by providing a bidirectional cross-referencing function based on an index mapping the locations between the answer sentence and the source document. Additionally, it converts numerical data or analysis results into native office objects in the format requested by the user and provides them in the form of downloadable artifacts.

[0016] The multi-stage inference agent system and method of operation according to the present disclosure can provide a hyper-personalized work assistant experience. Since the agent remembers the user's job and preferences and operates by reflecting them, high-quality responses that align with the work context can be obtained without the user having to give specific instructions every time. This reduces the burden of creating prompts for the user and increases work engagement.

[0017] Furthermore, the multi-stage inference agent system and the method of operation according to the present disclosure can guarantee the completeness and stability of business automation. Through a self-correcting mechanism in which the system independently recognizes and corrects exceptions occurring during code execution, even non-developer personnel can perform complex data analysis or visualization tasks without errors. This dramatically improves the practical availability of the agent service.

[0018] In addition, the multi-stage inference agent system and the method of operation according to the present disclosure can ensure information transparency and minimize secondary processing tasks. Reliability of the AI ​​answer can be ensured through a cross-reference interface that allows immediate verification of the location of the original document serving as the basis for the answer. Furthermore, by directly generating editable file artifacts rather than simple text, it drastically reduces the time required for the user to reprocess results into reports, thereby maximizing work productivity.

[0019] Furthermore, the multi-stage inference agent system and the method of operation according to the present disclosure enable efficient processing of long contexts through a hybrid architecture. By applying a hybrid model architecture that combines sliding window-based local attention and global attention, memory usage can be reduced and computation speed maintained even when processing documents spanning tens of pages or long conversation histories.

[0020] FIG. 1 illustrates an example of a block diagram of a computing system implementing a multi-stage inference service according to one embodiment of the present disclosure.

[0021] FIG. 2 illustrates an example of a block diagram of a computing device implementing a multi-stage inference service according to one embodiment of the present disclosure.

[0022] FIG. 3 illustrates an example of a block diagram in another aspect of a computing device implementing a multi-stage inference service according to one embodiment of the present disclosure.

[0023] FIG. 4 illustrates the concept of a hybrid attention mechanism according to one embodiment of the present disclosure.

[0024] FIG. 5 is a conceptual diagram illustrating a sandwich structure hybrid expert mix-dense architecture according to one embodiment of the present disclosure, and a shared expert and routing expert operation mechanism inside an expert mix block.

[0025] FIG. 6 is a block of a reorder layer normalization (QK-Reorder-LN) structure according to one embodiment of the present disclosure, and FIG. 7 is a block diagram illustrating a Pre-LN structure.

[0026] FIG. 8 is a flowchart illustrating the learning process of a generative artificial intelligence model according to one embodiment of the present disclosure.

[0027] FIG. 9 is a flowchart illustrating an initial operation process in which at least one processor of a computer device constituting a system analyzes a user query and determines an execution mode, according to an embodiment of the present disclosure.

[0028] FIG. 10 is a flowchart illustrating the process of a processor of a system that has entered a multistep inference mode, according to an embodiment of the present disclosure, breaking down a user query into a plurality of sub-tasks and establishing an execution plan.

[0029] FIG. 11 is a flowchart illustrating the process of a processor routing and executing each subtask to an appropriate expert module based on an established execution plan, according to an embodiment of the present disclosure.

[0030] FIG. 12 is a flowchart illustrating a process of integrating the execution results of each subtask to generate and provide a final answer to a user according to an embodiment of the present disclosure.

[0031] FIG. 13 is a flowchart illustrating the process of a processor assigned a data analysis task during a multistep inference process establishing a specific analysis plan before generating actual code, according to an embodiment of the present disclosure.

[0032] FIG. 14 is a flowchart illustrating the process of a processor generating an executable script and performing a first execution in an isolated environment after an analysis plan is determined, according to an embodiment of the present disclosure.

[0033] FIG. 15 is a flowchart illustrating the process of a processor performing a self-correcting loop when an error occurs in the result of a first code execution, according to an embodiment of the present disclosure.

[0034] FIG. 16 is a flowchart illustrating a process of deriving a final answer by analyzing the execution result after the code has been executed normally through a self-correction process, according to an embodiment of the present disclosure.

[0035] FIG. 17 is a drawing illustrating a user-driven source control and search mode setting interface according to one embodiment of the present disclosure.

[0036] FIG. 18 is an example screen diagram that verifies the reliability of information by providing a bidirectional cross-referencing function between a generated answer and a source document according to one embodiment of the present disclosure.

[0037] FIG. 19 is an example screen diagram showing a download interface for multimodal analysis results and structured outputs according to one embodiment of the present disclosure.

[0038] The present disclosure is capable of various modifications and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description.

[0039] The effects and features of the present disclosure, and the methods for achieving them, will become clear by referring to the embodiments described below in detail together with the drawings. However, the present disclosure is not limited to the embodiments disclosed below and can be implemented in various forms. In the following embodiments, terms such as "first," "second," etc., are used not in a limiting sense but for the purpose of distinguishing one component from another. Furthermore, singular expressions include plural expressions unless the context clearly indicates otherwise. Additionally, terms such as "include" or "have" mean that the features or components described in the specification exist, and do not preclude the possibility that one or more other features or components may be added. Furthermore, in the drawings, the size of components may be exaggerated or reduced for convenience of explanation. For example, the size and thickness of each component shown in the drawings are depicted arbitrarily for convenience of explanation, so the present disclosure is not necessarily limited to what is depicted.

[0040] Hereinafter, embodiments of the present disclosure will be described in detail with reference to the attached drawings. When describing with reference to the drawings, identical or corresponding components are given the same reference numerals, and redundant descriptions thereof will be omitted.

[0041]

[0042] [Exemplary system providing multi-level inference services]

[0043] Hereinafter, an exemplary system implementing a multi-stage inference service is described in detail with reference to the attached drawings. The following description of the above system may also be applied to an exemplary system implementing an artificial intelligence model service platform provision service.

[0044] FIG. 1 illustrates an example of a block diagram of a computing system implementing a multi-stage inference service according to one embodiment of the present disclosure.

[0045] Referring to FIG. 1, a computing system (1000) implementing the multi-stage inference service of the present disclosure includes a user computing device (110), a server computing system (130), and a training computing system (150), and each device and system is connected to communicate through a network (170).

[0046] A multi-stage inference system according to one embodiment of the present disclosure may be implemented and provided locally by a user computing device (110), implemented and provided in the form of a web service by a server computing system (130) communicating with the user computing device (110), and implemented and provided by the user computing device (110) and the server computing system (130) in conjunction with each other.

[0047] In this embodiment, the user computing device (110) and / or the server computing system (130) can train a machine learning model (120 and / or 140) through interaction with a training computing system (150) that is communicatedly connected via a network (170). The training computing system (150) may be separate from the server computing system (130) or may be part of the server computing system (130).

[0048] And at this time, the artificial intelligence model for the agent service to perform multi-stage inference and self-correction can be 1) trained directly locally by the user computing device (110), 2) trained by the server computing system (130) and the user computing device (110) interacting with each other through the network (170), and 3) trained by a separate training computing system (150) using various training and learning techniques. It may also be implemented by transmitting the artificial intelligence model trained by the training computing system (150) to the user computing device (110) and / or the server computing system (130) through the network (170) to provide / update it.

[0049] In some embodiments, the training computing system (150) may be part of the server computing system (130) or part of the user computing device (110).

[0050] - User Computing Device (110: User Computing Device)

[0051] The user computing device (110) may include all other types of computing devices, such as a smartphone, a mobile phone, a digital broadcasting device, a PDA (personal digital assistants), a PMP (portable multimedia player), a desktop, a wearable device, an embedded computing device and / or a tablet PC.

[0052] Additionally, in the embodiment, the user computing device (110) may further include a predetermined server computing device that provides a multi-stage inference service environment.

[0053] This user computing device (110) includes at least one processor (111) and memory (112).

[0054] Here, the processor (111) of the user computing device (110) may be composed of at least one of a central processing unit (CPU), a graphics processing unit (GPU), ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays), controllers, microcontrollers, microprocessors, and / or other electrical units for performing functions, or a plurality of electrically connected processors.

[0055] In an embodiment, the processor may include at least one of a heterogeneous computing structure including a central processing unit (CPU) that manages the control flow throughout the system, and a graphics processing unit (GPU), a neural network processing unit (NPU), or a tensor processing unit (TPU) that is an accelerator optimized for large-scale matrix operations and parallel processing.

[0056] In particular, according to the embodiment, this processor (111) may be configured based on a Field Programmable Gate Array (FPGA) implementation and / or an Application Specific Integrated Circuit (ASIC), which is a hardware technology for implementing a certain digital circuit.

[0057] Here, a field programmable gate array (FPGA) can refer to a flexible digital circuit that is programmable according to user needs.

[0058] As an example, a field programmable gate array implementation may include a register that temporarily stores data and controls the flow and timing of signals to maintain intermediate results or state information of operations to support synchronized operation of the FPGA, programmable logic that programs operations within the FPGA to perform specific functions or operations as logic circuits configurable according to user needs, and an input interface that receives signals from external devices or sensors and transmits them to internal circuits as a channel for receiving data from outside the FPGA.

[0059] Through the combination of the above components, a field-programmable gate array implementation can provide flexible and various types of digital circuits.

[0060] Meanwhile, an Application-Specific Integrated Circuit (ASIC) can refer to a custom integrated circuit that is fixedly designed to perform a specific use or function.

[0061] As an example, the application-dedicated integrated circuit may include a register, which is a small memory device for temporarily storing and managing data and supports the rapid processing of ASIC operations by storing intermediate calculation results or state information; a microprocessor, which is a central processing unit that performs control and operations within the ASIC and coordinates the operation of the entire system by performing various operations or generating control signals when necessary; and an input block, which is an interface for receiving data from the outside, which receives data to be processed by the ASIC and transmits it internally, and receives various input data through connections with sensors or external devices.

[0062] Through the combination of the components mentioned above, an application-specific integrated circuit can perform specific purpose tasks in an optimized manner.

[0063] In addition, the processor (111) according to the embodiment may have a cluster structure connected by high-bandwidth memory (HBM) and an ultra-high-speed interconnect (NVLink, etc.) to efficiently process the sliding window operation of local attention and the sparse operation of global attention described later.

[0064] In addition, the processor (111) according to the embodiment can perform the role of creating a virtual environment (Sandbox) isolated from the main system and allocating resources when executing code generated by the agent, and capturing error logs that occur during execution and feeding them back to the model.

[0065] Returning to the point, the memory (112) of the user computing device (110) may include one or more non-transient / transient computer-readable storage media such as RAM, ROM, EEPROM, EPROM, flash memory device, magnetic disk, etc., and combinations thereof, and may include web storage of a server that performs memory storage functions on the internet. This memory (112) may store data (113) and instructions (114) necessary for the at least one processor (111) to perform functional operations, such as training an artificial intelligence model or executing a multi-stage inference service through an artificial intelligence model.

[0066] Specifically, the memory (112) can store and manage the following data structures. The memory (112) according to the embodiment may include at least one of the following data: model parameters and KV cache, episodic memory, and intermediate artifacts.

[0067] Specifically, the memory (112) can store hundreds of billions of parameters of a hybrid model and a key-value cache generated during inference as a model parameter and KV cache. In particular, it may have a hierarchical structure with paging techniques or memory pooling techniques applied for processing long contexts of 128K tokens or more.

[0068] In addition, the memory (112) is an episodic memory that can implement long-term memory by storing the user's past conversation history, preferences, and work context information in the form of vector embeddings.

[0069] In addition, the memory (112) can store temporary files such as code snippets, data charts, and analysis reports generated by the agent during execution as intermediate outputs.

[0070] In one embodiment, the user computing device (110) can perform various deep learning for multi-stage inference services by linking with a deep-learning neural network.

[0071]

[0072] Here, the deep learning neural network according to the embodiment may include a Convolutional Neural Network (CNN), R-CNN (Regions with CNN features), Fast R-CNN, Faster R-CNN, Mask R-CNN, etc., and may include any deep learning neural network that includes an algorithm capable of performing the embodiments described below, and the embodiments of the present disclosure do not limit or restrict such deep learning neural networks themselves.

[0073] At this time, according to the embodiment, the deep learning neural network may be installed directly on the server computing system (130) or operate as a separate device from the server computing system (130) to perform deep learning for the multi-stage inference service.

[0074] Additionally, in one embodiment, the user computing device (110) can store at least one machine learning model (120).

[0075] For example, the user computing device (110) may be various machine learning models, such as multiple neural networks (e.g., deep neural networks) that perform multi-stage inference services based on structured / quantitative data, or other types of machine learning models including non-linear models and / or linear models, and may be composed of a combination thereof.

[0076] For example, machine learning models may include linear regression, decision trees, random forests, gradient boosting pre-trained language models or / and deep learning models. And neural networks may include at least one of feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or / and other forms of neural networks.

[0077] Additionally, according to an embodiment, the user computing device (110) may store a model to be used in each process and a prompt template that serves as the basis for input to the model in order to perform at least part of the process for a multi-stage inference service through a large-scale language model (LLM).

[0078] In one embodiment, a user computing device (110) may receive at least one machine learning model (120) from a server computing system (130) through a network (170), store it in memory (112), and then execute the stored machine learning model (120) by a processor (111) to perform a multi-stage inference service provision method.

[0079] In another embodiment, the user computing device (110) may provide a multi-stage inference service to the user by performing operations through a machine learning model (140) including at least one machine learning model (140) in conjunction with a server computing system (130) and communicating related data to the outside.

[0080] For example, a user computing device (110) can perform a multi-stage inference service in such a way that a server computing system (130) provides an output for the user's input using a machine learning model (140) via the web.

[0081] Additionally, the artificial intelligence model can be implemented in such a way that at least some of the machine learning models (120 and / or 140) are executed on a user computing device (110) and the rest are executed on a server computing system (130).

[0082] Additionally, the user computing device (110) may include at least one input component (121) that detects user input.

[0083] For example, the user input component (121) may include a touch sensor (e.g., a touch screen and / or a touch pad, etc.) that detects a touch of the user's input medium (e.g., a finger or a stylus), an image sensor that detects the user's motion input, a microphone that detects the user's voice input, a button, a mouse and / or a keyboard, etc.

[0084] Here, the image sensor may include an image processing module. Specifically, the image sensor may process still images or video obtained by an image sensor device (e.g., CMOS or CCD).

[0085] In addition, the image sensor can process a still image or video acquired through the image sensor device using an image recognition process (e.g., OCR, etc.) and / or an image processing module to extract necessary information and transmit the extracted information to a processor.

[0086] Additionally, the input component (121) can receive input from an external controller (e.g., mouse, keyboard, etc.) based on an interface module, and in this case, may include an external output device (e.g., speaker).

[0087] At this time, the interface module may be configured to include at least one of a wired / wireless headset port, an external charger port, a wired / wireless data port, a memory card port, a port for connecting a device equipped with an identification module, an audio I / O (Input / Output) port, a video I / O (Input / Output) port, an earphone port, a power amplifier, an RF circuit, a transceiver, and other communication circuits.

[0088] In addition, the external output device may include a display system that outputs various information related to a multi-stage inference service as a graphic image.

[0089] Such a display system may be implemented by including at least one of a liquid crystal display (LCD), a thin film transistor-liquid crystal display (TFT LCD), an organic light-emitting diode (OLED), a flexible display, a 3D display, and an e-ink display.

[0090] Meanwhile, the user computing device (110) including the above-described components may further perform at least some of the functional operations performed by the server computing system (130) described later.

[0091] -Server Computing System (130: Server Computing System)

[0092] The server computing system (130) can perform a series of processes to provide multi-stage inference services.

[0093] In detail, in an embodiment, the server computing system (130) can provide the multi-stage inference service by exchanging data necessary to enable the multi-stage inference service process to be executed on an external device, such as a user computing device (110), with said external device.

[0094] More specifically, in an embodiment, the server computing system (130) can provide an environment in which an application can run on a user computing device (110).

[0095] To this end, the server computing system (130) may include an application program, data and / or instructions, etc. for the application to operate, and may transmit and receive various data based thereon with the external device.

[0096] Additionally, the server computing system (130) includes at least one processor (131) and memory (132).

[0097] Here, the processor (131) of the server computing system (130) may be composed of at least one of a central processing unit (CPU), a graphics processing unit (GPU), ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays), controllers, microcontrollers, microprocessors, and / or other electrical units for performing functions, or a plurality of electrically connected processors.

[0098] In particular, depending on the embodiment, such a processor (131) may be configured based on a Field Programmable Gate Array (FPGA) implementation and / or an Application Specific Integrated Circuit (ASIC), which are hardware technologies for implementing a specific digital circuit. A detailed description thereof is omitted by applying the description of the FPGA and ASIC mentioned above.

[0099] The processor (131) according to the embodiment may have a cluster structure connected by high-bandwidth memory (HBM) and an ultra-high-speed interconnect (NVLink, etc.) to efficiently process the sliding window operation of local attention and the sparse operation of global attention described later.

[0100] Additionally, the processor (131) according to the embodiment can perform the role of creating a virtual environment (Sandbox) isolated from the main system and allocating resources when executing code generated by the agent, and capturing error logs that occur during execution and feeding them back to the model.

[0101] And the memory (132) may include one or more non-transient / transient computer-readable storage media such as RAM, ROM, EEPROM, EPROM, flash memory device, magnetic disk, etc. and combinations thereof. This memory (132) may store data (133) and instructions (134) necessary for the processor (131) to perform functional operations, such as training an artificial intelligence model or executing a multi-stage inference service provision method through the artificial intelligence model.

[0102] Specifically, the memory (132) can store and manage the following data structures. The memory (132) according to the embodiment may include at least one of the following data: model parameters and KV cache, episodic memory, and intermediate artifacts.

[0103] Specifically, the memory (132) can store hundreds of billions of parameters of a hybrid model and a key-value cache generated during inference as a model parameter and KV cache. In particular, it may have a hierarchical structure with paging techniques or memory pooling techniques applied for processing long contexts of 128K tokens or more.

[0104] In addition, the memory (132) is an episodic memory that can implement long-term memory by storing the user's past conversation history, preferences, and work context information in the form of vector embeddings.

[0105] In addition, the memory (132) can store temporary files such as code snippets, data charts, and analysis reports generated by the agent during execution as intermediate outputs.

[0106] In one embodiment, the server computing system (130) may be implemented to include at least one computing device. For example, the server computing system (130) may be implemented to operate a plurality of computing devices according to a sequential computing architecture, a parallel computing architecture, or a combination thereof. Additionally, the server computing system (130) may include a plurality of computing devices connected to a network (170).

[0107] Additionally, the server computing system (130) may store at least one machine learning model (140). For example, the server computing system (130) may include a neural network and / or other multi-layer non-linear model as the machine learning model (140). Exemplary neural networks may include a feed-forward neural network, a deep neural network, a recurrent neural network, and a convolutional neural network.

[0108] In an embodiment, the server computing system (130) may further include a data store computing system (hereinafter, data store) which is a storage for continuously storing and managing raw data that forms the basis of a multi-stage inference service.

[0109] Such data stores may include various forms of data storage, ranging from file systems to cloud storage. For example, a data store may include at least one database among a relational database that uses a structured query language (SQL) to define and manipulate data, a NoSQL database designed for flexibility and scalability to process unstructured and semi-structured data, a data warehouse optimized for querying and analysis by centralizing large volumes of data from multiple sources as a system used for reporting and data analysis, a data warehouse that stores large volumes of raw data in basic formats such as structured data, semi-structured data, and unstructured data, and a local storage device or Network Attached Storage (NAS) that stores data in files in a format generally accessible by a computer operating system.

[0110]

[0111] - Training Computing System (150: Training Computing System)

[0112] The training computing system (150) includes at least one processor (151) and memory (152).

[0113] Here, the processor (151) of the training computing system (150) may be composed of at least one of a central processing unit (CPU), a graphics processing unit (GPU), ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays), controllers, microcontrollers, microprocessors, and / or other electrical units for performing functions, or a plurality of electrically connected processors.

[0114] In particular, depending on the embodiment, this processor (151) may be configured based on a Field Programmable Gate Array (FPGA) implementation and / or an Application Specific Integrated Circuit (ASIC), which are hardware technologies for implementing a specific digital circuit. A detailed description thereof is omitted by applying the description of the FPGA and ASIC mentioned above.

[0115] And the memory (152) may include one or more non-transient / transient computer-readable storage media such as RAM, ROM, EEPROM, EPROM, flash memory device, magnetic disk, etc. and combinations thereof. This memory (152) may store data (153) and instructions (154) necessary for the processor (151) to perform learning of an artificial intelligence model, etc.

[0116] For example, the training computing system (150) may include a model trainer (160) that trains a machine learning model (120 and / or 140) stored in a user computing device (110) and / or a server computing system (130) using various training or learning techniques, such as back propagation of error (according to the framework illustrated in FIG. 7).

[0117] For example, such a model trainer (160) can perform updates to one or more parameters of a machine learning model (120 and / or 140) for a multi-stage inference service in a backpropagation manner based on a defined loss function.

[0118] In some embodiments, performing backpropagation of the error may include performing truncated backpropagation through time. The model trainer (160) may perform a number of generalization techniques (e.g., weight devaluation, dropout and / or knowledge distillation, etc.) to improve the generalization ability of the machine learning model (120 and / or 140) being trained.

[0119] Additionally, the model trainer (160) can train a machine learning model (120 and / or 140) based on a series of training data (161). Here, the training data (161) may include data of different forms, such as images, audio samples and / or text, for example. Examples of image types that may be used may include video frames, LiDAR point clouds, X-ray images, computed tomography scans, hyperspectral images and / or various other forms of images.

[0120] These training data (161) may be provided by a user computing device (110) and / or a server computing system (130). When the training computing device trains a machine learning model (120 and / or 140) on specific data of the user computing device (110), the machine learning model (120 and / or 140) may be characterized as a personalized model.

[0121] And the model trainer (160) includes computer logic that is utilized to provide the desired function.

[0122] Additionally, the model trainer (160) may be implemented as hardware, firmware, and / or software that controls a general-purpose processor. In one embodiment, the model trainer (160) may include a program file stored in a storage device, be loaded into memory (152), and be executed by one or more processors (151). In another embodiment, the model trainer (160) includes one or more sets of computer-executable data (153) and instructions (154) stored in a tangible computer-readable storage medium, such as a RAM hard disk or an optical or magnetic medium.

[0123] Network (170) includes, but is not limited to, 3GPP (3rd Generation Partnership Project) network, LTE (Long Term Evolution) network, WIMAX (World Interoperability for Microwave Access) network, Internet, LAN (Local Area Network), Wireless LAN (Wireless Local Area Network), WAN (Wide Area Network), PAN (Personal Area Network), Bluetooth network, satellite broadcasting network, analog broadcasting network and / or DMB (Digital Multimedia Broadcasting) network.

[0124] Generally, communication through the network (170) can be performed using any type of wired and / or wireless connection through various communication protocols (e.g., TCP / IP, HTTP, SMTP and / or FTP, etc.), encodings or formats (e.g., HTML and / or XML, etc.), and / or protection schemes (e.g., VPN, Secure HTTP and / or SSL, etc.).

[0125] FIG. 2 illustrates an example of a block diagram of a computing device implementing a multi-stage inference service according to one embodiment of the present disclosure.

[0126] Including FIG. 2, the computing device (100) included in the user computing device (110), server computing system (130), and training computing system (150) includes a plurality of applications (e.g., Application 1 to Application N). Each application may include a machine learning library and one or more machine learning models. For example, the applications may include an image processing application (e.g., Detection, Classification and / or Segmentation, etc.), a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application and / or a chat-bot application, etc.

[0127] In an embodiment, the computing device (100) may include a model trainer (160) for training an artificial intelligence model, and by storing and operating the trained artificial intelligence model, it may provide output data according to a predetermined input data (e.g., a user's multimodal query input).

[0128] Each application of the computing device (100) can communicate with a number of other components of the computing device (100), such as, for example, at least one sensor, a context manager, a device state component, and / or additional components. In one embodiment, each application can communicate with each device component using an API (e.g., a public API). In one embodiment, the API used by each application may be specific to that application.

[0129] FIG. 3 illustrates an example of a block diagram in another aspect of a computing device implementing a multi-stage inference service according to one embodiment of the present disclosure.

[0130] Referring to FIG. 3, the computing device (200) includes a plurality of applications (e.g., Application 1 to Application N). Each application can communicate with a central intelligence layer. For example, the applications may include an image processing application, a text messaging application, an email application, a dictation application, a virtual keyboard application and / or a browser application. In one embodiment, each application can communicate with the central intelligence layer (and a model stored therein) using an API (e.g., a common API across all applications).

[0131] The central intelligence layer may include a number of machine learning models. For example, as illustrated in FIG. 3, at least some of the machine learning models may be provided for each application and managed by the central intelligence layer. In another embodiment, two or more applications may share a single machine learning model. For example, in some embodiments, the central intelligence layer may provide a single model for all applications. In some embodiments, the central intelligence layer may be included within the operating system of the computing device (200) or otherwise implemented.

[0132] The central intelligence layer can communicate with the central device data layer. The central device data layer may be a centralized data store for the computing device (200). As illustrated in FIG. 3, the central device data layer can communicate with a number of other components of the computing device (200), such as, for example, one or more sensors, a context manager, a device state component, and / or additional components. In some embodiments, the central device data layer can communicate with each device component using an API (e.g., a private API).

[0133] The technology described herein may refer to servers, databases, software applications, and other computer-based systems, as well as actions taken and information transmitted to or from said systems. It will be recognized that the inherent flexibility of computer-based systems allows for a wide range of possible configurations, combinations, division of tasks, and functionality between and from components. For example, the processes described herein may be implemented using a single device or component or multiple devices or components operating in combination. Databases and applications may be implemented in a single system or in a distributed system across multiple systems. Distributed components may operate sequentially or in parallel.

[0134]

[0135] Based on the physical configuration of the computing device and system described above, the following specifically describes a model architecture to which a hybrid mechanism, which is a core technical feature of the present disclosure, is applied, and the operation process of a multi-stage inference agent utilizing the same.

[0136] In the detailed description of the invention that follows, major processes such as computation according to the structural features of the artificial intelligence model, toolchain routing, and self-correcting loops are described as being performed primarily by at least one processor (131) of the server computing system (130) for clarity of explanation. However, this does not limit the scope of the rights of the present disclosure to server-side operations only, and those skilled in the art will understand that, depending on the embodiment, at least some steps of the process (e.g., inference of the lightweight model, preprocessing, or rendering of sandbox results, etc.) may be performed by at least one processor (111) of the user computing device (110) or through collaboration between the two processors in order to efficiently distribute the computational load (Load Balancing), maintain security, or minimize network latency.

[0137]

[0138] [Model Architecture with Hybrid Mechanism]

[0139] The basis for the agent system according to this embodiment to perform high-level reasoning and long context understanding lies in a model architecture that applies a hybrid mechanism to at least one model performing reasoning. This architecture is designed to preserve the deep thinking ability of the model while drastically reducing computational complexity.

[0140] That is, at least one processor of at least one computing device of the agent system according to the embodiment may operate at least one artificial intelligence model to perform at least one step of the inference process, and at least one of the at least one artificial intelligence model operated may be implemented as a model architecture to which a hybrid mechanism is applied.

[0141] Specifically, the at least one processor can perform local attention in three consecutive layers to quickly process local grammatical relationships or adjacent information, and then integrate the entire information in a subsequent global layer. At this time, a strategy of not applying Rotated Position Embedding (RoPE) to the global layer to prevent positional bias may be adopted.

[0142] A model incorporating such a hybrid mechanism can physically support the agent's memory and situational awareness capabilities by processing vast contexts without loss when the computing device's processor drives the agent to process dozens of pages of search results or long code execution logs.

[0143] Through this, the processor of a computing device according to one embodiment of the present disclosure can perform natural language processing and data generation tasks by driving a generative artificial intelligence model to which a hybrid mechanism is applied.

[0144] The artificial intelligence model executed by the above processor can be formed based on a decoder-only transformer structure that autoretroactively predicts the next token based on an input sequence. That is, the artificial intelligence model according to the embodiment can be an architecture optimized for text generation tasks by stacking only decoder blocks to increase the parameter efficiency of the model.

[0145] In another embodiment, it is obvious to those skilled in the art that the decoder block according to the embodiment can be applied to the decoder in a structure that includes both an encoder and a decoder.

[0146] Specifically, the processor may tokenize input text data and convert it into an embedding vector, and perform operations by passing it through N stacked decoder blocks. Each decoder block may include an attention sublayer that performs an attention mechanism and a Feed-Forward Network (FFN) sublayer that performs feed-forward operations. At this time, the processor may preserve causality by applying Causal Masking or Look-ahead Masking during the attention operation to restrict the token at the current time point from referencing token information at a future time point.

[0147] A processor of a computing device according to one embodiment of the present disclosure can effectively perform data processing using a generative artificial intelligence model in response to a user request.

[0148] Specifically, the processor may be configured to operate based on a Transformer architecture with enhanced computational efficiency and learning stability to effectively support a non-inference mode for fast response and an inference mode for deep thinking within a single model. That is, the artificial intelligence model according to the embodiment may be a hybrid model having billions or more parameters.

[0149] The architecture of the hybrid model driven by the above processor may include a hybrid attention mechanism to resolve computational bottlenecks occurring during long context processing, and a query-key reordering layer normalization (QK-Reorder-LN) structure to resolve learning instability of the deep neural network.

[0150] FIG. 4 illustrates the concept of a hybrid attention mechanism according to one embodiment of the present disclosure.

[0151] Referring to FIG. 4, the processor can perform operations through a hybrid structure that mixes a first layer group performing global attention and a second layer group performing local attention.

[0152] As the length of the context that the processor according to the embodiment must process through the hybrid model increases, for example, when it exceeds 128K tokens, the existing method of performing global attention in all layers may impair efficiency by causing O(N^2) computational complexity and memory usage. To solve this, the processor may perform operations by setting the ratio of local attention layers to global attention layers among all transformer layers to a preset ratio, for example, 3 to 1. That is, the processor may be configured to repeat a pattern (LLLG) of performing three local attention layers consecutively followed by one global attention layer.

[0153] When performing the local attention described above, the processor may apply the Sliding Window Attention method. This is a method that performs attention operations only on adjacent tokens within a pre-set window size relative to the current token.

[0154] Unlike existing chunk attention methods, which suffer from instability where contextual information can be lost at boundaries, this approach has the advantage of theoretically generating a stable attention score across the entire sequence through continuous window movement. Additionally, it offers ease of implementation by enhancing compatibility with open-source frameworks.

[0155] In one embodiment of the invention, the processor can perform operations by setting the window size within a range of 1,024 tokens or more and 16,384 tokens or less. If the window size is less than the lower limit (1,024 tokens), there is a concern that the ability to grasp local context may be reduced due to an excessive lack of reference information regarding adjacent contexts. Conversely, if the window size exceeds the upper limit (16,384 tokens), the computational reduction effect intended for the introduction of local attention is reduced, which may lead to a decrease in the inference speed and memory efficiency of the entire model.

[0156] Preferably, the processor may set the window size to between 2,048 tokens and 8,192 tokens to achieve an optimal balance of computational efficiency and short context processing performance. Most preferably, the processor according to an embodiment of the present disclosure may set the window size to approximately 4,096 (4K) tokens, thereby reliably preventing degradation of short context processing performance while reducing the amount of computation in long sequences to a linear O(N) level.

[0157]

[0158] [Sandwich Expert Mixed Architecture]

[0159] To increase the parameter efficiency of a model to which a hybrid mechanism according to this embodiment is applied, at least one model of a sandwich architecture can be implemented in which dense blocks are placed at the input and output ends, and expert mixed blocks are placed in the deep computation section between them.

[0160] In other words, when the processor according to the embodiment drives an agent to use various tools, this structure can demonstrate excellent performance. For example, if a user requests, "Write Python code," the router can assign the corresponding token to a routing expert specializing in coding, and if the user requests, "Review legal documents," it can assign it to a routing expert specializing in law. This can maximize the performance of multi-purpose agents within a single model.

[0161] Specifically, a processor of a computing device according to one embodiment of the present disclosure can run an artificial intelligence model based on a sandwich structure hybrid architecture to simultaneously secure computational efficiency and learning stability. The architecture may be a structure in which dense layers are placed at the input and output ends of the model, and expert mixed layers are concentrated in the deep computation section between them.

[0162] In an embodiment, the model may be composed of a total of 56 transformer blocks. Specifically, it may have a structure in which a first dense block group for processing input data, an expert mixed block group responsible for complex inference and knowledge processing, and a second dense block group for organizing and outputting final features are sequentially stacked. This structural arrangement can serve as an anchor to prevent representation collapse problems that the expert mixed model may experience during the initial training phase and to ensure stable embedding transformation at the input / output stage.

[0163] More specifically, as illustrated in FIG. 5, the processor can perform operations based on an integrated decoder block in which a first sublayer performing an attention mechanism and a second sublayer performing a feedforward operation are organically combined.

[0164] The above processor can simultaneously secure the efficiency of context processing and the expertise of knowledge processing within a single model structure by selectively controlling the operation mode of each sub-layer according to the layer depth within the model or a preset architecture ratio.

[0165] The first sublayer according to the embodiment may include an attention module that analyzes the correlation between tokens within an input sequence. Depending on the position of the block currently being processed, the processor may operate the attention module in either a sliding window attention mode or a global attention mode.

[0166] For example, the processor can perform a 3-to-1 hybrid pattern in which it processes local context with low complexity by activating sliding window attention in three consecutive blocks and integrates the entire context by activating global attention in the following block. In this case, the input signal undergoes layer normalization before performing the attention operation, and the output of the attention module can be summed with the input signal through residual connection and passed to the second sub-layer.

[0167] Next, the second sub-layer according to the embodiment may include a feedforward module that processes knowledge by non-linearly transforming the features of each token. If the block belongs to the input / output end of the model, the processor may operate the feedforward module as a dense feedforward network in which all parameters are activated to ensure the stability of the representation. On the other hand, if the block belongs to the deep computation section, the processor may operate the feedforward module as an expert blended network.

[0168] And in the above expert mixed network mode, the processor can perform computations using multiple experts and routers instead of the existing single feedforward network.

[0169] The present disclosure may be characterized particularly by operating experts by dividing them into sharing experts and routing experts.

[0170] The aforementioned shared expert is an expert that is always active regardless of the characteristics of the input token, and can prevent knowledge fragmentation caused by the sparsity of the expert blend model by processing general knowledge or contextual information commonly required by all tokens. On the other hand, the aforementioned routing expert is an expert that is activated only for tokens selected by the router, and can process knowledge specialized for a specific domain or detailed task.

[0171] In the specific operation process, the processor uses the input token x and a learnable routing matrix The gating score G(x) for each expert can be calculated through the inner product between them. This It can be expressed as such. Through this, the processor can select and activate the top K routing experts with high gating scores among multiple experts. At the same time, the processor can input the token in parallel to shared experts without a separate selection process.

[0172] Finally, the processor generates the final output of the second sublayer by weighted summing the output of the shared expert and the output of the selected routing experts, which can then be summed with the output of the first sublayer through residual connection and passed to the next block.

[0173] In addition, the processor may add an auxiliary load balancing loss to the objective function during the learning process to prevent bottlenecks where tokens are concentrated only on specific experts, and may introduce a router Z-loss to suppress excessive growth of the router's logit value, thereby preventing divergence in the early stages of learning and accelerating the convergence speed.

[0174] Consequently, the processor of the present disclosure can configure an optimized computation pipeline by placing blocks of various combinations, such as sliding window attention-expert mix, global attention-expert mix, or global attention-dense mix, in appropriate places through the switching mechanism of FIG. 5. This complementary architecture of a sandwich structure can reduce computation costs by suppressing active parameters to tens of billions during inference even on a scale of hundreds of billions of parameters, ensure learning stability through shared experts and dense layers, and provide the effect of improving the ability to solve complex inference tasks through specialized experts in deep layers.

[0175] [Rearrangement Layer Normalization]

[0176] Meanwhile, to address the learning instability that may occur as the depth of the model to which the hybrid mechanism and / or sandwich expert mixed architecture according to the embodiment is applied increases, the model may be implemented with a reordering layer normalization (QK-Reorder-LN) structure that first applies layer normalization (RMSNorm) to the query and key vectors within the attention operation.

[0177] This is intended to suppress, at the hardware level, phenomena such as text collapse or logical leaps occurring when the aforementioned processor drives an agent to generate a long thought process to solve a complex problem.

[0178] Specifically, in the periodically deployed global attention layers, the processor can integrate global contextual information by referencing all tokens of the entire sequence. In particular, the processor can be configured not to apply Rotated Position Embeddings (RoPE) during global attention layer operations. Unlike general attention, excluding positional information from global attention is intended to prevent the model from becoming excessively biased toward specific locations when processing long sequences that exceed its pre-trained length. This allows the model to maintain the ability to view the global context regardless of the sequence length.

[0179] This is intended to prevent the model from being biased toward a specific context length and to allow the model to maintain a global view of the entire context. Through this hybrid structure, the processor can reduce KV cache memory usage during training and inference while still securing the ability to handle long contexts of up to 128K tokens.

[0180] FIG. 6 is a block of a reorder layer normalization (QK-Reorder-LN) structure according to one embodiment of the present disclosure, and FIG. 7 is a block diagram illustrating a Pre-LN structure.

[0181] Referring to FIGS. 6 and FIGS. 7, as the depth of the Transformer model increases, the Pre-LN structure may experience a problem where the variance of the output increases exponentially, causing the learning to become unstable. To solve this, the processor of the present disclosure can perform operations based on a repositioned layer normalization structure in which the application location of layer normalization is redesigned as shown in FIG. 6.

[0182] The order of operations performed by the processor within the attention block can be configured as follows. First, the processor can generate query, key, and value vectors from the input tensor through an input projection step. Then, the processor can perform a first-order normalization step to apply layer normalization to the generated query and key vectors, respectively. The normalization method used at this time may be Root Mean Square Normalization (RMSNorm). By calculating the attention score using the normalized query and key, the processor can prevent the inner product value from becoming excessively large.

[0183] In the subsequent attention operation step, the processor can generate an attention output using normalized queries and keys, and unnormalized values. Subsequently, the processor can perform a second normalization step to apply layer normalization (RMSNorm) once again to the generated attention output. This serves to adjust the scale of values ​​before they are added to residual connections, ensuring that the signal magnitude remains stable even in deep layers. While this structure may result in a slight increase in computational load compared to existing methods, it can improve training stability in deep neural networks and contribute to performance improvements in downstream tasks.

[0184] In addition, the processor according to the embodiment of the present disclosure may perform computations by combining additional technical elements for performance maximization in addition to the above features. The processor may enhance the ability to represent non-linearity by using SwiGLU (Swish-Gated Linear Unit) as the activation function of the feedforward network (FFN) block.

[0185] In addition, the processor may adopt a Grouped Query Attention (GQA) method in which multiple query heads share a single key-value head to improve inference speed and memory efficiency. For example, in the case of a model with 32B parameters, the processor may be configured to perform attention operations at a ratio of 5 to 1 using 40 attention heads and 8 key-value heads.

[0186] In addition, the processor can utilize a byte-level BPE-based tokenizer as a tokenizer and can build and use a vocabulary set optimized for multilingual token processing, including Korean and English. The aforementioned architecture serves as a foundation for the processor to accommodate both non-inference mode and inference mode, thereby enabling the efficient provision of various artificial intelligence services.

[0187]

[0188] [Global Gain Policy Optimization-based Learning and Thinking Tokens]

[0189] Meanwhile, the model according to the embodiment can be trained with an Asymmetric Sampling and Global Advantage Policy Optimization (AGAPO) algorithm to possess planning and modification capabilities as an agent, going beyond simple next token prediction. In the embodiment, the generative artificial intelligence model can be trained with the sampling and global advantage policy optimization algorithm on at least one processor of the training computing device.

[0190] Here, the asymmetric sampling mentioned above means that during the learning process, not only correct samples but also incorrect samples are utilized as training data without being discarded. A model trained using such training data can learn how to avoid errors and explore a logical path toward the correct answer through negative feedback from incorrect samples. This can provide a model for maximizing the agent's self-correcting ability.

[0191] In addition, by removing the clipping technique of the existing Proximal Policy Optimization (PPO), the model can be allowed to explore creative or new solutions that deviate significantly from the existing policy.

[0192] In addition, the model according to the embodiment before generating the final answer <think>It can be trained to generate internal monologues for problem-solving within tags. In this thought token segment, the model can break down the problem, verify the logic, and formulate a plan.

[0193] The generative artificial intelligence model according to this embodiment can be trained to selectively perform a non-inference mode and an inference mode according to a user's request by passing through a three-stage post-training pipeline performed by a processor of a computing device.

[0194] FIG. 8 is a flowchart illustrating the learning process of a generative artificial intelligence model according to one embodiment of the present disclosure. In the following description, each step may be understood as being performed by the processor. The learning method may be broadly configured to include an integrated mode map fine-tuning step, an inference reinforcement learning step, and a preference learning step.

[0195] Referring to FIG. 8, the processor prepares a base model for learning a hybrid model according to an embodiment (S11), and can perform Supervised Fine-Tuning (SFT) so that the prepared base model simultaneously acquires the ability to execute various instructions and the ability to reason professionally (S13).

[0196] In this embodiment, the processor can train a single model by integrating non-inference data and inference data. At this time, the ratio setting between the two data types can have a significant impact on the performance of the model.

[0197] Specifically, if the ratio of inference data is excessively high, a bias may occur in which the model undergoes an overly complex thought process even in non-inference mode; conversely, if it is too low, inference ability may not be fully manifested. Accordingly, the processor can perform training by setting the ratio of tokens of inference data to non-inference data to a preset range, for example, between 1.2 to 1 and 1.8 to 1, thereby enabling the model to maintain a balance between the two modes and achieve optimal performance. Experimental results confirm that the most desirable ratio is approximately 1.5 to 1.

[0198] In addition, the processor can construct training data including multiple domains such as world knowledge, mathematics, coding, logic, agent tool usage, long context, and multilingualism. In particular, to cultivate agent tool usage ability, the processor can construct tool call data including not only single turns but also multi-turns and long-horizon goals and utilize it for training.

[0199] Next, the processor can perform reinforcement learning to enhance the inference ability of the model after supervised fine-tuning. (S15)

[0200] In this embodiment, the processor can perform calculations using the Asymmetric Sampling and Global Advantage Policy Optimization (AGAPO) algorithm to overcome the limitations of existing algorithms.

[0201] A processor utilizing the above-mentioned asymmetric sampling and global gain policy optimization algorithm can perform operations having the following four technical characteristics.

[0202] First, the processor can update the policy using an objective function from which clipping has been removed. In the existing method, clipping can hinder creative branch learning during the inference process by interfering with the gradient updates of low-probability exploratory tokens. To overcome this, the processor can allow important exploratory tokens to contribute to learning by removing clipping and using a standard policy gradient loss.

[0203] Specifically, the processor has an integral function defined by the following [Equation 1] ( Learning is performed in a direction that maximizes ).

[0204] [Mathematical Formula 1]

[0205]

[0206] Here, is the policy model being trained, is a reference model, is the sequence-level cumulative KL penalty coefficient, means group size.

[0207] Furthermore, the processor can perform asymmetric sampling. The processor can include samples in the learning process without excluding them, even if all responses within a generated group of responses are incorrect. Through this, the processor can provide negative feedback to the model regarding incorrect inference paths, thereby enabling the model to learn the ability to avoid errors.

[0208] In addition, the processor can calculate the advantage in two stages: at the group level and at the global level. First, the processor uses the Leave-One-Out (LOO) method within response group G to calculate the group advantage as shown in the following formula. can calculate.

[0209] [Mathematical Formula 2]

[0210]

[0211] Next, the processor performs normalization on the entire mini-batch to obtain the final global advantage as shown in the following formula It can produce.

[0212] [Mathematical Formula 3]

[0213]

[0214] Here represents the verifiable reward value of the i-th response.

[0215] In calculating the above verifiable reward, the processor may apply different verification logic depending on the domain of the query. For example, in the case of a mathematical domain, the processor may determine the correctness of the final answer through a rule-based verifier. Additionally, in the case of a code domain, the processor may determine the correctness based on whether the generated code block passes all associated test cases. Furthermore, in the case of a scientific domain, the processor may perform rule-based verification first, and if it is determined to be incorrect, perform a flexible second verification through an LLM-based judge by writing and inputting according to a pre-set prompt. Additionally, in the case of an instruction fulfillment domain, the processor may grant a discrete reward (e.g., 1 or 0) depending on whether all given constraints are satisfied.

[0216] Finally, the processor may apply sequence-level cumulative Kullback-Leibler divergence (Sequence Level Cumulative KL). To prevent excessive deviation from the original distribution of the learned language model, the processor may apply a sequence-level cumulative KL penalty to the objective function rather than a token-level one.

[0217] Consequently, the processor can control the model to optimize the inference policy in a direction that increases the accuracy rate by using the final objective function of asymmetric sampling and global gain policy optimization defined as follows.

[0218] In addition, the processor can perform preference learning to align the model's response to human preferences while maintaining enhanced reasoning ability through reinforcement learning. (S17) Depending on the embodiment, this step may be performed separately from the reinforcement learning step.

[0219] First, the processor can perform learning based on accuracy and conciseness. To improve the token efficiency of the model, the processor can perform learning by labeling the response with the shortest token length as preferred and long, verbose responses as disliked among equally correct responses. Through this, the processor can reduce unnecessary computations and guide the model to go through core thought processes even in inference mode.

[0220] And the processor can perform learning based on language consistency and preference scores. (S19) The processor can perform final fine-tuning using language consistency and general preference scores as rewards to align the quality and style of the response with human standards. At this time, the processor can perform computations by reusing a portion of the previous preference / non-preference labeling training data to ensure the stability of the learning.

[0221] The model of the present disclosure, having undergone such a learning pipeline, can internalize both non-inference capabilities corresponding to rapid intuition and reasoning capabilities corresponding to deep thinking within a single model, and this can serve as a core foundation for enabling the reasoning level adjustment service described below.

[0222] The model trained with the hybrid attention and sandwich expert mixed architecture described above, along with the Asymmetric Sampling and Global Gain Policy Optimization (AGAPO) algorithm, functions as the brain of an intelligent agent capable of solving complex problems, going beyond a simple text generator. In particular, the thought tokens generated by the model can be utilized as key control signals in the agent system described later to convert ambiguous user queries into concrete execution plans and to analyze and self-correct errors that occur during execution.

[0223] Below, the specific operation process of an agent system that controls the actual toolchain and performs multistep inference based on this model is described.

[0224] In the following description, the foundation model, inference model, or hybrid model mentioned herein shall be interpreted as meaning or including an artificial intelligence model having the aforementioned hybrid attention and expert mixed architecture, unless otherwise noted.

[0225]

[0226] [Multi-stage inference service provision method]

[0227] Based on the aforementioned hardware and model architecture, a processor according to one embodiment of the present disclosure can perform a cyclic process of planning, execution, verification, and modification from receiving a user's query until a final result is produced.

[0228] Below, we will explain in detail how to implement a multi-stage inference service by implementing the multi-stage inference service at each stage.

[0229] FIG. 9 may be a flowchart illustrating an initial operation process in which at least one processor (hereinafter referred to as the processor) of a computer device constituting a system analyzes a user query and determines an execution mode, according to an embodiment of the present disclosure. In this embodiment, the processor may perform a control operation to interpret the user's natural language input and determine an optimal pipeline by loading and executing an orchestrator stored in memory.

[0230] Below, the series of processes performed by the above-mentioned processor before entering multi-step reasoning is explained in detail step by step.

[0231] First, the processor can perform a user input reception and preprocessing step (S101).

[0232] Specifically, the processor may perform the step of receiving a query in the form of natural language from a user computing device through an input / output interface. At this time, the input query may include various modalities such as voice and images, in addition to text. The processor may generate refined query data for subsequent analysis and store it in memory by performing preprocessing operations on the received query, such as stop word removal, typo correction, or anonymization.

[0233] At this time, according to one embodiment of the present disclosure, the processor may perform meta-information retrieval and prompt reconstruction processes to understand the user's business context, going beyond simple text reception.

[0234] Specifically, the processor can retrieve user meta-information stored in memory or a database using a user identifier transmitted along with a received query as a key. The meta-information may include long-term memory extracted from at least one of the following: the user's department, job function, preferred response style, and past conversation history.

[0235] Furthermore, the processor can initialize the agent's persona by dynamically combining the retrieved meta information at the front of the system prompt.

[0236] In one embodiment, if meta-information is confirmed that the user belongs to a finance-related department and prefers summary reports, the processor may implicitly inject system commands into the agent to assign the role of a professional financial analyst without separate user instructions and to control all responses to be written in a bulleted format emphasizing key figures for executive reporting. Through this, the agent can be prepared to generate customized responses that reflect even the user's implicit intentions.

[0237] Next, the processor can perform a query intent analysis and complexity determination step (S103) based on the refined query data and a prompt (e.g., persona) reconstructed based on meta-information extracted from long-term memory.

[0238] Specifically, the processor may perform a step of analyzing the semantic intent of the refined query data using a foundation model. In this step, the processor can go beyond simply matching keywords to identify whether the form of the result requested by the user is simple information, a report, code, or a creative work, and can perform an operation to evaluate the complexity of the task to be solved.

[0239] In an embodiment, the processor may use a foundation model to analyze the semantic intent of a query and perform operations to evaluate the complexity of the task to be solved. In this step, the processor may determine the mode for processing the query as either a non-inference mode or an inference mode.

[0240] Specifically, the processor can classify input queries according to pre-set criteria. For example, the processor can classify cases that do not require separate external information or logical reasoning, such as greetings or expressions of emotion, as simple conversations. Additionally, the processor can classify cases requiring verification of clear facts, such as "What is the weather like in Seoul today?", as single information searches. Furthermore, the processor can classify cases requiring the combination of multiple stages of information gathering, comparison, and logical judgment, such as "Analyze market trends in Brazil and formulate an entry strategy," as complex reasoning tasks.

[0241] For example, if it is necessary to verify clear facts, the processor may classify this as a single information search and control it to enter an immediate answer mode to save computational resources. On the other hand, if multiple stages of information collection, comparison, and logical judgment must be combined, such as a request to "analyze in depth the technical challenges and latest solutions of the secondary battery market," the processor may classify this as a complex reasoning task and activate a deep analysis mode (or deep dive mode).

[0242] In the embodiment, when the deep analysis mode is activated, the processor may allocate an inference budget capable of generating a larger amount of thought tokens compared to the normal mode so that the model can undergo a sufficient thought process before generating an answer.

[0243] In addition, the processor may perform a planning step of breaking down a complex query into executable subtasks and generating an execution graph by analyzing data dependencies between each task. As an example of a query related to secondary batteries, the processor may establish subtasks consisting of the steps of extracting keywords for technical challenges, searching for relevant academic papers and patents, comparing and analyzing technical solutions by company, and deriving final insights and writing a report.

[0244] And the processor may perform an execution mode determination and routing step (S105) based on the analyzed intent and complexity. Specifically, the processor may perform the execution mode determination and routing step using an optimal tool to perform each subtask according to the established plan.

[0245] In an embodiment, based on the analyzed intent and complexity, the processor may determine an execution mode to process the corresponding query and perform a step of routing the process flow with logic corresponding to the mode.

[0246] If the above query is classified as a simple conversation, the processor may operate in Direct Answer Mode, which generates an immediate response and terminates the process using only built-in knowledge without calling a separate external tool. This may be intended to prevent the consumption of the processor's computational resources and optimize response speed (latency).

[0247] If the above query is classified as a single information search, the processor may operate in a single-step search mode that controls an external search engine or a RAG (Retrieval-Augmented Generation) module to generate an answer by calling it once.

[0248] In the present disclosure, if the above query is determined to be a complex reasoning task, the processor may immediately suspend the generation of an answer and trigger entry into a Multi-step Reasoning Mode. At this time, the processor may determine, as specific criteria for determining that complex reasoning is necessary, by comprehensively calculating whether the user's request can be broken down into two or more sub-questions, whether the use of expert tools such as data analysis or coding is essential, or whether the presentation of evidence for the answer is required.

[0249] When the multistep inference mode is activated in this way, the processor can stop the simple answer generation logic and operate to perform a specific planning step to solve a complex problem.

[0250] In addition, a processor according to one embodiment of the present disclosure may apply a reliability-based search augmentation generation mechanism that goes beyond simple tool calls during search-based inference and selectively processes only information that meets reliability criteria set by the user.

[0251] Specifically, when the processor calls a search agent, it may also transmit source control parameters received through a user interface. For example, in the case of a deep analysis mode where the user requests an analysis based on academic materials, the processor may limit the search targets to specific academic databases or verified patent databases, and perform logic to filter out or lower the ranking of non-professional sources with low credibility, such as blogs or Wikipedia, from the search results.

[0252] In addition, the processor can perform dynamic routing tailored to the nature of each subtask. It can route to a web search agent when the latest market trends or news data are required, and to an in-house knowledge search agent when internal corporate regulations or past project history are needed.

[0253] Furthermore, if numerical data analysis or visualization is required, the processor can route it to a code generation model. In this case, the processor can convert table or chart images within the retrieved documents into structured data using a visual encoder and provide it as input to a code interpreter. Through this, the agent can integrate and analyze not only text but also visual information to generate an output in the format requested by the user.

[0254] In this way, a processor according to one embodiment of the present disclosure can perform the role of a key control unit that provides high-quality customized artificial intelligence services by converting a user's simple query into specific technical commands and designing an optimal execution path in accordance with user context and reliability criteria.

[0255] FIG. 10 may be a flowchart illustrating the process of a processor of a system that has entered a multistep inference mode, according to an embodiment of the present disclosure, breaking down a user query into a plurality of sub-tasks and establishing an execution plan.

[0256] When executing multistep inference mode, the processor can first perform a subtask decomposition and definition step (S201).

[0257] Specifically, when the multistep inference mode is enabled, the processor can identify the intermediate steps required to achieve the final goal of a user query and perform an operation to decompose them into executable unit tasks called sub-tasks.

[0258] For example, when a user inputs a query such as "Analyze the competitive landscape of the Brazilian washing machine market and suggest an entry strategy for LG Electronics," the processor can use a built-in inference model to break down the query into essential elements for constructing a logically complete answer.

[0259] Specifically, the processor can decompose the query into a first subtask for searching the laundry culture and lifestyle patterns of local Brazilian consumers, a second subtask for securing a list of major washing machine manufacturers and market share data within the Brazilian market, a third subtask for comparing and analyzing the company's products with the key specifications and price ranges of competitors' products, and a fourth subtask for synthesizing the analysis results to derive market opportunity factors and generate an entry strategy. At this time, the processor can determine the scope of the task by explicitly defining the input and expected output for each subtask.

[0260] The process of subtask decomposition and execution planning according to one embodiment of the present disclosure may be performed not by a simple listing of tasks, but through a directed acyclic graph (DAG) generation and topological alignment algorithm that mathematically models the causal relationships of data flow. To this end, at least one processor of a computing device may perform a series of computational logics including dependency parsing, graph construction, parallel interval identification, and circular reference verification.

[0261] First, the processor may perform a graph construction step of converting a natural language query into a set of atomic-unit subtask nodes (V) and generating directed edges (E) by analyzing the input-output relationships between each node. Specifically, when the output data of the first subtask (e.g., market share text) is essential as an input parameter of the second subtask (e.g., an argument of a visualization function), the processor may generate directed edges ( Data dependencies can be explicitly defined by connecting ). In this case, the processor can prevent infinite loops or deadlocks in advance by setting constraints so that the graph G = (V, E) takes the form of a directed acyclic graph (DAG) that does not contain cycles.

[0262] Next, the processor may perform a scheduling step to linearize the execution order by applying a topological sort algorithm (e.g., Kahn's algorithm or DFS-based sort) to the constructed graph. The processor may identify nodes with an in-degree of 0, i.e., nodes that have no preceding work or have already completed tasks, and insert them into an execution queue. Through this, the processor can ensure logical consistency by converting complexly intertwined dependencies into an executable sequential pipeline.

[0263] Furthermore, the processor can perform parallel segment identification and critical path analysis to reduce the total execution time. The processor can identify pairs of nodes located at the same topological level or unreachable to each other within a topologically aligned graph hierarchy and group them into parallel execution groups. For example, if market information retrieval and exchange rate information lookup have no data dependencies on each other, the processor can dispatch them simultaneously as separate threads or asynchronous processes. Moreover, the processor can calculate the critical path of the graph using the estimated time required for each subtask as a weight, and perform optimization logic to preferentially allocate high computing resources to tasks on that path.

[0264] Finally, the processor may perform circular reference detection logic to verify the validity of the planned graph. If a circular dependency is discovered during the execution plan formulation phase, where Task A requires Task B and Task B simultaneously requires Task A, the processor may identify this as a logical error and perform an automatic correction procedure to resolve the dependency by requesting additional information from the user or using a default value to break the circular loop.

[0265] Next, the processor can perform a dependency analysis and order determination step (S203) between multiple decomposed tasks.

[0266] Specifically, the processor can perform a step of analyzing the logical precedence relationship and data dependency among a plurality of decomposed subtasks.

[0267] Specifically, the processor can identify that market share data, which is the result of the second sub-task (data investigation), is required prior to performing the third sub-task (comparative analysis). Accordingly, the processor can distinguish between tasks that can be processed in parallel, such as general information retrieval, and tasks that must be processed sequentially, such as strategy proposal after data analysis, and generate an execution graph or workflow representing the flow of the entire operation.

[0268] And the processor can perform an optimal expert module assignment (Tool Assignment) step (S205). Specifically, the processor can perform a step of analyzing the nature of each subtask and mapping a module that can most effectively perform the corresponding task among a plurality of expert modules available in the system.

[0269] For example, since the first subtask (lifestyle pattern search) requires timeliness and local information, the processor may assign it to a Web Search Agent. Additionally, since the second subtask (data investigation) and the third subtask (comparative analysis) require the accurate processing and visualization of numerical data, the processor may assign them to a Code Interpreter or a Data Analysis Agent. In one embodiment, the operation of the processor routing a specific subtask (e.g., data analysis) to a Code Interpreter may be coupled with a model-level operation within the hybrid model in which tokens related to the task are assigned to and processed by a coding-specialized routing expert.

[0270] In addition, the processor may grant the data analysis agent the authority to generate and execute programming code, such as Python. Furthermore, since the fourth subtask (strategy proposal) requires a high level of reasoning ability, the processor may assign it to the high-performance reasoning model with the largest parameters within the system.

[0271] Next, the processor may perform the step of confirming and storing an execution plan (S207). The processor may synthesize the results of steps S201 through S205 to generate a final execution plan in a structured data format (e.g., JSON format) and store it in memory. The execution plan may include the ID of each subtask, execution order, assigned agent information, or input / output parameters, and may serve as a standard for sequential or parallel execution in subsequent steps.

[0272] FIG. 11 may be a flowchart illustrating the process of a processor routing and executing each subtask to an appropriate expert module based on an established execution plan, according to an embodiment of the present disclosure.

[0273] Referring to FIG. 11, the processor can perform context injection and prompt reconstruction steps (S301).

[0274] More specifically, the processor may perform the step of initiating the execution of subtasks sequentially or in parallel according to a determined execution plan. Specifically, prior to executing a specific subtask (e.g., subtask N), the processor may perform an operation to read the execution result of a preceding subtask (e.g., subtask N-1) by referring to shared memory.

[0275] Furthermore, the processor may perform a step of dynamically reconstructing an execution prompt or control instruction in a form that the expert module can understand by combining the read-out prior result data with the goal of the subtask currently to be performed.

[0276] For example, to perform an analysis of the competitive status of the Brazilian washing machine market (Sub-task 3), the processor can inject text data of the market share of major manufacturers collected by the web search agent in the previous step as context and generate a specific instruction prompt to convey to the data analysis agent, "structure the text data into a table and generate visualization code."

[0277] And the above processor can perform a dynamic routing step (S303) to the expert module.

[0278] Specifically, the processor may perform the step of selecting one of a plurality of predefined communication protocols based on the attributes of the current subtask and routing it to the optimal expert module to perform the corresponding task.

[0279] At this time, if a subtask requires real-time information or external data (e.g., exchange rate information, news search), the processor can select a web search and external information routing path (Web / API Routing), activate a search API interface, and control the transmission of a query to an external search engine or an internal document database (RAG).

[0280] Additionally, if a subtask requires numerical calculation, data visualization, or structured data processing (e.g., drawing a market share graph), the processor can select a code execution and data analysis path (Code Execution Routing) and route it to a code interpreter module. At this time, the processor can securely transmit not only natural language commands but also the source data (CSV, Excel, etc.) to be analyzed to a sandbox environment.

[0281] In addition, if a subtask requires a logical judgment or strategy proposal based on collected information, the processor can select a high-performance inference and generation path (Inference Routing) and route to a high-performance LLM (e.g., 32B model) with the largest parameter scale and superior inference ability within the system.

[0282] And the processor can perform an asynchronous execution and state monitoring step (S305).

[0283] Specifically, the processor may perform a step of requesting work from each expert module and waiting for the completion of the work by the corresponding module. At this time, the processor may perform parallel routing in an asynchronous manner for subtasks that do not depend on each other in order to utilize resources efficiently.

[0284] In addition, the processor can monitor the execution status (Running, Completed, Failed) of each module in real time, and if a response is delayed or a time-out occurs in a specific module, it can perform exception handling such as reassigning the corresponding task or providing a delay notification to the user.

[0285] Next, the processor can perform the result reception and shared memory synchronization step (S307).

[0286] Specifically, when execution results (e.g., searched text, generated graph images, return values ​​of executed code) are received from each expert module, the processor may perform the step of converting them into a standardized format and storing (updating) them in shared memory. By doing so, the processor can synchronize the context of the entire agent system so that subsequent subtasks can refer to the results of the preceding task.

[0287] FIG. 12 may be a flowchart illustrating a process of integrating the execution results of each subtask to generate and provide a final answer to a user, according to an embodiment of the present disclosure.

[0288] Referring to FIG. 12, the processor can perform result aggregation and conflict resolution steps (S401).

[0289] Specifically, when the execution of all subtasks is completed, the processor may perform a step of aggregating the output contexts of each expert module, which are distributed and stored in shared memory.

[0290] At this time, the processor can perform operations to verify whether there are logical contradictions or data inconsistencies among the aggregated information. For example, if the market size figures collected by a web search agent differ from the figures extracted from a CSV file by a data analysis agent, the processor can perform conflict resolution logic to determine priority based on the data's Source Reliability Score or Recency, or to request confirmation from the user if necessary.

[0291] And the above processor can perform a final answer synthesis and formatting step (S403).

[0292] Specifically, the processor can take verified result data as input and perform a step of generating a final answer that corresponds to the user's initial query intent. In this case, the processor does not simply list each result, but can reconstruct it into a complete sentence with an introduction, body, and conclusion structure using an embedded language model.

[0293] In particular, the processor can perform the step of formatting a highly readable report by placing a visualization image or code block generated by a data analysis agent in an appropriate location between text descriptions.

[0294] Next, the processor can perform a result provision step (S405) of formatted output through a user interface (UI).

[0295] The above processor may perform the step of outputting the generated final answer to a display device of a user computing device. At this time, the result screen provided may be configured with various layouts depending on the nature of the user's request.

[0296] Specific examples of final output artifacts (e.g., reports) generated according to embodiments of the present disclosure are described below.

[0297]

[0298] [Final Artifact Creation Process]

[0299] In one embodiment, the final output report can provide a comprehensive market analysis report.

[0300] In the case where a user requests "Current Status of Competition and Entry Strategy in the Brazilian Washing Machine Market," the processor may include consumer lifestyle analysis information obtained from a web search agent (e.g., preference for large capacity due to extended family culture, sensitivity to energy efficiency, etc.) under the title "In-depth Analysis of the Brazilian Washing Machine Market and Proposal of LG Electronics' Entry Strategy." Additionally, the processor may insert competitor market share status and visualized pie chart images obtained from a data analysis agent, and include competitor price positioning information analyzed by a code interpreter. Furthermore, the processor may generate a report in the form of proposing opportunity factors and entry strategies derived from an inference model (e.g., focusing on 12kg class large capacity models, etc.).

[0301] In addition, in another embodiment, the final output report may include data analysis and visualization results.

[0302] When a user requests, "Analyze the defect rate trend in the attached process data (CSV)," the processor may include data summary information (total number of data items, period, average defect rate, etc.) under the title "Process Data-Based Defect Rate Trend and Outlier Detection." Additionally, the processor may insert a monthly line graph generated using Python Matplotlib or similar tools and specify specific details, such as a surge in the defect rate in August (Red Alert). Furthermore, the processor may include the results of cause analysis derived through self-correction (e.g., correction of date format errors, verification of correlation with temperature variables, etc.).

[0303] In another embodiment, the final result product may include a code generation and execution guide.

[0304] Regarding the code generation and execution guide mentioned above, when a user requests "build a web crawler with Python," the processor may include an explanation of the implementation logic (e.g., using the requests and BeautifulSoup libraries) under the title "Web crawler for collecting news articles." Additionally, the processor may display the actual generated Python code block and verify that the code is functioning correctly by including a sandbox execution result preview (logs of title and body extraction results based on a test URL input).

[0305] As shown in the examples above, the processor of the present disclosure can significantly improve work productivity by generating and providing to the user a multi-modal report that organically combines charts, graphs, code execution results, and logical suggestions, going beyond simple text responses.

[0306]

[0307] FIG. 13 is a flowchart illustrating the process of a processor assigned a data analysis task during a multistep inference process establishing a specific analysis plan before generating actual code, according to an embodiment of the present disclosure.

[0308] In this embodiment, the analysis / planning agent can operate by receiving source data analysis goals from the orchestrator, and can perform a systematic pre-design step to prevent random code execution and increase the accuracy of the analysis.

[0309] Referring to FIG. 13, the processor can perform a data metadata extraction and schema analysis step (S407). The processor can access a data file to be analyzed (e.g., CSV, Excel, DB table, etc.) and perform a metadata scan to read only the header of the file and some sample data (e.g., the top 5 rows) without loading the entire data.

[0310] The processor can identify the following items by analyzing the data schema based on the retrieved information. Specifically, the processor can identify the data structure including the number of rows and columns. In addition, the processor can determine the data type of each column (numerical, categorical, date, text, etc.) and identify the presence of missing values ​​by checking the possibility of empty values ​​existing within the data. Furthermore, the processor can identify the dependent variable that serves as the target of the analysis by matching a user query (e.g., "Predict the quality") with a column name (e.g., Quality, Target).

[0311] Next, the processor may perform the steps of defining analysis goals and selecting methodologies. The processor may combine the user's natural language query with data characteristics identified in the previous step to define specific analysis goals and select a statistical or machine learning methodology suitable for them.

[0312] For example, if a user requests, "Create a model that predicts quality by analyzing production process dimension data," the processor may perform the following inference process. First, the processor may identify that the Quality column is categorical data such as OK / NG or 0 / 1, and define the analysis objective as a binary classification problem. Subsequently, the processor may select Logistic Regression, Random Forest, Support Vector Machine (SVM), etc., as suitable algorithm candidates to perform binary classification. If the target variable is continuous numerical data (e.g., price, temperature), the processor may define the analysis objective as regression analysis and select an appropriate algorithm.

[0313] Additionally, the processor may perform a step of generating a step-by-step execution sequence. The processor may perform a step of generating an analysis plan that arranges a series of work procedures necessary to execute the selected methodology in a logical order. The analysis plan may serve as a blueprint for the code generation agent to refer to in subsequent steps.

[0314] Specifically, the processor can generate the following standardized sequences. First, the processor can plan a data loading step that loads the entire data into memory using a library such as pandas. Second, the processor can plan a preprocessing step that encodes identified categorical variables (e.g., Quality) into numerical values, handles missing values, and splits the dataset into training and evaluation sets. Third, the processor can plan a model training step that initializes three selected models (Logistic Regression, Random Forest, SVM) and trains them using training data. Fourth, the processor can plan an evaluation and visualization step that calculates the accuracy, precision, recall, F1-score, etc., of each model using test data, and derives the optimal model by visualizing the confusion matrix or ROC curve.

[0315] And the processor can perform user confirmation and plan confirmation steps. The processor can perform a step of converting the generated analysis plan into a natural language summary form and presenting it to the user. For example, the processor can generate and output a message such as, "After checking the data, it is a classification problem that predicts quality. We will train three models, such as Random Forest, and compare their performance. Shall we proceed?"

[0316] When approval input is received from the user, or when the system is in automatic approval mode according to the system settings, the processor may finalize the analysis plan and transfer the process to the next step, the code generation and execution step.

[0317] FIG. 14 may be a flowchart illustrating the process in which, according to an embodiment of the present disclosure, after an analysis plan is determined, a processor (specifically, a code generation and execution agent) generates an executable script and performs a first execution in an isolated environment.

[0318] Referring to FIG. 14, the processor can perform a code generation step (S501).

[0319] Specifically, the processor can take the finalized analysis plan and data schema information as input and perform the step of generating an execution script based on a programming language (e.g., Python) capable of performing each analysis step.

[0320] At this time, the processor can automatically identify the external library required to implement the methodology specified in the analysis plan and write import code.

[0321] For example, pandas for data processing, numpy for numerical computation, matplotlib or seaborn for visualization, and scikit-learn for machine learning modeling may be imported. Additionally, to prevent path issues that may occur during data loading, the processor may configure its code to reference an absolute path within the sandbox environment described below (e.g., / mnt / data / file.csv).

[0322] And the above processor can perform a sandbox environment initialization and data transfer step (S503).

[0323] Before executing the generated code, the processor may perform a step of initializing an isolated sandbox or virtual container environment to maintain the security of the main system and prevent the code execution from affecting the entire system.

[0324] At this time, the processor can copy or mount the raw data file to be analyzed from the main storage to a designated directory inside the sandbox, thereby creating a state where the generated code can access the data.

[0325] Next, the processor can perform a primary code execution and resource monitoring step (S505).

[0326] Specifically, the processor may perform the step of transmitting the execution script generated in step S501 to an interpreter within the sandbox environment to initiate execution.

[0327] While code execution is in progress, the processor can monitor the CPU share, memory usage, or runtime of the process in real time through a resource monitoring module. If the runtime exceeds a preset threshold (e.g., an infinite loop) or goes beyond the allowed memory range, the processor can forcibly kill the execution and activate a safety mechanism to report a resource error to the user.

[0328] Hereinafter, the detailed configuration of a multi-layer defense-based sandbox security and resource control architecture according to an embodiment of the present disclosure is described.

[0329] A sandbox environment according to one embodiment of the present disclosure may be implemented as a multi-layered defense architecture including an execution isolation layer, a resource control layer, a data protection layer, and a network security layer to protect a host system from malicious acts or unintentional errors of the generated source code. To this end, at least one processor of a computing device may perform lightweight virtualization, kernel-level resource allocation, immutable data mounting, and traffic control logic.

[0330] First, the processor can dynamically select and apply isolation technologies that constitute a hybrid execution isolation layer based on the complexity of the code to be executed and the required security level. Specifically, for tasks requiring high-performance computation or access to system libraries, the processor can execute code by instantiating a lightweight micro-virtual machine based on a kernel-based virtual machine (KVM). This provides the effect of fundamentally blocking container escape attacks by physically separating the host kernel and the guest kernel through hardware-assisted virtualization technology. On the other hand, when fast execution speed is required, the processor can use a sandbox runtime that intercepts host kernel system calls and processes them in user space. Through this, the processor can secure execution speeds comparable to native performance while minimizing the system call attack surface. Furthermore, when light computation or execution in a browser environment is required, the processor can control the code to be compiled into a WebAssembly (WASM) binary and executed within an isolated runtime where memory safety is guaranteed.

[0331] Next, the processor performs kernel-level resource control functions to prevent generated code from falling into an infinite loop or occupying excessive resources, thereby impairing system availability. The processor can apply resource restriction group (cgroups) and namespace technologies to set thresholds that strictly limit maximum CPU usage, memory allocation, and the number of processes created for each sandbox instance. If a running process exceeds these thresholds, the processor can ensure system stability by immediately activating an Out of Memory (OOM) killer at the kernel level or suspending CPU scheduling. Additionally, the processor can operate a monitoring daemon that starts a kernel timer simultaneously with the start of code execution to forcibly terminate the container and reclaim allocated resources if execution is not completed within a preset time (e.g., 30 seconds).

[0332] Furthermore, the processor may apply an immutable mount mechanism to ensure data integrity. By mounting the source data subject to analysis as a read-only volume within the sandbox, the processor can control the execution code so that it can read the data but blocks actions such as modifying or deleting the original at the file system level. In this case, if data processing or the creation of temporary files is required, the processor may provide a Copy-on-Write file system that combines a temporary writable layer on top of the read-only layer. All file changes occurring during code execution are recorded only on this temporary layer, and the processor can maintain data confidentiality by immediately destroying the temporary layer upon termination of the sandbox.

[0333] Finally, the processor can implement network access control policies to control external network access within the sandbox. The processor may apply a whitelist policy that blocks all external network access by default and allows it only when necessary. Specifically, to prevent data leakage or the download of malicious packages that may occur during code execution, the processor may allow only traffic destined for pre-approved repositories or API endpoints, and block all other outbound packets at the firewall level.

[0334] And the above processor can perform a result capture step (S507).

[0335] Specifically, the processor may perform a step of capturing and collecting output data from the sandbox when code execution ends (normal termination or error termination). The output data may include standard output (STDOUT), which is text result recorded by output functions within the code; standard error (STDERR), which is exception information and traceback messages that occurred during code execution; and generated artifacts, which are image files or result data files generated while the code is executed.

[0336] The processor can determine whether standard error exists in the captured results and branch to a subsequent step of result verification or self-debugging.

[0337] FIG. 15 may be a flowchart illustrating the process of a processor performing a self-debugging loop when an error occurs in the result of a first code execution, according to an embodiment of the present disclosure.

[0338] A self-correcting mechanism according to one embodiment of the present disclosure may be implemented as a closed-loop feedback control system that detects an error signal from an execution result and updates the source code, which is a control variable, in a direction that minimizes the error, rather than simply repeating re-execution. To this end, at least one processor of a computing device may perform a series of technical processes including structural error parsing and fault location tracking, reflection-based diagnosis, deterministic patch generation and verification, and empirical learning memory updating.

[0339] Referring to FIG. 15, the processor can perform an execution result analysis and error determination step (S601). Specifically, the processor can run the generated execution code in an isolated environment, and self-detect and correct errors occurring during execution to finally generate a result (hereinafter referred to as an artifact) in the form requested by the user.

[0340] Specifically, the processor may instantiate a secure sandbox environment and transmit the generated code to an interpreter inside the sandbox to initiate execution. At this time, the processor may ensure data integrity by mounting source data required for code execution (e.g., CSV, PDF, etc.) as a read-only volume within the sandbox.

[0341] In addition, the processor may operate a safety mechanism that monitors CPU and memory usage in real time at the kernel level and forcibly terminates the process when a threshold is exceeded, in order to prevent infinite loops or excessive memory occupation that may occur during code execution.

[0342] Specifically, the processor may perform a step of analyzing the existence and content of a standard error (STDERR) stream among the execution result data received from the sandbox.

[0343] In addition, the processor can enter self-correction mode by determining as a logical error not only the presence or absence of an error message, but also when standard output (STDOUT) is empty (Empty Output) or the result data format differs from expectations (e.g., when an image should be generated but only text is returned).

[0344] More specifically, in an embodiment, the processor may perform preprocessing to convert the raw log returned from the sandbox execution environment into a structured error object using a regular expression and an abstract syntax tree parser, rather than treating it as text.

[0345] In this way, the processor can analyze the standard error stream output by the interpreter and classify the types of errors into syntax errors, runtime errors, and logical errors. In this case, even if an error message is not explicitly generated, the processor may define a case where the result is not a number (NaN) or is empty as a silent failure, treat it as an error, and initiate a self-correction loop.

[0346] Furthermore, the processor can parse traceback information to identify the exact line number where the error occurred and apply a spectrum-based fault location tracking algorithm to generate a heatmap of suspected sections within the code that are highly likely to have caused the error.

[0347] In an additional embodiment, when code execution ends, the processor may determine whether the execution was successful by capturing standard output (STDOUT) and standard error (STDERR) streams. According to one embodiment of the present disclosure, the processor may perform multidimensional error detection logic to verify not only the presence or absence of error messages but also the completeness of the output.

[0348] The above processor can perform multidimensional error detection logic, such as syntax and runtime errors, logical errors, and empty output or artifact format mismatch.

[0349] Specifically, the processor can identify defects in the code itself, such as missing libraries or variable type mismatches, through first syntax and runtime error logic.

[0350] In addition, the processor may consider it an error if, through a second logical error and empty output logic, the code terminates normally but the result value is output as 0 or the generated chart image is blank because the data filtering condition is incorrect.

[0351] In addition, the processor can identify a task failure through a third artifact format mismatch logic when a user requests the creation of an Excel file but the code outputs only a simple text result.

[0352] If an error is detected, the processor may activate a self-correcting loop.

[0353] Specifically, the processor can perform an error log-based cause inference step (S603). More specifically, the processor can input the original source code and the captured error log into a foundation model to infer the cause of the error.

[0354] The processor can perform operations to analyze the cause of errors by inputting the captured error logs and original source code back into the hybrid model. In this case, the hybrid model utilizes its long context processing capabilities to analyze the entire error traceback information, amounting to thousands of lines, and identify the root cause. If an image size exceeding error occurs during PowerPoint (PPT) generation, the processor can recognize it through the hybrid model and formulate a correction plan that adds preprocessing code to resize the image to fit the slide specifications.

[0355] In addition, for example, if an error message such as ValueError: could not convert string to float: 12,390 occurs, the processor can analyze it to identify the cause as "the number conversion function (float()) cannot process strings containing commas (,)" and specify the data location as "the GNP or income column data in the CSV file contains thousands separator (,)".

[0356] More specifically, the processor can perform a cause inference step that linguistically specifies a correction strategy through a tracing process, using the identified error type and suspected section as input.

[0357] The above processor can perform pruning logic to exclude previously failed modification methods by comparing the history of past attempts stored in the short-term memory of the current session with the current error log.

[0358] In addition, the processor can enable the self-criticism function of the foundation model to generate natural language feedback, such as "operation error due to variable type mismatch." By injecting the generated natural language feedback into the prompt context for generating code in the next turn, the processor can achieve a lightweight reinforcement learning effect without updating the model's weights.

[0359] And the above processor can perform a modification plan establishment and code regeneration (Re-generation) step (S605).

[0360] Specifically, the processor may establish a fix plan to resolve the identified cause and perform the step of regenerating the executable code by reflecting it.

[0361] The processor can generate modified code with defects removed based on an established modification plan. The processor can send the modified code back to the sandbox environment for re-execution and repeat the analysis-modification-execution process within a preset number of times (e.g., 3 times) until a normal result is produced.

[0362] In addition, if the processor succeeds in self-correction, it can optimize the application of a solution immediately without going through a reasoning process when similar errors occur in the future by pairing the error pattern and the solution code and storing them in episode memory.

[0363] Specifically, in the case of the above embodiment, the processor may establish an internal plan stating, "An error occurred during the data preprocessing process. I will try again after removing the comma," and generate modified code by adding string replacement logic to the existing code. For example, code such as data[GNP] = data[GNP].astype(str).str.replace(,, ).astype (float) may be added.

[0364] More specifically, the processor may perform a code modification step that generates a patch to modify the actual source code to resolve the diagnosed cause, while applying safeguards to ensure integrity.

[0365] If the cause of the error is a missing library (e.g., ModuleNotFoundError), the processor may perform infrastructure-level healing by first executing a command to install the corresponding package in the sandbox environment instead of modifying the source code, or by dynamically modifying system path environment variables.

[0366] In addition, the processor may dynamically generate unit test cases for the modified function and run them together during code execution to verify whether the modified code does not impede existing functions.

[0367] That is, the processor can perform recursive execution and loop control steps (S607).

[0368] Specifically, the processor may perform the step of requesting re-execution by sending the modified code back to the sandbox environment. At this time, the processor may control the maximum number of retries (Max Retry Count) by setting it (e.g., 3 times) to prevent an infinite loop.

[0369] If the modified code executes successfully and produces a result, the processor can exit the loop and proceed to the next step (result interpretation). Conversely, if an error occurs again, the processor can return to step S603 to analyze the new error message. If the number of retries reaches a threshold, the processor can output an error report to the user along with the message "Automatic correction failed" and terminate the process.

[0370] Finally, the processor may perform a debugging result learning and context update step (S609). Specifically, when the self-correcting loop terminates successfully, the processor may perform a result generation and context update step in which pairs of the query representing the problem situation, the error log that occurred, and the resolved patch code are stored in episode memory in the form of vector embeddings.

[0371] If self-correction is successful, the processor may perform a step of summarizing information (Debugging History) regarding "what error occurred and how it was corrected" and storing it in shared memory. This can be utilized as a context to prevent the same error in advance (Pre-correction) when performing similar tasks in the future (e.g., analysis of income data from other years).

[0372] Through this, if a similar error pattern is detected in the future, the processor can optimize the computational efficiency of the system by performing case-based reasoning, which immediately retrieves and applies the corresponding solution from memory without undergoing a complex reasoning process.

[0373] FIG. 16 may be a flowchart illustrating a process in which, according to an embodiment of the present disclosure, after code is executed normally through a self-correction process, a processor analyzes the execution result to derive a final answer.

[0374] Referring to FIG. 16, the processor can perform an execution output acquisition and parsing step (S701).

[0375] Specifically, if the code execution is ultimately successful, the processor may perform the step of converting and packaging the execution result into an artifact of the format requested by the user.

[0376] In an embodiment, the processor can generate an editable file (e.g., PPTX, XLSX, DOCX) by combining text, numerical data, chart images, etc., produced through code execution. For example, when a financial report is requested, the processor can complete a PPT file by automatically placing sales trend graphs associated with analyzed text summaries on each slide.

[0377] In addition, the processor may generate heterogeneous antifacts in the user's request format. In an embodiment, the processor may create and provide an interactive visualization object based on structured data in the format requested by the user. At this time, in addition to static images, the processor may create and provide an interactive chart object (e.g., HTML / JavaScript-based) that allows the user to click data points or zoom in / out on a web browser.

[0378] In this way, the processor can support users in obtaining high-quality results that can be immediately utilized for work without separate coding knowledge or subsequent editing work, through autonomous error correction and artifact generation processes within the sandbox.

[0379] To explain this process more specifically, when code execution in the sandbox environment successfully terminates (Success), the processor can perform a step of receiving and parsing all outputs generated inside the sandbox to the main system.

[0380] The output obtained at this time may include standard output (STDOUT), which is a text result recorded by output functions within the code, and generated artifacts, which are image files or data files generated as a result of code execution. The processor may extract key metrics from the text output and convert them into structured data, and may convert the binary output into a URL link or an embedded image format that can be displayed to a user.

[0381] And the processor can perform a result interpretation and answer generation step (S703) based on a foundation model (e.g., LLM or / and LAM, etc.).

[0382] As one of the key features of the present disclosure, the processor may not merely list execution results, but may also perform a step of interpreting the meaning of numerical results in natural language using a foundation model.

[0383] The above processor can input the user's initial question, written source code, and code execution results (text outputs) into the LLM by configuring them as the prompt context. Based on this, the LLM can generate response text that includes explanations and insights regarding the results.

[0384] For example, if the execution result is "Accuracy: 0.97", the processor can convert this into a contextually rich answer such as, "As a result of training a binary classification model that predicts quality based on production process dimension data, the prediction accuracy was found to be 97%. This implies very high confidence and is a level applicable to actual processes."

[0385] And the above processor can perform a multimodal result synthesis step (S705).

[0386] For example, the processor can construct a highly readable multimodal response by inserting graph or chart images of heterogeneous formats at appropriate locations in accordance with the flow of text descriptions corresponding to the request format. For example, the corresponding heatmap image can be placed immediately after the sentence "The results of the confusion matrix analysis are as follows."

[0387] That is, the processor can perform the step of integrating the results of the execution of the plurality of subtasks and the self-modified output to implement the final result through a user interface and provide a cross-reference function.

[0388] For example, the processor may perform a step of synthesizing a final result report by combining the interpretation text generated in step S703 and the visualization image obtained in step S701.

[0389] First, the processor can perform an operation to collect result data from each distributed subtask and synthesize it into a single complete answer. At this time, the processor can perform conflict resolution logic to verify whether there are contradictions or numerical discrepancies between information obtained from different sources (e.g., web search results and an internal database).

[0390] In one embodiment, the processor may determine priority based on the reliability score or recency of each information source to adopt data, or generate an answer by specifying uncertainty, such as "Source A is described as X, and Source B is described as Y."

[0391] Additionally, according to one embodiment of the present disclosure, the processor may perform an indexing step of database-izing the connection relationship between the generated answer text and the original source document to ensure the transparency and reliability of the answer. Specifically, the processor may divide the generated answer into sentence units or semantic units and extract the identifier of the original document referenced by each unit and specific location information within the document (e.g., page number, paragraph coordinate, text span index). Based on the extracted information, the processor may create a mapping table that defines the correspondence relationship between them and store it in memory.

[0392] Furthermore, the processor can render a page in the form of a highly readable multimodal report by inserting a visualization chart or image object generated in the preceding step at a contextually appropriate location between text descriptions.

[0393] Based on the mapping table, the processor can implement a bidirectional cross-reference interface on the display of a user terminal that allows for intuitive comparison of the answer and the original text.

[0394] The processor may provide a split layout that displays the generated final answer in a first area of ​​the screen and the referenced original document (e.g., a PDF viewer) in a second area.

[0395] Additionally, when input is detected in which a user selects (clicks or hovers) a specific sentence among the answer texts in the first area, the processor may refer to the mapping table to move the scroll to the corresponding page of the original document displayed in the second area and generate a control signal to render visual highlighting on the paragraph or sentence that serves as the basis.

[0396] Conversely, when a user selects a specific document from the original document list in the second area, the processor can support immediate identification of which part of the answer a specific material contributed to by tracing back and visually highlighting all sentences in the answer text in the first area that cited that document.

[0397] In addition, the processor can convert high-quality artifacts (e.g., editable PPT, Excel files, etc.) generated in the preceding self-correction step into link objects that the user can download and provide them on one side of the interface. Furthermore, if the user requests additional modification or data updates regarding the generated result, the processor can control an iterative interaction flow that creates a new subtask while maintaining the current context and returns to the planning stage.

[0398] In this way, the system of the present disclosure does not merely generate and list information, but provides an environment where the source of the information can be clearly visualized and verified, thereby providing practical utility that allows users to trust the output of artificial intelligence and immediately utilize it for actual business decision-making.

[0399] Finally, the processor can perform subtask completion processing and a higher-level reporting step (S707).

[0400] When the final answer generation is complete, the processor may perform the step of changing the status of the corresponding subtask (e.g., data analysis task) to Completed.

[0401] Subsequently, the processor can update the generated final answer and output data in the shared memory and transmit a task completion signal to the orchestrator described in the embodiment. Through this, the orchestrator can proceed with the next sequence of subtasks or, if all tasks are completed, continue the control flow to output the final result to the user.

[0402]

[0403] Hereinafter, with reference to FIGS. 17 to 19, a specific service interface and an example of utilization in which the agent system described above is implemented on a user terminal will be explained.

[0404] FIG. 17 may be an example of a search setting interface in which a user can directly control the source range and analysis depth of information according to one embodiment of the present disclosure.

[0405] Referring to FIG. 17, the processor may provide a settings panel that can control the scope and depth of the search through a user interface. Specifically, when a user authorizes an input to activate a deep analysis mode (deep dive mode), the processor may internally increase the inference budget and apply an allowlist policy that limits the search targets to reliable academic journals or patent databases.

[0406] Conversely, if a user wishes to exclude non-professional sources such as blogs or wikis, the processor may identify specific internet addresses entered in the domain exclusion field and perform logic to apply a weight penalty to documents from the corresponding domain source or filter them from the result list during the search result reordering process.

[0407] In addition, when a large volume of unstructured documents is uploaded through the file upload area, the processor can perform a preprocessing process of dividing them into chunks of a preset size and indexing them in a vector database.

[0408] FIG. 18 is an example screen diagram that verifies the reliability of information by providing a bidirectional cross-referencing function between a generated answer and a source document according to one embodiment of the present disclosure.

[0409] Referring to FIG. 18, the processor can control the display screen to be divided into a first area and a second area. At this time, the first area may display the final answer text generated by the artificial intelligence model, and the second area may display the original document viewer that serves as the basis for the answer.

[0410] When the processor detects an interaction in which a user selects (clicks or mouse hovers over) a specific sentence or numerical data in a first area, it can refer to a pre-established mapping table to automatically scroll the original document viewer in a second area to the page where the corresponding information is located. Furthermore, the processor can generate a control signal to visually highlight and display a corresponding section of original text.

[0411] Through this, the processor can provide an environment where the user can immediately verify the authenticity of generated information by comparing it with the original text, thereby alleviating the user's anxiety regarding the phenomenon of artificial intelligence models generating false information and ensuring data transparency.

[0412] FIG. 19 is an example screen diagram showing a download interface for multimodal analysis results and structured outputs according to one embodiment of the present disclosure.

[0413] Referring to FIG. 19, the processor can provide a visualization chart, which is the result of data analysis, by inserting it into a text response at a position that fits the sentence flow. Furthermore, the processor can convert the entire analyzed result into an editable office document object for the user and provide it in response to the user's request.

[0414] For example, when input is received in which a user selects the "Get as presentation file" object, the processor can convert the generated text into a slide body, convert the chart image into a chart object editable in the corresponding presentation software, generate a file (e.g., .pptx), and provide a download link.

[0415] Through such a configuration, the processor can provide the effect of enabling the user to immediately utilize the output of artificial intelligence as a business report, etc., without any separate processing work.

[0416]

[0417] The embodiments according to the present disclosure described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc., either alone or in combination. The program instructions recorded on the computer-readable recording medium may be those specifically designed and configured for the present disclosure or those known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, and flash memory. Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. Hardware devices may be modified into one or more software modules to perform processing according to the present disclosure, and vice versa.

[0418] The specific embodiments described in this disclosure are examples and do not limit the scope of this disclosure in any way. For the sake of brevity of the specification, descriptions of prior electronic configurations, control systems, software, and other functional aspects of said systems may be omitted. Additionally, the connections of lines or connecting members between components shown in the drawings are illustrative of functional connections and / or physical or circuit connections, and may be replaced or additionally represented as various functional connections, physical connections, or circuit connections in actual devices. Furthermore, unless specifically stated as "essential," "importantly," etc., a component may not be strictly necessary for the application of this disclosure.

[0419] Furthermore, although the detailed description of the present disclosure has been explained with reference to preferred embodiments of the present disclosure, those skilled in the art or those with ordinary knowledge in the art will understand that the present disclosure can be modified and changed in various ways without departing from the spirit and technical scope of the present disclosure as set forth in the claims below. Accordingly, the technical scope of the present disclosure should not be limited to the contents described in the detailed description of the specification but should be determined by the claims.

[0420] The present disclosure is applicable across the entire AI-based chatbot, data analysis, and business automation platform service industry. In particular, it enables the efficient automation of corporate decision support and report generation tasks through the logical decomposition of complex tasks and self-correcting code capabilities. Furthermore, through hybrid attention technology, it can contribute to reducing operating costs in the cloud and AI infrastructure industries that require large-scale context processing.< / think>

Claims

1. A method for providing an artificial intelligence agent service executed by a computer, (a) at least one processor of the computer receives a natural language query and a user identifier from a user terminal; (b) The step of the at least one processor retrieving user meta-information corresponding to the user identifier and dynamically injecting it into a system prompt to reconstruct an agent persona; (c) the at least one processor, based on the complexity of the reconstructed persona and the query, decomposes the query into a plurality of subtasks to construct a logically complete answer and establishes an execution plan; and (d) the step of the at least one processor routing the plurality of subtasks to corresponding expert modules for execution according to the execution plan, integrating each execution result to provide a final answer to the user terminal, Method of providing artificial intelligence agent services.

2. In Paragraph 1, The above step (d) is, For a task requiring data analysis or computation among the plurality of subtasks mentioned above, the expert module generates executable source code; and A method comprising the step of transmitting the generated source code to a virtual sandbox environment isolated from the main system for execution, and obtaining the execution result. Method of providing artificial intelligence agent services.

3. In Paragraph 2, The above step (d) is, The method further includes the step of performing a self-correcting loop in which, when an error is detected during the execution of source code in the above sandbox environment, an error log is analyzed to establish a correction strategy, and the code is regenerated and re-executed according to the correction strategy. Method of providing artificial intelligence agent services.

4. In Paragraph 1, The step of providing the above final answer is, A method comprising the step of analyzing the execution result of the above subtask, converting it into a native office object in a format requested by the user, and implementing it in the user interface in the form of a downloadable artifact. Method of providing artificial intelligence agent services.

5. In Paragraph 1, The above step (d) is, A step further comprising recalculating or filtering the reliability score of documents collected by the search agent among the expert modules based on source control parameters entered through the user interface, Method of providing artificial intelligence agent services.

6. In Paragraph 1, Prior to step (c) above, The method further includes the step of receiving a request to activate deep exploration mode from the above user, and The above step (c) includes the step of allocating an inference budget, which is the limit of thought tokens that the artificial intelligence model can generate for establishing an execution plan and logical verification, at a level greater than a preset threshold when the deep search mode is activated. Method of providing artificial intelligence agent services.

7. In Paragraph 1, After step (d) above, a step of generating a position mapping index between the sentences constituting the final answer and the text segments within the original document serving as the basis for the answer; and The method further includes the step of providing a two-way reference interface that scrolls to a corresponding page of the original document and visually highlights the corresponding text section in response to input in which a user selects a specific sentence of the final answer. Method of providing artificial intelligence agent services.

8. In Paragraph 1, The artificial intelligence model used in steps (c) and (d) above is characterized by having a hybrid attention architecture that performs sliding window-based local attention in a plurality of consecutive layers and performs global attention in periodically placed layers. Method of providing artificial intelligence agent services.

9. A method for providing a search-based artificial intelligence agent service executed by a computer, (a) at least one processor of the computer provides a search setting interface on the display of a user terminal and receives a setting input from the user including source control parameters and whether to enable deep search mode; (b) When the at least one processor receives a natural language query from the user, the processor establishes a deep analysis execution plan that decomposes the query into a plurality of hierarchical subtasks and determines the execution order based on whether the deep search mode is enabled and the complexity of the query; (c) the at least one processor, while driving a search agent that constitutes a toolchain according to the execution plan, executes a reliability-based search augmentation generation process that limits the scope of the search target database or evaluates the reliability of the collected documents by applying the received source control parameters; and (d) The above-mentioned at least one processor integrates the source documents collected by the search agent and the final answer generated by the artificial intelligence model, and implements a verification interface on the user terminal that can compare the answer content and the source documents. Method of providing artificial intelligence agent services.

10. In Paragraph 9, The search setting interface of step (a) above includes a source filter object that allows selecting whether to include in the search by type of information, including academic journals, patents, news, or blogs; A domain exclusion input field for receiving the domain address of a specific website that the user wishes to exclude from search results; and A toggle switch object that controls the activation or deactivation of the above-mentioned deep exploration mode, Method of providing artificial intelligence agent services.

11. In Paragraph 9, The reliability-based search augmentation generation process of step (c) above is, A step of analyzing metadata for each of the multiple candidate documents collected by the above search agent; A negative filtering step for removing documents from a candidate group that have a source corresponding to the domain exclusion information entered by the user; and A step comprising reordering by assigning weights to documents of a type selected from the source filter or documents satisfying a pre-set reliability criterion, Method of providing artificial intelligence agent services.

12. In Paragraph 9, The above step (b) is a step of setting the number of logical reasoning steps performed by the artificial intelligence model before generating an answer or the allocation of generateable thought tokens to be increased compared to the normal mode when the deep search mode is activated; and A method comprising the step of generating an execution plan including multiple iterative searches by expanding associated keywords that can be derived from the above natural language query. Method of providing artificial intelligence agent services.

13. In Paragraph 9, The verification interface of step (d) above is, The first area displaying the above final answer and the second area displaying the above supporting documents are provided by dividing them, The above processor generates an index that maps the sentence of the above final answer to the citation section within the above source document, and Characterized by providing a bidirectional cross-referencing function that moves a source document in the second area to a corresponding location and visually highlights the citation section in response to an input in which a user selects a specific sentence in the first area. Method of providing artificial intelligence agent services.

14. In Paragraph 9, The above step (a) further includes the step of receiving a document file uploaded by a user through the search settings interface, and The above step (c) includes the step of parsing the content of the uploaded document file and storing it in a vector database, and the search agent performing a hybrid search that integrates external web search results with the content of the uploaded document file to construct a context. Method of providing artificial intelligence agent services.