A dynamic sandbox data analysis method and system based on a large language model

By generating and executing data analysis code in a dynamic sandbox using a large language model, the problem of non-technical personnel being unable to perform independent analysis and the associated security risks are solved. This enables a secure and interactive data analysis process, improving efficiency and adaptability.

CN122197000APending Publication Date: 2026-06-12ZHEJIANG MEIRI HUDONG NETWORK TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG MEIRI HUDONG NETWORK TECH CO LTD
Filing Date
2026-04-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing data analysis technologies often present challenges for non-technical personnel, leading to inefficiencies such as security risks, poor result presentation, insufficient environmental isolation, and inadequate error correction mechanisms.

Method used

A dynamic sandbox data analysis method based on a large language model is adopted. Data analysis code is generated by receiving natural language instructions and executed in a dynamic sandbox scheduling system. The code is corrected and displayed by combining historical session context, so as to achieve secure isolation and interactive display.

🎯Benefits of technology

It improves the security, automation, and efficiency of data analysis, reduces the cost of manual intervention, adapts to different deployment scenarios, and provides interactive result display and closed-loop correction mechanisms.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197000A_ABST
    Figure CN122197000A_ABST
Patent Text Reader

Abstract

The application discloses a dynamic sandbox data analysis method based on a large language model. The method comprises the following steps: receiving a user natural language instruction, combining a historical conversation context, and generating data analysis code through a large language model; encapsulating the generated data analysis code as an execution request and sending it to a dynamic sandbox scheduling system; the dynamic sandbox scheduling system responds to the request, dynamically creates an isolated running environment and executes the data analysis code; obtains code execution results and feedback, renders and interactively displays the results, corrects the code based on the feedback and re-executes until the analysis task is completed. Through the whole process design of instruction conversion-request scheduling-isolation execution-closed loop correction, the method takes into account the convenience, safety and accuracy of data analysis, reduces the use threshold of non-professional users, and improves the coherence and reliability of task processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer data processing technology, and in particular to a dynamic sandbox data analysis method and system based on a large language model. Background Technology

[0002] With the deep integration of big data technology and artificial intelligence, data analysis has become a core support for decision-making across various industries. Currently, traditional data analysis methods heavily rely on technical personnel manually writing code in programming languages ​​such as Python and SQL, and then using libraries like Pandas and NumPy to process data and generate results. This approach not only demands a high level of technical expertise from users, making it difficult for non-technical personnel to independently complete data analysis tasks, but also suffers from low coding efficiency, long iteration cycles, and an inability to adapt to rapidly changing business data analysis scenarios.

[0003] To lower the barrier to data analysis, existing technologies have begun to combine Large Language Models (LLMs) with data analysis workflows, using LLMs to convert user natural language commands into executable code, achieving semi-automation of data analysis. However, such solutions have exposed many shortcomings in practical applications: First, security risks are prominent. When the code generated by LLMs is run directly in a local or general environment, it may contain dangerous operations such as file deletion, malicious network requests, and permission violations, which can easily lead to data leaks, system attacks, and other security incidents. Second, the result display experience is poor. LLMs usually only return code execution results in text form and cannot directly display rich media content such as interactive tables and visualizations. Users need to download files or switch tools to view the analysis results, creating a fragmented experience of generation, execution, and viewing. Third, there is insufficient environment isolation and resource management. Some solutions use a pre-built static sandbox environment to run the code. Multiple tasks sharing the same environment can easily lead to cross-contamination of data, and long-term idle static environments will cause a large waste of computing resources. At the same time, it is difficult to adapt to the needs of different deployment scenarios such as single machines and clusters.

[0004] Furthermore, existing automated data analysis solutions generally lack robust error correction mechanisms. The process terminates when an error occurs, requiring manual user intervention to troubleshoot and modify the code. This fails to create a closed-loop process of generation-execution-feedback-correction, further reducing the automation level and efficiency of data analysis. Therefore, there is an urgent need for an end-to-end data analysis solution that balances security, automation, interactivity, and resource adaptability to address the aforementioned technical pain points of existing technologies and improve the efficiency and security of the entire data analysis process. Summary of the Invention

[0005] To address the aforementioned technical problems, the technical solution adopted by this invention is as follows: According to a first aspect of the present invention, a dynamic sandbox data analysis method based on a large language model is provided, the method comprising the following steps: S100: Receive the user's natural language instructions and generate data analysis code based on the natural language instructions and historical conversation context using a large language model.

[0006] S200, the data analysis code is encapsulated into an execution request and sent to the dynamic sandbox scheduling system.

[0007] S300, the dynamic sandbox scheduling system responds to the execution request by dynamically creating an isolated running environment and executing the data analysis code in that running environment.

[0008] S400: Obtain the execution result and execution feedback of the data analysis code, render and interactively display the execution result; and based on the execution feedback, correct and re-execute the data analysis code until the analysis task is completed.

[0009] According to a second aspect of the present invention, a dynamic sandbox data analysis system based on a large language model is provided, comprising: The instruction processing module is used to receive natural language instructions from users and generate data analysis code based on the instructions and historical conversation context through a large language model.

[0010] The request scheduling module, connected to the instruction processing module, is used to encapsulate the data analysis code into an execution request and send it to the dynamic sandbox scheduling system.

[0011] The sandbox execution module is used to dynamically create and manage an isolated runtime environment in response to the execution request, and execute the data analysis code in the runtime environment.

[0012] The feedback correction module is used to obtain the execution result and execution feedback of the data analysis code, render and interactively display the execution result; and based on the execution feedback, control the instruction processing module to correct and re-execute the data analysis code until the analysis task is completed.

[0013] The present invention has at least the following beneficial effects: This invention leverages the synergy of historical conversation context and a large language model to effectively compensate for the information limitations of single natural language commands, improving the accuracy of command semantic understanding and the relevance of code generation, ensuring that the generated code remains consistent with historical operation logic. Through the encapsulation and scheduling mechanism of execution requests, it achieves seamless integration between code generation and sandbox execution, providing a solid foundation for subsequent secure execution. Utilizing a dynamically created isolated runtime environment, it physically avoids data cross-contamination and resource competition between different tasks, blocking the potential impact of dangerous code on the system and significantly improving the security of the code execution process. Coupled with interactive display of execution results and a closed-loop correction design, it not only improves the efficiency of user interpretation of analysis results but also enables automatic iterative code correction based on execution feedback, reducing manual intervention costs and ensuring accurate implementation of analysis tasks. The overall process balances convenience, security, and reliability in data analysis and is widely adaptable to various general data analysis scenarios.

[0014] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 A flowchart illustrating a dynamic sandbox data analysis method based on a large language model, provided as an embodiment of the present invention. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of this invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.

[0019] It should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the steps as sequential processes, many of these steps can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the steps can be rearranged. A process can be terminated when its operation is complete, but it may also have additional steps not included in the figures. A process can correspond to a method, function, procedure, subroutine, subroutine, etc.

[0020] (Example 1) This invention provides a dynamic sandbox data analysis method based on a large language model, aiming to solve problems in existing data analysis techniques such as semantic understanding bias of natural language instructions, complex data source adaptation, high code execution security risks, poor task flow coherence, and low efficiency in large-scale data processing. Figure 1 As shown in the figure, the dynamic sandbox data analysis method based on a large language model provided by this invention includes the following steps: S100: Receive the user's natural language command and, based on the natural language command and historical conversation context, generate data analysis code using a large language model. The natural language commands refer to the data analysis requests expressed by users in a non-programming form. They can cover various data analysis needs such as data query, statistical calculation, result visualization, and anomaly detection. Their expression is not subject to strict grammatical restrictions and can fully reflect the user's core needs for data processing and expected output results. Users do not need to master professional programming skills to initiate data analysis requests, thus lowering the threshold for using data analysis.

[0021] Specifically, this is implemented through the following sub-steps to accurately identify the data analysis objectives and generate logically coherent data analysis code: S101, Instruction Parsing and Intent Recognition The natural language instructions are subjected to intent recognition to determine the data analysis target. The received natural language instructions are then semantically interpreted and parsed. Semantic parsing algorithms are used to perform word segmentation, part-of-speech tagging, and core semantic extraction on the natural language instructions, accurately identifying key information such as data objects, processing actions, filtering conditions, and output requirements contained in the instructions. This provides structured semantic support for subsequent intent recognition and code generation, avoiding information loss due to the ambiguity of natural language expression.

[0022] After semantic parsing, the structured semantic data is input into a trained and optimized intent classification model, which outputs the corresponding data analysis objective. This intent classification model uses a large number of diverse data analysis scenario samples as training data and, through iterative optimization, possesses strong scenario adaptability. Its core function is to accurately map the input structured semantic data to preset intent categories, including but not limited to data filtering, summary statistics, trend analysis, data visualization, and outlier detection, thereby clarifying the user's core data analysis objective. For scenarios prone to ambiguity, such as semantic ambiguity or multiple overlapping intents, supplementary judgments are made by calling historical session context and relying on previous interaction information to eliminate ambiguous intents, ensuring the accuracy and uniqueness of intent recognition results and providing precise guidance for subsequent code generation.

[0023] S102, Generate data analysis code that matches the data analysis target by combining the historical session context. Historical session context association is achieved through a session identifier binding mechanism. Each analysis task has a unique session identifier. The historical session context includes all interactive data such as the preceding natural language instructions corresponding to the session identifier, the generated data analysis code, the code execution results, user feedback on the results, and the task execution status. Data at each stage is stored in association with the session identifier to ensure the uniqueness, integrity, and traceability of the context data.

[0024] During the code generation phase, the historical session context data corresponding to the current analysis task is first loaded using the session identifier. Then, the data analysis target determined in S101 is deeply integrated with the loaded context data. Here, the current analysis task refers to the complete data analysis process centered on the user's current natural language command, encompassing command parsing, code generation, execution, and result feedback. The current data analysis code is the core execution carrier of this task.

[0025] To accurately quantify the impact weight of historical data across various dimensions on current code generation and eliminate interference from invalid historical data, a context fusion weight formula is constructed to calculate the contribution of each piece of historical context data, thereby achieving weighted fusion: W i = (T current -T i ) / (T current -T min )×α i ×β.

[0026] Among them, W i The fusion weight of the i-th historical context data is [0,1]. A higher weight value indicates a greater impact of the historical data on the current code generation decision. A weight of 0 indicates invalid data that can be removed. currentThe timestamp for generating the request for the current code, T i Let T be the generation timestamp of the i-th historical context data. min The timestamp of the first session data in the current analysis task; (T) current -T i ) / (T current -T min This constitutes a time decay factor, enabling dynamic adjustment so that newer data has a higher weight; α i The type weight of the i-th historical data is preset with fixed values ​​based on data importance (preceding code α=0.9, execution result α=0.7, user feedback α=0.8), which can be dynamically fine-tuned according to business scenarios; β is the context credibility coefficient, with a value range of [0.8, 1.0]. When the data is complete and without missing data, β=1.0, and when some fields are missing, β takes the value of 0.8~0.9 to avoid fusion bias caused by incomplete data. The value of i is from 1 to n, where n is the number of historical context data in the current analysis task.

[0027] The large language model uses weighted and fused contextual data and data analysis objectives as comprehensive input conditions to generate data analysis code that conforms to the current data analysis objectives and is consistent with historical operation logic. During code generation, it automatically detects implicit data source requirements in user commands (such as mentions of "reading tables" or "calling database data"), supports access to heterogeneous data sources such as CSV, Excel, relational databases, and time-series databases, automatically identifies the data source type, and generates corresponding data reading adaptation code, shielding access protocols and format differences between different data sources to achieve seamless data source access. If the data source type cannot be identified or access fails, it provides immediate feedback to guide the user to supplement data source details, eliminating the need for manual configuration of data reading rules and ensuring that the code can directly adapt to the target data source.

[0028] The data analysis code can be executable code, such as Python / Pandas data analysis code. Through the above-mentioned fusion mechanism, the results of previous operations can be automatically reused when generating code. For example, subsequent statistical analysis code can be generated based on the data range that has been filtered by previous instructions, without requiring the user to repeatedly mention the basic conditions. This effectively avoids task interruption or repetitive operations, and improves the continuity of interaction and code generation efficiency.

[0029] By leveraging the synergy between intent recognition and historical session context, the generated code can precisely match the user's actual needs, effectively avoiding the generation of irrelevant code. This not only improves the targeting, accuracy, and execution efficiency of data analysis but also compensates for the incompleteness of single natural language instruction information, reduces code generation bias, and lays a reliable foundation for subsequent code security verification and sandbox execution.

[0030] In addition, while generating data analysis code, structured comments are added simultaneously. These comments include code function descriptions, explanations of key parameters, and annotations of core logic nodes, clearly presenting the code design ideas and providing convenient support for users to modify and optimize the code.

[0031] In this invention, the large language model adopts a hierarchical modular structure design to adapt to the collaborative needs of semantic parsing, intent mapping, and code generation. The specific structure is as follows: The first layer is the semantic understanding layer, which includes a semantic parsing submodule and a context association submodule. The semantic parsing submodule integrates the aforementioned semantic parsing algorithms and is responsible for performing word segmentation, part-of-speech tagging, and core semantic extraction on the input natural language instructions, transforming unstructured instructions into structured semantic vectors to provide standardized data for subsequent processing. The context association submodule calls an interface through a session identifier to load the corresponding historical session context data, transforming it into feature data that can be fused with the current semantic vector to achieve a structured representation of contextual information.

[0032] The second layer is the intent mapping layer, which includes a trained and optimized intent classification sub-model and an ambiguity resolution sub-module. The intent classification sub-model is the aforementioned intent classification model, which adopts the Transformer architecture and is iteratively trained through a large number of data analysis scenario samples. It has the ability to accurately map semantic vectors to preset intent categories and outputs clear data analysis targets. The ambiguity resolution sub-module is designed for semantically ambiguous scenarios and multiple intents. It constructs context feature matching rules, calculates the similarity between the current semantic vector and historical context feature data, eliminates ambiguous intents, and outputs a unique and valid target.

[0033] The third layer is the code generation layer, which consists of a logic generation submodule and an optimization submodule. The logic generation submodule, based on the data analysis objectives output by the intent mapping layer and combined with the feature data after context fusion, generates a data analysis code framework that conforms to syntax specifications and preceding operation logic. The optimization submodule performs syntax validation, logic simplification, and compatibility adaptation on the code framework (validation is only performed on code syntax, logic, and compatibility, forming a layered protection with the S200's special verification for security risks), ensuring that the generated code can be directly executed. Simultaneously, it generates the aforementioned structured comments to improve code readability and modifiability.

[0034] In addition, the large language model is also equipped with a training optimization module, which can incrementally train the intent classification sub-model and code generation logic based on data such as code execution feedback and user correction operations in real applications, continuously improving the accuracy of semantic understanding, the precision of intent mapping and the adaptability of code generation, and strengthening the robustness of the model in various data analysis scenarios.

[0035] S200, the data analysis code is encapsulated into an execution request and sent to the dynamic sandbox scheduling system. This step encapsulates the data analysis code generated by S100 into standardized execution requests and sends them to the dynamic sandbox scheduling system, providing preparatory work for the secure and isolated execution of the code. This step is implemented through a process of security verification, code encapsulation, request feature definition, and sandbox adaptation specifications, detailed as follows: After generating data analysis code using a large language model, a security verification process must be performed first. This is a necessary pre-processing step before code encapsulation and delivery, aiming to mitigate potential security risks during code execution from the outset. The security verification module is an independent pre-processing module deployed between S100 code generation and S200 request encapsulation, specifically responsible for code security detection.

[0036] Specifically, this security verification module performs a full scan of the generated data analysis code, focusing on file operations and network request behaviors within the code. A pre-defined whitelist defines the scope of legal behavior; this whitelist can be configured based on business needs, specifying allowed file read / write paths, network request domains and ports, and permission boundaries. During verification, each file operation instruction and network call statement in the code is compared with the whitelist rules to filter out dangerous operations exceeding the whitelist's limits. These dangerous operations include, but are not limited to, unauthorized file read / write, access to sensitive directories, initiating malicious network requests, tampering with system permissions, and script injection. If a dangerous operation is detected, the code will be rejected for encapsulation, and a verification exception will be reported. A regeneration mechanism can also be triggered, allowing the large language model to correct the code based on the security verification results. Only after the data analysis code passes this security verification can it proceed to the subsequent encapsulation stage. This security verification step effectively blocks the execution of dangerous code, preventing attacks, data leaks, or tampering, ensuring the security and system stability of the data analysis process, and providing a pre-emptive security barrier for subsequent code execution in the sandbox environment.

[0037] After the code passes security checks, it is encapsulated into a standardized execution request using JupyterTool. JupyterTool, as the core tool for code encapsulation and request construction, can format code blocks and standardize request parameter configuration, ensuring that the execution request can be recognized and parsed by the dynamic sandbox scheduling system. The encapsulated execution request contains several core elements, specifically: Code blocks, which are complete data analysis code that has passed security checks, are syntactically formatted by JupyterTool to ensure that the code can be executed directly in the sandbox environment; Timeout settings, also known as preset code execution time thresholds (i.e., the preset timeout thresholds required by the subsequent S300 timeout control mechanism, which can be preset to a fixed value based on the task type or customized by the user), are used to prevent code from looping indefinitely or occupying resources for a long time. After the timeout, the sandbox system will automatically terminate the execution. Session context, which is the session identifier and associated prior interaction data summary corresponding to the current analysis task, is used by the sandbox system to associate historical session information and maintain the contextual coherence of task execution.

[0038] Meanwhile, the execution request also includes a code execution priority configuration element. This priority can be preset based on business scenario requirements or specified by the user, specifically divided into three levels: high, medium, and low (the number of levels can be expanded according to actual application scenarios). Different levels correspond to different resource allocation weights. After receiving the execution request, the dynamic sandbox scheduling system will prioritize identifying the code execution priority and allocate computing resources to the sandbox runtime environment based on this priority. High-priority tasks will be allocated preferentially to core resources such as CPU and memory, ensuring that data analysis tasks for urgent business are executed first, thereby improving the rationality and efficiency of overall task processing.

[0039] The dynamic sandbox scheduling system is adapted to containerized operating environments and has cross-deployment scenario adaptability. It supports multiple deployment modes such as single-machine container environments and Kubernetes cluster environments, and adopts corresponding container management methods for different deployment environments: in a single-machine environment, the container lifecycle is controlled through container management tools; in a Kubernetes cluster environment, independent tasks are created through the Job component and network access support is provided through the Service component to achieve clustered container scheduling and management.

[0040] S300, in response to the execution request, the dynamic sandbox scheduling system dynamically creates an isolated runtime environment and executes the data analysis code within that environment. First, the execution request is parsed to extract core information such as code blocks, timeout settings, session context, execution priority, and resource requirements. This information serves as the basis for creating the runtime environment container. Based on the parsing results, a dynamic instantiation mode of creating upon request is adopted to independently create a dedicated runtime environment container (i.e., the runtime environment described in this invention) for the current analysis task, without reserving fixed container resources to avoid resource idleness and waste.

[0041] Different creation logic is adopted for different deployment scenarios: In a single-machine container environment, independent container instances are dynamically generated by calling the interface through container management tools, and a dedicated network namespace and file system are configured; in a Kubernetes cluster environment, container creation tasks are submitted through the Job component, cluster node resources are automatically allocated, and network access rules are preset by the Service component to ensure that containers can be called securely. A unique identifier is assigned to each runtime environment container and bound to the session identifier in the execution request to achieve precise association between tasks and containers and ensure isolation between different tasks.

[0042] For large-scale dataset scenarios, after creating the runtime environment container, the original data analysis task is automatically split into multiple independent subtasks based on data size and task complexity. These subtasks are then allocated to different sandbox containers for parallel execution according to resource quotas and execution priorities, significantly improving data processing efficiency. To optimize the balance of subtask allocation and avoid some containers being overloaded while others are idle, a container load balancing formula is constructed as the core criterion for subtask allocation: .

[0043] Where η is the container load balancing degree of the runtime environment, with a value range of [0, 1]. The closer η is to 1, the more balanced the distribution. The default is η≥0.8 as the optimal distribution state. When it is lower than this value, iterative adjustment of subtasks is triggered; m is the number of sandbox containers participating in parallel execution, which is automatically determined based on the total resource quota and the carrying capacity of a single container. An upper limit threshold can be configured; L j The total load of subtasks allocated to the j-th runtime environment container, in MB. The load calculation rule is: subtask data volume × task complexity coefficient (complexity coefficient is from 1 to 5, automatically determined); j ranges from 1 to m. AvgL is the average load of all runtime environment containers, serving as a benchmark reference value for load balancing.

[0044] During allocation, the tasks are initially split into subtasks and the initial load of each container is calculated. The load balancing factor η is then calculated by substituting the load balancing factor into the container load balancing factor. If η < 0.8, the subtask assignment is iteratively adjusted (some subtasks from high-load containers are migrated to low-load containers) until η reaches the optimal state of ≥ 0.8. Subsequently, the subtasks are assigned to the corresponding sandbox containers for parallel execution. After all subtasks have been executed, the local results of each container are summarized by association through session identifiers to complete data integration and consistency verification. Finally, a unified data analysis result is generated, ensuring that parallel execution does not affect the integrity and accuracy of the result.

[0045] Secondly, full lifecycle management is implemented for the runtime environment containers, covering the entire process of creation, monitoring, and destruction. This is the primary implementation example, focusing on monitoring container resource metrics to ensure controllable container operation and secure data isolation. During the creation phase, based on the resource requirements and priorities in the execution request, independent resource quotas are allocated to the container. These quotas include the number of CPU cores, memory capacity, and storage space. Quotas can be preset within a reasonable range according to business needs, ensuring sufficient resources for container operation while preventing excessive consumption of system resources by a single container. Simultaneously, a minimum privileges execution policy is configured for the container, granting only the necessary permissions for code execution, further reducing security risks. During the monitoring phase, only container operation metrics are monitored. Container monitoring components collect data in real time, such as CPU utilization, memory usage, network I / O, disk I / O, and code execution progress. The collected metrics are compared with preset security thresholds in real time. When a metric exceeds the threshold (e.g., memory overflow, excessively high CPU utilization), an alarm is immediately generated. Only cluster environments support dynamic resource adjustments, allowing this operation to be triggered as needed, avoiding impact on overall operational stability. During the destruction phase, when the code execution is complete or a resource terminates abnormally, the container destruction process is automatically triggered. First, all running processes in the container are terminated, then the task-related data, temporary files, environment configurations and dependency libraries in the container are thoroughly cleared, and then all allocated resources are released to ensure that no data remains, further strengthening the isolation between tasks and ensuring data security.

[0046] As another independent embodiment, S300 can additionally add a code execution timeout control mechanism, running in parallel with the resource metric monitoring logic of the main embodiment. This mechanism specifically controls code execution time to prevent resources from being occupied for extended periods. To address the poor adaptability of fixed timeout thresholds, an adaptive timeout threshold formula is constructed based on task complexity and data volume. This dynamically generates a dedicated preset timeout threshold, avoiding both redundant resource waste and preventing incorrect task termination due to excessively short thresholds. T timeout =T0×[1+k×log 10 (1+D / D0)]×γ.

[0047] Among them, T timeoutT0 is the adaptive timeout threshold for the current task, measured in seconds (s), serving as the core value for timeout settings in execution requests; T0 is the basic timeout threshold, ranging from 30 to 300 seconds, preset according to the task type (T0 = 30 seconds for simple statistical tasks, T0 = 180 seconds for complex modeling tasks, which can be manually calibrated by the user); k is the data volume impact coefficient, ranging from 0.2 to 0.5, with higher k values ​​indicating a greater impact of data volume on execution time (e.g., k = 0.2 for text data, k = 0.5 for multimedia-related data); D is the size of the dataset being processed by the current task, measured in megabytes (MB); D0 is the baseline dataset size, fixed at 100MB as the normalization baseline to eliminate calculation bias caused by differences in data volume units; γ is the task complexity coefficient, ranging from 1.0 to 2.0, automatically determined based on code logic (γ = 1.0 to 1.2 for simple query tasks with ≤2 nested loops and no complex calculations; γ = 1.8 to 2.0 for computationally intensive tasks with >2 nested loops and matrix operations).

[0048] The specific implementation method is as follows: Based on the timeout setting in the execution request, T is calculated using an adaptive timeout threshold formula. timeout The default value is used, and users can manually customize and adjust it according to their actual needs (custom thresholds have higher priority than formula-calculated values) to form a specific preset timeout threshold for the current analysis task's code execution. Simultaneously with the start of the code execution process, an independent timing program is launched to continuously track the code execution time and dynamically compare it with the preset timeout threshold. When the execution time exceeds the preset timeout threshold, an independent termination and cleanup process is immediately triggered: first, the code execution process within the container is forcibly terminated; then, the runtime environment container destruction process is initiated, thoroughly clearing the container and task-related data and releasing all resources. At the same time, a clear timeout alarm is generated, informing the user of the reason for the code execution timeout and the status of resource release. This embodiment can be implemented independently of the main embodiment (controlling only the execution time), or it can be implemented in conjunction with the resource indicator monitoring logic of the main embodiment (controlling both time and resources simultaneously), flexibly adapting to different scenario requirements and further improving the controllability of code execution and system resource utilization.

[0049] After completing the creation and full lifecycle management configuration of the runtime environment container, S300 also includes a dynamic port mapping step to securely expose services within the container for external calls, providing a secure transmission channel for S400 to obtain code execution results and runtime feedback information in real time. Specifically, a dynamic random port allocation strategy is adopted to assign a dedicated temporary port to each runtime environment container. Through port mapping / forwarding technology, a mapping relationship is established between the core services within the container (such as the code execution kernel and temporary web services) and the temporary ports on the host machine, generating a unique access link. This access link is bound to the container identifier and is only open to the session of the current analysis task, providing strict access control capabilities. When the runtime environment container is destroyed according to the process, the corresponding port mapping relationship is automatically released, and the access link is simultaneously invalidated, effectively preventing unauthorized access or long-term port occupation, ensuring the security of service calls and resource utilization.

[0050] Finally, after the runtime environment container is created and the port mapping is configured, the security-verified data analysis code is passed into the container, and the kernel execution process is started by calling the code in the container. If only the main implementation method is used, resource indicator monitoring during the monitoring phase is linked; if the other implementation method is used, an independent timing program and resource indicator monitoring are linked synchronously (depending on the implementation scenario) to ensure that the code runs securely and controllably in an isolated environment. The results, alarms, and anomaly feedback during the execution process are all transmitted back to subsequent modules in real time, providing support for subsequent result rendering and code correction.

[0051] S400: Obtain the execution result and feedback of the data analysis code; render and interactively display the execution result; and based on the execution feedback, modify and re-execute the data analysis code until the analysis task is completed. The S400 inherits the code execution flow of the S300, and its core implementation includes the acquisition and interactive display of data analysis code execution results and feedback, as well as a closed loop for code correction based on feedback. This ensures that data analysis tasks are accurately implemented and form a complete business chain. The specific process is as follows: First, relying on the dynamic port mapping link and session identifier association mechanism established by S300 (reusing the stateful session binding logic established by S100), the execution results and feedback of data analysis code in the runtime environment container are obtained in real time, achieving secure and accurate data transmission. Execution results and feedback are collected and stored according to type, and both are bound to the session identifier of the current analysis task, ensuring the unique correspondence and traceability of data and tasks. Execution results are the core data products generated by the execution of data analysis code, covering various forms such as tabular data, image data, and text statistical results, adapting to the output needs of different data analysis scenarios. Execution feedback focuses on the code execution status, including status receipts after normal code execution and error messages and trace logs generated when execution exceptions occur. The error messages cover types such as syntax errors, logical errors, data source access exceptions, and insufficient permissions errors. The trace logs completely record the code execution link, error location, stack information, and runtime environment parameters, providing comprehensive data support for subsequent error localization.

[0052] Secondly, the acquired execution results are rendered with rich media and displayed interactively, adapting to the display characteristics of different data types to improve user interaction experience and data interpretation efficiency. For tabular data (such as data details, statistical summary tables, etc.), native rendering is used for loading and display, while integrating online interactive functions. Users can set filtering conditions based on any field dimension and customize sorting rules (ascending or descending order), allowing for quick filtering of target data and adjustment of data display order without re-executing code. For image data (such as trend charts, bar charts, heatmaps, scatter plots, etc.), original-size rendering and adaptive scaling are supported. Users can adjust the image display ratio as needed to view details, and local saving is also provided, supporting the export of image data to common format files (such as PNG, JPG). All of the above interactive functions are embedded in the dialog stream to achieve WYSIWYG, without the need to switch tools or download files. All operation results are synchronized to the current session context in real time, ensuring consistency with the logic of previous operations.

[0053] Sensitive data (such as ID card numbers, mobile phone numbers, bank card numbers, etc.) in the execution results are automatically anonymized. To ensure that the anonymized data does not affect the accuracy of the analysis results, an anonymization retention formula is constructed to quantitatively verify the anonymization rules and constrain the rationality of the anonymization operation. R = (Sr / St) × δ.

[0054] Where R is the analytical retention rate of the anonymized data, with a value range of [0.9, 1.0]. R is required to be ≥ 0.9 to ensure that core data analysis tasks such as statistical analysis, trend judgment, and data comparison are not affected. Sr is the number of effective data features retained after anonymization, which is defined according to the data type (numerical data retains magnitude and interval features, string data retains length and format features, date data retains year and month features, etc.). St is the total number of features of the original sensitive data, that is, all feature dimensions that the data before anonymization has that can be used for analysis. δ is the feature effectiveness coefficient, with a value range of 0.95~1.0, which is determined by the type of anonymization rule (mask anonymization only hides sensitive fields, δ=1.0; format replacement may slightly affect feature presentation, δ=0.95).

[0055] The data masking rules can be customized by users based on business compliance requirements, covering methods such as field masking, format replacement, and content encryption. Before executing the data masking operation, the system first verifies the rationality of the masking rules using the data masking retention formula: if the calculated R ≥ 0.9, the masking operation is executed; if R < 0.9, the user is immediately prompted to adjust the masking rules (such as reducing the number of hidden features or changing the masking method) to ensure that the masking process only targets sensitive information and does not change the statistical characteristics and analytical dimensions of the data, thus ensuring the accuracy of the analysis results. Simultaneously, error localization and code correction are carried out based on execution feedback, building a self-healing closed-loop mechanism of execution-feedback-correction-re-execution to ensure the smooth progress of the task. When the execution feedback is a normal status receipt, the task progress is determined based on the user's confirmation of the displayed results. If the results meet the data analysis objectives and the user actively confirms the validity of the results, the loop terminates. If the user does not provide feedback after the preset confirmation time (which can be customized), the result is assumed to meet the requirements and the loop terminates. The process can also be manually terminated by the user, generating a phased results archive. If the user requests adjustments, the request is converted into new instructions and fed back to S100 to regenerate the adaptation code. When the execution feedback includes error messages and trace logs, precise error localization is first performed based on both: the error location, stack trace, and environment parameters in the trace logs are extracted, combined with the error message type, and associated with the current code snippet, session context, and sandbox runtime environment configuration to identify the root cause of the error, including but not limited to syntax errors, data format mismatches, function call errors, data source connection anomalies, and resource requests exceeding quotas.

[0056] After error localization is completed, the error information, tracking logs, and localization results are fed back to the large language model in S100. Based on the model's configured training and optimization module (combining iterative data such as historical code execution feedback and user correction operations), and the current analysis task's session context (including preceding instructions, generated code version, data analysis goals, and historical execution results), a targeted code correction plan is generated. The correction plan must balance error repair, consistency with the original data analysis goals, and adaptability to the sandbox runtime environment, avoiding the introduction of new security risks or resource issues. The corrected data analysis code undergoes security verification and is encapsulated into an execution request according to the S200 process: if the original runtime environment container has been destroyed due to an exception or timeout, an isolated runtime environment container is recreated based on the session identifier, repeating the S300-S400 process; if the runtime environment container is still alive, it can be reused to execute the corrected code, reducing environment creation overhead and task time.

[0057] Repeat the entire process of code execution → result and feedback acquisition → display and correction until the code executes normally, generates results that meet the data analysis objectives, and satisfies the closed-loop termination conditions. After the task is completed, archive all data from this task, including the original natural language instructions, all versions of data analysis code, execution results, feedback logs, correction records, and session context information. Associate the session identifier for long-term storage to facilitate subsequent task traceability, review, and problem troubleshooting.

[0058] Furthermore, this invention is based on a stateful session mechanism (i.e., retaining the historical interaction state of the same analysis task, rather than processing each operation independently, and associating the entire process data through a unique session identifier to achieve context reuse and logical coherence). By using a unique session identifier, the entire process steps of the same analysis task, such as code generation, execution, feedback, and correction, are associated and bound together, and the session context data (including preceding natural language instructions, generated code and version, execution results, user feedback, task execution status, etc.) are continuously synchronized. This ensures that the operation logic of each link is coherent and the data is interconnected, avoiding context loss due to process interruption or cross-step calls. It provides consistent support for sandbox execution and code correction closed loop, and also provides technical guarantee for the full data archiving and traceability of the above tasks.

[0059] This invention is widely applicable to various general data analysis scenarios, especially suitable for business scenarios with high requirements for data security, ease of operation, and process continuity. Typical applications include: Enterprise daily office data analysis scenarios, enabling non-professional office personnel to quickly process structured data in Excel, CSV tables, and business databases through natural language commands, achieving data statistics, trend visualization, and anomaly detection without manual coding, significantly improving office efficiency; Scientific research data analysis scenarios, adapting to heterogeneous data sources such as time-series databases and experimental datasets, processing large-scale experimental data in parallel through dynamic sandboxes, and optimizing analysis logic with a closed-loop correction mechanism to ensure the accuracy and security of data processing; Operation and maintenance data analysis scenarios, generating data collection and analysis code based on natural language commands, executing it in an isolated sandbox environment to avoid system risks, and relying on a stateful session mechanism to trace the entire process operation, facilitating problem review and fault location; Financial and government data processing scenarios, through sandbox isolation protection and sensitive data desensitization capabilities, compliantly processing business data containing privacy information, achieving a full-link security closed loop of data statistics, risk screening, and result display, balancing efficiency and compliance requirements. In addition, this solution can also be adapted to data analysis needs in multiple fields such as education and scientific research, and internet operations, flexibly responding to data processing tasks of different scales and types, and has strong scenario scalability.

[0060] In summary, the dynamic sandbox data analysis method based on a large language model provided by the embodiments of the present invention has the following specific technical effects: 1. Improve the accuracy of semantic understanding and code generation: By associating full-process data through a stateful session mechanism and combining context fusion weight formula to quantify the influence weight of historical data, the problem of semantic ambiguity and context loss is effectively avoided; the hierarchical large language model with training optimization module can continuously iterate based on execution feedback, achieve a high degree of coherence between intent recognition, code generation and historical operation logic, significantly reduce code deviation rate and make up for the deficiency of incomplete single instruction information.

[0061] 2. Enable seamless access to heterogeneous data sources: Automatically identify heterogeneous data sources such as CSV, Excel, relational databases, and time-series databases, generate adaptive reading code, shield data source access protocol and format differences, eliminate the need for manual user configuration, significantly reduce data source adaptation costs, and broaden the application scenarios of the solution.

[0062] 3. Enhance code execution security and isolation: Through pre-security verification (whitelist filtering) and dynamic sandbox container isolation, coupled with the least privilege running policy and full lifecycle management of containers, dangerous code execution is blocked from the source, avoiding data cross-contamination and resource competition, while ensuring that no data remains after task execution, thus guaranteeing data security and system stability.

[0063] 4. Optimize the efficiency of large-scale data processing: Based on the container load balancing formula, subtasks are evenly distributed, supporting parallel execution of multiple containers. Combined with adaptive timeout thresholds, task complexity and data volume are dynamically adapted to avoid resource idleness or excessive occupation, which greatly improves the analysis efficiency of large-scale datasets compared with the traditional serial processing mode.

[0064] 5. Lower the barrier to entry and improve process flexibility: Support non-programming users to initiate data analysis requests through natural language, interactive result display and self-healing correction loop reduce manual intervention; the system is compatible with single-machine and Kubernetes cluster deployment scenarios, and can be adapted to different business needs through user-defined rules (de-identification, timeout, weight, etc.), with strong flexibility and scalability.

[0065] (Example 2) This embodiment provides a dynamic sandbox data analysis system based on a large language model to implement the data analysis process of the aforementioned method embodiment. This system achieves a closed-loop end-to-end process of instruction processing, request scheduling, sandbox execution, and feedback correction through modular design. Specifically, it includes the following functional modules: Instruction Processing Module: The core module receives natural language instructions input by the user and, combined with historical conversation context data, generates data analysis code that meets the requirements through a hierarchical, modularly designed large language model. Its implementation includes three core operations: semantic parsing, intent recognition, and code generation. First, the natural language instructions are segmented, labeled with parts of speech, and have their core semantics extracted. The data analysis target is determined through a trained and optimized intent classification model. Then, relying on a stateful conversation mechanism, historical conversation context (including preceding instructions, code version, execution results, etc.) is loaded. The influence weight of historical data in each dimension is quantified through a context fusion weight formula, achieving deep fusion of context and current instruction. Finally, executable data analysis code (such as Python / Pandas code) with structured annotations, adapted to heterogeneous data sources, is generated. Simultaneously, the training and optimization module is linked to continuously iterate the model accuracy based on historical feedback data.

[0066] The request scheduling module communicates with the instruction processing module, serving as an intermediate link between instruction processing and sandbox execution. Its core function is to perform pre-processing security checks on the data analysis code generated by the instruction processing module. It filters dangerous operations such as file privilege violations and malicious network requests using a preset operation whitelist, and only standardizes and encapsulates the verified code to generate an execution request containing core elements such as code blocks, adaptive timeout thresholds, session context, and execution priority. This execution request is then sent to the dynamic sandbox scheduling system, simultaneously synchronizing the session identifier to ensure the contextual consistency between the request and subsequent execution stages.

[0067] The sandbox execution module responds to execution requests sent by the request scheduling module, dynamically creating and managing isolated runtime environment containers (i.e., the isolated runtime environment of this system) throughout their entire lifecycle, achieving secure isolated code execution. This module is adaptable to cross-deployment scenarios, supporting both single-machine container environments and Kubernetes cluster environments, employing differentiated container management strategies for different environments. For large-scale datasets, it automatically splits the original task into multiple sub-tasks, optimizing sub-task allocation through container load balancing formulas to achieve parallel execution of multiple containers. It also integrates dynamic port mapping, resource metric monitoring, and timeout control mechanisms, collecting container runtime status data in real time to ensure controllable code execution and reasonable resource allocation. After execution, it transmits execution results and feedback information back via port mapping links.

[0068] Feedback and Correction Module: Establishes communication connections with both the sandbox execution module and the instruction processing module, forming a self-healing closed loop of execution-feedback-correction. On one hand, through the dynamic port mapping link established by the sandbox execution module, it obtains code execution results and execution feedback (including normal status receipts, error messages, and trace logs) in real time, and performs rich media rendering and interactive display on the execution results—table data supports online filtering and sorting, image data supports scaling and saving, and operation results are synchronized to the session context in real time. On the other hand, it performs error localization based on execution feedback, extracts error locations and stack information from the trace logs to determine the root cause of the error, and feeds it back to the instruction processing module to trigger the code correction process. The instruction processing module generates corrected code and repeats the aforementioned process until the analysis task meets the termination conditions (code execution is normal, results meet the target and user confirmation or no feedback after timeout).

[0069] This system embodiment can be used to perform the aforementioned... Figure 1 The entire process of the method embodiment shown is described above. Therefore, for the specific implementation details, interaction logic and technical effects of each functional module, please refer to the corresponding description of the aforementioned method embodiment, which will not be repeated here.

[0070] It should be noted that the system embodiments and the corresponding method embodiments are based on the same inventive concept. The related technical features, implementation paths and technical effects involved in the method embodiments, such as semantic parsing, adaptive timeout management, container load balancing, and sensitive data desensitization, are all applicable to the system embodiments, and the two can corroborate and complement each other.

[0071] Those skilled in the art will understand that the above-mentioned functional modules can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software or firmware, the system should also include a memory and a processing unit: the memory is used to store computer programs, session context data, model parameters, security whitelist rules, and task archive data, etc., to ensure data traceability and reuse; the processing unit, as the core control unit, is communicatively connected to each functional module and the memory, and when executing the computer program stored in the memory, it drives each module to work together to realize all the steps of the aforementioned data analysis method.

[0072] This invention also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the method described in this invention.

[0073] This invention also provides a computer-readable storage medium storing computer-executable instructions for performing the methods described in this invention.

[0074] It should be understood that the various forms of processes shown above can be used to reorder, add, or delete steps. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this invention can be achieved, and this is not limited herein.

[0075] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A dynamic sandbox data analysis method based on a large language model, characterized in that, The method includes the following steps: S100: Receive the user's natural language instructions and generate data analysis code based on the natural language instructions and historical conversation context using a large language model; S200, the data analysis code is encapsulated into an execution request and sent to the dynamic sandbox scheduling system; S300, the dynamic sandbox scheduling system responds to the execution request by dynamically creating an isolated running environment and executing the data analysis code in that running environment; S400, obtain the execution result and execution feedback of the data analysis code, and render and interactively display the execution result; Based on the execution feedback, the data analysis code is corrected and re-executed until the analysis task is completed.

2. The method according to claim 1, characterized in that, S100 specifically includes: performing intent recognition on the natural language instruction, determining the data analysis target, and generating data analysis code that matches the data analysis target by combining the historical session context.

3. The method according to claim 1, characterized in that, The execution request includes a code execution priority, and the dynamic sandbox scheduling system allocates computing resources to the runtime environment based on the code execution priority.

4. The method according to claim 1, characterized in that, The dynamic sandbox scheduling system is adapted to containerized operating environments and independently creates, monitors, and destroys corresponding operating environment containers for each analysis task, implementing full lifecycle management.

5. The method according to claim 4, characterized in that, The full lifecycle management includes: allocating independent resource quotas to runtime environment containers during the creation phase, monitoring resource usage in real time during the monitoring phase, and clearing all task-related data and environment configurations during the destruction phase.

6. The method according to claim 1, characterized in that, The S300 also includes the following steps: Dynamic port mapping temporarily exposes services within the runtime environment for invocation, and disables the corresponding access link after the runtime environment is destroyed.

7. The method according to claim 1, characterized in that, The rendering and interactive display of the execution results include: providing online filtering and sorting functions for tabular data, and / or providing zooming and saving functions for image data.

8. The method according to claim 1, characterized in that, A stateful session mechanism is used to associate all code generation, execution, and correction steps for the same analysis task to maintain contextual consistency.

9. The method as described in claim 1 or 8, characterized in that, The execution feedback includes error information and tracing logs generated during the execution of data analysis code. Based on the error information and tracing logs, errors are located and code correction plans are generated.

10. A dynamic sandbox data analysis system based on a large language model, characterized in that, include: The instruction processing module is used to receive natural language instructions from users and generate data analysis code based on the instructions and historical conversation context through a large language model. The request scheduling module, connected to the instruction processing module, is used to encapsulate the data analysis code into an execution request and send it to the dynamic sandbox scheduling system. The sandbox execution module is used to dynamically create and manage an isolated runtime environment in response to the execution request, and execute the data analysis code in the runtime environment; The feedback correction module is used to obtain the execution results and execution feedback of the data analysis code, and to render and interactively display the execution results; Based on the execution feedback, the instruction processing module is controlled to correct and re-execute the data analysis code until the analysis task is completed.