An ai inference resource decoupling scheduling method and system based on a two-dimensional orthogonal state space

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By decoupling storage and computing resources through a two-dimensional orthogonal state space model, and combining metadata-level atomic operations and edge encryption, the problem caused by resource bundling in AI inference services is solved, achieving low-latency, high-efficiency resource scheduling and data sovereignty protection.

CN122309167APending Publication Date: 2026-06-30DONGGUAN INTENT RESONANCE TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: DONGGUAN INTENT RESONANCE TECHNOLOGY CO LTD
Filing Date: 2026-04-06
Publication Date: 2026-06-30

Application Information

Patent Timeline

06 Apr 2026

Application

30 Jun 2026

Publication

CN122309167A

IPC: G06F9/50; G06F9/52; G06N5/04; H04L9/06; H04L9/08

AI Tagging

Technology Topics

Physical system Scheduling (computing)

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A power information physical system security risk assessment system and method
CN122333474ARisk level Edge computing
Intelligent-agent system
WO2026131478A1Mathematical models Machine learning Data selection Engineering
Methods of qualitative and quantitative verification of complex learning-enabled systems and temporal logic verification
US20260184308A1Temporal logicSafety property
Method and apparatus for rapid approximation of system model
US12670300B2Computational physics Physical system
Predicting the feasible area for physical systems
JP2026522931AData pack Systems design

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing AI inference services suffer from resource deadlock, switching delays, and ambiguous decision-making due to the bundling of storage and computing resources. Furthermore, the lack of user data sovereignty prevents the implementation of fine-grained resource scheduling and low-latency state switching.

Method used

A two-dimensional orthogonal state space model is adopted to independently monitor the storage rights dimension and the computing power rights dimension. Resource decoupling is achieved through event-driven and quota discrete counting. Combined with metadata-level atomic operations and edge encryption, data sovereignty and low-latency switching are ensured.

Benefits of technology

It decouples storage resources from computing resources, ensures user data accessibility, low latency in state switching, data sovereignty belongs to the user, meets compliance requirements, and improves the efficiency and security of resource scheduling.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122309167A_ABST

Patent Text Reader

Abstract

This invention discloses an AI inference resource decoupling scheduling method and system based on a two-dimensional orthogonal state space. User AI service rights are decomposed into storage rights dimension σ and computing power rights dimension φ, constructing an orthogonal state space S=(σ,φ), where σ and φ change independently and have no causal dependency. Four service modes are derived from the state space: full service, read-only memory, temporary session, and guided recovery. State transitions in the computing power rights dimension are triggered by critical conditions of discrete quota counting: the cost r for a single token generation is defined as the lower limit of physical resources required for a single forward propagation of the model decoder. When the remaining quota Q(t) is less than r, the physical system cannot complete the next calculation, and the state will inevitably transition and terminate the computation flow. State switching is achieved through metadata-level atomic operations, and the execution time is independent of the user data scale. This invention clarifies the inevitability of the technical effect through logical deduction, is applicable to software or hardware-software co-implementation, and solves the problems of resource deadlock, switching latency, and service interruption under subscription systems.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method and system for dynamic resource scheduling and data management for large language model inference services. The method described in this invention operates on a standard computing device including a processor, memory, and network interface, and achieves dynamic monitoring and scheduling control of resource status through software program instructions and underlying hardware collaboration. Background Technology

[0002] With the widespread application of large language models, AI inference services face increasingly severe resource management challenges. In existing technologies, user interaction with AI typically employs subscription or pay-per-use models, which suffer from the following technical drawbacks:

[0003] Resource deadlock: Subscription-based systems tie storage and computing resources to a single equity dimension. When computing power quotas are exhausted, users simultaneously lose access to their historical data, leading to service interruption. This flaw stems from the fact that a univariate state space cannot represent the intermediate state of "storage being available but computing power being insufficient."

[0004] Switching latency: When a user's status changes (such as renewal or expiration), the existing solution requires a full data migration. Let the user data size be D and the bandwidth be B, then the migration time T_migrate = D / B, which is directly proportional to the data size. As D increases, the service interruption window becomes unacceptable.

[0005] Ambiguous decision-making: Existing billing systems rely on post-event deductions or empirical thresholds, lacking clear critical switching conditions. When computing power quotas are nearing exhaustion, the system cannot definitively predict whether the next computation will exceed the budget, leading to either premature termination (wasting quotas) or post-event overspending (damaging user experience).

[0006] Lack of data sovereignty: Platforms' centralized control over user conversation data not only violates the "minimum necessity principle" of the Personal Information Protection Law, but also hinders the formation of a data element market.

[0007] Therefore, there is a need for an AI service system and method that can use user memory as a core digital asset, while achieving refined resource scheduling, dynamic access control, and state switching optimization. Summary of the Invention

[0008] Technical problems to be solved

[0009] The technical problem to be solved by this invention is: how to decouple storage resources and computing resources in AI inference services, so that the accessibility of user memory data can still be maintained when computing resources are insufficient, while achieving low latency in state switching and efficient release of resources, and ensuring that data sovereignty belongs to the user. Technical solution

[0010] This invention provides a method for decoupling and scheduling AI inference resources based on a two-dimensional orthogonal state space, comprising the following steps:

[0011] Step 1: Construct a two-dimensional orthogonal state space Define a storage rights dimension σ∈{0,1} (0 indicates storage resource failure, 1 indicates availability) and a computing power rights dimension φ∈{0,1} (0 indicates insufficient computing power resources, 1 indicates sufficient computing power resources), forming a two-dimensional state vector S=(σ,φ). σ and φ change independently and have no causal dependency, meaning that the state change events of the two dimensions are independent of each other in the probability space.

[0012] Four mutually exclusive service patterns can be derived from this state space: S11=(1,1): Storage is valid and computing power is sufficient, providing complete services; S10=(1,0): Storage is available but computing power is insufficient, providing read-only memory service; S01=(0,1): Storage failure but sufficient computing power, providing temporary session services; S00=(0,0): Both are invalid, providing boot recovery service.

[0013] A two-dimensional state space can represent four modes, while a univariate bundled model can only represent two. The decoupled architecture achieves a doubling of state representation capabilities. It should be noted that the binarization design of φ is based on a deterministic criterion of physical resource limits: as long as the remaining quota can complete at least one forward propagation, the system considers the computing power sufficient; if not, it immediately circuit breakers. This binarization is a design choice to ensure unambiguous decision-making, not a technical defect.

[0014] Step 2: Establish state-service mapping functions Establish a mapping relationship f between the logical access path P_logic and the physical resource pool P_phys (including the hot storage layer, cold storage layer, and computing accelerator), where f is a function of the state vector S, i.e., f = F(S). The output of the mapping function includes:

[0015] F(1,1): P_logic points to both the hot storage layer and the computing accelerator; F(1,0): P_logic points only to the cold storage layer (read-only), severing the connection with the computing accelerator; F(0,1): P_logic points only to the computing accelerator (temporary session), cutting off the write path to the storage layer; F(0,0): P_logic points to the bootstrap service unit.

[0016] Step 3: Event-driven state monitoring and transitions The update of the storage rights dimension σ is driven by business events: in response to storage renewal events, storage expiration events, or storage failure recovery events, the system updates the σ value in real time.

[0017] The update of the computing power equity dimension φ is based on discrete quota counting: the remaining quota Q(t) is defined as Q0 - Σrᵢ, where Q0 is the initial quota and rᵢ is the discrete count value of the i-th token consumption. The single token generation cost r is defined as the lower bound of physical resources required for a single forward propagation of the model decoder, i.e., the minimum indivisible resource unit required to perform this operation on a given hardware. This value can be measured in an isolated environment through offline micro-benchmarking. For homogeneous hardware clusters and resource isolation guaranteed by hardware partitioning (such as MIG) or exclusive streams, r is a stable constant.

[0018] It should be noted that the stability of the physical resource lower limit r is based on a deployment environment where computing resources are physically isolated or exclusively used (e.g., independent instances partitioned using multi-instance GPU technology MIG, or dedicated computing units of a dedicated AI inference chip). In such environments, the physical overhead of a single forward propagation is fixed as a constant, thus ensuring the physical inevitability of criticality determination. For non-isolated environments, a pure software polling scheme can be used as an alternative implementation.

[0019] When Q(t) < r is detected, a state transition is triggered, and φ is updated to 0.

[0020] The physical basis for this critical condition is: if Q(t) ≥ r, then the system is capable of completing the next token generation; if Q(t) < r, then the physical system is absolutely unable to complete the next forward propagation. In this case, the state transition is a physical inevitability, rather than an empirical threshold judgment.

[0021] Step 4: Metadata-level state switching When the state vector S transitions, the mapping relationship is recalculated according to the mapping function F, and the physical node pointed to by P_logic is dynamically adjusted through atomic pointer swapping operations. This atomic operation only modifies the metadata of the access path (such as file system inode pointers, object storage metadata, and memory page table entries), and does not involve the relocation of data content. Its execution time does not increase with the increase of the total amount of user data.

[0022] Step 5: Real-time quota circuit breaker In the token generation loop of AI inference, the remaining quota Q(t) is checked after every preset number N tokens are generated. When Q(t) < N·r is detected:

[0023] Call the hardware accelerator driver interface (such as cudaStreamDestroy in CUDA) to destroy the current computing stream and terminate subsequent token generation; return an interrupt flag (such as [QUOTA_EXHAUSTED]) to the client, while retaining the generated partial results; the video memory resources occupied by the stream are marked as pending reclamation, and subsequent memory allocation requests can reuse the video memory.

[0024] Step 6: Data Sovereignty Protection The user's master key K_client is generated and stored only in a secure area on the client side (such as TEE or operating system keystore), while the encrypted form remembered by the user is persistently stored in the cloud.

[0025] In response to a user query request: The client uses K_client to decrypt the relevant memory fragments and submits them to the server along with the query; Alternatively, the client can authorize a temporary derived key to the server, and the server can decrypt the subset of memory required for the query within a trusted execution environment.

[0026] After a query session terminates, the server immediately clears the plaintext memory and temporary key. The cloud does not hold the master key and cannot independently decrypt any user data. Specifically, when the system is in read-only memory mode (σ=1, φ=0), the client can choose not to provide the decryption key to the server according to user policies, thus preventing the server from accessing the user's plaintext memory and further enhancing the user's control over the data. This feature reflects the functional synergy between state space and key management.

[0027] The master key may be backed up using a secure recovery mechanism (such as a mnemonic phrase or distributed key backup) to prevent user data from becoming permanently inaccessible due to damage to the end-side device.

[0028] Step 7: Secure cross-device synchronization When synchronizing across devices, the following protocol is executed: The source device generates a temporary ECDH key pair and encodes the public key into a QR code; The target device scans the QR code to obtain the public key, generates a local temporary key pair, and calculates the shared key. The target device uses a shared key to encrypt K_client to obtain ciphertext, and sends it to the cloud for forwarding to the source device; The source device decrypts the data using a shared key, then re-encrypts it before returning it to the target device. The target device decrypts and obtains K_client, then immediately removes the shared key and temporary key pair from memory.

[0029] The cloud only forwards encrypted text and does not access the plaintext master key.

[0030] Step 8: Exception Handling Mechanism A transient state is introduced during state transitions. During this transient state, new requests enter a queue and wait for the lock to be released; if the wait times out (e.g., 5 seconds), the system transitions to an error state and returns a 503 status code, suggesting the client retry. Error states can also be triggered by log replay failures or timeouts. In these cases, the system automatically rolls back to the most recent consistent snapshot and triggers an alert, supporting manual or automatic recovery. The Write-Ahead Log (WAL) mechanism guarantees the atomicity and idempotency of the transition operation.

[0031] Beneficial effects The technical effects of this invention are based on the deterministic principles of the underlying architecture of computer systems and can be reasonably expected through logical deduction. Those skilled in the art, based on the mapping relationships and metadata-level atomic operation mechanisms described in the specification, can reasonably expect and verify the following effects:

[0032] State representation capability is doubled: The cardinality of the two-dimensional state space is 4, while that of the univariate model is 2. The decoupled architecture achieves a doubling of state representation capability.

[0033] Physical determinism of critical switching: Based on the conservation relationship of discrete quota counting, r is defined as the lower limit of physical resources. When the remaining quota is less than this lower limit, the system will inevitably be unable to complete the next calculation in a physical sense. The state transition becomes a deterministic choice, avoiding the ambiguity of empirical threshold judgment.

[0034] Switching time is decoupled from data scale: Metadata-level atomic operations only modify the access path and do not involve data migration. Their execution time does not increase with the increase of the total amount of user data, overcoming the linear overhead defect of traditional full migration solutions.

[0035] Data sovereignty and compliance assurance: Building upon decoupled resource scheduling, this invention further integrates edge encryption and blind memory architecture. This mechanism works in conjunction with the state space: for example, in read-only memory mode, the client can choose not to provide the decryption key, thereby completely blocking the inference service's access to the user's memory. This design provides an indispensable compliance foundation for the legal deployment of the system under laws such as the Personal Information Protection Law, constituting a security defense dimension independent of scheduling performance.

[0036] Flexible implementation through hardware and software integration: The core concept of this invention can be fully implemented through a pure software protocol stack, or enhanced through hardware collaboration. In the hardware collaboration scheme, the quota discrete counting logic at the software level and the interrupt triggering mechanism of the hardware comparator functionally support each other and have an interactive relationship: the software logic defines the critical transition condition, and the hardware mechanism compresses the judgment delay of this condition from the millisecond level of software polling to the microsecond level of hardware interruption. The two work together to achieve deterministic ultra-low latency circuit breaking.

[0037] The hardware collaboration scheme described in this embodiment aims to provide logical specifications for the hardware design of AI inference accelerators. For chip design engineers, it is obvious that the corresponding functions can be implemented in application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) based on the logical relationships such as register mapping, comparator triggering, and interrupt reporting described in this embodiment. Attached Figure Description

[0038] Figure 1 This is a two-dimensional orthogonal state space state transition diagram of the present invention, which shows four basic states (S00, S01, S10, S11) and state transition conditions (storage renewal, token exhaustion, etc.).

[0039] Figure 2 This is a system architecture diagram of the present invention, showing the connection relationships between the client, gateway, state machine maintenance module, routing control module, quota verification module, hot storage layer, cold storage layer, inference engine (GPU accelerator), and key management module.

[0040] Figure 3 This is a flowchart of the real-time quota circuit breaker mechanism of the present invention, which shows the quota detection, critical condition judgment, computation flow destruction and interruption mark return in the token generation cycle.

[0041] Figure 4 This is a schematic diagram of the metadata-level atomic switching of the present invention, which illustrates three implementation methods: file system mount point redirection, object storage metadata update, and memory address remapping.

[0042] Figure 5 This is a flowchart of the cross-device key negotiation process of the present invention, which illustrates the steps of QR code transmission, ECDH negotiation, ciphertext forwarding and key clearing. Detailed Implementation

[0043] The present invention will now be described in detail with reference to the accompanying drawings and embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention.

[0044] Example 1: Construction and Transition of Two-Dimensional State Space like Figure 1 As shown, the system maintains four basic states as well as extended transient and error states.

[0045] State transitions are driven by business events: when a user completes the storage renewal operation, the system receives a payment callback, updates σ=1, and the state transitions from S00 or S10 to S01 or S11; when the computing power quota is consumed to less than the single generation consumption r, the system detects the critical condition, updates φ=0, and the state transitions from S11 or S01 to S10 or S00.

[0046] State transition operations are executed sequentially using distributed locks or single-threaded serialization. The system employs a write-ahead logging mechanism: after recording the change intent in the persistent log, a success response is immediately returned, and the log is asynchronously replayed in the background to complete physical state alignment. Multiple executions of the same operation are idempotent.

[0047] Example 2: Software Implementation of Metadata-Level Atomic Switching like Figure 2 , Figure 4 As shown, at the single-machine level, the system achieves path switching through atomic mount point redirection of the file system (for example, performing atomic symbolic link replacement in Linux system to redirect the logical path to the cold storage mount point).

[0048] The system maintains two mount views: a hot view pointing to the local SSD path and a cold view pointing to a remote object storage path (mounted via FUSE or NFS). This operation only modifies file system metadata and does not involve data migration.

[0049] At the cluster level, the system maintains a global state view through a configuration center (such as etcd). When a user's state transitions, the state machine maintenance module writes the new state to etcd, and each inference node receives the change notification through a listening mechanism and independently performs the atomic switch locally. etcd's Raft consensus protocol guarantees the sequential consistency of state changes. Although the distributed consensus protocol introduces a small latency, this latency is independent of the user data scale and is much smaller than the time required for a full data migration.

[0050] In another implementation, a similar effect can be achieved by modifying the metadata of the object storage (such as S3's x-amz-metadata-directive). In this case, no local file system operations are required; the access strategy of the storage bucket can be changed directly through the API.

[0051] Example 3: Hardware and software coordinating implementation of real-time quota circuit breaker like Figure 3 As shown, in the pure software implementation, after the system generates N tokens, it obtains Q(t) by reading the memory counter. When Q(t) < N·r, it calls cudaStreamDestroy to destroy the CUDA stream.

[0052] In the hardware-coordinated implementation, the GPU driver exposes a quota counter register. The hardware comparator compares the register value with a critical threshold r. When Q(t) < r, a PCIe MSI interrupt is triggered. The interrupt service routine directly destroys the computation flow and returns an interrupt flag. In this hardware-coordinated scheme, the software-level quota discrete counting logic and the hardware comparator's interrupt triggering mechanism are functionally mutually supportive and interactive: the software logic defines the critical transition condition, and the hardware mechanism compresses the judgment latency of this condition from the millisecond level of software polling to the microsecond level of hardware interrupt. Together, they achieve deterministic, extremely low-latency circuit breaking.

[0053] The hardware collaboration scheme aims to provide logical specifications for the hardware design of AI inference accelerators. For chip design engineers, it is obvious that the corresponding functions can be implemented in application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) based on the above-mentioned register mapping, comparator triggering, interrupt reporting and other logical relationships.

[0054] Example 4: Data Sovereignty and Cross-Device Synchronization like Figure 5 As shown, when a user uses the service for the first time, the client generates a master key K_client in a local secure area (such as a TEE or operating system keystore), encrypts the stored data using AES-256-GCM, and then uploads the ciphertext. The system also generates a set of recovery mnemonic phrases for the user (optional), which the user can securely save. The cloud only stores the hash value of the mnemonic phrases for verifying recovery requests, not the mnemonic phrases themselves. When a user's device is damaged, K_client can be recovered on a new device using the mnemonic phrases, thereby decrypting the stored data in the cloud.

[0055] During cross-device synchronization, the ECDH protocol is executed: the source device generates a temporary key pair and encodes the public key into a QR code; the target device scans the QR code, generates a local temporary key pair, and calculates the shared key; the target device uses the shared key to encrypt K_client to obtain ciphertext, sends it to the cloud, and forwards it to the source device; the source device decrypts it, re-encrypts it, and returns it to the target device. The shared key is immediately cleared from memory after transmission is complete, and the cloud only forwards the ciphertext.

[0056] Example 5: Logical Derivation of Technical Effect The technical effects of this invention are based on the deterministic principles of the underlying architecture of computer systems and can be reasonably expected through logical deduction. Those skilled in the art, based on the mapping relationships and metadata-level atomic operation mechanisms described in the specification, can reasonably expect and verify the following effects:

[0057] Switching latency is decoupled from data scale: Metadata-level atomic operations only modify the access path and do not involve data content migration. Their execution time does not increase with the total amount of user data. Let the amount of user data be D. The I / O time of a traditional full migration is T_migrate = D / B, while the execution time of the switching operation in this invention is independent of D. For any D > 0, the switching time of this invention is much shorter than the full migration time, and the advantage becomes more pronounced as D increases.

[0058] Critical switching determinism: Quota consumption is a discrete event, with a remaining amount Q(t) and a single consumption of r. The conditions Q(t) ≥ r and Q(t) < r form a mutually exclusive and complete set of events. When Q(t) < r, the next generation will inevitably exceed the budget, therefore the state transition is a deterministic choice.

[0059] Those skilled in the art can build prototype systems to verify the above-mentioned technical effects based on the embodiments described in the specification and existing open-source tools (such as FUSE file system, CUDA Toolkit, etcd distributed storage).

Claims

1. An AI inference resource scheduling method, characterized in that, Including: Construct a two-dimensional orthogonal state space: Decompose the user's AI service rights into a storage right dimension σ ∈ {0, 1} and a computing power right dimension φ ∈ {0, 1}, forming a state vector S = (σ, φ), where σ and φ vary independently and have no causal dependence relationship; Establish a state-service mapping: According to the state vector S, map the service state to a complete service mode (σ = 1, φ = 1), a read-only memory mode (σ = 1, φ = 0), a temporary session mode (σ = 0, φ = 1), or a boot recovery mode (σ = 0, φ = 0); Perform event-driven state monitoring: The storage right dimension σ is updated in response to a storage renewal event, a storage expiration event, or a storage failure recovery event; The computing power right dimension φ is updated in response to a quota recharge event or a quota exhaustion event; Perform metadata-level state switching: In response to the change of the state vector, dynamically adjust the mapping relationship between the logical access path and the physical resource pool through an atomic pointer exchange operation. The atomic operation only modifies the metadata of the access path and does not involve the relocation of user data content, and its execution time does not increase with the increase of the total amount of user data.

2. The method according to claim 1, characterized in that, The update of the computing power right dimension φ includes: Define the remaining quota Q(t) = Q0 - Σrᵢ, where Q0 is the initial quota and rᵢ is the discrete count value consumed by the i-th Token; Define the consumption r of a single Token generation as the lower limit of the physical resources required for a single forward propagation of the model decoder, that is, the smallest indivisible resource unit required to execute this operation on a given hardware. This value can be determined by offline measurement or hardware specifications and is a stable constant in an isolated environment; When it is detected that Q(t) is less than r, trigger a state transition, update φ to 0, and terminate the current AI inference calculation flow.

3. The method according to claim 1, characterized in that, The service state also includes: Transient, which is used during the state switching waiting period. When a new request enters the queue, if the waiting timeouts, it will transfer to the error state and return a retryable error code; And the error state, which is used when the log replay fails or the timeout is triggered. The system automatically rolls back to the nearest consistent snapshot and triggers an alarm, supporting manual or automatic recovery.

4. The method according to claim 1, characterized in that, The state switching adopts a write-ahead log mechanism to ensure application-layer atomicity: Record the change intention in the persistent log and immediately return a successful response. The background asynchronously replays the log to complete the physical state alignment, and the same operation executed multiple times is idempotent.

5. The method according to claim 1, characterized in that, The method is implemented through a pure software protocol stack, including: A distributed key-value store maintains the state vector, a publish-subscribe event bus listens for rights changes, and a file system atomic mount point redirection; Or it is implemented through hardware cooperation, including: The GPU driver exposes a quota counter register, a hardware comparator compares the register value with the critical threshold r, and triggers a PCIe MSI interrupt when Q(t) < r; In this hardware cooperation solution, the quota discrete counting logic at the software level and the interrupt triggering mechanism of the hardware comparator support each other functionally and interact with each other: The software logic defines the critical transition condition, and the hardware mechanism compresses the judgment delay of this condition from the millisecond level of software polling to the microsecond level of hardware interrupt. The two cooperate to achieve deterministic extremely low-latency fusing.

6. The method according to claim 2, characterized in that, The termination of the current AI inference computation stream includes: calling the hardware accelerator driver interface to destroy the current computation stream, returning an interrupt flag to the client, while retaining the generated partial results, and marking the video memory resources occupied by the stream as pending reclamation.

7. The method according to claim 1, characterized in that, It also includes data sovereignty protection: the user's master key is generated and stored only in a secure area on the client side, while the encrypted form of the user's memory is persistently stored in the cloud; in response to a user query request, the client uses the master key to decrypt the relevant memory fragments and submits them with the query, or authorizes the server to temporarily decrypt the subset of memories required for the query within a trusted execution environment; after the query session terminates, the server immediately clears the plaintext memory and temporary authorization; the master key can optionally be backed up through a secure recovery mechanism to prevent data inaccessibility due to damage to the client-side device. Specifically, when the system is in read-only memory mode (σ=1, φ=0), the client can choose not to provide the decryption key to the server according to the user's policy, thereby preventing the server from accessing the user's plaintext memory and further enhancing the user's control over the data.

8. The method according to claim 7, characterized in that, During cross-device synchronization: The source device generates a temporary key pair and encodes the public key into a QR code; the target device scans the QR code to obtain the public key, generates a local temporary key pair, and calculates the shared key; the target device uses the shared key to encrypt the master key to obtain ciphertext, sends it to the cloud, and forwards it to the source device; the source device uses the shared key to decrypt the ciphertext and re-encrypts it before returning it to the target device; the shared key is immediately cleared from memory after transmission is completed, and the cloud only forwards the ciphertext without accessing the plaintext master key.

9. An AI inference resource scheduling system, characterized in that, include: Processor and memory; The state machine maintenance module is configured to maintain the two-dimensional state space and monitor the changes of σ and φ in real time. The routing control module is configured to generate a mapping relationship based on the state vector S and adjust the logical access path by exchanging atomic pointers. The quota verification module is configured to detect quota exhaustion events and trigger state transitions. The key management module is configured to generate and store user master keys on the client side. The audit log module is configured to atomically record decision-making process data.

10. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method of any one of claims 1 to 8.