An emotional agent based on a state machine and an implementation method and device of a memory system thereof, an electronic device, a storage medium and a process
By combining a multidimensional emotional state machine and a hierarchical memory system with a streaming response mechanism, the problems of emotional continuity and multimodal synchronization in virtual companion systems are solved, achieving low-latency multimodal interaction and improving the naturalness and immersion of human-computer interaction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 上海臻广信息科技有限公司
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-26
AI Technical Summary
Existing virtual companion or chatbot systems lack emotional continuity, have limited memory management, insufficient real-time performance, and fragmented multimodal communication, making it difficult to achieve low-latency multimodal synchronous interaction.
Employing a multidimensional emotional state machine, a hierarchical memory system, and a streaming response mechanism, this system achieves emotional continuity and low-latency multimodal interaction by driving multimodal synchronization through emotional states and combining emotionally bound hierarchical memory with low-latency streaming response.
It achieves efficient memory and low-latency response of emotional intelligent agents, possesses personality continuity and natural interaction, supports long-term interaction records, and enhances the naturalness and immersion of human-computer interaction.
Smart Images

Figure FT_1 
Figure FT_2 
Figure FT_3
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and human-computer interaction technology, specifically to a method for implementing a virtual companion intelligent agent based on an emotional state machine and a hierarchical memory architecture, which is particularly suitable for AI systems that require long-term emotional continuity, real-time voice interaction, and multimodal expression. Background Technology
[0002] Existing virtual companion or chatbot technologies (such as Character.ai, Replika, etc.) have the following limitations: 1. Lack of emotional continuity: Most systems generate responses based solely on the current conversational context, lacking tracking of historical emotional states, leading to inconsistencies in personality performance.
[0003] 2. Limited memory management: Traditional methods typically place all historical records into a context window, which is limited by the token length of LLM, making it unable to effectively handle long-term interaction data, or lacking a memory filtering mechanism based on emotional importance.
[0004] 3. Insufficient real-time performance: In voice interaction scenarios, LLM inference latency is high, making it difficult to achieve natural dialogue response time of less than 500ms.
[0005] 4. Multimodal fragmentation: Text generation, speech synthesis (TTS), and character animation (Live2D / VRM) often operate independently, lacking a collaborative drive based on a unified emotional state.
[0006] Therefore, there is an urgent need for an intelligent agent architecture that can simulate the patterns of human emotional changes, possess efficient memory compression and retrieval capabilities, and achieve low-latency multimodal synchronization. Summary of the Invention
[0007] This invention aims to provide a method for implementing an emotional intelligent agent and its memory system based on a state machine. By constructing a multidimensional emotional state machine, a hierarchical memory system, and a streaming response mechanism, a virtual companion with "personality continuity" can be realized.
[0008] Core Innovation Points 1. Multidimensional Emotional State Machine: Defines a state vector containing basic emotions (happiness, sadness, etc.) and higher-level emotions (love, trust, energy), and introduces a decay mechanism and empathy adjustment algorithm; 2. A hierarchical memory system with emotional binding: Emotional states are bound to memory entries as metadata, and a three-level architecture of short-term (Working), medium-term (Session), and long-term (Long-term) is adopted, which combines importance scoring for dynamic compression and archiving; 3. Layered streaming response mechanism: The response process is broken down into two layers: "rapid emotional feedback" and "substantive content generation". Streaming output and speculative execution are used to reduce perceptual latency. 4. Multimodal Emotion Synchronization: With emotional state as the core driving force, it synchronously controls LLM tone, TTS intonation, and virtual avatar animation. Detailed Implementation
[0009] This system adopts a hybrid architecture and mainly includes the following core modules: 1. AI Inference Layer (Brain): Responsible for LLM integration and decision-making, supporting a unified abstract interface for various model providers (Deepseek, Qwen, local models, etc.); 2. Memory Core: Responsible for data storage, retrieval, compression, and sentiment annotation, using an embedded database (such as DuckDB / Pglite) for on-device persistence; 3. Emotional Engine: Maintains the global emotion state machine and executes state update and decay logic; 4. Perception and Expression Layer: Ears: Speech recognition (STT) and speaker detection (VAD).
[0010] Mouth: Emotional speech synthesis (TTS) with lip-sync.
[0011] Body: The motion-driven aspects of the virtual avatar (VRM / Live2D), including blinking, gazing, and emotional expressions.
[0012] II. Core Implementation Method 1. A state machine-based sentiment modeling method The system defines a multi-dimensional emotional state object, EmotionalState, which includes a basic emotional dimension (Ebase) and a higher-level emotional dimension (Eadv): Basic dimensions: happiness, sadness, anger, fear, surprise, disgust (normalized to 0.0-1.0).
[0013] Advanced dimensions: Affection, Trust, and Energy.
[0014] State update logic: Input analysis: Perform sentiment analysis on user input to extract user emotions. Uemotion .
[0015] Empathy adjustment: Based on Uemotion Adjust the agent's state. For example, when the user is happy, increase the agent's happiness and liking levels; when the user is sad, increase the sadness level and decrease the energy level.
[0016] Natural decay: Introducing a time factor Δt and a decay rate decayRate, each frame executes Valuenew=max(0.5,Valueold−decayRate×Δt). Valuenew =max(0.5, Valueold - decayRate ×Δ t This ensures that emotions do not become permanently fixed, mimicking the natural decline of human emotions.
[0017] Boundary constraints: All dimension values are forced to be limited to the range [0, 1].
[0018] 2. Implementation of a layered memory system based on emotional binding The memory system is divided into three layers, each with different storage strategies and lifecycles: Short-term memory (working memory): Store the 10-20 most recent interaction messages.
[0019] Located in memory, it is used to construct the immediate context of the current conversation.
[0020] When the threshold is exceeded, the compression mechanism is triggered.
[0021] Mid-term memory (Session Memory): Stores all interactions and real-time emotional trajectories in the current session.
[0022] Used to maintain the continuity of a single, long conversation.
[0023] Long-term memory: Data structure: Each memory entry (MemoryEntry) contains content, timestamp, snapshot of the emotional state at the time, user emotion tag, importance score, and label.
[0024] Storage media: Use an embedded analytical database (such as DuckDB WASM) or a lightweight relational database (Pglite).
[0025] Writing strategy: High-importance memories (Importance>0.8) are written directly to the long-term storage; low-importance memories are accumulated in the short-term storage first.
[0026] Compression and Archiving: Periodically call LLM to summarize short-term memory into concise text, retain key facts and sentiment, store in long-term database, and archive raw data.
[0027] Retrieval Enhancement (RAG): When responding to a user, relevant historical memories are retrieved based on vector similarity and sentiment tags and injected into the prompt, enabling the AI to recall "you seemed excited when you last mentioned..."
[0028] 3. Low-latency streaming response process To address the inference latency issue in LLM, a hierarchical response strategy is implemented: Phase 1: Immediate emotional feedback (<200ms) Once the user's voice activity (VAD) is detected to have ended, nonverbal feedback (such as "Mmm" or "I'm listening") or a corresponding listening animation is immediately sent based on the current emotional state.
[0029] Phase Two: Streaming Content Generation and Preloading Asynchronous calls to LLM are used for streaming generation (Stream Chat).
[0030] Speculative execution: Upon receiving the first text chunk, the TTS engine is immediately started for audio synthesis, without waiting for the full text to be generated.
[0031] Parallel processing: Text output, audio synthesis, UI rendering, and animation-driven parallel processing.
[0032] Phase 3: Multimodal Synchronization The phonemes output by TTS drive the lip-sync of the virtual avatar in real time.
[0033] The current emotional state dynamically adjusts the tone parameters of TTS (such as rising tone when happy) and animated expressions (such as the amplitude of a smile).
[0034] 4. Cross-platform adaptive deployment Web-based: WebGPU is used for graphics rendering, WebAssembly runs lightweight inference or database operations, and WebWorkers handle background tasks.
[0035] Desktop / Mobile: Utilizes native GPU acceleration (CUDA / Metal) to support higher-precision local model inference and more complex physical rendering.
[0036] Dynamic degradation: Detects device performance and automatically switches model precision (FP16 / INT8) or disables some effects to ensure smoothness.
[0037] III. Devices and Equipment The present invention also provides an apparatus for implementing the above method, comprising: Processor: Used to perform emotional state updates, memory management, and LLM inference scheduling.
[0038] Memory: Stores program code and hierarchical memory database.
[0039] Input units: Microphone array (voice), camera (visual optional), keyboard / touchscreen.
[0040] Output units: Speaker (voice), display screen (virtual avatar rendering).
[0041] Communication module: Used to connect to cloud-based LLM services or synchronize data across multiple devices.
[0042] IV. Beneficial Effects 1. Strong sense of personality authenticity: Through the binding of emotional state machine and memory, the intelligent agent exhibits stable personality traits and emotional memories, and is no longer a cold question-and-answer machine.
[0043] 2. Long-term companionship capability: Layered memory and compression mechanisms break through the limitations of context length, supporting continuous interaction records for months or even years.
[0044] 3. Natural and smooth interaction: Layered streaming response reduces the perceived latency to an acceptable range for humans, achieving a conversational rhythm similar to that of a real person.
[0045] 4. Privacy and Security: Supports on-device databases and local inference; sensitive memory data can be completely stored locally on the user's device.
[0046] Example 1: Specific Implementation of the Emotional State Machine In this embodiment, the Emotional State Machine (ESM) maintains a state vector St={e1,e2,...,en}, where ei represents the intensity value of the i-th emotion dimension (normalized to 0.0-1.0).
[0047] The state update logic is as follows: 1. Natural decay: For each frame or at fixed intervals Δt, all sentiment values are multiplied by (1−λ⋅Δt), where λ is the decay rate, so that the sentiment gradually returns to the baseline value (e.g., 0.5).
[0048] 2. User Empathy: Analyze the user's emotional input, Euser. If the user is happy, the agent's happiness value increases by α⋅Euser, where α is the empathy coefficient.
[0049] 3. Memory Trigger: Retrieve historical memories Mhistory related to the current context. If pleasant memories are recalled, the happiness value β⋅Mhistory will increase.
[0050] 4. Boundary constraints: Use the Clip function to ensure that all values are within the range [0, 1].
[0051] The system runs this update cycle at a high frequency (e.g., 60Hz) to ensure the smoothness of emotional changes. When a certain dimension exceeds a preset threshold, specific performance behaviors are triggered (such as changes in voice tone or switching between Live2D / VRM facial expression animations).
[0052] Example 2: Layered Memory and Emotional Binding The memory system employs a three-level storage strategy: 1. Working Memory: Implemented using a circular buffer, it stores the original text and metadata of the most recent N rounds (e.g., 20 rounds) of dialogue. The oldest data is automatically discarded each time a new dialogue begins.
[0053] 2. Session Memory: At the end of the session, the short-term memory is compressed into a "session summary" using a summary model, and the average emotional state during the session is recorded.
[0054] 3. Long-term memory: Storage: When the importance score of an interaction (calculated based on explicit user feedback or semantic strength) exceeds a threshold, its content is embedded as a vector, concatenated with the current St vector, and stored in a vector database (such as DuckDB WASM or Pglite). The data structure includes: content, timestamp, sentiment state vector, user sentiment label, and importance score.
[0055] Retrieval: When responding to user queries, the system not only uses semantic similarity for retrieval but also incorporates an "emotional resonance" factor, prioritizing the retrieval of historical memories similar to the current emotional state of the agent, thereby achieving "emotional resonance." For example, when the agent is in a "sad" state, it is easier to recall past sad experiences and offer comfort.
[0056] Example 3: Real-time Response Optimization Process 1. The system immediately locks the current emotional state Scurrent the moment the user sends a message.
[0057] 2. T0 moment: The front-end UI immediately plays a listening animation that matches Scurrent or emits preset audio such as "Mmm" or "I'm listening" (time <50ms) 3. At time T0+100ms: The background thread starts memory retrieval and large model inference.
[0058] 4. At time T0+1000ms: The large model starts to output the first token. The system pushes it to the front end via WebSocket streaming, and works with the TTS engine to generate and broadcast the token while driving lip-sync.
[0059] 5. If low device performance is detected (e.g., insufficient battery power on the mobile device), automatically switch to a low-precision quantization model or reduce the amount of memory used for retrieval to ensure response speed.
[0060] Example 4: Cross-platform deployment architecture It adopts an architecture of "unified core logic and separate rendering and adaptation".
[0061] The core layer, including affective computing, memory management, and decision-making logic, is written in TypeScript / Rust and compiled into WebAssembly or native libraries to ensure consistency across multiple platforms.
[0062] Adapter layer: Web-based (Stage Web): Utilizes Vue 3 + WebGPU to accelerate inference, Service Worker to manage offline memory, and DuckDB WASM for local storage.
[0063] Desktop version (Stage Tamagotchi): Utilizes the Electron framework to call local CUDA / Metal interfaces to accelerate the running of large models and supports higher precision rendering.
[0064] Mobile (Stage Pocket): Utilizes the Capacitor container to perform model quantization adaptation for the NPU and supports PWA installation.
[0065] Through the above-described embodiments, the present invention can effectively construct an intelligent agent with highly anthropomorphic characteristics, significantly improving the naturalness and immersion of human-computer interaction. Attached Figure Description
[0066] Figure 1 This is a schematic diagram of the overall system architecture of the present invention. Figure 2 This is a schematic diagram illustrating the state transition and decay principle of the emotional state machine of the present invention. Figure 3 This is a diagram showing the structure and data flow of the hierarchical memory system of the present invention. Figure 4 This is a timing diagram of the real-time two-layer response processing of the present invention. Figure 5 This is a diagram of the cross-platform adaptation logic architecture of the present invention.
Claims
1. A method for implementing an emotional intelligent agent based on a state machine, characterized in that, Includes the following steps: Initialize the emotional state machine, construct an emotional state space containing multiple basic emotional dimensions and at least one advanced emotional state, and set the initial values and decay parameters for each emotional dimension; Receive user input information, wherein the input information includes at least one of text, voice, or image; The input information is parsed to extract user emotional features and semantic content; Based on the user's emotional characteristics, the preset time decay mechanism, and historical memory data, the current emotional state in the emotional state machine is updated. The hierarchical memory system is invoked to retrieve relevant historical memory fragments based on the semantic content and the current emotional state. Based on the updated current emotional state and retrieved historical memory fragments, generate response content; The response content is output, wherein the output includes an immediate emotional feedback signal and a subsequently generated semantic response signal.
2. The method according to claim 1, characterized in that, The emotional state space includes: The basic emotional dimension encompasses one or more combinations of happiness, sadness, anger, fear, surprise, and disgust; Advanced emotional states encompass one or more combinations of factors such as liking for users, trust in users, and energy levels. The values of each emotional dimension change dynamically within a preset range and are controlled by the decay rate parameter to naturally return to the baseline value over time.
3. The method according to claim 1, characterized in that, The hierarchical memory system comprises a three-layer architecture: Short-term memory layer: Used to store the most recent interaction records of the current dialogue context, using an in-memory storage mechanism; Mid-term memory layer: used to store the complete sequence of interactions and emotional changes within the current conversation cycle; Long-term memory layer: used to store highly important memory fragments with compressed summaries, supporting vector similarity-based retrieval; The method further includes an emotional memory binding step: when storing a memory fragment, the emotional state vector generated at that moment of memory is associated with the content of the memory fragment and stored together.
4. The method according to claim 1, characterized in that, The generated response content includes: Trigger the first layer of response: Within the preset first time threshold of receiving input information, generate a non-semantic or brief semantic emotional feedback signal based on the current emotional state; Triggering the second-layer response: Initiate parallel reasoning of the large language model, generate a substantive content response based on the complete context, and output it word by word through streaming processing.
5. The method according to claim 1, characterized in that, The method also includes a cross-platform adaptive step: Detect the hardware performance parameters and device types of the current operating environment; Based on the aforementioned hardware performance parameters, the update frequency of the emotion state machine, the vector dimension precision of memory retrieval, or the quantization level of the large language model are dynamically adjusted. Through a unified platform abstraction interface layer, it can call graphics rendering, audio processing or local storage resources under different operating systems.
6. A state machine-based emotional intelligent agent device, characterized in that, include: The emotion computing module is used to execute the emotion state machine logic as described in any one of claims 1 to 2; The memory management module is used to perform the hierarchical memory storage and retrieval as described in claim 3; The interactive processing module is used to receive multimodal inputs and generate two-layer response outputs; The platform adaptation module is used to achieve cross-platform resource scheduling and performance self-adaptation.