Agent question and answer interaction method based on large model and generative artificial intelligence
By employing a hierarchical memory management and dynamic update mechanism, and combining time and topic weights to evaluate the importance of information unit packages, the problem of information loss and low efficiency in long dialogues and cross-session interactions of traditional intelligent agents is solved, achieving efficient question-and-answer interaction and contextual coherence.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG JINSANTONG DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2025-09-08
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional intelligent agents fail to adequately consider the dynamic changes in information value when processing dialogue history, leading to the accidental deletion of key historical information, increased computational burden, low response efficiency, and difficulty in maintaining contextual coherence in long dialogues and cross-session interactions.
We employ an agent-based question-and-answer interaction method based on large models and generative artificial intelligence. Through hierarchical memory management and dynamic update mechanisms, we evaluate the importance of information unit packages by combining time weight and topic weight, and dynamically adjust the short-term memory capacity to achieve efficient screening and compressed storage.
It enhances the agent's contextual understanding and responsiveness in long dialogues and multi-turn interactions, ensuring that key information is not lost, optimizing resource utilization efficiency, and improving interaction coherence and system stability.
Smart Images

Figure CN121168645B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence question-answering interaction technology, specifically to an agent-based question-answering interaction method based on large models and generative artificial intelligence. Background Technology
[0002] Generative artificial intelligence, as an important branch of artificial intelligence, can creatively generate new content based on existing data and is widely used in fields such as text, image, audio, and video generation. With the development of deep learning technology and computing hardware, generative artificial intelligence has become particularly prominent in the field of natural language processing. Generative dialogue models, represented by ChatGPT, have demonstrated excellent dialogue capabilities and interaction results, laying the technological foundation for intelligent agent question-and-answer interaction.
[0003] However, traditional intelligent agents have significant shortcomings in processing dialogue history. On the one hand, they directly truncate the dialogue history into fixed-length context input models or use a single storage module to manage information, lacking consideration for the dynamic changes in information value. On the other hand, when generating summaries, traditional agents often repeatedly process all historical content, failing to establish a hierarchical, incremental memory transfer mechanism. In long dialogue scenarios, this approach often leads to the accidental deletion of key historical information due to model input length limitations; in cross-conversation interactions, user preferences and core information are also difficult to persist; at the same time, the repeated summarization of the entire historical content not only increases the computational burden but also reduces the agent's response efficiency, ultimately leading to problems such as context gaps, omissions of key information, and low interaction efficiency in question-and-answer responses. Summary of the Invention
[0004] To address the aforementioned technical problems, this application provides an intelligent agent question-answering interaction method based on large models and generative artificial intelligence to solve existing issues.
[0005] The intelligent agent question-answering interaction method based on large models and generative artificial intelligence in this application adopts the following technical solution:
[0006] During the current conversation between the user and the AI agent, the question content at the current moment and all question and answer content before the current moment are obtained. Each question and answer content is formed into multiple information unit packages, and the embedding vectors of the question content and all information unit packages are obtained.
[0007] The working memory, short-term memory, and long-term memory in the agent are dynamically updated based on information unit packets. The update process for the short-term memory capacity threshold is as follows:
[0008] Based on the time interval between each information unit package and the question content, the time weight of each information unit package is determined; based on the similarity of the embedding vectors between each information unit package and the question content, the topic weight of each information unit package is determined; and combined with the time weight, the importance score of each information unit package is determined, so as to select information unit packages that need to be stored in short-term memory from the information unit packages that are eliminated from working memory within a preset time period before the current moment.
[0009] The agent's latency at the current moment is obtained, and combined with the preset target latency, the agent's load characteristic value at the current moment is determined. Based on the distribution of importance scores of all information unit packets in the short-term memory within a preset time period before the current moment, the short-term memory score characteristic value at the current moment is determined, and combined with the load characteristic value, the short-term memory capacity threshold at the current moment is adjusted to retain necessary information unit packets.
[0010] Based on the information unit packets stored in the agent's working memory, short-term memory, and long-term memory at the current moment, candidate information unit packets are selected to answer the question at the current moment.
[0011] Preferably, the expression for the time weight of each information unit packet is: In the formula, This represents the time weight of the i-th information unit packet; This represents the time interval between the i-th information unit packet and the question content; This indicates the preset time weighting factor; represents the preset time decay coefficient; exp() represents an exponential function with the natural constant as the base.
[0012] Preferably, the expression for the topic weight of each information unit packet is: In the formula, This represents the topic weight of the i-th information unit packet; This represents the similarity between the embedding vectors of the i-th information unit packet and the question content; This represents the maximum value of the embedding vector similarity between all information unit packets within a preset time period prior to the current moment and the question content at the current moment; This indicates the preset topic weight factor.
[0013] Preferably, the importance score of each information unit package is the normalized value of the sum of the time weight and the topic weight of each information unit package.
[0014] Preferably, the step of selecting information unit packets to be stored in short-term memory from information unit packets that have been eliminated from working memory within a preset time period prior to the current moment includes:
[0015] Within a preset timeframe before the current moment, information units whose importance score is greater than a preset threshold are stored in short-term memory from those that are eliminated from working memory.
[0016] Preferably, the load characteristic value of the agent at the current moment is the ratio of the agent's delay at the current moment to the preset target delay.
[0017] Preferably, the scoring characteristic value of the short-term memory at the current moment is the average of the importance scores of all information unit packets in the short-term memory within a preset time period prior to the current moment.
[0018] Preferably, adjusting the short-term memory capacity threshold at the current moment includes:
[0019] Short-term memory capacity threshold at the current moment The expression is: In the formula, represents the agent's load characteristic value at the current moment; I represents the short-term memory score characteristic value at the current moment; Indicates the preset baseline capacity limit; represents the preset adjustment factor; norm[ ] represents the normalization function.
[0020] Preferably, the candidate information unit package includes:
[0021] The similarity of the embedding vectors between all information unit packets in short-term and long-term memory and the question content at the current moment is sorted in descending order. The first preset number of similarity information unit packets in the sorted results and all information unit packets in the working memory at the current moment are used as candidate information unit packets at the current moment.
[0022] Preferably, answering the question at the current moment includes:
[0023] Calculate the product of the importance score of each candidate information unit package at the current time and the preset weight, and record it as the comprehensive score of each candidate content at the current time. Sort all information unit packages at the current time in descending order of comprehensive score. Select multiple information unit packages from the sorted candidate information unit packages and splice them into a streaming context and transmit it to the agent's large language model to answer the question at the current time.
[0024] One embodiment of this application provides an intelligent agent question-answering interaction method based on large models and generative artificial intelligence, the method comprising the following steps:
[0025] This application has at least the following beneficial effects:
[0026] This application achieves intelligent filtering and compressed storage of content to be eliminated from working memory by dynamically evaluating the importance score of information unit packages by combining time weight and topic weight. This preserves high-value information and optimizes the utilization efficiency of short-term memory through differential summarization and dynamic capacity control, thereby improving the agent's contextual understanding and responsiveness in long dialogues and multi-turn interactions. Furthermore, this application achieves intelligent adjustment of short-term memory capacity by dynamically fusing system load characteristic values and score characteristic values. It can shrink the capacity under high load to ensure system response efficiency and expand the capacity to retain key content when high-value information is concentrated, thus balancing controllable resource consumption and information integrity. Achieving dynamic balance significantly improves the contextual coherence and operational stability of intelligent agents in complex dialogue scenarios. In summary, this application realizes efficient question-and-answer interaction of intelligent agents based on large models and generative artificial intelligence through hierarchical memory and dynamic update mechanisms. By allocating multimodal information to three layers of memory—working, short-term, and long-term—according to importance, and dynamically filtering and compressing content based on time and topic weights, it avoids contextual redundancy and ensures that key information is not lost. At the same time, by dynamically adjusting the memory capacity through load and scoring feature values, a balance is achieved between resource utilization and information integrity. Ultimately, this significantly improves the interaction coherence, response speed, and question-and-answer interaction stability in long dialogues and cross-conversation scenarios. Attached Figure Description
[0027] To more clearly illustrate the technical solutions and advantages in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 A flowchart illustrating the steps of an intelligent agent question-answering interaction method based on large models and generative artificial intelligence, provided in one embodiment of this application;
[0029] Figure 2 A flowchart illustrating the short-term memory capacity threshold update steps provided in one embodiment of this application. Detailed Implementation
[0030] To further illustrate the technical means and effects adopted by this application to achieve the intended inventive objective, the following, in conjunction with the accompanying drawings and preferred embodiments, details the specific implementation, structure, features, and effects of the intelligent agent question-answering interaction method based on large models and generative artificial intelligence proposed in this application. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, specific features, structures, or characteristics in one or more embodiments can be combined in any suitable form.
[0031] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.
[0032] The following description, in conjunction with the accompanying drawings, details the specific scheme of the intelligent agent question-answering interaction method based on large models and generative artificial intelligence provided in this application.
[0033] This application provides an embodiment of an agent-based question-answering interaction method based on large models and generative artificial intelligence. Specifically, the method is described below. Please refer to [link to relevant documentation]. Figure 1 The method includes the following steps:
[0034] Step S1: During the current conversation between the user and the AI agent, obtain the question content at the current moment and all information unit packets before the current moment, and obtain the embedding vectors of the question content and all information unit packets.
[0035] User input is usually continuous multimodal content, such as long text, coherent speech, or continuous images. If these contents are treated as a whole, many problems will arise: for example, long text may exceed the input length limit of the model; due to compatibility issues, content such as images and speech cannot be directly interpreted by the text model; and a single piece of content may contain multiple topics. These problems will lead to low memory management and retrieval efficiency of artificial intelligence. Therefore, it is necessary to perform structured processing first so that artificial intelligence can adapt to the input requirements of a multi-layer memory system.
[0036] First, determine whether the user input is text, image, or speech. If the input is not text, perform basic conversions on the input. For example, use Automatic Speech Recognition (ASR) to convert speech into text and use Image Correlation (OCR) to extract text information from images. At the same time, filter out noise in the input. For example, delete invalid symbols in text, silent parts in speech, and redundant frames in images to ensure that the original input is usable.
[0037] Secondly, semantic segmentation is performed to generate information units. When processing text, the BERT model is used to divide long texts into semantically complete short sentences or paragraphs. When processing content that is sequential in time, such as speech and images, the BERT model is used to segment them into independent units according to time intervals and semantic coherence. Therefore, in this embodiment, during the current user's conversation with the AI agent, the question content at the current moment and all question and answer content before the current moment are obtained, and each question and answer content is segmented into multiple independent units.
[0038] Furthermore, multimodal vector embeddings are generated. Since the independent units of the segmented text units come from different sources, including text, audio, and time-frequency data, the methods used to extract their embedding vectors also differ. For example, for independent units of text, this embodiment uses a pre-trained Transformer editor for word segmentation and encoding, outputting fixed-dimensional text vectors as embedding vectors for independent units belonging to text. For independent units belonging to audio, this embodiment uses Wav2Vec 2.0 to convert the speech spectrum into speech semantic vectors, which are then used as embedding vectors for independent units belonging to audio. For independent units belonging to images, this embodiment uses the CLIP model's visual encoder to extract visual feature vectors, which are then used as embedding vectors for independent units belonging to images. Thus, the embedding vectors of the independent units are obtained. To ensure that the embedding vectors of different types of independent units can be compared with each other, L2 normalization is performed on the embedding vectors of all independent units.
[0039] The methods of obtaining text vectors using a pre-trained Transformer editor, converting speech spectrum into speech semantic vectors using Wav2Vec 2.0, extracting visual feature vectors using a CLIP model visual encoder, and standardizing the embedding vectors using L2 normalization are all well-known techniques, and their specific processes will not be elaborated here.
[0040] Finally, data is labeled for each independent unit, and metadata such as timestamps and modality types are added to each independent unit. This data is then bound to an embedding vector to form an information unit package, which serves as input for memory management.
[0041] The so-called modal metadata is "text" for independent units belonging to text, "audio" for independent units belonging to audio, and "image" for independent units belonging to images.
[0042] Step S2: Dynamically update the working memory, short-term memory, and long-term memory in the agent based on the information unit package.
[0043] Traditional single-context windows have a problem: limited by the model input length, they cannot simultaneously handle the latest information in long dialogues and core content across conversations. Furthermore, storing all content can be resource-intensive. Using only a single layer of memory also struggles to balance the timeliness and persistence of information. Timeliness refers to the latest dialogue content, while persistence refers to long-term user preferences. Therefore, this embodiment uses a hierarchical management approach to balance the timeliness and persistence of information, specifically:
[0044] This embodiment divides memory into three layers, as follows:
[0045] Working memory is an immediate window-level memory that primarily stores the original information from the most recent N rounds of dialogue. It can retain the original, uncompressed details and has a fixed capacity limit, such as 4KB. If new information comes in and exceeds the capacity, the oldest content will be deleted and the content will be dynamically adjusted. This ensures that the focus remains on the latest dialogue content, providing the most timely and complete context for the current interaction. This is also the key to ensuring timely dialogue response.
[0046] Short-term memory is a summary-level transitional memory that stores high-value information selected from discarded working memory. It is a concise summary that uses incremental differential summarization to summarize only newly discarded segments, avoiding redundant calculations. Its capacity can also be dynamically adjusted. The content mainly consists of key events, decisions, and temporary user needs from recent conversations. As an intermediate layer between working memory and long-term memory, it compresses redundant information while retaining important recent clues. It provides lightweight support for quick retrieval of earlier historical information, balancing storage efficiency and information integrity.
[0047] Long-term memory is a knowledge base-level persistent memory that stores key information that needs to be retained for a long time, such as core facts across sessions, stable user preferences, and persistent knowledge. It relies on scalable vector databases, such as FAISS, for storage and indexing. The content is extracted from short-term memory through event triggers or periodic filtering and can be stored for a long time across time and sessions. Its core function is to keep long-term interactions coherent. When users ask about historical facts, recurring needs, or preference-related content, key information can be quickly found through vector retrieval.
[0048] S201: Based on the time interval between each information unit package and the question content, determine the time weight of each information unit package; based on the similarity of the embedding vectors between each information unit package and the question content, determine the topic weight of each information unit package, and combine the time weights to determine the importance score of each information unit package, so as to select the information unit packages that need to be stored in short-term memory from the information unit packages that are eliminated from working memory within a preset time period before the current moment.
[0049] First, information unit packets are preferentially stored in working memory, added sequentially according to time. At the same time, a fixed capacity limit is maintained in real time. If the capacity is exceeded after a new information unit is added, a first-in, first-out mechanism is triggered, deleting the earliest stored information unit based on the timestamp. This process is repeated iteratively, so that working memory can consistently retain the original details of the most recent N rounds of dialogue. The value of N is not fixed and depends on the capacity of working memory.
[0050] Fragments deleted from working memory immediately enter the short-term memory processing flow. First, the time weight of each information unit packet is determined based on the time interval between each information unit packet and the question content. Then, the topic weight of each information unit packet is determined based on the similarity of the embedding vectors between each information unit packet and the question content. Combined with the time weights, the importance score of each information unit packet is determined. The comprehensive value of these information unit packets is calculated using an importance score formula. If the score exceeds a pre-set standard, a differential incremental summarization model, such as a lightweight pointer generation network, is invoked to perform differential compression on the fragments. This compression only generates concise summaries for new content that has not been summarized before, without reprocessing all historical content. The generated summaries are placed into short-term memory. At the same time, the storage limit of short-term memory is adjusted in real time using a dynamic capacity threshold formula. If the limit is exceeded, the fragments are sorted from high to low importance scores, and summaries with low scores are deleted to ensure that the total capacity of short-term memory remains within a controllable range.
[0051] The specific process for determining the importance score is as follows:
[0052] First, this embodiment determines the time weight of each information unit packet based on the time interval between each information unit packet and the question content. Specifically:
[0053] As a specific implementation method, in this embodiment, the time weight of the i-th information unit packet... The expression is: In the formula, This represents the time interval between the i-th information unit packet and the question content; This indicates the preset time weighting factor; represents the preset time decay coefficient; exp() represents an exponential function with the natural constant as the base.
[0054] It should be noted that the values of the preset time weighting factor and the preset time decay coefficient are both set manually. In this embodiment, the preset time weighting factor is set to 0.5 and the preset decay coefficient is set to 0.5. In actual applications, as other implementation methods, implementers can also set them according to specific circumstances. This embodiment does not impose any special restrictions.
[0055] Based on the time weight, it can be understood that the longer the time has passed, that is, the greater the time interval between the i-th information unit and the question content, the smaller the information weight of the i-th information unit, indicating that the timeliness score of the old information is lower; conversely, if the time between the i-th information unit and the question content is closer, the information weight of the information unit is greater, indicating that its timeliness score is higher, and its influence and contribution to the current context is also greater.
[0056] Furthermore, this embodiment determines the topic weight of each information unit package based on the similarity of the embedded vectors between each information unit package and the question content, specifically as follows:
[0057] In this embodiment, the topic weight of the i-th information unit packet The expression is: In the formula, This represents the similarity between the embedding vectors of the i-th information unit packet and the question content; This represents the maximum value of the embedding vector similarity between all information unit packets within a preset time period prior to the current moment and the question content at the current moment; This indicates the preset topic weight factor.
[0058] It should be noted that the preset topic weight factor is set manually, and the preset time weight factor... With preset topic weight factors The sum of the values is 1. In this embodiment, the preset residential area weight factor is 0.5. In practical applications, as other implementation methods, implementers can set the value according to specific circumstances. This embodiment does not impose any special restrictions. It should be noted that there are many methods to measure the similarity between vectors. In this embodiment, the cosine similarity between the embedded vector of the i-th information unit packet and the question content is used as the similarity between the embedded vector of the i-th information unit packet and the question content. In practical applications, as other implementation methods, implementers can also use other methods to measure the similarity between vectors, such as the reciprocal of the Euclidean distance, according to specific circumstances. This embodiment does not impose any special restrictions on the selection of methods to measure the similarity between vectors.
[0059] The method for calculating cosine similarity is a well-known technique, and its specific calculation process will not be elaborated here.
[0060] Based on the topic weights of each information unit package, it can be understood that the topic weight reflects the semantic relevance between the information unit package and the current question content. A higher topic weight indicates a greater similarity between the embedding vector of the i-th information unit package and the question content compared to the maximum similarity of the embedding vectors of all information unit packages and the question content within a preset time period prior to the current moment. The larger the value, the stronger the semantic relevance between the i-th information unit packet and the current question content, indicating that the i-th information unit packet is more helpful to the current question-and-answer interaction and is the content that should be given priority in the context;
[0061] Conversely, if the proportion of the similarity between the embedding vector of the i-th information unit packet and the question content in the maximum similarity between the embedding vectors of all information unit packets and the question content at the current time within the preset time period before the current time is smaller, that is... The smaller the value, the weaker the semantic relevance between the i-th information unit and the current question content, indicating that the i-th information unit is of little help to the current question-and-answer interaction and is content that can be considered later or eliminated in the context.
[0062] Furthermore, in this embodiment, the importance score of each information unit package is determined by combining the topic weight of each information unit package with the time weight, so as to select the information unit packages that need to be stored in short-term memory from the information unit packages that are eliminated from working memory within a preset time period before the current moment. Specifically:
[0063] In this embodiment, the normalized value of the sum of the time weight and the topic weight of each information unit package is used as the importance score of each information unit package.
[0064] Based on the importance scores of each information unit package, it can be understood that the importance scores dynamically evaluate the comprehensive value of the information unit packages from both the time and semantic dimensions. If the time weight of the current information unit package is greater, it means that the current information unit package is more important in the time dimension and is more likely to be reserved or called first, and the corresponding importance score is larger. At the same time, if the topic weight of the current information unit package is greater, it means that the current information unit package has a greater contribution to the topic and is more in line with the needs of the current question in the semantic dimension. The current information unit package is more likely to be given priority for generating answers or context reconstruction, and therefore, the current information unit package has a larger importance score.
[0065] Conversely, if the time weight of the current information unit package is smaller, it means that the current information unit package no longer has an advantage in the time dimension and is more likely to be delayed for retention or elimination, and the corresponding importance score is smaller. At the same time, if the topic weight of the current information unit package is smaller, it means that the topic contribution of the current information unit package is weaker and its semantic fit with the needs of the current question content is lower. The current information unit package is less likely to be given priority for generating answers or context reconstruction, and therefore, the importance score of the current information unit package is smaller.
[0066] Furthermore, in this embodiment, among the information unit packets that are eliminated from working memory within a preset time period before the current moment, information unit packets with an importance score greater than a preset threshold are stored in short-term memory.
[0067] Among them, the information unit packets that are eliminated in working memory refer to the information unit packets that are deleted according to the first-in-first-out mechanism when new information unit packets are stored in working memory and the capacity of working memory exceeds its capacity limit.
[0068] It should be noted that the preset duration and preset threshold are both set manually. In this embodiment, the preset duration is 10 minutes and the preset threshold is 0.4. In actual applications, as other implementation methods, implementers can also set them according to specific circumstances. This embodiment does not impose any special restrictions.
[0069] Thus, this embodiment achieves intelligent filtering and compressed storage of content to be eliminated from working memory by dynamically evaluating the importance score of information unit packages by combining time weight and topic weight. This not only preserves high-value information, but also optimizes the utilization efficiency of short-term memory through differential summarization and dynamic capacity control, thereby improving the agent's contextual understanding and response capabilities in long dialogues and multi-turn interactions.
[0070] S202: Obtain the agent's latency at the current moment, and determine the agent's load characteristic value at the current moment by combining it with the preset target latency; determine the short-term memory's score characteristic value at the current moment by based on the distribution of the importance scores of all information unit packets in the short-term memory within the preset time period before the current moment, and adjust the short-term memory's capacity threshold at the current moment by combining it with the load characteristic value, so as to retain the necessary information unit packets.
[0071] If the upper limit of short-term memory is fixed, two problems arise. First, when system load suddenly increases, such as when many people are conversing simultaneously or computing resources are insufficient, the fixed capacity might cause unused storage to consume too much memory and computational power during retrieval, slowing down response times. Second, when the conversation contains a lot of high-value information, such as core user preferences and key decisions, the fixed capacity might be insufficient, forcing the deletion of important content and affecting the coherence of subsequent conversations.
[0072] Therefore, this embodiment determines the agent's load characteristic value at the current moment by obtaining the agent's latency and combining it with a preset target latency; based on the distribution of importance scores of all information unit packets in short-term memory within a preset time period before the current moment, it determines the short-term memory's score characteristic value at the current moment, and adjusts the short-term memory's capacity threshold at the current moment in combination with the load characteristic value. A dynamic adjustment mechanism is adopted, integrating system load indicators and the average importance of information. When the system load is high, the capacity is automatically reduced to release resources, ensuring efficient system operation; when high-value information is dense, the capacity is appropriately increased to retain key content, thus finding a dynamic balance between controllable resource consumption and no loss of important information. Short-term memory effectively supports the contextual coherence of recent dialogues without slowing down system performance. The specific process is as follows:
[0073] In this embodiment, by obtaining the agent's latency at the current moment and combining it with a preset target latency, the load characteristic value of the agent at the current moment is determined. Specifically:
[0074] In this embodiment, the ratio of the agent's latency at the current moment to the preset target latency is used as the agent's load characteristic value at the current moment. This value is used to assess the agent's current workload, reflecting the agent's resource stress level. It serves as the basis for dynamically adjusting the short-term memory capacity. If the load characteristic value is greater than 1, it indicates that the agent is overloaded, and the short-term memory capacity needs to be reduced. Conversely, if the load characteristic value is less than 1, it indicates that the agent's load has not met expectations, and the short-term memory capacity can be appropriately increased.
[0075] It should be noted that the preset target delay is set manually. In this embodiment, the preset target delay is set to 200ms. In actual application, the implementer can also set it according to the specific situation. This embodiment does not impose any special restrictions.
[0076] Furthermore, this embodiment determines the score characteristic value of short-term memory at the current moment based on the distribution of importance scores of all information unit packets in short-term memory within a preset duration prior to the current moment, specifically as follows:
[0077] In this embodiment, the average importance score of all information unit packets in short-term memory within a preset time period before the current moment is used as the score feature value of short-term memory at the current moment. This is used to characterize the overall importance of information in short-term memory and is another key factor in dynamically adjusting memory capacity. The score feature value can assess the overall value of the content in the current short-term memory. The larger the score feature value, the more relaxed the capacity limit can be. The importance of a single information unit packet can only reflect its own value, while the score feature value can reflect the overall importance of information in short-term memory during this period. When the score feature value is high, it means that the overall value of the information in the current short-term memory is higher. This setting can avoid deleting high-value information too early due to capacity limitations and is more in line with the needs of actual use cases.
[0078] Based on the rating feature value, the intelligent agent can appropriately relax the capacity limit of short-term memory when high-value information is concentrated, so as to avoid important information being deleted prematurely.
[0079] Finally, this embodiment combines the scoring and load characteristics of short-term memory at the current moment to adjust the capacity threshold of short-term memory at the current moment in order to retain the necessary information unit packets, specifically:
[0080] As one implementation method, in this embodiment, the capacity threshold of short-term memory at the current moment is... The expression is: In the formula, represents the agent's load characteristic value at the current moment; I represents the short-term memory score characteristic value at the current moment; represents the preset baseline capacity limit; norm[ ] represents the normalization function.
[0081] Preferably, the flowchart of the short-term memory capacity threshold update step provided in this embodiment is as follows: Figure 2 As shown.
[0082] It should be noted that the preset baseline capacity limit and the preset adjustment factor are set manually. In this embodiment, the preset baseline capacity limit is 10KB, and the preset adjustment factor needs to be greater than 1. In this embodiment, the preset adjustment factor is 2. In actual applications, as other implementation methods, implementers can also set them according to specific circumstances. This embodiment does not impose any special restrictions. The setting of the preset adjustment factor needs to ensure that it does not exceed the capacity that the actual hardware can achieve.
[0083] Furthermore, if a certain information unit package is accessed more than M times in short-term memory, it will be selected and stored in long-term memory for persistent storage. The specific process is as follows: First, a global index is generated for this information unit package through the vector database FAISS, which contains embedded vectors and metadata; then, the information unit package is stored in a distributed vector library; at the same time, long-term memory performs full index optimization periodically, that is, by clustering similar entries to remove duplicates and merging them into the same type of content, so that key information can be quickly found through vector retrieval in cross-session scenarios.
[0084] It should be noted that the value of M in this embodiment is 10. In actual application, as other implementation methods, implementers can also set it according to specific circumstances. This embodiment does not impose any special restrictions.
[0085] Among them, the vector database FAISS, the distributed vector library, and the full index optimization are all well-known technologies, and their specific operating principles and processes will not be elaborated here.
[0086] Thus, this embodiment achieves intelligent adjustment of short-term memory capacity by dynamically fusing system load characteristic values and scoring characteristic values. It can shrink the capacity under high load to ensure system response efficiency, and expand the capacity to retain key content when high-value information is dense. This achieves a dynamic balance between controllable resource consumption and information integrity, significantly improving the contextual coherence and operational stability of the agent in complex dialogue scenarios.
[0087] Step S3: Based on the information unit packets stored in the agent's working memory, short-term memory, and long-term memory at the current moment, filter candidate information unit packets and answer the question at the current moment.
[0088] Based on step S2, the hierarchical and updating processing of working memory, short-term memory, and long-term memory in the agent was completed. However, since the information units in the multi-layered memory are scattered across the three levels of working, short-term, and long-term memory, and there is still a lot of content that is not related to the current dialogue, directly transmitting this information to the large model would either make the context redundant or miss key information. Therefore, it is necessary to first retrieve the most relevant fragments to the current dialogue and then logically reorganize them into a concise and complete context to solve the problems of information dispersion and insufficient relevance. The specific process is as follows:
[0089] First, using the embedding vector of the question currently entered by the user as the query basis, the vector database is called to perform a fast KNN retrieval on the information unit packets in short-term memory and long-term memory. Specifically, in this embodiment, the similarity between the embedding vectors of all information unit packets in short-term memory and long-term memory and the question content at the current moment is sorted in descending order. The first preset number of similarity information unit packets in the sorting results and all information unit packets in the working memory at the current moment are used as candidate information unit packets at the current moment.
[0090] The similarity calculation process is similar to the topic weight calculation process, and will not be elaborated further.
[0091] It should be noted that the preset quantity is set manually. In this embodiment, the preset quantity is 50. In actual application, as other implementation methods, implementers can also set it according to specific circumstances. This embodiment does not impose any special restrictions.
[0092] Furthermore, basic weights are assigned to them according to their levels: working memory 0.6, short-term memory 0.3, and long-term memory 0.1. The product of the importance score of each candidate information unit package at the current moment and the preset weight is calculated and recorded as the comprehensive score of each candidate content at the current moment. The preset weight is the basic weight of the corresponding level. Furthermore, all information unit packages at the current moment are sorted in descending order of comprehensive score. Following the order of "working memory, short-term memory, long-term memory", n1 information unit packages are selected from working memory, n2 information unit packages are selected from short-term memory, and n3 information unit packages are selected from long-term memory. At the same time, a prompt template is added, such as "Answer the question based on the following historical information". A streaming context Prompt is constructed. The construction process of the streaming context is a well-known technology and will not be described in detail.
[0093] The values of n1, n2, and n3 are all set manually. In this embodiment, their values are 5, 3, and 2, respectively. In actual applications, implementers can also set them according to specific circumstances. This embodiment does not impose any special restrictions.
[0094] Furthermore, the streaming context is transmitted to the agent's large language model. Leveraging the semantic understanding and generation capabilities of the large language model, a response to the question at the current moment is generated. Memory updates are tailored to the characteristics of different levels. Working memory uses a first-come, first-served mechanism, deleting the oldest content when full, along with the original content of the current round, while deleting older content that exceeds the capacity. Short-term memory performs incremental summaries on fragments evicted from working memory, processing only the newly evicted parts to reduce computational load. Long-term memory permanently stores the core information that needs to be retained across sessions from short-term memory after the dialogue ends. By generating answers in real time and updating memory incrementally in a hierarchical manner, efficient interaction can be achieved, and memory can be extended indefinitely.
[0095] The process of using a large language model for question-and-answer interaction is a well-known technique and will not be elaborated further.
[0096] Thus, this embodiment achieves efficient question-and-answer interaction for intelligent agents based on large models and generative artificial intelligence through a hierarchical memory and dynamic update mechanism. By allocating multimodal information to three layers of memory—working, short-term, and long-term—according to importance, and dynamically filtering and compressing content based on time and topic weights, it avoids contextual redundancy and ensures that key information is not lost. At the same time, by dynamically adjusting memory capacity through load and scoring feature values, a balance is achieved between resource utilization and information integrity. Ultimately, this significantly improves the interaction coherence, response speed, and system stability in long dialogues and cross-conversation scenarios.
[0097] It should be noted that the order of the embodiments described above is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. Furthermore, specific embodiments of this specification have been described above. Additionally, the processes depicted in the accompanying drawings do not necessarily require a specific or sequential order to achieve the desired results. In some user implementations, multitasking and parallel processing are possible or may be advantageous.
[0098] The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.
[0099] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them; modifications to the technical solutions described in the foregoing embodiments, or equivalent substitutions of some of the technical features, do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. An agent-based question-answering interaction method based on large models and generative artificial intelligence, characterized in that, The method includes the following steps: During the current conversation between the user and the AI agent, the question content at the current moment and all question and answer content before the current moment are obtained. Each question and answer content is formed into multiple information unit packages, and the embedding vectors of the question content and all information unit packages are obtained. The working memory, short-term memory, and long-term memory in the agent are dynamically updated based on information unit packets. The update process for the short-term memory capacity threshold is as follows: Based on the time interval between each information unit package and the question content, the time weight of each information unit package is determined; based on the similarity of the embedding vectors between each information unit package and the question content, the topic weight of each information unit package is determined; and combined with the time weight, the importance score of each information unit package is determined, so as to select information unit packages that need to be stored in short-term memory from the information unit packages that are eliminated from working memory within a preset time period before the current moment. The agent's latency at the current moment is obtained, and combined with a preset target latency, the agent's load characteristic value at the current moment is determined. Based on the distribution of importance scores of all information unit packets in short-term memory within a preset duration prior to the current moment, the short-term memory's score characteristic value at the current moment is determined, and combined with the load characteristic value, the short-term memory's capacity threshold at the current moment is adjusted, including: Short-term memory capacity threshold at the current moment The expression is: In the formula, represents the agent's load characteristic value at the current moment; I represents the short-term memory score characteristic value at the current moment; Indicates the preset baseline capacity limit; represents the preset adjustment factor; norm[ ] represents the normalization function; the scoring characteristic value of the short-term memory at the current moment is the mean of the importance scores of all information unit packets in the short-term memory within the preset duration before the current moment; Based on the information unit packets stored in the agent's working memory, short-term memory, and long-term memory at the current moment, candidate information unit packets are selected to answer the question at the current moment.
2. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The expression for the time weight of each information unit packet is: In the formula, This represents the time weight of the i-th information unit packet; This represents the time interval between the i-th information unit packet and the question content; This indicates the preset time weighting factor; represents the preset time decay coefficient; exp() represents an exponential function with the natural constant as the base.
3. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The expression for the topic weight of each information unit packet is: In the formula, This represents the topic weight of the i-th information unit packet; This represents the similarity of the embedding vectors between the i-th information unit packet and the question content; This represents the maximum value of the embedding vector similarity between all information unit packets within a preset time period prior to the current moment and the question content at the current moment; This indicates the preset topic weight factor.
4. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The importance score of each information unit package is the normalized value of the sum of the time weight and the topic weight of each information unit package.
5. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The step of selecting information unit packets to be stored in short-term memory from information unit packets that have been eliminated from working memory within a preset time period prior to the current moment includes: Within a preset timeframe before the current moment, information units whose importance score is greater than a preset threshold are stored in short-term memory from those that are eliminated from working memory.
6. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The load characteristic value of the agent at the current moment is the ratio of the agent's delay at the current moment to the preset target delay.
7. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The candidate information unit package includes: The similarity of the embedding vectors between all information unit packets in short-term and long-term memory and the question content at the current moment is sorted in descending order. The first preset number of similarity information unit packets in the sorted results and all information unit packets in the working memory at the current moment are used as candidate information unit packets at the current moment.
8. The agent-based question-answering interaction method based on large models and generative artificial intelligence as described in claim 1, characterized in that, The answering of the question at the current moment includes: Calculate the product of the importance score of each candidate information unit package at the current time and the preset weight, and record it as the comprehensive score of each candidate content at the current time. Sort all information unit packages at the current time in descending order of comprehensive score. Select multiple information unit packages from the sorted candidate information unit packages and splice them into a streaming context and transmit it to the agent's large language model to answer the question at the current time.