Data processing method and system for retrieval generation

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing lightweight pre-compression and hybrid topic segmentation in the sensory memory module, topic-aware summarization in the short-term memory module, and decoupling delay and deep integration in the long-term memory module, the problems of wasted computational resources and inaccurate generation in long dialogues by large language models are solved. This achieves efficient semantic aggregation and dynamic maintenance of the knowledge base, thereby improving the overall performance of the retrieval and generation system.

CN122240829APending Publication Date: 2026-06-19BEIJING JIZHI DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING JIZHI DIGITAL TECH CO LTD
Filing Date: 2026-04-27
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Large language models struggle to maintain long-term states in long dialogues or multi-turn interactions, leading to wasted computational resources and the generation of inaccurate or incomplete memory entries. Existing external memory systems lack modeling and deep maintenance of semantic connections.

Method used

The system employs a sensory memory module for lightweight pre-compression and hybrid topic segmentation, a short-term memory module for topic-aware summary generation, and a long-term memory module for decoupling delay and deep integration. By using the "lightweight pre-compression + hybrid topic segmentation" technical solution, redundant data is filtered and semantic topics are aggregated to provide high-quality input for downstream memory processing.

Benefits of technology

It significantly improves the data quality and semantic density of the knowledge base in retrieval generation tasks, optimizes the response speed and recall accuracy of the retrieval system, ensures the timeliness, consistency and scalability of the knowledge base, and enhances the overall performance of the retrieval generation system.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240829A_ABST

Patent Text Reader

Abstract

This application discloses a data processing method and system for retrieval generation, relating to the technical field of retrieval generation. The data processing method for retrieval generation includes: acquiring raw interaction data generated by a user interacting with a dialogue model; performing pre-compression and topic segmentation and aggregation processing on the raw interaction data to obtain semantic aggregated fragment data; performing summary generation processing on the semantic aggregated fragment data to obtain structured data; and performing update and maintenance processing on the structured data to obtain target structured data for retrieval generation.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the technical fields of retrieval generation, and in particular to a data processing method and system for retrieval generation. Background Technology

[0002] Large Language Models (LLMs) are limited by their fixed context windows, making it difficult to maintain long-term states in long dialogues or multi-turn interactions. External memory systems are a key technology to solve this problem.

[0003] Related technologies in external memory systems employ a sequential summarization-based strong generation method. When the dialogue context window is full, a summary is used to summarize historical information, and the summary is used as the new context to continue the dialogue. However, sequential summarization is a simple and mechanical summary based on rounds or time order, lacking information filtering and awareness of topic semantics, resulting in redundant information and wasted computational resources. Another related technology uses the RAG (Retrieval-Augmented Generation) method, which stores historical dialogue knowledge in a vector database. When needed, it retrieves relevant context through vector retrieval (calculating similarity) and inputs it as part of the prompts into the LLM for generation. This method typically directly stores raw or simply segmented text, lacking deep maintenance of long-term memory (such as reorganization, deduplication, and conflict resolution), and the update operations are serial and tightly coupled. Furthermore, it uses fixed rounds or conversation-level granularity for memory construction, lacking modeling of semantic connections, leading to mixed topic semantics and thus generating inaccurate or incomplete memory entries. Summary of the Invention

[0004] The embodiments of this application aim to at least partially solve one of the technical problems in the related art. Therefore, the purpose of the embodiments of this application is to provide a data processing method, system, device, and medium for retrieval generation, which improves data processing efficiency and quality, thereby improving the accuracy of retrieval generation.

[0005] This application provides a data processing method for retrieval, comprising: acquiring raw interaction data generated by a user interacting with a dialogue model; performing pre-compression and topic segmentation and aggregation processing on the raw interaction data to obtain semantic aggregated fragment data; performing summary generation processing on the semantic aggregated fragment data to obtain structured data; and performing update and maintenance processing on the structured data to obtain target structured data for retrieval. For example, the original interaction data includes token data; the original interaction data is pre-compressed and subject segmentation and aggregation are performed to obtain semantic aggregated fragment data, including: classifying the token data based on a pre-trained compression model to obtain a target compressed sequence, and storing the target compressed data in a first buffer; when the target compressed data in the first buffer reaches a preset capacity, attention is calculated on the target compressed data to obtain an attention boundary, and similarity is calculated on the target compressed sequence to obtain a similarity boundary; when the target compressed data in the first buffer does not reach the preset capacity, target compressed data is continuously acquired; subject segmentation and aggregation are performed based on the attention boundary and the similarity boundary to obtain semantic aggregated fragment data.

[0006] For example, classifying token data based on a pre-trained compression model to obtain a target compressed sequence includes: classifying token data based on the pre-trained compression model to obtain the original score vector corresponding to the token data; normalizing the original score vector to obtain the probability distribution and retention probability data corresponding to the token data; performing a first filtering process on the token data based on a first preset threshold and the retention probability data to obtain a first compressed sequence, wherein the first preset threshold is determined based on preset compressed data; calculating conditional entropy based on the probability distribution and the real label data to obtain conditional entropy data, wherein the real label data is the encoded data corresponding to the token data in the first compressed sequence; and performing a second filtering process on the first compressed sequence based on a second preset threshold and the conditional entropy data to obtain the target compressed data.

[0007] For example, the pre-trained compressed model includes a high-level attention layer, and the original interaction data includes N rounds of data, wherein the i-th round data and the j-th round data include user data or dialogue model data, i=1, 2, ..., N, j=1, 2, ..., N; wherein: attention calculation is performed on the target compressed data to obtain the attention boundary, including: obtaining the attention data of the N rounds of data after processing by the high-level attention layer; performing pairwise attention processing on the i-th round data and the j-th round data based on the attention data to obtain attention score data; obtaining the attention matrix based on the attention score data; determining the attention boundary based on the attention matrix; and / or similarity calculation is performed on the target compressed sequence to obtain the similarity boundary, including: embedding the N rounds of data to obtain the original interaction vector; calculating the similarity of the original interaction vectors corresponding to the N rounds of data based on the attention boundary to obtain similarity data; determining the similarity boundary based on the similarity data. For example, the semantic aggregated fragment data is processed to generate a summary to obtain structured data, including: obtaining topic data based on the semantic aggregated fragment data; obtaining index data corresponding to each semantic aggregated fragment data based on the topic data and the semantic aggregated fragment data, and storing the index data in a second buffer; when the index data in the second buffer reaches a preset number, the index data is processed to generate a summary to obtain structured data; when the target compressed data in the second buffer does not reach the preset number, the index data continues to be acquired, wherein the preset number is determined based on the compressed data.

[0008] For example, performing summary generation processing on index data to obtain structured data includes: extracting core information from the index data to obtain summary data, embedding the summary data to obtain a summary embedding vector; and obtaining structured data based on the index data, summary data, and summary embedding vector.

[0009] For example, updating and maintaining structured data to obtain target structured data for retrieval includes: acquiring time information of the structured data and storing the structured data and time information in a memory; responding to a preset trigger command, performing offline update processing on the structured data to obtain an update sequence; performing deduplication, merging, and discarding processing on the update sequence and structured data to obtain target structured data for retrieval, and updating the memory based on the target structured data to enable online inference based on the target structured data corresponding to the topic data and the summary embedding vector. For example, the structured data contains multiple memory data; in response to receiving a preset trigger command, the structured data is processed offline to obtain an update sequence, including: determining the related memory data of each memory data based on time information; and performing parallel similarity calculation based on the summary embedding vectors corresponding to each memory data and the related memory data to obtain the update sequence.

[0010] For example, topic segmentation and aggregation processing based on attention boundaries and similarity boundaries is performed to obtain semantic aggregated fragment data, including: determining the intersection data of attention boundaries and similarity boundaries; when the intersection data exists, segmenting the first buffer with the intersection data as the segmentation boundary to obtain semantic aggregated fragment data; when the intersection data does not exist, performing attention calculation and similarity calculation on the target compressed data of the acquired new original interaction data to obtain new attention boundaries and similarity boundaries, so as to obtain the intersection data of attention boundaries and similarity boundaries; when the original interaction data ends and the intersection data of attention boundaries and similarity boundaries does not exist, segmenting the first buffer with the similarity boundary as the segmentation boundary to obtain semantic aggregated fragment data; when the similarity boundary does not exist, segmenting the first buffer with the attention boundary as the segmentation boundary to obtain semantic aggregated fragment data.

[0011] Another embodiment of this application provides a data processing system for retrieval and generation. The data processing system for retrieval and generation includes: an acquisition module for acquiring raw interaction data generated by a user interacting with a dialogue model; a first processing module for pre-compressing and topic segmentation and aggregation of the raw interaction data to obtain semantic aggregated fragment data; a second processing module for summarizing the semantic aggregated fragment data to obtain structured data; and an acquisition module for updating and maintaining the structured data to obtain target structured data for retrieval and generation. Another embodiment of this application provides an electronic device having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of the above embodiments.

[0012] Another embodiment of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of the above embodiments.

[0013] In the above embodiments, the data processing method for retrieval includes: acquiring the original interaction data generated by the user's dialogue interaction with the dialogue model; performing pre-compression and topic segmentation aggregation processing on the original interaction data to obtain semantic aggregated fragment data; performing summary generation processing on the semantic aggregated fragment data to obtain structured data; and performing update and maintenance processing on the structured data to obtain the target structured data for retrieval. By implementing a progressive processing flow of "raw interactive data acquisition → pre-compression and topic segmentation and aggregation → semantic fragment summarization generation → structured data update and maintenance," efficient purification and knowledge reconstruction of massive, unstructured dialogue data are achieved. This systematically overcomes the inherent defects of raw interactive data, such as high noise, redundancy, and semantic dispersion, transforming low-quality corpora into topic-focused and semantically concise structured knowledge fragments. This significantly improves the data quality and semantic density of the knowledge base in the retrieval and generation task. Furthermore, the constructed target structured data, with its clear topic boundaries and summary representation, can achieve more accurate and faster matching with user query intent. This not only greatly optimizes the response speed, recall accuracy, and result relevance of the retrieval system but also effectively ensures the timeliness, consistency, and scalability of the knowledge base through structured storage and dynamic maintenance mechanisms, ultimately enhancing the overall performance of the retrieval and generation system from the data source. Attached Figure Description

[0014] Figure 1 A flowchart of a data processing method for retrieval and generation provided for embodiments of this application; Figure 2 A flowchart illustrating a method for pre-compressing and topic segmentation / aggregation of raw interactive data, provided for embodiments of this application. Figure 3 A flowchart of a method for generating summaries from semantically aggregated fragment data provided in this application embodiment; Figure 4 A flowchart illustrating a method for updating and maintaining structured data, provided for embodiments of this application. Figure 5 A block diagram of a data processing system for retrieval generation is provided for another embodiment of this application; Figure 6 A block diagram of an electronic device provided for another embodiment of this application. Detailed Implementation

[0015] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.

[0016] Large Language Models (LLMs) are limited by their fixed context windows, making it difficult to maintain long-term states in long dialogues or multi-turn interactions. When dealing with long-term, multi-turn interactive dialogues, they suffer from inefficiency and insufficient consistency. External memory systems are a key technology for solving this problem.

[0017] Specifically, the following drawbacks exist: (1) Redundant information leads to huge computational overhead: The preprocessing stage does not properly handle redundant and low-value tokens in the original input, resulting in token consumption inflation and waste of downstream computing resources, and low inference efficiency. (2) Fixed granularity leads to inaccurate memory unit construction: Using fixed rounds or session-level granularity for memory construction lacks modeling of semantic connections, resulting in mixed topic semantics, and thus generating inaccurate or incomplete memory entries. (3) Real-time online update mechanism leads to high real-time inference latency: The memory update and forgetting operations are performed in real time during inference or task execution. This tightly coupled and sequential update mechanism introduces significant inference time latency.

[0018] Related technologies in external memory systems employ a sequential summarization-based strong generation method. When the dialogue context window is full, a summary is used to summarize historical information, and the summary is used as the new context to continue the dialogue. However, sequential summarization is a simple and mechanical summary based on rounds or time order, lacking information filtering and awareness of topic semantics, resulting in redundant information and wasted computational resources. Another related technology uses the RAG (Retrieval-Augmented Generation) method, which stores historical dialogue knowledge in a vector database. When needed, it retrieves relevant context through vector retrieval (calculating similarity) and inputs it as part of the prompts into the LLM for generation. This method typically directly stores raw or simply segmented text, lacking deep maintenance of long-term memory (such as reorganization, deduplication, and conflict resolution), and the update operations are serial and tightly coupled. Furthermore, it uses fixed rounds or conversation-level granularity for memory construction, lacking modeling of semantic connections, leading to mixed topic semantics and thus generating inaccurate or incomplete memory entries.

[0019] In view of this, this application proposes a data processing method for retrieval generation. This method draws inspiration from how the human brain processes memory, dividing memory processing into three parts: sensory memory (rapid filtering), short-term memory (active integration), and long-term memory (persistent storage and consolidation). The data processing method for retrieval generation employs a lightweight and efficient memory enhancement approach, consisting of three decoupled modules: a sensory memory module that rapidly filters and selects dialogue content based on cognitive inspiration; a short-term memory module that efficiently integrates and refines topic perception; and a long-term memory module that decouples delay and deep integration. Through a technical solution of "lightweight pre-compression + hybrid topic segmentation," redundant data is filtered and semantic topics are aggregated, providing high-quality input for downstream memory processing while reducing computational overhead.

[0020] Figure 1 A flowchart of a data processing method for retrieving generated data, provided for an embodiment of this application.

[0021] like Figure 1 As shown, the data processing method 100 for retrieving generated data provided in this application includes, for example, steps S110-S140.

[0022] Step S110: Obtain the raw interaction data generated by the user's dialogue interaction with the dialogue model. For example, the interaction data (original content) generated by the user interacting with the dialogue model is segmented according to a preset vocabulary to obtain token data corresponding to the interaction data. The segmentation process is such as a word segmenter. The token data corresponding to the interaction data is used as the original interaction data. The dialogue model is such as a large language model. The user interacts with the large language model in rounds. For example, the user input is a single round, and the response of the large language model is another round.

[0023] Step S120: Perform pre-compression and topic segmentation and aggregation processing on the original interactive data to obtain semantic aggregated fragment data. For example, pre-compression and topic segmentation aggregation are performed by the sensory memory module (including a compression model, etc.). Pre-compression is implemented based on a compression model, such as LLMLingua-2. The compression model is obtained through pre-training. The pre-trained compression model classifies the token data to obtain the probability distribution and retention probability data corresponding to the token data. Conditional entropy data is calculated based on the probability distribution. The token data is then filtered based on the retention probability data and conditional entropy data to obtain the target compressed sequence. The target compressed data is stored in the first buffer. When the target compressed data in the first buffer reaches a preset capacity (e.g., 512 tokens), topic segmentation is performed on the target compressed sequence. Attention and similarity calculations are performed on the target compressed sequence to obtain the attention boundary (semantic expression). The system uses attention boundaries and similarity boundaries to segment the target compressed sequence into topics. When the similarity boundaries and attention boundaries intersect, the first buffer is segmented using the intersection data. When there is no intersection data, the system continues to acquire the original interaction data between the user and the dialogue model until intersection data appears. When there is no intersection data at the end of the dialogue, the target compressed sequence in the first buffer is segmented using the similarity boundaries first. When there is no similarity boundaries, the first buffer area is segmented using the attention boundaries.

[0024] Step S130: Perform summary generation processing on the semantic aggregation fragment data to obtain structured data. For example, the summary generation process is performed by a short-term memory module (including a large language model, etc.). Semantic aggregation fragment data is input into the short-term memory module. The short-term memory module forms turn pairs according to user input and dialogue model response, resulting in multiple turn pairs. The large language model generates topic data for the semantic aggregation fragment data of multiple turns (for example, under the same topic data, there may be multiple turn pairs, and each turn pair contains semantic aggregation fragment data of user input and user response). Based on the topic data and multiple turn pairs, index data is determined and stored in a second buffer. When the index data in the second buffer reaches a preset number (e.g., 512 tokens), the core information is extracted and embedded in the index data to obtain summary data and summary embedding vector. Thus, the index data, summary data, and summary embedding vector are used as structured data.

[0025] Step S140: Update and maintain the structured data to obtain the target structured data used for retrieval. For example, the update and maintenance processes are based on a long-term memory module. The structured data is input into the long-term memory module, which stores the structured data and corresponding time information in a memory bank. When a preset trigger instruction is received (such as a scheduled task updating at 1 a.m. every day), the structured data is updated offline to obtain an update sequence. The update sequence and the structured data are then maintained, including deduplication, merging, and discarding, to obtain the target structured data. The offline update process may include similarity calculation, etc.

[0026] In the above embodiments, by implementing a progressive processing flow of "raw interactive data acquisition → pre-compression and topic segmentation and aggregation → semantic fragment summarization generation → structured data update and maintenance," efficient purification and knowledge reconstruction of massive, unstructured dialogue data are achieved. This systematically overcomes the inherent defects of raw interactive data, such as high noise, redundancy, and semantic dispersion, transforming low-quality corpora into topic-focused and semantically concise structured knowledge fragments. This significantly improves the data quality and semantic density of the knowledge base in the retrieval and generation task. Furthermore, the constructed target structured data, with its clear topic boundaries and summary representation, can achieve more accurate and faster matching with user query intent. This not only greatly optimizes the response speed, recall accuracy, and result relevance of the retrieval system but also effectively ensures the timeliness, consistency, and scalability of the knowledge base through structured storage and dynamic maintenance mechanisms, ultimately enhancing the overall performance of the retrieval and generation system from the data source. The sensory memory module can quickly filter and select cognitive inspirations from dialogue content. Through the technical solution of "lightweight pre-compression + hybrid topic segmentation", it filters redundant data and aggregates semantic topics, providing high-quality input for downstream memory processing while reducing computational overhead.

[0027] Regarding step S120, this application provides a possible implementation of the sensory memory module: In one example, the original interaction data includes token data. The original interaction data undergoes pre-compression and topic segmentation / aggregation processing to obtain semantic aggregated fragment data. This includes: classifying the token data based on a pre-trained compression model to obtain a target compressed sequence, and storing the target compressed data in a first buffer; when the target compressed data in the first buffer reaches a preset capacity, attention is calculated on the target compressed data to obtain an attention boundary, and similarity is calculated on the target compressed sequence to obtain a similarity boundary; when the target compressed data in the first buffer does not reach the preset capacity, more target compressed data is acquired; topic segmentation / aggregation processing is performed based on the attention boundary and the similarity boundary to obtain semantic aggregated fragment data.

[0028] Specifically, compression models such as LLM Lingua-2 can obtain retention probability data and probability distribution through classification. Based on the probability distribution, conditional entropy data can be obtained. By filtering based on the conditional entropy data and retention probability data, the target compression sequence can be determined and stored in the first buffer. When the target compression data in the first buffer reaches a preset capacity (e.g., 512 tokens), topic segmentation processing is performed on the target compression sequence. By performing attention calculation and similarity calculation on the target compression sequence, attention boundaries (semantic changes) and similarity boundaries (changes in dialogue structure or tone, such as literal expression, word frequency, or syntactic structure between rounds) can be obtained. Based on the obtained attention boundaries and similarity boundaries, topic segmentation is performed on the target compression sequence to obtain semantic aggregate fragment data.

[0029] For example, acquiring raw interaction data from scenarios such as dialogues, such as token data, which includes user input and dialogue model responses, as shown in Table 1: Table 1. Examples of raw interaction data

[0030] The compressed, valid content (target compressed data) is stored in the sensory memory buffer (first buffer) to temporarily store information to be processed. The sensory memory buffer has a fixed capacity of 512 tokens, matching the context window of LMLingua-2 to ensure the integrity of attention calculation. If the sensory memory buffer does not reach its capacity, new target compressed data continues to be received. If the capacity is reached, topic segmentation and aggregation processing is triggered.

[0031] In one example, token data is classified based on a pre-trained compression model to obtain a target compressed sequence. This includes: classifying the token data using the pre-trained compression model to obtain the original score vector corresponding to the token data; normalizing the original score vector to obtain the probability distribution and retention probability data corresponding to the token data; performing a first filtering process on the token data based on a first preset threshold and the retention probability data to obtain a first compressed sequence, wherein the first preset threshold is determined based on preset compressed data; calculating conditional entropy based on the probability distribution and the real label data to obtain conditional entropy data, wherein the real label data is the encoded data corresponding to the token data in the first compressed sequence; and performing a second filtering process on the first compressed sequence based on a second preset threshold and the conditional entropy data to obtain the target compressed data.

[0032] Specifically, LMLingua-2 treats "whether to retain the token" as a binary classification task (label: 0 = discard, 1 = retain), and calculates the retention probability data through the logit vector output by the high-level model. The core logic is as follows: for each token... The model outputs a logit vector (the original score vector). (corresponding to "discard") ""reserve The unnormalized probability of the original score vector is obtained by normalizing the original score vector using the softmax function, resulting in the retained probability data and probability distribution. The calculation method is shown in formula (1), where e is the natural constant (approximately 2.718). Similarly, the discard probability data can be obtained. Thus, the probability distribution is obtained based on the retention probability data and the discard probability data. =[ , ]: (1) LLMLingua-2, through its pre-trained "redundant token recognition capability," generally retains a probability of less than 0.3 for interjections (such as "ya" and "oh") and repeated conjunctions (such as "and also"), while retaining a probability of more than 0.7 for core information (such as "Scenic Spot B" and "Scenic Spot D"). Specifically, as shown in Table 2: Table 2 Examples of Retention Probabilities

[0033] Only tokens with a probability higher than τ (the first preset threshold) are retained, and redundant information is removed. The first preset threshold τ is set as the r quantile of the retention scores of all tokens (r is the compression ratio (preset compressed data)), for example, r=0.7, taking the critical value of the top 70% retention probability).

[0034] For example, collect the retention probabilities of all tokens in the current input sequence (token data), sort them in ascending order, take τ = the "r quantile" of the retention probability sequence, and only retain tokens with retention probabilities greater than τ to obtain the first compressed sequence.

[0035] Conditional entropy is calculated based on probability distribution and real label data to obtain conditional entropy data. The real label data is the encoded data corresponding to the token data (for example, the real label data corresponding to the first compressed sequence that is retained is retained, and the corresponding encoded data is [1, 0]. The encoded data is encoded in a one-hot manner, with the value of the retained position set to 1 and the other positions set to 0). The specific process is as follows: Calculate the cross-entropy (conditional entropy) between the model prediction distribution (the probability distribution obtained by the pre-trained compressed model) and the real token label (real label data). The calculation method of cross-entropy is shown in formula (2): (2) Where q(x) (real tag data) is the distribution of real token tags, For the current token The probability distribution is retained based on a second preset threshold (e.g., 0.8). (Conditional entropy data) The token data corresponding to the first compressed sequence with an entropy greater than 0.8 is retained. High-entropy tokens have strong semantic uncertainty and are crucial for memory construction. Specifically, as shown in Table 3, the final target compressed sequence is obtained, as shown in Table 4: Table 3 Examples of Conditional Entropy

[0036] Table 4 Examples of Target Compressed Sequences

[0037] In one example, the pre-trained compressed model includes a high-level attention layer, and the original interaction data includes N rounds of data, where the i-th and j-th rounds of data include user data or dialogue model data, i=1, 2, ..., N, j=1, 2, ..., N; wherein: attention calculation is performed on the target compressed data to obtain the attention boundary, including: obtaining the attention data of the N rounds of data after processing by the high-level attention layer; performing pairwise attention processing on the i-th and j-th rounds of data based on the attention data to obtain attention score data; obtaining the attention matrix based on the attention score data; and determining the attention boundary based on the attention matrix; Specifically, as shown in Table 4, the number of dialogue rounds between the user and the large language model is 5 (N=5), and the dimension of the constructed attention matrix is "number of rounds × number of rounds" (5×5 in the example).

[0038] Obtain attention data for N rounds of data after processing by the high-level attention (8-11 layers) of LMLingua-2 to avoid local noise from the low-level attention. For any two rounds (s_i) and (s_j), first calculate token-level pairwise attention (for example, for token data in rounds i and j, calculate self-attention and cross-attention with each token in round i for each token, and obtain multiple attention scores for each token). Averaging the multiple attention scores yields the attention scores between rounds ((s_i) and (s_j)), and normalizing them to obtain the attention score data for rounds (s_i) and (s_j). Similarly, attention score data between all rounds and attention score data calculated by self-attention in rounds can be obtained. The attention matrix can be obtained based on the attention score data of the rounds.

[0039] It's important to note that when calculating attention scores, for multiple tokens in each round, the first three and last three tokens of each round can be masked (to avoid attention traps). Attention traps refer to situations where certain tokens in the sequence (such as the first and last tokens of a dialogue sequence) disproportionately attract a large number of attention scores during the attention mechanism's calculation process, even if these tokens themselves do not possess core semantic value and are not reasonable targets for subsequent attention allocation. This phenomenon interferes with normal attention association judgments, leading to meaningless occupation of attention resources, and consequently affecting the accuracy of subsequent processes such as topic segmentation and memory unit construction.

[0040] Due to the 5 dialogue rounds (N=5), the core focus sub-diagonal element sequence is ([M_{2,1}, M_{3,2}, M_{4,3}, M_{5,4}), corresponding to the association strength of "rounds 1-2", "rounds 2-3", "rounds 3-4", and "rounds 4-5", respectively. M_{2,1} represents the attention change (attention score data) of the current round 2 to the previous round 1, and the others are similar. See Table 5: Table 5. Examples of partial attention score data from the attention matrix.

[0041] A sub-diagonal element (M_{k,k-1}) is considered a "local maximum" if it satisfies both of the following conditions. The original diagonal element M_{k,k-1} represents the attention change (attention score data) between the current round k and the previous round k-1. Condition 1: Greater than the previous adjacent sub-diagonal element: (M_{k,k-1}>M_{k-1,k-2}) (the association strength of the current adjacent round pair is greater than the association strength of the previous group of adjacent round pairs); Condition 2: Greater than the next adjacent sub-diagonal element: (M_{k,k-1}>M_{k+1,k}) (the association strength of the current adjacent round pair is greater than the association strength of the next group of adjacent round pairs). In the example above, (M_{4,3}=0.91) is a local maximum, corresponding to round {4 / 3}, so the attention boundary B1 is {4 / 3}.

[0042] And / or perform similarity calculation on the target compressed sequence to obtain the similarity boundary, including: embedding the N rounds of data to obtain the original interaction vector; calculating the similarity of the original interaction vectors corresponding to the N rounds of data based on the attention boundary to obtain similarity data; and determining the similarity boundary based on the similarity data. Specifically, an embedding model (such as text-embedding-v2) is used to convert adjacent dialogue rounds (N rounds of data) into vectors to obtain the original interaction vectors. The sentence similarity of adjacent rounds near the attention boundary B1 is then calculated. Sentences with similarity below a preset threshold are excluded. The position (e.g., 0.4) is marked as the similarity boundary B2, as shown in Table 6: Table 6. Similarity Data Examples

[0043] sim (Semantic Similarity) represents the calculation of similarity data between vectors, and B2 is "sim(s_{k 1},s_k)< The round position k, in the example =0.4, therefore B2={3} (the similarity between rounds 2 and 3 is 0.32<0.4).

[0044] In one example, topic segmentation and aggregation are performed based on attention boundaries and similarity boundaries to obtain semantic aggregated fragment data. This includes: determining the intersection data of attention boundaries and similarity boundaries; when the intersection data exists, segmenting the first buffer using the intersection data as the segmentation boundary to obtain semantic aggregated fragment data; when the intersection data does not exist, performing attention and similarity calculations on the target compressed data of the newly acquired original interaction data to obtain new attention boundaries and similarity boundaries, so as to obtain the intersection data of attention boundaries and similarity boundaries; when the original interaction data ends and the intersection data of attention boundaries and similarity boundaries does not exist, segmenting the first buffer using the similarity boundary as the segmentation boundary to obtain semantic aggregated fragment data; and when the similarity boundary does not exist, segmenting the first buffer using the attention boundary as the segmentation boundary to obtain semantic aggregated fragment data.

[0045] Specifically, the intersection of B1 and B2 is used as the topic boundary to ensure semantic coherence within the segmented fragments and clear topic distinction between fragments. In the example, there is no intersection. In cases of no intersection, dialogue continues to accumulate until a new intersection round pair is obtained where the dialogue structure or tone changes and the semantics change. This yields new attention and similarity boundaries. When the original interaction data ends and the intersection of the attention and similarity boundaries does not exist, the result of B2 is used first as the boundary result. If the result of B2 does not exist, the result of B1 is used as the boundary result. The first buffer is then segmented based on the boundary results to obtain semantic aggregation boundary data. This semantic aggregation boundary data is directly passed to the short-term memory module, providing a non-redundant, highly semantically relevant input for subsequent structured processing.

[0046] The short-term memory module uses a "topic-round indexing + threshold-triggered summary generation" technical solution to balance the efficiency of memory construction (reducing API (Application Programming Interface) calls) and accuracy (avoiding topic confusion) to generate structured memory entries.

[0047] Regarding step S130, this application provides a possible implementation of the short-term memory module: In one example, the semantic aggregated fragment data is processed to generate a summary, resulting in structured data. This includes: obtaining topic data based on the semantic aggregated fragment data; obtaining index data corresponding to each semantic aggregated fragment based on the topic data and the semantic aggregated fragment data, and storing the index data in a second buffer; when the index data in the second buffer reaches a preset number, the index data is processed to generate a summary, resulting in structured data; when the target compressed data in the second buffer does not reach the preset number, the index data continues to be acquired, wherein the preset number is determined based on the compressed data.

[0048] Specifically, the semantic aggregation fragment data output by the sensory memory module, after pre-compression and topic segmentation, is obtained. A topic data-dialogue turn pair index is constructed, and a basic index (index data) is built for each semantic aggregation fragment data. Each semantic aggregation fragment data corresponds to: {topic data, dialogue turn pair} (“topic data” is automatically generated by the large language model). Using the “topic data” corresponding to the semantic aggregation fragment data as the core key, all “dialogue turn pairs” within the fragment (i.e., the one-to-one correspondence between user input and dialogue model response) are associated, ensuring that the semantics of each turn are bound to the topic, avoiding semantic fragmentation in subsequent processing. The index data format is shown in Table 7. Table 7. Example of Index Data

[0049] The index data of "topic data - dialogue turn pairs" is stored in the short-term memory buffer (second buffer). The trigger threshold th (preset quantity) of the buffer is the token quantity threshold (configurable, such as 512 tokens). The threshold th needs to be dynamically adjusted in conjunction with the compression ratio r (compressed data) of the sensory memory module. When r=0.7, th=512 is optimal, and when r=0.8, th=1024 is optimal, balancing efficiency and accuracy. If the number of tokens in the second buffer does not reach th, new index data continues to be temporarily stored; if th is reached, the summary generation process is triggered.

[0050] In one example, the index data is processed to generate a summary, resulting in structured data. This process includes: extracting core information from the index data to obtain summary data, embedding the summary data to obtain a summary embedding vector, and then generating structured data based on the index data, summary data, and summary embedding vector.

[0051] Specifically, an LLM (such as deepseek-r1) is used as the summarization model to generate a summary sum_i (summary data) for each "topic data-dialogue turn pair" combination. It is important to note that when inputting, only a subset of non-empty dialogue turns is selected (i.e., empty turns without valid content are filtered out) to avoid invalid summaries caused by empty input. The goal of summarization generation is to extract the core information in the turns (such as user needs and key model responses) to ensure that the summary can represent the turn semantics under the topic.

[0052] For each summary sum_i, embedding is performed to generate a "summary embedding vector" (used for similarity retrieval in the long-term memory module), ultimately forming a structured memory entry (structured data): containing "topic data" (related semantic classification), "summary embedding vector" (supporting subsequent retrieval), "original user input (pre-compressed user input)", and "original model response (pre-compressed model response)" (preserving details and avoiding loss of key information in the summary).

[0053] Structured memory entries (structured data) are directly passed to the long-term memory module as the basic unit for long-term storage. ["Memory entry = {topic data, summary data, summary embedding vector, pre-compressed user input, pre-compressed model response}. As shown in Table 8:] Table 8. Examples of Structured Data

[0054] The long-term memory module uses a technical solution of "soft update during testing + offline parallel update" to decouple memory maintenance from online inference, reducing latency, while ensuring memory accuracy through timestamps and similarity constraints.

[0055] Regarding step S140, this application provides a possible implementation of the long-term memory module: In one example, the structured data is updated and maintained to obtain target structured data for retrieval. This includes: acquiring the time information of the structured data and storing the structured data and time information in a memory; responding to a preset trigger command, performing offline update processing on the structured data to obtain an update sequence; performing deduplication, merging, and discarding processing on the update sequence and structured data to obtain target structured data for retrieval, and updating the memory based on the target structured data to enable online inference based on the target structured data corresponding to the topic data and the summary embedding vector. Specifically, after the long-term memory module receives the structured memory entries (structured data) output by the short-term memory module, it performs a "soft update": adding a precise timestamp to the memory entries. (Time information) Records the time of its generation and reception, and directly inserts the memory entry (each data in the structured data) and the corresponding time information into the long-term memory vector database such as Milvus, without performing any time-consuming search, comparison, recombination or merging operations.

[0056] Upon receiving a preset update trigger signal (trigger command) (such as a scheduled task at 1 AM daily), a relevant memory set queue that needs to be reorganized is determined for each entry in the long-term memory module (for each entry, the relevant memory set queue consists of all other entries inserted after the current entry's insertion time). The structured data is then processed offline based on the relevant memory set queue to obtain the update sequence. This offline update processing is executed in parallel with online inference, without consuming interactive resources.

[0057] Based on the update sequence, three maintenance operations are performed on each memory entry (each piece of data in the structured data) to ensure the accuracy and lightweight nature of the long-term memory bank, including: (1) Deduplication: Delete entries in the update sequence that are semantically identical to the current entry (such as two identical user preference records) to reduce redundancy in the bank; (2) Merging: Merge information in the update sequence that is related to the current entry but does not conflict with it into the current entry (example: the user first mentions "planning to travel to area a", and then asks about "transportation in area b", which is merged into "planning to travel to area a + transportation information in area b") to avoid information fragmentation; (3) Forgetting (discarding): Discard entries in the update sequence that are "timestamped and semantically irrelevant" (such as irrelevant dialogue records from several months ago) to control the capacity of the long-term memory bank, improve the efficiency of subsequent retrieval, and obtain the final target structured data.

[0058] After all entries are maintained, the long-term memory is updated. During subsequent online inference, relevant memory entries (structured data) can be quickly located through "topic data matching + extraction of embedding vector similarity retrieval", providing long-term context support for LLM.

[0059] In one example, the structured data consists of multiple memory data points. In response to a preset trigger command, the structured data is updated offline to obtain an update sequence, which includes: determining the related memory data for each memory data point based on time information; and performing parallel similarity calculations based on the summary embedding vectors corresponding to each memory data point and the related memory data points to obtain the update sequence.

[0060] Specifically, the structured data is updated offline to obtain an update sequence. The specific operation process is as follows: For each memory entry (memory data) in the long-term memory bank, an "update sequence" is calculated, and the selection rules are as follows: For each entry (memory data) in the long-term memory module, a relevant memory set queue that needs to be reorganized is determined (for each entry, the relevant memory set queue is all other memory data after the insertion time of the current entry). The "summary embedding vector similarity" between the current entry and all other entries after the insertion time of the current entry is calculated, and the entry with the highest similarity in the Top-k (e.g., Top-5) is selected as the update sequence to avoid old information overwriting new information. Since the update sequence of each entry is independent of each other, multi-threaded parallel computing is used to significantly reduce the total offline update time.

[0061] Figure 2 A flowchart illustrating the method for pre-compressing and topic segmentation / aggregation of raw interactive data provided in this application is shown below. Figure 2 As shown, the methods for pre-compressing and topic segmentation / aggregation of the original interactive data include S201-S211.

[0062] S201, Obtain raw interaction data.

[0063] For example, the interaction data (original content) generated by the user interacting with the dialogue model is segmented according to a preset vocabulary to obtain token data corresponding to the interaction data. The segmentation process is such as a word segmenter. The token data corresponding to the interaction data is used as the original interaction data. The dialogue model is such as a large language model. The user interacts with the large language model in rounds. For example, the user input is a single round, and the response of the large language model is another round.

[0064] S202, calls the LLMLingua-2 model for pre-compression processing.

[0065] For example, the pre-compression process is implemented based on a compression model, such as LLLLingua-2, which is obtained through pre-training.

[0066] S203, calculate the retention probability of each token.

[0067] For example, LLMLingua-2 treats "whether to retain the token" as a binary classification task (label: 0 = discard, 1 = retain). The retention probability data is calculated through the logit vector output by the higher layers of the model, and the discard probability data can be obtained similarly. Thus, the probability distribution is obtained based on the retention probability data and the discard probability data. S204, Set dynamic threshold. For example, the dynamic threshold (first preset threshold) is such as the r quantile of the retention scores of all tokens (r is the compression ratio (preset compressed data)), and the first preset threshold τ is such as r=0.7, taking the critical value of the top 70% retention probability. S205, based on the dynamic threshold, the retention probability is filtered to generate a preliminary compressed sequence.

[0068] For example, only tokens with a probability higher than the dynamic threshold (first preset threshold) are retained, and redundant information is removed to obtain a preliminary compressed sequence (first compressed sequence).

[0069] S206, perform cross-entropy filtering on the preliminary compressed sequence to obtain the target compressed data.

[0070] For example, conditional entropy data is obtained by calculating conditional entropy based on probability distribution and real label data, and then retaining... The target compressed sequence is obtained by taking the token data corresponding to the first compressed sequence with a conditional entropy greater than 0.8. S207, store the target compressed sequence into the sensory memory buffer.

[0071] For example, the target compressed data is stored in a first buffer (sensory memory buffer).

[0072] S208, determine whether the sensory memory buffer has reached its maximum capacity. If yes, proceed to step S209; otherwise, continue to acquire the target compression.

[0073] For example, when the target compressed data in the first buffer reaches a preset capacity (maximum capacity) (e.g., 512 tokens), the target compressed sequence is subject-segmented. If the sensory memory buffer does not reach the maximum capacity, new target compressed data continues to be received.

[0074] S209, perform topic segmentation on the sensory memory buffer to obtain the attention boundary and similarity boundary.

[0075] For example, by performing attention and similarity calculations on the target compressed sequence, attention boundaries (semantic changes) and similarity boundaries (changes in dialogue structure or tone, such as literal expression, word frequency, or syntactic structure between rounds) can be obtained.

[0076] S210, determine the final boundary based on the attention boundary and the similarity boundary.

[0077] For example, the intersection of attention boundary B1 and similarity boundary B2 is taken as the topic boundary (final boundary) to ensure semantic coherence within the segmented fragments and clear topic distinction between fragments. In cases where there is no intersection, dialogue continues to accumulate until a new intersection round pair is obtained where the dialogue structure or tone changes and the semantics change, resulting in new attention and similarity boundaries. When the original interaction data ends and the intersection of attention and similarity boundaries does not exist, the result of B2 is used first as the boundary result. If the result of B2 does not exist, the result of B1 is used as the boundary result. The first buffer is then segmented based on the boundary results to obtain semantic aggregation boundary data. S211, output the semantic fragments of topic aggregation to the short-term memory module.

[0078] For example, semantic aggregation boundary data (semantic fragments of topic aggregation) are directly passed to the short-term memory module to provide a non-redundant, highly semantically related input for subsequent structured processing. In the above embodiments, the sensory memory module performs "pre-compression and redundancy removal + topic segmentation" on the original input to quickly filter irrelevant information and group it according to semantics, reducing the burden on subsequent memory processing. By eliminating low-value tokens through a lightweight compression model and then combining attention and semantic similarity to determine topic boundaries, structured fragments are formed, achieving cognitive inspiration for rapid filtering and selection of dialogue content.

[0079] Figure 3 A flowchart of a method for generating summaries from semantically aggregated fragment data provided in this application is shown below. Figure 3 As shown, the methods for generating summaries from semantic aggregated fragment data include S301-S310.

[0080] S301, Obtain the thematic semantic fragment output by the sensory memory module.

[0081] For example, the semantic aggregate fragment data (topic semantic fragment) output by the sensory memory module is obtained after pre-compression and topic segmentation.

[0082] S302, construct the topic-dialogue turn pair index data to obtain the index data.

[0083] For example, a topic data-dialogue turn pair index is constructed, and a basic index (index data) is constructed for each semantic aggregate fragment data. Each semantic aggregate fragment data corresponds to: {topic data, dialogue turn pair} (“topic data” is automatically generated by the large language model). The “topic data” corresponding to the semantic aggregate fragment data is used as the core key to associate all “dialogue turn pairs” (i.e., the one-to-one correspondence between user input and dialogue model response) within the fragment.

[0084] S303, store the index data into the short-term memory buffer.

[0085] For example, the "topic data-dialogue turn pair" index data is stored in a short-term memory buffer (second buffer).

[0086] S304, determine whether the number of tokens in the short-term memory buffer has reached the threshold. If yes, proceed to S305; otherwise, continue to retrieve index data.

[0087] For example, the trigger threshold th (preset quantity) of the buffer is the token quantity threshold (configurable, such as 512 tokens); the threshold th needs to be dynamically adjusted in conjunction with the compression ratio r (compressed data) of the sensory memory module. When r=0.7, th=512 is optimal, and when r=0.8, th=1024 is optimal, balancing efficiency and accuracy. If the number of tokens in the second buffer does not reach th, new index data continues to be temporarily stored; if th is reached, proceed to S305.

[0088] S305 calls the LLM model to generate a summary, obtains summary data, and performs embedding processing to obtain a summary embedding vector.

[0089] For example, an LLM (such as deepseek-r1) is called as the summarization model to generate a summary sum_i (summary data) for each "topic data-dialogue turn pair" combination. Each summary sum_i is then embedded to generate a "summary embedding vector" (used for similarity retrieval in the long-term memory module).

[0090] S306, Construct memory entries.

[0091] For example, the final structured memory entries (structured data) include "topic data" (associated semantic classification), "summary embedding vector" (supporting subsequent retrieval), "raw user input (pre-compressed user input)", and "raw model response (pre-compressed model response)" (preserving details and avoiding loss of key information in the summary).

[0092] S307, Output memory entry.

[0093] For example, structured memory entries (structured data) are directly passed to the long-term memory module as the basic unit for long-term storage. ["Memory entry = {topic data, summary data, summary embedding vector, pre-compressed user input, pre-compressed model response}.]

[0094] In the above embodiments, the short-term memory module uses a technical solution of "topic-round indexing + threshold-triggered summary generation" to balance the efficiency (reducing API calls) and accuracy (avoiding topic confusion) of memory construction, generate structured memory entries, and achieve efficient integration and refinement of topic awareness.

[0095] Figure 4 A flowchart illustrating the method for updating and maintaining structured data as provided in this application is shown below. Figure 4 As shown, the methods for updating and maintaining structured data include S401-S409.

[0096] S401, Obtain the structured memory entries output by the short-term memory module.

[0097] For example, the structured memory entries (structured data) output by the short-term memory module are obtained.

[0098] S402, a soft update is performed during testing.

[0099] For example, a "soft update" is performed to add a precise timestamp to the memory entry. (Time information) Records the time of its generation and reception, and directly inserts the memory entry (each data in the structured data) and the corresponding time information into the long-term memory vector database such as Milvus, without performing any time-consuming search, comparison, recombination or merging operations.

[0100] S403, insert the memory entry into the long-term memory bank, along with a timestamp.

[0101] S404, continue online reasoning.

[0102] For example, offline update processing and online inference are executed in parallel without consuming interactive resources (non-blocking).

[0103] S405, determine whether all entries have been inserted or a trigger signal has arrived. If yes, proceed to S406; otherwise, continue to retrieve memory entries.

[0104] For example, when a preset update trigger signal (trigger instruction) is received (such as a scheduled task at 1 a.m. every day), the process proceeds to S406, or when all memory entries have been inserted, the process proceeds to S406.

[0105] S406 performs offline parallel updates.

[0106] For example, a relevant memory set queue that needs to be reorganized is determined for each entry in the long-term memory module (for each entry, the relevant memory set queue is all other entries after the current entry's insertion time), and offline parallel updates are performed (updates during sleep).

[0107] S407, calculate the update queue for each memory entry.

[0108] For example, structured data is updated offline based on a relevant memory set queue to obtain an update sequence, and the current entry and the time after insertion of the current entry are calculated ( The "summary embedding vector similarity" of all other entries is used to select the entries with the highest similarity in the Top-k (e.g., Top-5) as the update sequence to avoid old information overwriting new information. Since the update sequence of each entry is independent of each other, multi-threaded parallel computing is used to significantly reduce the total offline update time.

[0109] S408, perform memory maintenance operation.

[0110] For example, based on the update sequence, three maintenance operations are performed on each memory entry (each piece of data in the structured data) to ensure the accuracy and lightweight nature of the long-term memory bank, including: (1) Deduplication: Delete entries in the update sequence that are semantically identical to the current entry (such as two identical user preference records) to reduce redundancy in the bank; (2) Merging: Merge information in the update sequence that is related to the current entry but does not conflict with it into the current entry (example: the user first mentions "planning to travel to area a", and then asks about "transportation in area b", which is merged into "planning to travel to area a + transportation information in area b") to avoid information fragmentation; (3) Forgetting (discarding): Discard entries in the update sequence that are "timestamped and semantically irrelevant" (such as irrelevant dialogue records from several months ago) to control the capacity of the long-term memory bank, improve the efficiency of subsequent retrieval, and obtain the final target structured data.

[0111] S409, Update the long-term memory bank so that it can support subsequent retrieval and reasoning.

[0112] For example, after all entries are maintained, the long-term memory is updated; during subsequent online inference, relevant memory entries (structured data) can be quickly located through "topic data matching + extraction of embedding vector similarity retrieval", providing long-range context support for LLM.

[0113] In the above embodiments, the long-term memory module uses a "test-time soft update + offline parallel update" technical solution to decouple memory maintenance from online inference, reducing latency, while ensuring memory accuracy through timestamps and similarity constraints. Test-time soft update: In the online phase, only short-term memory entries are directly inserted into the long-term memory bank with timestamps, without real-time maintenance, and without blocking online inference; Offline parallel update: In the offline phase (a scheduled task such as 1 AM), an "update queue" is calculated for each memory entry (candidate entries are filtered based on semantic similarity and timestamps), and deduplication, merging, and forgetting operations are performed in parallel to maintain the long-term memory bank, achieving decoupling of latency and deep integration.

[0114] The data processing method and system proposed in this application for retrieval generation adopt the following: (1) a lightweight front-end pre-compression method based on information entropy or binary classification: using a lightweight model and a mechanism based on label conditional cross-entropy (information content) or binary classification, redundant labels in the original input are efficiently filtered, achieving low-cost and high-efficiency source information optimization. (2) a hybrid topic dynamic segmentation technology that integrates attention mechanism and semantic similarity: combining the local peak of attention between labels and the semantic similarity threshold between rounds, the topic boundary is dynamically determined by finding the intersection, ensuring that the memory entries are constructed based on accurate semantic units. (3) a decoupled architecture for online soft update and offline parallel update of long-term memory: introducing a "sleep time" mechanism, simplifying real-time update to direct insertion (soft update) of LTM to ensure low latency, and transferring expensive memory maintenance operations (deduplication, merging, forgetting) to offline parallel execution, solving the contradiction between real-time performance and deep processing in long-term memory systems.

[0115] Figure 5 A block diagram of a data processing system for retrieval generation is provided for another embodiment of this application; like Figure 5 As shown, another embodiment of this application provides a data processing system 500 for retrieving generated data. The data processing system 500 for retrieving generated data includes: an acquisition module 510, a first processing module 520, a second processing module 530, and an acquisition module 540.

[0116] The acquisition module 510 is used to acquire the raw interaction data generated by the user's dialogue interaction with the dialogue model.

[0117] The first processing module 520 is used to pre-compress and segment the original interactive data to obtain semantic aggregated fragment data.

[0118] The second processing module 530 is used to perform summary generation processing on the semantic aggregation fragment data to obtain structured data.

[0119] Module 540 is used to update and maintain structured data to obtain target structured data for retrieval.

[0120] For example, the original interaction data includes token data; the first processing module 520 is further configured to classify the token data based on a pre-trained compression model to obtain a target compressed sequence, and store the target compressed data in a first buffer; when the target compressed data in the first buffer reaches a preset capacity, attention is calculated on the target compressed data to obtain an attention boundary, and similarity is calculated on the target compressed sequence to obtain a similarity boundary; when the target compressed data in the first buffer does not reach the preset capacity, target compressed data is continuously acquired; topic segmentation and aggregation processing is performed based on the attention boundary and the similarity boundary to obtain semantic aggregated fragment data. For example, the first processing module 520 is further configured to classify the token data based on a pre-trained compression model to obtain the original score vector corresponding to the token data; normalize the original score vector to obtain the probability distribution and retention probability data corresponding to the token data; perform a first filtering process on the token data based on a first preset threshold and the retention probability data to obtain a first compressed sequence, wherein the first preset threshold is determined based on preset compressed data; calculate the conditional entropy based on the probability distribution and the real label data to obtain conditional entropy data, wherein the real label data is the encoded data corresponding to the token data of the first compressed sequence; and perform a second filtering process on the first compressed sequence based on a second preset threshold and the conditional entropy data to obtain target compressed data. For example, the pre-trained compressed model includes a high-level attention layer, and the original interaction data includes N rounds of data, wherein the i-th round data and the j-th round data include user data or dialogue model data, i=1, 2, ..., N, j=1, 2, ..., N; the first processing module 520 is further configured to acquire attention data of the N rounds of data after processing by the high-level attention layer; perform pairwise attention processing on the i-th round data and the j-th round data based on the attention data to obtain attention score data; obtain an attention matrix based on the attention score data; determine the attention boundary based on the attention matrix; and / or perform embedding processing on the N rounds of data to obtain the original interaction vector; calculate the similarity of the original interaction vectors corresponding to the N rounds of data based on the attention boundary to obtain similarity data; and determine the similarity boundary based on the similarity data. For example, the second processing module 530 is further configured to obtain topic data based on semantic aggregated fragment data; obtain index data corresponding to each semantic aggregated fragment data based on the topic data and semantic aggregated fragment data, and store the index data in a second buffer; when the index data in the second buffer reaches a preset number, perform summary generation processing on the index data to obtain structured data; when the target compressed data in the second buffer does not reach the preset number, continue to acquire index data, wherein the preset number is determined based on the compressed data.

[0121] For example, the second processing module 530 is further configured to perform core information extraction processing on the index data to obtain summary data, and to perform embedding processing on the summary data to obtain a summary embedding vector; and to obtain structured data based on the index data, summary data and summary embedding vector.

[0122] For example, the obtaining module 540 is further configured to obtain the time information of the structured data and store the structured data and time information in the memory; in response to receiving a preset trigger instruction, perform offline update processing on the structured data to obtain an update sequence; perform deduplication, merging and discarding processing on the update sequence and structured data to obtain the target structured data for retrieval and update the memory based on the target structured data so as to perform online inference based on the target structured data corresponding to the topic data and the summary embedding vector. For example, the structured data includes multiple memory data; the acquisition module 540 is further used to determine the related memory data of each memory data based on time information; and to perform similarity calculation in parallel based on the summary embedding vectors corresponding to each memory data and the related memory data to obtain an updated sequence.

[0123] For example, the first processing module 520 is further configured to determine the intersection data of the attention boundary and the similarity boundary; when the intersection data exists, the first buffer is segmented using the intersection data as the segmentation boundary to obtain semantic aggregated fragment data; when the intersection data does not exist, attention calculation and similarity calculation are performed on the target compressed data of the acquired new original interaction data to obtain new attention boundaries and similarity boundaries, so as to obtain the intersection data of the attention boundary and the similarity boundary; when the original interaction data ends and the intersection data of the attention boundary and the similarity boundary does not exist, the first buffer is segmented using the similarity boundary as the segmentation boundary to obtain semantic aggregated fragment data; when the similarity boundary does not exist, the first buffer is segmented using the attention boundary as the segmentation boundary to obtain semantic aggregated fragment data. Figure 6 A block diagram of an electronic device provided for another embodiment of this application.

[0124] Another embodiment of this application provides an electronic device having a computer program stored thereon, which, when executed by a processor, implements the steps of the method of any of the above embodiments.

[0125] like Figure 6 As shown, for ease of understanding, embodiments of this application illustrate a specific electronic device 400.

[0126] Electronic device 600 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic device 600 may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0127] like Figure 6 As shown, the electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM) 602 or a computer program loaded from a storage unit 608 into a random access memory (RAM) 603. The RAM 603 may also store various programs and data required for the operation of the electronic device 600. The computing unit 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.

[0128] Multiple components in electronic device 600 are connected to input / output (I / O) interface 605. These components include: input unit 606, such as a keyboard or mouse; output unit 607, such as various types of displays or speakers; storage unit 608, such as a disk or optical disk; and communication unit 609, such as a network interface card (NIC), modem, or wireless transceiver. Communication unit 609 allows electronic device 600 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0129] The computing unit 601 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods described above. For example, in some embodiments, any one or more of the various methods described above can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 600 via ROM 602 and / or communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of any one or more of the various methods described above can be performed. Alternatively, in other embodiments, the computing unit 601 can be configured to perform any one or more of the various methods described above by any other suitable means (e.g., by means of firmware).

[0130] This application provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the method in any of the above embodiments.

[0131] It should be noted that the logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be specifically implemented in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this application, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which programs can be printed, because programs can be obtained electronically, for example, by optically scanning the paper or other media, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.

[0132] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0133] In the description of this application, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this application, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0134] In the description of this application, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc., indicating the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application.

[0135] Furthermore, the terms "first," "second," etc., used in the embodiments of this application are for descriptive purposes only and should not be construed as indicating or implying relative importance, or implicitly specifying the number of technical features indicated in this embodiment. Therefore, features defined with terms such as "first" and "second" in the embodiments of this application can explicitly or implicitly indicate that the embodiment includes at least one of those features. In the description of this application, the word "multiple" means at least two or more, such as two, three, four, etc., unless otherwise explicitly and specifically defined in the embodiments.

[0136] In this application, unless otherwise explicitly specified or limited in the embodiments, the terms "installation," "connection," "joining," and "fixing" appearing in the embodiments should be interpreted broadly. For example, a connection can be a fixed connection, a detachable connection, or an integral part; it can also be a mechanical connection, an electrical connection, etc. Of course, it can also be a direct connection, or an indirect connection through an intermediate medium, or it can be the internal communication between two components, or the interaction between two components. Those skilled in the art can understand the specific meaning of the above terms in this application based on the specific implementation.

[0137] In this application, unless otherwise expressly specified and limited, "above" or "below" the second feature can mean that the first feature is in direct contact with the second feature, or that the first feature is in indirect contact with the second feature through an intermediate medium. Furthermore, "above," "on top of," and "over" the second feature can mean that the first feature is directly above or diagonally above the second feature, or simply that the first feature is at a higher horizontal level than the second feature. "Below," "below," and "under" the second feature can mean that the first feature is directly below or diagonally below the second feature, or simply that the first feature is at a lower horizontal level than the second feature.

[0138] Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of this application.

Claims

1. A data processing method for retrieval generation, characterized by, The method includes: Acquire raw interaction data generated by the user's dialogue interaction with the dialogue model; The original interactive data is pre-compressed and subject-segmented to obtain semantic aggregated fragment data. The semantically aggregated fragment data is processed to generate a summary, resulting in structured data; The structured data is updated and maintained to obtain the target structured data used for retrieval.

2. The method of claim 1, wherein, The original interaction data includes token data; the pre-compression and topic segmentation aggregation processing of the original interaction data to obtain semantic aggregated fragment data includes: The token data is classified based on a pre-trained compression model to obtain a target compressed sequence, and the target compressed data is stored in a first buffer. When the target compressed data in the first buffer reaches the preset capacity, attention calculation is performed on the target compressed data to obtain the attention boundary, and similarity calculation is performed on the target compressed sequence to obtain the similarity boundary. When the target compressed data in the first buffer does not reach the preset capacity, the target compressed data continues to be acquired. Topic segmentation and aggregation are performed based on the attention boundary and the similarity boundary to obtain semantic aggregated fragment data.

3. The method of claim 2, wherein, The pre-trained compression model classifies the token data to obtain the target compressed sequence, including: The token data is classified based on a pre-trained compression model to obtain the original score vector corresponding to the token data. The original score vector is normalized to obtain the probability distribution and retention probability data corresponding to the token data. The token data is then subjected to a first filtering process based on a first preset threshold and the retention probability data to obtain a first compressed sequence. The first preset threshold is determined based on preset compressed data. Conditional entropy data is obtained by calculating the conditional entropy based on the probability distribution and the real label data, wherein the real label data is the encoded data corresponding to the token data of the first compressed sequence; The first compressed sequence is subjected to a second filtering process based on the second preset threshold and the conditional entropy data to obtain the target compressed data.

4. The method of claim 2, wherein, The pre-trained compressed model includes a high-level attention layer, and the original interaction data includes N rounds of data, wherein the i-th round and j-th round data include the user data or the dialogue model data, i=1, 2, ..., N, j=1, 2, ..., N; where: The step of performing attention calculation on the target compressed data to obtain the attention boundary includes: Obtain attention data for the N rounds of data after processing by the higher-level attention layer; Based on the attention data, pairwise attention processing is performed on the i-th round data and the j-th round data to obtain attention score data; Based on the attention score data, an attention matrix is obtained; Based on the attention matrix, determine the attention boundary; and / or The step of calculating the similarity of the target compressed sequence to obtain the similarity boundary includes: The N rounds of data are embedded to obtain the original interaction vector; Based on the attention boundary, the similarity of the original interaction vectors corresponding to the N rounds of data is calculated to obtain similarity data; Based on the similarity data, the similarity boundary is determined.

5. The method of claim 3, wherein, The process of summarizing the semantically aggregated fragment data to obtain structured data includes: Based on the semantically aggregated fragment data, topic data is obtained; Based on the topic data and the semantic aggregated fragment data, index data corresponding to each semantic aggregated fragment data is obtained, and the index data is stored in the second buffer; When the index data in the second buffer reaches a preset quantity, the index data is processed to generate a summary to obtain structured data. When the target compressed data in the second buffer does not reach the preset quantity, the index data continues to be acquired, wherein the preset quantity is determined based on the compressed data.

6. The method according to claim 5, characterized in that, The process of generating a summary from the index data to obtain structured data includes: The index data is processed to extract core information to obtain summary data, and the summary data is then embedded to obtain a summary embedding vector. Structured data is obtained based on the index data, the summary data, and the summary embedding vector.

7. The method according to claim 6, characterized in that, The process of updating and maintaining the structured data to obtain the target structured data used for retrieval includes: Obtain the time information of the structured data, and store the structured data and the time information in a memory bank; In response to receiving a preset trigger command, the structured data is updated offline to obtain an update sequence; Based on the update sequence and the structured data, deduplication, merging, and discarding processes are performed to obtain target structured data for retrieval. The memory is then updated based on the target structured data to enable online inference based on the topic data and the target structured data corresponding to the summary embedding vector.

8. The method according to claim 7, characterized in that, The structured data consists of multiple memory data; The response to receiving a preset trigger command involves offline updating of the structured data to obtain an update sequence, including: Based on the time information, determine the relevant memory data for each memory data; Similarity calculations are performed in parallel based on the summary embedding vectors corresponding to each memory data and the related memory data to obtain the updated sequence.

9. The method according to claim 2, characterized in that, The topic segmentation and aggregation process based on the attention boundary and the similarity boundary yields semantically aggregated fragment data, including: Determine the intersection data of the attention boundary and the similarity boundary; When the intersection data exists, the first buffer is segmented using the intersection data as the segmentation boundary to obtain semantic aggregated fragment data; When the intersection data does not exist, attention and similarity calculations are performed on the target compressed data of the newly acquired original interaction data to obtain new attention boundaries and similarity boundaries, so as to obtain the intersection data of the attention boundaries and similarity boundaries. When the original interaction data ends and the intersection data of the attention boundaries and similarity boundaries does not exist, the first buffer is segmented using the similarity boundaries as the segmentation boundaries to obtain semantic aggregated fragment data. When the similarity boundaries do not exist, the first buffer is segmented using the attention boundaries as the segmentation boundaries to obtain semantic aggregated fragment data.

10. A data processing system for retrieving generated data, characterized in that, The system includes: The acquisition module is used to acquire raw interaction data generated by the user's dialogue interaction with the dialogue model. The first processing module is used to perform pre-compression processing and topic segmentation and aggregation processing on the original interactive data to obtain semantic aggregated fragment data; The second processing module is used to perform summary generation processing on the semantic aggregated fragment data to obtain structured data; The module is used to update and maintain the structured data to obtain the target structured data for retrieval.