A structured-intent-aware-memory-based dialogue state tracking method

By constructing a three-layer memory architecture and designing a weighted label density retrieval, intent confidence decay and bounce mechanism, and conflict perception prediction strategy, the problem of intent understanding in long dialogue scenarios is solved, and the intent accuracy and retrieval quality are improved.

CN122196137APending Publication Date: 2026-06-12TIANJIN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TIANJIN UNIV
Filing Date
2026-04-01
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies in long dialogue scenarios suffer from problems such as flattened memory organization, reliance on continuous similarity matching for retrieval mechanisms, lack of explicit modeling of dynamic changes in intent activity, and lack of posterior verification in the prediction process, leading to challenges in intent understanding.

Method used

A three-layer hierarchical memory architecture is constructed, including a working memory layer, a conversation memory layer, and a long-term profile layer. A weighted label density retrieval and intent confidence decay and bounce mechanism are adopted. A two-stage prediction strategy for conflict perception is designed, and the dialogue state is managed through structured intent perception memory.

🎯Benefits of technology

It improves intent accuracy in long dialogue scenarios, solves cross-intent false recall and format errors, significantly improves retrieval quality and prediction accuracy, and approaches the performance of supervised methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196137A_ABST
    Figure CN122196137A_ABST
Patent Text Reader

Abstract

The application discloses a dialogue state tracking method based on structured intention perception memory, comprising receiving current user speech and historical dialogue record, and the steps are as follows: a three-layer hierarchical memory architecture comprising a working memory layer, a conversation memory layer and a long-term portrait layer is constructed; event types are labeled for the current dialogue, and the event types include demand expression, information provision, confirmation feedback, clarification request and topic conversion; weighted label density retrieval is performed, given the intention, event type and topic range of the current query, discrete matching is performed on each memory entry in three dimensions, wherein the matching score of the intention dimension is weighted by the decay confidence of the intention, and finally top-k memory entries are taken in descending order of label density; based on the intention confidence decay and rebound mechanism, a confidence which dynamically changes with rounds is maintained for each recognized intention, and the confidence is rebounded and increased when the user mentions again; a two-stage prediction strategy with conflict perception is used to generate the dialogue state.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of artificial intelligence, natural language processing, and dialogue systems. Specifically, it relates to a dialogue state tracking method for task-oriented dialogue systems, which is particularly suitable for understanding user intent in long dialogue scenarios involving multiple turns of interaction. Background Technology

[0002] Dialogue State Tracking (DST) is a fundamental component of task-oriented dialogue systems. It maintains a structured representation of the dialogue state after each user utterance, containing the user's current intent and mentioned slot information to support downstream dialogue strategy decisions and response generation. With the advancements in large language models for natural language understanding, zero-shot DST methods based on large language models have gradually become a research hotspot. However, directly applying large language models to multi-turn DST still has several shortcomings.

[0003] In terms of context management, existing methods mainly employ two strategies for handling dialogue history. One approach is to concatenate the entire history into the input context. This method suffers from the "Lost in the Middle" phenomenon after more than 10 dialogue rounds, meaning the model's utilization of information in the middle of the context decreases significantly, and key information from earlier rounds is difficult to recall. Experimental data shows that without a memory mechanism, the intent accuracy of a standard large language model decreases from 84.7% to 73.6% as the number of dialogue rounds increases from 5 to over 20 rounds. The other approach uses a vector database to store historical fragments and retrieves them through semantic similarity. This alleviates the input length problem, but vector similarity is not aware of the semantic structure of the dialogue. For example, "book a restaurant" and "cancel a booking" are very close in vector space but represent completely opposite intents. Similarly, "expressing a need" and "confirming feedback" within the same domain cannot be distinguished. A representative recent work, MemGuide, introduces intent alignment signals in memory retrieval, but its intent modeling is coarse-grained, only performing semantic matching between intent and description, and the intent signal only applies to the retrieval stage and does not permeate the memory storage process.

[0004] In terms of intent modeling, user intent is not static in multi-turn dialogues. A user might switch from "checking train information" to "booking a hotel," and then back to "confirming train information," demonstrating a lifecycle of intent from emergence to dormancy and re-emergence. However, existing general memory frameworks, such as MemGPT's page scheduling, MemoryBank's forgetting curve decay, ProMem's self-questioning retrieval, or Zep's temporal knowledge graph, organize memory based on data volume, temporal order, graph structure, or reinforcement learning strategies. The decay granularity targets the entire memory item or knowledge triple, without tracking the activity state of the intent itself. For example, if a user asks "check train information" in round 3, shifts to hotel discussion in rounds 4-9, and says "confirm train information again" in round 10, these methods cannot perceive the dormancy and revival of the "train" intent, and the ranking of related memories in retrieval cannot be adjusted in a timely manner.

[0005] Regarding prediction strategies, existing methods mostly employ end-to-end direct prediction, where a large language model outputs both intent and slot values ​​in a single call. In zero-shot settings, this approach is prone to output format errors or field confusion, and misjudgments of intent can directly propagate to slot extraction, creating cascading errors. More importantly, the prediction process lacks validation of the results' reasonableness. When the model misjudges intent due to keyword ambiguity—for example, if a user discusses hotel bookings and then says "confirm the time," the model might categorize it as belonging to the train domain based on the keyword "time"—the system lacks mechanisms to identify and correct such anomalous jumps.

[0006] In summary, the shortcomings of existing technologies are concentrated in four aspects: the memory organization is flat and does not manage hierarchically around the semantic intent; the retrieval mechanism relies on continuous similarity matching and cannot distinguish between segments with opposite intents or different semantic roles; there is no explicit modeling method for the dynamic changes in intent activity; and the prediction process lacks a posterior verification step. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of existing technologies and solve the following problems: how to construct a hierarchical memory architecture organized around the semantics of intent, so that dialogue information can be managed separately according to time scale and semantic level; how to design a discretized retrieval mechanism for intent perception to replace continuous vector similarity matching to avoid false recall across intents; how to track the dynamic changes of activity at the intent level so that the memory ranking can be automatically restored when dormant intents are recalled; how to embed posterior verification in the prediction process to detect and correct abnormal intent jumps; this invention provides a dialogue state tracking method based on structured intent-aware memory, which, by introducing human memory theory from cognitive science, constructs a hierarchical memory architecture and a precise memory retrieval mechanism to solve the problem of understanding user intent in long dialogue scenarios in existing technologies.

[0008] The objective of this invention is achieved through the following technical solution: A dialogue state tracking method based on structured intent-aware memory includes receiving the current user's utterance and historical dialogue records, with the following steps: A three-layer hierarchical memory architecture is constructed, comprising a working memory layer, a session memory layer, and a long-term profile layer. The working memory layer is responsible for maintaining the immediate context of the current dialog window and has a limited capacity. The session memory layer uses structured events as storage units, records them as memory entries, and maintains the history of intent states. The long-term profile layer records the user's slot preference distribution, the frequency of intent occurrence, and the intent transition matrix, and accumulates cross-session behavioral patterns. Label the current conversation with event types, including demand expression, information provision, confirmation feedback, clarification request, and topic change; Perform weighted label density retrieval. Given the intent, event type, and topic range of the current query, perform discrete matching on each memory entry in three dimensions. The matching score of the intent dimension is weighted by the decay confidence of the intent. Finally, select the top-k memory entries in descending order of label density. Based on the intent confidence decay and rebound mechanism, a confidence level that dynamically changes with each round is maintained for each identified intent, and the confidence level rebounds and increases when the user mentions it again. A two-stage prediction strategy based on conflict perception is used to generate the dialogue state: In the first stage, the user's intent is confirmed by chain-like reasoning based on the retrieved memory entries; between the first and second stages, the intent transition matrix in the long-term profile layer is used to perform conflict detection and verification to determine the rationality of the intent jump; in the second stage, after confirming the intent, only the effective slots in the corresponding domain are extracted to generate and output the dialogue state containing the intent and slot value.

[0009] Furthermore, the working memory layer retains only the original information from the most recent k rounds of dialogue and updates it using a first-in, first-out (FIFO) strategy; each memory entry in the session memory layer... Quadruple , The content of the event represents the user's words. Abstract; For intent tag set, A collection of event types Subject-specific scope; maintain intent state history Record the intentions of each dialogue turn. Confidence level and incremental slot value , where i represents the dialogue round and t represents the t-th dialogue round.

[0010] Furthermore, in the weighted label density retrieval, label density The calculation formula is: ; in, For query intent, To query the set of event types, To search for a range of topics; A set of intent tags for memory entries. For the set of event types of memory entries, The subject range of the memory entries; For indicator functions, The confidence level of the query intent after decay in round t is [0, 1].

[0011] Furthermore, in the aforementioned intent confidence decay and bounce mechanism, intent In the The confidence level of the wheel is: ; in As the initial confidence level, For the first occurrence of a round, λ is the decay rate, and α is the rebound increment. For the purpose The set of rounds that are not continuously revisited, where e represents a mathematical constant, the confidence level is pruned to [0.05, 1.0], and the inactivity intention is not completely zero.

[0012] Furthermore, the slot preference distribution includes the historical occurrence frequency of each slot value, the intention occurrence frequency records the number of times each intention occurs, and the intention transition matrix records the historical frequency of transitions between intentions.

[0013] Furthermore, the specific hierarchical judgment process for conflict detection and verification is as follows: determine whether the predicted discourse intent of the current round belongs to the same domain as the previous round, or whether the corresponding historical frequency of the intent transition matrix is ​​greater than zero. If yes, trust the current predicted discourse intent; if no, check whether the discourse contains switching signal words or new domain keywords. If it does, trust it; if it contains neither switching signal words nor new domain keywords, call the large language model for secondary confirmation. If the jump is confirmed to be unreasonable, revert to the intent of the previous round.

[0014] The present invention also provides a dialogue state tracking system based on structured intent-aware memory, comprising: The hierarchical memory module is used to construct a three-layer hierarchical memory architecture, which includes a working memory layer, a session memory layer, and a long-term profile layer. The working memory layer is responsible for maintaining the immediate context of the current dialog window and has a limited capacity. The session memory layer uses structured events as storage units, records them as memory entries, and maintains the intent state history. The long-term profile layer records the user's slot preference distribution, intent occurrence frequency, and intent transition matrix, and accumulates cross-session behavioral patterns. The event type labeling module is used to label the event type of the current dialogue. The event types include demand expression, information provision, confirmation feedback, clarification request, and topic change. The tag density retrieval module is used to perform weighted tag density retrieval. Given the intent, event type and topic range of the current query, it performs discrete matching on each memory entry in three dimensions. The matching score of the intent dimension is weighted by the decay confidence of the intent. Finally, the top-k memory entries are selected in descending order of tag density. The confidence decay and rebound module is used to maintain a dynamically changing confidence level for each identified intent based on the intent confidence decay and rebound mechanism, and to cause the confidence level to rebound and increase when the user mentions it again. The dialogue generation module is used to generate dialogue states using a two-stage prediction strategy based on conflict awareness: In the first stage, the user's intent is confirmed by chain-like reasoning based on the retrieved memory entries; between the first and second stages, the intent transition matrix in the long-term profile layer is used to perform conflict detection and verification to determine the rationality of the intent jump; in the second stage, after confirming the intent, only the effective slots in the corresponding domain are extracted to generate and output the dialogue state containing the intent and slot values.

[0015] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of a dialogue state tracking method based on structured intent-aware memory.

[0016] The present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of a dialogue state tracking method based on structured intent-aware memory.

[0017] Compared with the prior art, the beneficial effects of the technical solution of the present invention are: 1. This invention employs a three-layer hierarchical memory architecture (working memory layer, conversation memory layer, and long-term profile layer); it alleviates the "lost in the middle" phenomenon that occurs when standard large language models process long dialogues and solves the problem of mutual interference of information. It improves memory management efficiency in long dialogue scenarios, and the intent accuracy in medium-to-long dialogues steadily increases with the number of rounds, and is significantly better than the baseline without memory in ultra-long dialogues.

[0018] 2. This invention implements weighted label density retrieval (combined with structured event types); overcoming the shortcomings of vector retrieval relying on continuous similarity matching, which cannot distinguish between segments with opposite intents or different semantic roles. It completely eliminates the cross-intent false recall phenomenon in vector retrieval, and effectively filters out memories with similar semantics but inconsistent intents through discrete hard constraints, significantly improving retrieval quality.

[0019] 3. This invention introduces an intent confidence decay and rebound mechanism; it solves the problem that existing general memory frameworks cannot track dynamic changes in activity at the intent level; it realizes dynamic tracking of intent lifecycle, so that recently active intents are naturally ranked higher, and re-emerged dormant intents can adaptively rebound to restore their ranking, and can cope with the switching and regression of dialogue intents without additional sorting strategies.

[0020] 4. This invention designs a two-stage prediction strategy for conflict perception. It solves the problems of format errors, cascading errors, and aberrant intent jumps caused by a lack of posterior validation in direct end-to-end prediction. It reduces the prediction difficulty and shrinks the search space of large language model slots, while correcting aberrant intent jumps through historical transition patterns accumulated over long-term profiles. Under zero-shot conditions, the method of this invention achieves a joint objective accuracy (JGA) of 52.6% on the MultiWOZ2.4 dataset, significantly outperforming existing zero-shot methods and even approaching the performance of the best supervised method. Attached Figure Description

[0021] Figure 1 This is a flowchart illustrating the method of the present invention.

[0022] Figure 2 This is a schematic diagram of the specific process framework of the method of the present invention in Example 1.

[0023] Figure 3 This is a technical evolution diagram of the technical solution of the present invention. Detailed Implementation

[0024] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only for explaining the present invention and are not intended to limit the present invention.

[0025] Example 1 This embodiment provides a dialogue state tracking method based on structured intent-aware memory, including receiving the current user's utterance and historical dialogue records, see [link to documentation]. Figure 1 The steps are as follows: First, a three-layer hierarchical memory architecture is constructed. Drawing inspiration from the Atkinson-Shiffrin multiple storage model in cognitive science, dialogue information is organized into three layers based on time scale and granularity: working memory, conversation memory, and long-term profile. The working memory layer maintains the original text of the current dialogue using a finite-capacity window (the most recent k rounds), employing a first-in, first-out (FIFO) strategy. The conversation memory layer uses structured events as storage units, recorded as memory entries. Each memory entry is labeled with an intent tag, event type, and topic scope, encoding semantic structure into the memory entry. The long-term profile layer records the user's slot preference distribution, intent frequency, and intent transition matrix, accumulating cross-conversation behavioral patterns.

[0026] Secondly, a structured event representation and weighted label density retrieval mechanism were designed. Five event types (demand expression, information provision, confirmation feedback, clarification request, and topic transition) were defined to describe the semantic roles of user discourse, and a rule-first, large language model-backup strategy was adopted to complete the annotation.

[0027] Building upon the previous approach, a weighted label density retrieval algorithm is designed: given the intent, event type, and topic range of the current query, each memory entry is discretely matched across three dimensions. The matching score for the intent dimension is weighted by the decaying confidence of that intent, and finally, the top-k entries are selected in descending order of density. This mechanism replaces continuous vector similarity with hard constraints, filtering out memories with inconsistent intents.

[0028] A confidence decay and rebound mechanism for intents is then introduced. A confidence level is maintained for each identified intent, decaying exponentially with each round: an initial confidence level is obtained when the intent first appears, decaying exponentially as the conversation moves to other intents; and the confidence level rebounds when the user mentions it again. This confidence level directly participates in the intent dimension weighting of tag density retrieval, naturally ranking recently active intents higher in memory, while the memory ranking of inactive intents decreases but does not completely disappear. Unlike the MemoryBank's decay of memory entries based on overall importance, this embodiment decays the active confidence level of the intent itself. That is, the same intent may correspond to multiple memories; decaying at the intent level better suits the needs of intent lifecycle management in conversations.

[0029] Finally, a two-stage prediction strategy based on conflict perception is adopted. Dialogue state tracking is decoupled into three steps: "intent confirmation, conflict detection, and slot extraction." In the first stage, the retrieved memory context is used to confirm the user's intent through chained reasoning (CoT). Between the first and second stages, a hierarchical verification is performed using the intent transition matrix from long-term user profiles. This involves checking for historical precedents of the transition pattern, whether the user's utterance contains topic switching signals, and, if necessary, calling a large language model for secondary confirmation. If the confirmed transition is unreasonable, the process reverts to the previous intent. In the second stage, after confirming the intent, only valid slots in the corresponding domain are extracted, reducing the search space from all 30 slots to 5-10.

[0030] Specifically: This embodiment uses the multi-domain task-oriented dialogue dataset MultiWOZ 2.4 as an example to illustrate the implementation process. See below. Figure 2 MultiWOZ 2.4 contains 10,438 multi-turn dialogues, covering five areas: restaurants, hotels, trains, attractions, and taxis, with a total of 30 slots. The base model uses DeepSeek-V3.2.

[0031] 1. Task formalization: Given a multi-turn dialogue sequence ,in For the first The number of user turns is t=1,2,…,T, where T refers to the total number of dialogue turns. For LLM systems that respond to dialogues with users. Define standard values ​​for dialogue state. , For the user's current intent, Indicates the domain type. Indicates the type of action. For the cumulative slot value set, Slot name This corresponds to the slot value. Cumulativity is a core feature of DST, i.e. It includes not only slot values ​​newly mentioned in round t, but also all valid slot values ​​confirmed in previous rounds. The DST task requires that in each round of user discourse... Post-prediction of dialogue state The target mapping function in this embodiment for: ; in This represents the predicted dialogue state. For historical dialogues, 'i' represents the dialogue round. This represents the LLM response in the i-th round. It has three layers of memory.

[0032] 2. Three-tiered hierarchical memory architecture: The working memory layer maintains the original text of the most recent k rounds of dialogue (k=10 in this example), using a first-in, first-out (FIFO) strategy. When a new dialogue round arrives, the oldest round is automatically dequeued, providing immediate context for subsequent intent confirmation and slot extraction, while capacity limits ensure that the prompt length is controllable.

[0033] The session memory layer uses structured events as its storage unit. Each memory entry... These are the event content (representing user statements). The dataset includes a summary, a set of intent tags, a set of event types, and a topic scope. Additionally, it maintains the intent state history. Record the intentions of each dialogue turn. Confidence level and incremental slot value , where i represents the dialogue turn, providing data for conflict detection and confidence decay.

[0034] The long-term profile layer records behavioral patterns across sessions and comprises three components: slot preference distribution. Record the historical frequency of each slot value, among which Slot name For slot values, This represents the frequency of the slot value in the history of conversations. and It is the semantic space predefined by the dataset based on the task ontology. Let This represents the predefined global set of slots in the system, where This represents the total number of slots. For each slot... The corresponding candidate slot value space is represented as This refers to the set of all legal values ​​that can be taken for that slot. Intent frequency. The system records the frequency of occurrence of each intent. In the long-term profile layer, the system maintains slot preference statistics. For each predefined slot The system counts its candidate slot values. Frequency of appearance in historical dialogues Intended transition matrix Record the frequency of historical transfers between intentions for use in conflict detection.

[0035] 3. Event representation and tag density retrieval: Event type descriptions define the semantic role of utterances in the dialogue flow. This embodiment defines five types: Expressing a demand, such as "I would like to book a Chinese restaurant"; Information provided, such as "two people, Friday evening at 7 pm"; Confirmation feedback, such as "Okay, this is fine" or "No, let's choose another one"; Clarification requests, such as "What is the phone number?"; Topic transition, such as "Also, could you help me find nearby hotels?"

[0036] During annotation, keyword matching is first used: "I want", "looking for", "book" etc. are marked as expressions of demand, "yes", "ok" "no" etc. are marked as confirmation feedback, and domain change is marked as topic change. This rule covers about 80% of the cases; the remaining cases are selected from five types by the large language model.

[0037] Theme range Used to mark topic boundaries. When a topic transition event is detected, a new topic tag is generated based on the current intent domain; otherwise, the topic range from the previous round is used, providing a third matching dimension for retrieval.

[0038] The calculation method for weighted label density retrieval is as follows. Given a query... (respectively, query intent, set of query event types, and scope of query topics) and memory entries. The label density is: ; in For indicator functions, The decayed confidence of the query intent after the decay in round t (values ​​[0, 1]). This represents an empty set. The intent dimension is weighted by confidence level, with event type and topic range each contributing 0 or 1 point, and the density range is [0, 3]. During retrieval, k results are selected in descending order of density; if the densities are the same, the result with the newest timestamp is selected.

[0039] The specific process is as follows: Obtain the confidence level of the query intent. ; Traverse each memory in the conversation memory, and check the intent match one by one (matching plus... ), intersection of event types (matching adds 1), consistent themes (matching adds 1), calculate density d and add to the candidate set; sort by (d, timestamp) in descending order and take the first k items.

[0040] 4. Confidence decay and rebound mechanism: intention In the The confidence level of the wheel is: ; in The initial confidence level (in this example) =0.9), For the first occurrence of a round, λ is the decay rate (in this example, λ=0.2, with a decay of approximately 18% per round), and α is the rebound increment (in this example, α=0.5). For the purpose The set of rounds that are not continuously revisited, where e represents a mathematical constant. Confidence is cropped to [0.05, 1.0], and the intention to remain dormant is not completely zero.

[0041] This confidence level directly participates in the intent dimension weighting of the above tag density retrieval: recently active intents (conf ≈ 1.0) are ranked higher in memory, while dormant intents (conf → 0.05) are ranked lower but do not disappear, and rebound after being brought up again.

[0042] 5. A two-stage prediction strategy based on conflict perception is used to generate dialogue states: Phase one is intent confirmation. Initial intent inference is made using keyword rules. (If "restaurant" is detected, it is inferred to be in the Restaurant domain), and the recent conversation context provided by working memory. Search results R, task definition and current discourse Using concatenation as a prompt, a chain-like reasoning approach is introduced, requiring the large language model to output the reasoning process before outputting the intent label. : ; A conflict detection stage is added to both Stage 1 and Stage 2, using the intent transition matrix from the long-term profiling layer for hierarchical judgment. If the predicted discourse intent of the current round belongs to the same domain as the previous round, or if the transition matrix shows... Then directly trust the currently predicted discourse intention. , This refers to the intended message established in the previous round. If... =0, check if the discourse contains switching signal words ("in addition," "also," "by the way," etc.) or new domain keywords; if so, trust. If it contains neither switching signal words nor new domain keywords, ...

[0043] Phase two involves slot extraction. After confirming the intent, only the list of valid slots for the corresponding domain (e.g., the Restaurant domain includes food, area, price range, name, book people, book day, book time) is retrieved and input along with the accumulated slot values ​​for that domain into the prompt words of the large language model. The large language model extracts the newly added or updated slot values ​​and outputs them in JSON format. The final accumulated predicted slot values ​​are then calculated. The new value overwrites the old value with the same name. This represents the cumulative predicted slot value up to round t-1. This is the incremental slot value.

[0044] 6. Single-round reasoning process Given The system will perform the following processing: Write to the working memory layer, synchronously create session memory entry e, and update the preference statistics and transition matrix in the long-term profile; for Label event types; perform weighted tag density retrieval based on preliminary inferred intent, event type, and topic scope; Phase one confirms intent through chain-like reasoning. The intent's rationality is verified through the transition matrix during the conflict detection phase; Phase two extracts the corresponding cumulative predicted slot values. ;Merge output dialog status .

[0045] Example 2 This embodiment provides a supplementary explanation of the dialogue state tracking method based on structured intent-aware memory, using a specific dialogue as an example, as follows: Assuming we are currently in the 7th round of the conversation, the historical context is as follows: At this point, perform the following steps on the 7th round of dialogue u7 = "Reserving that restaurant just now, for two people, Friday night": S1, Memory Update ● Input: User's statement u7; ●Processing: Write u7 to the tail of the working memory queue (if the queue is full, the oldest one is dequeued); create a new memory entry for u7 in the session memory layer (the content will be completed after step S2); in the long-term profile layer, include the transition from the previous intent Hotel-Request to the preliminary inference intent of this round in the transition matrix, and update the intent frequency statistics.

[0046] ●Output: Working memory Includes the original dialogue text from rounds 5-7; a new memory entry to be filled has been added to the conversation memory; the transition matrix and frequency statistics in the long-term profile have been updated.

[0047] S2, Event Type Labeling: ● Input: User's statement u7; ●Processing: Keyword rule scanning detected "booking" matching the demand rule, and "two people, Friday night" containing specific numerical values ​​matching the information rule; the previous intent domain was Hotel, but the current utterance mentions "restaurant," shifting the domain to Restaurant, thus matching the topic transition rule. The rules are already covered, so there's no need to call the large language model.

[0048] ● Output: Event type set E(u7) = {Demand, Information, Transition}; topic scope updated from Hotel to Restaurant (due to topic transition detected). The conversation memory entry created in step S1 is completed as e7=(c7, {Restaurant-Inform}, {Demand, Information, Transition}, Restaurant), and the user's utterance summary c7= "The user requests to book a previously recommended restaurant, providing information on the number of people and the time."

[0049] S3, Weighted Label Density Search: ●Input: Query q = ( = Restaurant-Inform, = {Demand, Information,Transition}, = Restaurant); existing memory entries in the conversation memory (including the restaurant recommendation event in round 5 and the hotel search event in round 6, etc.).

[0050] ●Processing: Preliminary inference of intent based on keyword rules = Restaurant-Inform (Detected "Restaurant" + "Reservation"). Obtain the current confidence of Restaurant-Inform; this intent first appeared in round 5, and after 2 rounds of decay, conf≈ 0.67. Calculate the tag density for each item in the session memory: Round 5 event e5 = ("User requests recommended Chinese restaurant", {Restaurant-Request}, {Demand}, Restaurant): Intent dimension. = Restaurant-Inform is not in the intent set {Restaurant-Request} of e5, the indicator function is 0, and the score is 0; Event type dimension, The Demand in the first instance overlaps with {Demand} in e5, earning 1 point; in the theme dimension, Restaurant = Restaurant, earning 1 point. Density d = 0 + 1 + 1 = 2. Event e6 in round 6 = (“User requests to find nearby hotels”, {Hotel-Request}, {Demand, Transition}, Hotel): Intent dimension mismatch, earning 0 points; event type dimension... The Demand and Transition parameters in e5 overlap with {Demand, Transition} in e6, earning 1 point; however, in the theme dimension, Restaurant ≠ Hotel, resulting in 0 points. The density d = 0 + 1 + 0 = 1. Sorted in descending order of (density, timestamp), e5 ranks higher.

[0051] ● Output: Search results R = {e5, ...} (The top 5 results after ranking are taken; only key entries are shown here).

[0052] S4, Phase One, Intent Confirmation: ● Input: Working Memory (Original dialogue from rounds 5-7), search results R, task definition TaskDef, current utterance u7.

[0053] ●Processing: The above content is assembled into prompt words according to the following framework, and then the large language model is invoked for chain-like reasoning. The core structure of the prompt words is as follows: "【Recent Conversations】(The most recent rounds of original conversations provided by working memory);" [Memory Context] (Related memory entries and user preferences returned by tag density retrieval); Based on the above dialogue history and memory context, please determine the intent of the current user's statement.

[0054] List of legitimate intentions: Restaurant - inform, Restaurant - inquire, Hotel - inform, Hotel - inquire, Train - inform, Train - inquire, Attraction - inform, Attraction - inquire, Taxi - inform, Taxi - inquire, General - inform, General - inquire; Judgment rules: - "Inform" indicates that the user actively provides information or expresses a need. - "Inquire" indicates that the user requests information from the system (such as phone number, address, price, etc.). - Judgment based on the context of the conversation: If the user has been discussing hotels and says "Okay, let's book," the domain is still hotels. - Resolve ambiguity by using the memory context. Current user statement: "Booked that restaurant, two people, Friday night." Preliminary inference: Restaurant - notification; Please analyze the reasoning process step by step, and output the intent label on the last line. Output format: "Intent: Domain-Action"; The "Please analyze the reasoning process step by step" instruction guides the large language model to output the reasoning process first, with the conclusion given in the format "Intent: Domain-Action" on the last line. "Preliminary Inference" represents the preliminary inference result of the keyword rules in step S3, provided for the model's reference but not mandatory. "List of Legitimate Intents" lists all legitimate intent tags, constraining the output range. The system extracts the tag after "Intent:" from the last line of the model's output as the confirmed intent.

[0055] In this example, the model outputs the following inference process: "The user mentioned 'that restaurant,' which refers to the restaurant recommended in round 5; 'book' is the notification action; 'two people, Friday night' is the specific slot value information." The last line outputs "Intent: Restaurant - Notification".

[0056] ● Output: Confirmation of intent Î7=Restaurant-Inform, and inference chain text.

[0057] S5, Collision Detection Phase: ●Input: Current predicted intent I7 = Restaurant-Inform, previous intent I6 = Hotel-Request, intent transition matrix T, user utterance u7.

[0058] ● Handling: First, perform a level 1 check: the domain Restaurant (Î7) ≠ the domain Hotel (I6), indicating a cross-domain redirect, requiring further evaluation. Check the transition matrix T(Hotel-Request → Restaurant-Inform). Assuming this transition has occurred historically (users often return to restaurant reservations after checking hotel information), T > 0. This is considered a legitimate redirect; trust the current prediction, and no further level 2 or 3 checks are needed.

[0059] ● Output: Final intent I7 = Restaurant-Inform (unmodified).

[0060] S6, Phase Two, Slot Extraction: ●Input: Confirm the list of valid slots corresponding to the intent Restaurant-Inform {food, area, price range, name, book people, book day, book time}, and the cumulative slot value for this area up to round 6. (Restaurant) = {Restaurant-food: Chinese} (from round 5), working memory context, current discourse u7.

[0061] ●Processing: The above information is concatenated into prompt words for the large language model. The prompt words are then constructed according to the following framework, and the large language model is called to extract the new slot values. The core structure of the prompt words is as follows: "【Recent Conversations】(Text of recent rounds of conversation);" Extracted slot value: Restaurant - Cuisine = Chinese Food; Please extract slot information for the restaurant category from the current user discourse.

[0062] Current user's message: "Reserved that restaurant, two people, Friday night"; Effective slots in the restaurant category: cuisine, region, price range, name, number of people booked, booking date, booking time; Extraction rules: 1. Only use the names from the list of valid slots above; 2. Only extract newly appearing values ​​in the current utterance (without repeating existing slot values ​​from the past); 3. Use the original wording (lowercase) in the speech; 4. If the user updates an existing slot, retrieve the new value; 5. If there are no new slot values ​​in the current utterance, return an empty {}; Output only JSON, do not output any other content.

[0063] Example: Statement: "I want to eat Italian food, for two people" Domain: Restaurant Output: {"Cuisine": "Italian", "Number of people": "2"}; Verb: "I need a train to London, departing at 9:00" Domain: Train Output: {"Destination": "London", "Departure Time": "9:00"}; Your output:

[0064] The prompts standardize the JSON output through the following mechanisms: valid slot names are limited to "valid slots" to prevent the model from creating its own fields; incremental information is extracted only by specifying "only extract values ​​newly appearing in the current utterance," avoiding duplication of historical slot values; the output format is constrained by "output only JSON, do not output other content," excluding extra text; and a few examples demonstrate the expected key-value pair format. The system parses the model output as JSON and processes aliases in the output through a slot name normalization mapping table (e.g., mapping "cuisine" to "food" and "number of reservations" to "book people") to ensure that the final slot values ​​are consistent with the standard naming of the dataset.

[0065] In this example, the model output is: {“Number of reservations”: “2”, “Reservation date”: “Friday”}.

[0066] ● Output: Incremental slot value = {book people: 2, book day: friday}.

[0067] S7. State Merging and Output: ● Input: Confirm Intent Î7 = Restaurant-Inform, Historical Cumulative Slot Value (Restaurant) ={Restaurant-food: Chinese}, Incremental slot value = {book people: 2, book day: friday}.

[0068] ●Process: Merge = (Restaurant) ∪ For slots with the same name, the new value will overwrite the old value.

[0069] ●Output: Complete dialogue state in round 7 = (Restaurant-Inform, {Restaurant-food:Chinese, book people: 2, book day: friday}).

[0070] The hyperparameter settings involved in this embodiment are as follows: working memory window k=10, retrieval top-k=5, decay rate λ=0.2, rebound increment α=0.5, minimum confidence level 0.05, and inference temperature 0.1.

[0071] Preferably, see Figure 3 This is a schematic diagram illustrating the technological evolution from existing technology to the technical solution of this invention.

[0072] As shown in Table 1, the method of this invention is the best in all four indicators among zero-sample methods, with JGA reaching 52.6%, which is 6.1 percentage points higher than MemGuide and 12.3 percentage points higher than Standard LLM. It surpasses the supervised method SimpleTOD and is close to TripPy.

[0073] Table 1. Comparison of experimental results with other methods Preferably, as shown in Table 2, the results of the ablation experiment of the method of the present invention on 100 dialogue subsets show that the hierarchical architecture and event type system have the greatest impact, followed by two-stage prediction and long-term profiling, while confidence decay, conflict detection and chain thinking each contribute 1-2 percentage points.

[0074] Table 2 Ablation Experiment Results In summary, the method of this invention uses a three-layer memory architecture to store immediate context, cross-turn event information, and long-term user preferences in a time-scale manner, avoiding interference between information of different granularities. Ablation experiments show that the joint target accuracy decreases significantly after removing the hierarchical architecture, verifying the necessity of hierarchical management. In dialogue length hierarchical tests, the intent accuracy of this invention slightly increases with the number of turns in medium-to-long dialogues, because the continuous accumulation of structured information from conversation memory and long-term profiles gradually improves the retrieval and matching quality; it also maintains high accuracy in very long dialogues, significantly outperforming the baseline without memory.

[0075] Weighted label density retrieval replaces continuous vector similarity matching with discrete hard constraints across three dimensions: intent label, event type, and topic scope. This mechanism prevents semantically similar but contradictory segments from being incorrectly associated. Statistical analysis shows a significant difference in intent accuracy between complete three-dimensional matching and no matching, indicating that the more comprehensive the multi-dimensional label matching, the greater the auxiliary role of retrieved memories in intent understanding. Ablation experiments reveal a significant performance drop after removing the event type system, further validating the contribution of structured event labels to retrieval quality.

[0076] The intent confidence decay and rebound mechanism models activity changes at the intent level rather than the memory item level. When linked to tag density retrieval, recently active intents naturally rank higher, while inactive intents decline in ranking but do not completely disappear. When a user mentions an intent again, confidence rebounds, and the ranking recovers. This mechanism allows the system to adaptively handle intent switching and regression without additional ranking strategies, and ablation experiments validate its positive contribution to intent accuracy.

[0077] Two-stage prediction decouples intent confirmation from slot extraction, reducing the slot search space from the entire domain to the current domain. This reduces the ambiguity of the structured output of large language models, and errors in intent judgment and slot extraction can be diagnosed independently. Conflict detection is embedded between the two stages as a posterior validation, utilizing historical patterns accumulated in the intent transition matrix to identify unreasonable domain jumps. In the absence of historical precedents and text switching signals, it triggers backtracking corrections. Ablation experiments show that this strategy significantly contributes to the joint objective accuracy.

[0078] Example 3 Based on the same inventive concept, this application also provides a dialogue state tracking system based on structured intent-aware memory, which can be used to implement the methods described in the above embodiments, specifically including the following: The hierarchical memory module is used to construct a three-layer hierarchical memory architecture, which includes a working memory layer, a session memory layer, and a long-term profile layer. The working memory layer is responsible for maintaining the immediate context of the current dialog window and has a limited capacity. The session memory layer uses structured events as storage units, records them as memory entries, and maintains the intent state history. The long-term profile layer records the user's slot preference distribution, intent occurrence frequency, and intent transition matrix, and accumulates cross-session behavioral patterns. The event type labeling module is used to label the event type of the current dialogue. The event types include demand expression, information provision, confirmation feedback, clarification request, and topic change. The tag density retrieval module is used to perform weighted tag density retrieval. Given the intent, event type and topic range of the current query, it performs discrete matching on each memory entry in three dimensions. The matching score of the intent dimension is weighted by the decay confidence of the intent. Finally, the top-k memory entries are selected in descending order of tag density. The confidence decay and rebound module is used to maintain a dynamically changing confidence level for each identified intent based on the intent confidence decay and rebound mechanism, and to cause the confidence level to rebound and increase when the user mentions it again. The dialogue generation module is used to generate dialogue states using a two-stage prediction strategy based on conflict awareness: In the first stage, the user's intent is confirmed by chain-like reasoning based on the retrieved memory entries; between the first and second stages, the intent transition matrix in the long-term profile layer is used to perform conflict detection and verification to determine the rationality of the intent jump; in the second stage, after confirming the intent, only the effective slots in the corresponding domain are extracted to generate and output the dialogue state containing the intent and slot values.

[0079] Preferably, embodiments of this application also provide a specific implementation of an electronic device capable of implementing all steps of the dialogue state tracking method based on structured intent-aware memory in the above embodiments. The electronic device specifically includes the following: Processor, memory, communications interface, and bus; The processor, memory, and communication interface communicate with each other via a bus; the communication interface is used to realize information transmission between server-side devices, metering devices, and user-side devices.

[0080] The processor is used to call a computer program in memory. When the processor executes the computer program, it implements all the steps in the dialogue state tracking method based on structured intent-aware memory in the above embodiments.

[0081] Embodiments of this application also provide a computer-readable storage medium capable of implementing all steps of the dialogue state tracking method based on structured intent-aware memory in the above embodiments. The computer-readable storage medium stores a computer program that, when executed by a processor, implements all steps of the dialogue state tracking method based on structured intent-aware memory in the above embodiments.

[0082] While this application provides method operation steps as shown in the embodiments or flowcharts, more or fewer operation steps may be included based on conventional or non-inventive labor. The order of steps listed in the embodiments is merely one possible execution order among many and does not represent the only execution order. In actual device or client product execution, the method can be executed in the order shown in the embodiments or drawings or in parallel (e.g., in a parallel processor or multi-threaded processing environment).

[0083] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0084] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0085] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0086] This invention is not limited to the embodiments described above. The above description of specific embodiments is intended to illustrate and explain the technical solutions of this invention. The specific embodiments described above are merely illustrative and not restrictive. Without departing from the spirit and scope of the claims, those skilled in the art can make many specific modifications based on the teachings of this invention, and these modifications all fall within the scope of protection of this invention.

Claims

1. A dialogue state tracking method based on structured intent-aware memory, comprising receiving the current user's utterance and historical dialogue records, characterized in that, The steps are as follows: A three-layer hierarchical memory architecture is constructed, comprising a working memory layer, a session memory layer, and a long-term profile layer. The working memory layer is responsible for maintaining the immediate context of the current dialog window and has a limited capacity. The session memory layer uses structured events as storage units, records them as memory entries, and maintains the history of intent states. The long-term profile layer records the user's slot preference distribution, the frequency of intent occurrence, and the intent transition matrix, and accumulates cross-session behavioral patterns. Label the current conversation with event types, including demand expression, information provision, confirmation feedback, clarification request, and topic change; Perform weighted label density retrieval. Given the intent, event type, and topic range of the current query, perform discrete matching on each memory entry in three dimensions. The matching score of the intent dimension is weighted by the decay confidence of the intent. Finally, select the top-k memory entries in descending order of label density. Based on the intent confidence decay and rebound mechanism, a confidence level that dynamically changes with each round is maintained for each identified intent, and the confidence level rebounds and increases when the user mentions it again. A two-stage prediction strategy based on conflict perception is used to generate dialogue states: the first stage combines retrieved memory items and confirms the user's intent through chain-like reasoning. Between the first and second stages, the intent transfer matrix in the long-term profile layer is used to perform conflict detection and verification to determine the rationality of the intent jump. In the second stage, after confirming the intent, only the valid slots in the corresponding domain are extracted, and a dialogue state containing the intent and slot values ​​is generated and output.

2. The dialogue state tracking method according to claim 1, characterized in that, The working memory layer retains only the original information from the most recent k rounds of dialogue and updates it using a first-in, first-out (FIFO) strategy; each memory entry in the session memory layer... Quadruple , The content of the event represents the user's words. Abstract; For intent tag set, A collection of event types Subject-specific scope; maintain intent state history Record the intentions of each dialogue turn. Confidence level and incremental slot value , where i represents the dialogue round and t represents the t-th dialogue round.

3. The dialogue state tracking method according to claim 1, characterized in that, In the weighted label density retrieval, label density The calculation formula is: ; in, For query intent, To query the set of event types, To search for a range of topics; A set of intent tags for memory entries. For the set of event types of memory entries, The subject range of the memory entries; For indicator functions, The confidence level of the query intent after decay in round t is [0, 1].

4. The dialogue state tracking method according to claim 1, characterized in that, In the aforementioned intent confidence decay and bounce mechanism, intent In the The confidence level of the wheel is: ; in As the initial confidence level, For the first occurrence of a round, λ is the decay rate, and α is the rebound increment. For the purpose The set of rounds that are not continuously revisited, where e represents a mathematical constant, the confidence level is pruned to [0.05, 1.0], and the inactivity intention is not completely zero.

5. The dialogue state tracking method according to claim 1, characterized in that, The slot preference distribution includes the historical frequency of each slot value, the intention occurrence frequency records the number of times each intention occurs, and the intention transition matrix records the historical frequency of transitions between intentions.

6. The dialogue state tracking method according to claim 1, characterized in that, The specific hierarchical judgment process for conflict detection and verification is as follows: determine whether the predicted discourse intent of the current round belongs to the same domain as the previous round, or whether the corresponding historical frequency of the intent transition matrix is ​​greater than zero. If yes, trust the current predicted discourse intent; if no, check whether the discourse contains switching signal words or new domain keywords. If it does, trust it; if it contains neither switching signal words nor new domain keywords, call the large language model for secondary confirmation. If the jump is confirmed to be unreasonable, revert to the intent of the previous round.

7. A dialogue state tracking system based on structured intent-aware memory, characterized in that, include: The hierarchical memory module is used to construct a three-layer hierarchical memory architecture, which includes a working memory layer, a session memory layer, and a long-term profile layer. The working memory layer is responsible for maintaining the immediate context of the current dialog window and has a limited capacity. The session memory layer uses structured events as storage units, records them as memory entries, and maintains the intent state history. The long-term profile layer records the user's slot preference distribution, intent occurrence frequency, and intent transition matrix, and accumulates cross-session behavioral patterns. The event type labeling module is used to label the event type of the current dialogue. The event types include demand expression, information provision, confirmation feedback, clarification request, and topic change. The tag density retrieval module is used to perform weighted tag density retrieval. Given the intent, event type and topic range of the current query, it performs discrete matching on each memory entry in three dimensions. The matching score of the intent dimension is weighted by the decay confidence of the intent. Finally, the top-k memory entries are selected in descending order of tag density. The confidence decay and rebound module is used to maintain a dynamically changing confidence level for each identified intent based on the intent confidence decay and rebound mechanism, and to cause the confidence level to rebound and increase when the user mentions it again. The dialogue generation module is used to generate dialogue states using a two-stage prediction strategy based on conflict awareness: the first stage combines retrieved memory entries and confirms the user's intent through chain-like reasoning. Between the first and second stages, the intent transfer matrix in the long-term profile layer is used to perform conflict detection and verification to determine the rationality of the intent jump. In the second stage, after confirming the intent, only the valid slots in the corresponding domain are extracted, and a dialogue state containing the intent and slot values ​​is generated and output.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the dialogue state tracking method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the dialogue state tracking method according to any one of claims 1 to 6.