Multimodal Contextual Interaction Control Method and Device Based on Pregenerated Resource Map
By using a multimodal scenario interaction control method based on pre-generated resource maps, the problem of contradiction between high realism and training logic in existing technologies is solved, achieving a low-latency, highly immersive training experience and strict adherence to logic.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI LINGJI INFORMATION TECH CO LTD
- Filing Date
- 2026-01-22
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies struggle to simultaneously achieve a high degree of realism and immersion while strictly adhering to teaching/training logic when using generative artificial intelligence for contextual interactive training, leading to distorted training results.
A pre-generated resource graph is used, which is loaded by loading a pre-built contextual interaction graph containing state nodes and directed edges. Each state node is associated with visual resources. Hybrid intent recognition processing and a global variable set are used to ensure that the training logic strictly follows the teaching requirements.
It achieves a low-latency, highly immersive training experience, reduces the risk of AI models developing hallucinations, and ensures strict adherence to training logic and accuracy of training results.
Smart Images

Figure CN122019714B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer interaction technology, and in particular to a multimodal scenario interaction control method and apparatus based on a pre-generated resource map. Background Technology
[0002] With the rapid development of artificial intelligence and multimedia technologies, scenario-based interactive training has been widely applied in fields such as language learning, vocational skills training, and psychological counseling simulation. Traditional interactive training systems mainly rely on preset static options (such as ABCD multiple-choice questions) or simple keyword matching technology. When users face a virtual character on the screen, they can only choose from a limited number of options. This "cramming" or "mechanical" interaction method cannot simulate the openness and uncertainty of real dialogue, resulting in severely distorted training effects and making it difficult for users to experience immersion.
[0003] In recent years, with the explosion of Large Language Models (LLM) and Generative Artificial Intelligence (AIGC) technologies, some interactive solutions based on real-time generation have emerged. These solutions allow users to input freely, with the LLM generating response text in real time, and then TTS (text-to-speech) generating feedback in real time. However, this approach has significant technical shortcomings in practical applications. The process of purely relying on large models for generation often lacks constraints. In professional skills training such as air traffic control communications and medical emergency procedures, specific standard operating procedures must be strictly followed. LLM is prone to generating illusions, producing guidance information that deviates from the teaching objectives or is even erroneous, leading to training failure. Therefore, existing technical solutions cannot simultaneously resolve the contradiction between high realism and immersion and strict adherence to teaching / training logic. The industry urgently needs a new technical solution that can leverage the efficiency and immersion offered by generative artificial intelligence while ensuring logical controllability. Summary of the Invention
[0004] This invention provides a multimodal scenario interaction control method and device based on a pre-generated resource map, which solves the problem of difficulty in controlling the training logic when using artificial intelligence technology in the prior art, and achieves the effect of maintaining a high degree of realism and immersion while strictly following the teaching / training logic.
[0005] This invention provides a multimodal scenario interaction control method based on a pre-generated resource map, comprising:
[0006] Load a pre-built contextual interaction graph, which includes several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature.
[0007] In response to the start command of the interactive session, the first state node is activated according to the current state pointer, and the visual resources associated with the first state node are rendered on the user terminal.
[0008] Collect the user's multimodal input signal during playback at the first state node, and convert the multimodal input signal into a text sequence;
[0009] The text sequence is subjected to hybrid intent recognition processing to calculate the matching degree with the standard intent features associated with each directed edge emitted by the current state node, and the target intent and target directed edge are determined.
[0010] Based on the second state node pointed to by the directed edge of the target, update the current state pointer and trigger state transition logic to switch the rendering of the visual resources associated with the second state node on the user terminal.
[0011] According to the present invention, a multimodal contextual interaction control method based on a pre-generated resource map is provided, wherein the pre-generated visual resources are constructed through the following process:
[0012] Receive structured script data containing character settings, dialogue scripts, and action descriptions;
[0013] The structured script data is parsed, and for each script segment, a generative artificial intelligence interface is called to generate corresponding digital human motion video or 3D animation data;
[0014] The generated video or animation data is stored in the content delivery network, and a unique Uniform Resource Locator (URL) is generated.
[0015] Establish a mapping relationship between the unified resource locator and the corresponding state node in the context interaction graph.
[0016] According to the present invention, a multimodal scenario interaction control method based on a pre-generated resource map is provided, the method further comprising:
[0017] Real-time monitoring of user intent recognition matching score and response time across N consecutive state nodes;
[0018] If the average matching score is lower than the first threshold or the average response time is higher than the second threshold, the interaction parameters of subsequent state nodes will be automatically adjusted. The adjustment of the interaction parameters includes: lowering the matching threshold of subsequent intent recognition, or switching to a simplified scenario sub-graph containing more auxiliary prompts.
[0019] According to the present invention, a multimodal scenario interaction control method based on a pre-generated resource graph is provided, wherein the pre-built scenario interaction graph further includes a global variable set, and the triggering state transition logic includes:
[0020] Parse the preset attribute change instructions for the directed edge of the target;
[0021] According to the attribute change instruction, update the values in the global variable set;
[0022] Determine whether the updated global variable value satisfies the admission criteria of the second state node;
[0023] If the conditions are met, the switching rendering is executed; if not, the system is redirected to the preset default branch node or error message node.
[0024] According to the present invention, a multimodal scenario interaction control method based on a pre-generated resource map is provided, wherein the target intent is determined through the following process:
[0025] The text sequence is converted into a user input vector using a pre-trained sentence embedding model;
[0026] Obtain the reference corpus vectors of all associated standard intents for the current state node;
[0027] Calculate the cosine similarity between the user input vector and each reference corpus vector to obtain an initial similarity score;
[0028] The initial similarity score is weighted and corrected based on a preset keyword library to obtain the final matching score;
[0029] The standard intent with the highest final matching score that exceeds a preset threshold is selected as the target intent.
[0030] According to the present invention, a multimodal scenario interaction control method based on a pre-generated resource map is provided, wherein the initial similarity score is weighted and corrected based on a preset keyword library, specifically including:
[0031] The text sequence is segmented and part-of-speech tagging is performed to extract the user's content word set;
[0032] Retrieve the set of strongly related keywords predefined by the current standard intent;
[0033] Determine whether the user's set of real words contains elements from the set of strongly related keywords;
[0034] If the initial similarity score is included, it is weighted and amplified according to a preset enhancement coefficient; if it is not included, the initial similarity score remains unchanged or is penalized and attenuated.
[0035] According to the present invention, a multimodal contextual interaction control method based on a pre-generated resource graph performs part-of-speech tagging on the text sequence, including:
[0036] Natural language processing tools are used to perform part-of-speech tagging on text sequences to identify verb conjugation forms and noun plural forms.
[0037] The verb conjugation forms are restored to their infinitive forms, and the noun plural forms are restored to their singular forms, in order to generate a standardized word form sequence for keyword matching.
[0038] According to the multimodal contextual interaction control method based on a pre-generated resource map provided by the present invention, before acquiring the multimodal input signal of the user during playback at the first state node, the method further includes:
[0039] Determine the playback progress of the visual resources of the current first state node;
[0040] If the playback progress is completed and no user input is detected to reach the preset time threshold, a timeout handling mechanism is triggered. The timeout handling mechanism includes: displaying a timeout message, automatically calling the prompt audio resource associated with the current state node for playback, or automatically transitioning the state along a preset default silence path.
[0041] The present invention also provides a multimodal scenario interaction control device based on a pre-generated resource map, comprising:
[0042] The loading module is used to load a pre-built context interaction graph, which includes several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature.
[0043] The response module is used to respond to the start command of the interactive session, activate the first state node according to the current state pointer, and render the visual resources associated with the first state node on the user terminal.
[0044] The acquisition module is used to acquire the user's multimodal input signals during the playback process at the first state node, and convert the multimodal input signals into a text sequence;
[0045] The first processing module is used to perform mixed intent recognition processing on the text sequence, calculate the matching degree with the standard intent features associated with each directed edge issued by the current state node, and determine the target intent and the target directed edge.
[0046] The second processing module is used to update the current state pointer based on the second state node pointed to by the target directed edge, and trigger state transition logic to switch the rendering of the visual resources associated with the second state node on the user terminal.
[0047] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the multimodal scenario interaction control method based on pre-generated resource maps as described above.
[0048] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the multimodal scenario interaction control method based on a pre-generated resource map as described above.
[0049] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the multimodal scenario interaction control method based on a pre-generated resource map as described above.
[0050] The multimodal scenario interaction control method and device based on pre-generated resource graphs provided by this invention can load pre-constructed scenario interaction graphs through the patterns of pre-generated resource graphs, reducing the high latency of training. Intent recognition only takes milliseconds, and video playback is also instantaneous and highly immersive. At the same time, since the direction of training nodes is strictly limited by the directed edges of the graph, the risk of the artificial intelligence model generating illusions and leading to incorrect guidance is reduced, thereby maintaining a high degree of realism and immersion and strictly following the teaching / training logic. Attached Figure Description
[0051] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0052] Figure 1 This is one of the flowcharts of the multimodal scenario interaction control method based on pre-generated resource maps provided by the present invention;
[0053] Figure 2 This is the second flowchart of the multimodal scenario interaction control method based on pre-generated resource maps provided by the present invention;
[0054] Figure 3 This is a schematic diagram of the structure of the multimodal scenario interaction control device based on pre-generated resource maps provided by the present invention;
[0055] Figure 4 is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0056] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0057] The following is combined Figure 1 Figure 4 illustrates the multimodal scenario interaction control method and apparatus based on pre-generated resource maps according to the present invention.
[0058] This invention provides a detailed description of the basic architecture and operational flow of the multimodal scenario interaction control method based on a pre-generated resource map. This method is primarily executed by the client (user terminal). As shown in Figure 1, the multimodal scenario interaction control method based on a pre-generated resource map in this invention mainly includes steps 110-150.
[0059] Step 110: Load the pre-built contextual interaction graph.
[0060] The contextual interaction graph contains several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature.
[0061] The context interaction graph is a directed graph. Each state node in the graph represents a specific moment or scene in the plot development. In physical storage, a node object can contain: a node ID, a node type (e.g., normal dialogue or branch selection), an associated visual resource ID (e.g., a pre-generated video), and an action script for entering that node. Directed edges connecting two nodes represent the possibilities of plot progression. Each edge points from one node to another and carries a trigger condition. In this implementation, the trigger condition can be a standard intent feature. A standard intent feature is a pre-defined matching template on the directed edges, containing not only simple text labels but also a reference corpus, essential keywords, and an intent vector. The context interaction graph can be configured based on specific training scenarios; no restrictions are placed here.
[0062] Visual resource identifiers point to the addresses, such as URLs, of multimedia files stored in a CDN (Content Delivery Network) or local storage. Upon system startup, data packets for a specific script are first read from a database such as the Neo4j graph database or the SequoiaDB document database. Graph data can be deserialized into memory and built into a node tree. Each node pre-loads metadata for its associated video, facilitating rapid subsequent indexing.
[0063] In some embodiments, the pre-generated visual resources are constructed through the following process: receiving structured script data containing character settings, dialogue scripts, and action descriptions; parsing the structured script data, and for each script segment, calling a generative artificial intelligence interface to generate corresponding digital human action video or 3D animation data; storing the generated video or animation data in a content delivery network and generating a unique resource locator; and establishing a mapping relationship between the resource locator and the corresponding state node in the context interaction graph.
[0064] Traditional scripts are Word documents, while the scripts in this embodiment of the invention can be received structured data such as XML or JSON format.
[0065] It can have a built-in batch processing engine (BPE) to automatically execute the following steps.
[0066] TTS generation: Call the speech synthesis API to convert text into an audio file with specific emotions.
[0067] Motion-driven: This method utilizes a lip-sync algorithm and a body motion generation model. The model takes audio and a static character portrait / 3D model as input and outputs dynamic facial video of the character speaking.
[0068] Scene compositing: Using computer graphics techniques, dynamic character layers are overlaid with background layers. If complex actions are involved, a video generation model may be invoked to generate a specific action video.
[0069] Encoding and Compression: The generated raw video sequence is encoded into H.264 or H.265 format, and the bitrate is optimized for network transmission. The generated video file is uploaded to cloud object storage and accelerated via CDN, thus obtaining a unique URL.
[0070] Understandably, during the backfill mapping, the system can automatically locate the corresponding state node in the context interaction graph based on the script fragment's ID and write this URL into the node's attributes. If the curriculum and research staff modifies a single sentence in the script, the system can automatically trigger the entire process of regeneration, re-upload, and remapping, reducing content production costs.
[0071] Traditional training software relies on live-action filming, and any script modifications require hiring new actors and setting up new studios, resulting in extremely high costs. This invention utilizes AIGC technology to pre-process and automate content production through software, enabling the creation of massive amounts of branching storylines, even those with thousands of tiny branches, greatly enriching the density of the interaction graph.
[0072] Step 120: In response to the start command of the interactive session, activate the first state node according to the current state pointer, and render the visual resources associated with the first state node on the user terminal.
[0073] When the user clicks "Start Training," a startup command is received. The status pointer can be pointed to the start node of the graph. The client player starts playing a pre-generated digital human video based on the resource URL of that node. This video is not rendered in real time, but rather plays a smooth, pre-produced high-definition MP4 / WebM stream, thus achieving movie-quality (1080P / 4K) while placing almost no strain on the client's GPU.
[0074] Step 130: Collect the multimodal input signal of the user during the playback process in the first state node, and convert the multimodal input signal into a text sequence.
[0075] During or after video playback, microphone access can be granted to capture the user's voice stream. An automatic speech recognition engine can be invoked to convert the audio signal into a text sequence in real time. In addition to voice, if the device supports a camera, facial expressions such as confusion or confidence can also be captured as supplementary metadata.
[0076] Step 140: Perform hybrid intent recognition processing on the text sequence, calculate the matching degree with the standard intent features associated with each directed edge emitted by the current state node, and determine the target intent and target directed edge.
[0077] In this implementation, after receiving the text sequence, the server does not directly generate a response like in traditional large-scale models, but instead performs a classification task. It can traverse all directed edges emitted by the current node. Assuming there are three directed edges, it can calculate the matching degree between the user's input text and the standard intent corresponding to these three directed edges, thereby determining the target directed edge for matching.
[0078] In some embodiments, the target intent is determined through the following process: converting the text sequence into a user input vector using a pre-trained sentence embedding model; obtaining reference corpus vectors of all associated standard intents of the current state node; calculating the cosine similarity between the user input vector and each reference corpus vector to obtain an initial similarity score; weighting and correcting the initial similarity score based on a preset keyword library to obtain a final matching score; and selecting the standard intent with the highest final matching score that exceeds a preset threshold as the target intent.
[0079] First, a pre-trained deep learning model, such as BERT, can be used to transform the user's natural language input into a high-dimensional numerical vector. Then, cosine similarity calculation is performed between the user's input vector and a pre-defined reference corpus vector. This gives the system strong generalization ability; even a sentence that has never appeared before will be close in the vector space if its meaning is similar, thus achieving a high initial similarity score. Vector models excel at capturing semantics, but may not be sensitive enough in distinguishing key details; a keyword database can be used for correction.
[0080] To compensate for the insensitivity of pure vector models in handling technical terms, negation words, or specific instructions, a weighted / penalty mechanism can be introduced. The initial similarity score is weighted and corrected based on a pre-defined keyword library. Specifically, this involves: segmenting the text sequence and restoring its part-of-speech tags to extract the user's content word set; retrieving the pre-defined set of strongly related keywords for the current standard intent; determining whether the user's content word set contains elements from the strongly related keyword set; if so, a weighted amplification of the initial similarity score according to a pre-defined enhancement coefficient; if not, maintaining the initial similarity score or applying a penalized decay.
[0081] If the system detects that the user input contains pre-defined strongly related keywords, it considers the intent highly credible and significantly improves the final score by multiplying it by an enhancement factor (e.g., 1.2 times), ensuring a correct match. If necessary key features are missing, or mutually exclusive words appear, the system can maintain the original score or reduce it to prevent seemingly plausible but ultimately incorrect matches.
[0082] In some embodiments, part-of-speech tagging of the text sequence includes: using natural language processing tools to tag the text sequence with parts of speech, identifying verb conjugation forms and noun plural forms; restoring verb conjugation forms to their base forms and noun plural forms to their singular forms, in order to generate a standardized word form sequence for keyword matching.
[0083] To ensure that the second-layer keyword matching does not fail due to changes in grammatical form, the system performs a standardized cleaning process before matching. Natural language processing tools can be used for part-of-speech tagging. The system uniformly restores the varied verb tenses and noun plurals in user speech to their root forms. Regardless of the tense or singular / plural form used by the user, as long as the core meaning is correct, it can be accurately captured by the keyword database, greatly improving the accuracy of recognition.
[0084] Step 150: Based on the second state node pointed to by the target directed edge, update the current state pointer and trigger the state transition logic to switch the rendering of the visual resources associated with the second state node on the user terminal.
[0085] Once the target directed edge is determined, the state pointer can be updated to the next node pointed to by the target directed edge. The client can immediately receive the video URL of the new node, thus switching the rendering of the visual resources associated with the second state node on the user terminal. To ensure a good user experience, the player can use double buffering technology to preload new videos in the background. Once the logic is confirmed, the foreground switches immediately, achieving a seamless transition similar to a real conversation.
[0086] The multimodal scenario interaction control method based on pre-generated resource graphs provided in this embodiment of the invention can load pre-constructed scenario interaction graphs through the patterns of pre-generated resource graphs, reducing the high latency of training. Intent recognition only takes milliseconds, and video playback is also instantaneous and highly immersive. At the same time, since the direction of training nodes is strictly limited by the directed edges of the graph, the risk of the artificial intelligence model generating illusions and leading to incorrect guidance is reduced, thereby maintaining a high level of realism and immersion and strictly following the teaching / training logic.
[0087] In some embodiments, the multimodal contextual interaction control method based on pre-generated resource maps of the present invention further includes: real-time monitoring of the user's intent recognition matching score and response time in N consecutive state nodes; if the average matching score is lower than a first threshold or the average response time is higher than a second threshold, then the interaction parameters of subsequent state nodes are automatically adjusted; the adjustment of interaction parameters includes: lowering the matching threshold of subsequent intent recognition, or switching to a simplified contextual sub-map containing more auxiliary prompts.
[0088] A sliding window queue can be maintained in memory to record the interaction data of the N most recent nodes. The interaction data can include intent recognition matching score and response time. The intent recognition matching score represents the confidence level of the user's input intent matching, such as 0.85, 0.45, etc. The response time represents the time from when the user hears the question to when they speak, such as 1.5s, 2.0s, etc.
[0089] If the average matching score of the last 5 interactions is below 0.6, which is the first threshold, it indicates that the user's language ability is insufficient and they have difficulty expressing their accurate intentions. The entry threshold for subsequent nodes can be lowered. A branch that originally required a similarity of 0.8 to trigger can now be triggered with 0.6, preventing frequent user frustration. Simplified scenario subgraphs can also be dynamically loaded. For example, a complex plot can be automatically and seamlessly switched to a simpler branch, or subtitles can be automatically overlaid in subsequent videos.
[0090] If a user's average thinking time exceeds 10 seconds (the second threshold), it indicates that the user has difficulty with auditory comprehension. The system can automatically adjust the playback speed of subsequent pre-generated videos to 0.9x, or automatically trigger a character retelling mechanism.
[0091] Understandably, through real-time monitoring and feedback loops, the system is no longer a cold, impersonal player, but an intelligent training companion with the ability to read between the lines. It can ensure that users with lower skill levels do not get stuck, and also provide high-level users with fast-paced challenges, thereby improving the training effect.
[0092] In some embodiments, such as Figure 2 As shown, the pre-built scenario interaction graph also includes a global variable set, and the triggering state transition logic includes steps 210, 220, 230 and 240.
[0093] Step 210: Parse the preset attribute change instructions for the target directed edge;
[0094] Step 220: Update the values in the global variable set according to the attribute change instruction;
[0095] Step 230: Determine whether the updated global variable value satisfies the admission conditions of the second state node;
[0096] Step 240: If the conditions are met, switch rendering; otherwise, redirect to the preset default branch node or error message node.
[0097] The global variable set is like a user's status bar or scoreboard, recording data accumulated throughout the interaction, such as the current score. This data exists and changes continuously throughout the training process.
[0098] The second state node is the next node the user originally intended to enter. A global variable set can be used as a parameter for admission criteria; for example, only users with a score greater than 80 can switch to the next node.
[0099] Attribute change commands are bound to the selection of different nodes. Update operations include specific mathematical addition, subtraction, or logical assignment operations. It's possible to modify the score data in the backend. For example, if the original score was 50, and the user selected the correct dialogue option, the score value in memory can be updated to 60.
[0100] A set of key-value pairs can be maintained in session-level memory as a global variable set. Each directed edge is bound not only to an intent but also to a script instruction.
[0101] For example, in an airport scenario where a user is searching for their luggage, when the user successfully answers "Here is my boarding pass" and triggers the "Provide boarding pass" edge, this edge carries an instruction that can then trigger a condition node. Certain nodes in the graph related to boarding passes are marked as condition nodes. When a user attempts to trigger an intent leading to that node, logical operations can be performed first, checking the global variable table.
[0102] If the conditions are met, a video showing the search for baggage corresponding to the flight number on that boarding pass will be played. If the conditions are not met, even if the intent matches correctly, the user will be forcibly redirected to another node, where a video saying "Sorry, you cannot use the baggage search service at this time" will be played.
[0103] Every choice a user makes changes their attributes, and these attributes determine what kind of plot the user will trigger in the future. In this way, global variables can be used to make the interaction coherent and causal, solving the problem of memory loss in traditional dialogue systems.
[0104] In some embodiments, before acquiring the user's multimodal input signal during the playback process of the first state node, the multimodal scenario interaction control method based on the pre-generated resource map further includes: determining the playback progress of the visual resources of the current first state node; if the playback progress is completed and no user input is detected to reach a preset time threshold, then triggering a timeout processing mechanism; the timeout processing mechanism includes: displaying a timeout message, or automatically calling the prompt audio resource associated with the current state node for playback, or automatically transitioning the state along a preset default silence path.
[0105] A timer or progress listener that synchronizes with the video player can be built into the client to monitor the playback percentage of the current status node in real time.
[0106] You can set a time threshold, such as 5 seconds after the video ends. If the system does not detect a valid voice input signal during this period, it is considered an interaction timeout.
[0107] In this situation, you can either issue a simple error message or a timeout warning, or execute preset recovery or natural flow logic.
[0108] It can automatically call up and play a preset prompt sound to guide the user to speak, or treat silence as a specific user input and automatically trigger the default path in the graph to jump to the next plot node, thereby ensuring the continuity of the immersive experience and simulating the handling of awkward silences in real conversations.
[0109] The following describes the multimodal scenario interaction control device based on pre-generated resource maps provided by the present invention. The multimodal scenario interaction control device based on pre-generated resource maps described below can be referred to in correspondence with the multimodal scenario interaction control method based on pre-generated resource maps described above.
[0110] like Figure 3 As shown, the multimodal scenario interaction control device based on pre-generated resource maps in this embodiment of the invention mainly includes a loading module 310, a response module 320, a data acquisition module 330, a first processing module 340, and a second processing module 350.
[0111] The loading module 310 is used to load a pre-built context interaction graph. The context interaction graph contains several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature.
[0112] The response module 320 is used to respond to the start command of the interactive session, activate the first state node according to the current state pointer, and render the visual resources associated with the first state node on the user terminal.
[0113] The acquisition module 330 is used to acquire the multimodal input signal of the user during the playback process in the first state node, and convert the multimodal input signal into a text sequence;
[0114] The first processing module 340 is used to perform mixed intent recognition processing on the text sequence, calculate the matching degree with the standard intent features associated with each directed edge emitted by the current state node, and determine the target intent and the target directed edge.
[0115] The second processing module 350 is used to update the current state pointer based on the second state node pointed to by the target directed edge, and trigger the state transition logic to switch the rendering of the visual resources associated with the second state node on the user terminal.
[0116] The multimodal scenario interaction control device based on a pre-generated resource graph provided in this embodiment of the invention can load a pre-built scenario interaction graph through the mode of the pre-generated resource graph, reducing the high latency of training. It only takes milliseconds to perform intent recognition, and the video playback is also instantaneous and highly immersive. At the same time, since the direction of the training node is strictly limited by the directed edges of the graph, the risk of the artificial intelligence model generating illusions and leading to incorrect guidance is reduced, thereby maintaining a high level of realism and immersion and strictly following the teaching / training logic.
[0117] Figure 4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 4, the electronic device may include: a processor 410, a communication interface 420, a memory 430, and a communication bus 440, wherein the processor 410, the communication interface 420, and the memory 430 communicate with each other through the communication bus 440. The processor 410 can call logic instructions in the memory 430 to execute a multimodal contextual interaction control method based on a pre-generated resource graph. This method includes: loading a pre-built contextual interaction graph, which contains several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature. In response to a start command for an interactive session, the processor activates a first state node based on the current state pointer and renders the visual resources associated with the first state node on the user terminal. The processor collects multimodal input signals from the user during playback at the first state node and converts these signals into a text sequence. The processor performs mixed intent recognition processing on the text sequence, calculates the matching degree with the standard intent features associated with each directed edge emitted by the current state node, and determines the target intent and target directed edge. Based on the second state node pointed to by the target directed edge, the processor updates the current state pointer and triggers state transition logic, switching the rendering of the visual resources associated with the second state node on the user terminal.
[0118] Furthermore, the logical instructions in the aforementioned memory 430 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0119] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the multimodal scenario interaction control method based on a pre-generated resource graph provided by the above methods. The method includes: loading a pre-constructed scenario interaction graph, which contains several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature. In response to the start command of the interaction session, a first state node is activated according to the current state pointer, and the visual resources associated with the first state node are rendered on the user terminal. Multimodal input signals of the user during the playback process of the first state node are collected and converted into a text sequence. The text sequence is subjected to mixed intent recognition processing, and the matching degree with the standard intent features associated with each directed edge issued by the current state node is calculated to determine the target intent and the target directed edge. Based on the second state node pointed to by the target directed edge, the current state pointer is updated, and state transition logic is triggered to switch the rendering of the visual resources associated with the second state node on the user terminal.
[0120] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the multimodal contextual interaction control method based on a pre-generated resource graph provided by the above methods. The method includes: loading a pre-constructed contextual interaction graph, which contains several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature. In response to a startup command of an interactive session, a first state node is activated according to the current state pointer, and the visual resource associated with the first state node is rendered on the user terminal. Multimodal input signals of the user during playback at the first state node are collected and converted into a text sequence. The text sequence is subjected to mixed intent recognition processing, and the matching degree with the standard intent features associated with each directed edge emitted by the current state node is calculated to determine the target intent and the target directed edge. Based on the second state node pointed to by the target directed edge, the current state pointer is updated, and state transition logic is triggered to switch the rendering of the visual resource associated with the second state node on the user terminal.
[0121] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0122] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0123] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A multimodal scenario interaction control method based on a pre-generated resource map, characterized in that, include: Load a pre-built contextual interaction graph, which includes several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature. In response to the start command of the interactive session, the first state node is activated according to the current state pointer, and the visual resources associated with the first state node are rendered on the user terminal. Collect the user's multimodal input signal during playback at the first state node, and convert the multimodal input signal into a text sequence; The text sequence is subjected to hybrid intent recognition processing to calculate the matching degree with the standard intent features associated with each directed edge emitted by the current state node, and the target intent and target directed edge are determined. Based on the second state node pointed to by the target directed edge, update the current state pointer and trigger state transition logic to switch the rendering of the visual resources associated with the second state node on the user terminal. The pre-generated visual resources are constructed through the following process: Receive structured script data containing character settings, dialogue scripts, and action descriptions; The structured script data is parsed, and for each script segment, a generative artificial intelligence interface is called to generate corresponding digital human motion video or 3D animation data; The generated video or animation data is stored in the content delivery network, and a unique Uniform Resource Locator (URL) is generated. Establish a mapping relationship between the unified resource locator and the corresponding state node in the context interaction graph.
2. The multimodal scenario interaction control method based on pre-generated resource maps according to claim 1, characterized in that, The method further includes: Real-time monitoring of user intent recognition matching score and response time across N consecutive state nodes; If the average matching score is lower than the first threshold or the average response time is higher than the second threshold, the interaction parameters of subsequent state nodes will be automatically adjusted. The adjustment of the interaction parameters includes: lowering the matching threshold of subsequent intent recognition, or switching to a simplified scenario sub-graph containing more auxiliary prompts.
3. The multimodal scenario interaction control method based on pre-generated resource maps according to claim 1, characterized in that, The pre-built scenario interaction graph also includes a global variable set, and the triggering state transition logic includes: Parse the preset attribute change instructions for the directed edge of the target; According to the attribute change instruction, update the values in the global variable set; Determine whether the updated global variable value satisfies the admission criteria of the second state node; If the conditions are met, the switching rendering is executed; if not, the system is redirected to the preset default branch node or error message node.
4. The multimodal scenario interaction control method based on pre-generated resource maps according to claim 1, characterized in that, The target intent is determined through the following process: The text sequence is converted into a user input vector using a pre-trained sentence embedding model; Obtain the reference corpus vectors of all associated standard intents for the current state node; Calculate the cosine similarity between the user input vector and each reference corpus vector to obtain an initial similarity score; The initial similarity score is weighted and corrected based on a preset keyword library to obtain the final matching score; The standard intent with the highest final matching score that exceeds a preset threshold is selected as the target intent.
5. The multimodal scenario interaction control method based on pre-generated resource maps according to claim 4, characterized in that, The weighted correction of the initial similarity score based on a preset keyword library specifically includes: The text sequence is segmented and part-of-speech tagging is performed to extract the user's content word set; Retrieve the set of strongly related keywords predefined by the current standard intent; Determine whether the user's set of real words contains elements from the set of strongly related keywords; If the initial similarity score is included, it is weighted and amplified according to a preset enhancement coefficient; if it is not included, the initial similarity score remains unchanged or is penalized and attenuated.
6. The multimodal scenario interaction control method based on pre-generated resource maps according to claim 5, characterized in that, Part-of-speech tagging of the text sequence includes: Natural language processing tools are used to perform part-of-speech tagging on text sequences to identify verb conjugation forms and noun plural forms. The verb conjugation forms are restored to their infinitive forms, and the noun plural forms are restored to their singular forms, in order to generate a standardized word form sequence for keyword matching.
7. The multimodal scenario interaction control method based on pre-generated resource maps according to claim 1, characterized in that, Before acquiring the user's multimodal input signal during playback at the first state node, the method further includes: Determine the playback progress of the visual resources of the current first state node; If the playback progress is completed and no user input is detected to reach the preset time threshold, a timeout handling mechanism is triggered. The timeout handling mechanism includes: displaying a timeout message, automatically calling the prompt audio resource associated with the current state node for playback, or automatically transitioning the state along a preset default silence path.
8. A multimodal scenario interaction control device based on a pre-generated resource map, characterized in that, include: The loading module is used to load a pre-built context interaction graph, which includes several state nodes and directed edges connecting the state nodes. Each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intent feature. The response module is used to respond to the start command of the interactive session, activate the first state node according to the current state pointer, and render the visual resources associated with the first state node on the user terminal. The acquisition module is used to acquire the user's multimodal input signals during the playback process at the first state node, and convert the multimodal input signals into a text sequence; The first processing module is used to perform mixed intent recognition processing on the text sequence, calculate the matching degree with the standard intent features associated with each directed edge issued by the current state node, and determine the target intent and the target directed edge. The second processing module is used to update the current state pointer based on the second state node pointed to by the target directed edge, and trigger state transition logic to switch the rendering of the visual resources associated with the second state node on the user terminal. The pre-generated visual resources are constructed through the following process: receiving structured script data containing character settings, dialogue scripts, and action descriptions; parsing the structured script data, and for each script segment, calling a generative artificial intelligence interface to generate corresponding digital human action videos or 3D animation data; storing the generated video or animation data in a content delivery network and generating a unique resource locator; and establishing a mapping relationship between the resource locator and the corresponding state node in the context interaction graph.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, characterized in that, When the processor executes the program, it implements the multimodal scenario interaction control method based on a pre-generated resource map as described in any one of claims 1 to 7.