Augmented reality-based interactive display method for three-dimensional model of animated character
By constructing a physical space coordinate framework and using speech semantic parsing, the problems of inaccurate positioning and imprecise interaction of 3D animated character models in existing technologies have been solved, achieving more accurate virtual-real fusion and richer dynamic interactive display effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN VOCATIONAL COLLEGE OF SCI & TECH
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing 3D model displays of animated characters in augmented reality scenarios suffer from inaccurate model positioning, poor virtual-real fusion, and imprecise voice interaction. They also lack spatial feature recognition and semantic parsing, resulting in insufficient flexibility and accuracy in interactive displays.
By constructing a physical space coordinate framework using image sensors, we can perform marker point detection and 3D model coordinate anchoring. Combined with voice keyword matching and semantic structure analysis, we can parse user interaction intent and drive the 3D model of the animated character to perform dynamic behavior in the augmented reality scene.
It improves the positioning accuracy of 3D models in real-world scenes and enhances the fusion of virtual and real elements, improves the smoothness and precision of interactive displays, enriches the dynamic expression of models, and enhances the user's interactive experience.
Smart Images

Figure CN122023741B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of augmented reality model interaction technology, specifically a method for interactive display of 3D animated character models based on augmented reality. Background Technology
[0002] Current augmented reality (AR) scenarios often use a method of directly loading 3D models and matching them with single marker points to overlay virtual and real elements. Voice interaction relies solely on simple keyword matching for command recognition, and model dynamics are triggered by pre-set programs. This approach lacks a systematic coordinate framework for physical space, relying solely on single-point markers for model positioning and failing to perform deep semantic analysis of voice commands. It relies on keywords for simple command correspondence, which is flawed. Anchoring the 3D model to physical space depends on single-point matching, lacking spatial feature recognition and coordinate framework support. This leads to spatial misalignment and inaccurate positioning of the model in the real-world scene, resulting in poor virtual-real fusion. Voice interaction, limited to keyword matching, fails to analyze semantic structure, making it difficult to accurately capture the user's true interaction intent. The model cannot execute corresponding behavioral logic operations based on different command types, resulting in limited dynamic behavior and insufficient flexibility and accuracy in interactive displays.
[0003] It is necessary to construct a physical space coordinate framework and complete the spatial anchoring and alignment of the model coordinate origin and the marker points. At the same time, the interaction intent is parsed through voice keyword matching and semantic structure analysis. Based on the instruction type, the behavioral logic operation is executed to drive the dynamic behavior of the model, so as to solve the positioning and interaction problems existing in the current technology and make up for the defects of the current technology. Summary of the Invention
[0004] This invention aims to solve at least one of the technical problems existing in the prior art;
[0005] Therefore, this invention proposes an interactive display method for 3D animated character models based on augmented reality, including:
[0006] The video stream data of the user's physical space is collected in real time by an image sensor, and spatial feature recognition and marker point detection are performed on the video stream data to construct a physical space coordinate framework.
[0007] Load the preset 3D model file of the animated character, and spatially anchor and align the origin of the preset 3D model file of the animated character with the position of the detected marker point established in the physical space coordinate frame;
[0008] Voice command signals are obtained from the user's voice input channel, and voice keyword matching and semantic structure analysis are performed on the voice command signals to parse out the interaction commands representing the user's interaction intentions.
[0009] On an augmented reality display device, based on the location of the marker point, the model corresponding to the preset 3D model file of the animated character is rendered into the real scene generated by the video stream data, forming a composite augmented reality scene that blends the virtual and the real.
[0010] Based on the type of the interaction instruction, the corresponding behavioral logic operation is performed on the preset 3D model file of the animated character to drive the preset 3D model file of the animated character to perform corresponding dynamic behaviors in the composite augmented reality screen.
[0011] Furthermore, spatial feature recognition and marker point detection are performed on the video stream data to construct a physical space coordinate framework, including:
[0012] Corner point extraction and edge contour detection are performed on each frame of the video stream data to generate a scene key feature point cloud;
[0013] Feature point tracking and matching are performed on the key feature point cloud of the scene in multiple consecutive frame images, and the spatial displacement and viewpoint change parameters of the feature points between adjacent frame images are calculated.
[0014] Based on the spatial displacement and viewpoint change parameters, a three-dimensional point cloud map of the physical space is reconstructed using the structure-reconstruction-motion algorithm, and the motion pose of the image sensor is estimated.
[0015] In the three-dimensional point cloud map, image regions that meet the preset geometric shape and pattern features are searched and identified as usable augmented reality markers. Each augmented reality marker is assigned a unique spatial identifier and three-dimensional position coordinates in the three-dimensional point cloud map.
[0016] By integrating the spatial coordinates and identifiers of all augmented reality markers and combining them with the real-time motion pose of the image sensor, a physical spatial coordinate framework is formed that includes the mapping relationship between the absolute coordinate system, the marker coordinate system, and the screen coordinate system.
[0017] Further, the coordinate origin of the preset 3D model file of the animated character is spatially anchored and aligned with the position of the detected marker point established in the physical space coordinate frame, including:
[0018] Read the coordinate data of the model center point of the preset 3D model file of the animated character, and obtain the coordinate value of the model center point coordinate data in the local coordinate system of the preset 3D model file of the animated character;
[0019] From the physical space coordinate frame, query the spatial identifier of the augmented reality marker selected as the display reference, and obtain the three-dimensional world coordinates of the augmented reality marker in the physical space coordinate system;
[0020] Calculate the coordinate transformation matrix from the center point coordinates of the preset 3D model file of the animated character to the augmented reality marker point. The coordinate transformation matrix includes a translation vector, a rotation vector, and a scaling factor.
[0021] The coordinate transformation matrix is applied to the vertex data of the preset 3D model file of the animated character, so that the spatial position of the model center point of the preset 3D model file of the animated character completely coincides with the spatial position of the augmented reality marker point.
[0022] A dynamic coordinate mapping relationship is established between the local coordinate system and the physical space coordinate system of the preset 3D model file of the animated character, so as to ensure that when the pose of the image sensor changes, the preset 3D model file of the animated character can maintain the anchoring relationship with the physical marker points in the augmented reality screen according to the dynamic coordinate mapping relationship.
[0023] Furthermore, the voice command signal is subjected to voice keyword matching and semantic structure analysis to parse out the interaction commands representing the user's interaction intent, including:
[0024] The acquired raw speech command signal is subjected to noise reduction and gain control processing to extract a clean speech waveform signal;
[0025] Speech endpoint detection is performed on the clean speech waveform signal to segment out speech segments containing valid instructions;
[0026] The acoustic feature vectors of the speech segments are extracted, and the extracted acoustic feature vectors are input into a pre-trained acoustic model to calculate the probability distribution of the phoneme sequence.
[0027] The phoneme sequence is decoded based on a statistical language model to generate a text instruction string corresponding to the voice instruction signal;
[0028] The text instruction string is matched with a preset interactive instruction keyword library to identify keyword combinations containing core action verbs and target object nouns;
[0029] Dependency parsing is performed on the identified keyword combinations to determine the syntactic relationship between actions and objects, thereby parsing out structured interaction instructions, which include action type, action target object, and optional parameters.
[0030] Furthermore, the text instruction string is matched with a preset interactive instruction keyword library to identify keyword combinations containing core action verbs and target object nouns, including:
[0031] Load an interactive instruction keyword library containing multiple categories of action instructions. The interactive instruction keyword library includes instruction trigger phrases, a list of action verbs, a list of model component names, and a list of parameter descriptive words.
[0032] The text instruction string is scanned using a forward maximum matching algorithm to find the longest matching phrase in the interactive instruction keyword library. This phrase is then marked as the instruction trigger phrase.
[0033] After identifying the text instruction string fragment that triggers the instruction phrase, continue scanning its subsequent strings to find matching action verbs in the list of action verbs;
[0034] After the action verb is identified, the matching model component name is searched in the list of model component names within the nearby string range;
[0035] If a matching model component name is found, then further search for descriptive words in the parameter descriptor list in the nearby strings that may accompany the movement direction, movement amplitude, or number of times.
[0036] The successfully matched command trigger phrase, action verb, model component name, and parameter descriptor are combined to form the keyword combination for this voice interaction.
[0037] Furthermore, on the augmented reality display device, using the location of the marker point as a reference, the model corresponding to the preset 3D model file of the animated character is rendered onto the real-world scene generated by the video stream data to form a composite augmented reality scene that blends virtual and real elements, including:
[0038] The system receives the latest video stream data frames captured by the image sensor in real time and decodes them into real-world texture bitmaps.
[0039] Based on the physical space coordinate frame and the current pose of the image sensor, calculate the virtual camera view matrix and projection matrix corresponding to the current video frame;
[0040] Based on the anchoring position of the pre-set 3D model file of the animated character in physical space and the dynamic coordinate mapping relationship, calculate the vertex coordinate transformation data of the pre-set 3D model file of the animated character under the current virtual camera view.
[0041] The calculated vertex coordinate transformation data is input into the graphics rendering pipeline, and combined with the material texture and skeletal animation data of the loaded animated character preset 3D model file, a rendered image of the animated character preset 3D model file in the current view is generated.
[0042] The generated rendered image is pixel-level fused with the real-world texture bitmap, wherein opaque pixels in the rendered image cover the corresponding pixels in the real-world texture bitmap, while transparent pixels display the original content of the real-world texture bitmap, thereby synthesizing the composite augmented reality image containing virtual animated character models superimposed on a real scene, and outputting it to an augmented reality display device.
[0043] Furthermore, based on the type of the interaction instruction, corresponding behavioral logic operations are performed on the preset 3D model file of the animated character to drive the preset 3D model file of the animated character to perform corresponding dynamic behaviors in the composite augmented reality scene, including:
[0044] Based on the action type in the parsed interaction instructions, the preset behavior logic rule library is queried to obtain the corresponding model behavior logic script. The model behavior logic script defines the behavior sequence, behavior parameters and behavior triggering conditions that the 3D model should execute.
[0045] Extract the action target object from the interaction command. The action target object is a specific body part, prop, or preset action fragment of the 3D model of the animated character.
[0046] Based on the model behavior logic script, calculate the values required for the behavior parameters, where the values are derived from parameters carried in the interaction instructions or default values obtained from the current model state.
[0047] The 3D model animation engine is invoked, and based on the model behavior logic script, the action target object, and the calculated behavior parameters, the skeletal animation controller, material transformer, or space transformer of the preset 3D model file of the animated character is scheduled.
[0048] The skeletal animation controller drives the model's skeletal system to generate motion, the material transformer changes the model's surface texture or color, and the space transformer changes the model's position, orientation, or scaling.
[0049] The execution results of the aforementioned controller and transformer are applied in real time to the preset 3D model file of the animated character being rendered, so that it exhibits dynamic behavior changes corresponding to the interactive commands in the composite augmented reality screen.
[0050] Furthermore, the 3D model animation engine is invoked, and based on the model behavior logic script, the target object of the action, and the calculated behavior parameters, the skeletal animation controller, material transformer, or space transformer of the preset 3D model file of the animated character are scheduled, including:
[0051] The behavior sequence definition in the model behavior logic script is parsed, and the behavior sequence is decomposed into multiple independent behavior units ordered by the time axis. Each behavior unit is associated with a specific controller type identifier.
[0052] For each behavior unit, a corresponding controller object is instantiated in the 3D model animation engine according to its associated controller type identifier. The controller object includes a skeletal animation controller object, a material transformer object, or a space transformer object.
[0053] The motion target object is mapped to the internal node tree structure of the preset 3D model file of the animated character, and the corresponding bone joint node, mesh material node or spatial transformation node of the motion target object is located.
[0054] The calculated behavior parameters are injected into the instantiated controller object. Specifically, bone rotation quaternions or translation vectors are injected into the skeletal animation controller object, texture coordinate offsets or color blending factors are injected into the material transformer object, and world coordinate translation matrices or Euler angle rotation data are injected into the space transformer object.
[0055] The scheduling queue of the 3D model animation engine is started. Based on the preset timestamp of each behavior unit in the behavior sequence, the corresponding controller object is activated in sequence, and the controller object is triggered to perform data writing operation on the model node it is bound to.
[0056] Before rendering each frame, the intermediate state data output by all active controller objects is read synchronously. The intermediate state data is then merged and written into the rendering state buffer of the preset 3D model file of the animated character to complete the real-time driving of the preset 3D model file of the animated character.
[0057] Furthermore, it also includes collision detection and response processing for the model's interaction with the environment in physical space:
[0058] Within the physical space coordinate framework, a simplified collision body mesh matching the outer surface of the model is generated from the preset 3D model file of the animated character.
[0059] Before the augmented reality display device renders each frame of the composite augmented reality image, the expected position and posture data of the preset 3D model file of the animated character at the next moment are obtained;
[0060] Extract high-density point cloud regions representing physical obstacles from a 3D point cloud map and convert them into obstacle collision body meshes.
[0061] The simplified collision body mesh and the obstacle collision body mesh are subjected to spatial intersection detection calculation to determine whether the preset 3D model file of the animated character will clip through or intersect with the physical obstacle in the next moment;
[0062] If a collision is detected, the model position offset vector and rotation adjustment amount required to avoid penetration are calculated based on the collision location and surface normal information.
[0063] The model position offset vector and rotation adjustment amount are applied to the coordinate transformation matrix of the preset 3D model file of the animated character, so that the model avoids physical obstacles or slides along the surface of obstacles during the final rendering, thus achieving an interactive display that conforms to the laws of physics.
[0064] Furthermore, it also includes adaptive adjustment of model interaction focus based on user gaze tracking:
[0065] By using an eye-tracking sensor integrated into the augmented reality display device, the coordinates of the user's gaze point on the composite augmented reality image are obtained in real time;
[0066] Transform the coordinates of the gaze point from the screen coordinate system to the physical space coordinate system to obtain the focal position coordinates of the user's current gaze in three-dimensional physical space.
[0067] Calculate the distance between the coordinates of the focus position and each component of the preset 3D model file of the animated character in 3D space, and identify the closest model component as the potential interactive focus;
[0068] Determine whether the potential interaction focus remains within the user's field of vision for more than a preset gaze time threshold;
[0069] If the gaze time threshold is exceeded, visual enhancement processing is performed on the model component corresponding to the potential interaction focus, and the deep interaction instruction set associated with the model component is unlocked.
[0070] When an interaction command is received subsequently, the interaction action corresponding to the currently visually enhanced model component is matched first from the deep interaction command set.
[0071] Compared with the prior art, the beneficial effects of the present invention are:
[0072] Spatial feature recognition and marker point detection are performed on the video stream data of the user's physical space collected in real time by the image sensor. A physical space coordinate frame is constructed, and the coordinate origin of the preset 3D model file of the animated character is spatially anchored and aligned with the position of the detected marker point established in the physical space coordinate frame. This can clarify the positioning benchmark of the 3D model in physical space, make the spatial correspondence between the model and the real scene more accurate, reduce the positional offset and misalignment of the model in the virtual-real fusion scene, enhance the placement stability of the 3D model in the real scene, optimize the visual presentation effect of virtual-real fusion, and avoid the interruption of the interactive experience caused by inaccurate positioning.
[0073] Voice command signals are acquired from the user's voice input channel. Voice keyword matching and semantic structure analysis are performed on these signals to parse out the interactive commands representing the user's intentions. Based on the type of interactive command, corresponding behavioral logic operations are executed on the pre-set 3D model file of the animated character. This drives the pre-set 3D model of the animated character to perform corresponding dynamic behaviors in the composite augmented reality scene. This expands the analytical dimensions of voice commands, accurately captures the user's true interactive intentions, and ensures a precise correspondence between the dynamic behavior of the 3D model and the user's commands. It enriches the dynamic expression of the model, strengthens the connection and adaptability between commands and behaviors during interaction, improves the smoothness and accuracy of the interaction, and enhances the user's interactive experience with the 3D model of the animated character. Attached Figure Description
[0074] Figure 1 This is a flowchart illustrating the steps of the interactive display method for 3D animated character models based on augmented reality as described in this invention.
[0075] Figure 2 A flowchart for parsing interactive commands from voice command signals;
[0076] Figure 3 Heatmap of the 4×4 pose transformation matrix for an AR camera;
[0077] Figure 4 To improve the accuracy and response time performance of different interactive commands;
[0078] Figure 5 A statistical chart showing the performance indicators of each module in an augmented reality 3D modeling system. Detailed Implementation
[0079] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0080] See Figure 1An interactive display method for 3D animated character models based on augmented reality is proposed, and its overall implementation scheme is as follows:
[0081] The system acquires real-time video stream data of the user's physical space using an image sensor, and performs spatial feature recognition and marker detection on the video stream data to construct a physical space coordinate framework. A pre-set 3D model file of the animated character is loaded, and its coordinate origin is spatially anchored and aligned with the detected marker points within the physical space coordinate framework. Voice command signals are acquired from the user's voice input channel, and voice keyword matching and semantic structure analysis are performed to parse the interaction commands representing the user's intentions. On the augmented reality display device, using the marker point locations as a reference, the model corresponding to the pre-set 3D model file of the animated character is rendered onto the real-world scene generated from the video stream data, forming a composite augmented reality image that blends virtual and real elements. Based on the type of interaction command, corresponding behavioral logic operations are executed on the pre-set 3D model file of the animated character to drive corresponding dynamic behaviors within the composite augmented reality image.
[0082] In one embodiment of the present invention, the video stream data is acquired by an image sensor integrated into the augmented reality display device. The image sensor has a sampling rate of 30 frames per second and a resolution of 1920x1080 pixels. The video stream data is presented as a continuous sequence of frames in RGB format. Corner extraction and edge contour detection are performed on each frame of the video stream data. Corner extraction uses the Shi-Tomasi algorithm, and edge contour detection uses the Canny operator. The generated scene key feature point cloud is a set containing image pixel coordinates and feature descriptors. Feature point tracking and matching are performed on the scene key feature point clouds of five consecutive frames. Feature point tracking uses the LK optical flow method to calculate the spatial displacement and viewpoint change parameters of successfully matched feature points between adjacent frames. The spatial displacement is the pixel offset on the two-dimensional image plane, and the viewpoint change parameters are expressed in the form of rotation matrix and translation vector. Based on spatial displacement and viewpoint change parameters, a three-dimensional point cloud map of physical space is reconstructed using the structure-of-motion (SOMO) algorithm. The SOMO algorithm adopts incremental SFM and estimates the six-degree-of-freedom motion pose of the image sensor relative to the origin of the reconstructed three-dimensional point cloud map at each time step. The pose is represented as a 4x4 homogeneous transformation matrix containing rotation and translation components.
[0083] In some embodiments, image regions satisfying preset geometric shapes and pattern features are searched within the 3D point cloud map. The preset geometric shapes include squares, circles, or specific QR code outlines, and the preset pattern features are predefined binary image templates. Regions meeting these conditions are determined to be usable augmented reality markers. A unique spatial identifier is assigned to each augmented reality marker; this identifier is a globally unique identifier of 16 bytes. The 3D position coordinates of each augmented reality marker in the 3D point cloud map are recorded; these coordinates are floating-point numbers in meters. The spatial position coordinates and identifiers of all augmented reality markers are integrated, combined with the real-time motion pose of the image sensor, to form a physical spatial coordinate framework that includes mappings between an absolute coordinate system, a marker coordinate system, and a screen coordinate system. The origin of the absolute coordinate system is located at the starting point of the 3D point cloud map. The marker coordinate system has its origin at the center of each augmented reality marker, and the screen coordinate system is based on the imaging plane of the image sensor. The mapping relationships are defined by a series of rigid body transformation matrices.
[0084] In practice, the system reads the center point coordinates of the animated character's preset 3D model file. This file is an FBX format 3D mesh file, and the center point coordinates are stored in the metadata area of the file header. The system then retrieves the center point coordinates within the local coordinate system of the animated character's preset 3D model file; each coordinate is a 3D vector. From the physical space coordinate framework, the system queries the spatial identifier of the selected augmented reality marker point as the display reference. This identifier is determined by user selection on the screen or by system preset rules. The system then retrieves the 3D world coordinates of this augmented reality marker point within the physical space coordinate system. Finally, the system calculates the coordinate transformation matrix from the center point coordinates of the animated character's preset 3D model file to the augmented reality marker point. This matrix is obtained using a point set registration algorithm that solves a least-squares problem. The transformation matrix includes a translation vector, a rotation vector, and a scaling factor. The translation vector is a 3D vector, the rotation vector is a unit quaternion, and the scaling factor is a scalar. The coordinate transformation matrix is applied to the vertex data of the preset 3D model file of the animated character. The application operation is to perform a linear transformation on the position of each vertex, so that the spatial position of the model center point of the preset 3D model file of the animated character completely coincides with the spatial position of the augmented reality marker point, with an overlap error within 1 mm.
[0085] It can be understood that establishing a dynamic coordinate mapping relationship between the local coordinate system of the pre-set 3D model file of the animated character and the physical space coordinate system is a function mapping. Its input is the real-time pose matrix of the image sensor, and its output is the projection matrix of the pre-set 3D model file of the animated character in the screen coordinate system. When the pose of the image sensor changes, the pre-set 3D model file of the animated character can maintain its anchoring relationship with the physical marker points in the augmented reality scene according to the dynamic coordinate mapping relationship. The anchoring relationship means that the position, rotation, and scaling of the virtual model in the 3D world remain constant relative to the physical marker points. The calculation of the dynamic coordinate mapping relationship involves the camera intrinsic matrix, the camera extrinsic matrix, and the 3D coordinates of the marker points. Its mapping formula is expressed as:
[0086]
[0087] in: These are the homogeneous coordinates of the model vertices in the screen coordinate system. It is the intrinsic parameter matrix of the image sensor. It is the extrinsic parameter matrix of the image sensor, determined by the current motion pose. It is the transformation matrix from the physical space coordinate system to the marker point coordinate system, which is calculated from the three-dimensional world coordinates of the marker point. It is the transformation matrix from the marker point coordinate system to the model's local coordinate system, defined by the coordinate transformation matrix. These are the homogeneous coordinates of the model vertices in the model's local coordinate system.
[0088] In one embodiment of the present invention, see [reference] Figure 2 The acquired raw speech command signal was subjected to noise reduction and gain control processing. The raw speech command signal was a 16-bit deep, 16kHz single-channel PCM audio stream. Noise reduction was performed using spectral subtraction, and gain control was performed using an automatic gain control algorithm. The signal-to-noise ratio of the processed clean speech waveform signal was improved to over 20 dB. Speech endpoint detection was performed on the clean speech waveform signal using a dual-threshold method based on short-time energy and zero-crossing rate to segment speech segments containing valid commands. The start and end points of the speech segments were marked with timestamps. Acoustic feature vectors were extracted from the speech segments. The acoustic feature vectors were 39-dimensional Mel-frequency cepstral coefficients with a step size of 10 milliseconds per frame of 25 milliseconds. The extracted acoustic feature vectors were input into a pre-trained acoustic model. The acoustic model was a deep neural network with a structure of a deep feedforward neural network with 5 hidden layers. The probability distribution of the phoneme sequence was calculated, and the probability distribution was represented as a probability vector sequence with a dimension equal to the number of phonemes. The phoneme sequence is decoded based on a statistical language model, which is a ternary grammar model. The decoding process uses the Viterbi algorithm to generate a text instruction string corresponding to the speech instruction signal. The text instruction string is a Unicode-encoded character sequence.
[0089] In some embodiments, text instruction strings are matched against a preset interactive instruction keyword library, stored in an SQLite database, which includes instruction trigger phrases, a list of action verbs, a list of model component names, and a list of parameter descriptors. Keyword combinations containing core action verbs and target object nouns are identified. These keyword combinations are tuple structures containing instruction trigger phrases, action verbs, model component names, and parameter descriptors. Dependency parsing is performed on the identified keyword combinations using a transition-based dependency parser to determine the syntactic relationship between actions and objects, thereby parsing structured interactive instructions. These structured interactive instructions are JSON-formatted data objects containing action type, action target object, and optional parameter fields.
[0090] In the specific implementation, an interactive instruction keyword library containing multiple categories of action commands is loaded. This library includes instruction trigger phrases such as "hello," "jump," "turn around," "wave," and "pick up," action verbs such as "move," "rotate," "play," "change," and "hide," model component names such as "head," "left hand," "right hand," "body," and "cloak," and parameter descriptors such as "left," "right," "fast," "slow," and "twice." A forward maximum matching algorithm is used to scan the text instruction string. The scanning window size of the forward maximum matching algorithm is 8 characters. The longest matching phrase is searched in the interactive instruction keyword library, and the matched "hello" is marked as the instruction trigger phrase. After identifying the text instruction string fragment of the instruction trigger phrase, the subsequent strings are scanned, searching for matching action verbs in the action verb list. In "move forward two steps," the action verb "move" is matched. After the action verb is identified, a matching model component name is searched in the model component name list within the nearby string range. If no model component name is found, the target object of the action defaults to the entire 3D model of the animated character. If a matching model component name is found, the system further searches the nearby strings for descriptive words that might accompany the movement direction, amplitude, or frequency of actions. For example, in "turn head to the left," the model component name "head" and the parameter descriptor "left" are matched after the action verb "turn." The successfully matched command trigger phrase, action verb, model component name, and parameter descriptor are combined to form the keyword combination for this voice interaction, and the keyword combination is encapsulated as a structure variable.
[0091] It is understandable that a formula for calculating keyword matching confidence is defined in the voice keyword matching process. The calculation process considers the length of the matched words in the text command string, the frequency of the words in the keyword database, and the positional information of the words in the sentence. Keyword matching confidence is used to select the most likely match from multiple potential results. The calculation of keyword matching confidence follows the formula:
[0092]
[0093] in: It is the confidence score of the current matching keyword combination. It is the total length of characters covered by the currently matched keyword combination in the text instruction string. It is the total character length of the text instruction string. It is the frequency count of the currently matched keyword combination in the historical usage records of the interactive command keyword library. It is the starting index of the matched keyword combination in the text instruction string. This is the preset reference index for the expected location of the command trigger phrase, typically the sentence beginning index 0. This formula quantifies the reliability of a single match, with high-scoring keyword combinations being adopted as the final parsing result. The parsed structured interaction command sets the action type to "move," the action target object to "the entire model," and the parameter fields to "direction: forward, steps: 2."
[0094] In one embodiment of the present invention, on an augmented reality display device, using the location of a marker point as a reference, the model corresponding to the preset 3D model file of the animated character is rendered onto the real-world scene generated by the video stream data, forming a composite augmented reality scene that blends virtual and real elements. The process involves receiving the latest video stream data frames captured by the image sensor in real time and decoding them into a real-world texture bitmap. Based on the physical space coordinate frame and the current pose of the image sensor, the virtual camera view matrix and projection matrix corresponding to the current video frame are calculated. Based on the anchoring position of the preset 3D model file of the animated character in physical space and the dynamic coordinate mapping relationship, the vertex coordinate transformation data of the preset 3D model file of the animated character under the current virtual camera view is calculated. The calculated vertex coordinate transformation data is input into the graphics rendering pipeline, and combined with the loaded material textures and skeletal animation data of the preset 3D model file of the animated character, a rendered image of the preset 3D model file of the animated character under the current view is generated. The generated rendered image is pixel-level fused with the real-world texture bitmap, where opaque pixels in the rendered image cover the corresponding pixels in the real-world texture bitmap, while transparent pixels display the original content of the real-world texture bitmap, thereby synthesizing a composite augmented reality image that includes virtual animated character models superimposed on a real scene, and outputting it to an augmented reality display device.
[0095] In practice, the latest video stream data frames captured by the image sensor are received in real time. The image sensor outputs an uncompressed video stream in YUY2 format at a rate of 60 frames per second, which is decoded into a real-world texture bitmap. The decoding process includes color space conversion from YUY2 to RGBA format, generating a real-world texture bitmap with the same resolution as the sensor output. Based on the physical space coordinate frame and the current pose of the image sensor, the physical space coordinate frame includes a transformation matrix from the world coordinate system to the marker point coordinate system. The current pose of the image sensor is a 4x4 pose matrix. The virtual camera view matrix and projection matrix corresponding to the current video frame are calculated. The view matrix is calculated from the camera position and orientation, and the projection matrix is calculated based on the camera's focal length, imaging plane size, and near and far clipping plane distances, where the near clipping plane distance is set to 0.1 meters and the far clipping plane distance is set to 100 meters. Based on the anchoring position of the pre-set 3D model file of the animated character in physical space, the anchoring position is a 3D world coordinate, and the dynamic coordinate mapping relationship is the transformation function from the local coordinate system of the model to the world coordinate system. Calculate the vertex coordinate transformation data of the pre-set 3D model file of the animated character under the current virtual camera view. The vertex coordinate transformation data includes homogeneous coordinates after model transformation, view transformation, and projection transformation.
[0096] In some embodiments, the calculated vertex coordinate transformation data is input into the graphics rendering pipeline, which follows the standard OpenGL ES 3.0 rendering pipeline and includes stages such as vertex shader, primitive assembly, rasterization, and fragment shader. Combining the material textures and skeletal animation data of the loaded pre-defined 3D model file of the animated character (the material textures are PNG images with a resolution of 2048x2048, and the skeletal animation data contains a sequence of keyframe bone transformation matrices), a rendered image of the pre-defined 3D model file of the animated character is generated from the current viewpoint. The rendered image is an RGBA pixel array in an off-screen rendering buffer. The generated rendered image is then pixel-wise blended with the real-world texture bitmap. This pixel-wise blending operation is performed in the fragment shader, where opaque pixels in the rendered image cover the corresponding pixels in the real-world texture bitmap, while transparent pixels display the original content of the real-world texture bitmap. The opaqueness of a pixel is determined by its alpha channel value being greater than a threshold of 0.95. The composite augmented reality image, which includes virtual animated character models superimposed on a real scene, has a resolution of 1920x1080 and is output to an augmented reality display device, which is an optical see-through head-mounted display.
[0097] In practice, the specific calculation of vertex coordinate transformation involves the final transformation matrix of the model in world space. This matrix is formed by concatenating the coordinate transformation matrix determined during spatial anchoring with possible dynamic behavior adjustment matrices. The fusion process of the rendered image and the real-world texture bitmap follows a specific transparency blending equation. The mathematical expression of this pixel blending operation is as follows:
[0098]
[0099] in: It outputs composite augmented reality images in pixel coordinates The final color value is a four-dimensional vector containing four channels: RGBA. It is the rendering of the image in pixel coordinates The color value at that location. It is the rendering of the image in pixel coordinates The alpha channel value of the pixel, which ranges from 0 to 1. It is a real-world texture bitmap in pixel coordinates The color value at each pixel. The blending process is performed independently on each pixel. When the alpha channel value of the rendered image pixel is 1, the final color is entirely determined by the rendered image. When the alpha channel value is 0, the final color is entirely determined by the real-world texture bitmap.
[0100] It's understandable that when the graphics rendering pipeline processes model vertex data, it needs to perform vertex skinning calculations based on skeletal animation data. Vertex skinning calculations perform weighted deformations on static model vertices according to the transformation matrices of their associated bones. The data used for skinning calculations comes from the bone weights and bone index information defined in the preset 3D model file of the animated character. During the rendering of each frame, the vertex shader interpolates the transformation matrix of each bone from the skeletal animation data based on the current animation timestamp and applies the transformation to the vertex. Refer to Table 1, which shows an example of the bones and their weights associated with a single vertex in a simplified model.
[0101] Table 1. Vertex Bone Weights of the Model
[0102]
[0103] In Table 1, the vertex index is a unique identifier for a vertex in the model mesh. Associated bone indices 1 to 4 point to specific bones in the bone list of the preset 3D model file for the animated character. Associated bone weights 1 to 4 represent the influence weight of the corresponding bone on the vertex position, with the sum of all weights being 1. An index value of -1 indicates that the vertex is not associated with this bone slot. Skinning calculations, based on these weights, superimpose the transformations of multiple bones onto the vertex to achieve smooth joint deformation animation. The vertex data after skinning calculations and view projection transformations are combined with texture-sampled material colors to finally generate the rendered image.
[0104] See Figure 3 In the implementation of the 4×4 pose transformation matrix heatmap (spatial coordinate calculation stage) for the AR camera, the matrix fully describes the rigid body transformation relationship of the AR camera from the local coordinate system to the world coordinate system, and is the core data foundation of the virtual-real fusion rendering process. Specifically, this 4×4 matrix follows the homogeneous coordinate transformation specification. The first 3 rows and first 3 columns constitute the rotation submatrix, and the elements of the first 3 rows and first 3 columns are all 0, ensuring the linear homogeneity of the transformation; the first 3 rows and fourth column represent the translation vector, and the 1.00 in the fourth row and fourth column is a homogeneous coordinate placeholder, maintaining the closure of matrix operations. Characteristics of the rotation submatrix: the values of row 1 column 1 (0.98), row 2 column 2 (0.99), and row 3 column 3 (0.98) are close to 1, and the absolute values of the other off-diagonal elements (such as row 1 column 2 being 0.05 and row 1 column 3 being -0.15) are small, indicating that the camera pose is close to a unit rotation in the current frame, with only a small angular deflection, which is consistent with the characteristics of stable camera motion in physical space. Translation vector characteristics: Row 1, Column 4 (1.25), Row 2, Column 4 (0.62), and Row 3, Column 4 (2.10) constitute the camera's position offset in the world coordinate system. The maximum value, 2.10, is in Row 3, Column 4, corresponding to the main displacement in the depth direction, consistent with the camera's viewing angle towards the marker point in the AR scene. Physical meaning: This matrix is directly used to calculate the virtual camera view matrix, converting the local coordinates of the animated character's 3D model into clipping coordinates from the camera's perspective. This is a key geometric constraint for achieving accurate overlay of the model and the real-world scene. The color gradient of the heatmap intuitively reflects the numerical distribution of the matrix elements: red areas (such as Row 3, Column 4) represent larger translation components, blue areas represent rotation or translation components close to 0, and gray areas represent diagonal elements close to 1, providing a visual numerical reference for coordinate transformation calculations in the rendering pipeline. In practical applications of the rendering pipeline, this pose matrix is cascaded with the transformation matrix from the model's local coordinate system to the world coordinate system, jointly completing the complete transformation of vertex coordinates from model space to clipping space, ensuring stable anchoring and real-time rendering of the virtual animated character model within the physical space coordinate framework.
[0105] In one embodiment of the present invention, based on the action type in the parsed interaction command, which includes string identifiers such as "move," "rotate," "play animation," and "change color," a preset behavior logic rule library is queried. This behavior logic rule library is a collection of rules stored in a JSON file. The corresponding model behavior logic script is then obtained. This script describes the behavior sequence, behavior parameters, and behavior triggering conditions, and is in a custom XML structure. The action target object is extracted from the interaction command. The action target object field value is "right hand," corresponding to a specific body part of the animated character's 3D model. Based on the model behavior logic script, the required values for the behavior parameters are calculated. These values originate from parameters carried in the interaction command or default values obtained from the current model state. For example, if the interaction command carries the parameter "amplitude: large," the model behavior logic script defines "large" as a rotation angle of 60 degrees. The 3D model animation engine is then invoked. This 3D model animation engine is a real-time animation system based on a scene graph. Based on the model behavior logic script, the action target object, and the calculated behavior parameters, it schedules the skeletal animation controller, material transformer, or space transformer of the preset 3D model file for the animated character. The skeletal animation controller drives the model's skeletal system to generate movement, the material transformer changes the model's surface texture or color, and the spatial transformer changes the model's position, orientation, or scaling. The execution results of the controller and transformers are applied in real time to the preset 3D model file of the animated character being rendered, causing it to exhibit dynamic behavior changes corresponding to interactive commands in the composite augmented reality scene. For example, the model's right-hand skeleton rotates, causing the right-hand mesh model to make a waving gesture.
[0106] In practical implementation, the behavior sequence definitions in the parsed model behavior logic script are analyzed. The model behavior logic script contains... <sequence>The tag-defined behavior sequence is decomposed into multiple independent behavior units ordered along the timeline. Each behavior unit is associated with a specific controller type identifier, including "BONE_ANIM", "MATERIAL", and "TRANSFORM". For each behavior unit, based on its associated controller type identifier, a corresponding controller object is instantiated in the 3D model animation engine. Controller objects include skeletal animation controller objects, material transformer objects, or space transformer objects. The instantiation process calls the engine's factory method to create the corresponding class object. The motion target object is mapped to the internal node tree structure of the preset 3D model file of the animated character. The mapping process is achieved by matching node names with strings, locating the corresponding skeletal joint node, mesh material node, or space transformer node of the motion target object. For example, the motion target object "right hand" is mapped to the skeletal joint node named "RightHand". The calculated behavior parameters are injected into the instantiated controller objects. For skeletal animation controller objects, bone rotation quaternions or translation vectors are injected; for material transformer objects, texture coordinate offsets or color blending factors are injected; and for space transformer objects, world coordinate translation matrices or Euler angle rotation data are injected. The scheduling queue of the 3D model animation engine is activated. This queue is a priority-based timed task queue that, based on the preset timestamps of each behavioral unit in the behavioral sequence, sequentially activates the corresponding controller objects and triggers them to perform data writing operations on their bound model nodes. Before rendering each frame, intermediate state data output by all active controller objects is read synchronously. This intermediate state data represents the model state data that has not yet been submitted to the final rendering buffer. The intermediate state data is merged and uniformly written into the rendering state buffer of the animated character's preset 3D model file, thus completing the real-time driving of the animated character's preset 3D model file.
[0107] In some embodiments, the model behavior logic script defines the timeline of the behavior sequence and the specific parameters of each behavior unit. Referring to Table 2, a sequence of behavior units is shown after parsing a simplified model behavior logic script fragment for the "waving" action.
[0108] Table 2: Behavioral Unit Sequence Table of the "Waving" Action Model Behavioral Logic Script
[0109]
[0110] In Table 2, the behavior unit identifier is a unique identifier defined in the script. The controller type identifier "BONE_ANIM" indicates that the behavior unit is executed by a skeletal animation controller object. The bound target node is the specific bone node name in the node tree within the preset 3D model file of the animated character. The start time and duration define the activation period of this behavior unit on the global timeline. The injection parameter description describes the specific actions that the controller object needs to perform in natural language, which are converted into specific numerical values, such as quaternions, during actual injection. The 3D model animation engine uses this table to schedule the activation and execution of each controller object at the specified time.
[0111] It's understandable that state blending is necessary when multiple controller objects modify the state of the same model node. State blending occurs during the intermediate state data merging phase, combining the output values of multiple controllers for the same attribute according to weights into a final value. For example, the rotation of a skeleton node might be affected by both a waving animation controller and a breathing idle animation controller. The general calculation for state blending follows the formula:
[0112]
[0113] in: It is the final calculated value applied to the node attribute of the target model, which is a vector or scalar. It is the number of active controller objects that contribute to this attribute in the current frame. It is the first The blend weights of each controller object in the current frame are defined by the model behavior logic script or determined by the time curve of the controller object itself, and the sum of all weights is 1. It is the first The intermediate values calculated by the controller object and expected to be applied to the target model node properties. For skeletal rotation, It uses quaternions, and mixed operations employ quaternion spherical linear interpolation. For position or color, It is a 3D or 4D vector, and the mixed operation uses linear interpolation. The translation and rotation data output by the space transform object are also converted into vector or quaternion form to participate in this mixed calculation. The merged... It is written to the rendering state buffer for subsequent rendering and drawing.
[0114] See Figure 4 This phase of the study quantitatively evaluated two key performance indicators—recognition accuracy and average response time—for an interactive system based on augmented reality-based 3D animated character models. In terms of recognition accuracy, the "wave" command ranked first with approximately 98.1%, followed by "play animation" (approximately 97.5%) and "rotate" (approximately 95.8%). The accuracy rates for "change color" (approximately 93.6%) and "move" (approximately 94.2%) were relatively lower. This distribution reflects the differences in complexity between different commands in terms of speech keyword matching and syntactic parsing: "wave," as a highly concrete, single generative action, has a higher degree of matching between its keyword combination and behavioral logic script. Commands such as "move" and "change color," containing more descriptive words (such as direction, amplitude, and color value range), are more prone to semantic ambiguity, leading to a slight decrease in recognition accuracy. From the perspective of average response time, the "change color" command has the fastest response time, approximately 50ms, followed by the "move" command (approximately 48ms), the "wave" command (approximately 41ms), and the "play animation" and "rotate" commands (approximately 42ms and 45ms, respectively). The difference in response time stems from the scheduling overhead of different behavior logic scripts: "change color" only requires calling the material transformer object and injecting texture or color blending factors, resulting in a short computation chain; while commands such as "play animation" and "rotate" require scheduling the skeletal animation controller or spatial transformer, involving multiple steps such as skeletal node state blending and matrix transformation calculations, thus increasing the response latency. Considering both metrics, the system maintains a high recognition accuracy (overall above 93%) while keeping the average response time below 50ms, meeting the real-time requirements of augmented reality interaction scenarios and validating the effectiveness and engineering feasibility of the proposed voice command parsing and model behavior-driven method.
[0115] In one embodiment of the present invention, within a physical space coordinate framework, a simplified collision body mesh matching the outer surface of a pre-defined 3D model file for an animated character is generated. This simplified collision body mesh is generated by convex hull decomposition and voxelization simplification of the high-precision mesh of the pre-defined 3D model file, ultimately generating a low-face-number bounding volume mesh composed of multiple convex polyhedra. Before the augmented reality display device renders each frame of the composite augmented reality scene, the expected position and posture data of the pre-defined 3D model file for the animated character at the next moment are obtained. This expected position and posture data originates from the model transformation matrix of the next frame, predicted through interpolation based on interactive commands and the current motion state of the model. High-density point cloud regions representing physical obstacles are extracted from the 3D point cloud map. These high-density point cloud regions are defined as continuous spatial regions with a point cloud density exceeding 1000 points per cubic meter, and are converted into obstacle collision body meshes. The conversion process employs a moving cube algorithm to generate closed triangular mesh surfaces from the high-density point cloud regions. The simplified collider mesh and the obstacle collider mesh are used to perform spatial intersection detection calculations. The spatial intersection detection calculation uses the separating axis theorem algorithm to determine whether the preset 3D model file of the animated character will clip through or intersect with the physical obstacle in the next moment. If a collision is expected, the model position offset vector and rotation adjustment amount required to avoid penetration are calculated based on the collision position and surface normal information. The surface normal information is obtained from the normal of the triangle facet of the obstacle collider mesh at the collision point. The direction of the model position offset vector is opposite to the surface normal at the collision point, and its magnitude is equal to the penetration depth. The model position offset vector and rotation adjustment amount are applied to the coordinate transformation matrix of the preset 3D model file of the animated character. The coordinate transformation matrix is the transformation matrix from the local coordinate system of the model to the world coordinate system. The application operation is to superimpose and correct the translation and rotation components of the coordinate transformation matrix so that the model avoids the physical obstacle or slides along the obstacle surface during the final rendering.
[0116] In practical implementation, an eye-tracking sensor integrated into the augmented reality display device—a near-infrared camera with a sampling rate of 120Hz—acquires in real-time the coordinates of the user's gaze point on the composite augmented reality screen. These gaze point coordinates are two-dimensional pixel coordinates in the screen coordinate system. The gaze point coordinates are then transformed from the screen coordinate system to the physical space coordinate system, involving a conversion from screen coordinates to normalized device coordinates. The inverse projection matrix and inverse view matrix of the virtual camera are then used to calculate the focal position coordinates of the user's current gaze in three-dimensional physical space. These focal position coordinates are three-dimensional coordinates in the world coordinate system. The distances between the focal position coordinates and the components of the pre-defined 3D model file of the animated character in three-dimensional space are calculated. These components are pre-divided into multiple collision body sub-regions. The distance calculation is the Euclidean distance from the focal position coordinates to the center of the sphere enclosed by each collision body sub-region. The closest model component is identified as the potential interactive focus. The system determines whether a potential interaction focus remains within the user's field of vision for more than a preset gaze time threshold. The gaze time threshold is set to 500 milliseconds, and the field of vision is a spherical space with a radius of 0.1 meters centered on the focus location. If the gaze time threshold is exceeded, visual enhancement processing is applied to the model component corresponding to the potential interaction focus. This visual enhancement includes attaching a high-brightness emissive material or rendering an outer glow contour to the model component, and unlocking the deep interaction command set associated with that model component. The deep interaction command set is a list of extended interaction commands bound to that model component, stored in a database. When subsequent interaction commands are received, the system prioritizes matching the interaction action corresponding to the currently visually enhanced model component from the deep interaction command set.
[0117] In some embodiments, calculating the model position offset vector required to avoid penetration involves an exact solution for the penetration depth. The penetration depth is defined as the maximum distance a vertex of the simplified collider mesh penetrates in the collision direction at the expected position and orientation. Assume that the simplified collider mesh of the pre-defined 3D model file of the animated character intersects with a planar obstacle collider mesh representing a wall at the expected position, and the surface normal of the planar obstacle is... Solve for the model position offset vector. The formula is expressed as:
[0118]
[0119] in: It is the calculated model position offset vector, which is a three-dimensional vector. It is the unit normal vector of the obstacle colliding body mesh at the collision point, pointing towards free space. It is the first simplified collision body mesh in the preset 3D model file of the animated character. The expected position coordinates of a vertex are the coordinates of the vertex in the world coordinate system after the expected position and pose data have been applied. These are the coordinates of any point on the planar obstacle. The operator " "" represents the dot product operation of vectors. Calculate the vertices of all simplified collider meshes along the plane normal. The minimum directed distance from a vertex to a planar obstacle; a negative value indicates that the vertex has crossed the plane. (Function) Ensure that when the minimum distance is negative (penetration), the magnitude of the offset vector is the absolute value of the penetration value, and the direction is the normal direction; when there is no penetration, the offset vector is a zero vector. The calculated model position offset vector... It will be directly added to the translation component of the coordinate transformation matrix of the preset 3D model file of the animated character to achieve collision avoidance.
[0120] See Figure 5 In the performance statistics of various modules of the augmented reality 3D model system, the two core indicators, frame rate (FPS) and recognition accuracy (%), show significant module differences and performance stratification. Specifically, the coordinate transformation and model rendering modules perform exceptionally well in terms of frame rate, with the median FPS of the coordinate transformation module at approximately 120 and the model rendering module at approximately 98, both in the high range of the system, demonstrating efficient real-time computing and graphics rendering capabilities. In contrast, the median FPS of the speech recognition and spatial positioning modules are only about 25 and 32, respectively, representing system performance bottlenecks and reflecting the high computational complexity of speech signal processing and spatial feature reconstruction. In terms of recognition accuracy, the median accuracy of the collision detection and coordinate transformation modules is close to 100%, demonstrating extremely high robustness and reliability. The median accuracy of the gaze tracking module is approximately 97.5%, slightly lower than the former two. The median accuracy of the speech recognition module is approximately 96%, the lowest among all modules, which is highly correlated with the susceptibility of speech signals to environmental noise interference. Analysis of the box plot's dispersion reveals that the model rendering and coordinate transformation modules exhibit a wider FPS box range with distinct upper and lower extrema, indicating that their performance fluctuations are significantly influenced by scene complexity and model load. In contrast, the spatial positioning and speech recognition modules show a narrower box range, demonstrating relatively stable performance. Overall, the system maintains high real-time performance and accuracy in its core interaction and rendering modules. However, there is still room for performance optimization in the speech recognition and spatial positioning modules, which can be further improved through algorithm lightweighting and hardware acceleration to enhance the overall system smoothness.
[0121] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.< / sequence>
Claims
1. A method for interactive display of 3D animated character models based on augmented reality, characterized in that, The method includes: The video stream data of the user's physical space is collected in real time by an image sensor, and spatial feature recognition and marker point detection are performed on the video stream data to construct a physical space coordinate framework. Load the preset 3D model file of the animated character, and spatially anchor and align the origin of the preset 3D model file of the animated character with the position of the detected marker point established in the physical space coordinate frame; Voice command signals are obtained from the user's voice input channel, and voice keyword matching and semantic structure analysis are performed on the voice command signals to parse out the interaction commands representing the user's interaction intentions. On an augmented reality display device, based on the location of the marker point, the model corresponding to the preset 3D model file of the animated character is rendered into the real scene generated by the video stream data, forming a composite augmented reality scene that blends the virtual and the real. According to the type of the interaction instruction, the corresponding behavioral logic operation is performed on the preset 3D model file of the animated character to drive the preset 3D model file of the animated character to make corresponding dynamic behaviors in the composite augmented reality screen. The video stream data is subjected to spatial feature recognition and marker point detection to construct a physical space coordinate framework, including: Corner point extraction and edge contour detection are performed on each frame of the video stream data to generate a scene key feature point cloud; Feature point tracking and matching are performed on the key feature point cloud of the scene in multiple consecutive frame images, and the spatial displacement and viewpoint change parameters of the feature points between adjacent frame images are calculated. Based on the spatial displacement and viewpoint change parameters, a three-dimensional point cloud map of the physical space is reconstructed using the structure-reconstruction-motion algorithm, and the motion pose of the image sensor is estimated. In the three-dimensional point cloud map, image regions that meet the preset geometric shape and pattern features are searched and identified as usable augmented reality markers. Each augmented reality marker is assigned a unique spatial identifier and three-dimensional position coordinates in the three-dimensional point cloud map. By integrating the spatial coordinates and identifiers of all augmented reality markers and combining them with the real-time motion pose of the image sensor, a physical spatial coordinate framework is formed that includes the mapping relationship between the absolute coordinate system, the marker coordinate system, and the screen coordinate system.
2. The interactive display method for 3D animated character models based on augmented reality according to claim 1, characterized in that, Spatial anchoring and alignment are performed between the origin of the preset 3D model file of the animated character and the position of the detected marker point established in the physical space coordinate frame, including: Read the coordinate data of the model center point of the preset 3D model file of the animated character, and obtain the coordinate value of the model center point coordinate data in the local coordinate system of the preset 3D model file of the animated character; From the physical space coordinate frame, query the spatial identifier of the augmented reality marker selected as the display reference, and obtain the three-dimensional world coordinates of the augmented reality marker in the physical space coordinate system; Calculate the coordinate transformation matrix from the center point coordinates of the preset 3D model file of the animated character to the augmented reality marker point. The coordinate transformation matrix includes a translation vector, a rotation vector, and a scaling factor. The coordinate transformation matrix is applied to the vertex data of the preset 3D model file of the animated character, so that the spatial position of the model center point of the preset 3D model file of the animated character completely coincides with the spatial position of the augmented reality marker point. A dynamic coordinate mapping relationship is established between the local coordinate system and the physical space coordinate system of the preset 3D model file of the animated character, so as to ensure that when the pose of the image sensor changes, the preset 3D model file of the animated character can maintain the anchoring relationship with the physical marker points in the augmented reality screen according to the dynamic coordinate mapping relationship.
3. The interactive display method for 3D animated character models based on augmented reality according to claim 2, characterized in that, The voice command signal is subjected to voice keyword matching and semantic structure analysis to parse out the interactive commands representing the user's interaction intent, including: The acquired raw speech command signal is subjected to noise reduction and gain control processing to extract a clean speech waveform signal; Speech endpoint detection is performed on the clean speech waveform signal to segment out speech segments containing valid instructions; The acoustic feature vectors of the speech segments are extracted, and the extracted acoustic feature vectors are input into a pre-trained acoustic model to calculate the probability distribution of the phoneme sequence. The phoneme sequence is decoded based on a statistical language model to generate a text instruction string corresponding to the voice instruction signal; The text instruction string is matched with a preset interactive instruction keyword library to identify keyword combinations containing core action verbs and target object nouns; Dependency parsing is performed on the identified keyword combinations to determine the syntactic relationship between actions and objects, thereby parsing out structured interaction instructions, which include action type, action target object, and optional parameters.
4. The interactive display method for 3D animated character models based on augmented reality according to claim 3, characterized in that, The text instruction string is matched against a preset interactive instruction keyword library to identify keyword combinations containing core action verbs and target object nouns, including: Load an interactive instruction keyword library containing multiple categories of action instructions. The interactive instruction keyword library includes instruction trigger phrases, a list of action verbs, a list of model component names, and a list of parameter descriptive words. The text instruction string is scanned using a forward maximum matching algorithm to find the longest matching phrase in the interactive instruction keyword library. This phrase is then marked as the instruction trigger phrase. After identifying the text instruction string fragment that triggers the instruction phrase, continue scanning its subsequent strings to find matching action verbs in the list of action verbs; After the action verb is identified, the matching model component name is searched in the list of model component names within the nearby string range; If a matching model component name is found, then further search for descriptive words in the parameter descriptor list in the nearby strings that may accompany the movement direction, movement amplitude, or number of times. The successfully matched command trigger phrase, action verb, model component name, and parameter descriptor are combined to form the keyword combination for this voice interaction.
5. The method for interactive display of 3D animated character models based on augmented reality according to claim 4, characterized in that, On the augmented reality display device, using the location of the marker point as a reference, the model corresponding to the preset 3D model file of the animated character is rendered onto the real-world scene generated by the video stream data, forming a composite augmented reality scene that blends the virtual and real worlds, including: The system receives the latest video stream data frames captured by the image sensor in real time and decodes them into real-world texture bitmaps. Based on the physical space coordinate frame and the current pose of the image sensor, calculate the virtual camera view matrix and projection matrix corresponding to the current video frame; Based on the anchoring position of the pre-set 3D model file of the animated character in physical space and the dynamic coordinate mapping relationship, calculate the vertex coordinate transformation data of the pre-set 3D model file of the animated character under the current virtual camera view. The calculated vertex coordinate transformation data is input into the graphics rendering pipeline, and combined with the material texture and skeletal animation data of the loaded animated character preset 3D model file, a rendered image of the animated character preset 3D model file in the current view is generated. The generated rendered image is pixel-level fused with the real-world texture bitmap, wherein opaque pixels in the rendered image cover the corresponding pixels in the real-world texture bitmap, while transparent pixels display the original content of the real-world texture bitmap, thereby synthesizing the composite augmented reality image containing virtual animated character models superimposed on a real scene, and outputting it to an augmented reality display device.
6. The interactive display method for 3D animated character models based on augmented reality according to claim 5, characterized in that, Based on the type of the interaction instruction, corresponding behavioral logic operations are performed on the preset 3D model file of the animated character to drive the preset 3D model file of the animated character to perform corresponding dynamic behaviors in the composite augmented reality scene, including: Based on the action type in the parsed interaction instructions, the preset behavior logic rule library is queried to obtain the corresponding model behavior logic script. The model behavior logic script defines the behavior sequence, behavior parameters and behavior triggering conditions that the 3D model should execute. Extract the action target object from the interaction command. The action target object is a specific body part, prop, or preset action fragment of the 3D model of the animated character. Based on the model behavior logic script, calculate the values required for the behavior parameters, where the values are derived from parameters carried in the interaction instructions or default values obtained from the current model state. The 3D model animation engine is invoked, and based on the model behavior logic script, the action target object, and the calculated behavior parameters, the skeletal animation controller, material transformer, or space transformer of the preset 3D model file of the animated character is scheduled. The skeletal animation controller drives the model's skeletal system to generate motion, the material transformer changes the model's surface texture or color, and the space transformer changes the model's position, orientation, or scaling. The execution results of the aforementioned controller and transformer are applied in real time to the preset 3D model file of the animated character being rendered, so that it exhibits dynamic behavior changes corresponding to the interactive commands in the composite augmented reality screen.
7. The interactive display method for 3D animated character models based on augmented reality according to claim 6, characterized in that, The process involves invoking a 3D model animation engine and, based on the model behavior logic script, the target object of the action, and the calculated behavior parameters, scheduling the skeletal animation controller, material transformer, or spatial transformer of the preset 3D model file of the animated character, including: The behavior sequence definition in the model behavior logic script is parsed, and the behavior sequence is decomposed into multiple independent behavior units ordered by the time axis. Each behavior unit is associated with a specific controller type identifier. For each behavior unit, a corresponding controller object is instantiated in the 3D model animation engine according to its associated controller type identifier. The controller object includes a skeletal animation controller object, a material transformer object, or a space transformer object. The motion target object is mapped to the internal node tree structure of the preset 3D model file of the animated character, and the corresponding bone joint node, mesh material node or spatial transformation node of the motion target object is located. The calculated behavior parameters are injected into the instantiated controller object. Specifically, bone rotation quaternions or translation vectors are injected into the skeletal animation controller object, texture coordinate offsets or color blending factors are injected into the material transformer object, and world coordinate translation matrices or Euler angle rotation data are injected into the space transformer object. The scheduling queue of the 3D model animation engine is started. Based on the preset timestamp of each behavior unit in the behavior sequence, the corresponding controller object is activated in sequence, and the controller object is triggered to perform data writing operation on the model node it is bound to. Before rendering each frame, the intermediate state data output by all active controller objects is read synchronously. The intermediate state data is then merged and written into the rendering state buffer of the preset 3D model file of the animated character to complete the real-time driving of the preset 3D model file of the animated character.
8. The interactive display method for 3D animated character models based on augmented reality according to claim 7, characterized in that, It also includes collision detection and response handling for the model's interaction with the environment in physical space: Within the physical space coordinate framework, a simplified collision body mesh matching the outer surface of the model is generated from the preset 3D model file of the animated character. Before the augmented reality display device renders each frame of the composite augmented reality image, the expected position and posture data of the preset 3D model file of the animated character at the next moment are obtained; Extract high-density point cloud regions representing physical obstacles from a 3D point cloud map and convert them into obstacle collision body meshes. The simplified collision body mesh and the obstacle collision body mesh are subjected to spatial intersection detection calculation to determine whether the preset 3D model file of the animated character will clip through or intersect with the physical obstacle in the next moment; If a collision is detected, the model position offset vector and rotation adjustment amount required to avoid penetration are calculated based on the collision location and surface normal information. The model position offset vector and rotation adjustment amount are applied to the coordinate transformation matrix of the preset 3D model file of the animated character, so that the model avoids physical obstacles or slides along the surface of obstacles during the final rendering, thus achieving an interactive display that conforms to the laws of physics.
9. The interactive display method for 3D animated character models based on augmented reality according to claim 8, characterized in that, It also includes adaptive adjustment of model interaction focus based on user eye tracking: By using an eye-tracking sensor integrated into the augmented reality display device, the coordinates of the user's gaze point on the composite augmented reality image are obtained in real time; Transform the coordinates of the gaze point from the screen coordinate system to the physical space coordinate system to obtain the focal position coordinates of the user's current gaze in three-dimensional physical space. Calculate the distance between the coordinates of the focus position and each component of the preset 3D model file of the animated character in 3D space, and identify the closest model component as the potential interactive focus; Determine whether the potential interaction focus remains within the user's field of vision for more than a preset gaze time threshold; If the gaze time threshold is exceeded, visual enhancement processing is performed on the model component corresponding to the potential interaction focus, and the deep interaction instruction set associated with the model component is unlocked. When an interaction command is received subsequently, the interaction action corresponding to the currently visually enhanced model component is matched first from the deep interaction command set.