An extended reality-based interactive product selection method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By determining the scene reference coordinates and ambient light characteristics in extended reality scenarios, and combining the mapping and fusion rules of gaze, gesture and voice data, stable product selection decisions are generated, solving the problems of unstable interaction and inconsistent decisions in extended reality scenarios, and realizing an efficient product selection method.

CN121120201BActive Publication Date: 2026-06-19GUANGZHOU LANGZUN SOFTWARE TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: GUANGZHOU LANGZUN SOFTWARE TECH CO LTD
Filing Date: 2025-09-01
Publication Date: 2026-06-19

Application Information

Patent Timeline

01 Sep 2025

Application

19 Jun 2026

Publication

CN121120201B

IPC: G06Q30/0601; G06T17/00; G06T19/00

AI Tagging

Application Domain

Commerce 3D modelling

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN121120201B_ABST

Patent Text Reader

Abstract

This application discloses a product selection method and system based on extended reality (AR), relating to the field of AR interaction technology. The method includes: acquiring 3D information of a target product and corresponding product attribute information; determining the scene reference coordinates of the AR scene and the ambient light features matching the scene reference coordinates based on the spatial perception data of the terminal device, and generating an AR presentation object of the target product in the AR scene; acquiring interaction behavior data based on user interaction actions in the AR scene; associating and matching the interaction behavior data with product attribute information to generate a candidate product feature set; performing fusion on the candidate product feature set according to a preset fusion criterion to obtain a target product selection decision, and outputting interactive feedback information corresponding to the target product selection decision. This application achieves robust fusion and reliable decision-making of multimodal interaction information, improving the accuracy and consistency of the AR product selection process.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of extended reality interactive technology, and in particular to an interactive product selection method and system based on extended reality. Background Technology

[0002] Currently, product selection interactions in extended reality scenarios typically rely on single-modal or loosely structured multimodal rules for decision-making, such as using only gaze or gesture as the primary triggering condition. Limited by factors such as ambient light variations, device motion, occlusion, and noise, single-modal approaches suffer from insufficient stability, leading to interaction positioning errors, ambiguous selections, and delayed feedback. Existing multimodal methods often employ fixed thresholds or weighted averages for fusion, lacking joint constraints on temporal and spatial consistency, and failing to formally model the collaborative relationships between different modalities, making it difficult to maintain reliability in dynamic scenarios. Furthermore, the association between product 3D information and attribute information often remains a static mapping, making it difficult to analyze in conjunction with interaction temporal sequences, easily leading to conflicts between attribute triggering and variant selection, affecting interaction efficiency and decision accuracy. Therefore, related technologies suffer from instability in multimodal interaction under complex lighting and motion conditions, difficulty in adapting fusion rules to time-varying scenarios, and insufficient reliability and consistency in product selection decisions. Summary of the Invention

[0003] In view of the aforementioned problems, this application is hereby filed.

[0004] Therefore, this application provides an interactive product selection method and system based on extended reality, which can solve the problems mentioned in the background art.

[0005] To solve the above-mentioned technical problems, this application provides the following technical solution:

[0006] In a first aspect, this application provides an interactive product selection method based on extended reality, comprising: obtaining three-dimensional information of a target product and corresponding product attribute information from a product information database; wherein the product attribute information includes attribute identifiers for driving interactive feedback and variant identifiers for referencing candidate variants, the attribute identifiers and the variant identifiers being preset identifiers read from the product information database; determining the scene reference coordinates of an extended reality scene and ambient light features matching the scene reference coordinates based on spatial perception data of a terminal device, and generating an extended reality presentation object of the target product in the extended reality scene; obtaining interactive behavior data based on user interaction actions in the extended reality scene; wherein the interactive behavior data includes interactive markers corresponding to the interaction position, interaction method, and interaction timing sequence of the target product; associating and matching the interactive behavior data with the product attribute information to generate a candidate product selection feature set; performing fusion on the candidate product selection feature set according to a preset fusion criterion to obtain a target product selection decision, and outputting interactive feedback information corresponding to the target product selection decision; wherein the target product selection decision is used to indicate the product selection conclusion for the target product or the screening result for candidate variants of the target product.

[0007] Preferably, determining the scene reference coordinates of the extended reality scene based on the spatial perception data of the terminal device includes: obtaining planar features and depth features from the spatial perception data; if a stable plane that meets preset constraints is detected, then the stable plane is used as the reference plane for the scene reference coordinates; if the preset constraints are not met, then the set of reference points obtained by repositioning is used as the set of reference points for the scene reference coordinates; determining the scene reference coordinates based on the reference plane or the set of reference points, and performing posture correction on the extended reality presentation object.

[0008] Preferably, the step of obtaining interactive behavior data based on the user's interactive actions in the extended reality scenario includes: obtaining gesture channel data, gaze channel data, and voice channel data from the input channel of the terminal device; mapping the gesture channel data, gaze channel data, and voice channel data into gesture interaction markers, gaze interaction markers, and voice interaction markers respectively according to a preset mapping rule; wherein the gesture interaction markers, gaze interaction markers, and voice interaction markers respectively include corresponding information on the interaction position and interaction method with the target product; aligning the gesture interaction markers, gaze interaction markers, and voice interaction markers by execution time to obtain the interaction timing sequence; wherein the preset mapping rule includes: based on the gaze channel data... The gaze stability score, gesture matching score, and semantic similarity score are calculated for the channel data, gesture channel data, and voice channel data, respectively. Under the relationship between the observation sequence determined according to the interaction time sequence and the preset partial order, the gaze interaction marker, gesture interaction marker, and voice interaction marker, as well as the corresponding confidence score, are generated based on the maximal element. Based on the consistency of the gaze interaction marker, gesture interaction marker, and voice interaction marker within the observation sequence, and the hand-eye coordination angle relationship calculated based on the gaze channel data and the gesture channel data, coordination gain or confidence suppression is performed on the associated interaction intent. When the hand-eye coordination angle relationship does not meet the coordination condition, the cumulative sample of the observation sequence is expanded according to the interaction time sequence.

[0009] Preferably, the step of associating and matching the interactive behavior data with the product attribute information includes: spatially locating the interactive position according to the scene reference coordinates to determine the target part in the product's three-dimensional information; if the target part has the attribute identifier, then combining the attribute identifier with the interaction method to obtain a first candidate feature; if it does not have the attribute identifier, then combining the interaction method with the interaction time sequence to obtain a second candidate feature; and adding the first candidate feature and the second candidate feature in parallel to the candidate product feature set.

[0010] Preferably, the step of performing fusion on the candidate product feature set according to the preset fusion criteria includes: dividing the candidate product feature set into a spatial feature subset and a semantic feature subset according to the feature source; performing fusion on the spatial feature subset and the semantic feature subset respectively according to the preset fusion criteria to obtain the corresponding subset fusion results; performing secondary fusion on the subset fusion results, and performing spatial alignment according to the scene reference coordinates and temporal alignment according to the interaction time sequence to obtain the target product selection decision.

[0011] Preferably, the interactive feedback information corresponding to the target product selection decision includes: replacing the appearance or switching the parameters of the extended reality object in the extended reality scene to form visual feedback; overlaying a preset prompt layer template read from the rendering resources in the extended reality scene, mapping the prompt content according to the target product selection decision; adjusting the brightness and hue parameters of the preset prompt layer template according to the ambient light characteristics to generate and render a prompt layer to form prompt feedback; and presenting the visual feedback and the prompt feedback as the interactive feedback information.

[0012] Preferably, if the target product selection decision indicates the screening result of the candidate variant, a variant switching instruction is generated, and in response to the variant switching instruction, the product 3D information corresponding to the candidate variant is obtained from the product information database based on the variant identifier to replace the presentation content of the extended reality object; if the target product selection decision does not indicate the screening result of the candidate variant, the presentation content of the extended reality object remains unchanged.

[0013] Secondly, this application also provides an extended reality-based interactive product selection system, comprising: an information acquisition module for acquiring three-dimensional information of a target product and corresponding product attribute information from a product information database; a scene construction module for determining the scene reference coordinates of an extended reality scene and ambient light features matching the scene reference coordinates based on the spatial perception data of a terminal device, and generating an extended reality presentation object of the target product in the extended reality scene; an interaction acquisition module for acquiring interaction behavior data based on user interaction actions in the extended reality scene; a feature association module for associating and matching the interaction behavior data with the product attribute information to generate a candidate product feature set; and a decision fusion module for performing fusion on the candidate product feature set according to a preset fusion criterion to obtain a target product selection decision and outputting interactive feedback information corresponding to the target product selection decision.

[0014] Thirdly, this application also provides a computer device, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to perform the following steps: obtaining three-dimensional information of a target product and product attribute information corresponding to the three-dimensional information from a product information database; wherein, the product attribute information includes an attribute identifier for driving interactive feedback and a variant identifier for referencing candidate variants, the attribute identifier and the variant identifier being preset identifiers read from the product information database; determining the scene reference coordinates of an extended reality scene and the ambient light features matching the scene reference coordinates based on the spatial perception data of the terminal device, and in the... The method involves generating an extended reality (AR) object representing the target product within an AR scenario; acquiring interactive behavior data based on user interactions within the AR scenario; wherein the interactive behavior data includes interactive markers corresponding to the interaction location, interaction method, and interaction sequence of the target product; associating and matching the interactive behavior data with the product attribute information to generate a candidate product feature set; performing fusion on the candidate product feature set according to a preset fusion criterion to obtain a target product selection decision, and outputting interactive feedback information corresponding to the target product selection decision; wherein the target product selection decision is used to indicate the product selection conclusion for the target product or the screening result for candidate variants of the target product.

[0015] Fourthly, this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, performs the following steps: obtaining three-dimensional information of a target product and product attribute information corresponding to the three-dimensional information from a product information database; wherein the product attribute information includes an attribute identifier for driving interactive feedback and a variant identifier for referencing candidate variants, the attribute identifier and the variant identifier being preset identifiers read from the product information database; determining the scene reference coordinates of an extended reality scene and the ambient light features matching the scene reference coordinates based on the spatial perception data of a terminal device, and then applying the results in the extended reality... An extended reality representation of the target product is generated in the scene; interactive behavior data is obtained based on the user's interactive actions in the extended reality scene; wherein, the interactive behavior data includes interactive markers corresponding to the interactive position, interactive method, and interactive timing sequence of the target product; the interactive behavior data is associated and matched with the product attribute information to generate a candidate product feature set; the candidate product feature set is fused according to a preset fusion criterion to obtain a target product selection decision, and interactive feedback information corresponding to the target product selection decision is output; wherein, the target product selection decision is used to indicate the product selection conclusion for the target product or the screening result for candidate variants of the target product.

[0016] Implementing this application has the following beneficial effects: This application provides a product interactive selection method and system based on extended reality. By jointly determining the scene reference coordinates and ambient light features, it provides a stable spatial and lighting reference for the objects presented in extended reality, reducing posture and appearance rendering errors; by introducing a mapping rule based on the observation sequence and a preset partial order relationship, it formally determines the gaze stability score, gesture matching score, and semantic similarity score, and generates interaction tags and confidence scores with maximal elements, thereby suppressing noise and local uncertainties; it utilizes the consistency of gaze, gesture, and voice within the observation sequence and the hand-eye coordination angle relationship to perform cooperative gain or confidence suppression, and expands when the cooperative condition is not met. The accumulated samples of the observation sequence enable adaptive convergence of decisions in dynamic scenarios. In the feature association stage, interaction location, interaction method, and temporal attributes are structurally associated with product attribute identifiers and variant identifiers to distinguish between spatial reach and semantic triggering, improving the discriminability of candidate features. In the fusion stage, consistency assessment and alignment are performed on spatial and semantic feature subsets respectively, and candidate selection is completed through a conflict graph. A robust decision-making strategy based on temporal continuity scoring is used to reduce decision fluctuations caused by boundary samples. In the feedback stage, based on the target product selection decision, the appearance of the extended reality object is replaced, parameters are switched, and prompt layers are rendered. Display parameters are adjusted based on ambient light characteristics to improve feedback visibility and consistency. This application achieves robust fusion and reliable decision-making for multimodal interaction under lighting, motion, and occlusion conditions, reducing accidental touches and ambiguity, and improving product selection efficiency and interaction consistency. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram of an interactive product selection method based on extended reality, which is the subject of this application.

[0019] Figure 2 This is a flowchart illustrating the execution of preset mapping rules for an interactive product selection method based on extended reality, which is the subject of this application.

[0020] Figure 3 This is a schematic diagram of the overall structure of an interactive product selection system based on extended reality, which is the subject of this application.

[0021] Figure 4This is a computer device diagram of an interactive product selection method based on extended reality, which is the subject of this application. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0023] The product selection method and system based on extended reality in this application are applicable to operation on head-mounted display devices, mobile terminals, or other computing devices with spatial awareness capabilities. Extended reality includes three aspects: spatial modeling of real-world scenes, 3D rendering of target objects, and interactive feedback. The goal is to enable terminal devices to stably perceive scenes in dynamic environments, accurately interpret user multimodal interactions, and map the interaction results into product selection decisions and feedback for target products.

[0024] Currently, common sources of interactive information in extended reality scenarios include gaze, gestures, and voice. The gaze channel provides a continuous trajectory of the user's gaze area, but is susceptible to ambient light and terminal movement. The gesture channel provides direction and operational intent through keypoint and skeleton estimation, but its stability decreases when occlusion and depth relationships are uncertain. The voice channel can express location and attribute semantics, but ambiguity exists in noisy scenes and with polysemous expressions. Traditional approaches often use fixed thresholds or simple weighting for fusion, making it difficult to simultaneously consider spatial consistency, temporal continuity, and cross-modal collaborative relationships. This leads to unstable target location, conflicts between semantic triggering and spatial reach, and decision fluctuations.

[0025] Based on the above problems, the technical solution provided in this application includes: First, reading the three-dimensional information and attribute information of a product from a product information database, wherein the attribute information includes attribute identifiers for driving interactive feedback and variant identifiers for referencing candidate variants; Second, determining the scene reference coordinates of the extended reality scene and the matching ambient light features based on the spatial perception data of the terminal device, and generating an extended reality presentation object of the target product in the scene; Subsequently, based on a preset mapping rule, mapping the gesture channel data, gaze channel data, and voice channel data into gesture interaction markers, gaze interaction markers, and voice interaction markers, wherein the markers include the interaction with the target product. The system collects information on the corresponding positions, interaction methods, and interaction time sequences, and formally determines and performs collaborative gain / suppression on multimodal scoring through observation sequences and partial order relationships. Based on this, it correlates and matches interactive behavior data with product attribute information to generate a candidate product feature set, distinguishing between spatial reach and semantic triggering. Finally, according to preset fusion criteria, it calculates consistency scores within the spatial and semantic feature subsets and completes spatiotemporal alignment. It resolves mutual exclusion relationships through a conflict graph and determines the target product selection decision by combining temporal continuity scores, thereby outputting interactive feedback information corresponding to the decision, including appearance replacement, parameter switching, and prompt layer rendering. These steps can be executed on computer devices with data processing capabilities, such as head-mounted displays, mobile terminals, or servers; the logical order of the process can be appropriately adjusted according to specific application scenarios.

[0026] This embodiment provides a product interactive selection method based on extended reality, which can be used with the aforementioned computer equipment. Figure 1 This is a flowchart of an interactive product selection method based on extended reality according to an embodiment of the present invention, such as... Figure 1 As shown, the process includes steps 202 to 210:

[0027] Step 202: Obtain the three-dimensional information of the target product and the product attribute information corresponding to the three-dimensional information from the product information database.

[0028] The product attribute information includes attribute identifiers used to drive interactive feedback and variant identifiers used for candidate variant references. The attribute identifiers and variant identifiers are preset identifiers read from the product information database.

[0029] Specifically, the product information database pre-stores digital information about various products. The product's 3D information includes geometric model data, texture mapping data, and material parameter data. Geometric model data defines the product's 3D shape, texture mapping data defines the product's surface appearance, and material parameter data defines the product's optical properties.

[0030] Attribute identifiers correspond to interactive parts of the product. For example, for clothing, the neckline has a first attribute identifier, and the cuffs have a second attribute identifier. Each attribute identifier is associated with a corresponding interactive response rule. Variant identifiers are used to distinguish different configurations of the same product. For example, products of different colors use different color variant identifiers, and products of different sizes use different size variant identifiers.

[0031] The process of retrieving target product information from the product information database includes: determining the unique identifier of the target product based on the product query request; retrieving the corresponding product record in the database using the unique identifier; extracting the storage path of the product's 3D information from the product record; and reading the 3D model file and texture file pointed to by the storage path. Simultaneously, an attribute information table is extracted from the product record. This attribute information table records the correspondence between each attribute identifier and a specific part in the 3D model, as well as the correspondence between each variant identifier and an alternative 3D model.

[0032] Step 204: Determine the scene reference coordinates of the extended reality scene and the ambient light features that match the scene reference coordinates based on the spatial perception data of the terminal device, and generate the extended reality presentation object of the target product in the extended reality scene.

[0033] Specifically, the terminal device collects spatial perception data through depth sensors, motion sensors, and illumination sensors. The depth sensor outputs a sequence of depth images, with the pixel value of each depth image representing the distance from the corresponding spatial point to the sensor. The motion sensor outputs pose change data of the device. The illumination sensor outputs ambient light data.

[0034] Step 204 involves determining the scene reference coordinates of the extended reality scene based on the spatial perception data of the terminal device, including steps A1 and A2:

[0035] Step A1: Obtain planar features and depth features from spatially perceived data.

[0036] Specifically, planar features are obtained by performing a planar detection algorithm on the depth image. The planar detection algorithm converts the depth image into a 3D point cloud and identifies planar structures from the point cloud. Depth features include the depth values of each pixel in the depth image and the 3D coordinates calculated based on camera intrinsics. Planar detection algorithms can employ RANSAC (Random Sample Consensus), region growing, clustering methods (such as Euclidean Cluster Extraction) combined with planar fitting and Hough Transform for point cloud planar detection, least squares planar fitting, and deep learning-based planar detection methods (such as PlaneNet, Planar R-CNN, etc.). In this embodiment, RANSAC is used because it is most widely used in robotics, SLAM, AR / VR, and other fields for planar extraction after converting depth images to point clouds.

[0037] If a stable plane that satisfies preset constraints is detected, the stable plane is used as the reference plane for the scene's baseline coordinates. It should be noted that the preset constraints are used to determine whether the detected plane is suitable as a reference plane. The preset constraints include: a plane area constraint, requiring the plane area to be greater than a first area threshold; a plane direction constraint, requiring the angle between the plane's normal vector and the direction of gravity to be less than a first angle threshold; and a plane stability constraint, requiring the change in the plane's position over multiple consecutive frames to be less than a first displacement threshold. The first area threshold is, for example, set to 0.3 square meters, 0.5 square meters, or other values determined according to the application scenario; the first angle threshold is, for example, set to 3 degrees, 5 degrees, or 10 degrees; and the first displacement threshold is, for example, set to 0.01 meters, 0.02 meters, or 0.05 meters.

[0038] A stable plane that satisfies preset constraints is selected as the reference plane, and the scene reference coordinate system is established based on the reference plane. The origin of the scene reference coordinate system is set at the geometric center of the reference plane, the Z-axis direction is set as the normal vector direction of the reference plane, and the X-axis and Y-axis are determined in the reference plane according to the right-hand coordinate system rule.

[0039] If the preset constraints are not met, the repositioned reference point set is used as the reference point set for the scene's baseline coordinates. It should be noted that when a stable plane cannot be detected, a feature-point-based repositioning method is used. Feature points are extracted from the current image and matched with feature points in a pre-stored environment map. When the number of successfully matched feature points exceeds a second threshold, the spatial pose of the device is calculated based on the matched feature points. The reference point set consists of the successfully matched 3D feature points. The second threshold is set, for example, to 15, 20, or 30.

[0040] Step A2: Determine the scene reference coordinates based on the reference plane or reference point set, and perform pose correction on the extended reality objects.

[0041] Specifically, when using a reference plane, the scene reference coordinates are established according to the aforementioned method. When using a reference point set, the coordinate system is determined by calculating the statistical characteristics of the reference point set, for example, with the centroid of the point set as the origin and the principal direction of the point set as the coordinate axis direction.

[0042] Pose correction for augmented reality objects involves transforming the 3D model of the product from the model coordinate system to the scene's reference coordinate system and adjusting the orientation of the presented object according to the user's perspective. The transformation is implemented using homogeneous coordinate transformation matrices, including translation, rotation, and scaling transformations.

[0043] Ambient light characteristics are obtained by analyzing illumination information in spatial perception data, including ambient light intensity values and the direction of the main light source. Ambient light intensity values are read from illumination sensors, and the direction of the main light source is inferred by analyzing the brightness distribution in the scene image.

[0044] The process of generating an extended reality representation object of the target product in an extended reality scene includes: determining the placement position of the product based on the scene's reference coordinates, with the default position set at a preset height above the reference plane or at the center of the reference point set, such as 0.05 meters, 0.1 meters, or 0.2 meters; loading the product's 3D information to create a 3D rendering object; and configuring rendering parameters based on ambient light characteristics to visually blend the virtual product with the real environment.

[0045] Step 206: Obtain interaction behavior data based on the user's interactive actions in the extended reality scenario.

[0046] The interactive behavior data includes interaction markers corresponding to the interaction location, interaction method, and interaction time sequence with the target product. Specifically, users interact with the target product in various ways within an extended reality scenario. The interaction location refers to the specific part of the product's 3D model where the user's interaction action is directed. Interaction methods include operation types such as clicking, dragging, and zooming. The interaction time sequence records the distribution of interaction actions over time. Interaction markers are structured descriptions obtained after processing the original interaction signals.

[0047] Step 206 includes steps 2061 to 2063:

[0048] Step 2061: Obtain gesture channel data, gaze channel data, and voice channel data from the input channel of the terminal device, respectively.

[0049] Specifically, gesture channel data is acquired through depth cameras or hand tracking sensors, including a sequence of 3D coordinates of key hand points. Each key point corresponds to a specific location on the hand, such as the fingertips, joints, and palm. Gaze channel data is acquired through eye-tracking devices, including pupil position, gaze direction vector, and gaze point coordinates. Voice channel data is acquired through a microphone, including the temporal waveform and spectral characteristics of the audio signal.

[0050] The three channels have different acquisition frequencies. For example, the gesture channel data is acquired at 30, 60, or 120 frames per second. The gaze channel data is acquired at 60, 90, or 120 times per second. The voice channel data is sampled at 16 kHz, 24 kHz, or 48 kHz. Data acquired at different frequencies needs to be synchronized in subsequent processing.

[0051] Step 2062: Map the gesture channel data, gaze channel data, and voice channel data to gesture interaction markers, gaze interaction markers, and voice interaction markers respectively according to the preset mapping rules.

[0052] Among them, gesture interaction markers, eye-tracking interaction markers, and voice interaction markers respectively include information on the corresponding interaction location and interaction method with the target product.

[0053] It should be noted that the preset mapping rules define the conversion process from raw channel data to interaction tags. Gesture interaction tags include gesture type, pointing position, and amplitude information. Gaze interaction tags include gaze position, gaze duration, and gaze stability information. Voice interaction tags include recognized command words, target location words, and operation verbs.

[0054] Step 2063: Align the execution times of gesture interaction markers, gaze interaction markers, and voice interaction markers to obtain the interaction timing sequence.

[0055] Specifically, the time alignment process unifies the interaction markers from different acquisition frequencies to the same time base. First, a unified timestamp sequence is determined, with timestamp intervals set to, for example, 33 milliseconds, 50 milliseconds, or 100 milliseconds. Then, the interaction markers for each channel are mapped to the nearest timestamp. For data between timestamps, linear interpolation or the nearest neighbor method is used for padding. The interaction time sequence records the interaction marker status of each channel at each timestamp.

[0056] The preset mapping rules include: calculating gaze stability score, gesture matching score and semantic similarity score based on gaze channel data, gesture channel data and voice channel data respectively; and generating gaze interaction markers, gesture interaction markers and voice interaction markers and corresponding confidence scores based on the maximum element under the observation sequence determined according to the interaction time sequence and the preset partial order relationship.

[0057] It is important to note that the gaze stability score reflects the stability of the user's gaze, and is obtained by calculating the change in gaze position between consecutive frames. For example, for n consecutive frames in an observation sequence, the score is calculated between the i-th frame and the... Euclidean distance between the points of view between frames ( Eye stability score Defined as:

[0058] ;

[0059] Where n is the total number of frames in the observation sequence. When the change in the line-of-sight point is small, the distance between adjacent frames... Smaller, total It is also relatively small, therefore the visual stability score is low. Higher; conversely, if the gaze jumps frequently, The score is lower. The rating range is within... Between these values, the closer the value is to 1, the more stable the line of sight.

[0060] Gesture matching scores reflect the similarity between a gesture and a predefined gesture template, obtained by comparing key point configurations. For example, the gesture matching score is calculated as follows: to eliminate the influence of differences in hand size and detection position among different users, the coordinates of the detected hand key points are first spatially normalized (e.g., using the wrist joint as the origin, or scaling based on the hand bounding box). Then, the spatial deviation between each normalized key point and its corresponding point in the predefined template gesture is calculated.

[0061] Gesture matching score is defined as:

[0062] ;

[0063] in, The gesture matching score has a range of values. The closer the value is to 1, the more similar the detected gesture is to the template; the smaller the value, the greater the deviation. K is the total number of hand keypoints involved in the matching (e.g., 21 Mediapipe keypoints), which is a positive integer. j is the keypoint index, ranging from 1 to K, used to traverse all keypoints. or , Represents the two-dimensional or three-dimensional coordinates (depending on the dimension of the input data) of the j-th hand keypoint detected in the current frame. or , This represents the standard position coordinates of the j-th keypoint in the template gesture; This represents the squared Euclidean distance between the detection point and the template point, used to measure local deviation. Distance tolerance parameter (units consistent with coordinates, such as pixels or normalized length units), controls the sensitivity of the score to deviation. The larger the value, the greater the allowable deviation, and the slower the score decays.

[0064] Semantic similarity scoring reflects the semantic relevance between a voice command and the name of a target body part, and is calculated through word vector similarity. For example, the semantic similarity score is calculated by converting the body part words (such as "knee" and "shoulder") output by the speech recognition system into semantic vector representations, and simultaneously mapping the semantic labels of the target interactive body part (such as preset keywords) into vector forms in the same semantic space. The degree of semantic closeness is measured by calculating the directional consistency between the two in the semantic space.

[0065] Semantic similarity score is defined as:

[0066] ;

[0067] in, For semantic similarity scoring, the value range is: In practical applications, a positive correlation is usually expected. Therefore, a high score (close to 1) indicates high semantic similarity, while a low score (close to -1) indicates semantic opposites or irrelevance. As a semantic matching threshold. , The word embeddings are the part words in the speech recognition results, with a dimension of d (e.g., 300-dimensional), generated by a pre-trained language model (such as Word2Vec, GloVe, or FastText). , The vector representation of the standard semantic label (e.g., "elbow") corresponding to the target interaction part in the same embedding space needs to be consistent with... They come from the same word vector model. This is a vector dot product operation; is the L2 norm (Euclidean length) of the vector; the denominator is used for normalization so that the result only reflects the angle between the vectors and is not affected by the vector length.

[0068] The observation sequence is a continuous sample segment selected from the interaction time series, with the number of samples being, for example, 5, 10, or 20. A preset partial order relation defines the magnitude relationship between the rating values. A maximal element is an element that has no other element greater than it under the partial order relation. When a rating becomes a maximal element, a corresponding interaction tag is generated.

[0069] Based on the consistency of eye-tracking interaction markers, gesture interaction markers, and voice interaction markers within the observation sequence, and the hand-eye coordination angle relationship calculated based on eye-tracking channel data and gesture channel data, collaborative gain or confidence suppression is applied to the associated interaction intent. When the hand-eye coordination angle relationship does not meet the collaborative condition, the cumulative sample of the observation sequence is expanded according to the interaction time sequence.

[0070] Consistency is determined by comparing whether different interaction markers point to the same target body part. For example, the consistency determination method is as follows: extract the target body part number corresponding to the gaze interaction marker, the gesture interaction marker, and the voice interaction marker. If the three numbers are equal, they are considered consistent; if two numbers are equal and the third is different, they are considered partially consistent; if all three are unequal, they are considered inconsistent.

[0071] The hand-eye coordination angle is the angle between the line of sight and the direction of the gesture. The coordination condition requires that the angle be less than a third angle threshold, such as 10, 15, or 20 degrees. Coordination gain is achieved by increasing the confidence of the relevant interaction markers. Specifically, when it is determined that coordination gain needs to be applied, the original confidence C of the relevant interaction markers is multiplied by the gain factor k to obtain the enhanced confidence C', expressed by the formula:

[0072] ;

[0073] The gain factor k is determined based on the degree of synergy; it is 1.5 for complete synergy and 1.2 for partial synergy. Confidence suppression is achieved by reducing the confidence of inconsistent interactive markers. Extending the observation sequence refers to increasing the number of observation samples to obtain more stable judgment results.

[0074] For example, such as Figure 2 The diagram illustrates the process of mapping gesture channel data, gaze channel data, and voice channel data into gesture interaction markers, gaze interaction markers, and voice interaction markers respectively according to preset mapping rules in this application, including:

[0075] A gaze ray is generated based on gaze channel data, and geometric intersection is performed between the gaze ray and the target part in the three-dimensional information of the product. The hit ratio and gaze variability are calculated based on the observation sequence determined by the interaction time sequence, and a gaze stability score is calculated by combining ambient light characteristics and the motion state of the terminal device. If the gaze stability score belongs to the set of maximal elements determined by the observation sequence as the domain and according to the preset partial order relationship, a gaze interaction mark is generated, and the confidence of the gaze interaction mark is determined based on the hit ratio, the landing point dispersion calculated based on the gaze landing point, and the pupil tracking confidence calculated based on the gaze channel data.

[0076] The gaze ray originates from the center of the eyeball, and its direction is determined by the pupil position and corneal reflection. Geometric intersection is achieved using an intersection test algorithm between the ray and the triangular mesh. The hit ratio is the ratio of the number of frames in which the gaze falls within the target area to the total number of frames in the observation sequence. The gaze variability is the average angular change of the gaze direction vector between consecutive frames. Ambient light characteristics affect pupil size and eye-tracking accuracy, while the motion state of the terminal device affects the calculation of the relative position of the gaze. The method for calculating the landing point dispersion is as follows: first, calculate the centroid position (i.e., the average position) of all gaze landing points:

[0077] ;

[0078] Then calculate the sum of squared distances from each landing point to the centroid. The landing point dispersion is defined as:

[0079] ;

[0080] in, , For the first The three-dimensional spatial coordinates of the point of view (unit: millimeter or normalized coordinates); The index of the line of sight, with a value range of 1. arrive ; The total number of lines of sight involved in the calculation; Let be the geometric centroid of all landing points, and denote the average gaze position; The L2 norm (i.e., Euclidean distance) of a vector is used to represent the vector. The landing point dispersion, with units consistent with coordinates, reflects the spatial dispersion of the landing point: the smaller the value, the more concentrated the gaze. The pupil tracking confidence score is directly output by the eye-tracking algorithm. For example, when detecting the pupil, the algorithm generates a confidence score based on the clarity of the pupil outline, the regularity of the pupil shape, and the continuity of the pupil position between consecutive frames. When the pupil outline is complete and the edges are clear, the confidence score is close to 1.0; when the pupil is partially obscured or the image is blurred, resulting in an incomplete outline, the confidence score decreases to between 0.5 and 0.8; when the pupil cannot be detected, the confidence score is 0.

[0081] The gesture pointing vector is calculated based on the gesture channel data, and the vector pointing from the center of the hand to the centroid of the target part and the depth relationship between the hand and the target part are determined based on the scene reference coordinates. The gesture matching score is calculated based on the gesture channel data. The gesture matching score is jointly determined by the angle relationship and depth relationship between the gesture pointing vector and the vector, and the confidence of the gesture key point detection. If the gesture matching score belongs to the set of maximal elements determined by the observation sequence as the domain and according to the preset partial order relationship, a gesture interaction mark is generated. The gesture interaction mark is matched with the gesture template library based on the gesture channel data to determine the gesture category and its confidence.

[0082] The gesture pointing vector points from the palm to the tip of the index finger. The hand center is obtained by calculating the geometric center of all hand keypoints. The target part's centroid is obtained by calculating the weighted average position of all vertices within the part. The depth relationship represents the relative position of the hand and the target part in the scene's depth direction. The angular relationship is the angle between the gesture pointing vector and the vector from the hand to the target part. The gesture template library pre-stores keypoint configuration patterns for commonly used gestures. Gesture categories include pointing, grasping, and swiping.

[0083] Speech parsing is performed based on speech channel data to obtain part words and attribute words, and semantic similarity scores are calculated based on the part semantic labels and part words of the target part. If the semantic similarity score is equal to the supremum of the current candidate word set, a speech interaction tag is generated. If a gaze interaction tag or a gesture interaction tag hits the same target part in the observation sequence, the priority and confidence of the speech interaction tag are increased.

[0084] Speech parsing uses an automatic speech recognition algorithm to convert audio signals into text. Partial word extraction uses named entity recognition. Attribute word extraction uses keyword extraction. Partial semantic labels are predefined for each product part. Semantic similarity scoring is obtained by calculating the cosine similarity of word vectors. The supremum is the maximum similarity value in the candidate word set. Priority enhancement is achieved by increasing the processing order weight of voice interaction tags.

[0085] If, within the observation sequence, the gaze interaction marker and the gesture interaction marker hit the same target part and the voice interaction marker is semantically consistent with the target part, then the confidence of the interaction intent associated with the three is increased; if the target parts corresponding to the gaze interaction marker and the gesture interaction marker are inconsistent, and the angle relationship calculated based on the gesture channel data and the gaze channel data does not belong to the minimal set of elements determined according to the preset partial order relationship with the candidate part pair set as the domain, then the confidence of the relevant interaction marker is suppressed.

[0086] Consistency among all three indicates a clear user interaction intent. The confidence level is increased by, for example, 1.2, 1.5, or 2 times the original value. Inconsistency in target locations indicates potential interaction ambiguity. The set of minimal elements contains the pair of locations with the smallest angles. The confidence level is suppressed by, for example, 0.8, 0.5, or 0.3 times the original value.

[0087] The hand-eye coordination angle relationship is calculated based on gaze channel data and gesture channel data, and a coordination angle reference value is determined based on the interaction time sequence in the vicinity. If the current hand-eye coordination angle relationship is equal to the coordination angle reference value, or if the current hand-eye coordination angle relationship belongs to the minimum element set in the set determined by the observation sequence according to the preset partial order relationship, then the gesture matching score and gaze stability score are increased accordingly. If the current hand-eye coordination angle relationship does not belong to the minimum element set under the partial order relationship, then the gesture matching score and gaze stability score are decreased accordingly, or the cumulative sample of the observation sequence is expanded to obtain more observations.

[0088] The hand-eye coordination angle relationship is obtained by calculating the angle between the gaze direction vector and the gesture pointing vector. The nearest time point refers to several time windows before and after the current time point, with window sizes such as 100 milliseconds, 200 milliseconds, or 500 milliseconds. The coordination angle reference value is the median or mean of the angles within the nearest time points. The scoring gain is obtained by multiplying it by a gain factor, such as 1.1, 1.3, or 1.5. The scoring loss is obtained by multiplying it by a decay factor, such as 0.9, 0.7, or 0.5.

[0089] The observation sequences are selected based on the interaction time-series, with the number of samples chosen accordingly. For example, the number of samples in the observation sequence can be determined according to the following rules: for unimodal interaction (using only one of gaze, gesture, or voice), the number of samples is set to 5 to 10; for bimodal interaction (using both input methods simultaneously), the number of samples is set to 10 to 15; and for trimodal interaction (using gaze, gesture, and voice simultaneously), the number of samples is set to 15 to 20. When unstable interaction signals are detected, such as frequent gaze shifts or low gesture recognition confidence, the number of samples is increased by 5 to obtain a more reliable judgment result. The calculation of both gaze stability score and gesture matching score involves conditional quantities provided by ambient light characteristics and the motion state of the terminal device.

[0090] Ambient light characteristics, as a conditional variable, affect the noise level of gaze tracking. The motion state of the terminal device, as a conditional variable, is used to compensate for coordinate offsets caused by device movement. The conditional variable is involved by normalizing the ambient light intensity to the [0,1] interval to obtain the ambient light conditional factor.

[0091] ;

[0092] in, This is the ambient light condition factor, with a value range of [0,1]. This is the current ambient light intensity value, in lux. Set to the maximum light intensity value supported by the system, for example, 10000 lux.

[0093] The equipment motion condition factor is obtained by normalizing the equipment motion speed:

[0094] ;

[0095] in, This is the equipment motion condition factor, with a value range of (0,1]. The current speed of the device, expressed in meters per second; This is a reference speed value, for example, set to 0.5 meters per second. When the device is stationary... ,correspond When the equipment moves quickly Increase, corresponding Decrease.

[0096] The complete process of incorporating conditional quantities into the score calculation is as follows: First, calculate the line-of-sight stability score using the aforementioned method. Or gesture matching score The initial score is used as the base score, which is then adjusted based on ambient light intensity and device motion. For gaze-based interaction markers, a gaze stability score is used. As the original score; for gesture interaction tags, the gesture matching score will be used. As the original score.

[0097] The final score is calculated using a piecewise adjustment function.

[0098] For example, first calculate the comprehensive condition factor. :

[0099] ;

[0100] The first scenario corresponds to extremely low ambient light or rapid device movement, in which case a reliable interaction signal cannot be obtained, and the comprehensive condition factor F... total Set to 0; the second case corresponds to moderate environmental conditions, with the comprehensive condition factor F. total The linear relationship decreases; the third case corresponds to favorable environmental conditions, with the comprehensive condition factor F decreasing. total Keep it above 0.7.

[0101] Then calculate the final score:

[0102] ;

[0103] in, For the final score; This is the original score; As a comprehensive condition factor; The minimum retention score is set to 0.1 to ensure that some interactive responsiveness is still preserved under poor conditions. The final score is obtained. Used to determine whether to generate the corresponding interactive marker.

[0104] Step 208: Associate and match the interaction behavior data with the product attribute information to generate a set of candidate product features.

[0105] Specifically, interactive behavior data records user interaction information with products, including interaction location, interaction method, and interaction sequence. Product attribute information defines the interactive attributes of various parts of the product. By associating and matching the two, the user's product selection intent can be identified.

[0106] Based on the scene's reference coordinates, the interaction location is spatially positioned to determine the target part in the product's 3D information.

[0107] Interaction locations are extracted from the interaction markers of each channel. Eye-tracking interaction markers provide the 3D coordinates of the eye-tracking point, gesture interaction markers provide the 3D coordinates of the gesture pointing point, and voice interaction markers provide the name of the body part referred to by the voice. This location information is then transformed from the device coordinate system to the scene reference coordinate system to obtain a unified spatial location representation.

[0108] Target body parts are determined through spatial matching. For gaze and gesture interactions, the shortest distance from the interaction location to each part of the product's 3D model is calculated, and the part with the smallest distance is identified as the target body part. For voice interactions, semantic matching maps part names to corresponding parts of the product's 3D model. When multiple interaction markers point to the same part, that part is identified as the target body part; when they point to different parts, the most likely target body part is determined based on the confidence level of each interaction marker.

[0109] If the target part has an attribute identifier, the attribute identifier is combined with the interaction method to obtain the first candidate feature.

[0110] The determination of attribute identifiers is achieved by querying the product attribute information table. When the target part has a corresponding record in the attribute information table, it indicates that the part has predefined interactive attributes. For example, the cuff part of the clothing has the attribute identifier cuff_002, indicating that the part supports length adjustment function.

[0111] The process of generating the first candidate feature is as follows: extract the attribute identifier of the target area, identify the current interaction method (such as click, drag, zoom), and combine the two to form a feature description. For example, if the attribute identifier is collar_001 and the interaction method is click, the generated first candidate feature is collar_001_click, indicating that the user clicked the collar area.

[0112] If no attribute identifier is available, the interaction method and interaction timing sequence are combined to obtain the second candidate feature.

[0113] When the target area has no corresponding record in the attribute information table, it indicates that the area does not have predefined interactive attributes. In this case, it is necessary to determine the user's intent through the timing pattern of the interaction.

[0114] The process of generating the second candidate feature involves extracting the interaction method and analyzing the duration, repetition frequency, and change pattern in the interaction time sequence. For example, if a user performs three consecutive clicks on an area without attribute labels, with an interval of less than 500 milliseconds, the generated second candidate feature is `triple_click_fast`, indicating a rapid triple-click operation. Similarly, if a user gazes at the same area for more than 2 seconds, the generated second candidate feature is `gaze_hold_long`, indicating a prolonged gaze.

[0115] The first and second candidate features are added to the candidate product feature set.

[0116] The candidate feature set is stored in a list structure, with each element containing feature type, feature value, confidence level, and timestamp. The first candidate feature is labeled as attribute-driven, and the second candidate feature is labeled as behavior-driven.

[0117] When a feature is added to the set, deduplication is performed. If a newly generated feature is identical to an existing feature in the set, the confidence score and timestamp of the existing feature are updated, and it is not added again. The confidence score is updated by taking the maximum value of the old and new confidence scores, and the timestamp is updated to the latest interaction time.

[0118] The candidate feature set maintains a fixed-size buffer, for example, storing a maximum of 20 features. When the number of features exceeds the limit, the oldest feature is deleted according to the first-in-first-out principle, or the feature with the lowest confidence is deleted according to the confidence level.

[0119] Step 210: Perform fusion on the candidate product feature set according to the preset fusion criteria to obtain the target product decision, and output the interactive feedback information corresponding to the target product decision.

[0120] The target product selection decision indicates the conclusion of product selection for the target product or the screening results of candidate variations of the target product. Specifically, the candidate product feature set contains multiple features from different interaction channels, which need to be unambiguous through a fusion process to form a unified product selection decision. The product selection conclusion includes confirming the selection of the current product, viewing product details, comparing different styles, etc. The screening results of candidate variations include selecting product variations with specific colors, sizes, or configurations.

[0121] Step 210 involves fusing the candidate product feature set according to a preset fusion criterion, including steps B1 to B3:

[0122] Step B1: Divide the candidate product feature set into a spatial feature subset and a semantic feature subset according to the feature source.

[0123] The segmentation process involves examining how each feature is generated. Features derived from the coordinates of the gaze point or the coordinates of the gesture are categorized into the spatial feature subset, while features derived from speech recognition results or attribute label matching are categorized into the semantic feature subset. For example, features generated from gazing at the cuff for 2 seconds belong to the spatial feature subset, while features generated from the voice command to select red belong to the semantic feature subset.

[0124] The spatial feature subset includes features based on interaction location, while the semantic feature subset includes features based on interaction method and attribute identifier.

[0125] Step B2: Perform fusion on the spatial feature subset and the semantic feature subset according to the preset fusion criteria to obtain the corresponding subset fusion results.

[0126] For a subset of spatial features, a spatial consistency score is calculated based on the confidence scores of gaze interaction markers and gesture interaction markers. Penalty terms are determined based on projection error and pose residuals, and further determined based on the degree of occlusion, thus obtaining the subset fusion result of the spatial feature subset.

[0127] It should be noted that the spatial consistency score is calculated as follows: The confidence values of all gaze interaction markers and gesture interaction markers are extracted from the spatial feature subset. When the gaze and gesture point to the same spatial region, spatial consistency is considered, and the score is the arithmetic mean of the two confidence values. When they point to different regions, spatial divergence is considered, and the score is the product of the higher confidence value and a decay coefficient. The decay coefficient is determined based on the distance between the two regions; the greater the distance, the greater the decay.

[0128] Projection error is obtained by calculating the distance from the interaction position to the nearest point on the product surface. When the interaction position falls precisely on the product surface, the projection error is zero. The greater the deviation, the larger the projection error. Attitude residual is obtained by comparing the changes in the product's attitude at adjacent time points. When the product remains stationary, the attitude residual is zero. The greater the change in attitude when the product rotates or translates, the larger the attitude residual.

[0129] The degree of occlusion is calculated by determining the proportion of the visible portion of the product to the whole. When fully visible, the occlusion degree is zero; for partial occlusion, the degree is calculated based on the occluded area. Each penalty term lowers the spatial consistency score, and the final spatial feature subset fusion result is the spatial consistency score minus the sum of all penalty terms.

[0130] For a subset of semantic features, a semantic consistency score is calculated based on attribute semantic similarity and part semantic similarity. Amplification or attenuation is determined based on the consistency between gaze interaction markers, gesture interaction markers and voice interaction markers, thereby obtaining the subset fusion result of the semantic feature subset.

[0131] Attribute semantic similarity is calculated by comparing attribute identifiers in different features. High similarity occurs when multiple features contain the same attribute identifier. Moderate similarity occurs when attribute identifiers differ but belong to the same category. Low similarity occurs when attribute identifiers are completely unrelated. Partial semantic similarity is calculated by comparing the part name mentioned in the voice command with the actual part being interacted with. Highest similarity occurs when the name matches exactly, followed by synonym matching, and lowest similarity occurs when there is no association.

[0132] The consistency of the three interaction markers is determined by comparing the semantic content they point to. When gaze, gesture, and speech all point to the same semantic concept, amplification is applied, increasing the semantic consistency score to 1.5 times its original value. When only two markers are consistent, the original score is maintained. When all three markers are inconsistent, attenuation is applied, reducing the score to 0.6 times its original value.

[0133] Step B3: Perform secondary fusion on the subset fusion results according to the preset conflict resolution criteria, and perform spatial alignment according to the scene reference coordinates and temporal alignment according to the interaction time sequence to obtain the target product selection decision.

[0134] It should be noted that the pre-defined conflict resolution criteria include:

[0135] Spatial alignment is performed on the subset fusion results of spatial feature subsets based on scene reference coordinates, and temporal alignment is performed on the subset fusion results of semantic feature subsets based on interaction time sequence, so as to obtain subset fusion results with spatiotemporal consistency.

[0136] Spatial alignment transforms the fusion results of spatial features obtained at different times into a unified scene reference coordinate system. Because users and devices may move, the coordinates of the same part of a product may differ at different times. By using a coordinate transformation matrix to unify all spatial positions to the scene reference coordinate system, spatial features from different times can be directly compared.

[0137] Temporal alignment arranges the fused semantic features in chronological order to identify semantic evolution patterns. For example, if a user first says "look at the cuffs" and then "change styles," temporal alignment can identify the shift in intent from browsing to selection. It analyzes the temporal relationships of semantic features through a sliding time window, with the window size adjusted according to interaction density.

[0138] A posterior score is calculated based on the subset fusion results of spatial feature subsets, the subset fusion results of semantic feature subsets, and the temporal continuity score obtained from the interaction time sequence. A conflict graph is constructed with candidate product features as nodes and mutually exclusive relationships as edges to complete the candidate selection.

[0139] The posterior score integrates information from three dimensions: the spatial feature subset fusion result reflects the certainty of spatial interaction, the semantic feature subset fusion result reflects the accuracy of semantic understanding, and the temporal continuity score reflects the stability of interaction intent. These three dimensions are combined in a preset ratio, for example, 0.4 for the spatial dimension, 0.4 for the semantic dimension, and 0.2 for the temporal dimension. The temporal continuity score is calculated by the number of times statistical features recur within a continuous time window; the more recurrences, the higher the score.

[0140] The conflict graph is constructed to identify logically mutually exclusive feature pairs. For example, selecting red and selecting blue are mutually exclusive, as is zooming in and zooming out. Each candidate feature is treated as a node in the graph, and mutual exclusion relationships are treated as edges. By finding the maximum independent set in the graph, a set of non-conflicting features is selected as candidate decisions.

[0141] The fluctuation set is determined based on the posterior score difference sequence formed by the most recent multiple decisions; if the difference between the two highest posterior scores belongs to the fluctuation set, the previous round of target product selection decision is maintained and the time continuity score is accumulated until the difference of posterior scores does not belong to the fluctuation set or the accumulated sample size reaches the sample size determined based on the interaction time series sequence.

[0142] The fluctuation set reflects the uncertainty range of the decision. The differences between the two highest posterior scores from the most recent 5 to 10 decisions are collected, and the mean and standard deviation of the difference sequence are calculated. The fluctuation set is defined as the interval between the mean and one standard deviation. When a new score difference falls within this interval, it indicates that the decision is not sufficiently clear and more observations are needed.

[0143] During the accumulation of samples, new interaction data continues to be collected, the candidate feature set is updated, and the posterior score is recalculated. The upper limit of the number of samples is set according to the complexity of the interaction, with 10 samples for simple interactions and 20 samples for complex interactions.

[0144] When any of the aforementioned conditions are met, if the difference in posterior scores belongs to the fluctuation set, the target product selection decision is determined based on the maximum element of the time continuity score in the partial order relation with the candidate set as the domain; otherwise, the target product selection decision is determined based on the maximum element of the posterior score in the partial order relation with the candidate set as the domain. If the difference between the two highest posterior scores does not initially belong to the fluctuation set, the target product selection decision is determined based on the maximum element of the posterior score in the partial order relation with the candidate set as the domain.

[0145] The decision-making process is as follows: First, calculate the posterior score for all candidates and sort them by score. Compare the difference between the highest and second-highest scores. If the difference is greater than the upper bound of the fluctuation set, directly select the highest-scoring candidate as the target product, indicating clear user intent. If the difference is within the fluctuation set, continue accumulating observations, indicating a need for more information. If the accumulated difference is still within the fluctuation set, use the time continuity score as the decision criterion, selecting the candidate that is most stable over time.

[0146] The parameters used to calculate the spatial consistency score, semantic consistency score, and posterior score are determined based on ambient light characteristics, the motion state of the terminal device, and the degree of occlusion. When ambient light is sufficient, gaze tracking is accurate, and the weight of the spatial consistency score is increased accordingly. When the device is stable, gesture recognition is reliable, and its weight is maintained at the normal level. When occlusion exists, the weight of features related to the occluded part is reduced.

[0147] If the candidate corresponding to the posterior score contains candidate variants and the semantic feature subset indicates the candidate variants, then a variant switching instruction is generated when determining the target product selection decision. The variant switching instruction contains the identifier of the target variant and switching parameters.

[0148] Step 210 outputs interactive feedback information corresponding to the target product selection decision, including:

[0149] In extended reality scenarios, the appearance of objects can be replaced or parameters can be switched to create visual feedback.

[0150] Appearance replacement is achieved by updating the product's texture map. The system pre-stores texture files for various variations, and selects the corresponding texture based on the target product to replace the currently displayed texture. Parameter switching is achieved by modifying rendering parameters, including adjusting transparency to reveal internal structure, modifying reflectivity to change gloss effects, and adjusting roughness to change surface texture.

[0151] A preset prompt layer template read from the rendering resources is overlaid in the extended reality scene, and the prompt content is mapped according to the target product selection decision.

[0152] Preset prompt layer templates are stored in the rendering resource library, including information card templates, operation prompt templates, and status indicator templates. The prompt content mapping selects the appropriate template and fills in the specific content based on the decision type. For example, when selecting a color variant, a color swatch template is used to display available colors; when confirming a product selection, a confirmation card is used to display product information.

[0153] The brightness and hue parameters of the preset prompt layer template are adjusted according to the ambient light characteristics to generate and render the prompt layer, thus creating prompt feedback.

[0154] Brightness adjustments ensure the tooltip layer remains clearly visible under varying lighting conditions. Increase brightness in strong light to prevent it from being overwhelmed by ambient light, and decrease brightness in low light to avoid glare. Hue adjustments harmonize the tooltip layer with the ambient light color temperature; adjust to a warm tone for warm light environments and a cool tone for cool light environments.

[0155] Visual feedback and cue feedback are presented as interactive feedback information.

[0156] Visual feedback directly impacts the product model, allowing users to immediately see changes in the product's appearance. Hint feedback appears as a semi-transparent interface around the product, without obscuring the product itself. Both types of feedback are displayed simultaneously, jointly conveying the product selection decision to the user.

[0157] If the target product selection decision indicates the screening results of the candidate variants, a variant switching instruction is generated. In response to the variant switching instruction, the product 3D information corresponding to the candidate variant is retrieved from the product information database based on the variant identifier to replace the presentation content of the extended reality object.

[0158] The variant switching process includes: querying the product information database based on the variant identifier to obtain the variant's model file path and texture file path; loading the variant's 3D model and texture data into memory; executing a gradient transition animation to smoothly transition from the current product appearance to the variant appearance; and updating the extended reality rendering objects in the scene to complete the variant switching.

[0159] If the target product selection decision does not indicate the results of the candidate variant selection, then the presentation content of the extended reality object remains unchanged.

[0160] Non-variant product decisions include actions such as viewing details, rotating, and zooming in / out. These decisions do not change the product itself, only the viewing method or displayed information. In this case, the product model and texture remain unchanged; relevant information or the result of the action is only displayed to the user through prompts and feedback.

[0161] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0162] Based on the same inventive concept, this application also provides an interactive product selection system based on extended reality. The solution provided by this system is similar to the solution described in the above method. Therefore, the specific limitations of one or more embodiments of the interactive product selection system based on extended reality provided below can be found in the limitations of the interactive product selection method based on extended reality described above, and will not be repeated here.

[0163] In one exemplary embodiment, such as Figure 3 As shown, an interactive product selection system based on extended reality is provided, including:

[0164] The information acquisition module retrieves the three-dimensional information of the target product and the product attribute information corresponding to the three-dimensional information from the product information database;

[0165] The scene construction module determines the scene reference coordinates of the extended reality scene and the ambient light features that match the scene reference coordinates based on the spatial perception data of the terminal device, and generates the extended reality presentation object of the target product in the extended reality scene.

[0166] The interactive data acquisition module acquires interactive behavior data based on the user's interactive actions in the extended reality scenario;

[0167] The feature association module associates and matches the interactive behavior data with the product attribute information to generate a set of candidate product features.

[0168] The decision fusion module performs fusion on the candidate product feature set according to preset fusion criteria to obtain the target product decision and outputs interactive feedback information corresponding to the target product decision.

[0169] The modules in the aforementioned extended reality-based interactive product selection system can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the computer device's memory as software, so that the processor can invoke and execute the corresponding operations of each module.

[0170] In one exemplary embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 4As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When the computer program is executed by the processor, it implements an interactive product selection method based on extended reality. The display unit of the computer device forms a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0171] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0172] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0173] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0174] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0175] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0176] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0177] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0178] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An extended reality-based interactive merchandise selection method, characterized by, include: The product information database retrieves the three-dimensional information of the target product and the product attribute information corresponding to the three-dimensional information of the product; wherein, the product attribute information includes an attribute identifier for driving interactive feedback and a variant identifier for referencing candidate variants, and the attribute identifier and the variant identifier are preset identifiers read from the product information database; Based on the spatial perception data of the terminal device, the scene reference coordinates of the extended reality scene and the ambient light features that match the scene reference coordinates are determined, and an extended reality presentation object of the target product is generated in the extended reality scene. Interaction behavior data is obtained based on the user's interactive actions in the extended reality scenario, including: Gesture channel data, gaze channel data, and voice channel data are obtained from the input channel of the terminal device, respectively. According to a preset mapping rule, the gesture channel data, the gaze channel data, and the voice channel data are respectively mapped to gesture interaction markers, gaze interaction markers, and voice interaction markers; wherein, the gesture interaction markers, gaze interaction markers, and voice interaction markers respectively include corresponding information on the interaction position and interaction method with the target product; The interaction timing sequence is obtained by aligning the execution times of the gesture interaction markers, the gaze interaction markers, and the voice interaction markers. The preset mapping rules include: Based on the gaze channel data, the gesture channel data, and the voice channel data, gaze stability score, gesture matching score, and semantic similarity score are calculated respectively. Under the relationship between the observation sequence determined according to the interaction time sequence and the preset partial order, the gaze interaction marker, the gesture interaction marker, and the voice interaction marker, as well as the corresponding confidence score, are generated according to the maximum element. Based on the consistency of the eye-tracking interaction markers, gesture interaction markers, and voice interaction markers within the observation sequence, and the hand-eye coordination angle relationship calculated based on the eye-tracking channel data and the gesture channel data, collaborative gain or confidence suppression is applied to the associated interaction intent. When the hand-eye coordination angle relationship does not meet the collaborative condition, the cumulative sample of the observation sequence is expanded according to the interaction time sequence. The interaction behavior data includes interaction markers corresponding to the interaction position, interaction method, and interaction time sequence of the target product. Associating and matching the interaction behavior data with the product attribute information includes: The interaction location is spatially located based on the scene reference coordinates to determine the target part in the three-dimensional information of the product. If the target part has the attribute identifier, then the attribute identifier is combined with the interaction method to obtain the first candidate feature; If the attribute identifier is not present, the interaction method is combined with the interaction timing sequence to obtain a second candidate feature; The first candidate feature and the second candidate feature are added to the candidate product feature set in parallel. The candidate product feature set is fused according to a preset fusion criterion to obtain a target product selection decision, and interactive feedback information corresponding to the target product selection decision is output; wherein, the target product selection decision is used to indicate the product selection conclusion for the target product or the screening result for the candidate variants of the target product.

2. The extended reality-based merchandise interactive assortment method of claim 1, wherein: The step of determining the scene reference coordinates of the extended reality scene based on the spatial perception data of the terminal device includes: Obtain planar features and depth features from the spatial sensing data; If a stable plane that satisfies the preset constraints is detected, the stable plane is used as the reference plane for the scene's reference coordinates. If the preset constraints are not met, the set of reference points obtained by relocation shall be used as the set of reference points for the scene reference coordinates. The scene reference coordinates are determined based on the reference plane or the set of reference points, and the pose of the extended reality object is corrected.

3. The interactive product selection method based on extended reality as described in claim 2, characterized in that: The step of fusing the feature set of candidate products according to a preset fusion criterion includes: The candidate product feature set is divided into a spatial feature subset and a semantic feature subset based on the feature source; According to the preset fusion criteria, the spatial feature subset and the semantic feature subset are fused respectively to obtain the corresponding subset fusion results; A secondary fusion is performed on the subset fusion result, and spatial alignment is performed based on the scene reference coordinates, and temporal alignment is performed based on the interaction time sequence to obtain the target product selection decision.

4. The interactive product selection method based on extended reality as described in claim 3, characterized in that: The interactive feedback information corresponding to the output of the target product selection decision includes: In the extended reality scene, the appearance of the extended reality object is replaced or its parameters are switched to generate visual feedback; A preset prompt layer template read from the rendering resources is overlaid in the extended reality scene, and the prompt content is mapped according to the target product selection decision; The brightness and hue parameters of the preset prompt layer template are adjusted according to the ambient light characteristics to generate and render the prompt layer, thus providing prompt feedback. The visual feedback and the prompt feedback are presented as interactive feedback information.

5. The interactive product selection method based on extended reality as described in claim 4, characterized in that: If the target product selection decision indicates the screening result of the candidate variant, a variant switching instruction is generated, and in response to the variant switching instruction, the product 3D information corresponding to the candidate variant is obtained from the product information database based on the variant identifier to replace the presentation content of the extended reality presentation object; If the target product selection decision does not indicate the screening result of the candidate variant, then the presentation content of the extended reality object remains unchanged.

6. A product interactive selection system based on extended reality, employing the product interactive selection method based on extended reality as described in any one of claims 1 to 5, characterized in that, include: The information acquisition module retrieves the three-dimensional information of the target product and the product attribute information corresponding to the three-dimensional information from the product information database; The scene construction module determines the scene reference coordinates of the extended reality scene and the ambient light features that match the scene reference coordinates based on the spatial perception data of the terminal device, and generates the extended reality presentation object of the target product in the extended reality scene. The interactive data acquisition module acquires interactive behavior data based on the user's interactive actions in the extended reality scenario; The feature association module associates and matches the interactive behavior data with the product attribute information to generate a set of candidate product features. The decision fusion module performs fusion on the candidate product feature set according to preset fusion criteria to obtain the target product decision and outputs interactive feedback information corresponding to the target product decision.

7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the product interactive selection method based on extended reality as described in any one of claims 1 to 5.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the product interactive selection method based on extended reality as described in any one of claims 1 to 5.