Indoor scene point cloud model three-dimensional reconstruction method and system based on eye control interaction

By using an eye-tracking interactive indoor scene point cloud model 3D reconstruction system, which utilizes eye-tracking calculation, line-of-sight mapping calibration, and anti-shake processing, combined with a deep learning reconstruction network, ALS patients can autonomously complete 3D reconstruction and manipulation. This solves the problems of line-of-sight mapping mismatch and dizziness in traditional technologies, and provides accurate spatial perception and planning capabilities.

CN122199834APending Publication Date: 2026-06-12CHANGCHUN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHANGCHUN UNIV
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing indoor scene 3D reconstruction technology cannot meet the operational needs of patients with severe limb movement disorders such as ALS. Traditional eye-tracking technology has problems such as mismatch of gaze mapping, tremors and dizziness in 3D spatial interaction. Existing systems cannot achieve accurate spatial perception and planning.

Method used

An indoor scene point cloud model 3D reconstruction system based on eye-controlled interaction is adopted, including an eye-tracking solution module, a gaze-space mapping calibration module, an anti-shake smoothing module, a multi-view 3D reconstruction module, and a streaming point cloud rendering module. It generates a high-precision 3D point cloud model through a deep learning reconstruction network and realizes 3D interaction that can be controlled by gaze.

🎯Benefits of technology

It provides patients with severe limb movement disorders such as ALS with the ability to independently complete 3D reconstruction of indoor scenes and manipulate point cloud models, solving the problems of visual tremors and dizziness, achieving accurate spatial perception and planning, and lowering the threshold for use.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199834A_ABST
    Figure CN122199834A_ABST
Patent Text Reader

Abstract

The application is suitable for the technical field of human-computer interaction, and provides an indoor scene point cloud model three-dimensional reconstruction method and system based on eye control interaction, which comprises a hardware layer, a data processing layer and an interactive application layer, wherein the data processing layer comprises an eye movement tracking solution module, a line-of-sight-space mapping calibration module and an anti-jitter smoothing processing module; the interactive application layer comprises a multi-view three-dimensional reconstruction module and a streaming point cloud rendering module. The three-dimensional point cloud model is generated by the multi-view three-dimensional reconstruction module, and after eye movement tracking solution, line-of-sight-space calibration mapping and anti-jitter smoothing processing, the three-dimensional point cloud model is rendered in real time by the streaming point cloud rendering module according to the three-dimensional virtual camera control parameters generated by the gaze point coordinate mapping after smoothing, so that the body movement disorder population can complete the accurate and stable control of the three-dimensional point cloud model only by the line of sight.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of human-computer interaction technology, and in particular relates to a method and system for three-dimensional reconstruction of point cloud models of indoor scenes based on eye-tracking interaction. Background Technology

[0002] Currently, indoor scene digitization mainly relies on multi-view stereo vision (MVS) and structure-of-motion (SfM) technologies. These technologies extract feature points by processing multi-angle image sequences, calculate camera pose, and generate dense point cloud models containing geometric and textural information. However, existing reconstruction software and point cloud interaction tools are entirely based on physical contact input design. Users must use a mouse, keyboard, or touchscreen to perform complex parameter adjustments and 3D perspective translation, rotation, and scaling. Furthermore, for common indoor areas such as white walls with weak or repetitive textures, existing algorithms often fail to match features, resulting in point cloud models with holes, breaks, or geometric distortions, making it difficult to meet the needs of refined display and interaction.

[0003] Eye-tracking technology calculates two-dimensional (x, y) coordinates on the screen by capturing eye movements and is currently mainly used in planar interaction scenarios such as text input and web browsing. However, there are significant technical gaps when directly applying this technology to three-dimensional (3D) space interaction: on the one hand, two-dimensional gaze coordinates cannot be directly mapped to depth propagation and six degrees of freedom (6-DoF) control in a 3D environment, lacking spatial interaction logic; on the other hand, the inherent physiological nystagmus and unconscious saccades of the human eye are amplified into severe screen shaking when driving a 3D camera, easily causing motion sickness and visual fatigue in users. Currently, 3D reconstruction and eye-tracking technologies have not yet achieved deep integration, and there is a lack of dedicated interaction interfaces adapted to specific user groups.

[0004] As the disease progresses, patients with severe limb movement disorders such as ALS gradually lose their limb movement abilities, eventually retaining only the normal function of the eye muscles. This makes it difficult for them to meet their core living needs, such as home space planning, environmental modification, and daily assistive services, through independent operation.

[0005] Currently, most eye-tracking devices on the market are limited to everyday communication scenarios, supporting only basic two-dimensional interactions such as text input and simple command triggering. There is a lack of eye-tracking systems adapted for 3D reconstruction and point cloud model manipulation of indoor scenes, failing to provide patients with accurate spatial perception and planning capabilities. On the other hand, traditional 3D reconstruction and point cloud interaction tools heavily rely on physical operations such as keyboards, mice, or touchscreens, resulting in high barriers to entry and limited interaction methods, completely unsuitable for the operational capabilities of ALS patients. Furthermore, directly applying existing eye-tracking technology to 3D interaction often faces technical bottlenecks such as viewpoint jitter, insufficient positioning accuracy, and spatial mapping mismatch; while conventional indoor scene reconstruction technologies are prone to defects such as texture loss and blurred details, rendering the reconstructed spatial model unusable for accurate planning and subsequent services. These problems collectively prevent ALS patients from independently completing digital modeling and personalized planning of their home spaces, and also make it difficult to provide accurate spatial commands to service robots, severely restricting the in-depth application of assistive technology in the fields of spatial digitization and intelligent assistance. Summary of the Invention

[0006] The purpose of this invention is to provide a method and system for 3D reconstruction of point cloud models of indoor scenes based on eye-controlled interaction, in order to solve the above-mentioned technical problems.

[0007] This invention is implemented as follows: a 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction, comprising a hardware layer, a data processing layer, and an interactive application layer, wherein the data processing layer includes:

[0008] The eye-tracking solution module is used to acquire the coordinates of the gaze point in the user's eye-tracking data in real time and output the original normalized gaze coordinates.

[0009] The gaze-space mapping calibration module is used to calibrate and map the original normalized gaze coordinates and output the mapped screen gaze point coordinates.

[0010] The anti-jitter smoothing module is used to perform anti-jitter smoothing on the mapped screen gaze point coordinates and output the smoothed gaze point coordinates.

[0011] The interactive application layer includes:

[0012] Multiview Figure 3 The 3D reconstruction module is used to generate a 3D point cloud model of an indoor scene based on a multi-view RGB image sequence of the indoor scene.

[0013] The streaming point cloud rendering module is used to render and display the 3D point cloud model of the indoor scene in real time based on the 3D virtual camera control parameters generated by the smoothed gaze point coordinate mapping.

[0014] Furthermore, the eye-tracking solution module specifically includes:

[0015] The facial feature point extraction unit is used to extract key facial feature points based on the face mesh detection algorithm.

[0016] The iris center localization unit is used to select key feature points in the left and right eye regions of the face to determine the iris center.

[0017] The relative position calculation unit is used to construct the eye coordinate system, calculate the relative proportion of the iris center within the eye socket, and output the original normalized line-of-sight coordinates.

[0018] Furthermore, the line-of-sight-space mapping calibration module specifically includes:

[0019] The calibration point acquisition unit is used to acquire the original normalized gaze coordinates and screen target coordinates when the user is looking at the screen anchor point.

[0020] Polynomial Fitting Unit: Used to establish a mapping relationship using a quadratic polynomial regression model, and output the mapped screen gaze coordinates.

[0021] Furthermore, the anti-jitter smoothing processing module includes:

[0022] The confidence filtering unit is used to monitor the confidence of eye-tracking data in real time. When the confidence is lower than a preset threshold, the eye-tracking data of the current frame is discarded.

[0023] Exponentially weighted moving average smoothing unit: used to perform temporal smoothing on the mapped screen gaze coordinates according to a preset smoothing coefficient, and output the smoothed gaze coordinates.

[0024] Furthermore, the anti-jitter smoothing processing module also includes:

[0025] The dynamic dead zone and soft start unit is used to set the dead zone range at the center of the screen. When the line of sight falls within the dead zone range, the output control quantity is zero. When the line of sight moves out of the dead zone range, the output control quantity increases non-linearly.

[0026] Furthermore, the multi-view Figure 3 The 3D reconstruction module uses a deep learning reconstruction network that incorporates frequency domain information compensation and confidence guidance mechanisms to generate 3D point cloud models of indoor scenes, specifically including:

[0027] The multi-scale feature extraction and fusion unit is used to extract feature maps of the image at different resolutions using a feature pyramid network architecture, and to concatenate deep semantic features with shallow texture features to generate feature maps containing contextual information.

[0028] The frequency domain high-frequency information compensation unit is used to convert the feature map from the spatial domain to the frequency domain, enhance the high-frequency components in the frequency domain, and then restore it back to the spatial domain to obtain the compensated feature map.

[0029] The confidence-guided adaptive depth hypothesis unit is used to predict and generate pixel-level confidence maps based on the compensated feature maps. It reduces the depth hypothesis search radius in high-confidence regions and expands the depth hypothesis search radius in low-confidence regions to obtain an adaptive range. Combined with the adaptive range, a dynamic depth hypothesis plane is constructed through homography transformation.

[0030] The cascaded cost volume construction and depth inference unit is used to construct a matching cost volume based on the depth hypothesis plane, perform regularization processing based on 3D convolutional network and Softmax operation on the cost volume to regress the depth map, and refine it step by step from coarse to fine, and perform cascaded guidance.

[0031] The depth map fusion and point cloud generation unit is used to project the depth maps generated from all views into a 3D space, perform geometric consistency verification and probabilistic filtering, remove outliers, and fuse them to generate a 3D point cloud model of the indoor scene.

[0032] Furthermore, the control parameters of the 3D virtual camera include rotation angle, translation vector, and scaling factor; the streaming point cloud rendering module specifically includes:

[0033] The streaming loading unit is used to read the 3D point cloud model of the indoor scene in blocks;

[0034] The rendering unit is used to convert the horizontal and / or vertical displacement of the gaze on the screen into the rotation angle of the 3D virtual camera based on the smoothed gaze coordinates.

[0035] The ray picking unit is used to calculate the intersection of the 3D ray corresponding to the center of the gaze and the point cloud based on the ray projection algorithm, so as to realize the interactive function of selecting by gazing.

[0036] Another objective of this invention is to provide a method for 3D reconstruction of indoor scene point cloud models based on eye-tracking interaction, implemented using the aforementioned 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction, comprising the following steps:

[0037] It acquires the gaze coordinates from the user's eye-tracking data in real time and outputs the original normalized gaze coordinates.

[0038] The original normalized gaze coordinates are calibrated and mapped, and the mapped screen gaze point coordinates are output.

[0039] Perform anti-jitter smoothing on the mapped screen gaze point coordinates and output the smoothed gaze point coordinates.

[0040] Generate a 3D point cloud model of the indoor scene based on the multi-view RGB image sequence of the indoor scene;

[0041] Based on the control parameters of the 3D virtual camera generated by the smoothed gaze point coordinate mapping, the 3D point cloud model of the indoor scene is rendered and displayed in real time.

[0042] The present invention provides a 3D reconstruction system for indoor scene point cloud models based on eye-controlled interaction, which utilizes multi-view... Figure 3 The 3D reconstruction module generates a 3D point cloud model, which, after eye-tracking calculation, gaze-space calibration mapping, and anti-shake smoothing, is rendered in real time by the streaming point cloud rendering module based on the 3D virtual camera control parameters generated by the smoothed gaze coordinate mapping. This allows people with limb movement disorders to accurately and stably control the 3D point cloud model using only their gaze. This invention provides an eye-controlled spatial digitization solution for people with severe limb movement disorders such as ALS, solving the problem of accurately controlling and flexibly interacting with 3D point cloud models through gaze in indoor scenes, and filling the tool gap in the field of spatial digitization for special populations. Attached Figure Description

[0043] Figure 1 This is a schematic diagram of the structure of the three-dimensional reconstruction system of indoor scene point cloud model based on eye-controlled interaction provided in an embodiment of the present invention.

[0044] Figure 2 The network structure diagram of DFMVSNet provided in the embodiment of the present invention.

[0045] Figure 3 This is a flowchart illustrating the adaptive depth assumption provided in an embodiment of the present invention.

[0046] Figure 4 This is a schematic diagram illustrating the usage process of the eye-controlled interaction-based indoor scene point cloud model 3D reconstruction system provided in this embodiment of the invention. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0048] like Figure 1 As shown, in one embodiment of the present invention, a 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction is provided, including a hardware layer 10, a data processing layer 20, and an interactive application layer 30. This system, based on ordinary monocular / binocular cameras and combined with computer vision and graphics technologies, achieves seamless interaction throughout the entire process, from eye-tracking and command mapping to streaming rendering of the 3D point cloud model.

[0049] The hardware layer 10 includes a regular USB camera (or a built-in computer camera) for capturing user facial images, a computer terminal (PC / laptop) for image processing and rendering, and a monitor for displaying 3D point cloud models.

[0050] The data processing layer 20 specifically includes:

[0051] The eye-tracking calculation module 21 is used to acquire the coordinates of the gaze point in the user's eye-tracking data in real time and output the original normalized gaze coordinates.

[0052] The gaze-space mapping calibration module 22 is used to calibrate and map the original normalized gaze coordinates and output the mapped screen gaze point coordinates.

[0053] The anti-jitter smoothing module 23 is used to perform anti-jitter smoothing on the mapped screen gaze point coordinates and output the smoothed gaze point coordinates.

[0054] Interactive application layer 30 includes:

[0055] Multiview Figure 3 The 3D reconstruction module 31 is used to generate a 3D point cloud model of the indoor scene based on the multi-view RGB image sequence of the indoor scene.

[0056] The streaming point cloud rendering module 32 is used to render and display the 3D point cloud model of the indoor scene in real time based on the 3D virtual camera control parameters generated by the smoothed gaze point coordinate mapping.

[0057] In this embodiment of the invention, the system, through deep optimization of eye-tracking 3D interaction algorithms and indoor scene reconstruction technology, enables users (such as those with severe limb movement disorders like ALS) to precisely control a 3D point cloud model of an indoor scene using only their gaze. Users can autonomously mark target locations on the 3D point cloud model, providing accurate spatial data support for subsequent eye-controlled robots and enabling more precise daily assistance services. This invention not only provides an efficient solution for people with severe limb movement disorders like ALS to autonomously plan their home environment and improve the adaptability of their living spaces, but also fills a gap in spatial digital tools within the field of disability assistance technology, possessing significant practical value and high real-world significance.

[0058] In a preferred embodiment of the present invention, the eye-tracking solution module 21 is responsible for acquiring the coordinates of the user's gaze point in real time, specifically including a facial feature point extraction unit, an iris center localization unit, and a relative position solution unit.

[0059] The facial feature point extraction unit is used to extract key feature points of the face based on the face mesh detection algorithm. Specifically, the system first acquires the RGB video stream from the camera and uses the face mesh detection algorithm to extract 468 key feature points of the face.

[0060] The iris center localization unit is used to select key feature points in the left and right eye regions of the face, calculate the coordinates of the iris center point, and determine the location of the iris center.

[0061] The relative position calculation unit is used to construct an eye coordinate system based on the inner and outer corners of the eye socket, calculate the relative proportion of the iris center within the eye socket, and output the original normalized gaze coordinates (x...). raw ,y raw ).

[0062] It should be noted that the system supports blink detection, which determines blinking by calculating the aspect ratio of the upper and lower eyelids, and is used to trigger specific commands.

[0063] In a preferred embodiment of the present invention, the line-of-sight-space mapping calibration module 22 can solve the problems of camera position deviation and individual eye differences, specifically including a calibration point acquisition unit and a polynomial fitting unit.

[0064] The calibration point acquisition unit is used to acquire the original normalized gaze coordinates and screen target coordinates when the user gazes at the screen anchor point. Specifically, the system provides 5-point and 9-point calibration interfaces. When the user gazes at a specific anchor point on the screen, the system acquires the corresponding original normalized gaze coordinates (x, y, y). raw ,y raw ) and screen target coordinates (u target ,v target ).

[0065] The polynomial fitting unit is used to establish a mapping relationship using a quadratic polynomial regression model, outputting the mapped screen gaze point coordinates. Compared to traditional affine transformations, this embodiment of the invention uses a quadratic polynomial regression model to effectively correct distortion at the edges of wide-angle lenses; specifically, the expression for the mapping relationship is as follows:

[0066] ;

[0067] ;

[0068] In the formula, (u,v) are the mapped screen gaze coordinates; (x,y) are the uncalibrated original normalized gaze coordinates; w xi and w yi (i=1,2,3,4,5) are all calibration coefficients; it should be noted that multiple sets of acquired raw normalized line-of-sight coordinates (x, y, y) can be used.raw ,y raw ) and their corresponding screen target coordinates (u target ,v target Substitute the values ​​into the expression for the mapping relationship, and then use the least squares method to fit and solve for each calibration coefficient.

[0069] In a preferred embodiment of the present invention, the anti-shake smoothing processing module 23 can solve the problem of screen shaking caused by physiological nystagmus of the eyeball, specifically including a confidence filtering unit, an exponentially weighted moving average (EMA) smoothing unit, and a dynamic dead zone and soft start unit.

[0070] The confidence filtering unit is used to monitor the confidence of eye-tracking data in real time. When the confidence is lower than a preset threshold (such as 0.35), the eye-tracking data of the current frame is discarded to prevent accidental operation.

[0071] The exponentially weighted moving average smoothing unit is used to perform temporal smoothing on the mapped screen gaze point coordinates according to a preset smoothing coefficient, and outputs the smoothed gaze point coordinates; specifically, the formula for temporal smoothing is as follows:

[0072] ;

[0073] in, The preset smoothing coefficient (e.g., set to 0.15-0.3) will result in a smoother image but a slightly increased latency; This indicates the screen gaze coordinates of the current frame after smoothing. Represents the original screen gaze coordinates of the current frame input; This indicates the coordinates of the screen gaze point output after the previous frame has been smoothed.

[0074] The dynamic dead zone and soft-start unit is used to preset a dead zone range at the center of the screen to filter out minute, unconscious eye movements. Specifically, when the real-time gaze point falls within the dead zone range, the system output control quantity is set to zero, keeping the 3D viewpoint stationary; when the gaze point moves out of the dead zone range, the output control quantity increases non-linearly (i.e., soft-start). This soft-start mechanism non-linearly maps the distance and direction of the gaze's deviation from the dead zone boundary to the rotation speed and direction of the 3D virtual camera, thereby simulating the natural browsing experience when the human eye and head rotate in tandem, effectively avoiding instantaneous changes in viewpoint and screen tearing; the specific formula is as follows:

[0075] ;

[0076] Among them, v out This represents the output rotation control parameters of the 3D virtual camera (such as rotational angular velocity); d inThis indicates the distance or angle by which the current gaze point deviates from the center of the screen; D th This represents the preset dead zone threshold, used to define the size of the central dead zone; K represents the preset global gain coefficient, used to adjust the overall rotation sensitivity. This represents a preset nonlinear exponential factor used to control the smooth curvature during the soft-start phase, achieving a nonlinear acceleration effect where the response speed increases the further away from the dead zone edge.

[0077] In a preferred embodiment of the present invention, multiple views Figure 3 The 3D reconstruction module 31 is the core computing unit of this system, responsible for converting the acquired multi-view RGB image sequence into a high-precision 3D point cloud model. Specifically, it includes a multi-scale feature extraction and fusion unit, a frequency domain high-frequency information compensation unit, a confidence-guided adaptive depth hypothesis unit, a cascaded cost volume construction and depth inference unit, and a depth map fusion and point cloud generation unit.

[0078] The multi-scale feature extraction and fusion unit is used to extract feature maps of the image at different resolutions using the Feature Pyramid Network (FPN) architecture, and to concatenate deep semantic features with shallow texture features to generate feature maps containing contextual information.

[0079] The frequency domain high-frequency information compensation unit is used to transform the feature map from the spatial domain to the frequency domain, enhance the high-frequency components in the frequency domain, and then restore it back to the spatial domain to obtain the compensated feature map.

[0080] The confidence-guided adaptive depth hypothesis unit is used to predict and generate pixel-level confidence maps based on the compensated feature maps. It reduces the depth hypothesis search radius in high-confidence regions and expands the depth hypothesis search radius in low-confidence regions to obtain an adaptive range. Combined with the adaptive range, a dynamic depth hypothesis plane is constructed through homography transformation.

[0081] The cascaded cost volume construction and depth inference unit is used to construct a matching cost volume based on the depth hypothesis plane, perform regularization processing based on 3D convolutional network and Softmax operation on the cost volume to regress the depth map, and refine it step by step from coarse to fine, and perform cascaded guidance.

[0082] The depth map fusion and point cloud generation unit is used to project the depth maps generated from all views into a 3D space, perform geometric consistency verification and probabilistic filtering, remove outliers, and fuse them to generate a 3D point cloud model of the indoor scene.

[0083] In practical applications, for weakly textured areas (such as white walls, textureless desktops, etc.) and complex geometric structures commonly found in indoor scenes, this invention employs an improved deep learning reconstruction network—DFMVSNet (Dynamic Feature-based MVS Network). This network achieves coarse-to-fine cascaded depth estimation by introducing frequency domain information compensation and confidence guidance mechanisms, such as... Figure 2 and Figure 3 As shown, the method for reconstructing a 3D point cloud model includes the following steps:

[0084] (1) Multi-scale feature extraction and fusion: The network first performs feature encoding on each input source image and reference image, specifically including:

[0085] Feature pyramid construction: adopts FPN architecture, including downsampling and upsampling paths; the network extracts feature maps of the image at different resolutions (Stage 0, Stage 1, Stage 2) through multiple convolution and pooling operations.

[0086] Multi-scale feature fusion: In order to take into account both the overall structure of the indoor scene (low-frequency information) and the details of the object edges (high-frequency information), the network introduces a multi-scale feature fusion method in the feature extraction stage, which concatenates the deep semantic features with the shallow texture features to generate a feature map containing rich contextual information.

[0087] (2) Frequency Domain High-Frequency Information Compensation: To address the problem that traditional convolutional neural networks easily lose high-frequency details during downsampling, this network innovatively introduces a frequency domain processing mechanism in the feature aggregation stage, specifically including:

[0088] FFT Transform: The Fast Fourier Transform (FFT) is used to transform the feature map from the spatial domain to the frequency domain.

[0089] Information compensation: High-frequency components representing object edges and texture details are enhanced in the frequency domain, and then restored back to the spatial domain using inverse Fourier transform (IFFT) to obtain the compensated feature map. This step effectively compensates for information loss during feature extraction, significantly improving the sharpness of reconstructed edges and fine structures of indoor furniture.

[0090] (3) Adaptive depth hypothesis based on confidence level, specifically including:

[0091] Confidence map generation: The network automatically predicts a pixel-level confidence map based on the texture richness and matching consistency of the compensated feature map.

[0092] Adaptive sampling radius calculation: In high-confidence regions, the network automatically reduces the search radius of the depth hypothesis, concentrates computing resources for fine-grained searching, and improves the accuracy of depth estimation; in low-confidence regions, the network automatically expands the search radius, uses a wider range of contextual geometric information to infer the depth of the point, thereby effectively filling in gaps in areas such as white walls.

[0093] Dynamic depth hypothesis construction: Combining the above adaptive range, a dynamic depth hypothesis plane is constructed through homography transformation, which avoids the computational redundancy and mismatch caused by uniform sampling across the entire depth range in traditional methods.

[0094] (4) Cascaded cost body construction and deep inference: The network adopts a coarse-to-fine cascaded architecture (Stage 2→Stage 1→Stage 0) for deep inference, specifically including:

[0095] Cost volume construction: Based on the depth hypothesis plane constructed above, a matching cost volume is constructed using the feature maps of the reference view and the source view. The photometric consistency between multiple views is measured by calculating the variance between features.

[0096] 3D Convolutional Regularization: Regularizes the cost volume using a 3D convolutional network to smooth noise and enhance geometric continuity.

[0097] Depth Regression: The cost volume is converted into a depth probability distribution through the Softmax operation, and then the current depth map is regressed.

[0098] Cascaded guidance: The depth map generated at the coarsest scale (Stage 2) is upsampled and used as prior information to guide the depth assumption range of the next level (Stage 1), which is refined step by step, and finally outputs a high-precision depth map at the original resolution (Stage 0).

[0099] (5) Depth map fusion and point cloud generation: The network projects the depth maps generated from all views into the three-dimensional space, performs geometric consistency verification and probability filtering, removes outliers, and fuses them to generate the final three-dimensional point cloud model of the indoor scene, which is output in PLY file format for subsequent streaming point cloud rendering module 32 to call.

[0100] In a preferred embodiment of the present invention, the control parameters of the three-dimensional virtual camera include rotation angle, translation vector and scaling factor; in order to smoothly view a million-level three-dimensional point cloud model on a regular PC, the present invention provides an OpenGL-based streaming point cloud rendering module 32, which specifically includes a streaming loading unit, a rendering unit and a ray picking unit.

[0101] The streaming loading unit is used to read the 3D point cloud model (PLY file) of the indoor scene in blocks. In practical applications, the system starts an independent thread to read the PLY file in blocks, which can avoid the main interface freezing due to reading a large amount of data into memory at once.

[0102] The rendering unit is used to convert the horizontal and / or vertical displacement of the gaze on the screen into the rotation angle of the 3D virtual camera based on the smoothed gaze coordinates. Specifically, the system can convert the horizontal and vertical displacement of the gaze on the screen into the yaw and pitch angles of the 3D virtual camera; in addition, to avoid spatial disorientation during eye-tracking, the system restricts the degree of freedom of the Roll axis and adopts a "fixed-axis orbital rotation" mode.

[0103] The ray picking unit is used to calculate the intersection of the 3D ray corresponding to the center of the gaze and the point cloud based on the ray projection algorithm, so as to realize the interactive function of selecting by gazing, which can be used to view local details or make markings.

[0104] In this embodiment of the invention, the traditional "eye-tracking as cursor" mode is abandoned. Instead, a nonlinear mapping relationship between the two-dimensional coordinates of the eye-tracking and the six degrees of freedom parameters of the three-dimensional virtual camera is innovatively established. This enables the rotation, translation, scaling, and roaming of the three-dimensional point cloud model using only the eye-tracking, and solves the conflict between eye movement tremors and three-dimensional rendering. To address the problem of physiological eye tremors being amplified in three-dimensional space, a smoothing algorithm based on Kalman filtering and a dynamic dead zone control mechanism are designed to eliminate screen tremors and accidental touches, thereby reducing the dizziness of eye-controlled three-dimensional interaction from the root.

[0105] In addition, the embodiments of the present invention designed a streaming point cloud rendering and interaction architecture: a lightweight streaming point cloud rendering module 32 was developed, which optimizes the loading and rendering of large-scale dense point cloud data in blocks, ensuring low latency and high frame rate real-time feedback under eye control command input, and solving the problem of stuttering and frame drop in traditional solutions under eye control operation.

[0106] like Figure 4 As shown, the usage process of the above system is as follows:

[0107] Step 1: System Initialization and Data Import: The user starts the system and loads the 3D point cloud model (PLY file) of the indoor scene through the "Import" function; the background streaming point cloud rendering module 32 starts the streaming loading thread, the interface displays the loading progress in real time, and gradually renders the point cloud from a low-resolution preview to a high-resolution full view.

[0108] Step 2, Eye Tracking Calibration: The user clicks the "Start Eye Tracking" button; if a large cursor deviation is found, the system enters calibration mode; the user follows the system prompts and looks at the 9 calibration points on the screen in sequence. The system automatically calculates the calibration coefficients of the polynomial fitting and activates the "Use Calibration Mapping" function.

[0109] Step 3, Mode Selection and Interaction: The system provides three eye-tracking interaction modes, which users can switch between by looking at the edge of the screen using the function button, as follows:

[0110] Rotation Mode: When the user looks at the left side of the screen, the scene rotates to the left; when looking at the right side, the scene rotates to the right. When looking up or down, the scene adjusts its tilt angle accordingly. When looking at the dead zone in the center of the screen, the scene stops rotating, making it easy to freeze and observe.

[0111] Translation mode: The line-of-sight offset is converted into the translation vector of the 3D virtual camera, enabling roaming of long corridors or large living rooms.

[0112] Target selection mode: The interface displays a crosshair. When the user looks at a piece of furniture and keeps their gaze fixed for more than a set time, the system automatically triggers a "selected" event, highlights the point cloud of that area, and automatically zooms in to show details.

[0113] Step 4: Parameter fine-tuning: During the interaction, users can adjust the "sensitivity", "smoothness coefficient" and "dead zone range" sliders through eye control or with the help of an assistant to adapt to the eye movement characteristics under different fatigue states, ensuring the comfort and accuracy of operation.

[0114] In practical applications, taking the example of an ALS patient independently browsing a 3D point cloud model of their living room, the specific implementation scheme is as follows:

[0115] 1. Implementation environment and hardware configuration:

[0116] Hardware: A regular home laptop computer, 24-inch monitor.

[0117] Data preparation: A 3D point cloud model of the living room of the home, which was collected and reconstructed in advance by family members or technicians.

[0118] Software environment: The system provided in the above embodiments.

[0119] 2. Implementation steps:

[0120] System startup and calibration: The patient sits in front of the screen, and the system automatically starts and guides the patient to look at 9 calibration points on the screen. It takes about 30 seconds to complete the gaze calibration mapping and establish personalized eye movement mapping parameters.

[0121] Model loading: The system automatically loads a 3D point cloud model of the living room, with the initial viewpoint placed in the center of the room.

[0122] Rotating viewpoint: The patient looks to the left edge of the screen, and the 3D virtual camera smoothly rotates to the left to show the left wall and window; when the patient's gaze returns to the center of the screen, the rotation stops.

[0123] In another embodiment of the present invention, a method for 3D reconstruction of indoor scene point cloud models based on eye-tracking interaction is also provided, which is implemented using the above-mentioned 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction, and includes the following steps:

[0124] S1. Real-time acquisition of gaze coordinates from user eye-tracking data, and output of raw normalized gaze coordinates;

[0125] S2. The original normalized gaze coordinates are calibrated and mapped, and the mapped screen gaze point coordinates are output.

[0126] S3. Perform anti-jitter smoothing on the mapped screen gaze point coordinates and output the smoothed gaze point coordinates.

[0127] S4. Generate a 3D point cloud model of the indoor scene based on the multi-view RGB image sequence of the indoor scene;

[0128] S5. Based on the smoothed gaze point coordinate mapping generated 3D virtual camera control parameters, render and display the 3D point cloud model of the indoor scene in real time.

[0129] In summary, the method and system for 3D reconstruction of indoor scene point cloud models based on eye-controlled interaction provided in this invention have the following beneficial technical effects compared to the prior art:

[0130] 1. Breaking down spatial interaction barriers for special populations: This invention addresses the pain point of people with limb movement disorders such as ALS and high-level paraplegia being unable to use a mouse and keyboard to operate 3D models. By constructing a dedicated eye-controlled 3D interaction system, patients can freely and smoothly browse digital models of their home environment without any physical movement, simply by looking at the screen. This empowers them to independently view environmental details and participate in home space renovation decisions, significantly enhancing their autonomy and sense of social participation.

[0131] 2. Eliminating dizziness and achieving precise control: Addressing the issue of dizziness easily caused by directly applying traditional eye-tracking technology to 3D space, this invention effectively suppresses over 90% of physiological eye-tracking tremors through dynamic dead zones and filtering smoothing algorithms. This results in smooth and stable rotation and movement of the 3D image, similar to cinematic camera work. Simultaneously, the non-linear velocity mapping mechanism allows users to quickly view the overall scene while fine-tuning the perspective to observe details, providing the operational precision required for interior design observations.

[0132] 3. High smoothness under low computing power: Through streaming point cloud rendering optimization and gaze-focused rendering enhancement technology, the system can smoothly run million-level 3D point cloud models on ordinary consumer PCs. The end-to-end response latency of eye-tracking commands is less than 100ms, and the rendering frame rate is stable at over 30fps, ensuring a "what you see is what you get" real-time interactive experience without the need for expensive workstation hardware support.

[0133] It should be noted that the above modules and units can be implemented as a computer program, which can run on a computer device. The computer device's memory can store the computer program that makes up the modules and units, enabling the processor to execute the various steps of the above method.

[0134] It should be understood that although the steps in the flowcharts of the various embodiments of the present invention are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the various embodiments may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.

[0135] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory.

[0136] The above embodiments merely illustrate several implementation methods of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this patent should be determined by the appended claims.

Claims

1. A 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction, comprising a hardware layer, a data processing layer, and an interactive application layer, characterized in that, The data processing layer includes: The eye-tracking solution module is used to acquire the coordinates of the gaze point in the user's eye-tracking data in real time and output the original normalized gaze coordinates. The gaze-space mapping calibration module is used to calibrate and map the original normalized gaze coordinates and output the mapped screen gaze point coordinates. The anti-jitter smoothing module is used to perform anti-jitter smoothing on the mapped screen gaze point coordinates and output the smoothed gaze point coordinates. The interactive application layer includes: The multi-view 3D reconstruction module is used to generate a 3D point cloud model of an indoor scene based on a multi-view RGB image sequence of the indoor scene. The streaming point cloud rendering module is used to render and display the 3D point cloud model of the indoor scene in real time based on the 3D virtual camera control parameters generated by the smoothed gaze point coordinate mapping.

2. The three-dimensional reconstruction system for indoor scene point cloud models based on eye-tracking interaction according to claim 1, characterized in that, The eye-tracking solution module specifically includes: The facial feature point extraction unit is used to extract key facial feature points based on the face mesh detection algorithm. The iris center localization unit is used to select key feature points in the left and right eye regions of the face to determine the iris center. The relative position calculation unit is used to construct the eye coordinate system, calculate the relative proportion of the iris center within the eye socket, and output the original normalized line-of-sight coordinates.

3. The 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction according to claim 1, characterized in that, The line-of-sight-space mapping calibration module specifically includes: The calibration point acquisition unit is used to acquire the original normalized gaze coordinates and screen target coordinates when the user is looking at the screen anchor point. Polynomial Fitting Unit: Used to establish a mapping relationship using a quadratic polynomial regression model, and output the mapped screen gaze coordinates.

4. The three-dimensional reconstruction system for indoor scene point cloud models based on eye-tracking interaction according to claim 1, characterized in that, The anti-jitter smoothing processing module includes: The confidence filtering unit is used to monitor the confidence of eye-tracking data in real time. When the confidence is lower than a preset threshold, the eye-tracking data of the current frame is discarded. Exponentially weighted moving average smoothing unit: used to perform temporal smoothing on the mapped screen gaze coordinates according to a preset smoothing coefficient, and output the smoothed gaze coordinates.

5. The three-dimensional reconstruction system for indoor scene point cloud models based on eye-tracking interaction according to claim 4, characterized in that, The anti-jitter smoothing processing module also includes: The dynamic dead zone and soft start unit is used to set the dead zone range at the center of the screen. When the line of sight falls within the dead zone range, the output control quantity is zero. When the line of sight moves out of the dead zone range, the output control quantity increases non-linearly.

6. The three-dimensional reconstruction system for indoor scene point cloud models based on eye-tracking interaction according to claim 1, characterized in that, The multi-view 3D reconstruction module uses a deep learning reconstruction network that incorporates frequency domain information compensation and confidence guidance mechanisms to generate a 3D point cloud model of the indoor scene, specifically including: The multi-scale feature extraction and fusion unit is used to extract feature maps of the image at different resolutions using a feature pyramid network architecture, and to concatenate deep semantic features with shallow texture features to generate feature maps containing contextual information. The frequency domain high-frequency information compensation unit is used to convert the feature map from the spatial domain to the frequency domain, enhance the high-frequency components in the frequency domain, and then restore it back to the spatial domain to obtain the compensated feature map. The confidence-guided adaptive depth hypothesis unit is used to predict and generate pixel-level confidence maps based on the compensated feature maps. It reduces the depth hypothesis search radius in high-confidence regions and expands the depth hypothesis search radius in low-confidence regions to obtain an adaptive range. Combined with the adaptive range, a dynamic depth hypothesis plane is constructed through homography transformation. The cascaded cost volume construction and depth inference unit is used to construct a matching cost volume based on the depth hypothesis plane, perform regularization processing based on 3D convolutional network and Softmax operation on the cost volume to regress the depth map, and refine it step by step from coarse to fine, and perform cascaded guidance. The depth map fusion and point cloud generation unit is used to project the depth maps generated from all views into a 3D space, perform geometric consistency verification and probabilistic filtering, remove outliers, and fuse them to generate a 3D point cloud model of the indoor scene.

7. The three-dimensional reconstruction system for indoor scene point cloud models based on eye-tracking interaction according to claim 1, characterized in that, The control parameters for the 3D virtual camera include rotation angle, translation vector, and scaling factor; the streaming point cloud rendering module specifically includes: The streaming loading unit is used to read the 3D point cloud model of the indoor scene in blocks; The rendering unit is used to convert the horizontal and / or vertical displacement of the gaze on the screen into the rotation angle of the 3D virtual camera based on the smoothed gaze coordinates. The ray picking unit is used to calculate the intersection of the 3D ray corresponding to the center of the gaze and the point cloud based on the ray projection algorithm, so as to realize the interactive function of selecting by gazing.

8. A method for 3D reconstruction of indoor scene point cloud models based on eye-tracking interaction, implemented using the 3D reconstruction system for indoor scene point cloud models based on eye-tracking interaction as described in any one of claims 1-7, characterized in that, Includes the following steps: It acquires the gaze coordinates from the user's eye-tracking data in real time and outputs the original normalized gaze coordinates. The original normalized gaze coordinates are calibrated and mapped, and the mapped screen gaze point coordinates are output. Perform anti-jitter smoothing on the mapped screen gaze point coordinates and output the smoothed gaze point coordinates. Generate a 3D point cloud model of the indoor scene based on the multi-view RGB image sequence of the indoor scene; Based on the control parameters of the 3D virtual camera generated by the smoothed gaze point coordinate mapping, the 3D point cloud model of the indoor scene is rendered and displayed in real time.