A middle ear structure multi-modal reconstruction method based on a graph neural network

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a graph neural network-based approach, the problems of multimodal data fusion and topological relationship understanding of the middle ear structure were solved, achieving high-precision and biologically reasonable middle ear structure reconstruction, improving reconstruction accuracy and computational efficiency, and adapting to anatomical differences among individuals.

CN122199796APending Publication Date: 2026-06-12GUANGDONG POLYTECHNIC NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GUANGDONG POLYTECHNIC NORMAL UNIV
Filing Date: 2026-03-06
Publication Date: 2026-06-12

Application Information

Patent Timeline

06 Mar 2026

Application

12 Jun 2026

Publication

CN122199796A

IPC: G06T17/00; G06N3/042; G06N3/08; G06T7/30; G06T5/70; G06T5/40

AI Tagging

Application Domain

Image enhancement Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122199796A_ABST

Patent Text Reader

Abstract

The application discloses a middle ear structure multi-modal reconstruction method based on a graph neural network, which comprises the following steps: preprocessing obtained multi-modal original middle ear data to obtain multi-modal data; converting the multi-modal data into a graph structure; inputting the graph structure into a graph neural network, performing end-to-end feature learning through a message passing mechanism and a multi-head attention mechanism, and adopting a dynamic weighted fusion strategy to integrate multi-modal features to generate fused node representations; training the graph neural network, introducing a topological constraint and a geometric prior as a regularization term of a loss function, wherein the topological constraint forces the reconstruction result to conform to the topological relationship of middle ear anatomy, the geometric prior comprises a smoothness constraint and a curvature constraint, and a trained graph neural network is obtained; and using the trained graph neural network to infer new multi-modal input data to generate a three-dimensional reconstruction result of the middle ear structure. The application significantly improves the accuracy and robustness of middle ear structure reconstruction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image processing and three-dimensional reconstruction technology, and in particular to a method for multimodal reconstruction of the middle ear structure based on graph neural networks. Background Technology

[0002] The middle ear, as a sophisticated acoustic conduction system, possesses an extremely complex geometry with significant individual variations. The middle ear cavity contains multiple interconnected chambers and spaces, the three ossicular chains and their ligaments and muscles, the round window membrane, and the oval window membrane, among other intricate structures. The morphological characteristics and spatial relationships of these tissue elements directly affect acoustic conduction efficiency and biomechanical properties. Traditional middle ear structural research primarily relies on anatomical section observation, microscopic image analysis, and two-dimensional tomography. These methods acquire data with limited modalities and coverage. In recent years, with advancements in medical imaging technology, high-resolution computed tomography (CT), magnetic resonance imaging (MRI), and three-dimensional optical microscopy have been increasingly applied to the observation and research of middle ear structures, making it possible to obtain multi-dimensional data on the complex middle ear structure. However, existing acquisition, processing, and reconstruction methods still suffer from multiple systemic technical bottlenecks, severely restricting the accuracy and efficiency of digitizing the fine structures of the middle ear.

[0003] Current practices in acquiring and processing middle ear structures primarily employ methods including medical image-based segmentation and reconstruction, point cloud processing, and geometric modeling. Specifically, high-resolution CT scans can clearly present the three-dimensional morphological features of bony structures and calcified areas; MRI imaging can effectively identify soft tissue boundaries and edema areas; and three-dimensional laser scanning or structured light imaging can obtain high-precision geometric point cloud data of the surface. Meanwhile, ultrasound imaging, due to its real-time and non-ionizing radiation characteristics, has unique advantages in observing the in vivo middle ear, capturing the kinematic features of the ossicular chain. However, these methods have several limitations in practical applications: 1. Challenges in Data Modal Isolation and Multi-Source Fusion: Data acquired by different imaging techniques exhibits heterogeneity—CT emphasizes the density characteristics of bony structures, MRI focuses on soft tissue boundaries, ultrasound provides dynamic motion information, and optical imaging captures surface geometric details. Existing processing methods are mostly based on independent segmentation and reconstruction of single-modal data, lacking effective cross-modal information fusion mechanisms. When using data from multiple imaging sources, achieving lossless feature fusion while maintaining the complementarity of information from each modality becomes a major technical challenge. Poor spatial registration accuracy, data redundancy, and inconsistencies between different modalities increase the complexity of subsequent analysis, leading existing methods to often employ conservative strategies such as modality selection or sequence fusion, making it difficult to fully utilize the comprehensive advantages of multi-source data.

[0004] 2. Insufficient Geometric Topological Relationship Recognition: The middle ear structure contains complex topological relationships—the ossicular chain, composed of the malleus, incus, and stapes connected by joints and ligaments, is difficult to accurately identify in two-dimensional tomography or isolated point clouds. Existing methods based on voxel segmentation or independent point cloud processing are essentially voxel-by-voxel or point-by-point local decisions, lacking global topological constraints and structural relationship reasoning. This often leads to anatomically unreasonable reconstruction results, such as separated bony bridges, incorrect spatial adjacency relationships, or geometric shapes that do not conform to the morphology of the ossicular chain. Furthermore, existing methods are completely incapable of handling delicate structures that are not significant enough in any single modality and require the integration of multiple modal cues for identification (such as ligament attachment points and micro-fistulas).

[0005] 3. Challenges in Co-modeling Multi-Scale Structures: The middle ear comprises multi-scale structures ranging from millimeter-scale macroscopic chambers to micrometer-scale extracellular matrix. Different imaging techniques have varying resolutions; for example, high-resolution CT can reach 0.1 mm, while optical microscopy can reach 1 μm. Representing and processing this multi-scale information, spanning more than 1000 times, within a unified mathematical framework, while ensuring geometric and topological consistency across scales, remains an open problem in the field of geometric processing. Traditional methods are either limited to a single scale or employ manual stitching after layered processing, easily leading to discontinuities and mismatches at scale boundaries.

[0006] 4. Low level of automation in morphological feature extraction and inference: Current methods for middle ear structure analysis mainly rely on manually designed geometric features (such as curvature, principal axis length, and volume) or shallow machine learning feature engineering in the feature extraction stage. The design of these features often requires expert knowledge and is difficult to adapt to structural variability. Although deep learning can automatically learn features, existing convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have limited processing capabilities for point clouds or grid structures—they often employ preprocessing steps such as projection, voxelization, or conversion to regular graticets, which lead to information loss and computational redundancy. For processing non-Euclidean structural data, existing deep learning methods lack natural and efficient representation and inference paradigms.

[0007] 5. Insufficient Integration Depth of Multimodal Data at the Algorithm Layer: Although some research has begun to explore multimodal fusion, current methods mostly remain at the simple fusion level of the decision layer or feature splicing layer, lacking designs for deep feature interaction and complementary learning at the intermediate representation layer. Existing multimodal algorithm frameworks (such as early fusion and late fusion) struggle to effectively transfer and utilize the structural constraints and geometric relationships between modalities when processing highly structured and relationally dense data such as the middle ear. Furthermore, existing methods often employ fixed weights or simple attention to adaptive processing mechanisms for conflicting, missing, or redundant information between different modalities, making it difficult to capture dynamic cross-modal correlations.

[0008] Given the systemic shortcomings of existing technologies, there is an urgent need to develop an innovative technical solution that can fundamentally address a series of key issues, such as multimodal data fusion, topological relationship understanding, and precise reconstruction of the complex structure of the middle ear. Summary of the Invention

[0009] To address the technical problems existing in the prior art, this invention proposes a multimodal reconstruction method for the middle ear structure based on graph neural networks. This method utilizes the message passing mechanism of graph neural networks to perform deep feature learning and fusion on the graph, while introducing topological constraints and geometric priors, thereby achieving accurate reconstruction of the complex three-dimensional structure of the middle ear.

[0010] To achieve the above objectives, this invention provides a method for multimodal reconstruction of middle ear structure based on graph neural networks, comprising: The acquired multimodal raw middle ear data are preprocessed to obtain multimodal data with a consistent spatial reference frame and a uniform intensity range; The multimodal data is converted into a graph structure, where nodes represent feature points or regions of the middle ear structure, edges encode the spatial adjacency or anatomical connection relationships between nodes, and a multi-source fusion strategy is used to construct nodes. The graph structure is input into a graph neural network, and end-to-end feature learning is performed through message passing mechanism and multi-head attention mechanism. A dynamic weighted fusion strategy is used to integrate multimodal features and generate fused node representations. The graph neural network is trained by introducing topological constraints and geometric priors as regularization terms in the loss function. The topological constraints force the reconstruction results to conform to the anatomical topological relationship of the middle ear, and the geometric priors include smoothness constraints and curvature constraints. The trained graph neural network is used to infer new multimodal input data and generate a three-dimensional reconstruction of the middle ear structure.

[0011] Preferably, the acquired multimodal raw middle ear data is preprocessed, including: Modality-specific denoising, spatial alignment, scale normalization, and multimodal registration processing; Among them, anisotropic diffusion filtering is used for noise reduction, specifically as follows: ; In the formula, D(∇I) is the diffusion tensor adaptively adjusted according to the gradient direction. I Image intensity value, t For diffusion iteration time, The gradient of image intensity; Normalization is performed using percentage truncation and histogram matching, specifically as follows: ; In the formula, HistMatch For histogram matching operations, Normalize For percentage cutoff and standard deviation normalization, For the first m Image data after modal normalization For the first m Raw image data for each modality.

[0012] Preferably, a multi-source fusion strategy is used to construct nodes, including: ; In the formula, For nodes i Three-dimensional spatial coordinates, For the first k The feature output of a modal encoder at a node location. For nodes i The multimodal attribute vector.

[0013] Preferably, in the graph structure, edges are constructed based on a dual criterion; The dual criteria include: The spatial proximity criterion is that an edge is established when the Euclidean distance between two nodes is less than a set threshold. The anatomical relationship criterion states that an edge is established when two nodes belong to the same anatomical structure.

[0014] Preferably, the dynamic weighted fusion strategy is as follows: ; In the formula, Let i be the feature representation learned from the k-th modality. For dynamically calculated fusion weights, Let K be the feature representation of node i after fusion, and K be the total number of modes.

[0015] Preferably, the loss function is: ; In the formula, For reconstruction losses; This is the loss due to topological consistency. Laplacian smoothing loss; For curvature constraint loss; The loss function; All of these are hyperparameters.

[0016] Preferably, the topology consistency loss The calculation is based on the difference between the reconstructed adjacency matrix and the expected adjacency matrix, wherein the expected adjacency matrix is predefined according to the anatomical topology of the middle ear.

[0017] Preferably, the method further includes online incremental learning, comprising: Incremental training is performed based on new input data and feedback results. Model parameters are optimized through gradient updates, while stability constraints are set to prevent performance degradation.

[0018] Preferably, the online incremental learning uses a gradient update method to adjust the parameters of new samples, specifically as follows: ; In the formula, For learning rate, For parameters GNN model, For the updated model parameters, These are the current model parameters. The gradient of the loss function. For newly input multimodal data, Labels corresponding to the new input data.

[0019] Compared with the prior art, the present invention has the following advantages and technical effects: (1) This invention fully utilizes the complementary advantages of each modality through multimodal fusion. The high precision of CT modality in representing bone structures, the high contrast of MRI in soft tissue regions, the ultra-high precision of optical point clouds in surface geometry, and the advantages of ultrasound in dynamic characteristics are all fully integrated in the unified graphical representation framework of this invention. Based on theoretical analysis and expected experimental verification, this invention can achieve sub-millimeter level reconstruction accuracy, which is an order of magnitude higher than the millimeter level accuracy of single-modal methods. At the same time, through explicit topological constraints and geometric priors, the biological rationality of the reconstruction results can be guaranteed, avoiding the unreasonable morphologies that may occur in purely data-driven methods.

[0020] (2) The GNN architecture of this invention performs message passing on a graph structure, which has higher computational efficiency compared to performing convolution operations on full-size three-dimensional volume data. Graph representation avoids a large amount of redundant computation by selectively retaining key nodes. The introduction of the incremental learning mechanism further optimizes the computational cost. When optimization is required for new data, there is no need to retrain the entire model; only gradient updates are needed, reducing training time from hours to minutes. This optimization enables the system to quickly adapt to new data characteristics in practical applications, improving the system's practicality.

[0021] (3) This invention ensures the reliability of the reconstruction results through explicit topological and geometric constraints. Compared with purely data-driven methods, the results of this invention are not only accurate but also biologically plausible. The incremental learning and online optimization mechanism provides adaptive capabilities, enabling continuous performance improvement based on new data and adapting to the characteristics of data from different sources and individuals. The stability constraint mechanism ensures that online learning does not lead to performance degradation, guaranteeing long-term reliability. Attached Figure Description

[0022] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings: Figure 1 This is a flowchart of a multimodal reconstruction method for middle ear structure based on graph neural networks according to an embodiment of the present invention. Detailed Implementation

[0023] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0024] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.

[0025] This embodiment proposes a multimodal reconstruction method for middle ear structure based on graph neural networks, such as... Figure 1 ,include: The acquired multimodal raw middle ear data are preprocessed to obtain multimodal data with a consistent spatial reference frame and a uniform intensity range; The multimodal data is converted into a graph structure, where nodes represent feature points or regions of the middle ear structure, edges encode the spatial adjacency or anatomical connection relationships between nodes, and a multi-source fusion strategy is used to construct nodes. The graph structure is input into a graph neural network, and end-to-end feature learning is performed through message passing mechanism and multi-head attention mechanism. A dynamic weighted fusion strategy is used to integrate multimodal features and generate fused node representations. The graph neural network is trained by introducing topological constraints and geometric priors as regularization terms in the loss function. The topological constraints force the reconstruction results to conform to the anatomical topological relationship of the middle ear, and the geometric priors include smoothness constraints and curvature constraints. The trained graph neural network is used to infer new multimodal input data and generate a three-dimensional reconstruction of the middle ear structure.

[0026] Specifically, this embodiment first constructs a general graph representation framework, uniformly converting raw or segmented data from multiple modalities such as CT, MRI, optical, and ultrasound into a node-edge-attribute graph structure. Nodes represent anatomically significant feature points or regions, edges encode spatial adjacency or anatomical relationships, and the attribute vectors of nodes and edges carry multidimensional feature information from different modalities. Second, a proprietary GNN architecture for middle ear structure is designed, including a multi-layered message passing mechanism, a structure-aware attention module, and a cross-modal feature interaction layer. This enables the network to perform end-to-end feature learning on the graph and automatically learn the importance weights of each modality at different structural locations. Third, topological constraints and geometric priors are introduced as regularization terms to ensure that the reconstruction results meet the basic requirements of middle ear anatomy. Finally, through a closed-loop learning mechanism, the network is allowed to continuously optimize the graph construction method, node partitioning granularity, and fusion weight strategy based on feedback from reconstruction accuracy, achieving adaptive evolution of the model.

[0027] Furthermore, the acquired multimodal raw middle ear data is preprocessed, including: Modality-specific denoising, spatial alignment, scale normalization, and multimodal registration processing.

[0028] Specifically, this stage is at the forefront of the entire technical solution. Its main task is to standardize the multimodal raw data from different acquisition devices to eliminate differences between data sources and lay the foundation for subsequent unified representation.

[0029] The raw data from different imaging modalities exhibit significant differences in intensity distribution, spatial resolution, and noise characteristics. To effectively fuse these heterogeneous data, dedicated preprocessing for each modality is necessary. This embodiment employs a modality-specific preprocessing pipeline, optimized for the characteristics of each modality's data.

[0030] First, modality-specific denoising and enhancement are performed. Because the noise characteristics generated by different acquisition devices vary significantly, using a uniform denoising method can lead to excessive suppression of useful features in certain modalities or excessive retention of noise. This embodiment employs a dedicated denoising scheme for each modality. For image data, anisotropic diffusion filtering is used, with the basic equation being: ; In the formula, D(∇I) is the diffusion tensor adaptively adjusted according to the gradient direction. I Image intensity value, t For diffusion iteration time, This represents the gradient of image intensity.

[0031] The advantage of this method lies in its ability to adjust the diffusion intensity based on the local image gradient direction: diffusion is suppressed along the gradient direction (usually corresponding to structural edges), while diffusion is promoted perpendicular to the gradient direction. This effectively smooths noise while preserving structural boundaries, making it particularly suitable for medical image processing.

[0032] Secondly, spatial alignment and coordinate system unification are required. Multimodal data comes from different acquisition methods and devices, resulting in spatial, temporal, and scale differences. Registration alone is insufficient to completely eliminate these differences; feature layer alignment and normalization are also necessary.

[0033] The goal of spatial alignment is to transform all modal data into a common reference coordinate system. During the acquisition of different modalities, patients may be in different positions, and the imaging perspective may also differ. This embodiment establishes a unified reference coordinate framework by calculating the principal directions of all modal data (based on principal component analysis). Subsequently, each modal data is transformed into this unified framework, ensuring that the spatial coordinates of all nodes are consistent within the unified reference system.

[0034] Scale normalization is performed again. Scale normalization addresses the resolution differences between different acquisition methods. Voxel size in CT scans, slice thickness in MRI, and sampling interval in optical imaging can all vary. This leads to differences in intensity dynamic range, feature scale, etc. This embodiment uses percentage truncation and histogram matching for normalization: ; In the formula, HistMatch For histogram matching operations, the histograms of each modality are adjusted to a unified reference histogram. Normalize For percentage cutoff and standard deviation normalization, For the first m Image data after modal normalization For the first m Raw image data for each modality.

[0035] This process ensures that the intensity dynamic range of each modality is consistent, which is beneficial for subsequent feature encoding and fusion.

[0036] Finally, multimodal registration is performed. Multimodal registration is a crucial step to ensure the effectiveness of subsequent fusion. For image modalities (CT, MRI), this embodiment uses mutual information as a similarity metric for rigid body registration. The advantage of mutual information is that it is independent of image content, can handle images with different contrasts, and is particularly suitable for registration between different modalities. The specific formula is: ; In the formula, For the mutual information between image A and image B, Let's consider the joint probability distribution of image A and image B. Let A be the edge probability distribution of image A. Let B be the edge probability distribution. a Let A be the grayscale value in image A. b represents the grayscale value in image B.

[0037] By optimizing the B-spline transform parameters, this embodiment can control the registration error between the two modal images to within 0.5mm, ensuring spatial consistency during subsequent fusion. This precision is crucial for millimeter-scale middle ear structures.

[0038] After preprocessing including denoising, spatial alignment, scale normalization, and multimodal registration, a multimodal dataset with a consistent spatial reference frame, uniform intensity range, and accurate spatial correspondence is obtained. This preprocessed data serves as the direct input for the next stage (graph construction).

[0039] Furthermore, this stage receives preprocessed multimodal data, and its main task is to encode the heterogeneous multimodal information into a unified graph structure, while integrating anatomical topological relationships. The graph structure not only preserves the rich information of multimodal features, but also provides a natural computational platform for subsequent graph neural network processing.

[0040] In middle ear reconstruction, data from different imaging modalities exhibit heterogeneity. Computed tomography (CT) can clearly present the three-dimensional morphology of bony structures, but has limited contrast for soft tissues; magnetic resonance imaging (MRI) offers high resolution in soft tissue regions, clearly displaying membranous structures and fluid-filled cavities; three-dimensional optical point cloud imaging provides ultra-high precision geometric information of surfaces; and dynamic ultrasound imaging can capture the motion characteristics of the ossicular chain under acoustic stimulation. A single modality cannot fully characterize all the anatomical details of the middle ear. Therefore, reconstruction methods need to construct a unified graphical representation framework to organize information from all modalities in a consistent manner.

[0041] Specifically, including: Feature Encoding and Node Construction: Dedicated encoders are applied to the preprocessed data. For volumetric imaging data (CT, MRI), an improved 3D convolutional neural network architecture is used, which can effectively learn spatial features in 3D volumetric data. For point cloud data, the PointNet++ network is used, a deep learning architecture capable of directly processing unordered point sets, capturing geometric features at different levels through multi-scale feature extraction. These encoders map heterogeneous multimodal data to a unified feature space, making features from different modalities comparable and fusionable.

[0042] Given raw datasets from K different imaging modalities, let Ek be the feature encoder for the k-th modality. The raw data is mapped to a unified feature space. Node construction employs a multi-source fusion strategy, and the multi-modal attribute vector of a node is defined as follows: ; In the formula, For nodes i Three-dimensional spatial coordinates, For the first k The feature output of a modal encoder at a node location. For nodes i The multimodal attribute vector.

[0043] This design allows each node to simultaneously carry feature information from all modalities, laying the foundation for subsequent cross-modal fusion. The number and density of nodes are adaptively adjusted according to the anatomical complexity of the middle ear, increasing node density in areas with dense fine structures such as the ossicular chain and reducing the number of nodes in simpler areas such as the cavities, thereby achieving efficient utilization of computing resources.

[0044] Furthermore, in the graph structure, edges are constructed based on a dual criterion; The dual criteria include: The spatial proximity criterion is that an edge is established when the Euclidean distance between two nodes is less than a set threshold. The anatomical relationship criterion states that an edge is established when two nodes belong to the same anatomical structure.

[0045] Specifically, the construction of edges follows two criteria to ensure that the graph topology can represent both local geometric relationships and encode global anatomical structures.

[0046] The first criterion is spatial proximity: if the Euclidean distance between two nodes is less than a set threshold... If the nodes are geographically close, then edges are established between them. This criterion allows nodes to interact with local features, thereby capturing local geometric characteristics.

[0047] The second principle is anatomical relationship: if two nodes belong to the same anatomical structure (such as the three bones of the ossicular chain connected by joints), even if they are far apart, an edge should be established to maintain anatomical connectivity.

[0048] The combination of these two criteria ensures that the graph can capture both local geometric details and encode global anatomical topological relationships, enabling the network to learn detailed features while maintaining the consistency of the global structure during feature learning.

[0049] The constructed graph structure G=(V,E,A) is where V is the set of nodes containing multimodal features, E is the set of edges capturing local geometry and global topological relationships, and A is the node attribute matrix. This graph structure integrates all modal information and encodes the topological relationships of the structure, serving as the direct input to the graph neural network.

[0050] Furthermore, the receiving graph structure's main task is to perform end-to-end feature learning and cross-modal fusion through the message passing mechanism of graph neural networks. This stage is the core of the solution, automatically discovering the relationships between different modalities through network learning, and achieving deep information integration.

[0051] Specifically, in traditional deep learning, convolutional neural networks process Euclidean data (such as images) through local convolutional operations. However, the topological relationships of the middle ear structure are complex non-Euclidean structures, and representing them with a regular grid would result in significant computational redundancy and information loss. Graph neural networks (GNNs), through message passing mechanisms, perform feature learning on graphs and can directly process non-Euclidean structures, making them particularly suitable for complex topological structures like the middle ear.

[0052] This embodiment employs an improved GraphSAGE framework for message passing. The basic idea of this framework is that each node aggregates information from its neighboring nodes, continuously expanding its receptive field through multiple iterations to ultimately achieve global feature learning. The message passing process in layer ℓ involves aggregating the latent representations of neighboring nodes, then transforming them using a learnable weight matrix and activation function to generate a new representation for the node in layer ℓ+1.

[0053] Multi-head attention and dynamic weights: The multi-head attention mechanism is a crucial component of the GNN architecture in this embodiment. The attention mechanism automatically adjusts the importance of neighboring nodes based on the content of a node, rather than treating all neighboring nodes equally. Multi-head attention allows the network to learn the relationships between nodes simultaneously from multiple perspectives; different heads can learn different interaction patterns. This design is particularly suitable for the fusion of multimodal data because the relationships between different modalities are diverse. The structure corresponding to some nodes may be significant in one modality but weak in another; the multi-head mechanism can capture such varying correlations.

[0054] To address the unique characteristics of the middle ear structure, this embodiment introduces a structure-aware mechanism into the edge weights. Static edge weights (based solely on distance) cannot adapt to complex anatomical relationships. Dynamic edge weight adjustment allows the network to automatically learn which edge relationships are more important in the current context based on edge characteristics (distance, anatomical type, etc.). This enables the network to adaptively model the topology, employing different connection strategies for different anatomical sites.

[0055] Cross-modal fusion attention mechanism: The key to multimodal fusion lies in how to effectively integrate information from different sources. Simple concatenation or averaging can lead to information loss or excessive suppression of information from a particular modality. This embodiment designs an explicit cross-modal fusion mechanism that allows the network to automatically adjust the importance of different modalities based on the content of the data. During the GNN learning process, the fusion representation of each node is calculated through dynamic weighted summation: ; In the formula, Let i be the feature representation learned from the k-th modality. For dynamically calculated fusion weights, Let K be the feature representation of node i after fusion, and K be the total number of modes.

[0056] The weights are dynamically calculated based on feature relevance, rather than being fixed. This means the network can learn different modal weights at different nodes. For example, in the ossicular chain bone region, CT images receive higher weights due to their high contrast; while in soft tissue regions, MRI images may receive higher weights. This adaptive weighted fusion fully leverages the strengths of each modality.

[0057] The calculation of fusion weights employs a softmax mechanism, ensuring that the sum of the weights equals 1, which provides a probabilistic interpretation of the fusion process. The weights are not rigidly assigned but are learned soft weights, allowing the network to continuously optimize the fusion strategy during training. This flexibility enables the algorithm to adapt to anatomical differences among individuals, as the middle ear structure can vary significantly across different populations.

[0058] Through multi-layer iteration and cross-modal fusion of GNN, the fused node feature representation is obtained. Each node's representation contains multimodal information from the global receptive field. These fused features serve as the basis for subsequent constrained optimization and reconstruction.

[0059] Furthermore, this stage receives the fused feature representation learned by the GNN. Its main task is to combine data-driven learning with prior knowledge by introducing explicit topological and geometric constraints, ensuring that the reconstruction result is consistent with both data observations and structural knowledge. This stage guides the network to learn a representation that is both data-consistent and anatomically sound through an optimized loss function.

[0060] Specifically, topological and geometric constraints are crucial because data-driven learning alone can easily lead to networks learning reconstructions that do not conform to anatomy. For example, the three bones of the ossicular chain are connected by joints to form a rigid topological relationship, which is a fundamental feature of the middle ear. If the network learns incorrect topological relationships (such as bone separation), the reconstruction will not conform to biological facts. Therefore, this embodiment explicitly introduces topological constraints and geometric priors.

[0061] Hard topological constraints force the network to maintain the correct anatomical topology. The basic topology of the middle ear is anatomically deterministic, and the desired adjacency relationships can be predefined. These constraints are added to the loss function through a regularization term, penalizing reconstructions that violate the topological rules.

[0062] Specifically, the topology consistency loss is defined as: ; In the formula, This is the adjacency matrix inferred from the reconstruction results; The expected adjacency matrix is predefined based on anatomical knowledge; For nodes i Feature representation; This represents the total number of nodes.

[0063] This term is added to the total loss as a regularization term, forcing the network to maintain correct topological connectivity.

[0064] Geometric soft constraints ensure that the reconstruction results are geometrically reasonable. Laplacian smoothing regularization encourages similar features between adjacent nodes, thereby generating a smooth reconstruction surface and avoiding unnecessary bumps and spikes. Curvature constraints further ensure that the curvature of the reconstruction surface is within a reasonable range, conforming to the known curvature characteristics of various middle ear structures. For example, the surface of the ossicular chain should be relatively smooth, and there should be no biologically inconsistent high curvature regions.

[0065] The organic combination of these constraints ensures that the reconstruction results conform to both data observation (through data-driven learning) and anatomical knowledge (through explicit constraints). This combination avoids anatomical inconsistencies caused by purely data-driven approaches and over-smoothing caused by purely knowledge-driven approaches.

[0066] Integrated Loss Function and Training Strategy: This stage drives network learning through a multi-weighted combination of loss functions. The overall reconstruction loss function is a multi-weighted combination as follows: ; In the formula, For reconstruction losses; This is the loss due to topological consistency. Laplacian smoothing loss; For curvature constraint loss; The loss function; four weight hyperparameters The method is determined by performance tuning on the validation set, typically using grid search or Bayesian optimization.

[0067] Training employs the Adam optimizer with an initial learning rate of 1e−3. The learning rate decays at regular intervals to accelerate convergence. The batch size is set to 32 to balance memory requirements with training efficiency. The loss on the validation set is monitored in real-time during training; training stops when the validation set loss no longer decreases to prevent overfitting.

[0068] In practical applications, the system may need to continuously improve reconstruction performance based on newly acquired data. This embodiment designs an online feedback mechanism to achieve adaptive optimization of the model through incremental learning.

[0069] After the system processes new multimodal input data, the validation system can evaluate and correct the reconstruction results. This correction information, along with the corresponding multimodal input data, is saved as new samples. After accumulating a certain number of samples (usually tens to hundreds), these samples are organized into a labeled dataset, triggering the incremental learning process.

[0070] Incremental learning uses a gradient update method, adjusting parameters only for new samples: ; In the formula, For learning rate, For parameters GNN model, For the updated model parameters, These are the current model parameters. The gradient of the loss function. For newly input multimodal data, Labels corresponding to the new input data.

[0071] The advantage of this incremental learning is that it only requires gradient updates for new samples, avoiding the high computational cost of full retraining. At the same time, the knowledge learned by the original model from historical data is fully preserved, and it can gradually adapt to new data distribution characteristics through incremental updates.

[0072] To prevent performance degradation caused by incremental learning, this embodiment implements a stability constraint mechanism. The updated model must meet performance tolerance requirements on the retained validation set to ensure that the new model does not significantly degrade its performance on historical data. Only updated models that meet this condition can be officially deployed. This ensures that the system maintains stable performance metrics while continuously optimizing, achieving reliable online learning.

[0073] Model parameters after training This model learns the features and patterns in the data while also being constrained to satisfy structural priors. It can be used in subsequent inference stages to directly process new multimodal input data.

[0074] Furthermore, the inference and multiple output formats stage is at the end of the entire workflow. It receives new multimodal input data, uses the trained model for inference, generates reconstruction results, and provides multiple output formats to meet different application needs.

[0075] Specifically, including: Inference Phase: Given new multimodal input data, the data is processed according to the complete process described above: First, multimodal data preprocessing and registration are performed to obtain standardized multimodal data; second, a unified graph representation is constructed, integrating multimodal features and topological relationships; third, feature learning and fusion are performed through a trained GNN network to generate fused node representations; finally, a decoder generates the reconstructed 3D structure. The entire inference process is executed in real time on a CPU or GPU, capable of processing new patient data.

[0076] Multiple output formats: To meet the needs of different application scenarios, multiple output formats are provided: The first type is a volumetric rendering 3D model, which directly displays the reconstructed complete 3D structure and its internal details. This format is intuitive and easy to understand, facilitating a quick grasp of the overall picture and spatial relationships of the reconstruction results.

[0077] The second type is the extracted surface mesh model (STL / OBJ format), which supports subsequent processing such as 3D printing, finite element analysis, and virtual simulation. Users can utilize this model for various downstream analyses.

[0078] The third type is the reconstruction result superimposed on the original multimodal image. By projecting the reconstructed contours or surfaces onto the original data, users can intuitively verify the accuracy and completeness of the reconstruction.

[0079] The fourth type is the quantitative parameter report, which includes geometric parameters such as volume, surface area, center coordinates, and boundary dimensions of each reconstructed structure. These parameters can be used for quantitative comparative analysis between structures or to track changes during the reconstruction process.

[0080] The system outputs various forms of reconstruction results based on different needs, completing the entire end-to-end workflow.

[0081] This embodiment, through an innovative technical solution, achieves significant performance improvements and functional expansions in multiple aspects compared to existing technologies. These advantages are not only reflected in technical specifications but also in the feasibility and prospects of practical applications, specifically including: (1) Significant improvement in reconstruction accuracy and integrity: Traditional single-modal reconstruction methods are limited by the amount of information from a single data source, often failing to accurately capture all the details of a 3D structure. This embodiment fully utilizes the complementary advantages of each modality through multimodal fusion. The high precision of CT modality in representing bony structures, the high contrast of MRI in soft tissue regions, the ultra-high precision of optical point clouds in surface geometry, and the advantages of ultrasound in dynamic characteristics are all fully integrated within the unified graphical representation framework of this embodiment. Based on theoretical analysis and expected experimental verification, sub-millimeter level reconstruction accuracy can be achieved, an order of magnitude improvement over the millimeter-level accuracy of single-modal methods. Simultaneously, explicit topological constraints and geometric priors ensure the biological rationality of the reconstruction results, avoiding the irrational morphologies that may occur with purely data-driven methods.

[0082] Furthermore, the completeness of this embodiment has been significantly improved. The complementarity of multimodal data compensates for the information gaps in certain areas of single-modality data. For example, CT provides clear visualization of the ossicular chain bone but is weak in displaying membranous structures; MRI, on the other hand, demonstrates this complementarity. This complementarity is fully utilized within the framework of this technical solution, ensuring the completeness and comprehensiveness of the reconstruction results.

[0083] (2) Significant optimization of processing efficiency and computational cost: End-to-end deep learning methods significantly reduce processing time compared to traditional multi-step registration and fusion processes. The GNN architecture in this embodiment performs message passing on a graph structure, which is more computationally efficient than performing convolution operations on full-size 3D volume data. Graph representation avoids a large amount of redundant computation by selectively retaining key nodes. Through optimized network structure design, real-time or near real-time processing speeds can be achieved on standard GPUs, which is crucial for practical applications.

[0084] Meanwhile, the introduction of the incremental learning mechanism further optimizes the system's computational cost. When optimization is needed for new data, there is no need to retrain the entire model; only gradient updates are required, reducing training time from hours to minutes. This optimization enables the system to quickly adapt to new data characteristics in practical applications, improving its usability.

[0085] (3) System reliability and self-adaptability: Explicit topological and geometric constraints ensure the reliability of the reconstruction results. Compared to purely data-driven methods, the results of this embodiment are not only more accurate but also biologically plausible. This reliability is particularly important for practical applications, as inaccurate results can lead to errors in subsequent analyses.

[0086] Incremental learning and online optimization mechanisms provide adaptive capabilities, enabling continuous performance improvement based on new data and adapting to the characteristics of data from different sources and from different individuals. This adaptive capability allows the system to maintain good performance over long periods in practical applications without requiring frequent offline retraining. Simultaneously, stability constraint mechanisms ensure that online learning does not lead to performance degradation, guaranteeing the long-term reliability of the system.

[0087] (4) Broad application prospects and promotional value: This technical solution has a wide range of applications. In the field of medical research, accurate middle ear structure reconstruction can support research in auditory biophysics and ossicular chain biomechanics analysis. In the field of medical education, it can be used to construct virtual anatomical models to help students understand complex three-dimensional structures. In the fields of clinical diagnosis and treatment planning, accurate three-dimensional models can help medical professionals develop more precise treatment plans.

[0088] In 3D printing applications, the generated high-precision mesh models can be directly used for 3D printing, providing a foundation for the customized design of medical devices. In the field of virtual simulation, biomechanical simulation models can be built to predict the physiological responses of structures. These application areas all have significant practical value and commercial potential.

[0089] Furthermore, the multimodal fusion technology framework of this technical solution is universal and can be extended to the three-dimensional reconstruction of other complex biological structures, such as the inner ear, temporal bone, and head and neck, thus broadening the application scope and market space of this invention.

[0090] This technical solution, through innovative technical approach and system design, improves performance while fully considering the needs of practical applications, demonstrating excellent technological advancement and universal feasibility.

[0091] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for multimodal reconstruction of middle ear structure based on graph neural networks, characterized in that, include: The acquired multimodal raw middle ear data are preprocessed to obtain multimodal data with a consistent spatial reference frame and a uniform intensity range; The multimodal data is converted into a graph structure, where nodes represent feature points or regions of the middle ear structure, edges encode the spatial adjacency or anatomical connection relationships between nodes, and a multi-source fusion strategy is used to construct nodes. The graph structure is input into a graph neural network, and end-to-end feature learning is performed through message passing mechanism and multi-head attention mechanism. A dynamic weighted fusion strategy is used to integrate multimodal features and generate fused node representations. The graph neural network is trained by introducing topological constraints and geometric priors as regularization terms in the loss function. The topological constraints force the reconstruction results to conform to the anatomical topological relationship of the middle ear, and the geometric priors include smoothness constraints and curvature constraints. The trained graph neural network is used to infer new multimodal input data and generate a three-dimensional reconstruction of the middle ear structure.

2. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 1, characterized in that, The acquired multimodal raw middle ear data were preprocessed, including: Modality-specific denoising, spatial alignment, scale normalization, and multimodal registration processing; Among them, anisotropic diffusion filtering is used for noise reduction, specifically as follows: ； In the formula, D(∇I) is the diffusion tensor adaptively adjusted according to the gradient direction. I Image intensity value, t For diffusion iteration time, The gradient of image intensity; Normalization is performed using percentage truncation and histogram matching, specifically as follows: ； In the formula, HistMatch For histogram matching operations, Normalize For percentage cutoff and standard deviation normalization, For the first m Image data after modal normalization For the first m Raw image data for each modality.

3. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 1, characterized in that, Nodes are constructed using a multi-source fusion strategy, including: ； In the formula, For nodes i Three-dimensional spatial coordinates, For the first k The feature output of a modal encoder at a node location. For nodes i The multimodal attribute vector.

4. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 3, characterized in that, In the graph structure, edges are constructed based on a dual criterion; The dual criteria include: The spatial proximity criterion is that an edge is established when the Euclidean distance between two nodes is less than a set threshold. The anatomical relationship criterion states that an edge is established when two nodes belong to the same anatomical structure.

5. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 1, characterized in that, The dynamic weighted fusion strategy is as follows: ； In the formula, Let i be the feature representation learned from the k-th modality. For dynamically calculated fusion weights, Let K be the feature representation of node i after fusion, and K be the total number of modes.

6. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 1, characterized in that, The loss function is: ； In the formula, For reconstruction losses; This is the loss due to topological consistency. Laplacian smoothing loss; For curvature constraint loss; The loss function; All of these are hyperparameters.

7. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 6, characterized in that, The topology consistency loss The calculation is based on the difference between the reconstructed adjacency matrix and the expected adjacency matrix, wherein the expected adjacency matrix is predefined according to the anatomical topology of the middle ear.

8. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 1, characterized in that, The method also includes online incremental learning, including: Incremental training is performed based on new input data and feedback results. Model parameters are optimized through gradient updates, while stability constraints are set to prevent performance degradation.

9. The method for multimodal reconstruction of middle ear structure based on graph neural networks according to claim 8, characterized in that, The online incremental learning employs a gradient update method to adjust parameters for new samples, specifically as follows: ； In the formula, For learning rate, For parameters GNN model, For the updated model parameters, These are the current model parameters. The gradient of the loss function. For newly input multimodal data, Labels corresponding to the new input data.