Orthodontic Treatment Monitoring Method and Device Based on Oral Scan Video

By reconstructing and segmenting tooth models using deep learning-based methods, and utilizing heterogeneous feature interaction and point cloud registration technology, the problem of low efficiency in existing orthodontic treatment monitoring is solved, and efficient and automated monitoring and adjustment of tooth posture is achieved.

CN116153530BActive Publication Date: 2026-06-30ZHEJIANG GONGSHANG UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG GONGSHANG UNIVERSITY
Filing Date
2023-02-22
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing orthodontic treatment monitoring methods are inefficient and it is difficult to effectively utilize point cloud segmentation and point cloud registration for tooth posture monitoring through deep learning technology.

Method used

A deep learning-based approach is adopted to reconstruct a 3D jawbone model by acquiring RGB-D oral scan videos, train a 3D tooth instance segmentation model, and achieve tooth instance segmentation and pose estimation by utilizing a heterogeneous feature interaction module and point cloud registration technology.

Benefits of technology

It improves the efficiency and accuracy of orthodontic treatment monitoring, enabling real-time detection of changes in tooth posture and supporting automated orthodontic treatment adjustments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116153530B_ABST
    Figure CN116153530B_ABST
Patent Text Reader

Abstract

This invention discloses a method and device for monitoring orthodontic treatment based on oral scanning videos. The method employs a deep learning-based framework to measure the degree of orthodontic correction, and can reconstruct, segment, and estimate the posture of individual teeth. First, a tooth instance segmentation model is constructed, within which a novel heterogeneous feature interaction module is built. This module extends graph attention and effectively fuses heterogeneous data by propagating information across different graphs. Then, a tooth point cloud registration model is constructed, and a quaternary loss function is designed during the point cloud registration process to mine structural information. This function explores the relationships between mismatched points by measuring the diversity in negative samples. Finally, the predicted tooth posture is compared with the target tooth posture to check whether the orthodontic treatment result is consistent with the plan.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a method and device for monitoring orthodontic treatment based on oral cavity scanning video. Background Technology

[0002] Orthodontic treatment involves periodically wearing braces to gradually change the position and posture of the teeth.

[0003] Currently, orthodontic treatment monitoring during treatment primarily relies on patients' regular visits to the clinic for checkups. Doctors visually assess the orthodontic progress at each stage of treatment. If the doctor finds that the orthodontic results are not progressing as planned, adjustments to the braces are necessary to ensure the smooth progress of treatment. However, this method of orthodontic treatment monitoring is inefficient and fails to meet user needs. Recently, significant progress has been made in deep learning-based dental technologies, especially point cloud segmentation and point cloud registration technologies. Research on orthodontic treatment monitoring from oral scan videos is a new direction in dental digitization, but few studies have explored the application of point cloud segmentation and point cloud registration in orthodontic treatment monitoring. Summary of the Invention

[0004] The purpose of this invention is to address the shortcomings of existing technologies by providing a method and device for monitoring orthodontic treatment based on oral scanning video.

[0005] The objective of this invention is achieved through the following technical solution: a method for monitoring orthodontic treatment based on oral cavity scan video, the method comprising the following steps:

[0006] S1, acquire the RGB-D oral cavity scan video of the patient at the current stage;

[0007] S2, reconstruct the oral scan video into a 3D jawbone model;

[0008] S3, based on the 3D jawbone model, train a 3D tooth instance segmentation model to obtain the instance label of each tooth in the 3D jawbone model and its corresponding 3D region;

[0009] S4. The point cloud data, including position and color information, of the 3D region to which the target tooth belongs in the current stage is registered with the point cloud data of the tooth in the 3D jawbone model before orthodontics to obtain the transformed pose of the target tooth.

[0010] Furthermore, the 3D tooth instance segmentation model is a point cloud instance segmentation model implemented based on the SoftGroup model.

[0011] Furthermore, the 3D tooth instance segmentation model constructs a heterogeneous feature interaction module, which uses graph attention to propagate local information between different graphs, thereby interacting with heterogeneous features and enhancing the contextual feature extraction capability in instance segmentation.

[0012] Furthermore, the 3D tooth instance segmentation model includes a backbone network, a heterogeneous feature interaction module, a foreground semantic segmentation branch, a center offset prediction branch, a feature extraction module, and a semantic segmentation module;

[0013] The color and three-dimensional coordinates of the points in the 3D jawbone model reconstructed by S2 are used as inputs to the backbone network, and multi-layer color features and multi-layer three-dimensional coordinate features are output to the heterogeneous feature interaction module respectively.

[0014] The heterogeneous feature interaction module uses the K-nearest neighbor method to construct two multi-layer adjacency graph sets using multi-layer color features and multi-layer 3D coordinate features, respectively; it uses multi-layer perceptual (MLP) mapping to obtain local feature information within and between graphs in the same layer to update the point cloud features in the 3D jawbone model; it uses graph attention to aggregate local features within and between graphs in the same layer to obtain context features that integrate heterogeneous features, which are then input into the foreground semantic segmentation branch and the center offset prediction branch, respectively.

[0015] The foreground semantic segmentation branch is used to predict the tooth point cloud region on the 3D dental model, and the center offset prediction branch is used to predict the offset of each point on the 3D dental model from the center of its corresponding tooth.

[0016] The initial tooth instance segmentation result is obtained based on the outputs of the foreground semantic segmentation branch and the center offset prediction branch and then input into the feature extraction module.

[0017] The feature extraction module is used to extract point cloud features for each initial tooth instance;

[0018] The semantic segmentation module is used to improve the initial tooth instance segmentation results, obtaining the instance label of each tooth and its corresponding 3D region.

[0019] Furthermore, the specific implementation of the heterogeneous feature interaction module is as follows:

[0020] Using the K-nearest neighbor method, a multi-level adjacency graph G is constructed based on multi-level color features. 1 (V 1,l E 1,l Construct a multi-layer adjacency graph G based on multi-layer three-dimensional coordinate features. 2 (V 2,l E 2,l );in:

[0021] matrix and G 1 (V 1,l E 1,l The set of nodes and edges of ); m i 1,l Representing the adjacency graph G 1 (V 1,l E 1,l The i-th node in the middle layer l has a corresponding feature f. i 1,l ; Representing the adjacency graph G 1 (V 1,l E 1,l The edge between the i-th node and the j-th node in the middle layer l has the following characteristics: N represents the number of nodes, l∈{1,2,...,L} max}, L max Indicates the number of floors;

[0022] matrix and G 2 (V 2,l E 2,l The set of nodes and edges of ) Adjacency graph G 2 (V 2,l E 2,l The i-th node in the middle layer l has a corresponding feature f. i 2,l ; Representing the adjacency graph G 2 (V 2,l E 2,l The edge between the i-th node and the j-th node in the middle layer l has the characteristic f. ij 2,l ;

[0023] The adjacency graph G representing color features 1 (V 1,l E 1,l ) and the adjacency graph G representing the three-dimensional coordinate features 2 (V 2,l E 2,l The following joint analysis was conducted:

[0024] a) By using multilayer perceptron (MLP) mapping, node features are updated using local features within and between graphs in the same layer, as shown in the following formula:

[0025]

[0026]

[0027]

[0028]

[0029] Where [·] represents the feature concatenation operation, and i, j, and k represent the node indices respectively; This indicates that color features are updated using color features within the same layer of the graph. This indicates that the color features are updated using the 3D coordinate features between graphs in the same layer. This indicates that the 3D coordinate features are updated using the 3D coordinate features within the same layer of the graph. This indicates that the 3D coordinate features are updated using color features between graphs in the same layer; This represents the weight matrix of the multilayer perceptron that needs to be learned;

[0030] b) Calculate the multidimensional attention weights within and between graphs, using the following formula:

[0031]

[0032]

[0033]

[0034]

[0035] in, This represents the color attention weights obtained using color features within the same layer of the graph. This represents the color attention weights obtained using the 3D coordinate features between graphs in the same layer. This represents the 3D coordinate attention weights obtained using the 3D coordinate features within the same layer of the graph. This represents the attention weights of the three-dimensional coordinates obtained using color features between images in the same layer. This represents the weight matrix of the multilayer perceptron that needs to be learned;

[0036] c) Aggregate local features using the multidimensional attention weights of intra- and inter-graph elements in layer l to obtain intra- and inter-graph context features in layer l+1, as shown in the following formula:

[0037]

[0038]

[0039]

[0040]

[0041] Where ⊙ represents the product of elements, and i and j represent node indices, respectively. This represents the color context features of the (l+1)th layer obtained using the color features of the lth layer. This represents the color context features of the (l+1)th layer obtained using the 3D coordinate features of the l-th layer. This represents the 3D coordinate context features of the (l+1)th layer obtained using the 3D coordinate features of the lth layer. This represents the 3D coordinate context features of the (l+1)th layer obtained using the color features of the lth layer.

[0042] d) The color and 3D coordinates are fused using intra-graph context features, inter-graph context features, and node features, respectively. The specific formulas are as follows:

[0043]

[0044]

[0045] Among them, f i l1,+1 To utilize the color feature f of the i-th node in the l-th layer i 1,l With color context features Update the color context features obtained from the (l+1)th layer; f i 2,l+1 To utilize the 3D coordinate features f of the i-th node in the l-th layer i 2,l 3D coordinate context features Update the obtained 3D coordinate context features of layer l+1; This represents the weight matrix of the multilayer perceptron that needs to be learned.

[0046] Further, in step S4, the point cloud data, including position and color information, corresponding to the 3D region to which the target tooth belongs in the current stage is used as the target point cloud T, and the point cloud data corresponding to the tooth in the pre-orthodontic 3D jawbone model is used as the source point cloud S. The specific steps for point cloud registration are as follows:

[0047] (1) Local geometric features of source point cloud S and target point cloud T are extracted by registering the backbone network;

[0048] Specifically, the registration backbone network can use KPFCN. By deleting the decoding units in KPFCN, the registration backbone network can obtain downsampled versions of the source point cloud S and the target point cloud T. and Local geometric features with positional encoding are extracted from the source point cloud S and the target point cloud T, respectively. and

[0049] (2) The transformation matrix from the source point cloud to the target point cloud is predicted by inputting the output of the registered backbone network into two cascaded TMP blocks; the specific operation in the TMP block is as follows:

[0050] (a) Non-local context features of source point cloud S and target point cloud T are obtained through attention layer, and information fusion between source point cloud S and target point cloud T is achieved;

[0051] Specifically, local geometric features and The nonlocal up and down features are obtained by inputting them into the self-attention layer, and then the output of the self-attention layer is compared with... and The information is input into the cross-attention layer respectively, thereby achieving information fusion between the source point cloud S and the target point cloud T;

[0052] (b) Location-aware feature matching;

[0053] Specifically, based on the positional output of the attention layer, the matching matrix Z between the source point cloud S and the target point cloud T is calculated, as shown in the following formula:

[0054]

[0055] Where Z(i,j) is the matching score between point i in the source point cloud S and point j in the target point cloud T, d represents the number of channels of the output feature of the attention layer, and <·,·> represents the inner product operation. These are two projection matrices that need to be learned, used to map the features output by the attention layer; Θ is a block diagonal matrix:

[0056]

[0057] Among them, M k For the k-th block in the block diagonal matrix,

[0058]

[0059] in, This indicates that the index in the feature channel is encoded; x, y, z represent the three-dimensional coordinates corresponding to the matching point;

[0060] (c) Obtain the confidence matrix using the normalized exponential function (Softmax);

[0061] Specifically, the normalized exponential function Softmax is applied to the two dimensions Z(i, ·0, Z9·,j) respectively to obtain the confidence matrix C, as shown in the following formula:

[0062] C(i,j)=Softmax(Z(i,··Softmax(Z(·,j))

[0063] Among them, the confidence level C(i,j) measures the features of point i in the source point cloud S. Features of point j in the target point cloud T The degree of matching between them; after normalizing C(i,j), it is denoted as

[0064] (d) Obtain matching points to learn the transformation matrix from the source point cloud to the target point cloud;

[0065] Specifically, based on the confidence matrix C, the n matching points with the highest matching scores are selected to form the matching set K. soft The transformation matrix includes the rigid rotation matrix R and the translation vector t;

[0066] The rigid rotation matrix R passes through the matrix H = U∑V T The singular value decomposition (SVD) is calculated, where U is HH. T The feature vector, V is the feature vector of H T The eigenvectors of H are obtained by the following formula:

[0067]

[0068] The formula for calculating R is:

[0069] R = Udiag(1,1,det(UV) T ))V

[0070] The formula for calculating the translation vector t is:

[0071]

[0072] (e) Relocation;

[0073] Specifically, after obtaining the rigid rotation matrix R and translation vector t output by the first TMP block, the source point cloud features are processed. After performing a rotation and translation repositioning operation, the data is then input into the second TMP block. The specific repositioning is achieved using the following formula:

[0074]

[0075] Furthermore, a structure-aware quaternary loss function is designed in the point cloud registration process, where each sample is learned using one positive sample and two negative samples simultaneously.

[0076] The structure-aware quaternary loss function is derived from the ternary loss function used to optimize the relationship between matched and unmatched points. The ternary loss function L...f The formula for (i,j) is as follows:

[0077]

[0078] Where i represents a point in the source point cloud, j represents a point in the target point cloud, and C(i,p) i (i) represents point i in the source point cloud and point p in the target point cloud. i Confidence level, This represents a point i in the source point cloud and a non-matching point in the target point cloud. Confidence level, The negative sample with the highest confidence level, i.e., the most difficult negative sample, is denoted as . p i This indicates the matching point of point i in the source point cloud within the target point cloud. This indicates that point i in the source point cloud is the j-th non-matching point in the target point cloud;

[0079] The specific implementation of the structure-aware quaternary loss function is as follows:

[0080] The following mechanism is designed to effectively improve the diversity and utilization of negative samples by constraining the matching relationship with information from multiple negative samples, thereby making full use of the information in the negative samples. Specifically:

[0081] The first negative sample is the most difficult negative sample. Second negative sample The following constraints determine that the correlation with point i in the source point cloud is strong, but with the most difficult negative sample The correlation is very weak, and the specific formula is:

[0082]

[0083] in, Indicates the most difficult negative sample and negative sample points Confidence level;

[0084] Therefore, the structure-aware quaternary loss function The specific formula is:

[0085]

[0086] in, This indicates that point i in the source point cloud is related to the first negative sample. Confidence level, This indicates that point i in the source point cloud is related to the second negative sample. Confidence level;

[0087] Total loss L from point cloud registration total It is the matching loss. Distortion loss and structure-aware quaternary loss The linear combination of these is given by the formula:

[0088]

[0089] Where, λ m and λ w The effects of matching loss and distortion loss were balanced respectively.

[0090] Furthermore, the formula for the matching loss is:

[0091]

[0092] Where α and γ are default parameters in Focal loss, used to balance the impact of large differences in the number of positive and negative samples; K gt This represents the set of true matching points in the source point cloud and the target point cloud. During training, it is obtained by using the true rigid rotation matrix R and translation vector t to obtain the set of the nearest neighbors in the source point cloud and the target point cloud that are below the distance threshold.

[0093] Furthermore, the formula for the distortion loss is:

[0094]

[0095] in, This represents the set of overlapping points in the source point cloud and the target point cloud. It is a distortion function that uses a real rigid rotation matrix R and a translation vector t to transform the source point cloud to coincide with the target point cloud.

[0096] The present invention also provides an orthodontic treatment monitoring device based on oral scan video, including a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it is used to implement the above-mentioned orthodontic treatment monitoring method based on oral scan video.

[0097] Compared with the prior art, the beneficial effects of this invention are as follows:

[0098] (1) The present invention designs a deep learning-based framework for measuring orthodontic treatment, which can reconstruct, segment and estimate the posture of individual teeth.

[0099] (2) The present invention constructs a new heterogeneous feature interaction module in point cloud segmentation, which expands the attention of the graph and effectively combines heterogeneous data by propagating information in different graphs.

[0100] (3) The present invention designs a new loss function in point cloud registration, which explores the relationship between unmatched points by measuring the diversity in negative samples.

[0101] (4) Experiments on tooth segmentation datasets and reconstructed tooth pose estimation datasets verified the effectiveness of the method of the present invention. Attached Figure Description

[0102] Figure 1 This is a schematic diagram of the overall framework provided in an embodiment of the present invention;

[0103] Figure 2 This is an overall flowchart of point cloud segmentation provided in an embodiment of the present invention;

[0104] Figure 3 This is a schematic diagram illustrating the implementation of the heterogeneous feature interaction module provided in an embodiment of the present invention;

[0105] Figure 4 This is a flowchart of the point cloud registration process provided in an embodiment of the present invention;

[0106] Figure 5 This is an illustrative diagram illustrating different contrast loss functions provided in embodiments of the present invention;

[0107] Figure 6 This is a structural diagram of the orthodontic treatment monitoring device based on oral scanning video provided in an embodiment of the present invention. Detailed Implementation

[0108] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0109] This invention provides a method for monitoring orthodontic treatment based on oral cavity scan video, comprising the following steps:

[0110] S1, acquire the RGB-D oral cavity scan video of the patient at the current stage;

[0111] S2, The oral scan video is reconstructed into a 3D jawbone model using the Simultaneous Localization and Mapping (SLAM) method;

[0112] S3, based on the 3D jawbone model, train a 3D tooth instance segmentation model to obtain the instance label (e.g., first molar, first canine, first incisor, etc.) and its corresponding 3D region for each tooth in the 3D jawbone model;

[0113] In one implementation, the 3D tooth instance segmentation model employs a point cloud instance segmentation model based on the SoftGroup model. The 3D tooth instance segmentation model incorporates a heterogeneous feature interaction module, which uses graph attention to propagate local information between different graphs, thereby interacting with heterogeneous features and enhancing the contextual feature extraction capability in instance segmentation. Specifically, the 3D tooth instance segmentation model includes a backbone network, a heterogeneous feature interaction module, a foreground semantic segmentation branch, a center offset prediction branch, a feature extraction module, and a semantic segmentation module.

[0114] The color and three-dimensional coordinates of the points in the 3D jawbone model reconstructed by S2 are used as inputs to the backbone network, and multi-layer color features and multi-layer three-dimensional coordinate features are output to the heterogeneous feature interaction module respectively.

[0115] The heterogeneous feature interaction module uses the K-nearest neighbor method to construct two multi-layer adjacency graph sets using multi-layer color features and multi-layer 3D coordinate features, respectively; it uses multi-layer perceptual (MLP) mapping to obtain local feature information within and between graphs in the same layer to update the point cloud features in the 3D jawbone model; it uses graph attention to aggregate local features within and between graphs in the same layer to obtain context features that integrate heterogeneous features, which are then input into the foreground semantic segmentation branch and the center offset prediction branch, respectively.

[0116] The foreground semantic segmentation branch is used to predict the tooth point cloud region on the 3D dental model, and the center offset prediction branch is used to predict the offset of each point on the 3D dental model from the center of its corresponding tooth.

[0117] The initial tooth instance segmentation result is obtained based on the outputs of the foreground semantic segmentation branch and the center offset prediction branch and then input into the feature extraction module.

[0118] The feature extraction module is used to extract point cloud features for each initial tooth instance;

[0119] The semantic segmentation module is used to improve the initial tooth instance segmentation results, obtaining the instance label of each tooth and its corresponding 3D region.

[0120] In one implementation, the heterogeneous feature interaction module is specifically implemented as follows:

[0121] Using the K-nearest neighbor method, a multi-level adjacency graph G is constructed based on multi-level color features. 1 (V 1,l E 1,l Construct a multi-layer adjacency graph G based on multi-layer three-dimensional coordinate features. 2 (V 2,l E 2,l );in:

[0122] matrix and G 1 (V 1,l E 1,l The set of nodes and edges of ) Representing the adjacency graph G 1 (V 1,l E 1,l The i-th node in the middle layer l has a corresponding feature f. i 1,l ; Representing the adjacency graph G 1 (V 1,l E 1,l The edge between the i-th node and the j-th node in the middle layer l has the following characteristics: N represents the number of nodes, l∈{1,2,...,L} max}, L max Indicates the number of floors;

[0123] matrix and G 2 (V 2,l E 2,l The set of nodes and edges of ) Adjacency graph G 2 (V 2,l E 2,l The i-th node in the middle layer l has a corresponding feature f. i 2,l ; Representing the adjacency graph G 2 (V 2,l E 2,l The edge between the i-th node and the j-th node in the middle layer l has the following characteristics:

[0124] The adjacency graph G representing color features 1 (V 1,l E 1,l ) and the adjacency graph G representing the three-dimensional coordinate features 2 (V 2,l E 2,l The following joint analysis was conducted:

[0125] a) By using multilayer perceptron (MLP) mapping, node features are updated using local features within and between graphs in the same layer, as shown in the following formula:

[0126]

[0127]

[0128]

[0129]

[0130] Where [·] represents the feature concatenation operation, and i, j, and k represent the node indices respectively; This indicates that color features are updated using color features within the same layer of the graph. This indicates that the color features are updated using the 3D coordinate features between graphs in the same layer. This indicates that the 3D coordinate features are updated using the 3D coordinate features within the same layer of the graph. This indicates that the 3D coordinate features are updated using color features between graphs in the same layer; This represents the weight matrix of the multilayer perceptron that needs to be learned;

[0131] b) Calculate the multidimensional attention weights within and between graphs, using the following formula:

[0132]

[0133]

[0134]

[0135]

[0136] in, This represents the color attention weights obtained using color features within the same layer of the graph. This represents the color attention weights obtained using the 3D coordinate features between graphs in the same layer. This represents the 3D coordinate attention weights obtained using the 3D coordinate features within the same layer of the graph. This represents the attention weights of the three-dimensional coordinates obtained using color features between images in the same layer. This represents the weight matrix of the multilayer perceptron that needs to be learned;

[0137] c) Aggregate local features using the multidimensional attention weights of intra- and inter-graph elements in layer l to obtain intra- and inter-graph context features in layer l+1, as shown in the following formula:

[0138]

[0139]

[0140]

[0141]

[0142] Where ⊙ represents the product of elements, and i and j represent node indices, respectively. This represents the color context features of the (l+1)th layer obtained using the color features of the lth layer. This represents the color context features of the (l+1)th layer obtained using the 3D coordinate features of the l-th layer. This represents the 3D coordinate context features of the (l+1)th layer obtained using the 3D coordinate features of the lth layer. This represents the 3D coordinate context features of the (l+1)th layer obtained using the color features of the lth layer; d) The color and 3D coordinates are fused using intra-graph context features, inter-graph context features, and node features, respectively, with the specific formulas as follows:

[0143]

[0144]

[0145] Among them, f i 1,l+1 To utilize the color feature f of the i-th node in the l-th layer i 1,l With color context features Update the color context features obtained from the (l+1)th layer; f i 2,l+1 To utilize the 3D coordinate features f of the i-th node in the l-th layer i 2,l 3D coordinate context features Update the obtained 3D coordinate context features of layer l+1; This represents the weight matrix of the multilayer perceptron that needs to be learned.

[0146] S4, the point cloud data, including position and color information, corresponding to the 3D region to which the target tooth belongs in the current stage is registered with the point cloud data corresponding to the tooth in the pre-orthodontic 3D jawbone model to obtain the transformed pose of the target tooth. In one embodiment, the point cloud data, including position and color information, corresponding to the 3D region to which the target tooth belongs in the current stage is used as the target point cloud T, and the point cloud data corresponding to the tooth in the pre-orthodontic 3D jawbone model is used as the source point cloud S. The specific steps for point cloud registration are as follows:

[0147] (1) Local geometric features of source point cloud S and target point cloud T are extracted by registering the backbone network;

[0148] Specifically, the registration backbone network can use KPFCN. By deleting the decoding units in KPFCN, the registration backbone network can obtain downsampled versions of the source point cloud S and the target point cloud T. and Local geometric features with positional encoding are extracted from the source point cloud S and the target point cloud T, respectively. and

[0149] (2) The transformation matrix from the source point cloud to the target point cloud is predicted by inputting the output of the registered backbone network into two cascaded TMP blocks; the specific operation in the TMP block is as follows:

[0150] (a) Non-local context features of source point cloud S and target point cloud T are obtained through attention layer, and information fusion between source point cloud S and target point cloud T is achieved;

[0151] Specifically, local geometric features and The nonlocal up and down features are obtained by inputting them into the self-attention layer, and then the output of the self-attention layer is compared with... and The information is input into the cross-attention layer respectively, thereby achieving information fusion between the source point cloud S and the target point cloud T;

[0152] (b) Location-aware feature matching;

[0153] Specifically, based on the positional output of the attention layer, the matching matrix Z between the source point cloud S and the target point cloud T is calculated, as shown in the following formula:

[0154]

[0155] Where Z(i,j) is the matching score between point i in the source point cloud S and point j in the target point cloud T, d represents the number of channels of the output feature of the attention layer, and <·,·> represents the inner product operation. These are two projection matrices that need to be learned, used to map the features output by the attention layer; Θ is a block diagonal matrix:

[0156]

[0157] Among them, M k For the k-th block in the block diagonal matrix,

[0158]

[0159] in, This indicates that the index in the feature channel is encoded; x, y, z represent the three-dimensional coordinates corresponding to the matching point;

[0160] (c) Obtain the confidence matrix using the normalized exponential function (Softmax);

[0161] Specifically, the normalized exponential function Softmax is applied to the two dimensions Z(i,·) and Z(·,j) respectively to obtain the confidence matrix C, as shown in the following formula:

[0162] C(i,j)=Softmax(Z(i,··Softmax(Z(·,j))

[0163] Among them, the confidence level C(i,j) measures the features of point i in the source point cloud S. Features of point j in the target point cloud T The degree of matching between them; after normalizing C(i,j), it is denoted as

[0164] (d) Obtain matching points to learn the transformation matrix from the source point cloud to the target point cloud;

[0165] Specifically, based on the confidence matrix C, the n matching points with the highest matching scores are selected to form the matching set K. soft The transformation matrix includes the rigid rotation matrix R and the translation vector t;

[0166] The rigid rotation matrix R passes through the matrix H = U∑V T The singular value decomposition (SVD) is calculated, where U is HH. T The feature vector, V is the feature vector of H T The eigenvectors of H are obtained by the following formula:

[0167]

[0168] The formula for calculating R is:

[0169] R = Udiag(1,1,det(UV) T ))V

[0170] The formula for calculating the translation vector t is:

[0171]

[0172] (e) Relocation;

[0173] Specifically, after obtaining the rigid rotation matrix R and translation vector t output by the first TMP block, the source point cloud features are processed. After performing a rotation and translation repositioning operation, the data is then input into the second TMP block. The specific repositioning is achieved using the following formula:

[0174]

[0175] This invention designs a structure-aware quaternary loss function during the point cloud registration process, where each sample is learned using one positive sample and two negative samples simultaneously.

[0176] The structure-aware quaternary loss function is derived from the ternary loss function used to optimize the relationship between matched and unmatched points. The ternary loss function L...f The formula for (i,j) is as follows:

[0177]

[0178] Where i represents a point in the source point cloud, j represents a point in the target point cloud, and C(i,p) i (i) represents point i in the source point cloud and point p in the target point cloud. i Confidence level, This represents a point i in the source point cloud and a non-matching point in the target point cloud. Confidence level, The negative sample with the highest confidence level, i.e., the most difficult negative sample, is denoted as . p i This indicates the matching point of point i in the source point cloud within the target point cloud. This indicates that point i in the source point cloud is the j-th non-matching point in the target point cloud;

[0179] The specific implementation of the structure-aware quaternary loss function is as follows:

[0180] The following mechanism is designed to effectively improve the diversity and utilization of negative samples by constraining the matching relationship with information from multiple negative samples, thereby making full use of the information in the negative samples. Specifically:

[0181] The first negative sample is the most difficult negative sample. Second negative sample The following constraints determine that the correlation with point i in the source point cloud is strong, but with the most difficult negative sample The correlation is very weak, and the specific formula is:

[0182]

[0183] in, Indicates the most difficult negative sample and negative sample points Confidence level;

[0184] Therefore, the structure-aware quaternary loss function The specific formula is:

[0185]

[0186] in, This indicates that point i in the source point cloud is related to the first negative sample. Confidence level, This indicates that point i in the source point cloud is related to the second negative sample. Confidence level;

[0187] Total loss L from point cloud registration total It is the matching loss. Distortion loss and structure-aware quaternary loss The linear combination of these is given by the formula:

[0188]

[0189] Where, λ m and λ w The effects of matching loss and distortion loss were balanced respectively.

[0190] The formula for the matching loss is:

[0191]

[0192] Here, α and γ are the default parameters in Focal loss, typically set to α = 0.25 and γ = 2 to balance the impact of large differences in the number of positive and negative samples; K gt This represents the set of true matching points in the source point cloud and the target point cloud. During training, it is obtained by using the true rigid rotation matrix R and translation vector t to obtain the set of the nearest neighbors in the source point cloud and the target point cloud that are below the distance threshold.

[0193] The formula for the distortion loss is:

[0194]

[0195] in, This represents the set of overlapping points in the source point cloud and the target point cloud. It is a distortion function that uses a real rigid rotation matrix R and a translation vector t to transform the source point cloud to coincide with the target point cloud.

[0196] The orthodontic treatment monitoring method provided in this embodiment can accurately detect the changing posture of current teeth compared to historical teeth, thereby determining the current orthodontic effect. The process of monitoring teeth in intraoral RGB-D video data using this embodiment includes two parts: training and testing. The orthodontic monitoring method used in this embodiment is described below with reference to the accompanying drawings.

[0197] Figure 1 This is a schematic diagram of the overall framework provided by an embodiment of the present invention; it includes 3D reconstruction, tooth instance segmentation, and tooth pose estimation. R in the diagram... h and t h This represents the rotation and translation of tooth h.

[0198] Figure 2 This is an overall flowchart of point cloud segmentation provided in an embodiment of the present invention.

[0199] The testing method in this embodiment is as follows: given a test 3D jawbone model, input it into the trained tooth instance segmentation model and perform a forward propagation to obtain the test results based on the tooth instance segmentation model proposed in this embodiment.

[0200] Figure 3 This is a schematic diagram of the heterogeneous feature interaction module implementation provided in an embodiment of the present invention. For simplicity, the present invention only visualizes the part shown in Figure G. 1 and G 2 To Figure G 1 The information is presented in a dotted box, representing feature stitching. SIM stands for similarity measurement, and MLP stands for multilayer perceptron.

[0201] Figure 4 This is a flowchart of the point cloud registration process provided in an embodiment of the present invention.

[0202] Figure 5 This is a schematic diagram illustrating different contrastive loss functions provided in embodiments of the present invention, where (a) represents ternary loss, (b) represents quaternary loss, and (c) represents structure-aware quaternary loss. '+' represents a positive point, '-' represents a negative point, and 'P' is an anchor point. Dashed arrows represent the structural relationships between negative points.

[0203] The results show that the method proposed in this invention is more competitive than other advanced point cloud registration methods and achieves the purpose of orthodontic monitoring.

[0204] Corresponding to the aforementioned embodiments of the orthodontic treatment monitoring method based on oral scan video, the present invention also provides embodiments of an orthodontic treatment monitoring device based on oral scan video.

[0205] See Figure 6 The present invention provides an orthodontic treatment monitoring device based on oral scan video, including a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it is used to implement the orthodontic treatment monitoring method based on oral scan video in the above embodiment.

[0206] The embodiments of the orthodontic treatment monitoring device based on oral scanning video of the present invention can be applied to any device with data processing capabilities, such as a computer. The device embodiments can be implemented through software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 6The diagram shown is a hardware structure diagram of any data processing-capable device, including the orthodontic treatment monitoring device based on oral scan video of the present invention. Except for... Figure 6 In addition to the processor, memory, network interface, and non-volatile memory shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.

[0207] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0208] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0209] This invention also provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the orthodontic treatment monitoring method based on oral scan video described in the above embodiments.

[0210] The computer-readable storage medium can be an internal storage unit of any data processing device as described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be an external storage device of any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units and external storage devices of any data processing device. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.

[0211] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

[0212] The specific embodiments described above illustrate the technical solution and beneficial effects of the present invention in detail. It should be understood that the above description is only the most preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, additions, and equivalent substitutions made within the scope of the principles of the present invention should be included within the protection scope of the present invention.

Claims

1. An orthodontic treatment monitoring method based on oral scan video, characterized by, include: S1, acquire the RGB-D oral cavity scan video of the patient at the current stage; S2, reconstruct the oral cavity scan video into a 3D jawbone model; S3, based on the 3D jawbone model, train a 3D tooth instance segmentation model to obtain the instance label of each tooth in the 3D jawbone model and its corresponding 3D region; S4, the point cloud data, including position and color information, corresponding to the 3D region of the target tooth in the current stage, is registered with the point cloud data of the corresponding tooth in the pre-orthodontic 3D jawbone model to obtain the transformed pose of the target tooth; specifically: the point cloud data, including position and color information, corresponding to the 3D region of the target tooth in the current stage is used as the target point cloud T, and the point cloud data of the corresponding tooth in the pre-orthodontic 3D jawbone model is used as the source point cloud S. The specific steps for point cloud registration are as follows: (1) Extract local geometric features of the source point cloud S and the target point cloud T respectively through the registration backbone network; specifically, the registration backbone network adopts KPFCN, and by deleting the decoding unit in KPFCN, the registration backbone network can obtain the down-sampling and of the source point cloud S and the target point cloud T, and extract the local geometric features with position encoding of the source point cloud S and the target point cloud T respectively and ; (2) The transformation matrix from the source point cloud to the target point cloud is predicted by inputting the output of the registered backbone network into two cascaded TMP blocks; the specific operation in the TMP block is as follows: (a) Non-local context features of source point cloud S and target point cloud T are obtained through attention layer, and information fusion between source point cloud S and target point cloud T is achieved; (b) Location-aware feature matching, including: Based on the positional output of the attention layer, the matching matrix Z between the source point cloud S and the target point cloud T is calculated as follows: in, Let be the matching score between point i in the source point cloud S and point j in the target point cloud T, and d represent the number of channels of the output feature of the attention layer. This represents the inner product operation. , These are two projection matrices that need to be learned, used to map the features output by the attention layer; For block diagonal matrices: in, For the k-th block in the block diagonal matrix, in, This indicates that the index in the feature channel is encoded; This represents the three-dimensional coordinates corresponding to the matching point; (c) Obtain the confidence matrix using the normalized exponential function, including: In respectively Two dimensions , Applying the normalized exponential function Calculate and obtain the confidence matrix. The formula is as follows: Among them, confidence level Measuring the midpoint of the source point cloud S Features Midpoint of target point cloud T Features The degree of matching between them; After normalization, it is denoted as ; (d) Obtain matching points to learn the transformation matrix from the source point cloud to the target point cloud, including: Based on the confidence matrix Select the n matching points with the highest matching scores to form a matching set. The transformation matrix includes the rigid rotation matrix R and the translation vector t; The rigid rotation matrix R passes through the matrix The singular value decomposition is calculated to obtain, where for eigenvectors, for eigenvectors, Obtained through the following formula: The formula for calculating R is: The formula for calculating the translation vector t is: (e) After obtaining the rigid rotation matrix R and translation vector t of the first TMP block output, the source point cloud features are processed. After performing a rotation and translation repositioning operation, the data is then input into the second TMP block. The specific repositioning is achieved using the following formula: 。 2. The method according to claim 1, characterized in that, The 3D tooth instance segmentation model is a point cloud instance segmentation model implemented based on the SoftGroup model.

3. The method according to claim 2, characterized in that, The 3D tooth instance segmentation model constructs a heterogeneous feature interaction module, which uses graph attention to propagate local information between different graphs, thereby interacting with heterogeneous features and enhancing the contextual feature extraction capability in instance segmentation.

4. The method according to claim 3, characterized in that, The 3D tooth instance segmentation model includes a backbone network, a heterogeneous feature interaction module, a foreground semantic segmentation branch, a center offset prediction branch, a feature extraction module, and a semantic segmentation module. The color and three-dimensional coordinates of the points in the 3D jawbone model reconstructed by S2 are used as inputs to the backbone network, and multi-layer color features and multi-layer three-dimensional coordinate features are output to the heterogeneous feature interaction module respectively. The heterogeneous feature interaction module uses the K-nearest neighbor method to construct two multi-layer adjacency graph sets using multi-layer color features and multi-layer 3D coordinate features, respectively; it uses multi-layer perceptual mapping to obtain local feature information within and between graphs in the same layer to update the point cloud features in the 3D jawbone model; it uses graph attention to aggregate local features within and between graphs in the same layer to obtain context features that integrate heterogeneous features, which are then input into the foreground semantic segmentation branch and the center offset prediction branch, respectively. The foreground semantic segmentation branch is used to predict the tooth point cloud region on the 3D dental model, and the center offset prediction branch is used to predict the offset of each point on the 3D dental model from the center of its corresponding tooth. The initial tooth instance segmentation result is obtained based on the outputs of the foreground semantic segmentation branch and the center offset prediction branch and then input into the feature extraction module. The feature extraction module is used to extract point cloud features for each initial tooth instance; The semantic segmentation module is used to improve the initial tooth instance segmentation results, obtaining the instance label of each tooth and its corresponding 3D region.

5. The method according to claim 4, characterized in that, The specific implementation of the heterogeneous feature interaction module is as follows: Using the K-nearest neighbor method, a multi-level adjacency graph is constructed based on multi-level color features. Construct a multi-level adjacency graph based on multi-level 3D coordinate features. ;in: matrix and They represent The set of nodes and the set of edges; Representing an adjacency graph Middle layer The i-th node in the array has the following characteristics: ; Representing an adjacency graph Middle layer The edge between the i-th node and the j-th node has the following characteristics: ; Indicates the number of nodes. , Indicates the number of floors; matrix and They represent The set of nodes and the set of edges; Adjacency graph Middle layer The i-th node in the array has the following characteristics: ; Representing an adjacency graph Middle layer The edge between the i-th node and the j-th node has the following characteristics: ; Adjacency graph representing color characteristics Adjacency graph representing three-dimensional coordinate features The following joint analysis was conducted: a) By using multilayer perceptron mapping, node features are updated using local features within and between graphs in the same layer, as shown in the following formula: in, This indicates a feature concatenation operation. and k represent the node index, respectively; This indicates that color features are updated using color features within the same layer of the graph. This indicates that the color features are updated using the 3D coordinate features between graphs in the same layer. This indicates that the 3D coordinate features are updated using the 3D coordinate features within the same layer of the graph. This indicates that the 3D coordinate features are updated using color features between graphs in the same layer; , , , This represents the weight matrix of the multilayer perceptron that needs to be learned; b) Calculate the multidimensional attention weights within and between graphs, using the following formula: in, This represents the color attention weights obtained using color features within the same layer of the graph. This represents the color attention weights obtained using the 3D coordinate features between graphs in the same layer. This represents the 3D coordinate attention weights obtained using the 3D coordinate features within the same layer of the graph. This represents the attention weights of the three-dimensional coordinates obtained using color features between images in the same layer. , , , This represents the weight matrix of the multilayer perceptron that needs to be learned; c) Using the first Multidimensional attention weights within and between layers aggregate local features to obtain the first layer. The specific formulas for the intra-graph context features and inter-graph context features of a layer are as follows: in, Represents the product between elements. These represent the node indices, Indicates the use of the first The color features of the layer obtained from the first The color context features of the layer Indicates the use of the first The three-dimensional coordinate features of the layer are obtained from the first layer. The color context features of the layer Indicates the use of the first The three-dimensional coordinate features of the layer are obtained from the first layer. The three-dimensional coordinate context features of the layer Indicates the use of the first The color features of the layer obtained from the first The three-dimensional coordinate context features of the layer; d) Perform fusion of intra-graph context features, inter-graph context features, and node features for color and 3D coordinates, respectively. The specific formulas are as follows: in, To utilize the first Layer Color characteristics of each node With color context features , The updated result is the first Color context features of the layer; To utilize the first Layer 3D coordinate features of each node 3D coordinate context features , The updated result is the first The three-dimensional coordinate context features of the layer; , This represents the weight matrix of the multilayer perceptron that needs to be learned.

6. The method according to claim 1, characterized in that, A structure-aware quaternary loss function was designed during the point cloud registration process, and each sample was learned using one positive sample and two negative samples simultaneously. The structure-aware quaternary loss function is derived from an extension of the ternary loss function used to optimize the relationship between matched and unmatched points. The formula is as follows: Where i represents a point in the source point cloud, and j represents a point in the target point cloud. Source point cloud midpoint Matching points in the target point cloud Confidence level, Indicates the source point cloud midpoint Non-matching points in the target point cloud Confidence level, The negative sample with the highest confidence level, i.e., the most difficult negative sample, is denoted as . , Indicates the source point cloud midpoint At the matching points of the target point cloud, Indicates the source point cloud midpoint At the j-th non-matching point in the target point cloud; The specific implementation of the structure-aware quaternary loss function is as follows: The following mechanism is designed to effectively improve the diversity and utilization of negative samples by constraining the matching relationship with information from multiple negative samples, thereby making full use of the information in the negative samples. Specifically: The first negative sample is the most difficult negative sample. The second negative sample Determined by the following constraints: points in the source point cloud The correlation is strong, but with the most difficult negative samples The correlation is very weak, and the specific formula is: in, Indicates the most difficult negative sample and negative sample points Confidence level; Therefore, the structure-aware quaternary loss function The specific formula is: in, This indicates that point i in the source point cloud is related to the first negative sample. Confidence level, This indicates that point i in the source point cloud is related to the second negative sample. Confidence level; Total losses from point cloud registration It is the matching loss. Distortion loss and structure-aware quaternary loss The linear combination of these is given by the formula: in, and The effects of matching loss and distortion loss were balanced respectively.

7. The method according to claim 6, characterized in that, The formula for the matching loss is: in, , This is the default parameter in Focal loss, used to balance the impact of large differences in the number of positive and negative samples; This represents the set of true matching points in the source point cloud and the target point cloud. During training, it is obtained by using the true rigid rotation matrix R and translation vector t to obtain the set of the nearest neighbors in the source point cloud and the target point cloud that are below the distance threshold.

8. The method according to claim 6, characterized in that, The formula for the distortion loss is: in, This represents the set of overlapping points in the source point cloud and the target point cloud. It is a distortion function that uses a real rigid rotation matrix R and a translation vector t to transform the source point cloud to coincide with the target point cloud.

9. An orthodontic treatment monitoring device based on oral cavity scan video, comprising a memory and one or more processors, wherein the memory stores executable code, characterized in that, When the processor executes the executable code, it is used to implement the orthodontic treatment monitoring method based on oral scan video as described in any one of claims 1-8.