Diffusion model-based data augmentation method for micro-quenching structure of titanium alloy
By using a data augmentation method based on a diffusion model and employing metallographic mask images as strong prior constraints, microscopic quenching microstructure images of titanium alloys that conform to metallurgical principles are generated. This solves the problems of insufficient data and inadequate model generalization ability in existing technologies, and achieves high-quality microstructure image generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 西部超导材料科技股份有限公司
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies face challenges in the automatic analysis of micro-quenched microstructure of titanium alloys due to insufficient data scale and diversity, resulting in inadequate model generalization ability. Furthermore, the scarcity of high-quality labeled data and the high threshold for sample preparation hinder model iterative optimization, and traditional data augmentation methods cannot generate new samples that conform to the laws of metallurgy.
A data augmentation method based on a diffusion model is adopted. By using the ControlNet+Stable Diffusion generation network and metallographic mask images as strong prior constraints, physically consistent micro-quenching microstructure images of titanium alloys are generated. The method includes a variational autoencoder module, a ControlNet branch module, and a latent space diffusion and noise reduction backbone module, which enables precise control and generation of microstructures.
The generated micro-quenching structure images are highly realistic at the visual level and strictly follow the laws of metallurgy at the physical level, which improves the generalization ability and generation quality of the model and provides high-quality data support for the quality inspection of aerospace titanium alloy components.
Smart Images

Figure CN122199294A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of tissue image processing technology, specifically relating to a method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model. Background Technology
[0002] In the field of aerospace manufacturing, titanium alloys, with their superior specific strength (strength / density ratio up to 1.3 times that of steel), excellent high and low temperature resistance (operating temperature range of -253℃ to 600℃), and outstanding corrosion resistance, have become irreplaceable materials for key load-bearing components of modern aircraft. High-performance titanium alloys, represented by Ti-6Al-4V (TC4) and Ti-1100, are widely used in core components such as high-pressure compressor blades of aero-engines, main load-bearing frames of fuselages, and landing gear joints. These components operate under extreme conditions (such as high-cycle fatigue loads, temperature cycling, and corrosive media environments) for extended periods. The microscopic characteristics of their microstructure (including phase composition, grain size, and interface morphology) directly determine the final mechanical properties, damage tolerance, and service life of the components. The β-transformation temperature of micro-quenched titanium alloys is a core technical indicator for heat treatment process formulation, forging parameter optimization, and microstructure property control. When titanium alloys are heated above the β-transformation temperature, the microstructure changes from an α+β dual-phase structure to a single-phase β structure. The microstructure formed after cooling directly affects the material's key properties such as strength and toughness. According to GB / T 23605-2020 "Method for Determination of β-Transformation Temperature of Titanium Alloys", the β-transformation temperature of titanium alloys is determined based on the reduction of the α-phase content in the micro-quenched microstructure from 1% to 0%.
[0003] However, current deep learning-based automated analysis techniques for microstructures after quenching face severe data bottlenecks. On one hand, the dual scarcity of data scale and diversity directly weakens the model's generalization ability. Public datasets typically contain only a few hundred images, far from meeting the massive data requirements of deep learning models. Simultaneously, the high cost and long cycle of sample preparation significantly limit the range of working conditions covered by datasets. This not only makes it difficult to encompass the microstructure characteristics of different grades of aerospace titanium alloys under different β-transformation temperature quenching processes, but also fails to fully cover the diverse morphological samples of the α-phase. This results in trained models only adapting to a narrow range of working conditions within the training set. When faced with batch differences and process fluctuations leading to microstructure changes in actual engineering, they are prone to phase identification errors and quantitative result deviations, resulting in severely insufficient generalization ability. On the other hand, the scarcity of high-quality labeled data restricts the model's subdivision accuracy. Pixel-level semantic segmentation annotation is the core foundation for deep learning models to distinguish between the α-phase, β-matrix, and grain boundaries. The 90-120 minutes annotation time per image means that even professional teams struggle to complete large-scale, high-precision annotation. Furthermore, the high barrier and long cycle of sample preparation hinder iterative model optimization. The preparation of a single metallographic specimen requires more than 10 precise processes, including cutting, multi-temperature heat treatment, mounting, multi-stage grinding, precision polishing, and targeted etching, taking 6-8 hours. Each step relies on the control of experienced metallographic technicians. This high-cost and low-efficiency data acquisition model results in a limited size of existing public datasets (usually only a few hundred images) and insufficient sample diversity (limited by actual working conditions and preparation conditions), which severely restricts the generalization ability and robustness of deep learning models.
[0004] Traditional data augmentation methods (such as geometric transformation, color dithering, and GAN generation) have significant limitations in solving this problem: geometric transformation cannot add new microstructure patterns; color adjustment is difficult to maintain the physical authenticity of micro-quenched structures; and while generative adversarial networks (GANs) can generate new samples, they lack precise control over the microstructure topology, often producing artifacts that do not conform to metallurgical principles (such as unreasonable phase interfaces and phase distributions that violate thermodynamic equilibrium). Summary of the Invention
[0005] The purpose of this invention is to provide a diffusion-based method for enhancing the microstructure data of titanium alloys under micro-quenching conditions, thereby generating high-quality alloy microstructure images with physical consistency in a low-cost and high-efficiency manner.
[0006] The technical solution adopted in this invention is a data enhancement method for the microstructure of titanium alloys under micro-quenching based on a diffusion model, which is implemented according to the following steps:
[0007] Step 1: Collect images of the quenched microstructure of titanium alloy after corrosion treatment, annotate the microstructure images to obtain metallographic mask images, construct a dataset using the two types of images, and divide it into training set, validation set and test set after preprocessing; Step 2: Construct a controlled titanium alloy quenching microstructure image generation network based on ControlNet+Stable Diffusion; Step 3: Input the training set images and metallographic mask images into the controlled titanium alloy quenching microstructure image generation network. Automatic new sample generation is achieved through the alloy microstructure image and its labeled mask information. During the training process, the validation set data is input into the trained network to save the optimal weights; the test set images are input to generate alloy microstructure images.
[0008] The invention is further characterized in that, In step 1, the preprocessing process is as follows: the image is augmented; during augmentation, any one or more of the following methods are used: random rotation, random scaling, random cropping, random brightness adjustment, random contrast enhancement, and random noise addition.
[0009] In step 2, the controlled titanium alloy quenching microstructure image generation network includes a variational autoencoder module, a ControlNet branch module, and a latent space diffusion and denoising backbone module. The variational autoencoder module serves as the underlying support, compressing the high-dimensional metallographic image into a continuous latent space through nonlinear transformation. The ControlNet branch module is designed to fully lock the network weights of the pre-trained LDM and construct a trainable feature extraction branch. The latent space diffusion and denoising backbone module simultaneously receives noisy latent variables from the diffusion process and multi-scale control feature maps from ControlNet. In the U-Net decoding path, the control feature maps are fused pixel-by-pixel with the feature maps of the backbone network through an addition operator, ultimately outputting predicted noise. Based on this, the controlled metallographic microstructure is accurately reconstructed through a reverse diffusion process.
[0010] The processing procedure of the variational autoencoder module is as follows: The input is the size. RGB images of the quenched microstructure of titanium alloy, in which For batch size, For image spatial resolution; first, the image is processed through a... Convolutional layers are mapped to the number of basic feature channels. The output feature tensor size becomes Subsequently, it enters the four-level hierarchical coding structure, with the first-level coding stage receiving a size of... The feature map is sequentially processed through two ResNet blocks for local feature extraction. Each ResNet block contains two... Convolution, GroupNorm normalization, and SiLU activation are applied, and gradient stability is maintained through residual connections; after this stage, the feature map remains intact. Resolution, but the number of channels is usually expanded to Then through a Convolutional downsampling layers reduce the spatial resolution to The output feature map size is The second-level encoding stage receives the aforementioned feature maps as input, then processes them through two structurally identical ResNet Blocks for high-level semantic feature modeling, followed by a step size of 2. Convolution achieves a second spatial compression, reducing the resolution to [value missing]. The number of channels is usually expanded to The third-level encoding stage repeats this pattern, compressing the resolution to... The number of channels has been increased to Finally, feature integration is performed through an intermediate residual block, and then processed by two parallel processes. The convolutional layer generates the mean and log-variance of the latent variable distribution, with a magnitude of... Then, through reparameterization techniques, latent variables were obtained. .
[0011] The processing procedure of the ControlNet branch module is as follows: the input is a size of The metallographic mask image is first obtained through a... Convolution maps the mask to the basic number of channels. The output size becomes This tensor is then incorporated into a four-level hierarchical coding structure that is completely isomorphic to the backbone; in the first level, the feature map resolution remains constant. The number of channels is This level typically contains three consecutive Encoding Blocks; the internal structure of each Encoding Block is as follows: the input feature map is first normalized by GroupNorm and activated by SiLU, and then passed through a... Convolution is used for feature extraction; then GroupNorm and SiLU activation are performed again, followed by a second... Convolution completes local feature modeling; then through Convolution performs residual mapping or directly uses identity residual connections to form a standard residual unit structure; after the last encoding block of this layer, a... Convolution performs spatial downsampling, reducing the resolution from... Down to Meanwhile, the number of channels is usually expanded to At this point, the feature tensor size is ; After entering the second level, the input size is This layer contains 3 EncodingBlocks, each with the same structure as described above, but the number of channels remains the same. At this level, the final step is a step of 2. Convolution performs spatial compression again, reducing the resolution to [a lower value]. The number of channels has been expanded to The output size is ; The third level input size is It contains 3 encoding blocks, and the feature channels are fixed. The downsampling convolution at the end of the layer compresses the resolution to [value missing]. The channel has been expanded to The output size is ; The fourth level input size is This layer contains three encoding blocks, but spatial downsampling is no longer performed after this layer; instead, the resolution remains unchanged. Then comes the intermediate block, which consists of one ResNet block, one self-attention or cross-attention module, and another ResNet block; the input and output sizes are consistent, with a size of [size missing]. At the output position of each level, a Zero Convolution layer is inserted. This layer is a 1×1 convolution with its weights and biases set to 0 during initialization, so its initial output is a zero tensor.
[0012] The processing procedure of the latent space diffusion and noise reduction backbone module is as follows: the input consists of two parts, the first being the latent variables from the VAE encoder. It is the diffusion time step The first is the latent representation after adding noise; the second is the embedding vector at the diffusion time step. And the multi-scale conditional feature maps passed from the ControlNet branch; First, noisy latent variables An initial 3×3 convolutional layer maps the number of channels to the base number of channels in the backbone network. Meanwhile, discrete time steps After sinusoidal position encoding and multilayer perceptron processing, a temporal embedding vector is generated. This vector will be injected into the features in each subsequent residual block through scaling and offset operations to distinguish different noise intensities; subsequently, the feature map enters the downsampling path, which specifically includes four levels. In the first level, the input feature map size is... This layer contains several ResNet Blocks; within each ResNet Block, temporal embeddings... The features are added to the normalized features; the zero convolutional layer output features of the ControlNet branch at this level are directly added to the current feature map of the backbone network to ensure that the structural information of the metallographic mask is preserved in the shallow features; After entering the second level, the number of channels is After processing with ResNet Blocks including temporal embedding injection and fusing with the zero-convolution output of the second layer of ControlNet, the feature map size is then downsampled again to obtain a feature map of size [size missing]. ; The third level input size is The processing procedure is the same as above. After fusing the conditional features of the corresponding level, the size of the downsampled feature map is [size missing]. ; The fourth level is the last level of the downsampling path, with an input size of... High-level semantic features are further refined through multiple ResNet blocks, spatial attention modules, and channel attention modules, and then fused with the conditional output of ControlNet's fourth layer before entering the intermediate block. The output feature map size is [size missing]. The feature map then enters the upsampling path. Finally, the feature map from the upsampling path passes through an output head, first undergoing GroupNorm normalization and SiLU activation, and then passing through a 3×3 convolutional layer to map the channel number back to the latent space channel number. Thus, the predicted noise is obtained. The prediction result is used to estimate the true noise distribution at the current time step.
[0013] During the training phase of the network, the diffusion process involves latent variables. Gaussian noise is gradually added; specifically, at time step... The noisy latent variable is represented as follows:
[0014] in: It follows a standard Gaussian distribution; For the diffusion scheduling function; The optimization objective of the network is to minimize the mean square error between the predicted noise and the actual noise, i.e.:
[0015] in, Output results for the diffusion backbone network containing ControlNet conditional injection; During the reasoning phase, sampling is first performed in the latent space. As the initial noise input, reverse denoising updates are then performed step-by-step according to time steps. At each time step, the current noise is predicted through the diffusion backbone network, and the latent variable representation is updated according to the diffusion inversion formula. After multiple iterations, the final denoised latent variables are obtained. The latent variable is then fed into the decoder section of the variational autoencoder; the decoder recovers the spatial resolution layer by layer through a symmetrical upsampling structure, and finally outputs the generated microscopic tissue image.
[0016] In step 3, by integrating a real-time monitoring mechanism based on validation set metrics, a complete inference evaluation is performed on the validation set at the end of each Epoch, recording the current FID and SSIM values and comparing them with the historical best values: if the current FID is lower and the SSIM is higher, then the current EMA weight is saved as the optimal weight. The training termination condition adopts the following combined strategy: (1) Early stopping mechanism: If the FID of the validation set no longer decreases for 20 consecutive epochs, the model is determined to have converged and training is terminated; (2) Maximum number of training epochs: The maximum number of training epochs is set to 500 epochs. Once the maximum number of training epochs is reached, the training will be forcibly terminated regardless of the convergence status.
[0017] The beneficial effects of this invention are: This invention presents a diffusion-based data augmentation method for titanium alloy microstructures, addressing the challenge of data augmentation for titanium alloy microstructures and providing a data foundation for subsequent automated analysis of alloy microstructures. The core innovation lies in introducing the ControlNet conditional control architecture, using the topological labels (Labels / Masks) of the alloy microstructure as strong prior constraints to precisely guide the pre-trained Stable Diffusion model in generating physically plausible images. Specifically, semantic segmentation mask images of the microstructure are first extracted from a small number of labeled real metallographic images using annotation or semi-automatic segmentation algorithms. These mask images not only accurately encode the geometric contours, size distributions, and spatial topological relationships of different phase structures but also retain key metallurgical features (such as phase interface curvature and grain orientation correlation). During the training phase, ControlNet, through a learnable cross-attention mechanism, deeply couples the input conditional mask with the potential diffusion process of Stable Diffusion, achieving spatial constraints and semantic guidance for the generated content. This dual control mechanism ensures that the generated images have high visual realism while strictly adhering to metallurgical principles at the physical level. Attached Figure Description
[0018] Figure 1 This is a schematic flowchart of the data enhancement method for the micro-quenching structure of titanium alloys according to the present invention.
[0019] Figure 2 This is a schematic diagram of the network structure for generating images of controlled titanium alloy micro-quenching microstructure according to the present invention.
[0020] Figure 3 This is the final generated image of the controlled titanium alloy micro-quenching microstructure image generation network proposed in this invention. Detailed Implementation
[0021] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0022] Example 1 This invention relates to a data enhancement method for the microstructure of titanium alloys under micro-quenching based on a diffusion model, such as... Figure 1 As shown, please follow these steps: Step 1: Collect images of the quenched microstructure of titanium alloy after corrosion treatment, annotate the microstructure images to obtain metallographic mask images, construct two sets of datasets, and divide them into training set, validation set and test set after preprocessing; The preprocessing process is as follows: the dataset samples are initially expanded by any combination of random rotation, random scaling, random cropping and random noise addition, and divided into training set, validation set and test set according to the proportions of 80%, 15% and 5%, respectively.
[0023] Step 2: Construct a controlled titanium alloy quenching microstructure image generation network based on ControlNet+Stable Diffusion; and train the network using training set images; This network employs a conditional generation architecture that deeply integrates ControlNet and Stable Diffusion, fully combining ControlNet's precise control over structural prior constraints with Stable Diffusion's powerful generation capabilities in high-fidelity image synthesis. This enhances the model's ability to model the physical consistency of complex titanium alloy microstructures. By using semantic segmentation masks as strong prior conditions, it achieves precise control over both the spatial distribution of microstructures and surface texture. Simultaneously, a metallurgical constraint-guided strategy is designed to explicitly model the controlled diffusion process of structure preservation and texture generation. This effectively ensures that key metallurgical parameters such as the morphological characteristics of residual α phases, grain boundary distribution, and phase ratio in the generated images conform to physical laws, alleviating the dual challenges of structural distortion and texture inconsistency in traditional generation methods. This improves the physical realism and engineering usability of the generated micro-quenched microstructures, providing high-quality synthetic data support for the quality inspection of aerospace titanium alloy components.
[0024] Controlled titanium alloy quenching microstructure image generation network, such as Figure 2 As shown, it includes a variational autoencoder module, a ControlNet branch module, and a latent space diffusion and noise reduction backbone module; The variational autoencoder module serves as the underlying support, compressing high-dimensional metallographic images into a continuous latent space through nonlinear transformations. The encoder consists of four downsampling stages, each containing two ResNet blocks and one downsampling convolutional layer. Specific parameter configurations are as follows: the downsampling convolutional layers uniformly use 3×3 kernels, a stride of 2, and padding of 1, achieving an 8x downsampling ratio. The decoder is structurally symmetrical to the encoder, employing hierarchical upsampling convolutions (3×3 kernels, stride 1) combined with nearest-neighbor interpolation or transposed convolutions. Finally, a 3×3 convolutional layer maps the features back to the RGB three channels, ensuring high fidelity in visual representation of microstructures.
[0025] The ControlNet branch module is designed to fully lock the network weights of the pre-trained LDM and construct a trainable feature extraction branch. The core network layer of this branch contains 12 encoding blocks and 1 intermediate block, with internal convolutional parameters strictly consistent with the backbone U-Net: using 3×3 convolutional kernels, and the stride switching between 1 and 2 depending on the feature map scale. To achieve accurate encoding of the metallographic mask, the branch introduces a zero-convolutional layer after each feature extraction layer: this layer uses a 1×1 convolutional kernel with a stride of 1, and its weights and biases are encoded as 0 during training initialization. This ensures that in the initial training phase, the gradients generated by the mask conditions can be injected into the backbone network in a very low-noise manner. As the training epochs increase, the zero-convolutional layer gradually learns how to extract strong constraint information about topological structures such as grain boundaries and phase boundaries from the mask.
[0026] Latent Spatial Diffusion and Denoising Backbone Module: The denoising backbone module adopts the classic U-Net architecture, consisting of four downsampling layers, one intermediate block, and four upsampling layers. Each layer deeply integrates a ResNet module and a Transformer block. The ResNet module uses two sets of 3×3 convolutions, a GroupNorm layer, and the SiLU activation function to process spatial features, while the Transformer block captures global dependencies through a cross-attention mechanism. At the parameter level, the convolutional kernels of the Transformer block are mainly used to process feature projections, with a size of 1×1; the spatial attention layer processes latent variables through a multi-head mechanism. In the decoding path, multi-scale feature maps from ControlNet are fused with the corresponding layers of the backbone network through an addition operator, and then reconstructed through 3×3 convolutions, finally outputting predicted noise to achieve accurate simulation of the nonlinear distribution of the complex microstructure of titanium alloy.
[0027] In summary, the generation of controlled titanium alloy metallographic structures relies on a multi-stage, continuously coupled process, encompassing the entire workflow from original image compression to controlled feature reconstruction. First, in the latent feature extraction stage, the original training image is input into the encoder of a variational autoencoder for spatial dimension downsampling compression. The output of this stage is a latent variable feature map, representing the compressed representation of the metallographic structure in a high-dimensional abstract feature space. Subsequently, in the forward diffusion and noise construction stage, the latent variable enters the diffusion module. The system progressively applies Gaussian noise to it according to the defined Markov chain formula, ultimately generating a noisy latent variable. This feature map no longer possesses the visual contours of the original structure but serves as the core input to the next stage's denoising backbone network. Simultaneously, in the conditional feature extraction stage, the semantic or geometric mask of the titanium alloy structure is input as an external constraint into the ControlNet branch. Through parallel processing of multiple convolutions within this branch, the system extracts and generates multiple sets of spatially guiding control feature maps in real time. Finally, in the feature fusion and reconstruction stage, the denoising backbone network (U-Net) simultaneously receives noisy latent variables from the diffusion process and multi-scale control feature maps from ControlNet. In the U-Net's decoding path, these control feature maps are fused pixel-by-pixel with the backbone network's internal feature maps using an additive operator. Through this end-to-end joint optimization, the network ultimately outputs predicted noise, which is then used to achieve accurate reconstruction of the controlled metallographic structure via a reverse diffusion process.
[0028] The processing procedure of the variational autoencoder module is as follows: The input is the size. RGB images of the quenched microstructure of titanium alloy, in which For batch size, This represents the original image spatial resolution. First, the image is processed through a... A convolutional layer (stride 1, padding 1) is mapped to the number of basic feature channels. The output feature tensor size becomes This process only changes the channel dimension, not the spatial resolution, and is used to complete the initial projection from the color space to the feature space.
[0029] Subsequently, it enters the four-level hierarchical coding structure, with the first-level coding stage receiving a size of... The feature map is sequentially processed through two ResNet blocks for local feature extraction. Each ResNet block contains two... Convolution, GroupNorm normalization, and SiLU activation are applied, and gradient stability is maintained through residual connections. After this stage, the feature map remains... Resolution, but the number of channels is usually expanded to Then through a A convolutional downsampling layer (stride 2, padding 1) reduces the spatial resolution to The output feature map size is .
[0030] The second-level encoding stage receives the aforementioned feature maps as input, then processes them through two structurally identical ResNet blocks for high-level semantic feature modeling, followed by a step size of 2. Convolution achieves a second spatial compression, reducing the resolution to [value missing]. The number of channels is usually expanded to The third-level encoding stage repeats this pattern, compressing the resolution to... The number of channels has been increased to This completes an 8x spatial compression. Finally, feature integration is performed using an intermediate residual block, and then processed by two parallel... The convolutional layer generates the mean and log-variance of the latent variable distribution, with a magnitude of... Then, through reparameterization techniques, latent variables were obtained. This latent variable is the input representation of the diffusion model.
[0031] The ControlNet branch module is topologically identical to the U-Net encoding path of the pre-trained LDM, and its input is of size [size missing]. The metallographic mask image is first obtained through a... Convolution (with stride of 1 and padding of 1) maps the mask to the base number of channels. The output size becomes This tensor is then incorporated into a four-level hierarchical encoding structure that is completely isomorphic to the backbone.
[0032] In the first level, the feature map resolution remains at [resolution value]. The number of channels is This layer typically contains three consecutive Encoding Blocks. The internal structure of each Encoding Block is as follows: the input feature map is first normalized by GroupNorm and activated by SiLU, then passed through a... Feature extraction is performed using convolution (stride 1, padding 1); then GroupNorm and SiLU activation are applied again, followed by a second... Convolution completes local feature modeling. Then, through... Convolution performs residual mapping or directly uses identity residual connections, thus forming a standard residual unit structure. After the last encoding block of this layer, a... Convolution (stride 2, padding 1) performs spatial downsampling, reducing the resolution from... Down to Meanwhile, the number of channels is usually expanded to At this point, the feature tensor size is ; After entering the second level, the input size is This layer also contains 3 EncodingBlocks, each with the same structure as described above, but the number of channels remains the same. At the end of this level, a step of 2 is used. Convolution performs spatial compression again, reducing the resolution to [a lower value]. The number of channels has been expanded to The output size is .
[0033] The third level input size is It also contains 3 encoding blocks with the exact same structure, except that the feature channels are fixed. The downsampling convolution at the end of the layer compresses the resolution to [value missing]. The channel has been expanded to The output size is .
[0034] The fourth level input size is This layer typically contains three Encoding Blocks, but spatial downsampling is no longer performed after this layer; instead, the resolution remains unchanged. Then comes the Middle Block, which consists of a ResNet Block, a self-attention or cross-attention module (Transformer Block), and another ResNet Block. The input and output sizes are consistent, with a size of [size missing]. The Transformer part uses 1×1 convolutions for Q, K, and V projections and establishes long-range dependencies in the spatial dimensions through a multi-head attention mechanism. At the output position of each layer (i.e., before each EncodingBlocks are completed and downsampling is prepared), a Zero Convolution layer is inserted. This layer is a 1×1 convolution (stride of 1), and its weights and biases are initialized to 0, so its initial output is a zero tensor. The output size of the Zero Conv is exactly the same as the features of the current layer. These Zero Conv outputs are then fused element-wise with the features of the corresponding scale in the backbone U-Net to achieve accurate injection of structural conditions at multiple scales.
[0035] Latent space diffusion and noise reduction backbone module: The input of this module consists of two parts, the first being the latent variables from the VAE encoder. It is the diffusion time step The first is the latent representation after adding noise; the second is the embedding vector at the diffusion time step. In addition, it includes multi-scale conditional features passed from the ControlNet branch; the core of this module is a denoising network based on the U-Net architecture, which predicts and removes noise in the latent space to recover a clean latent variable representation.
[0036] First, noisy latent variables An initial 3×3 convolutional layer (stride 1, padding 1) maps the number of channels to the base number of channels in the backbone network. Meanwhile, discrete time steps After sinusoidal position encoding and multilayer perceptron processing, a temporal embedding vector is generated. This vector is then injected into the features in each subsequent residual block through scaling and offset operations to distinguish different noise intensities. The feature map then enters the downsampling path, which comprises four levels, strictly corresponding to the hierarchical structure of the ControlNet branches.
[0037] In the first level, the input feature map size is... This layer contains several ResNet Blocks. Within each ResNet Block, temporal embeddings... The features are added to the normalized features. The key step is that the output features of the zero convolutional layer corresponding to the ControlNet branch at this level are directly added to the current feature map of the backbone network, thereby ensuring that the structural information of the metallographic mask is preserved in the shallow features.
[0038] After entering the second level, the number of channels is Similarly, it undergoes ResNetBlocks processing with temporal embedding injection and is fused with the zero-convolution output of the second layer of ControlNet. It is then downsampled again to obtain a feature map of size [size missing]. ; The third level input size is The processing logic is the same as above. After fusing the conditional features of the corresponding level, the size of the downsampled feature map is [size missing]. .
[0039] The fourth level is the last level of the downsampling path, with an input size of... High-level semantic features are further refined through multiple ResNet blocks and spatial and channel attention modules, and then fused with the conditional output of ControlNet's fourth layer. This is then fed into the middle block, with an output feature map size of [size missing]. The fourth-level intermediate block consists of two ResNet Blocks sandwiching a Transformer Block. The Transformer Block integrates features globally using a self-attention mechanism, and the output of the ControlNet's zero convolutional layer is also added to the backbone features here to ensure the consistency of global structural constraints. Then, it enters the upsampling path to gradually restore the spatial resolution to the original image size. Finally, the feature map after the upsampling path passes through an output head. It first undergoes GroupNorm normalization and the SiLU activation function, and then a 3×3 convolutional layer maps the channel count back to the latent space channel count. Thus, the predicted noise is obtained. This prediction result is used to estimate the true noise distribution at the current time step. During the training phase, the diffusion process is defined as the distribution of latent variables. Gaussian noise is gradually added. Specifically, at time step... The noisy latent variable is represented as follows:
[0040] in: It follows a standard Gaussian distribution; For the diffusion scheduling function; The optimization objective of the backbone network is to minimize the mean square error between the predicted noise and the actual noise, i.e.:
[0041] in, This is the output of the diffusion backbone network containing ControlNet conditional injection; m represents the metallographic mask; During the reasoning phase, sampling is first performed in the latent space. As initial noise input. Then according to time steps The reverse denoising update is performed incrementally. At each time step, the current noise is predicted through the diffusion backbone network, and the latent variable representation is updated according to the diffusion inversion formula. After multiple iterations, the final denoised latent variables are obtained. This latent variable is then fed into the decoder section of the variational autoencoder. The decoder recovers the spatial resolution layer by layer through a symmetric upsampling structure. Gradually restore to original resolution The final output is a generated microscopic tissue image. Throughout the generation process, only a structural mask is required as input; no real tissue image is needed, thus achieving automatic microscopic tissue generation based on structural constraints.
[0042] Step 3: Input the training set images and metallographic mask images into the controlled titanium alloy quenching microstructure image generation network. Automatic new sample generation is achieved through the alloy microstructure image and its labeled mask information. During the training process, the validation set data is input into the trained model to ensure the best model weights in real time. Finally, the test set images are input to obtain the generated alloy microstructure image.
[0043] By integrating a real-time monitoring mechanism based on validation set metrics (FID and SSIM), the system performs a complete inference evaluation on the validation set at the end of each epoch, records the current FID and SSIM values, and compares them with the historical best values: if the current FID is lower and the SSIM is higher, then the current EMA weight is saved as the optimal model.
[0044] Example 2 Furthermore, the training termination condition employs the following composite strategy: (1) Early stopping mechanism: If the FID of the validation set no longer decreases after 20 consecutive epochs, the model is considered to have converged and training is terminated.
[0045] (2) Maximum number of training epochs: The maximum number of training epochs is set to 500 epochs. Once the maximum number of training epochs is reached, the training will be forcibly terminated regardless of the convergence status.
[0046] Example 3 Furthermore, during the testing phase, the metallographic mask images of the test set are individually input into the optimal EMA weight model. After ControlNet feature extraction, U-Net progressive denoising, and VAE decoding, the corresponding titanium alloy microstructure image is directly output, without the need for real metallographic images throughout the process.
[0047] Example 4 Pix2Pix, CycleGAN, StyleGAN2, and standard LDM (without ControlNet constraints) were selected as baseline methods and compared with the controlled generative network proposed in this invention. Evaluation metrics: The Fréchet Inception Distance (FID, the lower the better) and the Structural Similarity Index (SSIM, the higher the better) were used for comprehensive evaluation.
[0048] Table 1. Comparison of generation quality of different methods on the test set
[0049] The quantitative evaluation results are shown in Table 1. The method of the present invention reduces the FID index by 46.6% and improves the SSIM index by 23.1% compared with the standard LDM, which fully demonstrates the significant improvement effect of ControlNet mask constraint on the fidelity of microstructure morphology.
[0050] Example 5 Figure 3 This image demonstrates the enhancement effect of the method of this invention on typical titanium alloy microstructure samples. Specifically, the image shows a high-fidelity, structurally clear enhanced image of titanium alloy microstructure automatically generated by the Stable Diffusion backbone model after giving an arbitrary titanium alloy microstructure semantic mask as a control condition for ControlNet. The mask provides spatial prior information about the microstructure category, guiding the generation process to preserve and enhance the realistic microstructure boundaries, morphological features, and texture details. Figure 3 It is evident that the generated new samples exhibit outstanding detail enhancement, with the texture of the α phase of micro-organisms being clearly reproduced, outperforming traditional interpolation or denoising methods. Simultaneously, the generated new samples demonstrate good semantic consistency with the input mask, and the generated results strictly adhere to the organizational distribution specified by the input mask, without any class confusion or unreasonable structures, indicating that ControlNet has a strong constraint on the generation process.
[0051] Example 6 In summary, the method of this invention can effectively integrate prior semantic information and generative modeling capabilities, and achieve high-quality and high-reliability microstructure image enhancement while preserving the semantics of the real titanium alloy microstructure, providing a reliable data foundation for subsequent model training and quantitative analysis.
Claims
1. A method for enhancing the microstructure of titanium alloys under micro-quenching based on a diffusion model, characterized in that, The specific steps are as follows: Step 1: Collect images of the quenched microstructure of titanium alloy after corrosion treatment, annotate the microstructure images to obtain metallographic mask images, construct a dataset using the two types of images, and divide it into training set, validation set and test set after preprocessing; Step 2: Construct a controlled titanium alloy quenching microstructure image generation network based on ControlNet+Stable Diffusion; Step 3: Input the training set images and metallographic mask images into the controlled titanium alloy quenching microstructure image generation network. Automatic new sample generation is achieved through the alloy microstructure image and its labeled mask information. During the training process, the validation set data is input into the trained network to save the optimal weights; the test set images are input to generate alloy microstructure images.
2. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 1, characterized in that, In step 1, the preprocessing process is as follows: the image is expanded; the expansion process specifically uses any one or more of the following: random rotation, random scaling, random cropping, random brightness adjustment, random contrast enhancement, and random noise addition.
3. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 1, characterized in that, In step 2, the controlled titanium alloy quenching microstructure image generation network includes a variational autoencoder module, a ControlNet branch module, and a latent space diffusion and noise reduction backbone module. The variational autoencoder module serves as the underlying support, compressing the high-dimensional metallographic image into a continuous latent space through nonlinear transformation. The ControlNet branch module is designed to fully lock the network weights of the pre-trained LDM and construct a trainable feature extraction branch. The latent space diffusion and noise reduction backbone module simultaneously receives noisy latent variables from the diffusion process and multi-scale control feature maps from ControlNet. In the decoding path of U-Net, the control feature maps are fused pixel by pixel with the feature maps of the backbone network through an additive operator, and finally output the predicted noise. Based on this, the accurate reconstruction of the controlled metallographic structure is achieved through the reverse diffusion process.
4. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 3, characterized in that, The processing procedure of the variational autoencoder module is as follows: The input is the size. RGB images of the quenched microstructure of titanium alloy, in which For batch size, For image spatial resolution; first, the image is processed through a... Convolutional layers are mapped to the number of basic feature channels The output feature tensor size becomes Subsequently, it enters the four-level hierarchical coding structure, with the first-level coding stage receiving a size of... The feature map is sequentially processed through two ResNet Blocks for local feature extraction. Each ResNetBlock contains two... Convolution, GroupNorm normalization, and SiLU activation are applied, and gradient stability is maintained through residual connections; after this stage, the feature map remains intact. Resolution, but the number of channels is usually expanded to Then through a Convolutional downsampling layers reduce the spatial resolution to The output feature map size is ; The second-level encoding stage receives the aforementioned feature maps as input, then processes them through two structurally identical ResNet blocks for high-level semantic feature modeling, followed by a step size of 2. Convolution achieves a second spatial compression, reducing the resolution to [value missing]. The number of channels is usually expanded to ; The third-level encoding stage repeats this pattern, compressing the resolution to... The number of channels has been increased to Finally, feature integration is performed through an intermediate residual block, and then processed by two parallel processes. The convolutional layer generates the mean and log-variance of the latent variable distribution, with a magnitude of... Then, through reparameterization techniques, latent variables were obtained. .
5. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 4, characterized in that, The processing procedure of the ControlNet branch module is as follows: the input is a size of The metallographic mask image is first obtained through a... Convolution maps the mask to the basic number of channels. The output size becomes This tensor is then incorporated into a four-level hierarchical coding structure that is completely isomorphic to the backbone; in the first level, the feature map resolution remains constant. The number of channels is This level typically contains three consecutive Encoding Blocks; the internal structure of each Encoding Block is as follows: the input feature map is first normalized by GroupNorm and activated by SiLU, and then passed through a... Convolution is used for feature extraction; then GroupNorm and SiLU activation are performed again, followed by a second... Convolution completes local feature modeling; then through Convolution performs residual mapping or directly uses identity residual connections to form a standard residual unit structure; after the last encoding block of this layer, a... Convolution performs spatial downsampling, reducing the resolution from... Down to Meanwhile, the number of channels is usually expanded to At this point, the feature tensor size is ; After entering the second level, the input size is This layer contains 3 Encoding Blocks, each with the same structure as described above, but the number of channels remains the same. At this level, the final step is a step of 2. Convolution performs spatial compression again, reducing the resolution to [a lower value]. The number of channels has been expanded to The output size is ; The third level input size is It contains 3 encoding blocks, and the feature channels are fixed. The downsampling convolution at the end of the layer compresses the resolution to [value missing]. The channel has been expanded to The output size is ; The fourth level input size is This layer contains three encoding blocks, but spatial downsampling is no longer performed after this layer; instead, the resolution remains unchanged. Then comes the intermediate block, which consists of a ResNet block, a self-attention or cross-attention module, and another ResNet block. The input and output sizes are consistent, with a size of [size missing]. At the output position of each layer, a ZeroConvolution layer is inserted. This layer is a 1×1 convolution with its weights and biases set to 0 during initialization, so its initial output is a zero tensor.
6. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 5, characterized in that, The processing procedure of the latent space diffusion and noise reduction backbone module is as follows: the input consists of two parts, the first being the latent variables from the VAE encoder. It is the diffusion time step The latent representation after adding noise; The second is the diffusion time step embedding vector. And the multi-scale conditional feature maps passed from the ControlNet branch; First, noisy latent variables An initial 3×3 convolutional layer maps the number of channels to the base number of channels in the backbone network. ; Meanwhile, discrete time steps After sinusoidal position encoding and multilayer perceptron processing, a temporal embedding vector is generated. This vector will be injected into the features in each subsequent residual block through scaling and offset operations to distinguish different noise intensities; subsequently, the feature map enters the downsampling path, which specifically includes four levels. In the first level, the input feature map size is... This layer contains several ResNet Blocks; Temporal embedding in each ResNet block The features are added to the normalized features; the zero convolutional layer output features of the ControlNet branch at this level are directly added to the current feature map of the backbone network to ensure that the structural information of the metallographic mask is preserved in the shallow features; After entering the second level, the number of channels is After processing with ResNet Blocks including temporal embedding injection and fusing with the zero-convolution output of the second layer of ControlNet, the feature map size is obtained by downsampling again. ; The third level input size is The processing procedure is the same as above. After fusing the conditional features of the corresponding level, the size of the downsampled feature map is [size missing]. ; The fourth level is the last level of the downsampling path, with an input size of... High-level semantic features are further refined through multiple ResNet blocks, spatial attention modules, and channel attention modules, and then fused with the conditional output of ControlNet's fourth layer before entering the intermediate block. The output feature map size is [size missing]. The feature map then enters the upsampling path. Finally, the feature map from the upsampling path passes through an output head, first undergoing GroupNorm normalization and SiLU activation, and then passing through a 3×3 convolutional layer to map the channel number back to the latent space channel number. Thus, the predicted noise is obtained. The prediction result is used to estimate the true noise distribution at the current time step.
7. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 6, characterized in that, During the training phase of the network, the diffusion process involves latent variables. Gaussian noise is gradually added; specifically, at time step... The noisy latent variable is represented as follows: in: It follows a standard Gaussian distribution; For the diffusion scheduling function; The optimization objective of the network is to minimize the mean square error between the predicted noise and the actual noise, i.e.: in, Output results for the diffusion backbone network containing ControlNet conditional injection; During the reasoning phase, sampling is first performed in the latent space. As the initial noise input, reverse denoising updates are then performed step-by-step according to time steps. At each time step, the current noise is predicted through the diffusion backbone network, and the latent variable representation is updated according to the diffusion inversion formula. After multiple iterations, the final denoised latent variables are obtained. The latent variable is then fed into the decoder section of the variational autoencoder; the decoder recovers the spatial resolution layer by layer through a symmetrical upsampling structure, and finally outputs the generated microscopic tissue image.
8. The method for enhancing the microstructure data of titanium alloy micro-quenching based on a diffusion model as described in claim 7, characterized in that, In step 3, by integrating a real-time monitoring mechanism based on validation set metrics, a complete inference evaluation is performed on the validation set at the end of each Epoch, recording the current FID and SSIM values and comparing them with the historical best values: if the current FID is lower and the SSIM is higher, then the current EMA weight is saved as the optimal weight. The training termination condition adopts the following combined strategy: (1) Early stopping mechanism: If the FID of the validation set no longer decreases for 20 consecutive epochs, the model is determined to have converged and training is terminated; (2) Maximum number of training epochs: The maximum number of training epochs is set to 500 epochs. Once the maximum number of training epochs is reached, the training will be forcibly terminated regardless of the convergence status.