A dam twin scene modeling method based on human-machine coupling

By employing a human-computer coupled dam twin scene modeling method, and utilizing a combination of speech-to-text conversion and generator discriminator, the problem of aligning text information with image regions is solved, generating high-quality dam twin scenes and achieving semantic consistency and realism between text and images.

CN119919540BActive Publication Date: 2026-06-26HOHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HOHAI UNIV
Filing Date
2025-01-02
Publication Date
2026-06-26

Smart Images

  • Figure CN119919540B_ABST
    Figure CN119919540B_ABST
Patent Text Reader

Abstract

The application discloses a dam twin scene modeling method based on human-computer coupling, and particularly relates to the field of text-to-image technology in the artificial intelligence, first, the voice of human is converted into text through constructing a voice-to-text method, and then the text is transmitted into a model; the generator in the model can retain more key text information through text-image feature fusion, map the semantics into the image area, generate high-quality realistic images, and the generated images match the overall sentence semantics, achieving good text-image semantic consistency effect; the discriminator can better identify the image area more related to the text information through the attention mechanism, promoting the text-image semantic consistency of the generated image; the stronger discriminator in turn promotes the generator to generate higher-quality images; and the introduction of contrast learning promotes the authenticity of the semantic consistency of the generated image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image generation from text in artificial intelligence, specifically to a method for modeling a dam twin scene based on human-computer coupling. Background Technology

[0002] With the increasing demands of global climate change and water resource management, dams, as important water conservancy engineering facilities, play a crucial role. The design, construction, operation, and maintenance of dams directly affect human life safety, the ecological environment, and economic development. Therefore, how to effectively manage and optimize dam operation to ensure its long-term safety and sustainability has become one of the key issues in the field of water conservancy engineering. Dam modeling not only requires accurate modeling of the physical environment but also necessitates the integration of the professional knowledge and experience of operators. This necessitates a human-computer interaction mechanism that acts as a bridge in the modeling process. Through human-computer coupling, the system can not only provide real-time feedback and adjustments based on expert input but also gradually improve and optimize the dam model through multiple rounds of interaction, thereby achieving "human-centered" dynamic modeling. By combining technologies such as speech recognition, natural language processing, and intelligent image generation, the dam twin scene modeling method based on human-computer coupling allows engineers to directly input, modify, and update the dam model via voice or text, greatly improving modeling efficiency and accuracy. This intelligent modeling method allows experts to quickly generate modeling scenarios that meet actual needs when facing complex engineering environments, and adjust model parameters in real time according to changes to ensure the accuracy and flexibility of the model.

[0003] With the development of GANs, significant contributions have been made to the field of text-to-image generation. Under the GAN architecture, text is input and an image corresponding to the text is generated. Furthermore, GANs are very fast at generating images for text-to-image tasks and possess a smooth potential space, allowing for more controllable synthesis. Therefore, GANs are also used to complete the task of text-to-image generation. The task of text-to-image generation is to generate images consistent with the text description based on the text information. This research aims to ensure both the semantic consistency of the text and the realism of the images.

[0004] In recent years, there has been a lot of progress in the field of text-to-image generation research, but there are still some unresolved difficulties: (1) First, it is impossible to align key text information with corresponding image regions. (2) In existing models, the discriminator can only extract the features of the entire image and then connect them with the text features to determine the semantic consistency of the text and image, but it cannot determine which part of the local image features is more relevant to the text information. How to better align text features with image features is an unresolved problem. In the generation of twin scenes of dams, special attention should be paid to the requirements of text description, because the type of dam and the description of the surrounding natural environment will affect the generation of its twin scene. Summary of the Invention

[0005] Therefore, this invention provides a method for modeling a dam twin scenario based on human-machine coupling to solve the problems mentioned in the background art.

[0006] To achieve the above objectives, the present invention provides the following technical solution: a method for modeling a dam twin scene based on human-computer coupling, comprising the following steps:

[0007] S1: Collect information related to dam modeling, including the dam's modeling structure and graphic style, to build a text-image pair dataset;

[0008] S2: Combine the collected data with the scene description to form text, and preprocess and augment the text and images to build a text-image pair dataset. Divide the text-image pair dataset into training set and validation set in a 7:3 ratio.

[0009] S3: Constructing a speech-to-text method;

[0010] S4: Build a generator. The text converted from speech will be input into the generator to generate an image.

[0011] S5: Construct a discriminator. The image generated by the generator is input into the discriminator for discrimination.

[0012] S6: Train the text-to-image generation model using the dataset, calculate the adversarial loss, and optimize the parameters;

[0013] S7: Using the trained text-to-image model, the user's verbal description is converted into text, and then the text is input into the text-to-image model to generate the model.

[0014] Preferably, the dam modeling-related information collected in S1 is as follows:

[0015] S11: The information collected for dam modeling includes the dam's structural features, functional features, and environmental factors. The structural features include the dam type, such as concrete gravity dams, arch dams, and earth-rock dams. The functional features include the dam's operating water level, minimum operating water level, and flood control high water level. The purpose of the reservoir dam includes power generation, irrigation, flood control, and drinking water supply. The environmental factors include geographical location, the specific geographical location of the dam, including rivers, lakes, and valleys; hydrological characteristics; the surrounding vegetation, riverbanks, and natural landscape; and weather and climate, including local climate conditions, rainfall, temperature, and other potential impacts.

[0016] S12: Using point cloud technology, the 3D information of the dam obtained in S11 is captured. After generating a 3D model, the key cross-sections, top view, and planar contour of the dam are extracted to provide input for text-based planar model generation.

[0017] Preferably, the specific content of constructing text and preprocessing text and images in S2 is as follows:

[0018] S21: Construct text. The text should cover the geometric features, structural details, design parameters (height, width, material type), and environmental conditions (topography and climate around the dam) of the dam as much as possible, so that the image generation model can understand the complete scene. The same dam can be described in different ways, including technical parameter descriptions, scene descriptions, and use cases, to increase the diversity of the training set and help the model learn different styles of text expression.

[0019] S22: Text preprocessing, removing useless symbols, emojis, and special characters, and generating different text expressions through synonym replacement to enhance the diversity of the dataset;

[0020] S23: Image preprocessing, adjusting all images to a uniform size of 256x256 pixels, and performing data augmentation on the images, including rotation, translation, and flipping.

[0021] Preferably, the speech-to-text settings in S3 are as follows:

[0022] S31: Use Baidu Speech Recognition to implement online speech recognition and convert real-time speech into text;

[0023] S32: The context is updated through a recurrent neural network (RNN) so that the system can understand and remember the task state and the user's historical input; at the same time, when the user makes a modification to the task, the system uses the RNN to understand the modification based on the historical context and combines it with the current state of the task to generate new task content.

[0024] S33: The updated text content will be used by the system to generate new dam modeling images or models to reflect the user's latest requirements.

[0025] Preferably, in step S32, the context is updated using a recurrent neural network (RNN), and the user proposes modifications to the task settings as follows:

[0026] S321: Recurrent Neural Networks (RNNs) utilize hidden states The system gradually accumulates and stores the user's historical input information. At time step t, the recurrent neural network (RNN) receives the current input. The hidden state of the previous time step Generate the current hidden state The expression is:

[0027] ;

[0028] This represents the hidden state at time step t, containing accumulated information about the current input and past inputs. It is the weight matrix of the current input. It is the weight matrix of the previous hidden state, b is the bias term, and σ is the activation function tanh, used to introduce nonlinearity; hidden state It contains all historical information, meaning that each input is updated based on the integration of the previous hidden state and the new input. As the time step t increases, the recurrent neural network (RNN) gradually accumulates the user's historical input, thus remembering the previous input content.

[0029] S322: When a user requests modifications to the task content, the system generates updated task content based on the current context state and the new input; the expression is:

[0030] ;

[0031] In this formula, the hidden state Combine with modified input and the previous context Form the updated context;

[0032] S323: To generate a task description that meets the latest user requirements, the system adjusts the hidden state based on the updated settings. Generate new output :

[0033] );

[0034] This represents the task description text generated at the current time step. This is the weight matrix from the hidden state to the output, where b is the bias term and g is the sigmoid activation function of the output layer; the output... It is the updated task content, generated based on the latest context and the user's latest input, reflecting the user's intention to modify.

[0035] Preferably, the components and settings of the generator in step S4 are as follows:

[0036] S41: The text encoder uses a bidirectional LSTM network structure;

[0037] S42: The text-image feature fusion block consists of LSTM, a semantic mapping module, and MLP;

[0038] S43: The generator loss consists of adversarial loss and DAMSM loss, expressed as follows:

[0039] ;

[0040] Where s is the text description. The input image is the generated image, and D() is the discriminator's judgment on whether the input image matches the input sentence. These are the weights of the DAMSM loss; the DAMSM loss is used to measure the semantic consistency between text and images.

[0041] Preferably, the composition and settings of each part of the text image feature fusion block in step S42 are as follows:

[0042] S421: In LSTM, LSTM first obtains the initial hidden state of the recurrent neural network RNN ​​through the initialization of noise z, and then obtains the new hidden state through the subsequent input gate, forget gate and output gate, thus obtaining more important text information. Moreover, it can establish long-term dependencies between fusion blocks and reduce the difficulty of skip training.

[0043] S422: The semantic mapping module contains structures in the order upsample, conv, BN, ReLU, conv, upsample, used to generate semantic graph mapping p. i ;

[0044] S423: In MLP, there are structures in the order of Linear, Relu, Linear;

[0045] S424: The final generated semantic graph mapping p i Add it to the affine transformation.

[0046] Preferably, the LSTM is configured as follows in step S421:

[0047] S4211: Initialize the LSTM using noise z; the expression is:

[0048] , ;

[0049] It is the initial hidden state of a recurrent neural network (RNN). It is the initial cell state;

[0050] S4212: and and The update rule; the expression is:

[0051] ;

[0052] ;

[0053] ;

[0054] , and These are the input gate, forget gate, and output gate, respectively, where s is the sentence vector. It is a cellular state. It is a linear combination of the input to the hidden state at the current time step t, where It refers to the hidden state of a recurrent neural network (RNN). It is the Sigmoid activation function.

[0055] Preferably, the MLP is set as follows in step S423:

[0056] S4231: Two MLPs predict the channel scaling parameter γ and shift parameter β under language conditions, respectively, with the following expressions:

[0057] ;

[0058] S4232: First, use parameter γ to perform channel direction scaling on x, then use shift parameter β to perform channel direction shifting; the expression is:

[0059] ;

[0060] in It is the information of the i-th channel of the visual feature map. and These are the scaling and shifting parameters for the i-th channel of the visual feature map.

[0061] Preferably, the semantic graph mapping p generated in step S424 is... iThe expression added to the affine transformation is as follows:

[0062] S4241: The expression is:

[0063] ;

[0064] It is a semantic graph mapping generated in the semantic mapping module. pi can be used as a weight, which can add semantic information to more important positions and determine how much text information is amplified. These are the scaling parameters generated by the MLP. These are offset parameters generated by MLP. It refers to the hidden state of a recurrent neural network (RNN). It is an image feature. These are the image features after affine transformation. t, h, and w represent the number of image channels, height, and width, respectively.

[0065] Preferably, the discriminator in step S5 is composed and configured as follows:

[0066] S51: Downsampling block; downsampling uses a convolutional layer with a stride of 2; the residual block consists of a 4x4 convolutional layer, a ReLU layer, a 3x3 convolutional layer, and a ReLU layer;

[0067] S52: Attention mechanism; First, the input image is downsampled by the downsampling block of S51 to extract the image features G, which are then concatenated with the spatially copied sentence vector s and input into the attention mechanism; The attention mechanism is incorporated into the discriminator, and a stronger discriminator in turn promotes the generator to generate images with stronger semantic consistency with the text, thereby promoting the semantic consistency between the generated image and the text.

[0068] S53: Cross-modal alignment; In the task of text-to-image generation, cross-modal alignment is particularly important. By performing cross-modal training on sentence vectors s and images, as well as generated fake images and real images, and introducing contrastive loss to align the semantics and images in the common space, the generated images have higher semantic consistency and realism.

[0069] S54: The discriminator loss uses the match-aware gradient-penalized MA-GP loss associated with the adversarial loss.

[0070] ;

[0071] Where s is the sentence vector, is the sentence vector that does not match the text, and x is the real image corresponding to s. The input image is the generated image, and D() is the discriminator's judgment on whether the input image matches the input sentence. p and are hyperparameters of MA-GP; pdata is the mathematical expectation.

[0072] Preferably, the attention mechanism in step S52 is composed and configured as follows:

[0073] S521: The attention mechanism generates an attention map of sentence vectors that suppresses irrelevant parts of the image and text; the expression is:

[0074] ;

[0075] ;

[0076] ;

[0077] It is the feature channel of the sentence vector s at position {w,h}; first, extract the image features. Together they are fed into the MLP to generate energy. Then use energy value Calculate attention weights ,at last Combined with sentence vector s to obtain Sentence vectors After being connected with G, image features are extracted through a downsampling block. These features are then combined with the copied sentence vector s to calculate adversarial loss, which is used to evaluate the realism of the generated image and its consistency with the text image. The discriminator, enhanced by the attention mechanism, can better determine which local image features are more relevant to the text information, and the stronger discriminator, in turn, can promote a stronger generator.

[0078] Preferably, the cross-modal alignment in step S53 is composed and configured as follows:

[0079] S531: Establish cross-modal alignment at the image-image and image-text levels; the expression is:

[0080] ;

[0081] Cosine similarity is used as a distance metric, where u represents the embedding vector of the image, and v represents the embedding vector of the text or another image. It is the transpose in matrix operations;

[0082] The contrastive loss function is:

[0083] ;

[0084] ;

[0085] ;

[0086] ;

[0087] m is the mini-batch size of the input samples. It is the i-th sample in the mini-batch u. This refers to the i-th sample in the mini-batch v. It refers to the j-th sample of sample v. It is the temperature hyperparameter, and exp is the exponential function. (·) represents the image embedding extraction process in the discriminator, which projects the input image into the image embedding. For sentence vectors; x represents the generated fake image, and x represents the real image. It is the loss function between the sentence and the real image. It is the loss function between the sentence and the generated fake graph; It is the loss function between real and fake images;

[0088] S532: The expression for the cross-modal alignment objective function is:

[0089] ;

[0090] ;

[0091] It is the contrast loss of the generator. It is the contrast loss of the discriminator.

[0092] Preferably, the training model in step S6 specifically includes the following steps:

[0093] S61: Text description input text encoder generates sentence vectors and word features;

[0094] S62: Input the normally distributed noise vector into the fully connected layer to reshape it to the required size, and then input it together with the sentence vector into 7 text image feature fusion blocks to generate an image;

[0095] S63: In the discriminator, the generated image is further discriminated by downsampling blocks and attention mechanisms to calculate adversarial loss, and the extracted image features are aligned across modalities.

[0096] S64: The final objective function is obtained by a weighted combination of adversarial loss and contrastive loss. and These are coefficient weights, and the overall objective function is:

[0097] ;

[0098] .

[0099] The present invention has the following advantages:

[0100] 1. This invention can generate twin scenes of various dams, and can generate dam twin scenes that meet the requirements through human description, which has good reference value for dam modeling.

[0101] 2. The model designed in this invention is a single-stage end-to-end GAN model. After achieving human-computer interaction through speech-to-text conversion, the model can understand complex scenes and key semantics in the text. It can also generate natural and realistic twin scenes even with cluttered backgrounds. The generator can retain more critical text information and map semantics to image regions through text-image feature fusion, generating high-quality and realistic images. The generated images match the overall sentence semantics, achieving a good text-image semantic consistency effect. The discriminator can better identify image regions that are more relevant to the text information through the attention mechanism, promoting the text-image semantic consistency of the generated images. The stronger discriminator also promotes the generator to generate higher-quality images. Furthermore, the introduction of contrastive learning promotes the authenticity of the semantic consistency of the generated images. Attached Figure Description

[0102] Figure 1 This is a flowchart illustrating the method of the present invention.

[0103] Figure 2 This is a schematic diagram of the text-to-image generation model structure of the present invention.

[0104] Figure 3 This is a schematic diagram of the text image feature fusion block structure.

[0105] Figure 4 A schematic diagram illustrating the training process of a text-to-image model. Detailed Implementation

[0106] The following specific embodiments illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0107] like Figures 1-4 As shown, this invention provides a method for modeling a dam twin scenario based on human-machine coupling, which includes the following steps:

[0108] S1: Collect information related to dam modeling, including the dam's modeling structure and graphic style, to build a text-image pair dataset;

[0109] The relevant information for dam modeling was collected as follows:

[0110] S11: The information collected for dam modeling includes the dam's structural characteristics, functional characteristics, and environmental factors. Structural characteristics include the dam type, such as concrete gravity dams, arch dams, and earth-rock dams. Functional characteristics include: the dam's operating water level, minimum operating water level, and flood control high water level; the dam's purpose, including power generation, irrigation, flood control, and drinking water supply. Environmental factors include: geographical location, the specific geographical location of the dam, including rivers, lakes, and valleys; hydrological characteristics; surrounding vegetation, riverbanks, and natural landscapes; and weather and climate, local climate conditions, and potential impacts such as rainfall and temperature.

[0111] S12: Using point cloud technology to capture the 3D information of the dam, after generating a 3D model, extract key 2D information such as the key cross-sections, top view, and planar outline of the dam, providing rich input for text-based planar model generation.

[0112] S2: Combine the collected data with the scene description to form text, and preprocess and augment the text and images to build a text-image pair dataset. Divide the dataset into training set and validation set in a 7:3 ratio.

[0113] The specific steps for constructing text and preprocessing text and images are as follows:

[0114] S21: Construct text. The text should cover the geometric features, structural details, design parameters (height, width, material type), and environmental conditions (topography and climate around the dam) of the dam as much as possible, so that the image generation model can understand the complete scene. The same dam can be described in different ways, including technical parameter descriptions, scene descriptions, and use cases, to increase the diversity of the training set and help the model learn different styles of text expression.

[0115] S22: Text preprocessing, removing useless symbols, emojis, and special characters, and generating different text expressions through synonym replacement to enhance the diversity of the dataset;

[0116] S23: Image preprocessing, adjust all images to a uniform size of 256x256 pixels, and perform data augmentation on the images (rotation, translation, flipping).

[0117] S3: Constructing a speech-to-text method;

[0118] The components and settings of the speech-to-text method are as follows:

[0119] S31: Use Baidu Speech Recognition to implement online speech recognition and convert real-time speech into text;

[0120] S32: The system updates the context through a recurrent neural network (RNN), enabling it to understand and remember the task state and the user's historical input. When the user modifies the task, the system uses the RNN network structure to understand the modification based on the historical context and combines it with the current state of the task to generate new task content.

[0121] S321: RNN uses hidden states The RNN progressively accumulates and stores the user's historical input information. At time step t, the RNN receives the current input. The hidden state of the previous time step Generate the current hidden state The expression is:

[0122] ;

[0123] This represents the hidden state at time step t, containing accumulated information about the current input and past inputs. It is the weight matrix of the current input. It is the weight matrix of the previous hidden state, b is the bias term, and σ is the activation function tanh, used to introduce nonlinearity; hidden state It will contain all historical information, that is, each input will be updated based on the integration of the previous hidden state and the new input. In this way, as the time step increases, the RNN will gradually accumulate the user's historical input and achieve "remember" the previous input content.

[0124] S322: When a user requests modifications to the task content, the system generates updated task content based on the current context state and the new input; the expression is:

[0125] ;

[0126] In this formula, the hidden state Combined with input modification and the previous context An updated context has been formed;

[0127] S323: To generate a task description that meets the user's latest needs, the system will adjust the hidden state accordingly. Generate new output :

[0128] );

[0129] This represents the task description text generated at the current time step. This is the weight matrix from the hidden state to the output, where b is the bias term and g is the sigmoid activation function of the output layer; the output... It is the updated task content, generated based on the latest context and the user's latest input, reflecting the user's intention to modify.

[0130] S33: The updated text content will be used by the system to generate new dam modeling images or models that reflect the user's latest requirements.

[0131] S4: Build the generator;

[0132] The generator's components and settings are as follows:

[0133] S41: The text encoder uses a bidirectional LSTM network structure;

[0134] S42: The text-image feature fusion block consists of LSTM, a semantic mapping module, and MLP;

[0135] The composition and settings of each part of the text-image feature fusion block are as follows:

[0136] S421: In LSTM, the LSTM first initializes with noise z. In this embodiment, the initial hidden state of the RNN is obtained. Then, the new hidden state is obtained through the subsequent input gate, forget gate and output gate. This not only obtains more important text information, but also establishes long-term dependencies between fusion blocks and reduces the difficulty of skip training.

[0137] The LSTM settings are as follows:

[0138] S4211: Initialize the LSTM using noise z; the expression is:

[0139] , ;

[0140] It is the initial hidden state of a recurrent neural network (RNN). It is the initial cell state;

[0141] S4212: and and The update rule; the expression is:

[0142] ;

[0143] ;

[0144] ;

[0145] , and These are the input gate, forget gate, and output gate, respectively, where s is the sentence vector. It is a cellular state. It is a linear combination of the input to the hidden state at the current time step t, where It is the hidden state of RNN. It is the Sigmoid activation function.

[0146] S422: The semantic mapping module contains structures in the order of upsample, conv, BN, ReLU, conv, upsample, used to generate semantic graph mapping p. i ;

[0147] S423: MLP contains structures in the order of Linear, ReLU, Linear;

[0148] The MLP settings are as follows:

[0149] S4231: Two MLPs predict the channel scaling parameter γ and shift parameter β under language conditions, respectively, with the following expressions:

[0150] ;

[0151] S4232: First, use parameter γ to perform channel direction scaling on x, then use shift parameter β to perform channel direction shifting; the expression is:

[0152] ;

[0153] in It is the information of the i-th channel of the visual feature map. and These are the scaling and shifting parameters for the i-th channel of the visual feature map.

[0154] S424: The final generated semantic graph mapping p i Add it to the affine transformation.

[0155] S4241: The expression is:

[0156] ;

[0157] It is a semantic graph mapping generated in the semantic mapping module. pi, as a weight, can add semantic information to more important positions and determine how much text information is amplified. These are the scaling parameters generated by the MLP. These are offset parameters generated by MLP. It refers to the hidden state of a recurrent neural network (RNN). It is an image feature. These are the image features after affine transformation; t, h, and w represent the number of channels, height, and width of the image, respectively.

[0158] S43: The generator loss consists of adversarial loss and DAMSM loss, expressed as follows:

[0159] ;

[0160] Where s is the text description. The generated image is D(), which is the discriminator's judgment on whether the input image matches the input sentence. These are the weights of the DAMSM loss; the DAMSM loss is used to measure the semantic consistency between text and images.

[0161] S5: Construct the discriminator;

[0162] S51: Downsampling block; downsampling uses a convolutional layer with a stride of 2; the residual block consists of a 4x4 convolutional layer, a ReLU layer, a 3x3 convolutional layer, and a ReLU layer;

[0163] S52: Attention mechanism; First, the input image is downsampled by a series of downsampling blocks to extract the image features G, which are then concatenated with the spatially copied sentence vector s and input into the attention mechanism; In order to promote the semantic consistency between the generated image and the text, this embodiment incorporates the attention mechanism into the discriminator, so that a stronger discriminator will in turn promote the generator to generate images with stronger semantic consistency with the text.

[0164] S521: The attention mechanism generates an attention map of sentence vectors that suppresses irrelevant parts of the image and text; the expression is:

[0165] ;

[0166] ;

[0167] ;

[0168] It is the feature channel of the sentence vector s at position {w,h}; first, extract the image features. Together they are fed into the MLP to generate energy. Then use energy value Calculate attention weights ,at last Combined with sentence vector s to obtain The sentence vector s is concatenated with G and then processed through a downsampling block to extract image features. These features are then combined with the copied sentence vector s to calculate the adversarial loss, which is used to evaluate the realism of the generated image and its consistency with the text image. The discriminator, enhanced by the attention mechanism, can better determine which local image features are more relevant to the text information, and the stronger discriminator, in turn, can promote a stronger generator.

[0169] S53: Cross-modal alignment; In the task of text-to-image generation, cross-modal alignment is particularly important. In order to make the generated images have higher semantic consistency and realism, this embodiment performs cross-modal training on sentence vectors s and images, as well as generated fake images and real images; This embodiment aligns the semantics and images in the common space by introducing contrast loss;

[0170] S531: This embodiment establishes cross-modal alignment at the image-image and image-text levels; the expression is:

[0171] ;

[0172] Cosine similarity is used as a distance metric, where u represents the embedding vector of the image, and v represents the embedding vector of the text or another image. It is the transpose in matrix operations;

[0173] The contrastive loss function is:

[0174] ;

[0175] ;

[0176] ;

[0177] ;

[0178] m is the mini-batch size of the input samples. It is the i-th sample in the mini-batch u. This refers to the i-th sample in a mini-batch of samples v. It refers to the j-th sample of sample v. It is the temperature hyperparameter, and exp is the exponential function. (·) represents the image embedding extraction process in the discriminator, which projects the input image into the image embedding. For sentence vectors; x represents the generated fake image, and x represents the real image. It is the loss function between the sentence and the real image. It is the loss function between the sentence and the generated fake graph; It is the loss function between real and fake images;

[0179] S532: The expression for the cross-modal alignment objective function is:

[0180] ;

[0181] ;

[0182] It is the contrast loss of the generator. It is the contrast loss of the discriminator;

[0183] S54: The discriminator loss uses an adversarial loss associated with Match-Aware Gradient Penalty (MA-GP) loss.

[0184] ;

[0185] Where s is the sentence vector. is the sentence vector that does not match the text, and x is the real image corresponding to s. The input image is the generated image, and D() is the discriminator's judgment on whether the input image matches the input sentence. p and are hyperparameters of MA-GP; pdata is the mathematical expectation.

[0186] S6: Train the text-to-image generation model using the dataset, calculate the adversarial loss, and optimize the parameters;

[0187] S61: Text description input text encoder generates sentence vectors and word features;

[0188] S62: Input the normally distributed noise vector into the fully connected layer to reshape it to the required size, and then input it together with the sentence vector into 7 text image feature fusion blocks to generate an image;

[0189] S63: In the discriminator, the generated image is further discriminated by downsampling blocks and attention mechanisms to calculate adversarial loss, and the extracted image features are aligned across modalities.

[0190] S64: The final objective function is obtained by a weighted combination of adversarial loss and contrastive loss. and These are coefficient weights, and the overall objective function is:

[0191] ;

[0192] ;

[0193] S7: Using the trained text-to-image model, the user's verbal description is converted into text, and then the text is input into the text-to-image model to generate the model.

[0194] Although the present invention has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to it, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present invention fall within the scope of protection claimed by the present invention.

Claims

1. A method for modeling a dam twin scenario based on human-machine coupling, characterized in that: Includes the following steps: S1: Collect information related to dam modeling, including the dam's modeling structure and graphic style, to build a text-image pair dataset; S2: Combine the collected data with the scene description to form text, and preprocess and augment the text and images to build a text-image pair dataset. Divide the text-image pair dataset into training set and validation set in a 7:3 ratio. S3: Constructing a speech-to-text method; S4: Build a generator. The text converted from speech will be input into the generator to generate an image. S5: Construct a discriminator. The image generated by the generator is input into the discriminator for discrimination. The discriminator is composed and configured as follows: S51: Downsampling block; downsampling uses a convolutional layer with a stride of 2; the residual block consists of a 4x4 convolutional layer, a ReLU layer, a 3x3 convolutional layer, and a ReLU layer; S52: Attention mechanism; First, the input image is downsampled by the downsampling block of S51 to extract the image features G, which are then concatenated with the spatially copied sentence vector s and input into the attention mechanism; The attention mechanism is incorporated into the discriminator, and a stronger discriminator in turn promotes the generator to generate images with stronger semantic consistency with the text, so as to promote the semantic consistency between the generated image and the text. The attention mechanism consists of the following components and settings: S521: The attention mechanism generates an attention map of sentence vectors that suppresses irrelevant parts of the image and text; the expression is: ; ; ; It is the feature channel of the sentence vector s at position {w,h}; First, extract image features. The sentence vector s is fed into the MLP to generate energy values. Then use energy value Calculate attention weights ,at last Combined with sentence vector s to obtain ; The sentence vector s is concatenated with G and then processed through a downsampling block to extract image features. These features are then combined with the copied sentence vector s to calculate the adversarial loss, which is used to evaluate the realism of the generated image and the consistency between the generated image and the text. The discriminator, enhanced by the attention mechanism, can better determine which local image features are more relevant to the text information, and the stronger discriminator, in turn, can promote a stronger generator. S53: Cross-modal alignment; By performing cross-modal training on sentence vectors s and images, as well as generated fake images and real images, and introducing contrastive loss to align the semantics and images in the common space, the generated images have higher semantic consistency and realism. The cross-modal alignment components and settings are as follows: S531: Establish cross-modal alignment at the image-image and image-text levels; the expression is: ; Cosine similarity is used as a distance metric, where u represents the embedding vector of the image, and v represents the embedding vector of the text or another image. It is the transpose in matrix operations; The contrastive loss function is: ; ; ; ; m is the size of the mini-batch u of the input samples; It is the i-th sample in the mini-batch u; This refers to the i-th sample in the mini-batch v; It refers to the j-th sample of sample v; It is the temperature hyperparameter; exp is the exponential function; (·) is the process of extracting the image embedding in the discriminator and projecting the input image into the image embedding; For sentence vectors; x represents the generated fake image; x represents the real image. It is the loss function between the sentence and the real image; It is the loss function between the sentence and the generated fake graph; It is the loss function between real and fake images; S532: The expression for the cross-modal alignment objective function is: ; ; It is the contrast loss of the generator. It is the contrast loss of the discriminator; S54: The discriminator loss uses the match-aware gradient-penalized MA-GP loss associated with the adversarial loss. ; Where s is the sentence vector, is the sentence vector that does not match the text, and x is the real image corresponding to s. The input image is the generated image, and D() is the discriminator's judgment on whether the input image matches the input sentence. p and are hyperparameters of MA-GP; pdata is the mathematical expectation; S6: Train the text-to-image generation model using the dataset, calculate the adversarial loss, and optimize the parameters; S7: Using the trained text-to-image model, the user's verbal description is converted into text, and then the text is input into the text-to-image model to generate the model.

2. The method for modeling a dam twin scenario based on human-machine coupling according to claim 1, characterized in that: The speech-to-text settings in S3 are as follows: S31: Use Baidu Speech Recognition to implement online speech recognition and convert real-time speech into text; S32: The context is updated through a recurrent neural network (RNN) so that the system can understand and remember the task state and the user's historical input; at the same time, when the user makes a modification to the task, the system uses the RNN to understand the modification based on the historical context and combines it with the current state of the task to generate new task content. S33: The updated text content will be used by the system in S6 to generate new dam modeling images or models to reflect the user's latest requirements.

3. The method for modeling a dam twin scenario based on human-machine coupling according to claim 2, characterized in that: In step S32, the context is updated using a recurrent neural network (RNN), and the user proposes modifications to the task settings as follows: S321: Recurrent Neural Networks (RNNs) utilize hidden states... The system gradually accumulates and stores the user's historical input information. At time step t, the recurrent neural network (RNN) receives the current input. The hidden state of the previous time step Generate the current hidden state The expression is: ; This represents the hidden state at time step t, containing accumulated information about the current input and past inputs. It is the weight matrix of the current input. It is the weight matrix of the previous hidden state, b is the bias term, and σ is the activation function tanh, used to introduce nonlinearity; hidden state It contains all historical information, meaning that each input is updated based on the integration of the previous hidden state and the new input. As the time step t increases, the recurrent neural network (RNN) gradually accumulates the user's historical input, thus remembering the previous input content. S322: When a user requests modifications to the task content, the system generates updated task content based on the current context state and the new input; the expression is: ; In this formula, the hidden state Combine with modified input and the previous context Form the updated context; S323: To generate a task description that meets the latest user requirements, the system adjusts the hidden state based on the updated settings. Generate new output : ); This represents the task description text generated at the current time step. This is the weight matrix from the hidden state to the output, where b is the bias term and g is the sigmoid activation function of the output layer; the output... It is the updated task content, generated based on the latest context and the user's latest input, reflecting the user's intention to modify.

4. The dam twin scene modeling method based on human-machine coupling according to claim 3, characterized in that: The components and settings of the generator in step S4 are as follows: S41: The text encoder uses a bidirectional LSTM network structure; S42: The text-image feature fusion block consists of LSTM, a semantic mapping module, and MLP; S43: The generator loss consists of adversarial loss and DAMSM loss, expressed as follows: ; Where s is the text description. The input image is the generated image, and D() is the discriminator's judgment on whether the input image matches the input sentence. These are the weights of the DAMSM loss; the DAMSM loss is used to measure the semantic consistency between text and images.

5. The method for modeling a dam twin scenario based on human-machine coupling according to claim 4, characterized in that: The composition and settings of each part of the text image feature fusion block in step S42 are as follows: S421: In LSTM, LSTM first obtains the initial hidden state of the recurrent neural network RNN ​​through the initialization of noise z, and then obtains the new hidden state through the subsequent input gate, forget gate and output gate, thus obtaining more important text information. Moreover, it can establish long-term dependencies between fusion blocks and reduce the difficulty of skip training. S422: The semantic mapping module contains structures in the order upsample, conv, BN, ReLU, conv, upsample, used to generate semantic graph mapping p. i ; S423: In MLP, there are structures in the order of Linear, Relu, Linear; S424: The final generated semantic graph mapping p i Add it to the affine transformation.

6. The dam twin scene modeling method based on human-machine coupling according to claim 5, characterized in that: The LSTM settings in step S421 are as follows: S4211: Initialize the LSTM using noise z; the expression is: , ; It is the initial hidden state of a recurrent neural network (RNN). It is the initial cell state; S4212: and and The update rule; the expression is: ; ; ; , and These are the input gate, forget gate, and output gate, respectively, where s is the sentence vector. It is a cellular state. It is a linear combination of the input to the hidden state at the current time step t, where It refers to the hidden state of a recurrent neural network (RNN). and These are the scaling and shifting parameters for the i-th channel of the visual feature map.

7. The method for modeling a dam twin scenario based on human-machine coupling according to claim 6, characterized in that: The MLP settings in step S423 are as follows: S4231: Two MLPs predict the channel scaling parameter γ and shift parameter β under language conditions, respectively, with the following expressions: ; S4232: First, use parameter γ to perform channel direction scaling on x, then use shift parameter β to perform channel direction shifting; the expression is: ; in It is the information of the i-th channel of the visual feature map.

8. A method for modeling a dam twin scenario based on human-machine coupling according to claim 5, characterized in that: In step S424, the final generated semantic graph mapping p i The expression added to the affine transformation is as follows: S4241: The expression is: ; It is a semantic graph mapping generated in the semantic mapping module. pi, as a weight, can add semantic information to more important positions and determine how much text information is amplified. These are the scaling parameters generated by the MLP. These are offset parameters generated by MLP. It refers to the hidden state of a recurrent neural network (RNN). It is an image feature. These are image features after affine transformation; t, h, and w represent the number of channels, height, and width of the image, respectively.

9. A method for modeling a dam twin scenario based on human-machine coupling according to claim 1, characterized in that: The training of the model in step S6 specifically includes the following steps: S61: Text description input text encoder generates sentence vectors and word features; S62: Input the normally distributed noise vector into the fully connected layer to reshape it to the required size, and then input it together with the sentence vector into 7 text image feature fusion blocks to generate an image; S63: In the discriminator, the generated image is further discriminated by downsampling blocks and attention mechanisms to calculate adversarial loss, and the extracted image features are aligned across modalities. S64: Finally, the overall objective function is obtained by a weighted combination of the adversarial loss and the contrastive loss. and These are coefficient weights, and the overall objective function is: ; ; in For generator loss, For discriminator loss, The generator contrast loss is used in the cross-modal alignment objective function expression.