A multi-scale feature fusion oil field industrial image multi-modal semantic analysis model optimization method and system

The multimodal semantic parsing model for oilfield industrial images, which integrates multi-scale feature fusion and cross-semantic interaction, solves the problems of insufficient feature extraction for large facilities and minor defects in oilfield scenarios and inaccurate cross-modal semantic interaction, achieving efficient semantic parsing and model generalization.

CN122265786APending Publication Date: 2026-06-23DAQING ANRUIDA TECH DEV CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DAQING ANRUIDA TECH DEV CO LTD
Filing Date
2026-04-03
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing general models lack specific design for semantic parsing of oilfield industrial images, resulting in insufficient feature extraction of large facilities and minor defects in oilfield scenes, inaccurate cross-modal semantic interaction, and data scarcity and high cost limiting the generalization ability of the models.

Method used

A multi-scale feature fusion model for multimodal semantic parsing of oilfield industrial images is adopted. A four-level downsampling is performed through a window attention-based Transformer architecture. Combined with a spatial attention enhancement mechanism and a cross-semantic interaction module, an oilfield-specific semantic bias is introduced, and a dual-source data processing flow of synthetic and real data is constructed.

Benefits of technology

It significantly improves the accuracy of oilfield equipment and defect identification and semantic parsing, enhances the model's generalization ability, and solves the problems of multi-scale feature fusion and cross-modal semantic interaction in oilfield scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265786A_ABST
    Figure CN122265786A_ABST
Patent Text Reader

Abstract

The application provides a multi-scale feature fusion oilfield industrial image multi-modal semantic analysis model optimization method and system, and belongs to the field of oilfield industrial intelligent detection. The method solves the problem of insufficient adaptability of the interaction mechanism of the existing cross-modal model, such as the model based on the CLIP architecture, to the knowledge of the oilfield. The method comprises the following steps: preprocessing the oilfield industrial image and obtaining corresponding text features; performing feature extraction through a multi-scale feature extraction network to obtain a visual feature set containing at least two different scales; processing each scale visual feature in the visual feature set by using a spatial attention enhancement mechanism to obtain enhanced visual features and performing multi-scale feature fusion to obtain high-dimensional fused visual features; interacting the high-dimensional visual features and the text features through a cross-semantic interaction module; and analyzing the features after the interaction to output the semantic information of the target in the oilfield industrial image. The method is used in the field of oilfield industrial image semantic analysis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent detection in oilfield industry, and in particular relates to an optimization method for multi-modal semantic parsing model of oilfield industrial images through multi-scale feature fusion. Background Technology

[0002] Intelligent semantic analysis of oilfield industrial images—that is, automatically identifying equipment types (such as storage tanks, pipelines, and valves), defect categories (such as cracks, corrosion, and leaks), and their location information in images—is a core technology for realizing intelligent inspection, predictive maintenance, and safe production in oilfields. With the development of computer vision and deep learning technologies, general object detection and image segmentation models based on convolutional neural networks (CNN) and Transformer architectures, as well as cross-modal understanding models combining visual and textual information, have achieved significant results in many general scenarios. However, when these existing technologies are directly applied to the highly specialized and complex industrial scenario of oilfields, they still face a series of inherent core defects stemming from scenario specificity, resulting in insufficient accuracy, robustness, and practicality of semantic analysis.

[0003] First, at the spatial feature extraction level, existing general models lack specific designs tailored to the unique characteristics of oilfield scenes. Targets in oilfield images, such as equipment welds, pipeline cracks, and corrosion patches, typically exhibit strong location sensitivity and significant edge and texture features. However, whether it's classic CNN-based models like ResNet and YOLO series or the Vision Transformer based on global self-attention, their spatial feature extraction or attention mechanisms are often designed for general objects, failing to consider the high-contrast edges and specific structural patterns of oilfield targets. For example, general self-attention mechanisms calculate the relationships between all regions in an image equally, failing to specifically enhance attention to key local structures such as pipeline alignment and weld continuity. This results in insufficient spatial feature representation of subtle cracks, making it difficult to achieve pixel-level or sub-pixel-level accurate localization and segmentation.

[0004] Secondly, in terms of multi-scale feature fusion, existing methods struggle to effectively handle the significant differences in target scale within oilfield images. Oilfield scenes simultaneously encompass large facilities such as storage tanks and long-distance pipelines, medium-sized equipment like valves and flanges, and minute defects such as weld cracks and pinhole leaks. Mainstream multi-scale fusion architectures, such as Feature Pyramid Networks (FPN) and their variants, fuse features from different levels through top-down and bottom-up paths. However, these general fusion strategies are prone to losing or obscuring shallow, high-resolution, fine-grained features representing minute defects during the fusion process. Especially in the context of oilfields, texture information characterizing corrosion initiation points or micro-cracks is easily submerged or smoothed out by noise during multiple downsampling and upsampling processes, leading to an imbalance in the model's perception of multi-scale targets and a significant reduction in recall for small-scale defects.

[0005] Furthermore, in terms of cross-modal semantic interaction between vision and text, existing cross-modal models, such as those based on the CLIP architecture, lack sufficient adaptability to oilfield domain knowledge. These models are typically pre-trained on large-scale general image-text pairs, and their attention mechanisms learn the visual-textual associations of general objects. When directly applied to oilfields, these models struggle to accurately establish strong semantic associations between specific grayscale variation regions and stress corrosion cracks, or between high-temperature oxidation spots and overheating defects. Because the general cross-modal attention weights lack biases for specific semantics related to oilfield equipment, operating conditions, and defect patterns, the models cannot fully utilize prior textual information to guide and correct visual understanding, thus limiting the accuracy and reliability of cross-modal parsing.

[0006] Finally, regarding the data foundation for model training, the scarcity of real-world image data from oilfields and the high cost of annotation constitute a significant bottleneck. Due to the stringent safety regulations, harsh environment, and confidentiality of equipment at oilfield sites, obtaining a large number of high-quality real-world images covering various equipment states, defect types, and lighting and weather conditions is extremely difficult. Existing technical solutions typically rely solely on limited real-world data for training, resulting in poor model generalization ability and an inability to adapt to unfamiliar new scenarios or equipment. Furthermore, the lack of a standardized processing workflow that seamlessly integrates synthetic and real-world data further limits the possibility of using simulation techniques to expand datasets and improve model robustness. Summary of the Invention

[0007] In view of this, the present invention aims to propose a method and system for optimizing multimodal semantic parsing models of oilfield industrial images by multi-scale feature fusion, in order to solve the problem that the interaction mechanism of existing cross-modal models, such as models based on CLIP architecture, is not adaptable to knowledge in the oilfield domain.

[0008] To achieve the above objectives, the present invention adopts the following technical solution: a method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion, the method comprising: Preprocess the oilfield industrial images and obtain the corresponding text features; Feature extraction is performed on the preprocessed oilfield industrial images using a multi-scale feature extraction network to obtain a visual feature set containing at least two different scales; A spatial attention enhancement mechanism is used to process the visual features at each scale in the visual feature set to obtain enhanced visual features; The enhanced visual features at each scale are fused to obtain the fused high-dimensional visual features. The high-dimensional visual features and the text features interact through a cross-semantic interaction module, which introduces an oilfield-specific semantic bias when calculating attention weights. The interactive features are analyzed to output the semantic information of the targets in the oilfield industrial image.

[0009] Furthermore, a preferred approach is proposed, wherein the multi-scale feature extraction network adopts a window attention-based Transformer architecture to perform four levels of downsampling on the input image, obtaining visual features at scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32, respectively, which constitute the visual feature set.

[0010] Furthermore, a preferred approach is proposed, wherein the spatial attention enhancement mechanism specifically comprises: Adaptive pooling is performed on the input visual feature map in both the height and width dimensions to obtain height-dimension pooled features and width-dimension pooled features. The high-dimensional pooling feature is concatenated with the width-dimensional pooling feature that has undergone dimension permutation to obtain the fused feature; The fused features are then subjected to convolution, normalization, and activation processing. The activated features are split into height-dimensional features and width-dimensional features, and then convolution and activation functions are applied to them respectively to obtain height-dimensional weight attention weight maps and width-dimensional attention weight maps. The height-dimensional attention weight map and the width-dimensional attention weight map are multiplied by the oilfield edge enhancement coefficient to obtain the corrected attention weight map; The modified attention weight map is multiplied by the input visual feature map to obtain the spatial attention-enhanced feature map.

[0011] Furthermore, a preferred method is proposed, wherein the enhanced visual features at each scale are fused using multi-scale feature fusion to obtain fused high-dimensional visual features, including: The visual features at each scale enhanced by spatial attention are subjected to convolutional dimensionality reduction to obtain the lateral enhancement features; A feature pyramid network (FPN) structure is used to perform top-down feature fusion on the lateral enhancement features to obtain a first fused feature set. A path aggregation network (PAN) structure is used to perform bottom-up feature fusion on the first fused feature set, and the spatial attention enhancement mechanism is applied after each downsampling fusion to obtain the second fused feature set. After adjusting the features at each scale in the second fusion feature set to the same size, they are spliced ​​and convolutional to obtain the high-dimensional visual features.

[0012] Furthermore, a preferred approach is proposed, in which the cross-semantic interaction module performs the following operations: The high-dimensional visual features and the text features are projected onto a unified hidden dimension to obtain visual query features, text key features, and text value features. The visual query features, text key features, and text value features are split into multiple attention heads; During the attention weight calculation process for each attention head, the oilfield semantic-specific bias is added; The outputs of multiple attention heads are concatenated and projected back to the visual feature dimension; The projected features are added to the input high-dimensional visual features through a residual connection with residual coefficients to obtain cross-modal fusion features.

[0013] Furthermore, a preferred embodiment is proposed, wherein the method further includes: A synthetic oilfield image dataset with semantic annotations is automatically generated through a synthetic data generation unit. The real oilfield images and their annotation files are loaded and parsed in a standardized manner using the real data loading unit to obtain a standardized real oilfield image dataset. During model training or inference, a data source is selected from the synthetic oilfield image dataset or the real oilfield image dataset.

[0014] Based on the same inventive concept, this invention also proposes a multi-scale feature fusion oilfield industrial image multimodal semantic parsing model optimization system, the system being used to implement the method described in any of the above claims, the system comprising: The data input and preprocessing module is used to load and preprocess oilfield industrial images and obtain corresponding text features; The multi-scale feature extraction module is used to extract multi-scale visual feature sets from the preprocessed image; An oilfield-specific spatial attention enhancement module is used to enhance the spatial attention of features in the multi-scale visual feature set. A multi-scale feature fusion module is used to fuse enhanced multi-scale visual features; The cross-modal semantic interaction module is used to interact with the fused visual features and text features, and introduces oilfield-specific semantic biases. The semantic parsing module is used to parse the semantic information of the target based on the features after the interaction; The visualization module is used to provide a human-computer interaction interface, control the operation of the system, and display the analysis results.

[0015] Furthermore, a preferred embodiment is proposed, wherein the data input and preprocessing module includes a dual-source data processing unit, the dual-source data processing unit comprising: The synthetic data generation unit is used to automatically generate synthetic images containing oilfield equipment, defects, and corresponding semantic annotations based on preset oilfield scene parameters; The real data loading unit is used to load real oilfield image files and parse the associated annotation files to obtain standardized semantic information.

[0016] Based on the same inventive concept, the present invention also proposes a computer device, including a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes an optimization method for a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion as described in any of the preceding claims.

[0017] Based on the same inventive concept, the present invention also proposes a computer-readable storage medium storing a computer program, which, when run by a processor, executes the steps of the multi-scale feature fusion oilfield industrial image multimodal semantic parsing model optimization method as described in any one of the above.

[0018] Compared with the prior art, the beneficial effects of the present invention are: This invention utilizes a spatial attention mechanism that decouples and integrates oilfield edge enhancement coefficients by designing dimensions. This mechanism enables the model to specifically enhance its ability to perceive the strong positional sensitivity and significant edge features of oilfield equipment and defects. As a result, in tests on both public and field datasets, the model demonstrates a significant improvement in the accuracy of identifying these edge features compared to general spatial attention mechanisms.

[0019] This invention effectively solves the challenge of feature fusion and utilization in oilfield images, ranging from large storage tanks to minute defects. By deeply integrating spatial attention enhancement mechanisms into the FPN-PAN multi-scale feature fusion process and performing uniform-size stitching at the end, this scheme can adaptively retain and enhance key features at each scale, especially fine-grained features that are easily lost and represent minute defects. Experiments show that the feature representation capability and multi-scale target detection performance of this fusion method are significantly improved.

[0020] This invention achieves precise alignment of cross-modal semantic interaction in the oilfield professional field. By introducing an oilfield-specific semantic bias into the visual-text cross-modal attention computation and supplementing it with residual connections with adjustable coefficients, this technique strengthens the semantic association between visual features and oilfield professional text descriptions such as equipment type and defect name. This enables the model to more accurately understand and resolve the visual-text correspondences unique to oilfield scenes, thereby improving the accuracy of cross-modal semantic parsing. A four-level downsampling network is constructed using a window attention-based Transformer architecture. The four scales (1 / 4, 1 / 8, 1 / 16, and 1 / 32) are designed to specifically match the feature extraction needs of large equipment, medium-sized equipment and large-area defects, and fine-grained micro-defects in oilfield scenes, respectively. This design avoids the feature redundancy or loss problems that may occur when extracting oilfield targets in general architectures and reduces computational overhead with window attention. By constructing a dual-source data processing mechanism of automatic generation of synthetic data and standardized loading of real data, synthetic oilfield images with complete and accurate annotations can be generated in batches and seamlessly switched with real data. This not only significantly improves the efficiency of data preparation and processing, but more importantly, it effectively enhances the model's generalization ability by expanding high-quality training data. Attached Figure Description

[0021] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings: Figure 1 This is a flowchart of the optimization method for a multi-scale feature fusion oilfield industrial image multimodal semantic parsing model according to the present invention; Figure 2 This is a schematic diagram of the oilfield multi-scale feature fusion process described in this invention; Figure 3 This is a schematic diagram of the cross-modal semantic interaction process for oilfield adaptation described in this invention. Detailed Implementation

[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of the present invention can be combined with each other, and the described embodiments are only some embodiments of the present invention, not all embodiments.

[0023] Implementation Method 1: This implementation method addresses the problem of insufficient adaptability of existing cross-modal models, such as those based on the CLIP architecture, to oilfield domain knowledge by proposing a multi-scale feature fusion-based optimization method for multi-modal semantic parsing models of oilfield industrial images. The method includes: Preprocess the oilfield industrial images and obtain the corresponding text features; Feature extraction is performed on the preprocessed oilfield industrial images using a multi-scale feature extraction network to obtain a visual feature set containing at least two different scales; A spatial attention enhancement mechanism is used to process the visual features at each scale in the visual feature set to obtain enhanced visual features; The enhanced visual features at each scale are fused to obtain the fused high-dimensional visual features. The high-dimensional visual features and the text features interact through a cross-semantic interaction module, which introduces an oilfield-specific semantic bias when calculating attention weights. The interactive features are analyzed to output the semantic information of the targets in the oilfield industrial image.

[0024] In this embodiment, the multi-scale feature extraction network adopts a window attention-based Transformer architecture to perform four levels of downsampling on the input image, obtaining visual features at scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32, respectively, which constitute the visual feature set.

[0025] In this embodiment, the spatial attention enhancement mechanism is specifically as follows: Adaptive pooling is performed on the input visual feature map in both the height and width dimensions to obtain height-dimension pooled features and width-dimension pooled features. The high-dimensional pooling feature is concatenated with the width-dimensional pooling feature that has undergone dimension permutation to obtain the fused feature; The fused features are then subjected to convolution, normalization, and activation processing. The activated features are split into height-dimensional features and width-dimensional features, and then convolution and activation functions are applied to them respectively to obtain height-dimensional weight attention weight maps and width-dimensional attention weight maps. The height-dimensional attention weight map and the width-dimensional attention weight map are multiplied by the oilfield edge enhancement coefficient to obtain the corrected attention weight map; The modified attention weight map is multiplied by the input visual feature map to obtain the spatial attention-enhanced feature map.

[0026] In this embodiment, the step of fusing the enhanced visual features at each scale to obtain the fused high-dimensional visual features includes: The visual features at each scale enhanced by spatial attention are subjected to convolutional dimensionality reduction to obtain the lateral enhancement features; A feature pyramid network (FPN) structure is used to perform top-down feature fusion on the lateral enhancement features to obtain a first fused feature set. A path aggregation network (PAN) structure is used to perform bottom-up feature fusion on the first fused feature set, and the spatial attention enhancement mechanism is applied after each downsampling fusion to obtain the second fused feature set. After adjusting the features at each scale in the second fusion feature set to the same size, they are spliced ​​and convolutional to obtain the high-dimensional visual features.

[0027] In this embodiment, the cross-semantic interaction module performs the following operations: The high-dimensional visual features and the text features are projected onto a unified hidden dimension to obtain visual query features, text key features, and text value features. The visual query features, text key features, and text value features are split into multiple attention heads; During the attention weight calculation process for each attention head, the oilfield semantic-specific bias is added; The outputs of multiple attention heads are concatenated and projected back to the visual feature dimension; The projected features are added to the input high-dimensional visual features through a residual connection with residual coefficients to obtain cross-modal fusion features.

[0028] In this embodiment, the method further includes: A synthetic oilfield image dataset with semantic annotations is automatically generated through a synthetic data generation unit. The real oilfield images and their annotation files are loaded and parsed in a standardized manner using the real data loading unit to obtain a standardized real oilfield image dataset. During model training or inference, a data source is selected from the synthetic oilfield image dataset or the real oilfield image dataset.

[0029] Unlike general, indiscriminate spatial attention, this implementation proposes an attention generation principle that integrates dimensional decoupling and scene coefficient enhancement. First, compression and feature extraction are performed along both the height and width spatial dimensions to decouple and independently perceive position-sensitive features in different directions, such as horizontal pipes and vertical welds. Then, a learnable enhancement coefficient specific to oilfield edge features is introduced to perform scene-based correction on the attention weights. This embeds prior knowledge of the oilfield scene into the attention map generation process, enabling targeted enhancement of the core features of the scene in principle.

[0030] Unlike traditional FPN / PAN methods that simply add or stitch together indiscriminate features, this implementation uses spatial attention enhancement as a core operator, deeply embedded in every key path of multi-scale fusion, namely lateral connectivity and bottom-up fusion. Before each feature transfer and fusion, it first uses the aforementioned scenario-optimized attention mechanism to reweight the importance of features at each scale, ensuring that features crucial for oilfield target identification, especially fine-grained features from shallow layers, are preserved and enhanced during the fusion process, rather than being smoothed or diluted in simple convolution and downsampling.

[0031] Unlike general cross-modal models that rely solely on data-driven learning of visual-text associations, this implementation explicitly injects domain-specific biases into the model's structure. Specifically, it introduces a trainable, oilfield-specific semantic bias term into the core formula of multi-head attention computation. This bias term functions as a static, oilfield-domain-related prior guide, enhancing the association strength between visual-text feature pairs related to oilfield equipment and defects during attention weight allocation. This forces the model to focus more on semantic alignment within the domain during learning, thus fundamentally solving the problem of divergent association capture in specialized domains for general models.

[0032] Unlike general backbone networks that use fixed downsampling strategies (such as the standard 1 / 32 final scale), this implementation method is based on prior analysis of the physical scale distribution of oilfield targets. It reverse-engineers the scale structure of the feature extraction network, explicitly mapping four levels of feature maps to four categories of oilfield targets: large, medium, small, and micro. This ensures that each network level is responsible for extracting the optimal features of targets within a specific scale range. Furthermore, it uses window attention instead of global attention, aligning with the characteristic that oilfield targets typically occupy only a local area of ​​the image. This reduces computational cost while forcing the model to focus on feature interactions within a local window, thus extracting local target features more efficiently and accurately.

[0033] Unlike solutions that rely solely on limited real-world data or simple data augmentation, this implementation method employs a standardized pipeline with seamless switching between simulation and reality channels. The synthetic data channel generates data procedurally according to predefined oilfield visual and semantic rules, such as background color, equipment type, and defect morphology, ensuring data diversity and absolute labeling accuracy from the source. The real-world data channel, through a standardized interface, unifies heterogeneous real-world data into a format that the model can process. The two channels achieve standardization and unification at the data interface level, ensuring the model can robustly learn from two complementary data sources—ideal simulation and complex reality—enhancing its generalization capabilities.

[0034] Implementation Method 2, see below Figure 1 , Figure 2 and Figure 3 This embodiment provides a complete example of the multi-scale feature fusion-based multimodal semantic parsing model optimization method for oilfield industrial images described in Embodiment 1, including: Preprocess the oilfield industrial images and obtain the corresponding text features; Feature extraction is performed on the preprocessed oilfield industrial images using a multi-scale feature extraction network to obtain a visual feature set containing at least two different scales; A spatial attention enhancement mechanism is used to process the visual features at each scale in the visual feature set to obtain enhanced visual features; The enhanced visual features at each scale are fused to obtain the fused high-dimensional visual features. The high-dimensional visual features and the text features interact through a cross-semantic interaction module, which introduces an oilfield-specific semantic bias when calculating attention weights. The interactive features are analyzed to output the semantic information of the targets in the oilfield industrial image.

[0035] This implementation addresses the location sensitivity and edge feature saliency of oilfield equipment / defects by designing a spatial attention enhancement mechanism based on dimensional decoupling. It adaptively pools the height and width dimensions of the input feature map, calculates attention weights after fusing dimensional features, and strengthens edge feature weights using oilfield-specific coefficients. This ultimately enhances the spatial attention of the feature map, highlighting the core features of oilfield equipment / defects. It breaks through the indiscriminate feature enhancement mode of general spatial attention, achieving decoupled attention calculation for height and width dimensions. The addition of oilfield-specific edge feature enhancement coefficients makes the attention weights more sensitive to the location and edge features of oilfield equipment and defects. Tests on publicly available oilfield industrial image datasets and field-collected datasets show that the spatial feature representation capability of the spatial attention enhancement mechanism is improved by ≥35%, and the accuracy of identifying edge features such as oilfield welds and pipeline cracks is improved by more than 15% compared to existing general spatial attention mechanisms.

[0036] The specific mechanism for enhancing spatial attention is as follows: Let the input visual feature map be B represents the batch size, C represents the number of channels, H represents the height, and W represents the width. Dimensional decoupling pooling is performed on the feature map. High-dimensional adaptive average pooling:

[0037] Width-dimensional adaptive average pooling: After dimensional substitution,

[0038] The pooled features are concatenated to obtain the fused features:

[0039] The fused features are then subjected to convolution, normalization, and activation processing: F act = (BN(Conv(F cat ))),in, The Hardswish activation function is used to adapt to the high-contrast features of oilfield images.

[0040] The activated features are split back into height and width dimensions. The basic attention weights are obtained through convolution, and then corrected by the oilfield edge enhancement coefficient α to obtain the final attention weight map:

[0041] Wherein, σ is the Sigmoid activation function, and α is the oilfield-specific edge enhancement coefficient, with a value range of 0.1-0.3, used to enhance the weight of edge features such as welds and pipeline cracks. According to the ablation experiment of the oilfield industrial image dataset, when α is in the range of 0.1-0.3, the spatial attention enhancement mechanism improves the spatial feature representation ability of oilfield equipment and defects by ≥35%, which is significantly better than the indiscriminate enhancement effect of the general spatial attention mechanism.

[0042] The multi-scale feature extraction network in this embodiment is based on the window attention architecture of the Swing Transformer. It constructs a four-level multi-scale feature extraction network to progressively downsample the oilfield image and extract depth visual features at four scales: 1 / 4, 1 / 8, 1 / 16, and 1 / 32. This matches the feature extraction requirements of equipment and defects of different sizes in the oilfield image. The equipment includes: storage tanks, pipelines, and valves; the defects include cracks, corrosion, and leaks.

[0043] The feature extraction process of a multi-scale feature extraction network includes: The input oilfield image is downsampled by a 4×4 stride convolution to obtain the initial feature F1 at a 1 / 4 scale; F1 is subjected to 2×2 stride convolution downsampling and window attention processing in sequence to obtain 1 / 8 scale feature F2, 1 / 16 scale feature F3, and 1 / 32 scale feature F4, respectively; Output a multi-scale feature set {F1,F2,F3,F4} as input for subsequent feature fusion.

[0044] This paper proposes a method that replaces global attention with window attention, adapting to the characteristics of oilfield images, which primarily feature local equipment and defects. This significantly improves the accuracy of local feature extraction while reducing computational load. The four downsampling scales (1 / 4, 1 / 8, 1 / 16, and 1 / 32) in the multi-scale feature extraction network are highly compatible with the target scale distribution of oilfield industrial images. Specifically, the 1 / 4 scale is suitable for the overall feature extraction of large targets such as large oilfield storage tanks and long-distance pipelines; the 1 / 8 and 1 / 16 scales are suitable for the feature extraction of medium-sized equipment such as valves and pumps, as well as large-area corrosion and deformation defects; and the 1 / 32 scale is suitable for the feature extraction of fine-grained defects such as weld cracks and micro-leakage points. Experimental verification on an oilfield industrial image dataset shows that this four-scale design improves the feature extraction accuracy of various oilfield targets by ≥40% compared to the general two- or three-scale feature extraction architecture, effectively avoiding feature redundancy of large-scale targets and feature loss of small-scale defects.

[0045] This implementation method performs multi-scale feature fusion on the enhanced visual features at various scales, including: This paper integrates the multi-scale fusion logic of FPN (top-down) and PAN (bottom-up), incorporating an oilfield-specific spatial attention enhancement module at each fusion step. First, it performs lateral convolution and attention enhancement on features at each scale. Then, it performs top-down upsampling fusion and bottom-up downsampling fusion. Finally, it unifies the size of all scale features and concatenates them to obtain high-dimensional fused features. Specifically: Let the multi-scale feature set be The corresponding number of channels is The target number of channels to be merged is .

[0046] Convolutional dimensionality reduction and spatial attention enhancement are applied to features at each scale to obtain laterally enhanced features:

[0047] in, Enhanced attention operation for oilfield-specific spaces.

[0048] Starting with the largest scale feature, feature fusion is achieved through a 2x upsampling to obtain the FPN fused feature set.

[0049]

[0050]

[0051] in, It is upsampled twice as bilinear interpolation.

[0052] Starting with the smallest scale features, feature fusion is achieved through 3×3 convolution with 2x downsampling and attention enhancement to obtain the PAN fused feature set. :

[0053]

[0054] in, The 3×3 convolution is downsampled by 2 times, with a stride of 2 and padding of 1.

[0055] All PAN fusion features are adaptively averaged and pooled to the same size. The final fused features are obtained by convolution after splicing:

[0056]

[0057] This implementation breaks through the traditional FPN / PAN feature fusion mode without attention guidance. Throughout the entire process of lateral connection and bottom-up downsampling fusion in PAN, it embeds oilfield-specific spatial attention enhancement processing, effectively preserving fine-grained features of minute defects in oilfield images. Simultaneously, by using a uniform-size stitching fusion method, it solves the problem of inconsistent feature dimensions across multiple scales, significantly improving the fusion efficiency and utilization of multi-scale features. Tested on an oilfield industrial image dataset, the fused feature representation capability output by this module is improved by ≥45%, and compared to the traditional FPN / PAN fusion architecture, the F1 score for detecting multi-scale targets in oilfields is improved by more than 20%.

[0058] This implementation addresses the visual-text cross-modal semantic alignment problem by designing an oilfield-adaptive cross-modal attention mechanism. It projects visual and textual features onto a unified hidden dimension, adds an oilfield-specific bias to strengthen the semantic association between oilfield equipment and defects, and introduces residual connections to improve the stability of model training and inference. Ultimately, it achieves efficient semantic interaction between visual and textual features. Specifically: Let the visual features be , For visual feature sequence length, For visual features, the text features are... , For the length of the text feature sequence, D t The text feature dimension is D, and the hidden layer dimension is D. h The number of attention heads is N h B represents the batch size; Projecting visual and textual features onto a unified hidden dimension:

[0059] in, For visual feature projection layer, For text feature projection layer; The projected features are split into multiple heads, and attention weights are calculated by adding an oilfield semantic-specific bias before the multiple heads are concatenated.

[0060]

[0061]

[0062]

[0063] in, This is a semantic-specific bias for oilfields, with a value range of 0.05-0.15 and an optimal value of 0.1. It is used to strengthen the semantic association between oilfield equipment and defects. Experimental results show that after adding this semantic-specific bias for oilfields, the semantic association between visual features and text features in oilfield scenes is improved by ≥28%, effectively solving the problem of insufficient capture of oilfield-specific semantics by general cross-modal models.

[0064] Projecting the attention output back to the visual feature dimension and adding a residual connection with coefficients yields cross-modal fusion features:

[0065] in, For the output projection layer, β is the residual coefficient, ranging from 0.5 to 0.8, with a preferred value of 0.8, used to improve the stability of the model training and inference process. A semantic-specific bias specific to oilfields is added to the cross-modal attention calculation to address the problem of insufficient semantic association capture in oilfield scenes by general cross-modal models. Simultaneously, a residual connection with adjustable coefficients is introduced to effectively avoid the gradient vanishing problem during model training, significantly improving the stability and accuracy of cross-modal interaction. Experimental verification shows that the cross-modal semantic alignment accuracy of this implementation is improved by ≥30%, and compared to existing general visual-text cross-modal attention models, the semantic alignment accuracy in oilfield scenes is improved by more than 25%. Through the oilfield semantic bias design, the recognition accuracy of oilfield equipment and defects can be improved by more than 20%.

[0066] This implementation also proposes a dual-data source supply mechanism for automatically generating synthetic oilfield images and standardizing the loading of real oilfield images. Synthetic data generates fully semantically annotated oilfield images by simulating oilfield scenarios, including equipment, defects, and location information. Real data supports standardized loading and semantic annotation parsing of mainstream image formats, achieving seamless switching between the two data sources and providing sufficient data support for model training. Specifically: The background color of the preset oilfield industrial image is the common yellowish-brown and blue tones found in oilfield sites, and the generated image size is fixed at 512×512. Based on common oilfield scenes, core oilfield equipment such as pipelines, storage tanks, valves, pumps, and wellhead devices are randomly generated. Common industrial defects such as cracks, corrosion, leakage, weld defects, and deformation are randomly added at the corresponding locations of the equipment. Semantic annotation files matching the images are automatically generated simultaneously. The annotation content includes defect type, equipment type, target location coordinates, and defect severity level information. The output is a synthetic oilfield image dataset with complete semantic annotations. Supports mainstream industrial image formats such as JPG, PNG, and JPEG. Automatically performs size standardization resizing and pixel value normalization preprocessing on loaded real oilfield images. Simultaneously parses text annotation files with the same name as the images, automatically extracts defect type, equipment type, location information, and annotation level semantic content, completes the matching and verification of annotation information and image data, and outputs a standardized real oilfield image dataset. During both the model training and inference phases, users can freely choose synthetic data, real data, or mixed data modes through a visual interface. The data input interface adopts a standardized design, and the output format of different data sources is completely unified, enabling seamless switching between different data sources.

[0067] Addressing the industry pain points of scarce real image data and high annotation difficulty in oilfields, the synthetic data can simulate oilfield scenarios with different equipment, defects, and locations in batches, and automatically generate semantic annotations; real data enables automated image preprocessing and annotation parsing, improving data processing efficiency by ≥50% and model generalization ability by ≥40%.

[0068] In practical applications, this embodiment proposes a method for optimizing a multi-modal semantic parsing model of oilfield industrial images through multi-scale feature fusion. The specific implementation steps include: Step S1, System Initialization and Parameter Configuration: Start the multimodal semantic parsing and visualization system, complete hardware environment adaptation (automatic CPU / GPU identification), and initialize all core modules; Configure the core parameters of the model in the visualization interface: oilfield edge enhancement coefficient α (recommended value 0.1-0.3), residual coefficient β (recommended value 0.5-0.8), and number of fusion target channels C. out (Recommended value: 256), Number of attention heads N h (Recommended value: 8); The numerical results output by the model are mapped to the actual defect types (cracks / corrosion / leakage) and equipment types (pipelines / tanks / valve) in the oilfield.

[0069] Step S2, Preparation and Loading of Dual-Source Data: Select the data source in the visualization interface: synthetic oilfield image or real oilfield image; If you choose synthetic data: set the number of samples, and the system will automatically generate a synthetic oilfield image dataset with complete semantic annotations and save it to the specified directory; If real data is selected: Select the dataset directory, and the system will automatically load the images and parse the semantic annotations to complete the standardization preprocessing; The processed dataset is allocated to the model training module, and training hyperparameters (batch size, learning rate, number of training epochs) are set. Train a multimodal semantic parsing model for oilfields, and then optimize and save the model.

[0070] Step S3: Oilfield Image Preprocessing and Input: Select the single oilfield image to be analyzed in the visualization interface, and the system will automatically preprocess the image: resize to 512×512, convert to Tensor, and normalize (mean [0.485, 0.456, 0.406], standard deviation [0.229, 0.224, 0.225]). Generate simulated text feature embeddings or load text features from actual LLM outputs as text input for cross-modal semantic interaction; The preprocessed visual and textual features are input into the trained analytical model.

[0071] Step S4: Multi-scale feature extraction and spatial attention enhancement: The multi-scale feature extraction module performs four levels of downsampling on visual features to extract depth visual feature sets {F1, F2, F3, F4} at scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32. Oilfield-specific spatial attention enhancement processing was applied to the features at each scale to obtain the lateral enhanced feature set {Flat-1, Flat-2, Flat-3, Flat-4}.

[0072] Step S5: Fusion of multi-scale features of the oilfield: The horizontal enhancement features are upsampled and fused from top to bottom using FPN to obtain the FPN fused feature set; The FPN fusion features are subjected to PAN bottom-up downsampling and attention-enhanced fusion to obtain the PAN fusion feature set; All PAN fusion features are unified to the same size, concatenated, and then convolved to obtain the final high-dimensional fusion feature F. fusion .

[0073] Step S6, Cross-modal semantic interaction and core semantic parsing: F fusion feature F fusion Flattening process yields visual feature sequence F v ; The cross-modal semantic interaction module will F v With text features F t Projection, multi-head attention computation (oil field semantic bias), and residual connection are performed to obtain the cross-modal fusion feature F. cross ; For cross-modal fusion features F cross Perform global average pooling to obtain the global feature F. global ; Semantic parsing head pair F globalMultilayer perceptron classification is performed, and numerical analysis results are output. These results are then transformed into three core semantic information—defect type, equipment type, and location code—through mapping rules.

[0074] Implementation Method 3: This implementation method proposes a multi-scale feature fusion-based multimodal semantic parsing model optimization system for oilfield industrial images. The system is used to implement the method described in any one of Implementation Methods 1 to 2, and includes: The data input and preprocessing module is used to load and preprocess oilfield industrial images and obtain corresponding text features; The multi-scale feature extraction module is used to extract multi-scale visual feature sets from the preprocessed image; An oilfield-specific spatial attention enhancement module is used to enhance the spatial attention of features in the multi-scale visual feature set. A multi-scale feature fusion module is used to fuse enhanced multi-scale visual features; The cross-modal semantic interaction module is used to interact with the fused visual features and text features, and introduces oilfield-specific semantic biases. The semantic parsing module is used to parse the semantic information of the target based on the features after the interaction; The visualization module is used to provide a human-computer interaction interface, control the operation of the system, and display the analysis results.

[0075] In this embodiment, the data input and preprocessing module includes a dual-source data processing unit, which includes: The synthetic data generation unit is used to automatically generate synthetic images containing oilfield equipment, defects, and corresponding semantic annotations based on preset oilfield scene parameters; The real data loading unit is used to load real oilfield image files and parse the associated annotation files to obtain standardized semantic information.

[0076] Implementation Method 4: This implementation method proposes a computer device, including a memory and a processor. The memory stores a computer program. When the processor runs the computer program stored in the memory, the processor executes an optimization method for a multi-scale feature fusion oilfield industrial image multimodal semantic parsing model according to any one of Implementation Methods 1 to 2.

[0077] Implementation Method 5: This implementation method proposes a computer-readable storage medium storing a computer program. When the computer program is run by a processor, it executes the steps of the multi-scale feature fusion oilfield industrial image multimodal semantic parsing model optimization method as described in any one of Implementation Methods 1 to 2.

[0078] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, systems, or computer program products. Therefore, this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this disclosure can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0079] This disclosure is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 The computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0080] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0081] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this disclosure and not to limit its protection scope. Although this disclosure has been described in detail with reference to the above embodiments, those skilled in the art should understand that after reading this disclosure, they can still make various changes, modifications or equivalent substitutions to the specific implementation of the invention, but these changes, modifications or equivalent substitutions are all within the protection scope of the pending claims.

Claims

1. A method for optimizing a multimodal semantic parsing model of oilfield industrial images through multi-scale feature fusion, characterized in that, The method includes: Preprocess the oilfield industrial images and obtain the corresponding text features; Feature extraction is performed on the preprocessed oilfield industrial images using a multi-scale feature extraction network to obtain a visual feature set containing at least two different scales; A spatial attention enhancement mechanism is used to process the visual features at each scale in the visual feature set to obtain enhanced visual features; The enhanced visual features at each scale are fused to obtain the fused high-dimensional visual features. The high-dimensional visual features and the text features interact through a cross-semantic interaction module, which introduces an oilfield-specific semantic bias when calculating attention weights. The interactive features are analyzed to output the semantic information of the targets in the oilfield industrial image.

2. The method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion according to claim 1, characterized in that, The multi-scale feature extraction network adopts a window attention-based Transformer architecture to perform four levels of downsampling on the input image, obtaining visual features at scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32, respectively, which constitute the visual feature set.

3. The method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion according to claim 1, characterized in that, The spatial attention enhancement mechanism is specifically as follows: Adaptive pooling is performed on the input visual feature map in both the height and width dimensions to obtain height-dimension pooled features and width-dimension pooled features. The high-dimensional pooling feature is concatenated with the width-dimensional pooling feature that has undergone dimension permutation to obtain the fused feature; The fused features are then subjected to convolution, normalization, and activation processing. The activated features are split into height-dimensional features and width-dimensional features, and then convolution and activation functions are applied to them respectively to obtain height-dimensional weight attention weight maps and width-dimensional attention weight maps. The height-dimensional attention weight map and the width-dimensional attention weight map are multiplied by the oilfield edge enhancement coefficient to obtain the corrected attention weight map; The modified attention weight map is multiplied by the input visual feature map to obtain the spatial attention-enhanced feature map.

4. The method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion according to claim 1, characterized in that, The process of fusing enhanced visual features at various scales to obtain fused high-dimensional visual features includes: The visual features at each scale enhanced by spatial attention are subjected to convolutional dimensionality reduction to obtain the lateral enhancement features; A feature pyramid network (FPN) structure is used to perform top-down feature fusion on the lateral enhancement features to obtain a first fused feature set. A path aggregation network (PAN) structure is used to perform bottom-up feature fusion on the first fused feature set, and the spatial attention enhancement mechanism is applied after each downsampling fusion to obtain the second fused feature set. After adjusting the features at each scale in the second fusion feature set to the same size, they are spliced ​​and convolutional to obtain the high-dimensional visual features.

5. The method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion according to claim 1, characterized in that, The cross-semantic interaction module performs the following operations: The high-dimensional visual features and the text features are projected onto a unified hidden dimension to obtain visual query features, text key features, and text value features. The visual query features, text key features, and text value features are split into multiple attention heads; During the attention weight calculation process for each attention head, the oilfield semantic-specific bias is added; The outputs of multiple attention heads are concatenated and projected back to the visual feature dimension; The projected features are added to the input high-dimensional visual features through a residual connection with residual coefficients to obtain cross-modal fusion features.

6. The method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion according to claim 1, characterized in that, The method further includes: A synthetic oilfield image dataset with semantic annotations is automatically generated through a synthetic data generation unit. The real oilfield images and their annotation files are loaded and parsed in a standardized manner using the real data loading unit to obtain a standardized real oilfield image dataset. During model training or inference, a data source is selected from the synthetic oilfield image dataset or the real oilfield image dataset.

7. A multi-scale feature fusion system for optimizing a multimodal semantic parsing model of oilfield industrial images, characterized in that, The system is used to implement the method according to any one of claims 1-6, the system comprising: The data input and preprocessing module is used to load and preprocess oilfield industrial images and obtain corresponding text features; The multi-scale feature extraction module is used to extract multi-scale visual feature sets from the preprocessed image; An oilfield-specific spatial attention enhancement module is used to enhance the spatial attention of features in the multi-scale visual feature set. A multi-scale feature fusion module is used to fuse enhanced multi-scale visual features; The cross-modal semantic interaction module is used to interact with the fused visual features and text features, and introduces oilfield-specific semantic biases. The semantic parsing module is used to parse the semantic information of the target based on the features after the interaction; The visualization module is used to provide a human-computer interaction interface, control the operation of the system, and display the analysis results.

8. The multi-scale feature fusion-based multimodal semantic parsing model optimization system for oilfield industrial images according to claim 7, characterized in that, The data input and preprocessing module includes a dual-source data processing unit, which includes: The synthetic data generation unit is used to automatically generate synthetic images containing oilfield equipment, defects, and corresponding semantic annotations based on preset oilfield scene parameters; The real data loading unit is used to load real oilfield image files and parse the associated annotation files to obtain standardized semantic information.

9. A computer device, characterized in that: The system includes a memory and a processor. The memory stores a computer program. When the processor runs the computer program stored in the memory, the processor executes a method for optimizing a multi-modal semantic parsing model of oilfield industrial images based on multi-scale feature fusion, as described in any one of claims 1-6.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of a method for optimizing a multimodal semantic parsing model of oilfield industrial images based on multi-scale feature fusion as described in any one of claims 1-6.