A multi-modal feature fusion, apparatus and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using convolutional coding networks and cross-modal attention or gating mechanisms to fuse multimodal features, the problems of low fusion accuracy and poor adaptability in existing technologies are solved, and high-precision adaptive multimodal feature fusion is achieved.

CN122196873APending Publication Date: 2026-06-12ZHEJIANG WANLI UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ZHEJIANG WANLI UNIV
Filing Date: 2026-02-11
Publication Date: 2026-06-12

Application Information

Patent Timeline

11 Feb 2026

Application

12 Jun 2026

Publication

CN122196873A

IPC: G06F18/25; G06F18/24; G06V10/44; G06V10/54; G06V10/82; G06N3/0464; G06N3/045; G06N3/0455; G06N3/042; G06N3/0442

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing multimodal feature fusion methods lack complementary information, have low fusion accuracy, and are fixed in pattern, making them difficult to adapt to various application scenarios.

⚗Method used

By acquiring sensor information from different types of sensors, a convolutional coding network is used to encode it into feature vectors. This is combined with cross-modal attention or gating mechanisms for feature interaction, and dynamic weights are generated based on confidence for weighted fusion, thereby achieving adaptive adjustment of multimodal features.

🎯Benefits of technology

It improves the accuracy of multimodal feature fusion, and can adaptively and flexibly adjust to adapt to various application scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122196873A_ABST

Patent Text Reader

Abstract

This disclosure provides a multimodal feature fusion apparatus and electronic device. The disclosure acquires sensor information from different types of sensors on the surface of an object under test; inputs multiple sensor information into a first encoder to obtain multiple corresponding feature vectors; the first encoder encodes the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; based on a cross-modal attention mechanism or gating mechanism, the multiple feature vectors interact to obtain multiple corresponding complementary feature vectors; based on the confidence of the multiple complementary feature vectors, the multiple complementary feature vectors are input into a weight generation network to determine multiple weight coefficients; using the multiple weight coefficients, the multiple complementary feature vectors are weighted and fused to obtain a multimodal fused feature. In summary, this disclosure can fuse complementary information and improve the accuracy of multimodal feature fusion, and can be adaptively and flexibly adjusted to adapt to various application scenarios.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a multimodal feature fusion device and electronic device. Background Technology

[0002] Currently, most existing multimodal feature fusion methods employ fixed-weight fusion, early input concatenation, or late-stage decision voting. However, fixed-weight fusion pre-defines the contribution ratio of each modality, making it unable to dynamically adjust based on the actual quality of the input samples or differences in the application scenario. Early direct input concatenation involves simply concatenating multimodal information at the raw data level and then uniformly inputting it into the model, failing to fully consider the format differences of heterogeneous data and unable to achieve deep interaction and information complementarity between features. Late-stage decision voting involves simply integrating the features at the decision output layer using averaging or voting after each modality completes its inference independently, potentially losing the correlation information of features in intermediate layers and failing to fully utilize the collaborative relationships between modalities. Thus, these methods lack the ability to dynamically evaluate and adaptively adjust the quality of each modality's features and struggle to effectively mine complementary correlation information between modalities. In summary, existing multimodal feature fusion methods lack complementary information, have low fusion accuracy, use fixed patterns, and are difficult to adapt to various application scenarios. Summary of the Invention

[0003] This disclosure provides a multimodal feature fusion apparatus and electronic device to address, to some extent, the problems of existing multimodal feature fusion methods, such as lack of complementary information, low fusion accuracy, fixed patterns, and difficulty in adapting to various application scenarios.

[0004] According to one aspect of this disclosure, a multimodal feature fusion method is provided. The method includes: acquiring sensor information from different types of sensors on the surface of an object to be measured; the different types of sensors include at least one of the following: visible light imaging sensor, infrared imaging sensor, three-dimensional point cloud acquisition sensor, and spectral curve acquisition sensor; inputting the multiple sensor information into a first encoder to obtain corresponding multiple feature vectors; the first encoder is used to encode the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural network, graph convolutional network, and one-dimensional convolutional network; interacting the multiple feature vectors based on a cross-modal attention mechanism or gating mechanism to obtain corresponding multiple complementary feature vectors; inputting the multiple complementary feature vectors into a weight generation network based on the confidence of the multiple complementary feature vectors to determine multiple weight coefficients; and using the multiple weight coefficients, weightedly fusing the multiple complementary feature vectors to obtain multimodal fused features.

[0005] Furthermore, according to one aspect of the method disclosed herein, the method further includes: inputting multimodal fusion features into a classifier or segmentation decoder, and outputting defect detection results of the surface of the object under test; the defect detection results include at least one of the following: defect type, defect location coordinates, and detection confidence.

[0006] Furthermore, according to one aspect of the method disclosed herein, sensor information of different types of sensors on the surface of the object under test is obtained, including: obtaining the mapping relationship between the sensor coordinates and standard coordinates of all different types of sensors; based on the mapping relationship, collecting data from different types of sensors at the same location on the surface of the object under test at the same time to obtain multiple sensor information.

[0007] Furthermore, according to one aspect of the method of this disclosure, when the sensor is a visible light imaging sensor and / or an infrared imaging sensor, multiple sensor information is input into a first encoder to obtain multiple corresponding feature vectors, including: normalizing the visible light imaging of the visible light imaging sensor and / or the infrared imaging of the infrared imaging sensor, and inputting them into a convolutional neural network in the first encoder; the normalization process includes at least one of the following: pixel value normalization and mean-variance normalization; performing network forward propagation and dimensionality reduction integration processing using the convolutional neural network to obtain feature vectors; the network forward propagation is used to extract hierarchical features from local texture to global semantics of the visible light imaging and / or the infrared imaging; the dimensionality reduction integration processing is implemented through a global pooling layer in the convolutional neural network.

[0008] Furthermore, according to one aspect of the method of this disclosure, when the sensor is a three-dimensional point cloud acquisition sensor, multiple sensor information is input into a first encoder to obtain corresponding multiple feature vectors, including: performing a first preprocessing on the point cloud information of the three-dimensional point cloud acquisition sensor and inputting it into a graph convolutional network in the first encoder; the first preprocessing includes at least one of the following: noise point removal and coordinate normalization; the graph convolutional network includes: a derived graph convolutional network; using the graph convolutional network to extract the three-dimensional geometric shape and spatial distribution features of the point cloud information; and integrating the three-dimensional geometric shape and spatial distribution features to obtain feature vectors.

[0009] Furthermore, according to one aspect of the method of this disclosure, when the sensor is a spectral curve acquisition sensor, multiple sensor information is input into a first encoder to obtain multiple corresponding feature vectors, including: performing a second preprocessing on the one-dimensional spectral curve information of the spectral curve acquisition sensor and inputting it into a one-dimensional convolutional network in the first encoder; the second preprocessing includes at least one of the following: baseline correction and band denoising; using the one-dimensional convolutional network to obtain the absorption and reflection features of the one-dimensional spectral curve information; and integrating the absorption and reflection features to obtain a feature vector.

[0010] Furthermore, according to one aspect of the method disclosed herein, based on a cross-modal attention mechanism or gating mechanism, multiple feature vectors are interacted to obtain multiple complementary feature vectors, including: taking any feature vector as a query and concatenating all feature vectors except the query as a key and value; determining the cross-attention weights of the query and the key; using the cross-attention weights to perform weighted aggregation of the values to obtain an aggregation result; fusing the aggregation result with the feature vector corresponding to the query to obtain complementary feature vectors, and repeating until multiple complementary feature vectors corresponding to all feature vectors are obtained; or, concatenating all feature vectors and inputting them into a gating recurrent unit for feature transformation to obtain multiple corresponding modulation vectors; multiplying each feature vector with its corresponding modulation vector to obtain multiple complementary feature vectors.

[0011] Furthermore, according to one aspect of the method of this disclosure, based on the confidence of multiple complementary feature vectors, multiple complementary feature vectors are input into a weight generation network to determine multiple weight coefficients, including: for any complementary feature vector, obtaining a quality index of the complementary feature vector; the quality index includes at least one of the following: feature sharpness and signal strength; determining a confidence level based on the quality index; inputting the complementary feature vector into the weight generation network, and performing feature evaluation based on the confidence level to obtain a weight vector; the weight generation network is a two-layer fully connected structure.

[0012] According to another aspect of the method of this disclosure, this disclosure also provides a multimodal feature fusion device, the device comprising: an acquisition unit for acquiring sensor information of different types of sensors on the surface of an object to be measured; the different types of sensors include at least one of the following: visible light imaging sensor, infrared imaging sensor, three-dimensional point cloud acquisition sensor, and spectral curve acquisition sensor; an encoding unit for inputting multiple sensor information into a first encoder to obtain corresponding multiple feature vectors; the first encoder is used to encode the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural network, graph convolutional network, and one-dimensional convolutional network; an interaction unit for interacting multiple feature vectors based on a cross-modal attention mechanism or a gating mechanism to obtain corresponding multiple complementary feature vectors; a determination unit for inputting multiple complementary feature vectors into a weight generation network based on the confidence of multiple complementary feature vectors to determine multiple weight coefficients; and a fusion unit for using multiple weight coefficients to perform weighted fusion of multiple complementary feature vectors to obtain multimodal fused features.

[0013] According to another aspect of this disclosure, an electronic device is provided, comprising: a memory for storing computer-readable instructions; and a processor for executing the computer-readable instructions, causing the electronic device to perform the method as described in any embodiment of one aspect.

[0014] This disclosure provides a multimodal feature fusion apparatus and electronic device. The disclosure acquires sensor information from different types of sensors on the surface of an object under test; these different types of sensors include at least one of the following: visible light imaging sensors, infrared imaging sensors, three-dimensional point cloud acquisition sensors, and spectral curve acquisition sensors; multiple sensor information is input into a first encoder to obtain corresponding multiple feature vectors; the first encoder is used to encode the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural network, graph convolutional network, and one-dimensional convolutional network; based on a cross-modal attention mechanism or gating mechanism, multiple feature vectors are interacted to obtain corresponding multiple complementary feature vectors; based on the confidence of multiple complementary feature vectors, multiple complementary feature vectors are input into a weight generation network to determine multiple weight coefficients; using the multiple weight coefficients, the multiple complementary feature vectors are weighted and fused to obtain multimodal fused features. This addresses the shortcomings of existing fusion methods, such as fixed modes, difficulty in mining modal complementary information, and inability to dynamically adapt to scenarios. This disclosure enables unified deep representation of heterogeneous features through a dedicated convolutional coding network. Furthermore, it fully mines and fuses complementary correlation information between modalities using cross-modal interaction mechanisms or gating mechanisms, and adaptively adjusts the contribution of each modality through dynamic weight generation. In summary, the technical solution provided by this disclosure can fuse complementary information, improve the accuracy of multimodal feature fusion, and is adaptively and flexibly adjustable, making it suitable for various application scenarios.

[0015] It should be understood that both the foregoing general description and the following detailed description are exemplary and intended to provide further illustration of the claimed technology. Attached Figure Description

[0016] The above and other objects, features, and advantages of this disclosure will become more apparent from the more detailed description of the embodiments thereof in conjunction with the accompanying drawings. The drawings are provided to further illustrate the embodiments of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the disclosure and do not constitute a limitation thereof. In the drawings, the same reference numerals generally represent the same components or steps.

[0017] Figure 1 A flowchart illustrating a multimodal feature fusion method provided in this embodiment of the disclosure; Figure 2 A flowchart of another complete multimodal feature fusion method provided in this disclosure embodiment; Figure 3 This is a structural block diagram of a multimodal feature fusion device provided in an embodiment of the present disclosure; Figure 4This is a hardware block diagram of an electronic device provided in an embodiment of the present disclosure. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this disclosure more apparent, exemplary embodiments according to this disclosure will now be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this disclosure, and not all embodiments of this disclosure. It should be understood that this disclosure is not limited to the exemplary embodiments described herein.

[0019] Currently, most existing multimodal feature fusion methods employ fixed-weight fusion, early input concatenation, or late-stage decision voting. However, fixed-weight fusion pre-defines the contribution ratio of each modality, making it unable to dynamically adjust based on the actual quality of the input samples or differences in the application scenario. Early direct input concatenation involves simply concatenating multimodal information at the raw data level and then uniformly inputting it into the model, failing to fully consider the format differences of heterogeneous data and unable to achieve deep interaction and information complementarity between features. Late-stage decision voting involves simply integrating the features at the decision output layer using averaging or voting after each modality completes its inference independently, potentially losing the correlation information of features in intermediate layers and failing to fully utilize the collaborative relationships between modalities. Thus, these methods lack the ability to dynamically evaluate and adaptively adjust the quality of each modality's features and struggle to effectively mine complementary correlation information between modalities. In summary, existing multimodal feature fusion methods lack complementary information, have low fusion accuracy, use fixed patterns, and are difficult to adapt to various application scenarios.

[0020] Therefore, to address the aforementioned problems, this disclosure provides a multimodal feature fusion method that overcomes the shortcomings of existing fusion methods, such as fixed patterns, difficulty in mining complementary modal information, and inability to dynamically adapt to different scenarios. This disclosure achieves unified deep representation of heterogeneous features through a dedicated convolutional coding network. Simultaneously, it fully mines and fuses complementary correlation information between modalities using cross-modal interaction mechanisms or gating mechanisms, and adaptively adjusts the contribution of each modality through dynamic weight generation. In summary, the technical solution provided by this disclosure can fuse complementary information, improve the accuracy of multimodal feature fusion, and is adaptively flexible, making it suitable for various application scenarios.

[0021] First, this disclosure provides a multimodal feature fusion method. Please refer to... Figure 1 , Figure 1 This is a flowchart illustrating a multimodal feature fusion method provided in an embodiment of this disclosure. Figure 1 As shown, the method includes: In step S101, sensor information of different types of sensors on the surface of the object to be tested is acquired; different types of sensors include at least one of the following: visible light imaging sensor, infrared imaging sensor, three-dimensional point cloud acquisition sensor, and spectral curve acquisition sensor; In step S102, multiple sensor information is input into the first encoder to obtain multiple corresponding feature vectors; the first encoder is used to encode the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural network, graph convolutional network and one-dimensional convolutional network; In step S103, based on a cross-modal attention mechanism or gating mechanism, multiple feature vectors are interacted to obtain multiple corresponding complementary feature vectors; In step S104, based on the confidence of multiple complementary feature vectors, multiple complementary feature vectors are input into the weight generation network to determine multiple weight coefficients. In step S105, multiple complementary feature vectors are weighted and fused using multiple weight coefficients to obtain multimodal fusion features.

[0022] In this disclosure, different types of sensors can be understood as heterogeneous sensing devices used to capture information about the surface of an object in different dimensions. Specifically, a visible light imaging sensor can be understood as a device that captures signals in the visible light band to create an image; the corresponding sensor information can be an RGB image or a grayscale image of the object's surface. An infrared imaging sensor can be understood as a device that captures thermal radiation or near-infrared reflection signals from an object to create an image; the corresponding sensor information can be an infrared thermogram or a near-infrared image. A three-dimensional point cloud acquisition sensor can be understood as a device that acquires three-dimensional spatial information of an object based on technologies such as laser triangulation and structured light; the corresponding sensor information can be point cloud data or a depth map. A spectral curve acquisition sensor can be understood as a device that captures absorption / reflection signals from an object in different spectral bands; the corresponding sensor information can be one-dimensional spectral curve data.

[0023] In this disclosure, the first encoder can be understood as a feature extraction unit composed of multiple dedicated convolutional coding sub-networks. Each sub-network is adapted to a specific modality of sensor information, and its core function is to convert heterogeneous raw sensor information into homogeneous deep feature representations. The feature vector obtained by the first encoder can be understood as a unified fixed-dimensional vector formed after deep encoding of the raw sensor information. The convolutional coding network in the first encoder can be understood as a dedicated neural network structure designed for the characteristics of different modal data, possessing feature extraction and dimension mapping capabilities, and serving as the basic network unit for achieving unified encoding of heterogeneous data. Specifically, the convolutional neural network can be understood as a convolutional coding network suitable for two-dimensional image data, capable of effectively extracting two-dimensional spatial features such as texture, contour, and semantics from visible light and infrared images. The graph convolutional network can be understood as a convolutional coding network suitable for unstructured three-dimensional point cloud data, capable of modeling the spatial topological relationships of point cloud data based on graph structures, performing convolution operations on irregular point sets, and effectively extracting the three-dimensional geometric shape and spatial distribution features of the point cloud data. One-dimensional convolutional networks can be understood as convolutional coding networks suitable for one-dimensional sequential spectral data. They use one-dimensional convolutional kernels to perform convolution operations along the spectral band dimension, which can capture the absorption and reflection features of spectral curves and extract the sequential semantic information of spectral data, as detailed below.

[0024] In this disclosure, the cross-modal attention mechanism can be understood as an interactive mechanism that enables different modal features to actively mine and associate complementary information with other modal features. It can achieve accurate information matching and fusion by taking any modal feature as a query and the other modal features as keys and values and calculating cross-attention weights, as detailed below.

[0025] In this disclosure, the gating mechanism can be understood as an interactive mechanism for controlling and enhancing the information flow of multimodal features. It can filter and enhance each modal feature by generating modulation vectors to achieve effective information retention, as detailed below.

[0026] In this disclosure, complementary feature vectors can be understood as enhanced feature vectors obtained through cross-modal interaction. They can not only retain the core information of their own modality, but also integrate complementary information from other modalities, making them more representative than the original feature vectors.

[0027] In this disclosure, confidence level can be understood as an indicator that measures the quality of each complementary feature vector and its contribution to the current fusion task, and is mainly based on a comprehensive judgment of factors such as feature clarity and signal strength.

[0028] In this disclosure, the weight generation network can be understood as a lightweight sub-network used to dynamically evaluate the contribution of each modality and output weight coefficients. Its input is a complementary feature vector, and its output is a normalized weight vector. The weight coefficient can be understood as a value representing the importance of each complementary feature vector in the final fusion process. The larger the weight, the higher the contribution of the corresponding modality feature to the current task, and the sum of all weight coefficients is 1.

[0029] In this disclosure, multimodal fusion features can be understood as a unified feature representation obtained by weighting and summing complementary feature vectors based on dynamic weight coefficients. It can integrate complementary information from all modalities and can also adaptively adjust the weights of each modality according to sample quality and scene differences.

[0030] Specifically, multimodal feature fusion can include the following steps: Step 1: Preprocess the acquired sensor information such as visible light images, infrared images, point cloud data, and spectral curves (e.g., noise removal, normalization, baseline correction, etc.). At the same time, based on the preset coordinate system mapping relationship, complete the spatial alignment and time synchronization of each modal data to ensure that all sensor information corresponds to the state of the same area and the same time of the object under test. The second step is to input the preprocessed sensor information of each modality into the corresponding dedicated convolutional coding sub-network in the first encoder (such as convolutional neural network for visible light / infrared images, graph convolutional network for point cloud data, and one-dimensional convolutional network for spectral curves). Through network forward propagation and feature extraction, multiple original feature vectors of the same dimension are output. Step 3: Construct feature interaction using a cross-modal attention mechanism or gating mechanism. Input all original feature vectors into this module to enhance features through information interaction, resulting in multiple complementary feature vectors. Specifically, if an attention mechanism is used, any modal feature is used as the query, and the remaining features are concatenated as the key and value. Cross-attention is calculated and complementary information is fused. If a gating mechanism is used, all features are concatenated to generate a modulation vector. Feature enhancement is achieved through element-wise multiplication, and finally, multiple complementary feature vectors are output, as detailed below. Step 4: Input the obtained complementary feature vectors into the weight generation network. The network analyzes the quality and confidence of each feature vector and outputs the initial weight vector. It can also be normalized by a function to obtain multiple weight coefficients that satisfy the weight sum of 1. Step 5: Multiply each complementary feature vector element-wise with its corresponding weight coefficient, and then sum all the product results to obtain a multimodal fusion feature that integrates the core complementary information of each modality, which can be used for subsequent specific tasks such as defect detection.

[0031] The following will describe specific application scenarios of the fusion method disclosed herein, including: The multimodal fusion features are input into a classifier or segmentation decoder to output the defect detection results on the surface of the object under test; the defect detection results include at least one of the following: defect type, defect location coordinates, and detection confidence.

[0032] In this disclosure, a classifier can be understood as a lightweight neural network module used to determine the defect category of multimodal fusion features. It is typically composed of multiple fully connected layers, activation functions, and normalization layers, and can map multimodal fusion features to a preset defect category space and output the probability distribution of the corresponding category.

[0033] In this disclosure, the segmentation decoder can be understood as a decoding network module used to realize pixel-level or region-level localization of surface defects of the object under test. It is usually composed of deconvolution layers, upsampling layers, and skip connection layers, and can gradually restore the multimodal fusion features to the input image size, and output pixel-by-pixel defect segmentation mask and boundary information.

[0034] In this disclosure, the defect detection result can be understood as comprehensive judgment information obtained based on the multimodal fusion features of this disclosure, used to characterize the surface defect state of the object under test. Specifically, the defect type identifies the category of defects present on the surface of the object under test, such as scratches, cracks, dents, stains, deformations, etc.; the defect location coordinates characterize the specific spatial location of the defect on the surface of the object under test or in the corresponding image coordinate system, which can be represented by pixel coordinates or world coordinates; and the detection confidence score characterizes the reliability of the model's current defect judgment result, typically a value between 0 and 1, with higher values indicating stronger reliability of the judgment result.

[0035] Specifically, the process of defect classification based on multimodal fusion features includes: inputting the obtained multimodal fusion features into a pre-trained classifier; performing nonlinear transformation and feature mapping on the fusion features through the fully connected layer inside the classifier to obtain the output values corresponding to each preset defect category; processing the output values with a soft maximization function to obtain the predicted probability of each defect category; selecting the category with the highest predicted probability as the final defect type output, and using the maximum probability value as the detection confidence of the current detection result to complete the defect classification task.

[0036] The following will explain in detail how to acquire information from different types of sensors, including: Obtain the mapping relationship between sensor coordinates and standard coordinates for all different types of sensors; Based on the mapping relationship, different types of sensors collect data at the same location on the surface of the object being detected at the same time, resulting in multiple sensor information.

[0037] In this disclosure, the mapping relationship can be understood as the spatial position transformation relationship between each sensor's own local coordinate system and a pre-selected unified reference coordinate system, which is the core basis for realizing multimodal data spatial alignment.

[0038] Specifically, acquiring sensor information may include the following steps: First, a precise mapping relationship between the local coordinate system and the reference coordinate system of each sensor can be established through a calibration algorithm. Then, by using a unified hardware clock or external encoder trigger signal, all different types of sensors are synchronously controlled to ensure that all sensors simultaneously acquire data on the same surface state of the object under test, guaranteeing consistency in acquisition timing. Next, the raw data synchronously acquired by each sensor is transformed and resampled in real time using pre-stored mapping relationships, uniformly converting all sensor data to the reference coordinate system to obtain pixel-level or voxel-level aligned multimodal data.

[0039] The following will explain how to encode different types of sensors to obtain feature vectors, including: The first scenario: When the sensor is a visible light imaging sensor and / or an infrared imaging sensor, multiple sensor information is input into the first encoder to obtain multiple corresponding feature vectors, including: The visible light imaging from the visible light imaging sensor and / or the infrared imaging from the infrared imaging sensor are both normalized and input into the convolutional neural network in the first encoder; the normalization process includes at least one of the following: pixel value normalization and mean-variance normalization. The convolutional neural network is used for forward propagation and dimensionality reduction integration to obtain feature vectors. Forward propagation is used to extract hierarchical features from local texture to global semantics for visible light imaging and / or infrared imaging. Dimensionality reduction integration is achieved through global pooling layers in the convolutional neural network.

[0040] In this disclosure, visible light imaging can be understood as two-dimensional image data, including color images or grayscale images, formed after a visible light imaging sensor captures the reflected and transmitted signals of visible light bands on the surface of the object under test.

[0041] In this disclosure, infrared imaging can be understood as two-dimensional image data formed after an infrared imaging sensor captures thermal radiation or near-infrared band signals from the surface of an object under test.

[0042] In this disclosure, normalization can be understood as a preprocessing operation that standardizes and adjusts the pixel values of visible light imaging and / or infrared imaging. The purpose is to eliminate numerical differences between different devices and under different lighting conditions, thereby improving the stability of network training and feature extraction. Specifically, pixel value normalization linearly maps image pixel values from their original range to a preset interval; mean-variance normalization subtracts the global mean of each channel's pixel value and then divides it by the global standard deviation of that channel, ensuring that the pixel values follow a normal distribution with a mean of 0 and a variance of 1.

[0043] In this disclosure, network forward propagation can be understood as the process by which a convolutional neural network performs feature extraction, nonlinear transformation, and dimensionality compression on normalized image data in the order of input layer, convolutional layer, activation layer, pooling layer to fully connected layer, gradually extracting from local texture features to global semantic features.

[0044] In this disclosure, dimensionality reduction and integration can be understood as the operation of converting high-dimensional spatial features extracted by a convolutional neural network into a one-dimensional feature vector with a fixed dimension. Its global pooling layer can be understood as the core network layer for achieving dimensionality reduction and integration. By performing global average / max pooling on the spatial dimension of the high-dimensional feature map, the features of each channel are compressed into a single value, and finally, the values of all channels are concatenated to form a one-dimensional feature vector.

[0045] Specifically, when the acquired sensor information is visible light imaging and / or infrared imaging, and a feature vector is encoded, the following steps may be included: Step 1: Read the raw image data of visible light imaging / infrared imaging, and select the pixel value normalization or mean-variance normalization method according to actual needs to complete the standardization of image data; The second step is to input the normalized image data into a convolutional neural network adapted to the two-dimensional image, and then extract local texture features through convolutional layers, perform nonlinear transformations through activation layers, and compress feature dimensions through pooling layers, thereby gradually extracting hierarchical features from low-order texture to high-order semantics. The third step is to input the high-dimensional feature map obtained from the forward propagation of the network into the global pooling layer, and compress the spatial dimension through global average pooling or global max pooling to output a feature vector of uniform dimension, thus completing the feature encoding of visible light / infrared imaging.

[0046] The second scenario: When the sensor is a 3D point cloud acquisition sensor, multiple sensor information is input into the first encoder to obtain multiple corresponding feature vectors, including: The point cloud information from the 3D point cloud acquisition sensor undergoes a first preprocessing step and is then input into the graph convolutional network in the first encoder. The first preprocessing step includes at least one of the following: noise point removal and coordinate normalization. The graph convolutional network includes a derived graph convolutional network. Graph convolutional networks are used to extract the 3D geometric shape and spatial distribution features of point cloud information; By integrating the three-dimensional geometric shape and spatial distribution features, a feature vector is obtained.

[0047] In this disclosure, point cloud information can be understood as an unstructured data set consisting of a large number of three-dimensional spatial coordinate points captured by a three-dimensional point cloud acquisition sensor, which can characterize the three-dimensional geometric shape and spatial distribution of the surface of the object under test.

[0048] In this disclosure, the first preprocessing can be understood as a preprocessing operation performed on the original point cloud information, with the aim of removing noise interference, unifying coordinate scale, and improving the accuracy of feature extraction by graph convolutional networks. Specifically, noise removal is achieved by using statistical filtering, radius filtering, or outlier detection algorithms to eliminate abnormal points caused by sensor errors or environmental interference. Coordinate standardization involves subtracting the center coordinates of all point cloud coordinates and then scaling them to a preset spatial range (e.g., within a unit sphere) to eliminate the influence of differences in the placement and size of the object under test.

[0049] In this disclosure, graph convolutional networks may include graph convolutional architectures suitable for unstructured point cloud data, such as PointNet++ derived graph convolutional networks and PointCNN graph convolutional networks. The core is to construct a graph structure based on the spatial topological relationship of point clouds and perform convolution operations on irregular point sets.

[0050] In this disclosure, three-dimensional geometry can be understood as the macroscopic / microscopic geometric morphological features of the surface of the object under test represented by point cloud information, such as spatial shape attributes like concavity / convexity, curvature, edges, and holes.

[0051] In this disclosure, spatial distribution characteristics can be understood as features that characterize the spatial arrangement pattern, such as the relative position of each coordinate point in the point cloud data, neighborhood density, and topological connectivity.

[0052] Specifically, when the acquired sensor information is point cloud information, the encoding process to obtain a feature vector can include the following steps: Step 1: Read the raw point cloud information, first remove outliers using a noise point removal algorithm, and then standardize the coordinates of the remaining valid point cloud to obtain regular point cloud data; The second step is to input the preprocessed point cloud data into a graph convolutional network (such as PointNet++). The network first constructs the local neighborhood graph structure of the point cloud, and then extracts the neighborhood features of each point through graph convolution operations, and gradually aggregates them to obtain the global three-dimensional geometric features and spatial distribution features. The third step is to stitch together the extracted 3D geometric features and spatial distribution features, align their dimensions, and then compress them to a unified dimension through a fully connected layer to output the final feature vector, thus completing the feature encoding of the point cloud information.

[0053] The third scenario: When the sensor is a spectral curve acquisition sensor, multiple sensor information is input into the first encoder to obtain multiple corresponding feature vectors, including: The one-dimensional spectral curve information of the spectral curve acquisition sensor is subjected to a second preprocessing and input into the one-dimensional convolutional network in the first encoder; the second preprocessing includes at least one of the following: baseline correction and band denoising; One-dimensional convolutional networks are used to obtain the absorption and reflection characteristics of one-dimensional spectral curves. The absorption and reflection features are integrated to obtain the feature vector.

[0054] In this disclosure, one-dimensional spectral curve information can be understood as a numerical sequence of reflectance / absorbance of each sampling point on the surface of the object under test in different spectral bands, captured by the spectral curve acquisition sensor.

[0055] In this disclosure, the second preprocessing can be understood as a preprocessing operation performed on the original one-dimensional spectral curve information, with the aim of eliminating baseline drift and noise interference while retaining effective spectral features. Specifically, baseline correction eliminates baseline shifts in the spectral curve through algorithms such as polynomial fitting and adaptive iterative reweighting; band denoising removes random noise in the spectral bands and improves the smoothness of the spectral curve through wavelet transform, moving average filtering, and other methods.

[0056] In this disclosure, absorption characteristics and reflection characteristics can be understood as the core features characterizing the material properties of the analyte in a one-dimensional spectral curve; absorption characteristics refer to the characteristic peaks with significantly increased absorption rate in a specific spectral band, and reflection characteristics refer to the characteristic peaks with significantly increased reflectance in a specific spectral band. Different materials correspond to unique absorption / reflection characteristic spectral patterns.

[0057] Specifically, when the acquired sensor information is one-dimensional spectral curve information, and the feature vector is encoded, the following steps may be included: Step 1: Read the original one-dimensional spectral curve information, first eliminate baseline drift through baseline correction, and then remove random noise through band denoising to obtain regular spectral curve data; Step 2: Input the preprocessed spectral curve data into a one-dimensional convolutional network, and use one-dimensional convolutional kernels of different sizes to slide convolution along the band dimension to capture the correlation features between bands and extract the absorption and reflection features in the spectral curve. The third step is to concatenate the extracted absorption and reflection features, compress their dimensions, and convert them into a feature vector of uniform dimension through a global pooling layer or a fully connected layer, thus completing the feature encoding of the one-dimensional spectral curve information.

[0058] The following will elaborate on how this disclosure interacts to obtain complementary feature vectors, including: Use any feature vector as the query, and concatenate all feature vectors except the query vector to form the key and value; Determine the cross-attention weights between the query and the key; The values are weighted and aggregated using cross-attention weights to obtain the aggregated result; The aggregation result and the corresponding feature vector of the query are fused to obtain complementary feature vectors, and this process is repeated until multiple complementary feature vectors corresponding to all feature vectors are obtained. or, All feature vectors are concatenated and input into a gated recurrent unit for feature transformation to obtain multiple corresponding modulation vectors; Multiplying each feature vector by its corresponding modulation vector yields multiple complementary feature vectors.

[0059] In one embodiment of this disclosure, a cross-modal attention mechanism based on a lightweight decoder layer can be used to achieve feature interaction. Specifically, this includes: using the feature vector corresponding to each modality sequentially as the query vector; concatenating the feature vectors of all other modalities along their channel dimensions to form a key vector and a value vector; calculating the similarity between the query vector and the key vector to obtain a cross-attention weight; then weighting and summing the value vectors according to this cross-attention weight to obtain an aggregation result; finally, performing residual fusion with the original feature vector used as the query to obtain complementary enhanced features that incorporate complementary information from other modalities; repeating the above operations across all modal feature vectors to finally obtain complementary feature vectors corresponding to all modalities.

[0060] In another embodiment of this disclosure, a gating mechanism based on a gated recurrent unit can be used to achieve feature interaction. Specifically, this includes: concatenating all feature vectors along their feature dimensions to obtain concatenated features; inputting the concatenated features into a preset gated recurrent unit or multilayer perceptron; through nonlinear transformation and feature learning, outputting modulation vectors that correspond one-to-one with each original feature vector; and then performing element-wise multiplication of each original feature vector with its corresponding modulation vector to achieve information flow control and effective information enhancement for each modality, eliminating redundant noise features, and finally obtaining multiple complementary feature vectors.

[0061] The following will explain in detail how this disclosure determines the weighting coefficients, including: For any complementary feature vector, obtain the quality index of the complementary feature vector; the quality index includes at least one of the following: feature sharpness and signal strength; Determine the confidence level based on quality indicators; The complementary feature vectors are input into the weight generation network, and the features are evaluated based on the confidence level to obtain the weight vector; the weight generation network is a two-layer fully connected structure.

[0062] In this disclosure, quality indicators can be understood as quantitative parameters used to quantitatively evaluate the representational ability and effective information content of each complementary feature vector. Among them, feature clarity is used to characterize the completeness of the complementary feature vector in expressing the key semantic information of the defect. The higher the clarity, the more accurately the feature depicts the core information such as the texture, geometry, and spectrum of the defect. Signal strength is used to characterize the magnitude of the effective signal carried by the complementary feature vector. The higher the signal strength, the less the feature is affected by environmental noise and sensor interference, and the higher the proportion of effective information.

[0063] Specifically, when dynamically determining the weighting coefficients, the following steps may be included: Step 1: For each complementary feature vector, extract its feature clarity, signal strength and other quality indicators. Calculate a comprehensive score based on the value of each quality indicator. After normalizing the comprehensive score, use it as the confidence score for the complementary feature vector. The higher the confidence score, the higher the reliability and importance of the modality feature in the current sample. Step 2: Concatenate all complementary feature vectors along their feature dimensions to obtain the concatenated overall feature, and then input the overall feature into a lightweight weighted generation network with a two-layer fully connected layer structure. The third step: The weight generation network learns and evaluates the contribution of each modality feature based on the confidence level corresponding to each complementary feature vector and the consistency relationship between different modal features. Through the nonlinear mapping of the fully connected layer, it outputs an initial weight vector equal to the number of modalities. Step 4: Perform a softening maximization function normalization operation on the initial weight vector to ensure that the sum of all coefficients is 1, thus obtaining multiple weight coefficients used to characterize the importance of each modal feature.

[0064] For example, Figure 2 A flowchart illustrating another complete multimodal feature fusion method provided for embodiments of this disclosure. From Figure 2 It can be seen that: First, four heterogeneous modal information types—visible light imaging, infrared imaging, 3D point cloud data, and spectral curve data—are input into corresponding dedicated encoding sub-networks. Specifically, visible light and infrared imaging data are fed into convolutional neural networks or a hybrid architecture combining them with a visual Transformer; 3D point cloud data is fed into graph convolutional networks or point-set-based deep networks; and spectral curve data is fed into one-dimensional convolutional networks or temporal Transformers. This achieves unified deep encoding of heterogeneous features and outputs feature vectors of the same dimension. Second, the feature vectors from all modalities are fed into a feature interaction module, where cross-modal attention or gating mechanisms enable information complementarity and enhancement between features, resulting in context-aware complementary feature vectors. Next, the complementary feature vectors are input into a weight generation sub-network, which dynamically outputs normalized weight coefficients by analyzing the quality and consistency of each modal feature. Subsequently, an adaptive weighted fusion module uses these weight coefficients to dynamically weight and sum the complementary feature vectors, generating the final multimodal fusion feature. Finally, the fusion feature is fed into a lightweight task-specific network (such as a classifier or segmentation decoder) to output defect detection results. For example, this disclosure also provides a comparison table of experimental results using the method of this disclosure and other methods, as shown in Table 1 below: method Average accuracy (mAP) Accuracy under strong light interference Accuracy under partial occlusion Visible light only (VL) 88.5% 65.2% 70.1% Multimodal feature splicing 94.3% 85.7% 88.9% Multimodal average weight fusion 95.1% 87.4% 90.2% This disclosure (dynamic weighted fusion) 98.2% 96.5% 95.8% Table 1 As shown in Table 1, the dynamic weighted fusion method proposed in this disclosure significantly outperforms the comparative methods in all indicators. It achieves a mean accuracy (mAP) of 98.2%, nearly 10 percentage points higher than the single-modal method using only visible light, and 3.9% and 3.1% higher than multimodal feature splicing and average weighted fusion, respectively. Its advantages are even more pronounced in complex scenarios such as strong light interference and partial occlusion, achieving an accuracy of 96.5% under strong light interference and 95.8% under partial occlusion, far exceeding other methods. This demonstrates that the dynamic weighted fusion mechanism of this disclosure can effectively cope with scene interference, adaptively selecting the most reliable modal information for fusion, and significantly improving the accuracy of defect detection.

[0065] This disclosure also provides a multimodal feature fusion apparatus. Figure 3 This is a structural block diagram of a multimodal feature fusion device provided in an embodiment of the present disclosure, such as... Figure 3 As shown, the multimodal feature fusion device 300 includes: The acquisition unit 301 is used to acquire sensor information of different types of sensors on the surface of the object to be measured; the different types of sensors include at least one of the following: visible light imaging sensor, infrared imaging sensor, three-dimensional point cloud acquisition sensor and spectral curve acquisition sensor; The encoding unit 302 is used to input multiple sensor information into the first encoder to obtain multiple corresponding feature vectors; the first encoder is used to encode the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural network, graph convolutional network and one-dimensional convolutional network; The interaction unit 303 is used to interact multiple feature vectors based on a cross-modal attention mechanism or gating mechanism to obtain multiple corresponding complementary feature vectors. The determination unit 304 is used to input multiple complementary feature vectors into the weight generation network based on the confidence of multiple complementary feature vectors to determine multiple weight coefficients. The fusion unit 305 is used to weight and fuse multiple complementary feature vectors using multiple weight coefficients to obtain multimodal fusion features.

[0066] In one exemplary embodiment, the fusion unit 305 is further configured to: input the multimodal fusion features into a classifier or a segmentation decoder, and output the defect detection results of the surface of the object under test; the defect detection results include at least one of the following: defect type, defect location coordinates, and detection confidence.

[0067] In one exemplary embodiment, the acquisition unit 301 is specifically used to: acquire the mapping relationship between the sensor coordinates and standard coordinates of all different types of sensors; based on the mapping relationship, collect data from different types of sensors at the same location on the surface of the object to be measured at the same time to obtain multiple sensor information.

[0068] In one exemplary embodiment, the encoding unit 302 is specifically configured to: normalize the visible light imaging from the visible light imaging sensor and / or the infrared imaging from the infrared imaging sensor, and input them into the convolutional neural network in the first encoder; the normalization processing includes at least one of the following: pixel value normalization and mean-variance normalization; perform network forward propagation and dimensionality reduction integration processing using the convolutional neural network to obtain feature vectors; the network forward propagation is used to extract hierarchical features from local texture to global semantics for visible light imaging and / or infrared imaging; the dimensionality reduction integration processing is implemented through a global pooling layer in the convolutional neural network.

[0069] In one exemplary embodiment, the encoding unit 302 is specifically used to: perform a first preprocessing on the point cloud information of the three-dimensional point cloud acquisition sensor and input it into the graph convolutional network in the first encoder; the first preprocessing includes at least one of the following: noise point removal and coordinate normalization; the graph convolutional network includes: a derived graph convolutional network; extract the three-dimensional geometric shape and spatial distribution features of the point cloud information using the graph convolutional network; and integrate the three-dimensional geometric shape and spatial distribution features to obtain a feature vector.

[0070] In one exemplary embodiment, the encoding unit 302 is specifically used to: perform a second preprocessing on the one-dimensional spectral curve information of the spectral curve acquisition sensor and input it into the one-dimensional convolutional network in the first encoder; the second preprocessing includes at least one of the following: baseline correction and band denoising; use the one-dimensional convolutional network to obtain the absorption features and reflection features of the one-dimensional spectral curve information; and integrate the absorption features and reflection features to obtain a feature vector.

[0071] In one exemplary embodiment, the interaction unit 303 is specifically configured to: take any feature vector as a query, and concatenate all feature vectors except the query as a key and a value; determine the cross-attention weight between the query and the key; use the cross-attention weight to perform weighted aggregation on the value to obtain an aggregation result; fuse the aggregation result with the feature vector corresponding to the query to obtain a complementary feature vector, and repeat until multiple complementary feature vectors corresponding to all feature vectors are obtained; or, concatenate all feature vectors and input them into the gated loop unit for feature transformation to obtain multiple corresponding modulation vectors; multiply each feature vector by its corresponding modulation vector to obtain multiple complementary feature vectors.

[0072] In one exemplary embodiment, the determining unit 304 is specifically used to: for any complementary feature vector, obtain a quality index of the complementary feature vector; the quality index includes at least one of the following: feature sharpness and signal strength; determine the confidence level based on the quality index; input the complementary feature vector into the weight generation network, and perform feature evaluation based on the confidence level to obtain a weight vector; the weight generation network is a two-layer fully connected structure.

[0073] Figure 4 This is a hardware block diagram of an electronic device provided according to an embodiment of the present disclosure. The electronic device 400 according to an embodiment of the present disclosure includes at least a processor and a memory for storing computer-readable instructions. When the computer-readable instructions are loaded and executed by the processor, the processor performs the multimodal feature fusion method described in any of the preceding embodiments of the present disclosure.

[0074] Figure 4 The illustrated electronic device 400 specifically includes a central processing unit (CPU) 401, a graphics processing unit (GPU) 402, and a memory 403. These units are interconnected via a bus 404. The CPU 401 and / or GPU 402 can function as the aforementioned processor, and the memory 403 can function as the aforementioned memory for storing computer-readable instructions. Furthermore, the electronic device 400 may also include a communication unit 405, a storage unit 406, an output unit 407, and an input unit. In summary, this disclosure provides a multimodal feature fusion apparatus and electronic device. This disclosure acquires sensor information from different types of sensors on the surface of the object under test; these different types of sensors include at least one of the following: visible light imaging sensors, infrared imaging sensors, three-dimensional point cloud acquisition sensors, and spectral curve acquisition sensors; multiple sensor information is input into a first encoder to obtain corresponding multiple feature vectors; the first encoder is used to encode the sensor information into feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural networks, graph convolutional networks, and one-dimensional convolutional networks; based on a cross-modal attention mechanism or gating mechanism, multiple feature vectors are interacted to obtain corresponding multiple complementary feature vectors; based on the confidence of multiple complementary feature vectors, multiple complementary feature vectors are input into a weight generation network to determine multiple weight coefficients; using the multiple weight coefficients, multiple complementary feature vectors are weighted and fused to obtain multimodal fused features. This addresses the shortcomings of existing fusion methods, such as fixed modes, difficulty in mining modal complementary information, and inability to dynamically adapt to scenarios. This disclosure enables unified deep representation of heterogeneous features through a dedicated convolutional coding network. Furthermore, it fully mines and fuses complementary correlation information between modalities using cross-modal interaction mechanisms or gating mechanisms, and adaptively adjusts the contribution of each modality through dynamic weight generation. In summary, the technical solution provided by this disclosure can fuse complementary information, improve the accuracy of multimodal feature fusion, and is adaptively and flexibly adjustable, making it suitable for various application scenarios.

[0075] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this disclosure.

[0076] The basic principles of this disclosure have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this disclosure are merely examples and not limitations, and should not be considered as essential features of each embodiment of this disclosure. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the scope of this disclosure to the necessity of employing the aforementioned specific details for implementation.

[0077] The block diagrams of devices, apparatuses, devices, and systems disclosed herein are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0078] Additionally, as used herein, the "or" used in a list of items beginning with "at least one" indicates a separate list, such that a list of, for example, "at least one of A, B, or C" means A or B or C, or AB or AC or BC, or ABC (i.e., A and B and C). Furthermore, the word "exemplary" does not imply that the described example is preferred or better than other examples.

[0079] It should also be noted that in the systems and methods of this disclosure, the components or steps can be decomposed and / or recombined. These decompositions and / or recombinations should be considered as equivalent solutions to this disclosure.

[0080] Various changes, substitutions, and modifications can be made to the technology described herein without departing from the teachings defined by the appended claims. Furthermore, the scope of the claims of this disclosure is not limited to the specific aspects of the processes, machines, manufactures, events, means, methods, and actions described above. Currently existing or later-developed processes, machines, manufactures, events, means, methods, or actions that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein can be utilized. Therefore, the appended claims include such processes, machines, manufactures, events, means, methods, or actions within their scope.

[0081] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects without departing from the scope of this disclosure. Therefore, this disclosure is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features disclosed herein.

[0082] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this disclosure to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations thereof.

Claims

1. A multimodal feature fusion method, characterized in that, The method includes: Acquire sensor information from different types of sensors on the surface of the object to be tested; the different types of sensors include at least one of the following: visible light imaging sensor, infrared imaging sensor, three-dimensional point cloud acquisition sensor, and spectral curve acquisition sensor; Multiple sensor information is input into a first encoder to obtain multiple corresponding feature vectors; the first encoder is used to encode the sensor information into the feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: convolutional neural network, graph convolutional network, and one-dimensional convolutional network; Based on cross-modal attention or gating mechanisms, multiple feature vectors are interacted to obtain multiple corresponding complementary feature vectors; Based on the confidence levels of multiple complementary feature vectors, the multiple complementary feature vectors are input into a weight generation network to determine multiple weight coefficients; By using multiple weight coefficients, multiple complementary feature vectors are weighted and fused to obtain multimodal fusion features.

2. The method according to claim 1, characterized in that, The method further includes: The multimodal fusion features are input into a classifier or segmentation decoder to output the defect detection results on the surface of the object under test; the defect detection results include at least one of the following: defect type, defect location coordinates, and detection confidence.

3. The method according to claim 1, characterized in that, The acquisition of sensor information from different types of sensors on the surface of the object under test includes: Obtain the mapping relationship between sensor coordinates and standard coordinates for all the different types of sensors; Based on the mapping relationship, different types of sensors are used to collect data at the same location on the surface of the object under test at the same time, resulting in multiple sensor information.

4. The method according to claim 1, characterized in that, When the sensor is the visible light imaging sensor and / or the infrared imaging sensor, the step of inputting multiple sensor information into the first encoder to obtain multiple corresponding feature vectors includes: The visible light imaging from the visible light imaging sensor and / or the infrared imaging from the infrared imaging sensor are both normalized and input into the convolutional neural network in the first encoder; the normalization process includes at least one of the following: pixel value normalization and mean-variance normalization. The feature vector is obtained by performing forward propagation and dimensionality reduction integration using the convolutional neural network; the forward propagation is used to extract hierarchical features from local texture to global semantics in visible light imaging and / or infrared imaging; the dimensionality reduction integration is implemented through a global pooling layer in the convolutional neural network.

5. The method according to claim 1, characterized in that, When the sensor is the three-dimensional point cloud acquisition sensor, the process of inputting multiple sensor information into the first encoder to obtain multiple corresponding feature vectors includes: The point cloud information from the 3D point cloud acquisition sensor undergoes a first preprocessing step and is then input into the graph convolutional network in the first encoder. The first preprocessing step includes at least one of the following: noise point removal and coordinate normalization. The graph convolutional network includes a derived graph convolutional network. The graph convolutional network is used to extract the three-dimensional geometric shape and spatial distribution features of the point cloud information; The feature vector is obtained by integrating the three-dimensional geometry and the spatial distribution features.

6. The method according to claim 1, characterized in that, When the sensor is the spectral curve acquisition sensor, the process of inputting multiple sensor information into the first encoder to obtain multiple corresponding feature vectors includes: The one-dimensional spectral curve information of the spectral curve acquisition sensor is subjected to a second preprocessing and input into the one-dimensional convolutional network in the first encoder; the second preprocessing includes at least one of the following: baseline correction and band denoising; The absorption and reflection characteristics of the one-dimensional spectral curve information are obtained using the one-dimensional convolutional network. The absorption and reflection features are integrated to obtain the feature vector.

7. The method according to claim 1, characterized in that, The method based on cross-modal attention or gating mechanisms interacts multiple feature vectors to obtain multiple corresponding complementary feature vectors, including: Use any one of the feature vectors as a query, and concatenate all the feature vectors except the query to form a key and a value; Determine the cross-attention weight between the query and the key; The values are weighted and aggregated using the cross-attention weights to obtain the aggregation result; The aggregation result and the feature vector corresponding to the query are fused to obtain the complementary feature vector, and this process is repeated until multiple complementary feature vectors corresponding to all the feature vectors are obtained. or, All feature vectors are concatenated and input into a gated recurrent unit for feature transformation to obtain multiple corresponding modulation vectors; Each feature vector is multiplied by its corresponding modulation vector to obtain multiple complementary feature vectors.

8. The method according to claim 1, characterized in that, The step of inputting multiple complementary feature vectors into a weight generation network based on the confidence level of multiple complementary feature vectors to determine multiple weight coefficients includes: For any one of the complementary feature vectors, a quality index for the complementary feature vector is obtained; the quality index includes at least one of the following: feature sharpness and signal strength; The confidence level is determined based on the quality indicators. The complementary feature vector is input into the weight generation network, and feature evaluation is performed based on the confidence level to obtain the weight vector; the weight generation network is a two-layer fully connected structure.

9. A multimodal feature fusion device, characterized in that, The device includes: The acquisition unit is used to acquire sensor information from different types of sensors on the surface of the object to be tested; the different types of sensors include at least one of the following: visible light imaging sensor, infrared imaging sensor, three-dimensional point cloud acquisition sensor, and spectral curve acquisition sensor; An encoding unit is used to input multiple sensor informations into a first encoder to obtain multiple corresponding feature vectors; the first encoder is used to encode the sensor information into the feature vectors based on a convolutional coding network; one sensor information corresponds to one convolutional coding network; the convolutional coding network includes at least one of the following: a convolutional neural network, a graph convolutional network, and a one-dimensional convolutional network; An interaction unit is used to interact with multiple feature vectors based on a cross-modal attention mechanism or a gating mechanism to obtain multiple corresponding complementary feature vectors. A determining unit is used to input multiple complementary feature vectors into a weight generation network based on the confidence level of multiple complementary feature vectors, and determine multiple weight coefficients. The fusion unit is used to perform weighted fusion of multiple complementary feature vectors using multiple weight coefficients to obtain multimodal fusion features.

10. An electronic device, characterized in that, include: Memory, used to store computer-readable instructions; A processor for executing the computer-readable instructions, causing the electronic device to perform the method as described in any one of claims 1-8.