Multi-modal feature collaborative fusion method and device based on physical topology information theory
By adopting a multimodal feature collaborative fusion method based on physical topological information theory, the problems of computational resource dependence and insufficient feature fusion in existing image recognition technologies are solved, achieving efficient image recognition on edge devices and improving recognition accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID BEIJING ELECTRIC POWER CO
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing image recognition technologies rely on large-scale labeled data and expensive computing resources. Fixed input size limitations lead to the loss of image edge details, and it is difficult to achieve real-time applications on edge computing devices. Furthermore, feature extraction and utilization are limited, and multi-dimensional information fusion is insufficient, resulting in insufficient recognition accuracy and robustness in complex environments.
A multimodal feature collaborative fusion method based on physical topological information theory is adopted. Through physical prior-driven adaptive preprocessing, correlational multimodal feature extraction, information synergy calculation and dynamic fusion weight, a physical rule space is constructed for feature projection and weighted fusion, and a lightweight classification network is used for image recognition.
It improves the accuracy and robustness of image recognition in complex environments, solves the problems of insufficient feature representation and poor illumination adaptability in traditional methods, maximizes the utilization of complementary information between features and suppresses redundant information, reduces computational complexity, and provides conditions for lightweight classification.
Smart Images

Figure CN122244515A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image recognition technology, specifically relating to a multimodal feature collaborative fusion method and device based on physical topological information theory. Background Technology
[0002] With the rapid development of artificial intelligence technology, image recognition has been widely used in fields such as power line inspection and industrial testing. However, existing mainstream image recognition technologies still have the following core problems: On the one hand, deep learning methods, represented by convolutional neural networks, are highly dependent on large-scale labeled data and expensive computing resources. Their fixed input size limitation can easily lead to the loss of image edge details. Furthermore, the training and deployment of deep networks have demanding hardware requirements, making it difficult to achieve real-time applications on edge computing devices.
[0003] On the other hand, existing methods have limitations in feature extraction and utilization, focusing on single-level or single-dimensional feature representation, making it difficult to effectively integrate multi-dimensional information such as shape, texture, and spatial distribution, resulting in insufficient recognition accuracy and robustness in complex lighting and occlusion environments. Summary of the Invention
[0004] The purpose of this invention is to provide a multimodal feature collaborative fusion method and apparatus based on physical topological information theory, so as to at least solve or improve one of the problems existing in the prior art.
[0005] To achieve the above objectives, the present invention adopts the following technical solution: In a first aspect, the present invention provides a multimodal feature collaborative fusion method based on physical topological information theory, comprising: Step 1: Obtain the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image; Step 2: Based on the preprocessed image, perform correlational multimodal feature extraction under physical rule constraints to obtain shape feature set, texture feature set and spatial feature set with physical labels; Step 3: Construct a physical rule space, and project the shape feature set, texture feature set, and spatial feature set into the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features; Step 4: Calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. Step 5: Based on the dynamic fusion weights, the shape feature set, texture feature set, and spatial feature set are weighted and fused to obtain a comprehensive feature vector; Step 6: Input the comprehensive feature vector into the pre-trained lightweight classification network and output the image recognition result.
[0006] The aforementioned scheme constructs a physical rule space, projects shape, texture, and spatial feature sets onto this space, calculates information synergy, and generates dynamic fusion weights based on this information synergy for multimodal feature weighted fusion. This solves the technical problems of insufficient single-dimensional feature representation and lack of scientific basis for multimodal feature fusion in existing image recognition technologies, leading to low recognition accuracy. By fusing statistical dependence and topological stability through information synergy, it maximizes the utilization of complementary information between features and effectively suppresses redundant information, thereby improving the accuracy and robustness of image recognition in complex environments.
[0007] Furthermore, step 1 specifically includes: Step 1-1: After obtaining the image to be identified, match the target device type in the image and call the preset device physical attribute library; Steps 1-2: Based on the physical size constraints in the device physical attribute library, foreground seed points and background seed points are dynamically generated, and the GrabCut algorithm is used to segment the image to be recognized to obtain the target region image. Steps 1-3: Calculate the actual physical size of the target in the target region image and the percentage of image pixels, dynamically determine the adaptive scaling ratio based on the percentage, and scale the target region image to obtain a size-normalized target image. Steps 1-4 involve classifying the target image with normalized dimensions, dynamically adjusting the Gaussian filter scale and weights of the multi-scale Retinex enhancement algorithm based on the classification results, and performing illumination enhancement processing on the target image to obtain the enhanced preprocessed image.
[0008] The above scheme achieves physically prior-driven adaptive preprocessing by matching device type to call the physical attribute library, dynamically generating seed points based on physical size for target segmentation, adaptively scaling according to actual physical size and pixel ratio, and dynamically adjusting Retinex enhancement parameters according to scene classification. This solves the technical problems of traditional preprocessing methods, such as loss of key details and poor lighting adaptability caused by fixed parameters. Foreground-background segmentation guided by physical size constraints ensures accurate extraction of the target region; dynamic scaling based on physical size ratio magnifies details during long-distance shooting and cropps redundancy during close-up shooting, preserving key features such as bolts and wire connectors; and scene-adaptive lighting enhancement effectively recovers texture information in shadow and highlight areas under complex lighting conditions such as cloudy days and backlighting, providing high-quality input images for subsequent feature extraction.
[0009] Furthermore, step 2 specifically includes: Step 2-1: Convert the preprocessed image to grayscale, scale it to the standard size according to the device's physical size, calculate the Zernike moment, verify and correct the Zernike moment calculation result through the physical rule verification layer, and output a shape feature set with physical labels. Step 2-2: Based on the physical labels in the shape feature set, mark the physically vulnerable and non-vulnerable regions in the image, perform directional Gabor filtering and weighted LBP encoding on the vulnerable regions, and output the texture feature set related to the equipment defects. Steps 2-3: Based on the equipment structure information in the shape feature set, divide the area into multi-scale blocks according to the physical region, calculate the spatial statistical features for each scale block, and remove false features through physical consistency verification to output the spatial feature set.
[0010] The above scheme achieves correlated multimodal feature extraction under physical rule constraints by correcting Zernike moments through a physical rule verification layer, using shape-label-guided directional Gabor filtering and weighted LBP encoding, and dividing the physical region into multi-scale blocks and performing physical consistency verification. This solves the technical problems of high feature redundancy, severe pseudo-feature interference, and lack of correlation among multimodal features in traditional feature extraction. By using shape-guided texture extraction, the Gabor filter direction is fixed to a preset angle matching the defect direction, reducing computational redundancy while focusing on key defect areas such as cracks and corrosion. Physical consistency verification eliminates pseudo-features that do not match the physical state of the equipment, ensuring that the output feature set is strongly correlated with the actual physical state, thus improving the reliability and discriminative power of the features.
[0011] Furthermore, step 3 specifically includes: Construct a high-dimensional manifold space consisting of at least one physical rule threshold as the physical rule space; The shape feature set, texture feature set, and spatial feature set are respectively input into the preset physical rule embedding function and mapped to the physical rule space to obtain the corresponding shape projection feature, texture projection feature, and spatial projection feature.
[0012] The above scheme constructs a high-dimensional manifold space composed of physical rule thresholds as the physical rule space, and uses a physical rule embedding function to map the original features to this space, realizing the projection of features from the original pixel domain to the physical constraint domain. This scheme solves the technical problem that features of different modalities are difficult to directly compare and fuse due to differences in dimensions and distributions. By uniformly projecting features to a rule space with clear physical meaning, shape, texture, and spatial features are aligned at the same physical semantic scale, providing a unified metric for subsequent calculation of information coherence, and ensuring the physical rationality and interpretability of multimodal feature fusion.
[0013] Furthermore, the dynamic fusion weights in step 4 are calculated using the following formula: For shape projection feature s, texture projection feature t, and spatial projection feature p, the corresponding dynamic fusion weights Ωs, Ωt, and Ωp are:
[0014] Wherein, the subscript f takes the values s, t, p respectively, to calculate Ωs, Ωt, Ωp accordingly; β is the L2 norm of the projected feature f, used to characterize the physical rule embedding strength of the feature set itself; β is the dynamic temperature coefficient, used to adjust the distribution concentration of the dynamic fusion weights; The information synergy between projection features f and g is used to characterize the synergistic effect of their statistical dependence and topological stability in the physical rule space.
[0015] The above scheme achieves adaptive adjustment of fusion weights by introducing a dynamic fusion weight calculation formula based on L2 norm and information synergy. This scheme solves the technical problem that fixed-weight fusion cannot adapt to the dynamic changes in feature importance under different scenarios. The weight calculation formula simultaneously considers the physical rule embedding strength of the feature itself (L2 norm) and the synergistic effect of statistical dependence and topological stability between features (information synergy), so that the texture feature weight is automatically increased when cracks are obvious and the shape feature weight is automatically enhanced when the structure is deformed. This achieves an optimal fusion strategy that is adaptive to the scene, significantly improving the expressive power and generalization performance of the fused features.
[0016] Furthermore, the information synergy degree Coh(f,g) is calculated through the following steps: Calculate the normalized mutual information of the projected features f and g in the physical rule space to obtain the statistical dependency measure MI(f,g); A continuous cohomology analysis is performed on the combination of projection features f and g in the physical regular space. The maximum continuous cohomology Betti number PH(f,g) under the preset physical effective scale is calculated to obtain the topological stability metric. Multiplying the statistical dependency measure by the topological stability measure yields the information synergy Coh(f,g) = MI(f,g)·PH(f,g).
[0017] The above scheme achieves a joint measure of statistical dependency and topological stability by multiplying normalized mutual information by the sustained cohomology Betti number to obtain information synergy. This scheme solves the technical problem of traditional feature correlation measures that only focus on statistical associations and ignore spatial topological structure. Normalized mutual information captures the statistical dependencies between features, while the sustained cohomology Betti number characterizes the stable topological structure formed by feature combinations in a physically regular space (such as the ring-shaped discontinuity formed by cracks). The product of the two ensures that there is a strong correlation between features and that this correlation is based on a stable structure that conforms to physical laws. This allows for the selection of truly physically meaningful feature synergy patterns and avoids misfusion caused by noise or spurious correlations.
[0018] Furthermore, step 5 specifically includes: Based on the dynamic fusion weights, the shape feature set, texture feature set, and spatial feature set are weighted and summed to obtain the initial fusion features; Feature pruning is performed on feature dimensions whose dynamic fusion weights are lower than a preset threshold in the initial fusion features to obtain a dimensionally compressed comprehensive feature vector.
[0019] The above scheme obtains the initial fusion features through weighted summation, and then prunes feature dimensions with weights below a preset threshold, achieving dimensionality compression of the fusion features. This scheme solves the technical problem of increased computational burden and redundant information interfering with recognition performance due to dimensionality explosion after multimodal feature fusion. Through a secondary screening mechanism of dynamic fusion weights, redundant dimensions with low physical rule embedding strength and poor synergy with other features are effectively removed, while retaining the core feature dimensions with the strongest discriminative power. This reduces feature dimensionality and computational complexity while ensuring recognition accuracy, creating conditions for the real-time deployment of lightweight classification networks.
[0020] In a second aspect, the present invention provides a multimodal feature collaborative fusion device based on physical topological information theory, comprising: The preprocessing module is used to acquire the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image. The feature extraction module is used to perform correlational multimodal feature extraction based on the preprocessed image under physical rule constraints, to obtain shape feature set, texture feature set and spatial feature set with physical labels; The projection module is used to construct a physical rule space and project the shape feature set, texture feature set, and spatial feature set onto the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features. The weight generation module is used to calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and to generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. The fusion module is used to perform weighted fusion of shape feature set, texture feature set and spatial feature set according to dynamic fusion weights to obtain comprehensive feature vector; The classification module is used to input the comprehensive feature vector into a pre-trained lightweight classification network and output the image recognition result.
[0021] In a third aspect, the present invention provides an electronic device including a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the multimodal feature collaborative fusion method based on physical topological information theory as described above.
[0022] In a fourth aspect, the present invention provides a computer-readable storage medium storing at least one instruction that, when executed by a processor, implements the multimodal feature collaborative fusion method based on physical topological information theory as described above. Attached Figure Description
[0023] The accompanying drawings, which form part of this application, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings: Figure 1 This is a schematic diagram of a multimodal feature collaborative fusion method based on physical topological information theory according to an embodiment of the present invention; Figure 2 This is a structural block diagram of a multimodal feature collaborative fusion device based on physical topological information theory according to an embodiment of the present invention; Figure 3 This is a structural block diagram of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0024] The present invention will now be described in detail with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other.
[0025] The following detailed description is exemplary and intended to provide further detailed explanation of the invention. Unless otherwise specified, all technical terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. The terminology used in this invention is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention.
[0026] Example 1 like Figure 1 As shown, a multimodal feature collaborative fusion method based on physical topological information theory includes the following steps: Step 1: Obtain the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image; This step focuses on the physical laws of the inspection scenario, constructing a preprocessing mechanism based on prior constraints and dynamic adjustments. Specifically, the image to be identified can be a 512×512 pixel JPG or PNG image collected by the power inspection robot. The preprocessing includes four sub-steps: equipment type matching, target segmentation, adaptive scaling, and illumination enhancement.
[0027] Step 2: Based on the preprocessed image, perform correlational multimodal feature extraction under physical rule constraints to obtain shape feature set, texture feature set and spatial feature set with physical labels; This step uses the physical association of devices as a link to achieve the linked extraction of shape-guided texture and texture verification space, avoiding feature redundancy and false feature interference.
[0028] Step 3: Construct a physical rule space, and project the shape feature set, texture feature set, and spatial feature set into the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features; Step 4: Calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. The core of this step lies in combining information theory with topological data analysis to construct a multimodal feature cooperation and inhibition model based on continuous homology and mutual information. This is specifically achieved through the following methods: First, for any two projected features f and g (f,g∈{s,t,p}), calculate their information coherence Coh(f,g) = MI(f,g)*PH(f,g). The information coherence is obtained by multiplying the statistical dependency metric and the topological stability metric. MI(f,g) is the normalized mutual information of projected features f and g in the physical rule space R, used to characterize their statistical dependency; PH(f,g) is the information coherence of the combination of projected features f and g in the physical rule space R after continuous coherence analysis, at a predetermined physical effective scale. The maximum sustained cohomology Betti number (e.g., H1 Betti number, representing the number of ring structures) is used to characterize the topological stability of the combination of f and g under physical rule constraints. Sustained cohomology analysis can characterize the topological lifecycle of data at different scales. The higher the PH(f,g) value, the more stable the combination of f and g is in the physical rule space, forming a topological structure that conforms to defect characteristics (e.g., ring discontinuities formed by cracks).
[0029] Then, based on the information synergy among all projected features, dynamic fusion weights are generated for each feature set.
[0030] Step 5: Based on the dynamic fusion weights, the shape feature set, texture feature set, and spatial feature set are weighted and fused to obtain a comprehensive feature vector; Step 6: Input the comprehensive feature vector into the pre-trained lightweight classification network and output the image recognition result.
[0031] In one embodiment, step 1 specifically includes: Step 1-1: After acquiring the image to be identified, match the target device type in the image and call the preset device physical attribute library; the device physical attribute library contains physical size parameters and structural feature parameters of different device types; For example, the physical properties of a 110kV insulator include: 6-8 sheds, a single shed diameter of 30-35cm, and a shed spacing of ≥10mm; for a 220kV tower, the physical properties include: a crossarm length of 2-3m and a tower diameter of 0.5-0.8m. These physical parameters provide constraints for subsequent processing.
[0032] Steps 1-2: Based on the physical size constraints in the device physical attribute library, foreground seed points and background seed points are dynamically generated, and the GrabCut algorithm is used to segment the image to be recognized to obtain the target region image. Traditional GrabCut relies on manual annotation of foreground seed points. This step dynamically generates seed points based on physical attributes: First, the arc contour area of the target device is located through edge detection algorithm, such as the arc of the insulator skirt; then, according to physical size constraints, such as the skirt spacing ≥10mm, continuous contour areas that meet the size range are selected as foreground seed points; at the same time, areas that deviate from the physical size constraints by more than 20% are marked as background seed points, such as background trees, clouds, etc.
[0033] Steps 1-3: Calculate the actual physical size of the target in the target region image and the percentage of image pixels, dynamically determine the adaptive scaling ratio based on the percentage, and scale the target region image to obtain a size-normalized target image. This invention is based on the physical properties of the device and the proportion R of the target in the image. obj The scaling ratio is determined dynamically. Specifically, there are three cases: Scenario 1, shooting from a distance, R obj If the resolution is less than 15%, the target occupies a small portion of the image, resulting in insufficient detail. The image is then magnified according to the actual device size / image pixel ratio to ensure that critical details such as bolts and wire connectors remain clear. For example, if a 110kV tower is actually 20m high, and its height occupies 100 pixels in the image, the scaling ratio is set to 200 pixels / m, magnifying the image until the target height occupies 400 pixels (corresponding to a 2m height range in the actual size), thus preserving sufficient detail resolution.
[0034] Scenario 2: Shooting from a moderate distance, 15% ≤ R obj ≤40% indicates that the target occupies a moderate proportion in the image, encompassing both the complete device structure and retaining a certain amount of background context. This solution employs a physical feature-preserving scaling strategy: the target scaling size is determined based on the smallest identifiable feature size in the device's physical attribute library. For example, for an insulator, its smallest identifiable feature is a crack at the edge of the skirt (actual width approximately 0.2-0.5mm). To ensure the crack occupies at least 3-5 pixels in the image for subsequent identification, the required scaling ratio is calculated as scale = 3 pixels / 0.2mm = 15 pixels / mm, thus determining the target image scaling size. If the calculated size is larger than the original image size, it is moderately enlarged; if smaller, the original size is maintained or slightly compressed, ensuring that the complete device structure and necessary background are preserved without losing critical details.
[0035] Scenario 3, close-up shooting, R obj If the target area occupies more than 40% of the image, the background information is redundant. Cropping only the background area while preserving the complete device structure is sufficient; no additional scaling is needed.
[0036] Steps 1-4 involve classifying the target image with normalized dimensions, dynamically adjusting the Gaussian filter scale and weights of the multi-scale Retinex enhancement algorithm based on the classification results, and performing illumination enhancement processing on the target image to obtain the enhanced preprocessed image.
[0037] To address complex lighting scenarios such as overcast skies and twilight backlighting during power line inspections, this invention dynamically adjusts the Gaussian filtering scale of the multi-scale Retinex algorithm. And its weight. When the scene classification result is a cloudy scene and the average image brightness is lower than the first preset threshold (e.g., 120), the shadow areas of the device (e.g., below the insulator skirt) are emphasized and enhanced. Set the scale to 10 (small scale, enhance details), 60 (medium scale, balance brightness), and 200 (large scale, suppress noise), and increase the weight of the small scale (e.g., set the small scale weight to 0.4) to avoid the traditional fixed weight. This results in loss of detail in shadow areas; when the scene classification result is a backlit scene and the image brightness difference is higher than the second preset threshold (e.g., 80), the large-scale value is reduced for bright areas of the device (e.g., reflections from wires). The weights are reduced from 1 / 3 to 1 / 5, and a physical brightness threshold is introduced. If the brightness of the reflective area of the conductor is greater than 220, additional grayscale compression is performed to prevent texture information loss due to overexposure. The MSR formula used in this step is optimized as follows:
[0038] in, Weights for scene adaptation (small scale for cloudy scenes) =0.4, large-scale backlit scenes =0.2), This is a Gaussian filter function.
[0039] In a more specific embodiment, steps 1-2, specifically generating foreground and background seed points, include: The circular contour area of the target device is located using an edge detection algorithm; Based on the physical size constraints in the device's physical attribute library, select continuous contour regions that meet the size range requirements as foreground seed points; Regions that deviate from the physical size constraints by more than a preset threshold are marked as background seed points.
[0040] In a more specific embodiment, steps 1-4, specifically adjusting the Gaussian filter scale and its weights of the multi-scale Retinex enhancement algorithm based on the classification results, include: When the scene classification result is a cloudy scene and the average brightness of the image is lower than the first preset threshold, the weight of the small-scale Gaussian filter is enhanced, and the Gaussian filter scale is set to a small-scale combination that is adapted to enhance shadow details. When the scene classification result is a backlight scene and the image brightness difference is higher than the second preset threshold, the weight of the large-scale Gaussian filter is reduced, and grayscale compression processing is performed on the region that exceeds the physical brightness threshold.
[0041] In one embodiment, step 2 specifically includes: Step 2-1: Convert the preprocessed image into a grayscale image, scale it to the standard size according to the physical size of the device, calculate the Zernike moment, verify and correct the Zernike moment calculation result through the physical rule verification layer, and output the shape feature set S with physical labels; Traditional Zernike moments only calculate the global shape. This invention adds a physical rule verification layer: First, the preprocessed image is converted to grayscale and scaled to a standard size according to the device's physical dimensions (e.g., insulators are scaled to 256×256 to ensure each skirt occupies 30-40 pixels). After calculating the Zernike moments, valid shape features are filtered through physical rule verification. For example, for an insulator, if the calculated number of skirts is less than 6 or the diameter deviation of a single skirt is greater than 20%, it is judged as a shape abnormality (e.g., image blurring, partial occlusion). In this case, the edge detection results from steps 1-2 are automatically called, edge contour weights are added, and the Zernike moments are recalculated until a shape feature set S that conforms to the physical rules is output. The final output S has physical labels, such as insulator - number of skirts 7 - no deformation - single skirt diameter 32cm, rather than a simple numerical vector.
[0042] Step 2-2: Based on the physical labels in the shape feature set S, mark the physically vulnerable and non-vulnerable regions in the image, perform directional Gabor filtering and weighted LBP encoding on the vulnerable regions, and output the texture feature set T related to the equipment defects; This step overcomes the inefficiency of full-image Gabor filtering by focusing on extracting texture from physically vulnerable areas of the equipment, guided by shape features S. Specifically, it involves: marking vulnerable areas based on physical labels in S, such as the edges of insulator skirts (high-risk cracking areas) and the crossarm connecting bolts of towers (high-risk corrosion areas); and performing basic texture sampling only on non-vulnerable areas (such as the central insulator pole). The Gabor filter parameters are adjusted according to the physical defect characteristics of vulnerable areas—for example, if cracks on the edges of insulator skirts are mostly radially distributed, the filter direction angle is fixed at 0° and 90° (covering the radial direction), and the frequency is set to 0.3-0.5 (matching the texture period of crack widths of 0.2-0.5mm), avoiding the computational redundancy of traditional multi-directional filtering (8 directions). Physical defect association weights are assigned to the filtered vulnerable areas—for example, if the LBP value of the insulator skirt edge matches the crack texture pattern (local pixel difference > 15), the weight is set to 1.2, and the weight for non-vulnerable areas is set to 0.6. The final output texture feature set T retains only key texture information related to equipment defects.
[0043] Steps 2-3: Based on the equipment structure information in the shape feature set S, divide the area into multi-scale blocks according to the physical region, calculate the spatial statistical features for each scale block, and remove false features through physical consistency verification to output the spatial feature set P.
[0044] Traditional spatial feature calculations only consider grayscale mean and HOG (Homologous Oriented Gradient) statistics. This step, however, incorporates verification based on the physical location of the equipment. Specifically, it involves dividing the equipment structure in S (e.g., insulator skirts - center rod - connecting hardware) into physical regions, rather than fixed pixel blocks. For example, the skirt region is divided into 16×16 pixel blocks based on the size of a single skirt, and the connecting hardware region is divided into 8×8 pixel blocks based on the bolt size. After calculating the grayscale mean, standard deviation, and HOG for each sub-region, a physical consistency check is performed to remove false features. For instance, if the grayscale mean of the insulator skirt region deviates from the center rod by more than 30 (normal deviation is about 10-20), and the HOG gradient direction matches the crack direction, it is considered a valid spatial feature. If the deviation is greater than 30 but the gradient direction is normal (e.g., background light spots), it is marked as a false feature and removed. The final output spatial feature set $P$ is strongly correlated with the physical state of the equipment.
[0045] In a more specific embodiment, step 2-1, which verifies and corrects the Zernike moment calculation results through the physical rule verification layer, specifically includes: Based on the physical dimension parameters in the equipment physical property library, verify whether the equipment structural features calculated by Zernike moments conform to the preset physical rules; If it does not meet the requirements, the edge detection results from steps 1-2 are called, the edge contour weights are increased, and the Zernike moments are recalculated until the output shape feature set S that conforms to the physical rules is obtained.
[0046] In a more specific embodiment, step 2-2, which involves directional Gabor filtering and weighted LBP encoding of the vulnerable region, specifically includes: Based on the defect feature direction of the physically vulnerable area, the direction angle of the Gabor filter is fixed to a preset angle that matches the defect direction, and the frequency is set to a preset frequency range that matches the defect texture period. The filtered vulnerable regions are assigned physical defect association weights, which are greater than the weights of non-vulnerable regions. Output the weighted texture feature set T, whose dimension is lower than that of the full image texture features.
[0047] In a more specific embodiment, steps 2-3, dividing the blocks into multi-scale blocks according to physical regions, specifically include: Based on the device structure information in the shape feature set S, different physical regions are divided into pixel blocks of different scales according to their own physical dimensions; For each pixel block, the mean gray level, standard deviation, and histogram of oriented gradient features are calculated to obtain preliminary spatial features. By verifying physical consistency, false features that do not conform to the physical state of the equipment are eliminated, and the spatial feature set P is output.
[0048] In one embodiment, step 3 specifically includes: A high-dimensional manifold space consisting of at least one physical rule threshold is constructed as the physical rule space R; the physical rule threshold is derived from the physical rule library preset in step 4 (e.g., an insulator crack length ≥ 5mm is considered a defect).
[0049] The shape feature set S, texture feature set T, and spatial feature set P are respectively input into the preset physical rule embedding function Φrule(·), which maps them to the physical rule space R, resulting in the corresponding shape projection feature s = Φrule(S), texture projection feature t = Φrule(T), and spatial projection feature p = Φrule(P). The physical rule embedding function $\Phi_{\text{rule}}(\cdot)$ is a preset nonlinear transformation function that maps the original feature space to the physical rule constraint space, which can be implemented through a shallow neural network or a kernel function.
[0050] In one embodiment, in step 4, the dynamic fusion weights are calculated using the following formula: For shape projection feature s, texture projection feature t, and spatial projection feature p, the corresponding dynamic fusion weights Ωs, Ωt, and Ωp are:
[0051] Wherein, the subscript f takes the values s, t, p respectively, to calculate Ωs, Ωt, Ωp accordingly; β is the L2 norm of the projected feature f, used to characterize the physical rule embedding strength of the feature set itself; β is the dynamic temperature coefficient, used to adjust the distribution concentration of the dynamic fusion weights; The information synergy between projection features f and g is used to characterize the synergistic effect of their statistical dependence and topological stability in the physical rule space.
[0052] In a more specific embodiment, β is dynamically determined based on the frequency of physical rule violations detected during physical consistency checks. Let the base temperature coefficient be β0 and the adjustment coefficient be α, then:
[0053] in, This represents the number of physical rule violations detected within a preset time window. β represents the total number of checks. When the frequency of violation events exceeds a preset threshold, β increases, and the weight distribution tends to be more uniform; conversely, β decreases, and the weight distribution tends to be more concentrated.
[0054] In a more specific embodiment, information coordination degree The following steps are used to calculate: Calculate the normalized mutual information of the projection features f and g in the physical rule space R to obtain the statistical dependency measure MI(f,g); A continuous cohomology analysis is performed on the combination of projection features f and g in the physical rule space R. The maximum continuous cohomology Betti number PH(f,g) under the preset physical effective scale is calculated to obtain the topological stability metric. Multiplying the statistical dependency measure by the topological stability measure yields the information synergy Coh(f,g) = MI(f,g) · PH(f,g).
[0055] In a more specific embodiment, the dynamic temperature coefficient β is dynamically determined based on the frequency of physical rule violations detected during the physical consistency check in step 2: When the frequency of physical rule violation events exceeds a preset threshold, the dynamic temperature coefficient β is increased to make the distribution of dynamic fusion weights more uniform. When the frequency of physical rule violation events is lower than a preset threshold, the dynamic temperature coefficient β is reduced to make the distribution of dynamic fusion weights more concentrated.
[0056] In one embodiment, step 5 specifically includes: Based on the dynamic fusion weights Ωs, Ωt, and Ωp, the shape feature set S, texture feature set T, and spatial feature set P are weighted and summed to obtain the initial fusion feature F. init : F init = Ωs S+Ωt T+Ωp P For the initial fusion feature F init Feature dimensions with dynamic fusion weights below a preset threshold are pruned to remove redundant dimensions, resulting in a dimensionally compressed comprehensive feature vector F.
[0057] In a more specific embodiment, for the initial fusion feature F init In the dynamic fusion process, feature dimensions with weights below 0.15 are pruned. For example, if the weight Ωs of shape feature S is reduced to 0.1, only the core dimension of the number of skirts in S is retained, while conflicting dimensions such as the roundness of the skirts are deleted, thus obtaining the dimensionally compressed comprehensive feature vector F.
[0058] In one embodiment, in step 6, the lightweight classification network includes an input layer, a feature extraction layer, a physical rule layer, and a classification layer. The input layer concatenates the comprehensive feature vector f with the physical label of the equipment (e.g., 110kV insulator - defect-free) to provide physical prior guidance for the network; the feature extraction layer uses depthwise separable convolution, with approximately 1 / 20th the number of parameters of ResNet-50 (approximately 8 million parameters), and the convolution kernel size is optimized for power equipment features (e.g., 3×3 for texture, 5×5 for shape); the physical rule layer embeds physical rules for power equipment defects, such as only insulator cracks ≥5mm in length being considered defects, and loose tower bolts being accompanied by deformation of surrounding textures, and the physical rules are transformed into numerical constraints through a rule mapping function; the classification layer uses a softmax activation function to output the target category (e.g., insulator - defect-free insulator - crack defect) and confidence level.
[0059] During the training phase, a total loss function incorporating physical rule constraints is used for training. The total loss function is: L total =L ce +λ L phys Where: L ce L is the cross-entropy loss function; phys λ is the physical rule loss function, used to measure the deviation between the network prediction result and the preset physical rule threshold; λ is the weight coefficient, used to balance the cross-entropy loss and the physical rule loss.
[0060] In a more specific embodiment, the physical rule loss function L phys Calculated using the following formula:
[0061] Where: M is the number of physical rules; This is the feature metric value related to the m-th physical rule in the network prediction results; The preset threshold for the m-th physical rule.
[0062] In a more specific embodiment, the physical rules include at least one of the following: Insulator crack length defect rule: When the network prediction result is an insulator crack defect, if the actual physical length corresponding to the predicted crack region pixel length is less than a preset threshold, then physical rule loss is triggered. Tower bolt loosening defect rule: When the network prediction result is tower bolt loosening defect, if the texture change feature value around the bolt area is less than the preset threshold, the physical rule loss is triggered; Insulator skirt damage defect rule: When the network prediction result is an insulator skirt damage defect, if the shape change characteristic value of the skirt area is less than the preset threshold, the physical rule loss is triggered.
[0063] In a preferred embodiment, step 6, during the inference phase, further includes: After the comprehensive feature vector is input into the lightweight classification network, the physical rule layer first quickly verifies whether the intermediate features of the network conform to the basic physical laws. If they do not conform, invalid predictions are directly output and subsequent calculations are terminated.
[0064] Example 2 like Figure 2 As shown, based on the same inventive concept as the above embodiments, the present invention also provides a multimodal feature collaborative fusion device based on physical topological information theory, comprising: The preprocessing module is used to acquire the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image. The feature extraction module is used to perform correlational multimodal feature extraction based on the preprocessed image under physical rule constraints, to obtain shape feature set, texture feature set and spatial feature set with physical labels; The projection module is used to construct a physical rule space and project the shape feature set, texture feature set, and spatial feature set onto the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features. The weight generation module is used to calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and to generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. The fusion module is used to perform weighted fusion of shape feature set, texture feature set and spatial feature set according to dynamic fusion weights to obtain comprehensive feature vector; The classification module is used to input the comprehensive feature vector into a pre-trained lightweight classification network and output the image recognition result.
[0065] Example 3 like Figure 3 As shown, the present invention also provides an electronic device 100 for implementing a multimodal feature collaborative fusion method based on physical topological information theory; The electronic device 100 includes a memory 101, at least one processor 102, a computer program 103 stored in the memory 101 and executable on at least one processor 102, and at least one communication bus 104.
[0066] The memory 101 can be used to store the computer program 103. The processor 102 implements the steps of the multimodal feature collaborative fusion method based on physical topology information theory in Embodiment 1 by running or executing the computer program stored in the memory 101 and calling the data stored in the memory 101.
[0067] The memory 101 may primarily include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created based on the use of the electronic device 100 (such as audio data), etc. In addition, the memory 101 may include non-volatile memory, such as hard disk, RAM, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.
[0068] At least one processor 102 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Processor 102 may be a microprocessor or any conventional processor. Processor 102 is the control center of electronic device 100, connecting various parts of electronic device 100 via various interfaces and lines.
[0069] The memory 101 in the electronic device 100 stores multiple instructions to implement a multimodal feature collaborative fusion method based on physical topology information theory, and the processor 102 can execute multiple instructions to achieve the following: Step 1: Obtain the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image; Step 2: Based on the preprocessed image, perform correlational multimodal feature extraction under physical rule constraints to obtain shape feature set, texture feature set and spatial feature set with physical labels; Step 3: Construct a physical rule space, and project the shape feature set, texture feature set, and spatial feature set into the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features; Step 4: Calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. Step 5: Based on the dynamic fusion weights, the shape feature set, texture feature set, and spatial feature set are weighted and fused to obtain a comprehensive feature vector; Step 6: Input the comprehensive feature vector into the pre-trained lightweight classification network and output the image recognition result.
[0070] Example 4 If the modules / units integrated in the electronic device 100 are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, and read-only memory (ROM).
[0071] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0072] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0073] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0074] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0075] In the description of this specification, references to terms such as "an embodiment," "example," "specific example," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0076] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation of the present invention. Any modifications or equivalent substitutions that do not depart from the spirit and scope of the present invention should be covered within the scope of protection of the claims of the present invention.
Claims
1. A multi-modal feature collaborative fusion method based on physical topology information theory, characterized in that, Includes the following steps: Step 1: Obtain the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image; Step 2: Based on the preprocessed image, perform correlational multimodal feature extraction under physical rule constraints to obtain shape feature set, texture feature set and spatial feature set with physical labels; Step 3: Construct a physical rule space, and project the shape feature set, texture feature set, and spatial feature set into the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features; Step 4: Calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. Step 5: Based on the dynamic fusion weights, the shape feature set, texture feature set, and spatial feature set are weighted and fused to obtain a comprehensive feature vector; Step 6: Input the comprehensive feature vector into the pre-trained lightweight classification network and output the image recognition result.
2. The physical topology information theory based multi-modal feature collaborative fusion method according to claim 1, characterized in that, Step 1 specifically includes: Step 1-1: After obtaining the image to be identified, match the target device type in the image and call the preset device physical attribute library; Steps 1-2: Based on the physical size constraints in the device physical attribute library, foreground seed points and background seed points are dynamically generated, and the GrabCut algorithm is used to segment the image to be recognized to obtain the target region image. Steps 1-3: Calculate the actual physical size of the target in the target region image and the percentage of image pixels, dynamically determine the adaptive scaling ratio based on the percentage, and scale the target region image to obtain a size-normalized target image. Steps 1-4 involve classifying the target image with normalized dimensions, dynamically adjusting the Gaussian filter scale and weights of the multi-scale Retinex enhancement algorithm based on the classification results, and performing illumination enhancement processing on the target image to obtain the enhanced preprocessed image.
3. The physical topology information theory based multi-modal feature collaborative fusion method according to claim 1, characterized in that, Step 2 specifically includes: Step 2-1: Convert the preprocessed image to grayscale, scale it to the standard size according to the device's physical size, calculate the Zernike moment, verify and correct the Zernike moment calculation result through the physical rule verification layer, and output a shape feature set with physical labels. Step 2-2: Based on the physical labels in the shape feature set, mark the physically vulnerable and non-vulnerable regions in the image, perform directional Gabor filtering and weighted LBP encoding on the vulnerable regions, and output the texture feature set related to the equipment defects. Steps 2-3: Based on the equipment structure information in the shape feature set, divide the area into multi-scale blocks according to the physical region, calculate the spatial statistical features for each scale block, and remove false features through physical consistency verification to output the spatial feature set.
4. The physical topology information theory based multi-modal feature collaborative fusion method according to claim 1, characterized in that, Step 3 specifically includes: Construct a high-dimensional manifold space consisting of at least one physical rule threshold as the physical rule space; The shape feature set, texture feature set, and spatial feature set are respectively input into the preset physical rule embedding function and mapped to the physical rule space to obtain the corresponding shape projection feature, texture projection feature, and spatial projection feature.
5. The physical topology information theory based multi-modal feature collaborative fusion method according to claim 1, characterized in that, The dynamic fusion weights in step 4 are calculated using the following formula: For shape projection feature s, texture projection feature t, and spatial projection feature p, the corresponding dynamic fusion weights Ωs, Ωt, and Ωp are: wherein subscript f respectively takes values s, t, p to correspond to calculating Ωs, Ωt, Ωp; is the L2 norm of the projection feature f, used to represent the physical rule embedding strength of the feature set itself; β is a dynamic temperature coefficient, used to adjust the distribution concentration of the dynamic fusion weight; is the information synergy degree between the projection features f and g, used to represent the synergistic effect of the statistical dependence and topological structure stability of the two in the physical rule space.
6. The multimodal feature collaborative fusion method based on physical topological information theory according to claim 10, characterized in that, The information synergy degree Coh(f,g) is calculated through the following steps: Calculate the normalized mutual information of the projected features f and g in the physical rule space to obtain the statistical dependency measure MI(f,g); A continuous cohomology analysis is performed on the combination of projection features f and g in the physical regular space. The maximum continuous cohomology Betti number PH(f,g) under the preset physical effective scale is calculated to obtain the topological stability metric. Multiplying the statistical dependency measure by the topological stability measure yields the information synergy Coh(f,g) = MI(f,g)·PH(f,g).
7. The multimodal feature collaborative fusion method based on physical topological information theory according to claim 1, characterized in that, Step 5 specifically includes: Based on the dynamic fusion weights, the shape feature set, texture feature set, and spatial feature set are weighted and summed to obtain the initial fusion features; Feature pruning is performed on feature dimensions whose dynamic fusion weights are lower than a preset threshold in the initial fusion features to obtain a dimensionally compressed comprehensive feature vector.
8. A multimodal feature collaborative fusion device based on physical topological information theory, characterized in that, include: The preprocessing module is used to acquire the image to be recognized and perform physical prior-driven adaptive preprocessing on the image to be recognized to obtain the enhanced preprocessed image. The feature extraction module is used to perform correlational multimodal feature extraction based on the preprocessed image under physical rule constraints, to obtain shape feature set, texture feature set and spatial feature set with physical labels; The projection module is used to construct a physical rule space and project the shape feature set, texture feature set, and spatial feature set onto the physical rule space respectively to obtain the corresponding shape projection features, texture projection features, and spatial projection features. The weight generation module is used to calculate the information synergy between any two projection features based on shape projection features, texture projection features, and spatial projection features, and to generate dynamic fusion weights for each feature set in multimodal fusion based on the information synergy. The information synergy is used to characterize the synergistic effect of the statistical dependence and topological stability of two projection features in the physical regular space. The fusion module is used to perform weighted fusion of shape feature set, texture feature set and spatial feature set according to dynamic fusion weights to obtain comprehensive feature vector; The classification module is used to input the comprehensive feature vector into a pre-trained lightweight classification network and output the image recognition result.
9. An electronic device, characterized in that, It includes a processor and a memory, the processor being used to execute a computer program stored in the memory to implement the multimodal feature collaborative fusion method based on physical topological information theory as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one instruction, which, when executed by a processor, implements the multimodal feature collaborative fusion method based on physical topological information theory as described in any one of claims 1 to 7.