An industrial endoscope defect automatic identification system based on deep learning

By combining multimodal image acquisition and a multi-expert sub-network architecture, dynamic matching feature extraction and confidence level calibration solve the problem of misjudgment in the recognition of industrial endoscopes under dynamic working conditions, and achieve highly reliable and traceable defect detection.

CN122199446APending Publication Date: 2026-06-12SHENZHEN SIMU AUTOMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN SIMU AUTOMATION TECH CO LTD
Filing Date
2026-03-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing deep learning-based industrial endoscope defect identification systems suffer from a strong coupling conflict between model generalization ability and adaptability under dynamic working conditions. They cannot effectively cope with domain shifts caused by factors such as light fluctuations, reflections, and media obstruction, and lack uncertainty perception and adaptive calibration mechanisms, leading to frequent misjudgments.

Method used

A closed-loop processing flow is formed by employing a multimodal image acquisition module, a physical constraint-driven preprocessing module, a domain-aware feature extraction module, a confidence calibration module, and a defect classification and localization output module. Combined with a multi-expert sub-network architecture and a dual uncertainty assessment mechanism, the feature extraction strategy is dynamically matched and the confidence is calibrated.

🎯Benefits of technology

It effectively mitigates performance degradation caused by domain offset, ensures high reliability and traceability of single inference results, and protects model integrity and data confidentiality through a security element chip, adapting to the detection needs of complex industrial scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199446A_ABST
    Figure CN122199446A_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of artificial intelligence, and particularly relates to an industrial endoscope defect automatic identification system based on deep learning, which comprises a multi-modal image acquisition module, a physical constraint driven preprocessing module, a domain perception guided feature extraction module, a confidence calibration module and a defect classification and positioning output module; context metadata is generated through multi-modal sensor fusion and physical imaging model normalization to drive dynamic switching of multi-specialist subnetworks, and confidence calibration and conservative decision making are realized in combination with double uncertainty evaluation; through coupling of multi-modal sensing and physical imaging model, environmental disturbance factors are explicitly coded as context metadata, avoiding blind dependence on implicit features in traditional pure data driven methods; the domain perception guided multi-specialist subnetwork architecture can be used to dynamically match the optimal feature extraction strategy in a single inference, effectively alleviating performance degradation caused by domain shift.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence technology, specifically a deep learning-based automatic identification system for defects in industrial endoscopes. Background Technology

[0002] With the improvement of industrial automation and intelligence, the demand for real-time monitoring of the internal structural health status of key equipment is becoming increasingly urgent. As a non-destructive testing tool, industrial endoscopes are widely used in the internal defect detection of high-value and high-risk equipment such as aero-engines and nuclear power plants. They can obtain internal visual information without disassembling the equipment. Breakthroughs in deep learning technology in the field of computer vision have driven the introduction of models such as convolutional neural networks into endoscopic image analysis, enabling automatic defect localization, classification and quantitative assessment, significantly reducing the subjectivity and labor intensity of manual interpretation, and providing data support for predictive maintenance.

[0003] Current mainstream deep learning-based industrial endoscope defect recognition systems employ a linear architecture of image acquisition, preprocessing, feature extraction, and classification decision. Images are acquired through high-resolution lenses and CMOS sensors, and their quality is optimized through preprocessing such as denoising and contrast enhancement. Feature extraction and classification are then performed by deep neural networks such as ResNet and U-Net variants. The training relies on a supervised learning paradigm with a large number of labeled samples. This approach outperforms traditional methods in identifying typical defects against regular textured backgrounds. By using data-driven approaches, it bypasses reliance on prior knowledge of defect morphology, achieving a paradigm shift from manually designed features to models autonomously learning features, demonstrating phased engineering rationality.

[0004] Existing technologies suffer from a fundamental structural contradiction: a strong coupling conflict between model generalization ability and adaptability to dynamic working conditions. The performance of deep learning models depends on the coverage of training data distribution, while endoscopic imaging in industrial settings is affected by factors such as light fluctuations, reflections, and media obstruction, resulting in significant domain shifts and a sharp degradation in model performance. Models trained in idealized environments are highly vulnerable in real-world complex scenarios, and endoscopic inspections are mostly one-off, non-repeatable operations, making it impossible to correct misjudgments. The core limitation lies in the lack of uncertainty awareness and adaptive calibration mechanisms in the architecture, making the model prone to outputting high-confidence erroneous results. This contradiction stems from the fundamental mismatch between static models and dynamic working conditions and cannot be resolved through local optimization.

[0005] Therefore, the present invention provides an automatic defect identification system for industrial endoscopes based on deep learning. Summary of the Invention

[0006] In order to overcome the shortcomings of the prior art, at least one technical problem raised in the background art is solved.

[0007] The technical solution adopted by the present invention to solve its technical problem is as follows: The present invention provides an automatic identification system for defects in industrial endoscopes based on deep learning, which includes a multimodal image acquisition module, a physical constraint-driven preprocessing module, a domain-aware feature extraction module, a confidence calibration module, and a defect classification and localization output module; the modules are coupled to each other through a standardized data interface and a status signal bus to form a closed-loop processing flow with feedforward and feedback dual paths.

[0008] Preferably, the multimodal image acquisition module comprises a high dynamic range CMOS image sensor, a miniature inertial measurement unit, an ambient light intensity detector, and a lens focal length encoder. The CMOS image sensor acquires raw visible light image frame sequences, the inertial measurement unit records the attitude angular velocity and acceleration information of the endoscope probe in real time, the ambient light intensity detector synchronously acquires illuminance values ​​under current lighting conditions, and the lens focal length encoder outputs discrete zoom level indicators corresponding to the current optical zoom state. The above four types of sensor data are uniformly packaged into a composite sensor tuple with a time stamp via an internal SPI bus and transmitted to the subsequent preprocessing module via an RS485 industrial communication bus.

[0009] Preferably, the physical constraint-driven preprocessing module receives composite sensor tuples from the multimodal image acquisition module and performs image normalization based on physical priors according to the geometric optical model and surface reflection characteristics of endoscopic imaging. Specifically, the module first calculates the deflection angle of the current viewing angle relative to the standard frontal direction based on the attitude information provided by the inertial measurement unit, and constructs a projection transformation matrix from three-dimensional space to a two-dimensional image plane by combining the zoom parameters output by the lens focal length encoder. Subsequently, it performs inverse illumination compensation on the original image using the illuminance value provided by the ambient light intensity detector to eliminate brightness non-uniformity caused by changes in the distance of the light source or occlusion. Finally, the image after geometric correction and illumination normalization is sent to the feature extraction module, and the set of physical parameters used is appended as context metadata to the end of the image tensor channel dimension for subsequent modules to call.

[0010] Preferably, the domain-aware guided feature extraction module adopts a multi-expert hybrid architecture, comprising at least three parallel-deployed feature extraction sub-networks, which are specifically optimized for low-light blurred scenes, strong reflective interference scenes, and complex texture background scenes, respectively. Each sub-network is designed based on an improved U-Net encoder-decoder structure, with its encoder part sharing the weights of the underlying convolutional kernels, while the high-level semantic branches are trained independently. In addition to the normalized image, the input of all sub-networks also includes contextual metadata from the preprocessing module. The module has a domain discriminator controller, which dynamically activates the corresponding scene-adapted sub-network path and suppresses the gradient updates of other paths based on the combination of physical parameters in the contextual metadata. The judgment logic of the domain discriminator controller is implemented based on a preset rule mapping table, which uniquely maps different light intensity ranges, viewing angle ranges, and focal length combinations to specific sub-network indices.

[0011] Preferably, the output of the domain discriminant controller is used not only to select the activated sub-network, but also to simultaneously generate a domain consistency score signal, which reflects the degree of matching between the current input image and the training domain of each sub-network; when the matching scores of all sub-networks are lower than a preset safety threshold, the domain discriminant controller triggers an anomaly flag and records the flag along with intermediate features. Figure 1 And pass it to the next module.

[0012] Preferably, the confidence calibration module receives intermediate feature maps, activation subnetwork identifiers, and domain consistency score signals from the feature extraction module, and constructs a dual uncertainty assessment mechanism based on these. This mechanism includes a cognitive uncertainty assessment unit and a random uncertainty assessment unit. The cognitive uncertainty assessment unit uses the Monte Carlo Dropout sampling method to run the same subnetwork multiple times during the inference phase, and statistically analyzes the variance distribution of the defect category prediction results for each pixel, thereby quantifying the degree of confidence the model lacks in recognizing samples outside the current input distribution. The random uncertainty assessment unit assesses the observation uncertainty caused by noise or blurring in local areas of the image based on the spatial gradient magnitude and local contrast index of the feature map. The two uncertainty indices are fused into a comprehensive uncertainty heatmap, and then weighted element-wise with the original defect probability map to generate the final calibrated defect confidence map.

[0013] Preferably, the confidence calibration module has built-in feedback control logic. When the uncertainty value of the region exceeding the preset proportion in the comprehensive uncertainty heatmap is higher than the convergence judgment condition, the system automatically activates the conservative decision mode. In this mode, the defect classification and location output module only marks the regions with confidence values ​​higher than the dual threshold (i.e., the dynamic threshold after the combined effect of the original probability threshold and the uncertainty suppression factor), and the remaining regions are marked as pending verification. At the same time, the system caches the context metadata, uncertainty heatmap and original image of this inference to the local non-volatile memory, and sends an early warning event packet to the remote operation and maintenance terminal through the power line carrier communication interface, requesting manual intervention or supplementary sampling instructions.

[0014] Preferably, the defect classification and localization output module adopts a dual-branch output structure. The main branch is responsible for generating pixel-level defect semantic segmentation maps, and the auxiliary branch is responsible for outputting defect type classification labels and quantization parameters. The main branch is based on a fully convolutional network architecture, receives the calibrated defect confidence map as input, restores the spatial resolution through upsampling and skip connections, and finally outputs a defect mask with the same size as the input image. The auxiliary branch extracts global pooling vectors from the high-level semantic feature map of the feature extraction module, completes multi-class discrimination of defect types through a multilayer perceptron, and simultaneously calculates the area ratio, maximum extension length, and morphological complexity index of the defect region. All output results are accompanied by timestamps and device IDs, and are encapsulated into structured data frames according to the extended control specifications conforming to the IEC61850 standard, and uploaded to the central monitoring platform via the Ethernet interface.

[0015] Preferably, the system integrates a security element chip at the hardware level to store model encryption keys, device identity certificates, and secure boot firmware. Each time the system powers on, the security element chip performs an integrity verification process to verify whether the digital signatures of each software module match the pre-stored public key. If the verification fails, loading the deep learning model is prohibited and the system enters a secure lock state. In addition, all sensitive data is temporarily stored in memory in encrypted form and is only decrypted and used in computation within a trusted execution environment to prevent plaintext leakage to the general operating system.

[0016] Preferably, the multi-expert sub-network adopts a joint optimization strategy during the training phase: first, it is pre-trained on a large-scale synthetic dataset, which simulates combinations of different materials, lighting, viewpoints, and defect morphologies through a physical rendering engine; then, it is fine-tuned on a small sample dataset collected from real industrial scenes. During the fine-tuning process, a domain adversarial loss function is introduced, forcing each sub-network to enhance its ability to extract common features while retaining scene specificity; after training, the weight parameters of each sub-network are fixed to a read-only storage area, and version replacement is only allowed through a secure firmware update mechanism.

[0017] Preferably, the system supports online incremental learning, but this function is subject to strict access control. When the remote operation and maintenance terminal confirms that it has received the sample to be reviewed and has been manually labeled, it can send an encrypted labeling package to the local device. After the system verifies the validity of the labeling package signature on the security element chip, it temporarily stores it in the isolation sandbox area. During the device's idle period, the system starts the incremental learning process, only fine-tuning the parameters of the high-level classification layer of the currently activated sub-network, while keeping the bottom shared feature extraction layer frozen. The incremental learning process is subject to the dual constraints of the maximum number of iterations and the loss reduction rate. Once either condition is not met, the training is terminated and the system is rolled back to the original model version.

[0018] Preferably, a dual-port RAM shared buffer is provided between the physical constraint-driven preprocessing module and the domain-aware feature extraction module, and a write operation is triggered by the DMA controller when an image frame arrives. This buffer is simultaneously accessed by the physical parameter calculation unit of the preprocessing module and the domain discrimination controller of the feature extraction module, ensuring zero-latency transmission of context metadata. In addition, the confidence calibration module and the defect classification and location output module implement event-driven data flow through an interrupt triggering mechanism. The output module is only allowed to read the calibrated confidence map after the comprehensive uncertainty assessment is completed and the decision mode is determined.

[0019] The beneficial effects of this invention are as follows: 1. The present invention discloses an automatic defect identification system for industrial endoscopes based on deep learning. Through the coupling of multimodal sensing and physical imaging models, environmental disturbance factors are explicitly encoded into contextual metadata, avoiding the blind reliance on implicit features in traditional pure data-driven methods. It utilizes a domain-aware, guided multi-expert sub-network architecture to dynamically match the optimal feature extraction strategy in a single inference, effectively mitigating performance degradation caused by domain offset. It implements a dynamic confidence calibration mechanism based on dual uncertainty assessment, enabling the system to proactively reduce decision-making aggression and trigger early warning processes when facing out-of-distribution samples. It ensures high reliability and traceability of single inference results in industrial scenarios where repeatable detection is not possible. Through the collaboration of secure element chips and trusted execution environments, it guarantees the integrity and confidentiality of models and data throughout their entire lifecycle. Through hardware-level cache sharing and interrupt triggering mechanisms, it maintains data synchronization and temporal consistency among multiple modules, avoiding state corruption caused by communication delays. Attached Figure Description

[0020] The invention will now be further described with reference to the accompanying drawings.

[0021] Figure 1 This is a structural block diagram of an industrial endoscope defect automatic identification system based on deep learning, as described in this invention. Detailed Implementation

[0022] To make the technical means, creative features, objectives and effects of this invention easier to understand, the invention will be further described below in conjunction with specific embodiments.

[0023] like Figure 1 As shown in the embodiment of the present invention, an automatic defect identification system for industrial endoscopes based on deep learning includes a multimodal image acquisition module, a physical constraint-driven preprocessing module, a domain-aware feature extraction module, a confidence calibration module, and a defect classification and localization output module connected in sequence. Each module is coupled to a status signal bus through a standardized data interface, forming a closed-loop processing flow with both feedforward and feedback paths. The system integrates a safety element chip at the hardware level and deploys a joint optimization training strategy and an online incremental learning mechanism at the software level to ensure the reliability, safety, and continuous adaptability of the model.

[0024] The multimodal image acquisition module consists of a high dynamic range CMOS image sensor, a miniature inertial measurement unit (IMU), an ambient light intensity detector, and a lens focal length encoder. The CMOS image sensor is used to continuously acquire raw visible light image frame sequences. The miniature inertial measurement unit integrates a three-axis gyroscope and a three-axis accelerometer to record the endoscope probe's attitude angular velocity (rad / s) and linear acceleration (m / s²) in three-dimensional space in real time. An ambient light intensity detector is used to synchronously acquire illuminance values ​​under current lighting conditions. The lens focal length encoder is a rotary photoelectric encoder, outputting eight discrete zoom levels (0-7), corresponding to optical zoom magnifications from 1x to 4x, with a step interval of 0.5x. The above four types of sensor data are uniformly packaged into composite sensor tuples with time stamps via an internal SPI bus at a clock frequency of 10MHz. Each frame of image corresponds to one tuple, containing a timestamp (unit: μs), an image raw data pointer, an IMU six-dimensional vector, an illuminance value (unit: lux), and a focal length level (integer 0-7). This tuple is transmitted to the physical constraint-driven preprocessing module via an RS485 industrial communication bus at a baud rate of 115200bps.

[0025] After receiving the composite sensor tuple, the physical constraint-driven preprocessing module first parses the IMU data and focal length setting, and calculates the deflection angle of the current viewing angle relative to the standard orthographic direction (i.e., the optical axis is perpendicular to the surface being inspected). Assuming the standard orthographic direction is the reference coordinate system, the integral of the angular velocity output by the IMU yields the current Euler angle (θ, φ, ψ), where θ is the pitch angle, φ is the yaw angle, and ψ is the roll angle. Combining this with the actual focal length f = 1.0 + 0.5 × fidx (unit: mm) mapped by the focal length setting fidx ∈ {0, 1, ..., 7}, and based on the endoscope lens calibration parameters (principal point (cx, cy), focal length fx = fy = f × s, where s is the pixel size, taken as 3.75 μm), the projection transformation matrix H from the three-dimensional space point P = (X, Y, Z) to the two-dimensional image point p = (u, v) is constructed. , in, The module uses a 3×3 rotation matrix generated by Euler angles to perform inverse geometric correction on the original image and generates a viewpoint normalized image Igeo by bilinear interpolation resampling, making it equivalent to the imaging result under a standard orthographic viewpoint. The module reads the illuminance value E (unit: lux) output by the ambient light intensity detector and performs reverse illumination compensation on Igeo according to Lambert's cosine law and the point light source attenuation model. Assuming the ideal uniform illuminance is E0 = 1000 lux, the compensation factor γ = E0 / E. If E < 50 lux or E > 5000 lux, γ is clamped to the interval [0.2, 5.0] to avoid overcompensation. Finally, the normalized image Inorm = γ × Igeo is normalized to the floating-point range [0, 1]. The module appends the physical parameter set Φ = {θ, φ, ψ, f, E} as context metadata to the end of the channel dimension of Inorm, forming a five-channel tensor Tin = [Inorm, θ, φ, ψ, f, E], where the angle is expressed in radians, f is in millimeters, and E is in lux. This tensor is written to the dual-port RAM shared buffer through the DMA controller for zero-latency access by subsequent modules.

[0026] The domain-aware guided feature extraction module adopts a multi-expert hybrid architecture, which includes three parallel feature extraction sub-networks: SubNetL (low-light blur scene expert), SubNetR (strong reflective interference scene expert) and SubNetT (texture complex background scene expert). Each sub-network is based on the improved U-Net encoder-decoder architecture. The encoder consists of four downsampling stages, each containing two 3×3 convolutional layers (ReLU activation) and one 2×2 max pooling layer. The weights of the bottom convolutional kernels (first stage) are shared among the three sub-networks, while the high-level semantic branches (second to fourth stages) are trained independently. The decoder consists of four upsampling stages. Each stage restores the resolution through transposed convolution and is fused with the feature maps of the corresponding stage of the encoder through skip connections. The input of all subnetworks is a five-channel tensor Tin, and the output is a 256-channel intermediate feature map Fmid∈R^{H / 4×W / 4×256}.

[0027] This module contains a domain discrimination controller, which is implemented as a lightweight fully connected network. The input is context metadata Φ={θ,φ,ψ,f,E}, and the output is a three-dimensional weight vector. After Softmax normalization, it represents the activation probability of each subnetwork; The controller's decision logic can also be implemented using a preset rule mapping table: when E ≤ 100 lux and |θ| + |φ| ≥ 0.5 rad, SubNetL is activated; when E ≥ 3000 lux and the local image gradient variance > 0.15, SubNetR is activated; when the local second derivative energy (Laplacian variance) of the image is > 0.25 and the texture entropy is > 6.0 bit / pixel, SubNetT is activated; in other cases, SubNetT is activated by default. During the inference phase, only the sub-network path with the highest activation probability is enabled, while gradient updates for other paths are suppressed (implemented via PyTorch's `torch.nograd()` context manager). The domain discriminator controller simultaneously calculates the domain consistency score. If Sdomain < 0.65, then the exception flag flag is set to 1; otherwise... =0, intermediate feature map The activation sub-network identifier idxactive∈{0,1,2} and flamenomaly are transmitted to the confidence calibration module via the high-speed AXI bus.

[0028] The confidence calibration module constructs a dual uncertainty assessment mechanism. The cognitive uncertainty assessment unit performs Monte Carlo Dropout sampling on the activation sub-network: During the inference phase, the sub-network corresponding to Fmid is repeatedly run N=10 times with a Dropout rate p=0.3, generating a pixel-level defect probability map Pi∈[0,1]^{H×W}, i=1,…,10 each time; the prediction variance of each pixel is calculated. , in, The average predicted probability is used. The random uncertainty assessment unit calculates the spatial gradient magnitude G and local contrast C based on Fmid: G = || Fmid||2, (Ratio of local standard deviation to mean, window size 7×7); Random uncertainty is defined as: , Where α=0.6, β=2.0, and γ=1.5 are empirical coefficients. The comprehensive uncertainty heatmap Utotal=λUepistemic+(1-λ)Ualeatoric,λ=0.7, the original defect probability map Praw is obtained from a single forward propagation (without Dropout), and the calibrated confidence map Pcalib=Praw⊙(1-Utotal), where ⊙ represents element-wise multiplication.

[0029] This module has built-in feedback control logic: calculate the percentage of pixels ru in Utotal that exceed the threshold τu=0.4; if ru>0.3 (i.e., the uncertainty of 30% of the region is too high), then the conservative decision mode is enabled; In this mode, the defect classification and location output module only labels regions that satisfy Pcalib(x,y)>τp and Utotal(x,y)<τu, where τp=0.5 is the original probability threshold; the remaining regions are marked as pending verification. The system caches the context metadata Φ, Utotal, raw image Iraw, and flameomaly of this inference to the local eMMC non-volatile memory (capacity 32GB), and sends an early warning event packet conforming to the IEC61850-7-420 standard to the remote operation and maintenance terminal through the power line carrier communication (PLC) interface. The packet includes the device ID, timestamp, exception type, and data pointer.

[0030] The defect classification and localization output module adopts a dual-branch structure. The main branch is a fully convolutional network. The input Pcalib is upsampled three times (transposed convolutional kernel 4×4, stride 2) and fused with skip connections to output a defect mask Mseg∈{0,1}^{H×W} of the same size as the original image. The auxiliary branch performs global average pooling from Fmid to obtain a 256-dimensional vector vg, which is input to a three-layer MLP (hidden layers 128→64→K, where K is the number of defect categories, K=5 in this embodiment: crack, corrosion, pit, scratch, foreign object), and outputs the category probability distribution pclass∈[0,1]^K; Based on Mseg, the following quantization parameters are calculated: Area ratio Aratio = (∑Mseg) / (H×W); Maximum extension length Lmax is obtained through skeletonization and the longest path algorithm (unit: mm, which needs to be combined with the pixel-physical scale conversion factor spx = 0.05 mm / pixel); Morphological complexity index Cmorph = perimeter² / (4π×area), used to distinguish between regular and irregular defects. All output results are accompanied by an IEEE 1588 precise timestamp (accuracy ±1μs) and a unique device ID (provided by the security element chip), encapsulated as a data frame of the IEC 61850-9-2LE extended control specification, and uploaded to the central monitoring platform via a gigabit Ethernet interface.

[0031] The system integrates an Infineon OPTIGA™ TrustM secure element chip at the hardware level to store the AES-256 model encryption key, X.509 device identity certificate, and SHA-256 secure boot firmware hash value. Upon each power-on, the secure element performs integrity verification: calculating the SHA-256 digest of each software module (including the deep learning model binary file) and comparing it with the signature verified by the pre-stored public key. If any module fails to verify, loading the model will be prohibited and the system will enter a security lock state, allowing recovery only through the authorized firmware update interface; All sensitive data (including original images, feature maps, and model weights) are encrypted and stored in DDR4 memory in AES-GCM mode, and are only decrypted and used in computation within the ARMTrustZone Trusted Execution Environment (TEE) to prevent plaintext leakage to the Linux general operating system.

[0032] The multi-expert subnetwork adopts a joint optimization strategy during the training phase. During the pre-training phase, a synthetic dataset generated by NVIDIAOmniverseReplicator is used, containing 100,000 images that simulate three materials: carbon steel, stainless steel, and aluminum alloy. The illumination intensity is 50-10000 lux, the viewing angle is −1.0 to +1.0 rad, the focal length is 1-4 times, and the defect morphology covers 5 classes with 20,000 samples for each class. During the fine-tuning phase, 2000 images (manually annotated at the pixel level) were acquired using a real industrial pipeline endoscope, and a domain adversarial loss function was employed. , in, For Dice's loss, The domain classification cross-entropy loss is used after the gradient inversion layer (GRL). =0.1. After training, the weights of each sub-network are fixed to the QSPIFlash read-only storage area, and version replacement is only allowed through the secure firmware update mechanism (requiring OPTIGA chip verification of ECDSA signature).

[0033] The system supports online incremental learning, but is subject to strict access control. Once the remote maintenance terminal confirms receipt of the sample to be reviewed and issues the manual annotation package (including pixel mask and category label), the security element chip verifies its ECDSA signature; if valid, it is temporarily stored in the isolation sandbox area (based on ARMMMU memory protection). During device idle periods (CPU load < 20% and no new image input for 5 seconds), the system initiates an incremental learning process: fine-tuning is performed only on the MLP classification layer (auxiliary branch) of the currently activated subnetwork, while the underlying shared encoder remains frozen; the optimizer uses AdamW with a learning rate of 1e-5, a maximum of 50 iterations, and an early stopping condition that the validation loss does not decrease by 0.01 for 5 consecutive epochs. If any condition is not met (such as loss divergence or iteration limit exceeding), training is terminated and the system rolls back to the original model version.

[0034] A dual-port RAM shared buffer (capacity 8MB) is provided between the physical constraint-driven preprocessing module and the domain-aware feature extraction module. The DMA controller triggers a write operation when each frame of image arrives. This buffer is simultaneously accessed in a non-blocking manner by the physical parameter calculation unit of the preprocessing module and the domain discrimination controller of the feature extraction module, ensuring that the context metadata transmission delay is <100μs. The confidence calibration module and the defect classification and location output module implement event-driven data flow through a GPIO interrupt triggering mechanism: the interrupt signal is only raised when the Utotal calculation is completed and the decision mode (aggressive / conservative) is determined, allowing the output module to read Pcalib and avoiding invalid reads.

[0035] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of the present invention is defined by the appended claims and their equivalents.

Claims

1. A deep learning-based automatic defect identification system for industrial endoscopes, applied to industrial endoscopes, characterized in that, include: The multimodal image acquisition module is used to acquire the original visible light image frame sequence and simultaneously acquire the attitude angular velocity and acceleration of the endoscope probe, the ambient illuminance value and the lens focal length, and package them into a composite sensor tuple with a time stamp. The physical constraint-driven preprocessing module constructs a projection transformation matrix from three-dimensional space to two-dimensional image plane based on the attitude information and focal length parameters in the composite sensor tuple, performs geometric correction on the original image, performs inverse illumination compensation based on the ambient illumination value, generates a normalized image, and appends the physical parameters used as context metadata to the end of the image tensor channel dimension. The domain-aware guided feature extraction module includes at least three parallel-deployed feature extraction sub-networks, optimized for low-light blur, strong reflective interference, and complex texture background scenes, respectively. Their underlying convolutional kernel weights are shared while their high-level semantic branches are independent. The domain-aware guided feature extraction module is equipped with a domain discrimination controller, used to dynamically activate the corresponding scene-adaptive sub-network path based on the context metadata and generate a domain consistency score signal. When the sub-network path matching scores are all below a preset safety threshold, an anomaly flag is triggered. The confidence calibration module receives intermediate feature maps, activation subnetwork identifiers, and domain consistency score signals from the feature extraction module. The defect classification and localization output module adopts a dual-branch structure. The main branch generates pixel-level defect masks based on a fully convolutional network, while the auxiliary branch extracts global pooling vectors from high-level semantic feature maps to output defect type labels and quantization parameters.

2. The system according to claim 1, characterized in that, In the defect classification and location output module, when the uncertainty value of the region exceeding the preset proportion in the comprehensive uncertainty heatmap is higher than the convergence judgment condition, the system activates the conservative decision mode, only marks the region with confidence level higher than the dynamic threshold, marks the remaining regions as pending review, caches the relevant data, and sends the warning event packet through the communication interface.

3. The system according to claim 1, characterized in that, The physical constraint-driven preprocessing module constructs rotation and projection matrices based on the geometric optical model of endoscopic imaging using Euler angles and focal length parameters obtained from attitude angle integration, and performs inverse geometric correction on the original image. Based on Lambert's cosine law and a point light source attenuation model, reverse illumination compensation is performed on the illuminance value to generate an image that is normalized in both viewpoint and brightness.

4. The system according to claim 1, characterized in that, The domain discriminant controller in the domain-aware guided feature extraction module uses a preset rule mapping table to select sub-networks. When the ambient illuminance is ≤100 lux and the sum of the viewing angles is ≥0.5 rad, the low-illuminance blurred scene subnetwork is activated; When the illuminance is ≥3000 lux and the local image gradient variance is >0.15, the strong reflective interference scene sub-network is activated; When the image Laplacian variance is greater than 0.25 and the texture entropy is greater than 6.0 bits / pixel, the complex texture background scene subnetwork is activated.

5. The system according to claim 1, characterized in that, In the confidence calibration module, a cognitive uncertainty assessment unit and a random uncertainty assessment unit are constructed: The cognitive uncertainty assessment unit uses Monte Carlo Dropout sampling statistical prediction variance to quantify the model's lack of confidence in out-of-distribution samples, while the accidental uncertainty assessment unit assesses the uncertainty caused by observation noise based on the spatial gradient magnitude and local contrast of the feature map. The cognitive uncertainty assessment unit and the accidental uncertainty assessment unit are integrated into a comprehensive uncertainty heatmap, and then weighted and corrected element by element with the original defect probability map to generate a calibrated defect confidence map.

6. The system according to claim 1, characterized in that, The quantitative parameters output by the auxiliary branch of the defect classification and location output module include defect area ratio, maximum extension length, and morphological complexity index. The maximum extension length is calculated by combining skeletonization with the longest path algorithm and the pixel-to-physical scale conversion factor.

7. The system according to claim 1, characterized in that, It also includes a security element chip integrated at the hardware level, used to store model encryption keys, device identity certificates, and secure boot firmware hash values; When the system is powered on, the security element chip performs an integrity check, verifying the digital signatures of each software module; If the verification fails, loading the deep learning model will be prohibited and the system will enter a security lockout state. All sensitive data is temporarily stored in encrypted form in memory and decrypted for use in computation within a trusted execution environment.

8. The system according to claim 1, characterized in that, The multi-expert subnetwork includes the following during the training phase: Pre-training is performed on a large-scale synthetic dataset generated by a physically based rendering engine; After pre-training is completed, the sub-network weights are fixed to read-only storage.

9. The system according to claim 1, characterized in that, Supports controlled online incremental learning functionality: When the remote operation and maintenance terminal sends out a manually labeled package that has been verified by signature, the system only makes minor adjustments to the high-level classification layer of the currently active sub-network during the device's idle period, while the low-level shared feature extraction layer remains frozen. The incremental learning process is constrained by both the maximum number of iterations and the rate of loss reduction. If either condition is not met, training is terminated and the model is rolled back to the original version.

10. The system according to claim 1, characterized in that, A dual-port RAM shared buffer is provided between the physical constraint-driven preprocessing module and the domain-aware feature extraction module. The buffer is written by the DMA controller when the image frame arrives, ensuring zero-latency transmission of context metadata. The confidence calibration module and the defect classification and location output module achieve event-driven data flow through an interrupt triggering mechanism. When the comprehensive uncertainty assessment is completed and the decision mode is determined, the output module reads the calibrated confidence map.