Cloth defect detection method and system based on direction perception and fine-grained enhancement

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The DFG-DETR method, a fabric defect detection method with orientation-aware fine-grained enhancement, solves the problem that fabric defect detection models have difficulty identifying subtle defects in complex texture backgrounds, and achieves high-precision, low-latency fabric defect detection.

CN122222902APending Publication Date: 2026-06-16ZHEJIANG SCI-TECH UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ZHEJIANG SCI-TECH UNIV
Filing Date: 2026-02-02
Publication Date: 2026-06-16

Application Information

Patent Timeline

02 Feb 2026

Application

16 Jun 2026

Publication

CN122222902A

IPC: G06T7/00; G06V10/774; G06V10/80; G06V10/764; G06V10/82; G06V10/77; G06N3/045; G06N3/0464

AI Tagging

Application Domain

Image analysis Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122222902A_ABST

Patent Text Reader

Abstract

The application discloses a cloth defect detection method and system based on direction perception and fine-grained enhancement, which comprises the following steps: step 1, obtaining a cloth image, forming a cloth defect dataset, and dividing and preprocessing the cloth defect dataset; step 2, inputting the preprocessed image into a backbone network for multi-level feature extraction; a direction perception detail reservation down-sampling module is used in the backbone network, and an adaptive feature re-labeling module is embedded in an HGBlock; step 3, sending the last layer of features in the multi-level features output in step 2 into an encoder constructed by a dynamic sparse selective attention module for context feature enhancement; step 4, jointly inputting the high-level features after enhancement in step 3 and the remaining layer features output in step 2 into a path aggregation feature pyramid network for multi-scale fusion, and outputting a fusion feature representation; and step 5, inputting the fusion feature into a D-Fine detection head, and outputting a defect class, position and confidence prediction result.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of fabric defect detection technology, specifically relating to a fabric defect detection method and system based on direction perception and fine-grained enhancement. Background Technology

[0002] Automated fabric defect detection is a crucial step in achieving high-quality production in the textile industry. Early methods relied on manual visual inspection or hand-designed image features, which generally suffered from low efficiency, high subjectivity, and poor generalization ability. With the development of deep learning, convolutional neural network detectors, represented by Faster R-CNN and the YOLO series, have been widely used in this task, significantly improving the level of automation. However, these methods mainly rely on local convolution operations, making it difficult to effectively distinguish weak anomalies from normal patterns against a background of strongly periodic fabric textures. Furthermore, the standard downsampling process easily destroys the geometric integrity of fine-grained linear defects, thus limiting detection accuracy.

[0003] To overcome the limitations of local modeling, Detection Transformer-type models introduce long-range context modeling capabilities through a global self-attention mechanism, and their application in industrial defect detection has been gradually explored. Addressing their high computational complexity and slow convergence, researchers have proposed various lightweight and acceleration strategies, promoting the application of end-to-end Transformer detectors in real-time scenarios. Among them, the DEIM model, with its efficient matching mechanism and compact network structure, has demonstrated excellent overall performance on multiple industrial visual inspection benchmarks and has become a representative baseline for current real-time defect detection.

[0004] However, DEIM is still designed for general object detection tasks and does not fully consider the uniqueness of orientation-sensitive structures in fabric images. Its feature extraction process lacks explicit modeling of prior information on horizontal and vertical orthogonal structures, making it difficult to effectively enhance the feature representation of regions with orientation perturbations. Furthermore, in complex texture backgrounds, the model's response to low-contrast, weak local perturbations is limited. In addition, its global attention mechanism has high computational complexity, which may face inference efficiency bottlenecks in resource-constrained industrial online inspection scenarios. Therefore, there is an urgent need in this field for an efficient detection technology tailored to fabric defect detection tasks. Summary of the Invention

[0005] To address the technical challenge of robustly detecting subtle, multi-scale defects in complex textured backgrounds using existing fabric defect detection models, this invention provides a fabric defect detection method and system based on orientation awareness and fine-grained enhancement. Specifically, fabric defects often manifest as perturbation regions with inconsistent local structural responses in the horizontal and vertical directions, and are highly indistinguishable from the background with extremely low signal-to-noise ratios. Existing methods easily misclassify subtle defects as normal texture variations. Furthermore, general detection models lack explicit modeling of fabric orientation priors, making it difficult to effectively enhance the feature responses in low-contrast regions. To solve these problems, this invention provides a high-precision, low-latency orientation-aware fine-grained fabric defect detection method, DFG-DETR. This method simultaneously preserves background texture and enhances perturbation responses through orientation-sensitive dual-path downsampling, and integrates adaptive feature recalibration and dynamic sparse attention mechanisms to achieve robust detection of fabric defects.

[0006] The present invention adopts the following technical solution:

[0007] A fabric defect detection method based on orientation perception and fine-grained enhancement includes the following steps:

[0008] Step 1: Obtain fabric images to form a fabric defect dataset, and then divide and preprocess the fabric defect dataset.

[0009] Step 2: Input the preprocessed image into the backbone network for multi-level feature extraction. The backbone network uses a direction-aware detail-preserving downsampling module to replace the standard downsampling layer, and embeds an adaptive feature recalibration module in HGBlock to extract highly discriminative fine-grained feature representations.

[0010] Step 3: The last layer of features output from Step 2 is fed into the encoder constructed by the Dynamic Sparse Selective Attention (DSSA) module for contextual feature enhancement.

[0011] Step 4: Input the enhanced high-level features from Step 3 and the remaining layer features output from Step 2 into the path aggregation feature pyramid network for multi-scale fusion and output the fused feature representation.

[0012] Step 5, Defect Detection: Input the fused features into the D-Fine detection head and output the defect category, location, and confidence prediction results.

[0013] Preferably, step 1, data acquisition, dataset partitioning, and data preprocessing, is as follows:

[0014] Data acquisition: The fabrics on the production line are photographed using a camera to obtain initial fabric images, which are then uniformly adjusted to the same size. The defects on these fabric images are then labeled and filtered, and only images with defects are retained. The images are then cropped at an appropriate resolution to prevent defects from being lost after scaling, and finally, the required fabric defect dataset is obtained.

[0015] Dataset partitioning: The fabric defect dataset is divided into training set, validation set, and test set according to a preset ratio.

[0016] Data preprocessing: The input image and its corresponding bounding box annotations undergo joint enhancement processing. Before training, Mosaic enhancement is enabled with a certain probability. Four random images are stitched together to form a synthetic image. Random cropping based on the intersection-over-union ratio (IoU), photometric distortion perturbation, random horizontal flipping, and image resizing are applied sequentially. The bounding box coordinates corresponding to all geometric transformations are updated synchronously. Invalid annotations are removed, image pixel values are normalized, and the bounding box format is converted to a normalized representation of center point coordinates and width and height. Before testing and evaluation, to accelerate recognition speed, data preprocessing only uses size scaling, scaling the images to a resolution of 640×640. During scaling, the longest side of the image is used as the reference to scale to 640, while the shorter side is scaled proportionally. Any part smaller than 640 is filled with blanks.

[0017] Preferably, step 2, feature extraction, is as follows: The preprocessed image is input into the HGNet backbone network for multi-level feature extraction, and a direction-aware detail-preserving downsampling (DirDown) module (architecture as follows) is used. Figure 2 (d) The structural response in the horizontal and vertical orthogonal directions is explicitly modeled to preserve the orientation-sensitive details of the orientation perturbation region. Then, the HGBlock module with the embedded Adaptive Feature Recalibration (AFR) module is used to fuse local context and global amplitude modulation to enhance the feature discriminability of the orientation perturbation region. Finally, a high discriminative feature representation with fine-grained resolution is output.

[0018] Preferably, in step 2, the orientation-aware detail-preserving downsampling module extracts orientation priors from the input feature map to generate orientation prior features, and generates a spatial attention map based on the input feature map to modulate the orientation prior features. Finally, dual-path downsampling is performed, and the two downsampled features are concatenated in the channel dimension and then fused and shuffled in the grouped channels to output the downsampling result.

[0019] Preferably, in step 2, the directional prior feature extraction includes performing horizontal average pooling and vertical average pooling on the input feature map respectively, and then summing the results.

[0020] Preferably, in step 2, the spatial attention map is generated by calculating the mean value of the input feature map along the channel dimension to obtain a single-channel grayscale image, which is then generated by lightweight convolution and the Sigmoid activation function.

[0021] Preferably, in step 2, in the dual-path downsampling structure, one path uses depthwise convolution for downsampling, and the other path uses max pooling for downsampling.

[0022] Preferably, in step 2, the adaptive feature recalibration module generates a spatial prior feature map based on the input features and performs weighting, predicts a global control factor through global average pooling and continuous convolution, maps the input features using 1×1 convolution and modulates the amplitude using the global control factor, and finally generates local context-aware weights based on depthwise separable convolution, and reweights the modulated features to output enhanced features.

[0023] Preferably, in step 2, the global control factor includes a global gain factor and a global gating factor. The global gain factor is used to alleviate the imbalance in optimization between samples caused by the significant difference in defects, and the global gating factor is used to dynamically control the enhancement strength of the local context-aware weights.

[0024] Preferably, in step 2, the channel mapping is achieved through 1×1 convolution, and the amplitude modulation is accomplished by multiplying the mapping features with the global gain factor.

[0025] Preferably, in step 2, the local context-aware weights are generated by depthwise separable convolutions and modulated by the global gating factor to perform non-inhibitory reweighting on the feature map.

[0026] Preferably, step 2 specifically includes the following steps:

[0027] Step 2.1: Generate a low-dimensional feature map from the input image through the initial convolutional layer of HGNet; this operation provides the basic representation for subsequent multi-scale feature extraction.

[0028] Step 2.2 uses the DirDown module to replace the standard downsampling layer for explicit modeling of directional perturbation regions. "Directional perturbation regions" refer to spatial areas in an image or feature map where the local response amplitudes in the horizontal and vertical directions differ significantly, typically including real defects and background patterns / textures. This step is detailed below:

[0029] Step 2.2.1: Perform horizontal average pooling with a kernel size of 1×3 and vertical average pooling with a kernel size of 3×1 on the input feature map, and add the two together to obtain the directional prior features. The differences in their internal responses constitute the directional perturbation region.

[0030] Step 2.2.2: Calculate the mean along the channel dimension of the input feature map to generate a single-channel grayscale image. Obtain a spatial attention map by light convolution and sigmoid activation, and use it to enhance the directional prior features to form attention-modulated directional features.

[0031] Step 2.2.3: Construct a dual-path downsampling structure: one path uses depthwise convolution for detail-preserving downsampling, and the other path uses max pooling for downsampling of strong response regions;

[0032] Step 2.2.4: After concatenating the two outputs along the channel dimension, group the concatenated features along the channel dimension, and perform 1 step independently on each group. Convolution is used to achieve intra-group compression and interaction, and inter-group information fusion is promoted through channel shuffling, so that the output downsampled features contain spatial detail information and local strong response features.

[0033] This step displays the warp and weft structure of the modeled fabric while reducing the resolution, effectively preventing areas of directional disturbance from breaking or becoming blurred during downsampling.

[0034] Step 2.3 involves embedding an AFR (Adaptive Feature Recalibration Module) into each HGBlock, specifically including the following steps:

[0035] Step 2.3.1: Perform the same operation as step 2.2.2 to obtain the spatial prior feature map;

[0036] Step 2.3.2: Perform dynamic scaling prediction on the spatial prior feature map, predicting the global gain factor through global average pooling and continuous convolution operations. With global gating factor Both are scalars. This is used to mitigate the imbalance in inter-sample optimization caused by significant differences in defects. Used to suppress local false responses against complex texture backgrounds;

[0037] Step 2.3.3: Map the enhanced feature map using a 1×1 convolution, and then apply the global gain factor obtained in step 2.3.2. Broadcast to all channels, and perform uniform amplitude modulation on the mapped features;

[0038] Step 2.3.4: Implement a local context-aware channel attention mechanism using depthwise separable convolution, and generate weights that pass through a global gating factor. The modulated and reweighted feature maps are then used to complete adaptive feature recalibration.

[0039] This mechanism can dynamically enhance the feature response of areas with minor defects while suppressing interference from complex background textures.

[0040] Step 2.4: Output multi-scale high-discriminative feature maps step by step. These features have both orientation sensitivity and anomaly detection capabilities, which can be used for subsequent encoding and detection.

[0041] Preferably, step 3, context enhancement, is as follows: The last layer of features in the multi-level high discriminative features output in step 2 is fed into the encoder constructed by the Dynamic Sparse Selective Attention (DSSA) module. This module processes multiple directional sensitive features in parallel and retains the original high-frequency detail paths, thereby achieving long-range context modeling of directional perturbation regions with linear complexity.

[0042] Preferably, in step 3, the dynamic sparse selective attention module divides the input features into multiple groups of sub-features along the channel dimension, performs multi-scale context extraction, direct transmission, or high-frequency detail retention on different sub-features, and sends the first three processing results into the linear attention unit, splices all the path outputs, and performs cross-group feature interaction and response sparsification through the adaptive feature recalibration module.

[0043] Preferably, in step 3, the multi-scale context extraction includes processing the two sets of sub-features using 3×3 and 7×7 depthwise separable convolutions respectively, and high-frequency detail preservation is achieved by dynamically scaling the fourth set of sub-features by applying learnable weights.

[0044] Preferably, in step 3, the specific processing procedure of the DSSA module includes the following steps:

[0045] Step 3.1: Divide the input features into four groups along the channels, which are used for local modeling and original information preservation in different receptive fields respectively: two groups are extracted by 3×3 and 7×7 depth separable convolutions to extract multi-scale context, one group is directly used for efficient attention calculation, and the other group is used as the original path to preserve high-frequency details.

[0046] Step 3.2: The first three groups of features are fed into an efficient linear attention unit based on random feature mapping. Figure 4 The efficient attention in this model is a linear attention model. It achieves long-range context modeling with linear complexity by reconstructing the normalization of the query and key and the order of matrix multiplication. At the same time, the fourth group is used as the original high-frequency detail path without attention transformation. Instead, it is dynamically scaled through a learnable scalar weight activated by Sigmoid to adaptively retain or suppress the original detail components. This prevents the loss of minor defect information due to excessive smoothing during the multi-path fusion process. Finally, the four outputs are concatenated along the channel dimension to generate an enhanced feature representation.

[0047] Step 3.3: The four-way concatenated enhanced features are processed through another instantiated AFR module for cross-group feature interaction and response sparsification, outputting the final context-enhanced high-level features.

[0048] Preferably, step 4, multi-scale fusion, is as follows: the enhanced high-level features and the other level features output in step 2 are input into the path aggregation feature pyramid network, and multi-scale fusion is performed through bidirectional paths from top to bottom and from bottom to top, finally outputting a fusion feature representation that is semantically rich and has strong discriminative power against directional perturbation regions.

[0049] This invention also discloses a fabric defect detection system based on direction perception and fine-grained enhancement, used to perform the above method, comprising the following modules:

[0050] Dataset partitioning and preprocessing module: used to acquire fabric images, form a fabric defect dataset, and partition and preprocess the fabric defect dataset;

[0051] Multi-level feature extraction module: used to input the preprocessed image into the backbone network for multi-level feature extraction; the backbone network adopts a direction-aware detail-preserving downsampling module and embeds an adaptive feature recalibration module in HGBlock;

[0052] Contextual Feature Enhancement Module: This module takes the last layer of features from the multi-level feature extraction module and feeds it into the encoder constructed by the dynamic sparse selective attention module to enhance the contextual features.

[0053] The fusion module is used to input the high-level features enhanced by the context feature enhancement module and the remaining features output by the multi-level feature extraction module into the path aggregation feature pyramid network for multi-scale fusion and output fused feature representation.

[0054] The results output module is used to input the fused features into the D-Fine detection head and output the defect category, location, and confidence prediction results.

[0055] Compared with existing technologies, this invention provides a high-precision, low-latency direction-aware fine-grained fabric defect detection method and system, DFG-DETR. It achieves robust detection of fabric defects by simultaneously preserving background texture and enhancing perturbation response through direction-sensitive dual-path downsampling, and by integrating adaptive feature recalibration and a dynamic sparse attention mechanism. Furthermore, its preferred scheme has the following advantages:

[0056] 1. This invention designs a DirDown module, which combines horizontal and vertical pooling with spatial attention modulation and constructs a dual-path downsampling structure. While reducing the resolution, it simultaneously preserves the original fabric background texture and the enhanced directional perturbation response. This not only avoids the weakening of local perturbation signals, but also provides a complete background reference for subsequent feature recalibration and context modeling. This enables the model to distinguish between real defects and normal texture changes based on local-global consistency, thereby generating a cleaner defect feature response map.

[0057] 2. This invention embeds a lightweight AFR module into the HGNet backbone network and encoder. By fusing local context consistency awareness and global amplitude modulation, it adaptively enhances the response of defect features and suppresses the interference of normal fabric texture background across the entire image, significantly improving the ability to distinguish low contrast and weak local perturbations.

[0058] 3. This invention constructs a DSSA encoding structure, which integrates multi-scale local context with gated original high-frequency pathways, and reuses an adaptive recalibration mechanism for cross-group feature interaction. While enhancing high-level semantic expression, it effectively controls model complexity, providing a high-precision and high-efficiency solution for industrial online detection scenarios. Attached Figure Description

[0059] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0060] Figure 1 This is a flowchart of a preferred embodiment of the present invention for a fabric defect detection method based on direction perception and fine-grained enhancement.

[0061] Figure 2 This is a schematic diagram of the network structure (i.e., the backbone network based on HGNet) proposed in a preferred embodiment of the present invention; wherein, the structural diagram of the DirDown module is as follows. Figure 2 As shown in (d).

[0062] Figure 3 This is a schematic diagram of the AFR module structure proposed in a preferred embodiment of the present invention.

[0063] Figure 4 This is a schematic diagram of the DSSA module structure proposed in a preferred embodiment of the present invention.

[0064] Figure 5 This is a comparison chart of the detection accuracy of various categories of the present invention on the fabric dataset.

[0065] Figure 6 This is a comparison chart of the detection results of D-Fine, DEIM, and the present invention on a fabric dataset.

[0066] Figure 7 This is a comparison chart of MAL loss on a fabric dataset based on the present invention.

[0067] Figure 8 This is a block diagram of a fabric defect detection system based on direction perception and fine-grained enhancement, according to a preferred embodiment of the present invention. Detailed Implementation

[0068] To enable those skilled in the art to better understand the present invention, a detailed description will be provided below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are merely some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0069] like Figure 1 As shown in the figure, this embodiment of a fabric defect detection method based on direction perception and fine-grained enhancement includes the following steps:

[0070] Step 1 is as follows:

[0071] Collect fabric defect image samples. This embodiment uses the fabric defect dataset released by Alibaba Cloud Tianchi platform. This dataset contains a variety of common industrial fabric defect types and has high annotation quality.

[0072] The dataset undergoes preliminary preprocessing and partitioning. First, the original annotations are uniformly converted to COCO format; then, they are divided into training, validation, and test sets in an 8:1:1 ratio. Considering the original image resolution of 4096×1696, with a severely unbalanced aspect ratio, directly scaling it to the model input size of 640×640 would lead to excessive distortion of linear defects. To address this issue, in this embodiment, the original image is first cropped in half along its width to 2048×1696, and then resampled using a 640×640 sliding window to reconstruct the training samples. This strategy maintains a consistent input size while mitigating signal attenuation caused by multiple downsampling of small-scale defects and is compatible with image stitching-based Mosaic data augmentation methods.

[0073] End-to-end data preprocessing of the dataset images is performed, specifically including: First, using a certain probability, a Mosaic enhancement strategy is adopted to stitch four images together and perform geometric perturbation by combining rotation, translation, and scaling, while simultaneously updating the corresponding bounding box annotations; then, photometric distortion, random background expansion, and random cropping based on intersection-union ratio are applied sequentially, and invalid annotations that are too small or out of bounds are removed after each geometric transformation; next, random horizontal flipping is performed, and the images are uniformly adjusted to an input size of 640×640, and invalid bounding boxes are cleaned up again; finally, pixel values are normalized to floating-point type, and bounding boxes are converted to a normalized format of center point coordinates and width and height to adapt to the input requirements of the detection model.

[0074] Step 2: Input the backbone network for feature extraction. In this embodiment, HGNet is selected as the basic backbone network for feature extraction. Although HGNet is lightweight and achieves efficient feature fusion by concatenating different semantic features output from multiple levels of convolution, it has certain limitations in handling subtle imperfections. Specifically, this indiscriminate feature aggregation method may cause some key detail signals to be submerged, making them difficult to identify in subsequent operations. In addition, traditional downsampling operations usually use ordinary convolution, which leads to over-smoothing of the output features at the previous scale, thereby widening the gap between features of different intensities. This causes the model to tend to identify salient features that differ greatly from the background, while ignoring weak signals and small targets, resulting in decreased detection accuracy.

[0075] To overcome the problems existing in current convolutional downsampling methods, this embodiment designs a DirDown module. For example... Figure 2 As shown, this module first captures directional structural information by applying one-dimensional average pooling to the input feature map along the horizontal and vertical directions. and Then, the mean of the input feature map is calculated along the channel dimension to generate a single-channel grayscale image, which is then subjected to lightweight convolution and sigmoid activation to obtain a spatial attention map. And these directional features are dynamically enhanced, as follows:

[0076] in, This represents element-wise multiplication (spatial dimension broadcasting). This non-suppressive weighting design ensures that the response at all locations in the directional feature is preserved or enhanced, avoiding information loss due to suppression operations.

[0077] To retain more detailed information, the module constructs a dual-path structure: one path further refines the features through deep convolutional layers, while the other path enhances salient responses through max pooling. Finally, the outputs of these two paths... and After concatenation, the final downsampled feature map is obtained through Channel Grouping (CGF).

[0078] The group channel fusion operation divides the spliced features into several groups along the channel dimension, applies a 1×1 convolution to each group independently to achieve intra-group compression and interaction, and promotes inter-group information fusion through channel shuffling.

[0079] This design not only maintains the geometric integrity of orthogonal defect information that is common in fabrics, but also improves discriminability, which helps to enhance the detection capability of small and low-contrast defects.

[0080] To further enhance the model's sensitivity to weak anomalies, this embodiment introduces an AFR module into HGNet's HGBlock. For example... Figure 3 As shown, this module first generates an attention map based on the spatial prior of channel compression to highlight abnormal regions, thus obtaining features. Then to Perform channel mapping, and then map the features. Predicting the global gain factor using a minimal dynamic gating mechanism With global gating factor Furthermore, by combining local context-aware recalibration (LCR) based on depthwise separable convolution, joint modulation of channel and spatial dimensions is achieved:

[0081]

[0082] This mechanism can selectively amplify subtle defect features while effectively suppressing background interference caused by periodic fabric textures, thereby enhancing the model's fine-grained recognition capabilities in complex scenes.

[0083] Step 3: In order to fully explore the discriminative information of fabric defects at different scales and semantic levels, this embodiment designs a lightweight and efficient hybrid feature encoding and fusion mechanism that organically unifies global context modeling capabilities with multi-scale feature interaction.

[0084] Specifically, based on the multi-level feature maps output by the backbone network, this embodiment constructs a DSSA encoder to enhance the semantic expressive power of high-level features. For example... Figure 4 As shown, the encoder first divides the input features into four groups along the channel dimension: two groups extract local structural information with different receptive fields through 3×3 and 7×7 depth separable convolutions respectively; the third group retains the original shape; and the fourth group serves as a high-frequency detail path.

[0085] Subsequently, the first three sets of features are input into an efficient linear attention unit. This unit applies Softmax normalization to the key along the spatial dimension and the query along the channel dimension, and reconstructs the matrix multiplication order, achieving long-range context modeling with linear complexity. , and The fourth group is dynamically scaled using a learnable scalar weight activated by a sigmoid algorithm to preserve high-frequency components sensitive to minor imperfections. The output of the efficient linear attention unit is calculated as follows:

[0086] Given a query matrix Key matrix Sum matrix ,in This represents the number of spatial locations in the feature map.

[0087] The above four features are spliced together and then jointly modulated by the AFR module to achieve synergistic enhancement of spatial and channel dimensions:

[0088] This design significantly reduces computational complexity while effectively balancing local details, directional structure, and global semantics, making it particularly suitable for characterizing small, low-contrast, and highly directional defects in fabrics.

[0089] Step 4: Input the enhanced high-level features from Step 3 along with the mid- and low-level features output from the backbone network into the Path Aggregation Feature Pyramid Network (PAFPN) (architecture as follows). Figure 2 (b) shows the structure connected by the up and down arrows (PAFPN), which performs multi-scale fusion and outputs a fused feature representation. This PAFPN employs a bidirectional cross-scale connection structure: on the one hand, it progressively transmits high-level semantic information to the shallow layers via a top-down path, improving the detection capability of small targets; on the other hand, it feeds back high-resolution details to the deep layers via a bottom-up path, optimizing the localization accuracy of large target boundaries. Through this fusion mechanism, the features received by the detection heads at each scale contain both rich semantic discriminative power and retain fine spatial structural information, thereby significantly improving the model's overall detection performance for multi-scale fabric defects.

[0090] Step 5: Input the multi-scale fusion features output from Step 4 into the D-Fine detection head (architecture as follows). Figure 2(c) As shown, end-to-end defect prediction is performed, outputting the defect category, location, and confidence prediction results. The D-Fine detection head contains a multi-layer decoder structure, which adopts a localization distribution prediction mechanism to model the bounding box coordinates of each defect target as a probability distribution to capture localization uncertainty. Simultaneously, this detection head introduces an inter-layer self-distillation strategy, using the high-confidence localization distribution output of the last layer of the decoder as a soft target to supervise the prediction results of the preceding intermediate decoding layers. This allows the shallow decoder to inherit deep semantic information during training, thereby improving the early discrimination capability for weak defects. Finally, based solely on the output of the last layer of the decoder, a one-to-one assignment is performed with the ground truth annotations using the Hungarian matching algorithm, and the distribution regression loss and classification loss are jointly optimized to achieve high-precision end-to-end detection without non-maximum suppression. After completing the current training, the results of the evaluation set are compared with the optimal weights. If the current weights are better than the previous optimal weights, they are retained as the weights of the optimal model. Finally, the number of training epochs is compared; if the maximum number of training epochs has been reached, training is stopped.

[0091] The obtained optimal training weights are used for testing on the test set, and the resulting test metrics can be compared with other models systematically.

[0092] The following experiment will be conducted to verify the technical advantages of the present invention.

[0093] The following experiments, conducted on the Alibaba Cloud Tianchi Fabric Defect Dataset, systematically compare the method of this invention with current mainstream object detectors to verify its comprehensive performance advantages. The methods compared include YOLOv8n, YOLOv8s, RT-DETR, D-Fine, and DEIM. All models use the same input size (640×640), training strategy, and evaluation metrics. The evaluation metrics include the number of parameters, floating-point computation (FLOPs), frames per second (FPS), and mean average precision (mAP). As shown in Table 1, this invention achieves higher average precision without significantly increasing model complexity, while maintaining high inference efficiency.

[0094] Table 1 Comparison of Mainstream Object Detection Algorithms

[0095] To further verify the invention's ability to detect minute defects, a fine-grained comparison was conducted with the current state-of-the-art lightweight method, DEIM. For example... Figure 5 As shown, the detection accuracy of this invention is superior to or equal to DEIM across all defect categories, with a particularly significant improvement in small, low-contrast defect categories such as "holes" and "creases." This result demonstrates that this invention can reliably address the challenges of defect identification under highly interfering backgrounds.

[0096] Furthermore, visualizing the detection results can intuitively reflect the robustness of the model in real-world scenarios. For example... Figure 6 As shown, against a complex texture background, D-Fine ( Figure 6 (Left image) and DEIM ( Figure 6 The intermediate image shows obvious missed detections and false detections; while the present invention ( Figure 6 (As shown in the right image) It can completely detect and accurately locate minute defects, significantly reducing false predictions. This result demonstrates that, when faced with complex textures and minute defects commonly found in industrial fabrics, this invention exhibits stronger detection robustness and accuracy.

[0097] The following experiments were conducted on the Alibaba Cloud Tianchi Fabric Defect Dataset to verify the contribution of the key technologies in this invention to the final detection performance. All experiments were based on the same training configuration and used DEIM as the baseline model.

[0098] Table 2. Experimental results of DFG-DETR ablation in this invention.

[0099] As shown in Table 2, introducing the orientation-aware detail-preserving downsampling module (DirDown), the adaptive feature recalibration module (AFR), or the dynamic sparse selective attention module (DSSA) can all improve detection accuracy; among them, AFR brings the most significant gain. When DirDown and AFR are integrated simultaneously, the performance is further improved; and the complete model (containing all three modules) achieves the highest mAP@50, while the model complexity (number of parameters and FLOPs) remains at a lightweight level.

[0100] To further verify the effectiveness of the training strategy, this invention employs Matchability-Aware Loss (MAL) in the classification branch. This loss function uses the IoU between the predicted bounding box and the ground truth bounding box as a soft label to improve the consistency between confidence and localization quality. Figure 7 As shown, in this invention, the MAL loss converges faster and has a lower value than the baseline DEIM, indicating that the overall model architecture contributes to the effective optimization of the loss function and reflects a stronger feature learning ability. These results fully demonstrate that the various design elements of this invention work synergistically to improve the accuracy and efficiency of fabric defect detection.

[0101] like Figure 8 As shown, this embodiment discloses a fabric defect detection system based on direction perception and fine-grained enhancement, used to perform the above method, including the following modules:

[0102] Dataset partitioning and preprocessing module: used to acquire fabric images, form a fabric defect dataset, and partition and preprocess the fabric defect dataset;

[0103] Multi-level feature extraction module: used to input the preprocessed image into the backbone network for multi-level feature extraction; the backbone network adopts a direction-aware detail-preserving downsampling module and embeds an adaptive feature recalibration module in HGBlock;

[0104] Contextual Feature Enhancement Module: This module takes the last layer of features from the multi-level feature extraction module and feeds it into the encoder constructed by the dynamic sparse selective attention module to enhance the contextual features.

[0105] The fusion module is used to input the high-level features enhanced by the context feature enhancement module and the remaining features output by the multi-level feature extraction module into the path aggregation feature pyramid network for multi-scale fusion and output fused feature representation.

[0106] The results output module is used to input the fused features into the D-Fine detection head and output the defect category, location, and confidence prediction results.

[0107] Other aspects of this embodiment can be found in the above method embodiments.

[0108] In summary, this invention belongs to the field of fabric defect detection technology, specifically disclosing a direction-aware and fine-grained enhancement method for fabric defect detection. Addressing the challenge of robustly detecting small, low-contrast defects against complex texture backgrounds, this invention employs a direction-aware detail-preserving downsampling module in the backbone network to display fine-grained features sensitive to horizontal and vertical structures, and combines this with an adaptive feature recalibration module to enhance local contextual representation. Simultaneously, the standard Transformer block in the encoder is replaced with a dynamic sparse selective attention module, achieving efficient fusion of local details and global semantics while reducing computational overhead. Finally, end-to-end defect localization and recognition are completed through the detection head. This invention effectively improves the detection accuracy and robustness for multi-scale, weakly salient defects, making it suitable for real-time industrial quality inspection scenarios.

[0109] The preferred embodiments and principles of the present invention have been described in detail above. For those skilled in the art, there may be changes in the specific implementation based on the ideas provided by the present invention, and these changes should also be considered within the scope of protection of the present invention.

Claims

1. A fabric defect detection method based on direction perception and fine-grained enhancement, characterized in that, Includes the following steps: Step 1: Obtain fabric images to form a fabric defect dataset, and then divide and preprocess the fabric defect dataset. Step 2: Input the preprocessed image into the backbone network for multi-level feature extraction; the backbone network adopts a direction-aware detail-preserving downsampling module and embeds an adaptive feature recalibration module in HGBlock; Step 3: The last layer of features output from Step 2 is fed into the encoder constructed by the dynamic sparse selective attention module for contextual feature enhancement. Step 4: Input the enhanced high-level features from Step 3 and the remaining layer features output from Step 2 into the path aggregation feature pyramid network for multi-scale fusion and output the fused feature representation. Step 5: Input the fused features into the D-Fine detection head and output the defect category, location, and confidence prediction results.

2. The fabric defect detection method according to claim 1, characterized in that: In step 1, the fabrics on the production line are photographed using a camera to obtain initial fabric images, which are then uniformly adjusted to the same size. Next, the defects on the fabric images are marked and filtered, and only images with defects are retained. The images are then cropped according to resolution to obtain a fabric defect dataset.

3. The fabric defect detection method according to claim 1, characterized in that: In step 1, the fabric defect dataset is divided into training set, validation set, and test set according to a preset ratio.

4. The fabric defect detection method according to any one of claims 1-3, characterized in that: In step 1, the preprocessing is as follows: the input image and its corresponding bounding box annotations are jointly enhanced, Mosaic enhancement is enabled, four random images are stitched together into a composite image, random cropping based on cross-union ratio, photometric distortion perturbation, random horizontal flipping and image size adjustment are applied in sequence, the bounding box coordinates corresponding to all geometric transformations are updated synchronously, invalid annotations are removed, the image pixel values are normalized, and the bounding box format is converted into a normalized representation of center point coordinates and width and height.

5. The fabric defect detection method according to claim 1, characterized in that: Step 2 is as follows: Step 2.1: Generate low-dimensional feature maps by passing the input image through the initial convolutional layers of the backbone network HGNet; Step 2.2: Use the orientation-aware detail-preserving downsampling module to explicitly model the orientation-perturbed region; Step 2.3: Embed an adaptive feature recalibration module in each HGBlock to perform highly discriminative, fine-grained feature representation; Step 2.4: Output multi-scale high-discriminative feature maps step by step.

6. The fabric defect detection method according to claim 5, characterized in that: Step 2.2 specifically includes the following steps: Step 2.2.1: Perform horizontal average pooling with a kernel size of 1×3 and vertical average pooling with a kernel size of 3×1 on the input feature map, and add the two together to obtain the directional prior features. The internal response differences constitute the directional perturbation region. Step 2.2.2: Calculate the mean along the channel dimension of the input feature map to generate a single-channel grayscale image. Obtain a spatial attention map by light convolution and sigmoid activation, and use it to enhance the directional prior features to form attention-modulated directional features. Step 2.2.3: Construct a dual-path downsampling structure: one path uses depthwise convolution for detail-preserving downsampling, and the other path uses max pooling for downsampling of strong response regions; Step 2.2.4: After concatenating the two outputs along the channel dimension, group the concatenated features along the channel dimension, and perform 1 step independently on each group. Convolution is used to achieve intra-group compression and interaction, and inter-group information fusion is promoted through channel shuffling, so that the output downsampled features contain spatial detail information and local strong response features.

7. The fabric defect detection method according to claim 5, characterized in that: Step 2.3 specifically includes the following steps: Step 2.3.1: Calculate the mean along the channel dimension of the input feature map to generate a single-channel grayscale image. Obtain a spatial attention map by light convolution and sigmoid activation, and enhance the input features in a non-inhibitory weighted manner to form attention-modulated enhanced features. Step 2.3.2: Perform dynamic scaling prediction on the enhanced feature map, predicting the global gain factor through global average pooling and continuous convolution operations. With global gating factor Among them, the global gain factor is used to alleviate the imbalance in optimization between samples caused by the difference in the significance of defects, and the global gating factor is used to dynamically adjust the enhancement strength of the local context-aware weights. Step 2.3.3: Map the enhanced features using a 1×1 convolution to obtain the mapped features, and then apply the global gain factor obtained in step 2.3.

2. Broadcast to all channels of the mapped feature, perform uniform amplitude modulation on the mapped feature, and obtain the global modulation feature; Step 2.3.4: Utilize depthwise separable convolution to implement a local context-aware channel attention mechanism based on global modulation features, and apply the attention weights through a global gating factor. After modulation, the enhanced features are reweighted non-inhibitory to complete adaptive feature recalibration.

8. The fabric defect detection method according to claim 1, characterized in that: In step 3, the processing procedure of the dynamic sparse selective attention module includes the following steps: Step 3.1: Divide the input features into four groups along the channels, which are used for local modeling and original information preservation in different receptive fields respectively: two groups are extracted by 3×3 and 7×7 depth separable convolutions to extract multi-scale context, one group is directly used for attention calculation, and the other group is used as the original path to preserve high-frequency details. Step 3.2: Input the first three sets of features into the linear attention unit. The linear attention unit normalizes the query and key and reconstructs the matrix multiplication order to achieve long-range context modeling. At the same time, the fourth set is used as the original high-frequency detail path and dynamically scaled through a learnable scalar weight activated by Sigmoid. Finally, the four outputs are concatenated along the channel dimension to generate the enhanced feature representation. Step 3.3: The four-way concatenated enhanced features are processed through another instantiated adaptive feature recalibration module for cross-group feature interaction and response sparsification, outputting the final context-enhanced high-level features.

9. A fabric defect detection system based on direction perception and fine-grained enhancement, used to perform the method as described in any one of claims 1-8, characterized in that, Includes the following modules: Dataset partitioning and preprocessing module: used to acquire fabric images, form a fabric defect dataset, and partition and preprocess the fabric defect dataset; Multi-level feature extraction module: used to input the preprocessed image into the backbone network for multi-level feature extraction; the backbone network adopts a direction-aware detail-preserving downsampling module and embeds an adaptive feature recalibration module in HGBlock; Contextual Feature Enhancement Module: This module takes the last layer of features from the multi-level feature extraction module and feeds it into the encoder constructed by the dynamic sparse selective attention module to enhance the contextual features. The fusion module is used to input the high-level features enhanced by the context feature enhancement module and the remaining features output by the multi-level feature extraction module into the path aggregation feature pyramid network for multi-scale fusion and output fused feature representation. The results output module is used to input the fused features into the D-Fine detection head and output the defect category, location, and confidence prediction results.