Unsupervised cigarette package defect detection method based on structure reconstruction
By using multi-scale feature fusion and a prototype-guided local-global attention module, the problem of misjudgment of texture changes in cigarette packaging inspection was solved, enabling accurate detection of complex structures and fine-grained defects, thus improving detection efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA TOBACCO HENAN IND CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for detecting defects in cigarette packaging often struggle to distinguish between material texture variations and actual defects when dealing with complex structures and fine-grained defects. This results in a high false detection rate, and the decoder is insensitive to local differences, making it difficult to achieve accurate localization.
By extracting multi-scale latent feature maps, fusing frequency domain and spatial domain information, and introducing a prototype-guided local-global attention module, the decoder's ability to model global semantic consistency and local structural details is enhanced. The reconstruction process is guided by normal prototype features, achieving more accurate defect detection.
It improves the accuracy and efficiency of cigarette packaging defect detection, effectively distinguishing between normal texture fluctuations and real defects, and is suitable for simultaneous quality inspection of minor damage and large-scale structural defects in key parts on high-speed production lines.
Smart Images

Figure CN122244844A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of cigarette packaging defect detection technology, and in particular to an unsupervised cigarette packaging defect detection method based on structure reconstruction. Background Technology
[0002] As a crucial carrier of tobacco products, the appearance quality of cigarette packaging directly impacts brand image and market compliance. On high-speed automated production lines, processes such as printing, hot stamping, and embossing can easily generate various appearance defects, including damaged seals, damaged inner lining paper, and surface stains. Traditional manual visual inspection methods are insufficient to meet the inspection speed and consistency requirements of hundreds of packs per minute on modern production lines, making machine vision-based intelligent inspection technology an industry necessity. In recent years, deep learning-driven unsupervised defect detection methods have demonstrated significant potential in cigarette packaging quality inspection. Among these, reconstruction-based methods learn the distribution characteristics of normal samples to establish a low-dimensional latent representation of the input data and reconstruct the input image accordingly. During the inference phase, normal areas can be accurately reconstructed, while defective areas, deviating from the normal distribution, exhibit significant reconstruction errors, thus achieving defect localization. This type of method requires only normal samples for training, effectively avoiding the labeling dilemma caused by the scarcity and high diversity of defect samples, and has significant practical value in cigarette packaging appearance quality inspection scenarios.
[0003] While existing methods have achieved significant progress in modeling normal patterns and identifying defects, reconstruction methods still have shortcomings in latent feature modeling and reconstruction mechanisms, limiting their ability to characterize complex structures and fine-grained defects. Reconstruction methods often rely on single spatial domain representations, lacking effective decoupling and modeling of multi-band, multi-scale structural cues in the latent space, making it difficult to construct discriminative latent normal pattern distributions. Secondly, in the reconstruction stage, existing methods typically rely directly on the encoder's output feature representation, which is primarily semantically oriented and insensitive to local differences. They lack auxiliary latent feature representations that can provide the decoder with references for normal structure modeling, making the decoder prone to over-smoothing or structural blurring when faced with local shape changes, texture destruction, and insufficient response to small-scale defect regions, thus weakening defect localization capabilities. Furthermore, the common use of a uniform global attention mechanism in the reconstruction process lacks the ability to collaboratively model global context and local structural details, constraining the decoder's ability to jointly model global semantic consistency and local structural detail features during reconstruction.
[0004] To address the aforementioned issues, this invention has conducted in-depth research. It is particularly important to note that in actual tobacco production, cigarette packaging materials are diverse, including cellophane, aluminum foil, and white cardboard. Printing processes (such as gravure and offset printing) and post-printing processes (such as hot stamping and embossing) on different materials produce drastically different visual textures. For example, inner lining paper typically has a metallic luster and specific embossed texture, while the seal is made of paper, requiring extremely high precision in its patterns. Existing general defect detection models often struggle to distinguish between normal variations caused by material texture changes and genuine physical damage (such as wrinkles in the inner lining paper or damaged seals), leading to a high false detection rate. Especially when dealing with complex defects like "seal warping," which alters the local structure and may be accompanied by changes in reflectivity, single-spatial-domain reconstruction models are prone to misjudging uneven lighting as a defect or excessively smoothing out minor warping, resulting in missed detections. Therefore, the industry urgently needs a detection solution that can deeply understand the unique structural information of cigarette packaging and accurately distinguish between normal texture fluctuations and genuine defects. Summary of the Invention
[0005] In view of the above, the present invention aims to provide a reconstruction scheme that enhances the latent features of the prototype. By extracting structurally sensitive feature representations from the latent space and injecting them as auxiliary information into the decoder, the decoder is guided to take into account both global semantic consistency and local structural details during the reconstruction process, thereby achieving more coordinated and accurate reconstruction modeling. This addresses the dual challenges of scarce and diverse defect data in cigarette packaging production processes.
[0006] The technical solution adopted in this invention is as follows:
[0007] An unsupervised method for detecting defects in cigarette packaging based on structure reconstruction, comprising:
[0008] Multi-scale latent feature maps are extracted from the original image data of cigarette packaging, and the multi-scale latent feature maps are aggregated to obtain aggregated feature maps;
[0009] Normal prototype features representing the characteristics of normal areas of cigarette packaging are extracted from the aggregated feature map;
[0010] The decoding process is guided by the normal prototype features, and the aggregated feature map is reconstructed to obtain the normal features reconstructed by the decoder;
[0011] Based on the difference between the normal features reconstructed by the decoder and the multi-scale latent feature map, the final defect detection result is obtained, including at least one of the following predetermined cigarette packaging defects: folded seal, reversed seal, damaged seal, ruptured box packaging paper, and damaged inner lining paper.
[0012] In at least one possible implementation, the extraction of normal prototype features characterizing normal regions of cigarette packaging from the aggregated feature map includes:
[0013] Wavelet transform is performed on the aggregated feature map to obtain multiple sub-bands containing low-frequency and high-frequency components;
[0014] The multiple sub-bands are fused to obtain wavelet transform enhanced features;
[0015] By using a learnable prototype token and a cross-attention mechanism, normal patterns are aggregated from the wavelet transform enhanced features to obtain initial normal prototype features.
[0016] The initial normal prototype features are input into the prototype refinement hybrid expert system. The outputs of multiple expert networks are dynamically selected and aggregated through the routing network to obtain the refined normal prototype features.
[0017] In at least one possible implementation, the method further includes computing a normal prototype aggregation loss, which is the cosine distance between a single normal image patch feature and its corresponding nearest normal prototype feature, and by minimizing the loss constraint, a prototype token can be learned to capture only normal patterns.
[0018] In at least one possible implementation, the step of guiding the decoding process with the normal prototype features to reconstruct the aggregated feature map and obtain the decoder-reconstructed normal features includes:
[0019] The decoder's multiple attention heads are divided into global attention heads and local attention heads;
[0020] The global attention head is used to model the global spatial dependencies and semantic consistency of cigarette packaging layout;
[0021] The local attention head enhances sensitivity to local neighborhood structures by introducing convolution operations with different receptive fields into the query vector branch.
[0022] The outputs of the global attention head and the local attention head are fused to obtain the normal features reconstructed by the decoder.
[0023] In at least one of the possible implementations, the local attention head processes the query vector using convolutional kernels of different sizes, and concatenates the processed vectors to reorganize and aggregate local neighborhood structure information.
[0024] In at least one possible implementation, the extraction of the multi-scale latent feature map from the original cigarette packaging image data includes:
[0025] The original image of the cigarette packaging is segmented into a patch sequence, and after adding position encoding, it is input into the encoder;
[0026] Intermediate outputs are extracted from different depth layers of the encoder and recombined to form two-dimensional spatial feature maps corresponding to different scales, thus constituting a multi-scale latent feature representation.
[0027] In at least one possible implementation, the aggregation of the multi-scale latent feature maps to obtain an aggregated feature map includes:
[0028] The multi-scale latent feature maps are summed pixel by pixel along the layer dimension and averaged to obtain an aggregated feature map that serves as a compact representation of the image.
[0029] In at least one possible implementation, obtaining the final defect detection result based on the difference between the normal features reconstructed by the decoder and the multi-scale latent feature map includes:
[0030] During the testing phase, the cosine distance in the spatial dimension between the multi-scale final reconstructed feature map output by the decoder and the multi-scale latent feature map of the corresponding layer of the encoder is calculated layer by layer to generate the initial defect response map at multiple scales.
[0031] The initial defect response maps at multiple scales are stitched together along the channel dimension and global average pooling is performed to obtain a defect detection heatmap aligned with the spatial resolution of the input image.
[0032] In at least one possible implementation, the method further includes a total loss function during the training phase, the total loss function being a weighted sum of reconstruction loss and normal prototype aggregation loss, wherein the reconstruction loss is the average of the sum of cosine distances between the decoder's multi-scale final reconstructed features and the corresponding layers of the encoder's multi-scale latent features.
[0033] The advantages and effects embodied in the main design concept of this invention are described below:
[0034] First, this invention introduces a frequency domain enhanced prototype extraction module, which uses wavelet transform to decompose aggregated features into low-frequency and high-frequency components. For cigarette packaging, the low-frequency components correspond to the overall layout and background color of large printed areas such as trademarks and warning labels, while the high-frequency components precisely capture key structural information such as the edges of the seal, the outline of the hot-stamped text, and the embossed texture of the inner lining paper. By fusing and refining these frequency domain components, the model can construct normal prototype features that include both global layout priors and sensitivity to local texture changes. This provides an essential basis for distinguishing between features such as "inner lining paper damage" (damaging high-frequency textures) and "normal aluminum foil paper reflection" (only changing low-frequency brightness).
[0035] Secondly, this invention designs a prototype-guided local-global attention module. By grouping attention heads, one group is responsible for maintaining global semantic consistency, ensuring that the spatial layout of key identifiers such as patterns and logos on the cigarette packaging is not disrupted during reconstruction, thus avoiding distortion of the overall brand image due to local defects. The other group enhances local detail perception through multi-receptor field convolution, enabling it to keenly capture subtle anomalies such as minute edge displacements caused by "seal warping" and local texture breaks caused by "seal damage." This global-local collaborative mechanism effectively overcomes the over-smoothing problem of traditional decoders when faced with subtle structural changes, resulting in a significant and accurate high response in the reconstructed feature map in defective areas.
[0036] In summary, this invention provides an unsupervised defect detection scheme with strong structural sensitivity and precise positioning by closely combining the unique structural characteristics and defect types of cigarette packaging. It is particularly suitable for the synchronous online quality inspection of minor damage and deformation of key parts such as seals and inner lining paper, as well as large-scale structural defects on the packaging surface, on high-speed production lines. It is of great significance for improving the quality control level of tobacco products leaving the factory. For specific details of the scheme, please refer to the following text, which will not be repeated here. Attached Figure Description
[0037] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described below with reference to the accompanying drawings, wherein:
[0038] Figure 1(a) is a schematic diagram of the steps and modules of the unsupervised cigarette packaging defect detection method based on structure reconstruction provided in an embodiment of the present invention;
[0039] Figure 1(b) is an overall architecture diagram of the unsupervised cigarette packaging defect detection scheme based on structure reconstruction provided in the embodiment of the present invention;
[0040] Figure 2(a) is a schematic diagram of wavelet transform multi-scale enhancement provided in an embodiment of the present invention;
[0041] Figure 2(b) is a schematic diagram of the hybrid expert system provided in an embodiment of the present invention;
[0042] Figure 2(c) is a schematic diagram of a prototype-guided local-global attention module provided in an embodiment of the present invention;
[0043] Figure 3 The figure shows a representative case of detecting defects in cigarette packaging based on the solution of this invention. Detailed Implementation
[0044] Embodiments of the present invention are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0045] This invention proposes an unsupervised method for detecting defects in cigarette packaging based on structure reconstruction, used to detect defects such as... Figure 3 The typical cigarette packaging defects shown, such as folded seals, reversed seals, damaged seals, bursting box paper, and damaged inner lining paper, are mainly detected through the following logic: First, multi-scale latent feature maps of defect-free original image data from multiple regions and processes of the cigarette packaging are extracted. Then, the multi-scale latent feature maps are averaged pixel by pixel to obtain an aggregated feature map. Next, vectors representing the normal area features of the cigarette packaging are extracted from the aggregated features. Then, normal prototype features can be used to guide decoding, taking into account both the global structure and local details of the cigarette packaging, and only normal features are reconstructed. Finally, based on the features reconstructed by the decoder, the difference between the reconstructed features and the multi-scale features extracted by the encoder is calculated to obtain the final defect detection result that can identify various appearance defects of cigarette packaging.
[0046] In a specific implementation, a flowchart of an unsupervised cigarette packaging defect detection method based on structure reconstruction can be seen in Figure 1(a). The main process involves the following four modules: feature extraction module S101, frequency domain enhanced prototype extraction module S102, prototype-guided local-global attention module S103, and defect detection module S104. Therefore, the original image data of the cigarette packaging is first input into the feature extraction module S101 to obtain an aggregated feature map. Specifically, a pre-trained encoder is used to extract multi-scale latent feature maps from the original cigarette packaging image data, and the multi-scale latent feature maps are averaged pixel-by-pixel. Next, the obtained aggregated feature map is input into the frequency domain enhanced prototype extraction module S102 to obtain a vector representing the normal region features of the cigarette packaging, i.e., normal prototype features. Then, the obtained normal prototype features and aggregated feature map are input into the prototype-guided local-global attention module S103 to obtain the normal features reconstructed by the decoder. Finally, the obtained normal features reconstructed by the decoder and the multi-scale latent feature map extracted by the encoder are input into the defect detection module S104 to obtain the final defect detection result that can identify various appearance defects of cigarette packaging, including whether there is one or more of the following specific defect types: seal folding, seal inverted, seal damage, box packaging paper bursting, inner lining paper damage; and, if a defect exists, its specific location on the cigarette box packaging can be located.
[0047] Based on the core concept proposed by the present invention and in conjunction with the complete scheme framework shown in Figure 1(b), the implementation of each link in the scheme will be described in detail below.
[0048] Step 1: Input the original image data of the cigarette packaging into the feature extraction module to obtain the aggregated feature map.
[0049] In a specific implementation, the input image is first segmented into a fixed-size patch sequence and positional encoding is added. Then, it sequentially passes through each layer of the Transformer encoder, extracting intermediate output sequences from encoder layers of different depths. Shallow outputs contain more spatial detail information, while deeper outputs contain higher-level semantic features, perceiving the global consistency semantic information of the cigarette packaging surface. The multi-level sequence features characterizing the printing quality and packaging integrity of the cigarette packaging surface are recombined to form a two-dimensional spatial feature map, ultimately constituting a set of multi-scale latent feature representations. This provides the multi-level information foundation required for cigarette packaging defect detection in subsequent feature aggregation and decoding processes. Then, the multi-scale features are added pixel-by-pixel along the layer dimensions and averaged to obtain an aggregated feature map. This aggregated feature map serves as a compact representation of the entire image, containing both highly discriminative normal patterns, such as standard cigarette packaging printing quality standards, and potential defect responses. Those skilled in the art will understand that in actual production lines, cigarette packaging images typically contain multiple key areas, such as, but not limited to, the main trademark on the front, warning text on the back, side seals, and inner lining paper. Through multi-scale feature extraction, shallow features can preserve the details of the brushstrokes and the clarity of the embossing lines of the gold foil text, while deep features can capture whether the overall layout is compliant, such as whether the warning text is complete and whether the brand logo is centered. This multi-level information lays a solid foundation for distinguishing normal texture fluctuations (such as the natural reflection of gold foil paper) from real defects (such as scratches on the inner lining paper).
[0050] Step 2: Input the aggregated feature map obtained in Step 1 into the frequency domain enhanced prototype extraction module to obtain the vector representing the normal region features of the cigarette packaging feature map, i.e., the normal prototype features.
[0051] In a specific implementation, as shown in Figure 2(a), in the frequency domain sensing prototype feature extractor, the extracted aggregated features X on the cigarette packaging surface are first decoupled into X using discrete wavelet transform. HH X HL X LH X LL Four sub-bands, X LL X refers to the low-frequency components that reflect the overall structure of cigarette packaging printing and the basic morphological characteristics of trademark patterns at the coarse-grained level. HH X HL X LH This indicates that high-frequency components retain the object's texture details, edge information, and local structural changes at a fine-grained level. The four sub-bands are convolved and then concatenated to obtain wavelet transform-enhanced features that fuse multi-level frequency domain information from the cigarette packaging. , This represents low-frequency information reflecting the overall structural form of cigarette packaging in the frequency domain, and high-frequency information depicting local textures and structural changes. Combined with... Figure 3 As shown, compared to encoding methods that rely solely on purely spatial features, this frequency domain representation exhibits more significant response differences near defective areas such as seal defects and packaging paper damage in cigarette packaging. This provides the model with a more structurally sensitive embedding of cigarette packaging surface quality, which can be used for attention inference in subsequent defect localization. Specifically, in cigarette packaging, low-frequency components correspond to large areas of background color, such as white cardstock, gold foil, and the main graphics, while high-frequency components capture the fine edges of hot stamping patterns, the embossed texture of the seal, and the microscopic wrinkles of the inner lining paper. Through wavelet transform, the model can decouple the textures that are inherent in normal printing (such as fine dots) from actual damage (such as torn edges) in the frequency domain, thereby more accurately focusing on anomalous changes.
[0052] Will The extracted normal feature patterns from defect-free cigarette packaging samples are transformed into key-value pair vectors using a linear layer and a cross-attention mechanism. These features are then aggregated into M learnable prototype tokens T={t1, …, t} through cross-attention. M |t m ∈R C In the model, each prototype token encodes a compact representation of the normal pattern of the cigarette packaging surface features of a region. The features processed by cross-attention are fused with the features processed by discrete wavelet inverse transform to obtain the features after the fusion of spatial structure information and frequency domain features, as shown in the attached formula (1). Through this joint encoding method, the model establishes explicit information interaction between the frequency domain and the spatial domain, enhancing the discriminative power of printing texture defects while preserving the geometric structure of the cigarette packaging surface, and obtaining a stronger structural consistency modeling ability for the standard cigarette packaging template. This means that the model can learn the normal pattern library of different brands of cigarette packaging, such as the frequency domain features of the red background and main building pattern of brand A, and the diamond logo and brushstroke details of brand B, so as to accurately identify anomalies that do not conform to the standard template during inference.
[0053]
[0054] Where I represents the aggregated feature map, DWT represents the discrete wavelet transform, Conv(·) represents the convolution operation, Concat(·) represents the concatenation operation, T represents the learnable token, Q, K and V represent the query matrix, key matrix and value matrix of the cross-attention, respectively, and W Q W K and W V Let Q, K, and V be the linear transformation weight matrices, respectively. A is the scaling factor, softmax is the activation function, and the similarity matrix is normalized into attention weights. i Let W represent the output feature of the i-th cross-attention head, W represent the linear transformation weight matrix of the multi-head cross-attention output, Dropout represent the dropout layer that randomly deactivates some neurons, LN represent layer normalization, F represent the final output feature of the multi-head cross-attention, IDWT represent the inverse discrete wavelet transform, Y represent the wavelet transform branch output feature, and T represent the output feature of the i-th cross-attention head. c This represents the initial normal prototype features.
[0055] Furthermore, as shown in Figure 2(b), the learnable tokens representing the standard pattern of cigarette packaging are further processed by the prototype refinement hybrid expert system to generate more accurate and discriminative normal prototype features of cigarette packaging, as shown in the attached formula (2). Specifically, a set of N parameterized expert networks {E1, E2, ..., E...} is predefined. nEach expert independently performs a nonlinear mapping from the input feature space to the output feature space, such as experts specializing in different cigarette packaging printing units like background areas, seal markings, and box packaging text. These experts employ a lightweight feedforward neural network structure, and a differentiated initialization strategy in the early training phase enables them to possess potential diverse representation capabilities to adapt to the diverse features of cigarette packaging surfaces. Simultaneously, a parameter-sharing lightweight routing network is designed. This network receives learnable tokens as input, representing the aggregated standard normal pattern of the cigarette packaging surface after the aforementioned cross-attention processing. Through two fully connected layers and the GELU activation function, it outputs an activation contribution score vector for each expert, and a Top-k strategy is used to activate only the experts whose features best match the current cigarette packaging image region to participate in the update. The activated experts perform parallel transformations on the input features, and their outputs are weighted and aggregated through routing scores to form a refined normal prototype feature representation corresponding to a specific feature pattern. As each expert gradually specializes in a specific type of feature pattern during training, the routing network assigns lower activation scores to features that do not conform to their professional domain, thereby reducing their impact during the aggregation phase. Collaborative work among multiple experts can enhance key attributes of prototype features from different perspectives, constructing more sensitive representations of normal features. Simultaneously, a gating network dynamically allocates input to different experts, achieving efficient feature division and fusion. Those skilled in the art will understand that this mechanism is particularly important in cigarette packaging scenarios because the texture differences between different areas are significant: for example, the paper texture of the seal area is completely different from the metallic brushed texture of the inner lining paper; even the matte and glossy areas on the same packaging need to be modeled separately. The hybrid expert system automatically assigns features to the most specialized experts through a routing network, such as those specializing in recognizing the gloss changes of hot stamping text or focusing on detecting the continuity of embossing lines, thereby ensuring that the prototype features of each area receive the most refined representation. When encountering defects such as seal warping mentioned earlier, the expert responsible for edge detection will produce a high response, while the expert responsible for the background color will have a lower response. Ultimately, weighted aggregation highlights the true defective areas.
[0056]
[0057] Where W1 and W2 represent the two-layer linear transformation weight matrices of the routing network, b1 and b2 represent the bias vectors of the corresponding linear layers, GeLu is the non-linear activation function, R(·) represents the lightweight routing network, s represents the activation contribution score vector of each expert, Top represents the expert indices corresponding to the top-k highest scores selected from the score vectors, ActI represents the set of activated experts, and x represents the feature input to the expert network, which is T in this case. c FFN iLet E represent the i-th independent feedforward neural network. i (·) represents the feedforward network output of the i-th expert, T re This represents the refined, normal prototype features.
[0058] Furthermore, as shown in Formula (3) below, in order to ensure the consistent representation of normal prototype features and normal features, the cosine distance between the normal image patch features of a single cigarette packaging surface and the corresponding nearest normal prototype feature is calculated as the normal prototype aggregation loss. By minimizing this loss, the learnable token is constrained to capture only the normal patterns representing the features of the cigarette packaging surface. Training uses only normal samples without defects on the cigarette packaging surface. This loss forces the learnable token to become the cluster center of normal features, ensuring that the learnable token purely represents the normal patterns.
[0059]
[0060] Where y represents the feature of a single normal image patch, t m Represents normal prototype features, CS(·) denotes the calculation of cosine similarity, M represents the total number of normal image patches in the current batch, and L c This represents the normal prototype polymerization loss.
[0061] Step 3: Input the normal prototype features obtained in Step 2 and the aggregated feature map obtained in Step 1 into the prototype-guided local-global attention module to obtain the normal pattern features reconstructed by the decoder.
[0062] In a specific implementation, as shown in Figure 2(c), this module divides the eight attention heads inside the decoder into two groups, one for maintaining the global structural knowledge of the cigarette packaging layout and the other for enhancing local spatial sensitivity. Considering that cigarette packaging design usually has strict layout specifications, such as fixed warning areas and precise brand logo positions, while the seal, inner lining paper and other parts contain rich and fine textures, this division is necessary. Referring to the attached formula (4), the six global attention heads maintain the computational form in the standard Transformer decoder without introducing additional structural constraints. They are mainly used for global spatial dependencies, stably preserving the semantic consistency between the encoder's prior knowledge of the global structure of the cigarette box packaging layout and the brand logo layout, ensuring that even if there are minor anomalies in the local image, the model's understanding of the overall layout does not deviate. The two local attention heads are enhanced to meet the modeling needs of the local structure of the cigarette packaging surface. By introducing convolution operations based on different receptive fields in the query vector branch, the local neighborhood information is reorganized and aggregated to capture subtle changes such as small curling at the edge of the seal or breaks in the indentation on the inner lining paper. This grouping design directly addresses the two main types of cigarette packaging defects: defects affecting the overall structure (such as inverted seals or torn packaging paper) and defects affecting local texture (such as damaged seals or scratches on the inner lining paper). Therefore, the overall head focuses on the consistency of the brand image, such as whether warning labels are complete and seals are affixed correctly; the local head focuses on micro-quality, such as whether hot stamping lettering is missing strokes or whether crease lines are continuous. Working together, the model can both macroscopically determine whether the packaging is compliant and microscopically detect hidden flaws.
[0063]
[0064] Among them, Q g K g and V g Let W represent the query matrix, key matrix, and value matrix of global attention, respectively. Q g W K g and W V g They represent Q respectively g K g and V g The linear transformation weight matrix, D i A represents the decoder reconstruction features. g i W represents the output feature of the i-th global attention head. g F represents the linear transformation weight matrix of the multi-head global attention output. g This represents the output feature of global attention.
[0065] Specifically, the key vector for cross-attention is generated through normal prototype features, and the aggregated feature map is generated through a linear layer to generate the query vector. In the local attention head, before calculating attention, the query vector is processed using convolutional kernels of different sizes, and then the processed vectors are concatenated to reorganize and aggregate the local neighborhood structure information, introducing convolutional enhancement with different receptive fields, as shown in the attached formula (5). For example, a 3x3 convolutional kernel can capture the local gradient changes at the seal edge, while a 5x5 convolutional kernel can cover a larger area and perceive the extension direction of the inner lining paper folds. In cigarette packaging, the folds of the inner lining paper often extend in a linear manner, and a larger receptive field helps to track the direction of the folds; while a seal breakage may be a small hole or tear, and a smaller receptive field can more accurately locate its boundary. Through the fusion of multi-scale convolutions, the local head can adapt to defects of different sizes and shapes.
[0066]
[0067] Among them, Q l K l and V l Let W represent the query matrix, key matrix, and value matrix of the local attention matrix, respectively. Q l W K l and W V l They represent Q respectively l K l and V l The linear transformation weight matrix, A l i W represents the output feature of the i-th local attention head. l F represents the linear transformation weight matrix of the multi-head local attention output. l D represents the local attention output feature. i+1 This represents the output feature of the (i+1)th layer of the decoder.
[0068] Furthermore, the global consistency information and local structural responses obtained by the two attention heads are concatenated in a multi-head dimension, and the fused decoder output is obtained through linear mapping.
[0069] like Figure 3As shown, for defects that affect the overall layout, such as inverted seals or bursting packaging paper, the six global attention heads anchor the global spatial dependencies of the cigarette label pattern, warning text, and brand logo, stably preserving the global structural priors of the cigarette box packaging. This ensures that the semantic consistency and layout rationality of the layout are not disrupted during defect analysis, avoiding distortion of the overall layout features due to local defects. For local defects such as folded seals, damaged seals, and damaged inner lining paper, the two local attention heads accurately capture minute deformations of the seal edge, textural anomalies of seal surface damage, and textural changes of local damage to the inner lining paper through convolutional branches of different receptive fields. This enhances the sensitivity to local structural changes and fine-grained defects, thereby improving the accuracy of defect identification and positioning. Through the synergistic effect of global and local attention heads, this module maintains the global structural integrity of the cigarette packaging layout while achieving efficient perception of various minor packaging defects, providing reliable feature support for subsequent defect localization.
[0070] Step 4: During the training phase, calculate the total loss and optimize the model parameters.
[0071] In a specific implementation, during the training phase, the cosine distance between the decoder's multi-scale final reconstructed features and the encoder's multi-scale latent features at the corresponding layers is calculated. The average of the multi-dimensional cosine distances is then used as the global loss. Simultaneously, the normal prototype aggregation loss from step two is introduced as an auxiliary loss. The total loss is the weighted sum of the two, as shown in the attached formula (6). By minimizing the total loss, model parameter optimization is achieved. During training, the model uses only a large number of defect-free cigarette packaging images. These images cover products from different brands and batches, including various normal printing deviations (such as slight color differences and overprinting errors) and mechanical indentations within acceptable limits. By optimizing the total loss, the model learns to encode all these normal variations into compact prototype features and accurately reconstructs them during decoding. This results in significant reconstruction errors for any anomalies exceeding the normal range during testing, whether global or local.
[0072]
[0073] Among them, I i L represents the multi-layer latent feature map of the encoder, N represents the number of feature layers in the encoder-decoder, and L represents the number of latent feature maps in the encoder-decoder. g Let L represent the global loss, λ represent the weighted average of the normal prototype aggregation loss, and L represent the average of the global loss. total This indicates the total loss.
[0074] Step 5: In the testing phase, the normal mode features reconstructed by the decoder and the multi-scale latent feature maps extracted by the encoder obtained in Step 3 are input into the defect detection module to obtain the final defect detection results that can identify various appearance defects of cigarette packaging.
[0075] In a specific implementation, during the testing phase, the cosine distance between the multi-scale final reconstructed feature map output by the decoder and the multi-scale latent feature map of the corresponding layer of the encoder is calculated layer by layer in the spatial dimension to generate multi-scale initial defect response maps. These defect response maps of different scales are then stitched together in the channel dimension, and a global average pooling operation is performed along this dimension to fuse defect information at each scale.
[0076] Finally, refer to Figure 3 The illustration uses common defects in cigarette packaging production, such as seal defects, box packaging paper defects, and inner lining paper defects, as test cases. After processing through the aforementioned multi-scale feature fusion process, a defect detection heatmap aligned with the spatial resolution of the input cigarette packaging image is obtained. The high-response areas of this heatmap can accurately correspond to the actual locations of various packaging defects. The entire model can effectively capture large-scale structural defects that affect the overall layout, and can also sensitively identify fine-grained local micro-defects, achieving the perception and localization of surface defects in cigarette packaging. For example, for a defect like "seal reversed," the heatmap will show a large area of high brightness in the seal area, indicating that the seal may be reversed; while for "inner lining paper damage," the heatmap will accurately outline thin bright lines along the scratches or wrinkles. The detection results can be directly superimposed on the original image for quality inspectors to quickly confirm, and can also serve as the basis for automated rejection systems, thereby significantly improving the quality inspection efficiency and accuracy of the cigarette packaging production line.
[0077] In summary, the unsupervised cigarette packaging defect detection method based on structure reconstruction disclosed in this invention specifically involves extracting multi-scale latent feature maps from the original image data of cigarette packaging, and averaging the multi-scale latent feature maps pixel by pixel to obtain an aggregated feature map; extracting vectors representing normal region features from the aggregated features; guiding the decoding process with normal prototype features to reconstruct only normal features; and calculating the difference between the decoder-reconstructed features and the multi-scale features extracted by the encoder to obtain the final defect detection result. This invention significantly enhances the structural sensitivity expression of normal features through a synergistic mechanism of frequency domain enhancement and prototype guidance, effectively improving the modeling ability of global structure and local details. It solves the problems of insufficient latent space discriminability and excessive smoothing of the decoder in traditional reconstruction methods, achieving efficient and accurate detection of cigarette packaging defects.
[0078] In this invention, when directional terms are mentioned, they are relative concepts based on the embodiments. Furthermore, "at least one" refers to one or more, and "more than one" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent the existence of A alone, A and B simultaneously, or B alone. A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects have an "or" relationship. "At least one of the following" and similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, and c can represent: a, b, c, a and b, a and c, b and c, or a and b and c, where a, b, and c can be single or multiple.
[0079] The above description of the structure, features, and effects of the present invention is based on the embodiments shown in the figures. However, the above are only preferred embodiments of the present invention. It should be noted that the technical features involved in the above embodiments and their preferred methods can be reasonably combined and matched by those skilled in the art to form a variety of equivalent solutions without departing from or changing the design concept and technical effects of the present invention. Therefore, the present invention is not limited to the scope of implementation shown in the figures. Any changes made in accordance with the concept of the present invention, or modifications to equivalent embodiments, that do not exceed the spirit covered by the specification and figures, should be within the protection scope of the present invention.
Claims
1. An unsupervised method for detecting defects in cigarette packaging based on structural reconstruction, characterized in that, include: Multi-scale latent feature maps are extracted from the original image data of cigarette packaging, and the multi-scale latent feature maps are aggregated to obtain aggregated feature maps; Normal prototype features representing the characteristics of normal areas of cigarette packaging are extracted from the aggregated feature map; The decoding process is guided by the normal prototype features, and the aggregated feature map is reconstructed to obtain the normal features reconstructed by the decoder; Based on the difference between the normal features reconstructed by the decoder and the multi-scale latent feature map, the final defect detection result is obtained, including at least one of the following predetermined cigarette packaging defects: folded seal, reversed seal, damaged seal, ruptured box packaging paper, and damaged inner lining paper.
2. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 1, characterized in that, The extraction of normal prototype features characterizing normal areas of cigarette packaging from the aggregated feature map includes: Wavelet transform is performed on the aggregated feature map to obtain multiple sub-bands containing low-frequency and high-frequency components; The multiple sub-bands are fused to obtain wavelet transform enhanced features; By using a learnable prototype token and a cross-attention mechanism, normal patterns are aggregated from the wavelet transform enhanced features to obtain initial normal prototype features. The initial normal prototype features are input into the prototype refinement hybrid expert system. The outputs of multiple expert networks are dynamically selected and aggregated through the routing network to obtain the refined normal prototype features.
3. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 2, characterized in that, The cigarette packaging defect detection method further includes calculating the normal prototype aggregation loss, which is the cosine distance between a single normal image patch feature and its corresponding nearest normal prototype feature. By minimizing the loss constraint, the prototype token can be learned to capture only normal patterns.
4. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 1, characterized in that, The process of guiding the decoding process with the normal prototype features, reconstructing the aggregated feature map to obtain the decoder-reconstructed normal features includes: The decoder's multiple attention heads are divided into global attention heads and local attention heads; The global attention head is used to model the global spatial dependencies and semantic consistency of cigarette packaging layout; The local attention head enhances sensitivity to local neighborhood structures by introducing convolution operations with different receptive fields into the query vector branch. The outputs of the global attention head and the local attention head are fused to obtain the normal features reconstructed by the decoder.
5. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 4, characterized in that, The local attention head processes the query vector using convolution kernels of different sizes, and then concatenates the processed vectors to reorganize and aggregate local neighborhood structure information.
6. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 1, characterized in that, The multi-scale latent feature map extracted from the original image data of cigarette packaging includes: The original image of the cigarette packaging is segmented into a patch sequence, and after adding position encoding, it is input into the encoder; Intermediate outputs are extracted from different depth layers of the encoder and recombined to form two-dimensional spatial feature maps corresponding to different scales, thus constituting a multi-scale latent feature representation.
7. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 1, characterized in that, The step of aggregating the multi-scale latent feature maps to obtain an aggregated feature map includes: The multi-scale latent feature maps are summed pixel by pixel along the layer dimension and averaged to obtain an aggregated feature map that serves as a compact representation of the image.
8. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to claim 1, characterized in that, The step of obtaining the final defect detection result based on the difference between the normal features reconstructed by the decoder and the multi-scale latent feature map includes: During the testing phase, the cosine distance in the spatial dimension between the multi-scale final reconstructed feature map output by the decoder and the multi-scale latent feature map of the corresponding layer of the encoder is calculated layer by layer to generate the initial defect response map at multiple scales. The initial defect response maps at multiple scales are stitched together along the channel dimension and global average pooling is performed to obtain a defect detection heatmap aligned with the spatial resolution of the input image.
9. The unsupervised cigarette packaging defect detection method based on structure reconstruction according to any one of claims 1 to 8, characterized in that, The method also includes a total loss function during the training phase, which is a weighted sum of reconstruction loss and normal prototype aggregation loss. The reconstruction loss is the average of the sum of the cosine distances between the decoder's multi-scale final reconstructed features and the corresponding layers of the encoder's multi-scale latent features.