A carton counting method

By employing an attention-based image enhancement and convolutional network feature extraction method, the problems of image brightness and blurring in cigarette box counting were solved, achieving accurate cigarette box quantity detection and improving the efficiency and accuracy of automated counting.

CN117218108BActive Publication Date: 2026-06-19CHINA TOBACCO ZHEJIANG IND CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TOBACCO ZHEJIANG IND CO LTD
Filing Date
2023-10-12
Publication Date
2026-06-19

Smart Images

  • Figure CN117218108B_ABST
    Figure CN117218108B_ABST
Patent Text Reader

Abstract

This invention provides a method for counting cigarette boxes, comprising the following steps: capturing image data from any angle within a cigarette storage facility and preprocessing the image data; using a trained image enhancement module based on an attention mechanism to denoise and deblur the preprocessed image data; using a trained convolutional network combining bottom-up and top-down structures as a feature extraction module to extract features from the image data output in step S2; and inputting both the image data output in step S2 and step S3 into a trained counting module with similarity comparison and feature enhancement functions to calculate the number of cigarette boxes. This invention can effectively improve the accuracy of cigarette box count detection and obtain a more accurate number of cigarette boxes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of counting technology, and specifically relates to a method for counting cigarette boxes. Background Technology

[0002] In modern warehouse management systems, although most operations have been automated, manual intervention is still required in certain critical stages, such as outbound loading. This reliance on manpower is not only prone to errors but also consumes significant time and labor. To overcome these problems, tobacco warehouses have begun adopting automated cigarette carton counting systems. Currently, using image technology for counting is a popular solution.

[0003] Some manufacturers have even begun to experiment with unmanned aerial vehicles (UAVs) to improve the efficiency of inventory counting. UAVs, controlled by radio and onboard programs, are equipped with cameras that capture high-resolution images and use sensors to precisely locate goods, thus accurately calculating the quantity of tobacco products in stock. However, these methods still face challenges, such as adjusting image brightness, eliminating noise, and removing blur. Due to lighting limitations in the shooting environment, captured images may be too dark; images may be affected by noise, possibly due to poor lighting conditions or noise from the image sensor itself; vehicle movement during loading can cause image blurring. Furthermore, variations in the shape or sales style of cigarette cartons from different batches can also affect the counting results. Therefore, in practice, counting cigarette cartons requires addressing issues such as image brightness adjustment, noise elimination, blur removal, and adapting to changes in carton shape, making existing counting methods less than ideal and unable to accurately determine the quantity of cigarette cartons. Therefore, providing a cigarette carton counting method that achieves better counting results has become a pressing technical problem for those skilled in the art. Summary of the Invention

[0004] The purpose of this invention is to provide a method for counting cigarette boxes to solve the above-mentioned technical problems in the prior art.

[0005] To achieve the above objectives, the present invention provides the following technical solution:

[0006] A method for counting cigarette cartons includes the following steps:

[0007] Step S1: Capture image data from any angle inside the tobacco warehouse and preprocess the image data;

[0008] Step S2: Use the trained attention-based image enhancement module to perform noise reduction and deblurring on the preprocessed image data;

[0009] Step S3: Use the trained convolutional network combining bottom-up and top-down structures as a feature extraction module to extract features from the image data output in step S2.

[0010] Step S4: Input the image data output in step S2 and the image data output in step S3 into the trained counting module with similarity comparison function and feature enhancement function to calculate the number of cigarette boxes.

[0011] Preferably, during the training phase, image data from various shooting angles of the smoke warehouse are collected, and each image data is preprocessed by standardization and zero-mean to construct an image data training set.

[0012] Preferably, the image data standardization process is performed according to the following formula:

[0013]

[0014] Where x is a data sample in the training set, μ is the mean of the data samples, and σ is the standard deviation of the data samples. * This is a standardized data sample.

[0015] Preferably, zero-mean processing of image data involves subtracting the average value of the image data in each dimension from the image data in the standardized image data.

[0016] Preferably, the attention-based image enhancement module includes:

[0017] The noise reduction module is used to reduce noise in the preprocessed image data;

[0018] The deblur module is used to deblur the denoised image data.

[0019] Preferably, the noise reduction module includes a feature extraction module, a feature learning module based on residual structure, and an image reconstruction module connected in sequence. The feature extraction module in the noise reduction module includes three convolutional layers connected in series; the feature learning module includes three convolutional layers connected in series, each of which is followed by an activation function; the image reconstruction module includes two convolutional layers and a batch normalization layer connected in sequence.

[0020] Preferably, the deblurring module uses an RNN-based network structure, which also incorporates a classic LSTM network and an attention mechanism.

[0021] Preferably, the deblurring module includes two convolutional blocks and an activation function layer arranged in series. The first convolutional block includes an LSTM module and three convolutional layers placed before and after the LSTM module. The second convolutional block includes an attention mechanism layer and a convolutional layer. The input and output of the second convolutional block are connected through residual connections. The feature map is output after passing through the activation function layer and is denoted as feature map C.

[0022] Preferably, in step S3, the feature extraction module includes four residual blocks connected in series from top to bottom; each residual block consists of three convolutional layers, which are then output through an activation function, and the input and output of the residual block are connected through residual connections; the output of each residual block reduces the size of the feature map by half; the top-down network structure of the feature extraction module is as follows: the four residual blocks from top to bottom process the feature map C in sequence, resulting in four feature maps output by the four residual blocks, which are denoted as C1, C2, C3, and C4 in sequence. ′ The input to the feature extraction module is then passed through a pooling layer to obtain the result compared to C4. ′ The two feature maps of the same size are superimposed by averaging the values ​​at each pixel position to obtain the final output C4 of the bottom residual block;

[0023] The bottom-up network structure of the feature extraction module is as follows: First, the feature map C4 obtained in the previous step is upsampled to obtain a feature map of the same size as C3. Then, this feature map and C3 are superimposed to obtain feature map C3. ′ And so on, feature map C2 is obtained sequentially. ′ and C1 ′ , C1 ′ As the final output of the feature extraction module, it is output to the subsequent counting module for further processing.

[0024] Preferably, the specific content of step S4 is as follows:

[0025] Step S41, Similarity Comparison: Use the image data output in step S2 as the supporting image, and use the feature map C1 output in step S3 as the supporting image. ′ The query image is used as the query image; then, the supporting image is used as the kernel to perform a convolution operation on the query image to obtain feature map D1. Then, feature map D1 is passed through a convolution block consisting of two convolutional layers and a dilated convolutional layer placed between the two convolutional layers to obtain feature map D2; finally, D2 is used as the kernel to perform a convolution operation on the query image to obtain feature map D3.

[0026] Step S42, Feature Enhancement: The support image is passed through a downsampling convolution operation and used as the convolution kernel to perform a convolution operation on feature map D3 to obtain feature map D4; then the query image is passed through a downsampling convolution layer to obtain a feature map of the same size as feature map D4, and then this feature map is concatenated with feature map D4 to obtain feature map D5; finally, feature map D5 is passed through a convolution structure that first downsamples and then upsamples to obtain a density map D with the same size as the support image.

[0027] Step S43, Counting: Calculate the density of the pixels of each cigarette box in the density map D to obtain the number of cigarette boxes.

[0028] The beneficial effects of this invention are as follows:

[0029] The cigarette box counting method of the present invention employs an image enhancement module based on an attention mechanism to perform noise reduction and deblurring processing on the image. It also uses a convolutional network combining bottom-up and top-down structures as a feature extraction module to extract features from the image data. Furthermore, it uses a counting module with similarity comparison and feature enhancement functions to calculate the number of cigarette boxes. This effectively improves the accuracy of cigarette box quantity detection, achieves better counting results, and thus obtains a more accurate number of cigarette boxes. Attached Figure Description

[0030] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly described below, and the specific embodiments of the present invention will be further described in detail with reference to the drawings, wherein...

[0031] Figure 1 A flowchart of a cigarette box counting method provided in an embodiment of the present invention;

[0032] Figure 2 This is a structural diagram of the noise reduction module provided in an embodiment of the present invention;

[0033] Figure 3 A structural diagram of the deblurring module provided in an embodiment of the present invention.

[0034] Figure 4 This is a structural diagram of the feature extraction module provided in an embodiment of the present invention;

[0035] Figure 5 This is a structural diagram of the counting module provided in an embodiment of the present invention. Detailed Implementation

[0036] To enable those skilled in the art to better understand the technical solution of the present invention, the present solution will be further described in detail below with reference to specific embodiments.

[0037] like Figure 1 As shown, this embodiment of the invention provides a method for counting cigarette cartons, which includes the following steps:

[0038] Step S1: Capture image data from any angle inside the tobacco warehouse and preprocess the image data;

[0039] Step S2: Use the trained attention-based image enhancement module to perform noise reduction and deblurring on the preprocessed image data;

[0040] Step S3: Use the trained convolutional network combining bottom-up and top-down structures as a feature extraction module to extract features from the image data output in step S2.

[0041] Step S4: Input the image data output in step S2 and the image data output in step S3 into the trained counting module with similarity comparison function and feature enhancement function to calculate the number of cigarette boxes.

[0042] The cigarette box counting method provided in this invention employs an image enhancement module based on an attention mechanism to denoise and deblur the image. It also uses a convolutional network combining bottom-up and top-down structures as a feature extraction module to extract features from the image data. Furthermore, it uses a counting module with similarity comparison and feature enhancement functions to calculate the number of cigarette boxes in the cigarette warehouse. This effectively improves the accuracy of cigarette box quantity detection, achieves better counting results, and ultimately obtains a more accurate number of cigarette boxes.

[0043] Furthermore, during the training phase, image data from various shooting angles of the smoke warehouse are collected, and each image data is preprocessed by standardization and zero-mean normalization to construct an image data training set for use by the image enhancement module, feature extraction module, and counting module during training. It is understandable that, in the implementation phase, the specific content of the image data preprocessing work in step S1 is also to standardize and zero-mean normalize the image data.

[0044] Specifically, the standardization of image data is performed according to the following formula:

[0045]

[0046] Where x is a data sample in the training set, μ is the mean of the data samples, and σ is the standard deviation of the data samples. * This is a standardized data sample.

[0047] Specifically, zero-mean normalization of image data involves subtracting the average value of each dimension of the image data from the normalized data. This centers all dimensions of the input image data to 0, as shown below:

[0048]

[0049] Where N represents the total number of data points in the training set, x i This represents the value of the i-th dimension of the processed data. Let x′ represent the value of the i-th dimension in the j-th data in the training set. i This represents the value of the i-th dimension after zero-mean normalization.

[0050] Further, in step S2, the attention-based image enhancement module includes:

[0051] The noise reduction module is used to reduce noise in the preprocessed image data;

[0052] The deblur module is used to deblur the denoised image data.

[0053] like Figure 2 As shown, the denoising module includes a feature extraction module, a residual structure-based feature learning module, and an image reconstruction module connected in sequence. The feature extraction module comprises three convolutional layers in series; the feature learning module comprises three convolutional layers in series, each followed by an activation function; the image reconstruction module comprises two convolutional layers and a batch normalization layer connected in sequence. This scheme, through this structured network design, can achieve denoising of cigarette box images, while possessing fewer parameters and high computational efficiency, making it well-suited for real-time denoising requirements in practical applications. Preferably, the activation function is the ReLU activation function.

[0054] Specifically, the RNN-based network structure also incorporates the classic LSTM module, thereby enhancing the representational power of the feature maps. Furthermore, the RNN-based network structure employs an attention mechanism, enabling better utilization of global information and focusing more on informative features. The LSTM module is a commonly used module and will not be elaborated upon here.

[0055] like Figure 3 As shown, the deblurring module includes two concatenated convolutional blocks and an activation function layer. The first convolutional block includes an LSTM module and three convolutional layers before and after the LSTM module. The second convolutional block includes an attention mechanism layer and a convolutional layer. The input and output of this convolutional block are connected through residual connections. After passing through the activation function layer, it outputs a feature map, denoted as feature map C, whose expression is as follows:

[0056] C = ConvA(ConvL(X))

[0057] Here, X is the input of the deblurring module, ConvL() is a convolutional block containing the LSTM module, and ConvA() is a convolutional block containing the attention mechanism. It can be understood that the deblurring module uses an RNN (Recurrent Neural Network)-based network structure to gradually restore the clear image in the pyramid-structured feature extraction layers, thus achieving image deblurring. Since the LSTM module is a special type of Recurrent Neural Network (RNN) module, the deblurring module of this invention is an extension of this module. The LSTM module has three convolutional layers before and after it. The first three convolutional layers sequentially decrease the size of the input image data, while the last three convolutional layers sequentially increase the size of the image data until it is the same as the input, thereby improving the stability and effectiveness of the deblurring module.

[0058] Understandably, during the training of the image enhancement module, the L2 loss can be used to measure the difference between the output image and the real image. Its loss function expression is as follows:

[0059]

[0060] Where: M is the number of image data, y i and These are the deblurred image and the original image, respectively, and loss1 is the L2 loss function.

[0061] Furthermore, in step S3, the structure of the feature extraction module includes four residual blocks connected in series from top to bottom, such as... Figure 4 As shown, each residual block consists of three convolutional layers, output by an activation function, and the input and output of the residual block are connected by residual connections. The output of each residual block reduces the size of the feature map by half. The top-down network structure of the feature extraction module is as follows: the four residual blocks from top to bottom process the feature map C in sequence, resulting in four feature maps output by the four residual blocks, which are denoted as C1, C2, C3, and C′4 respectively. The input of the feature extraction module is then passed through a pooling layer to obtain a feature map of the same size as C′4. These two feature maps of the same size are then superimposed by averaging the values ​​at each pixel position to obtain the final output C4 of the bottom residual block, thereby reducing information loss during the feature extraction process.

[0062] The bottom-up network structure of the feature extraction module is as follows: First, the feature map C4 obtained in the previous step is upsampled to obtain a feature map with the same size as C3. Then, this feature map and C3 are superimposed to obtain feature map C′3. In this way, feature maps C′2 and C′1 are obtained in turn. C′1 is used as the final output of the feature extraction module and output to the subsequent counting module for subsequent operations.

[0063] This approach uses a convolutional network combining bottom-up and top-down structures as the feature extraction module, which can effectively focus on accurate target location information at low levels while also focusing on rich semantic information at high levels.

[0064] like Figure 5 As shown, the specific content of step S4 is as follows:

[0065] Step S41, Similarity Comparison: The image data output in step S2 is used as the support image, and the image data output in step S3, i.e., feature map C′1, is used as the query image. Then, the support image is used as the convolution kernel to perform a convolution operation on the query image. Here, the stride size can be 1, to obtain feature map D1. Then, feature map D1 is passed through a convolution block consisting of two convolutional layers and a dilated convolutional layer placed between the two convolutional layers to obtain feature map D2. Finally, D2 is used as the convolution kernel to perform a convolution operation on the query image to obtain feature map D3.

[0066] Step S42, Feature Enhancement: The support image is passed through a downsampling convolution operation and used as the convolution kernel to perform a convolution operation on feature map D3 to obtain feature map D4; then the query image is passed through a downsampling convolution layer to obtain a feature map of the same size as feature map D4, and then this feature map is concatenated with feature map D4 to obtain feature map D5; finally, feature map D5 is passed through a convolution structure that first downsamples and then upsamples to obtain a density map D with the same size as the support image.

[0067] Step S43, Counting: Calculate the density of the pixels of the cigarette boxes in the density map D to obtain the number of cigarette boxes in the cigarette warehouse.

[0068] Understandably, during the training phase of the counting module, the generated density map is compared with the corresponding true density map, and the mean squared error (MSE) is used to measure the difference between them. By minimizing the loss during backpropagation, the accuracy and quality of cigarette box counting are improved. The expression for its loss function is as follows:

[0069]

[0070] Where: Q is the total number of pixels, y i and y′ i are the true value and the predicted value of the i-th pixel, respectively, and loss2 is the mean squared error loss function.

[0071] The above are merely preferred embodiments of the present invention. It should be noted that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention. Moreover, after reading the contents of the present invention, those skilled in the art can make various modifications or alterations to the present invention, and these equivalent forms also fall within the scope defined by the appended claims.

Claims

1. A carton counting method, characterized by, It includes the following steps: Step S1: Capture image data from any angle inside the tobacco warehouse and preprocess the image data; Step S2: Use the trained attention-based image enhancement module to perform noise reduction and deblurring on the preprocessed image data; Step S3: Use the trained convolutional network combining bottom-up and top-down structures as a feature extraction module to extract features from the image data output in step S2 to obtain a feature map; Step S4: Input the image data output in step S2 and the feature map output in step S3 into the trained counting module with similarity comparison and feature enhancement functions to calculate the number of cigarette boxes. In step S2, the attention-based image enhancement module includes: The noise reduction module is used to reduce noise in the preprocessed image data; The deblur module is used to deblur the noise-reduced image data; The noise reduction module includes a feature extraction module, a residual structure-based feature learning module, and an image reconstruction module connected in sequence. The feature extraction module in the noise reduction module includes three convolutional layers connected in series; the feature learning module includes three convolutional layers connected in series, each of which is followed by an activation function; the image reconstruction module includes two convolutional layers and a batch normalization layer connected in sequence. The deblurring module uses an RNN-based network structure, which also incorporates a classic LSTM network and an attention mechanism. The deblurring module includes two convolutional blocks and an activation function layer arranged in series. The first convolutional block includes an LSTM module and three convolutional layers before and after the LSTM module. The first three convolutional layers decrease the size of the input image data sequentially, while the last three convolutional layers increase the size of the image data sequentially until it matches the input size, thus improving the stability of the deblurring module. The second convolutional block includes an attention mechanism layer and convolutional layers. The input and output of this convolutional block are connected through residual connections, and the output after passing through the activation function layer is a feature map, denoted as the feature map. ; The feature extraction module in step S3 includes four residual blocks connected in series from top to bottom; each residual block consists of three convolutional layers, which are then output through an activation function, and the input and output of the residual block are connected through residual connections; the output of each residual block reduces the size of the feature map by half; the top-down network structure of the feature extraction module in step S3 is as follows: the four residual blocks from top to bottom process the feature map C in sequence, resulting in four feature maps output by the four residual blocks, which are denoted as follows. The input to the feature extraction module is then passed through a pooling layer to obtain the result. Two feature maps of the same size are combined by averaging them pixel-wise to obtain the final output of the bottommost residual block. ; The bottom-up network structure of the feature extraction module in step S3 is as follows: First, the feature map obtained in the previous step is processed... Perform an upsampling operation to obtain the same as The feature map is the same size as the feature map, and then the feature map and Feature maps are obtained by superposition And so on, feature maps are obtained sequentially. and ,Will As the final output of the feature extraction module in step S3, it is output to the subsequent counting module for further operations; The specific content of step S4 is as follows: Step S41, Similarity Comparison: Use the image data output in step S2 as the supporting image, and the feature map output in step S3 as the supporting image. The query image is used as the support image; then, the support image is used as the kernel to perform a convolution operation on the query image to obtain the feature map. Then feature map The feature map is obtained by passing the convolutional block consisting of two convolutional layers and a dilated convolutional layer placed between the two convolutional layers. Finally, The query image is convolved using the kernel to obtain the feature map. ; Step S42, Feature Enhancement: The support image is downsampled and convolved, and then used as the convolution kernel to enhance the feature map. Perform convolution operations to obtain feature maps. Then, the query image is passed through a downsampling convolutional layer to obtain the feature map. After obtaining feature maps of the same size, compare the feature map with the feature map. Perform a stitching operation to obtain the feature map. Finally, the feature map A density map with the same size as the support image is obtained through a convolutional structure that first downsamples and then upsamples. ; Step S43, Counting: Calculate the density map The number of cigarette boxes is obtained by summing the pixel density of the cigarette boxes in the middle component; During the training phase, image data from various shooting angles of the smoke warehouse were collected, and each image data was preprocessed by standardization and zero-mean to construct an image data training set. During the training of the image enhancement module, the L2 loss is used to measure the difference between the output image and the real image. The expression of its loss function is as follows: ; where: is the number of image data, and are the deblurred image and the real image, respectively, is the two-norm loss function.

2. The carton counting method of claim 1, wherein, Image data standardization is performed according to the following formula: ; Where x is a data sample in the training set. The average value of the data sample. The standard deviation of the data sample. This is a standardized data sample.

3. The carton counting method of claim 2, wherein, Zero-mean normalization of image data involves subtracting the average value of the image data in each dimension from the normalized image data.