DH-yolo-based saccharomyces cerevisiae and mixed bacteria collaborative detection method
By using an improved DH-YOLO network model, and leveraging the Dynamic-HGNetV2 backbone structure, the dynamic convolutional hybrid module DIMM, and C2PSA-RFFN, the contradiction between accuracy and resource requirements in the detection of Saccharomyces cerevisiae and other microorganisms was resolved, achieving efficient and accurate detection of Saccharomyces cerevisiae and other microorganisms.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIANGNAN UNIV
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for detecting brewer's yeast and other microorganisms present a contradiction between detection accuracy and computational resource requirements, resulting in low detection efficiency and difficulty in effective implementation on the production site. In particular, they are prone to missed or false detections when cell morphology is similar or there are subtle differences.
An improved DH-YOLO network model is adopted, which enhances feature extraction and detection capabilities, reduces computational load, and improves detection accuracy through the Dynamic-HGNetV2 backbone structure, the Dynamic Convolution Hybrid Module (DIMM), and the integrated Sharp Frequency Feedforward Module (C2PSA-RFFN).
It significantly improves the accuracy and efficiency of synergistic detection of Saccharomyces cerevisiae and other microorganisms, reduces the number of model parameters, adapts to the resource limitations of embedded detection terminals, and achieves high-precision identification of Saccharomyces cerevisiae and other microorganisms.
Smart Images

Figure CN122243932A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a method for the synergistic detection of brewer's yeast and other microorganisms based on DH-YOLO, belonging to the interdisciplinary technical field of microbial detection and image processing. Background Technology
[0002] As the core functional microorganism in the fermentation process of alcoholic beverages, the activity and purity of brewing yeast directly determine the flavor, yield, and stability of the fermentation process. However, due to factors such as instrument leakage and overnight microbial contamination of the starter culture, contamination by other microorganisms (such as Candida albicans, lactic acid bacteria, and acetic acid bacteria) is common during fermentation. This can lead to abnormal fermentation, spoilage of the alcoholic beverage, and even food safety risks. Furthermore, once the fermentation system is contaminated, these contaminants will compete with the brewing yeast for limited nutrients, thereby inhibiting the normal metabolism and proliferation of the brewing yeast and weakening its fermentation capacity. In cases of severe contamination, the growth rate of contaminants can even exceed that of the brewing yeast, ultimately leading to fermentation failure. Therefore, achieving rapid and accurate detection of the types and quantities of brewing yeast and other contaminants is not only a crucial link in the quality control of the alcoholic beverage production process and the evaluation of the finished product quality, but also a significant technical challenge facing the industrial fermentation field.
[0003] In recent years, with the rapid development of artificial intelligence, computer vision and image processing technologies, deep learning-based detection methods have shown great potential. Researchers have applied them to yeast detection in the fermentation process to detect cells. These models can automatically learn the morphology, texture, color and other discriminative features of cells from a large number of microscopic images, realize the automatic identification of cell location and type, and thus promptly detect contaminants in the fermentation process, providing support for the dynamic adjustment of fermentation strategies and intelligent control of the production process.
[0004] While microbial detection is crucial for ensuring the quality of industrial fermentation products, current detection technologies still face several bottlenecks in practical applications. Firstly, microorganisms and brewer's yeast are highly similar in cell morphology, with subtle differences between cells at different growth stages. Furthermore, the small size of budding cells leads to frequent false negatives and missed detections using traditional methods. To ensure reliable results, personnel must manually verify suspected targets extensively, increasing labor costs, extending the detection cycle, and impacting overall production efficiency. Secondly, while existing research has improved the detection accuracy of deep learning models to some extent through algorithm optimization, such improvements often involve a significant increase in model parameters and computational load, requiring high-performance GPUs for inference. However, embedded or portable detection terminals used in actual production environments are limited by size, power consumption, and cost, resulting in limited storage and computing resources that cannot meet the deployment requirements of complex models. This contradiction prevents many advanced detection algorithms from being effectively implemented in production, severely hindering the advancement of intelligent fermentation processes. Summary of the Invention
[0005] To address the shortcomings of the YOLOv11 network in capturing fine-grained cell edge texture features and microscopic defocusing when performing co-detection of the lifecycles of Saccharomyces cerevisiae and other microorganisms, this invention provides a DH-YOLO-based method for co-detection of Saccharomyces cerevisiae and other microorganisms, comprising: Step 1: Prepare the dataset by dividing the labeled Saccharomyces cerevisiae and miscellaneous bacteria cell image dataset into a training set and a validation set; Step 2: Construct a DH-YOLO model based on the improved YOLOv11. The DH-YOLO model includes a backbone feature extraction network, a Neck feature fusion network, and a head detection network. Step 3: Train the constructed DH-YOLO model using the training set prepared in Step 1, and set early stopping conditions to obtain the early stopping model. Step 4: Input the images of Saccharomyces cerevisiae and other microorganisms in the validation set into the early-stop model obtained through training for validation, and then save the best model. Step 5: Use the optimal model to detect the images of the yeast and other microorganisms to be detected, and obtain the visual detection results of the images of the yeast and other microorganisms to be detected.
[0006] Optionally, in step 2, the Backbone uses the Dynmaic-HGNetV2 network for feature extraction, and reduces model parameters through an hourglass structure and depthwise separable convolution, avoiding the gradient vanishing problem while enhancing feature extraction capabilities; the Neck part uses a PANet+FPN structure, which has a top-down and top-down feature fusion path to achieve feature information fusion between feature maps of different sizes; the Head part completes regression and classification tasks through decoupling heads, directly predicting the center point or boundary point of the target.
[0007] Optionally, step 2 includes: Step 2.1: Select Dynamic-HGNetV2 as the backbone feature extraction network. Dynamic-HGNetV2 includes a multi-scale hourglass structure HG-Stem module, n sequentially executed dynamic gated frequency domain modules Dynamic-HGblock module, dynamic convolution module Dynmanic-Conv, multi-scale pyramid pooling layer module SPPF, and integrated sharp frequency feedforward module C2PSA-RFFN. Step 2.2: Design the structure of the HG-Stem module, Dynamic-HGblock module, Dynamic-Conv dynamic convolution module, and C2PSA-RFFN integrated sharp frequency feedforward module in the Dynamic-HGNetV2 network; Step 2.3: In the feature fusion network Neck, an embedded dynamic convolutional hybrid module DIMM is designed to replace the original C3k2 module of YOLOv11.
[0008] Optionally, the HG-Stem module of the multi-scale hourglass structure performs a 4x downsampling of the input image through parallel max pooling and convolution dual branches. One of the parallel branches uses two max pooling and convolution operations, while the other uses one double max pooling and convolution operation to form differentiated results. n sequentially executed Dynamic-HGblock modules perform dynamic convolution and standard convolution to extract cell image features. The dynamic convolution module Dynmanic-Conv includes an average pooling layer, a fully connected layer, a sigmoid function, a routed convolution module CondConv2d, ordinary convolution, residual connections, and a CBS module for dimensionality expansion. The routed convolution module adaptively activates the corresponding expert convolution kernels according to the cell size, thereby constructing multi-scale information and performing channel fusion. After compression and restoration, the features required for the next block are obtained.
[0009] Optionally, the integrated sharp-frequency feedforward module processes the input feature map. First, a 1×1 convolution is used to expand the channel dimension, doubling the number of channels to obtain the feature map. Then, the spatial context information is aggregated through a 3×3 depthwise separable convolution, and the output is evenly divided into feature maps. and feature map The gating mechanism is applied to obtain the gating-processed features. ;in , , These represent the number of image channels, width, and height, respectively. This indicates element-wise multiplication; Features after gating Calibration was performed using a 1×1 convolutional layer to obtain... Next, frequency domain adaptive filtering is performed to fill the spatial domain output. Make its height and width both the size of a block. Multiples of, and divided into by Reshape. Non-overlapping Image patches are used to obtain output feature maps. ;in, and These represent the height and width of the transformed feature map, respectively. Perform a two-dimensional real-valued Fourier transform on each block to convert the signal from the spatial domain to the frequency domain and obtain the spectrum. In the frequency domain, a learnable, input-independent complex filter is used. The modulated spectrum is obtained by performing element-wise modulation on the spectrum. The filtered spectrum is subjected to an inverse two-dimensional real-valued Fourier transform to convert it back to the spatial domain, and then reshaped and clipped back to the original input size. The enhanced feature map is obtained. , .
[0010] Optionally, the dynamic convolutional mixing module (DIMM) first processes the input feature map... Through batch normalization, and uniformly divided along the channel dimension into and Two parts. The two sub-feature maps are then fed into the Dynamic Aware Hybrid Module (DID) with identical structure but independent parameters for processing. The outputs of the two paths... and The concatenation is performed along the channel dimension, and then fused and channel number recovery is achieved through a 1×1 convolution to obtain the fused features. To ensure stable training of deep networks and promote gradient flow, the module introduces a layer scaling parameter. First, use a learnable vector with minimal initial values. The output of the residual branch is scaled channel by channel to obtain the scaled fused features. The scaled fused features are added to the input to form the first residual sub-block. ,feature Then it proceeds to a second identical sub-block, which contains another BN layer and a convolutional GELU feedforward network, and is similarly scaled. After processing the residuals, we finally obtain... .
[0011] Optionally, the dynamic perception hybrid module DID first processes the input feature map Global average pooling is performed to obtain After 1×1 convolution to expand the dimension, we get Then, the reshape dimension is expanded to obtain Then, softmax is used to obtain the soft attention weights. ,in Simultaneously, the original feature maps are processed through 3×3 convolutions, 1×11 convolutions, and 11×1 convolutions to extract cell edge texture features at different scales. Finally, the outputs of each branch are compared with the soft attention weights generated from the global context. By performing element-wise multiplication and summing along the first dimension, we can obtain... .
[0012] Optionally, the early stopping condition set in step 3 is: the model's mAP value does not improve during 100 training rounds.
[0013] The present invention also provides a method for detecting the growth stage of miscellaneous bacteria during the fermentation process of alcoholic beverages. The method is based on the above-mentioned method for synergistic detection of brewing yeast and miscellaneous bacteria to detect miscellaneous bacteria, and then determines the growth stage of miscellaneous bacteria based on the morphological change rate of miscellaneous bacteria.
[0014] The present invention also provides the application of the above method in the fermentation process of alcoholic beverages.
[0015] The beneficial effects of this invention are: This invention provides a method for the collaborative detection of Saccharomyces cerevisiae and other microorganisms based on DH-YOLO. It improves upon YOLOv11 by proposing a lightweight DH-YOLO network model. First, DH-YOLO replaces the backbone structure of YOLOv11 with a lightweight Dynamic-HGNetv2 network. It compresses computation through depthwise separable convolutions and a lightweight bottleneck module, and stacks multi-scale hourglass structures to enhance the ability to distinguish features of adherent cells. Next, a dynamic convolutional hybrid module (DIMM) is embedded in the neck network instead of the original C3k2 module. This module, through a dynamic kernel selection mechanism, can adaptively allocate the optimal receptive field based on the feature context of different spatial locations, effectively enhancing the ability to distinguish subtle inter-class differences among cells. Finally, a sharp-frequency feedforward (RFFN) module is integrated into the C2PSA module branch of the YOLOv11 model. This module performs frequency domain transformation on the feature map and uses a learnable spectral weight matrix to adaptively suppress low-frequency components representing blurred backgrounds while enhancing high-frequency information representing cell edge details. This improves the accuracy of the model in the collaborative detection of Saccharomyces cerevisiae and other microorganisms. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a flowchart of the overall process for the synergistic detection of brewer's yeast and other microorganisms based on DH-YOLO, provided in one embodiment of the present invention. Figure 2 This is a diagram showing the overall structure of the DH-YOLO model provided in one embodiment of the present invention; Figure 3 This is a structural diagram of the DHG-block module in the Dynmaic-HGNetV2 network of the DH-YOLO model provided in one embodiment of the present invention; Figure 4 This is a structural diagram of the Dynmaic-Conv module in the DH-YOLO model provided in one embodiment of the present invention; Figure 5 This is a structural diagram of the integrated sharp frequency feedforward module C2PSA-RFFN in the DH-YOLO model provided in one embodiment of the present invention; Figure 6 This is a structural diagram of the dynamic convolutional hybridization module (DIMM) in the DH-YOLO model provided in one embodiment of the present invention; Figure 7 This is a structural diagram of the Dynamic Perception Hybrid Module (DID) in the DH-YOLO model provided in one embodiment of the present invention; Figure 8 This is a visualization of the results of detecting brewer's yeast and other microorganisms using the existing YOLOv11m model, provided in one embodiment of the present invention. Figure 9 This is a visualization of the results of detecting brewer's yeast and other microorganisms using the DH-YOLO model provided in this invention, as shown in one embodiment of the invention. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
[0019] Example 1 This embodiment provides a method for the synergistic detection of Saccharomyces cerevisiae and other microorganisms based on DH-YOLO. (See [link to relevant documentation]). Figure 1 The method includes: Step 1: Prepare the dataset by dividing the labeled Saccharomyces cerevisiae and miscellaneous bacteria cell image dataset into a training set and a validation set; Step 2: Construct the DH-YOLO model (Dynmaic-HGNet HybridYOLO) based on the improved YOLOv11. This model includes a backbone feature extraction network, a Neck feature fusion network, and a head detection network. Step 2.1: Select Dynamic-HGNetV2 as the backbone feature extraction network; Specifically, such as Figure 2As shown, Dynamic-HGNetV2 is used to replace the original backbone feature extraction network structure of YOLOv11. The HG-Stem module in the Dynamic-HGNetV2 network is designed. This module first performs a 4x downsampling of the input image through parallel max pooling and convolution dual branches. One of the parallel branches uses two max pooling and convolution operations, while the other uses one double max pooling and convolution operation to create differentiated results. Compared to the original HG-Stem module in the Dynamic-HGNetV2 network, which uses convolutional layers to extract features and then performs downsampling through multi-scale max pooling, this application's scheme designs a multi-scale hourglass structure parallel branch to enhance the feature discrimination of adherent cells. Subsequently, the Dynamic-HGNetV2 network alternately stacks downsampling convolutional layers. It integrates n sequentially executed dynamic gated frequency domain modules (Dynamic-HGblock modules). Figure 2 DHG-block (abbreviated as DHG-block) captures multi-scale information of images. For example... Figure 3 As shown, each Dynmaic-HGblock module performs dynamic convolution and standard convolution to extract cell image features. For the dynamic convolution module Dynmaic-Conv (… Figure 2 (abbreviated as DwConv), this application is specially designed Figure 4 The structure shown includes an average pooling layer, a fully connected layer, a sigmoid function, a routed convolution CondConv2d, a regular convolution, residual connections, and a CBS for dimensionality expansion; the routed convolution module adaptively activates the corresponding expert convolution kernels according to the cell size, thereby constructing multi-scale information and performing channel fusion, and then compressing and restoring it to obtain the features required for the next block; Step 2.2: In the backbone feature extraction network, an integrated sharp frequency feedforward module C2PSA-RFFN is designed to replace the C2PSA module in YOLOv11 for collaborative feature extraction of images of mixed bacteria cells and Saccharomyces cerevisiae cells. Specifically, the integrated sharp frequency feedforward (C2PSA-RFFN) replaces the C2PSA structure. This module performs frequency domain transformation on the feature map and uses a learnable spectral weight matrix to adaptively suppress the low-frequency components that represent the blurred background, while enhancing the high-frequency information that represents the details of cell edges, thereby improving the accuracy of the model in the collaborative detection of Saccharomyces cerevisiae and mixed bacteria.
[0020] like Figure 5 As shown, the integrated sharp frequency feedforward module C2PSA-RFFN processes the input feature map. First, a 1×1 convolution is used to expand the channel dimension, doubling the number of channels to obtain the feature map. Then, the spatial context information is aggregated through a 3×3 depthwise separable convolution, and the output is evenly divided into... and The application of gating mechanisms has yielded the following results. ;in , , These represent the number of image channels, width, and height, respectively. Indicates element-wise multiplication; features after gating. Calibration was performed using a 1×1 convolutional layer to obtain... Next, frequency domain adaptive filtering is performed to fill the spatial domain output. Make its height and width both the size of a block. Multiples of, and divided into by Reshape. Non-overlapping Image patches are used to obtain output feature maps. ,in, and These represent the height and width of the transformed feature map, respectively. A two-dimensional real-valued Fourier transform (RFFT2) is performed on each block to transform the signal from the spatial domain to the frequency domain. In the frequency domain, a learnable, input-independent complex filter is used. Element-by-element modulation of the spectrum is obtained The filtered spectrum is subjected to an inverse two-dimensional real-valued Fourier transform (IRFFT2) to transform it back to the spatial domain, and then reshaped and clipped back to the original input size. The enhanced feature map is obtained. , .
[0021] Step 2.3 In the feature fusion network (i.e., Neck), an embedded dynamic convolutional hybrid module DIMM is designed to replace the original C3k2 module of YOLOv11. This module can adaptively allocate the optimal receptive field according to the feature context of different spatial locations through a dynamic kernel selection mechanism. Specifically, such as Figure 6 As shown, the Dynamic Convolutional Hybridization Module (DIMM) employs an asymmetric multi-scale convolutional kernel structure. By using convolutional kernels of different sizes in parallel, it can effectively extract edge features and surface texture information of cells at multiple scales. Building upon this, an attention-guided dynamic fusion mechanism is introduced to adaptively integrate feature responses from different receptive fields, significantly enhancing the model's robustness to differences in cell-scale distribution and the effects of adhesion, thereby improving the detection accuracy for densely packed cell targets.
[0022] The core design of the Dynamic Convolutional Hybridization Module (DIMM) is based on residual blocks that are dynamically perceived and stably optimized. First, the input feature map... Through batch normalization, and uniformly divided along the channel dimension into and Two parts. The two sub-feature maps are then fed into the DID module (which has the same structure but independent parameters) for processing. The outputs of the two paths... and The concatenation is performed along the channel dimension, and then fused and channel number restored using a 1×1 convolution. To ensure stable training of deep networks and promote gradient flow, the module introduces a layer scaling parameter. First, a learnable vector with minimal initial values is used. The output of the residual branch is scaled channel by channel, i.e. The scaled features are added to the input to form the first residual sub-block. ,feature Then it proceeds to a second identical sub-block, which contains another BN layer and a convolutional GELU feedforward network, and is similarly scaled. After processing the residuals, we finally obtain... .
[0023] like Figure 7 As shown, the Dynamic Perception Hybrid Module (DID) first processes the input feature map... Global average pooling is performed to obtain After 1×1 convolution to expand the dimension, we get Then, the reshape dimension is expanded to obtain Then, softmax is used to obtain the soft attention weights. ,in Simultaneously, the original feature maps are processed through 3×3 convolutions, 1×11 convolutions, and 11×1 convolutions to extract cell edge texture features at different scales. Finally, the outputs of each branch are compared with the soft attention weights generated from the global context. By performing element-wise multiplication and summing along the first dimension, we can obtain... , Figure 6 The outputs of two DID modules with identical structures but independent parameters As respectively and Output.
[0024] Step 3: Input the images of Saccharomyces cerevisiae and other microorganisms in the training set into the constructed DH-YOLO model for training to obtain the early stopping model; This embodiment of the invention includes an early stop button, which terminates training if the model's mAP value does not improve within 100 training rounds.
[0025] Step 4: Input the images of Saccharomyces cerevisiae and other microorganisms in the validation set into the early-stop model obtained through training for validation, and then save the best model. Step 5: Use the optimal model to detect the images of the Saccharomyces cerevisiae cells to be detected, and obtain the visual detection results of the images of the Saccharomyces cerevisiae and other microorganisms to be detected.
[0026] This invention provides a method for the collaborative detection of Saccharomyces cerevisiae and other microorganisms based on DH-YOLO. The DH-YOLO network model is proposed, with a lightweight Dynamic-HGNetV2 backbone. It reduces computational load and enhances the features of adherent cells through depthwise separable convolutions and a multi-scale hourglass structure. The neck uses a dynamic convolutional hybrid module (DIMM) to adaptively adjust the receptive field, highlighting subtle inter-class differences. A C2PSA branch is embedded with RFFN to suppress low-frequency blurring and enhance high-frequency edges in the frequency domain, significantly improving the accuracy and efficiency of the collaborative detection of Saccharomyces cerevisiae and other microorganisms, thereby enhancing the model's accuracy in this area.
[0027] Example 2 This embodiment provides a method for the collaborative detection of Saccharomyces cerevisiae and other microorganisms based on DH-YOLO. The method constructs a YOLOv11 object detection network and improves it using Dynamic-HGNetV2, RFFN modules, and a dynamic convolutional hybrid module (DIMM). Then, using a self-made Saccharomyces cerevisiae and other microorganism cell dataset, the images from the training set are input into the improved YOLOv11 network for training, and the network parameters are optimized using a validation set. Finally, the best model weights are saved, and the images from the validation set are fed into the network for detection. Specifically, the method includes: Step 1: Create a dataset. Manually label the collected *Saccharomyces cerevisiae* cell image dataset and divide it into training and validation sets. The specific content of the dataset construction is as follows: 1-1: More than 6,000 images of Saccharomyces cerevisiae cells were collected from the School of Bioengineering of a university and divided into training set and validation set in an 8:2 ratio.
[0028] 1-2: Under the guidance of experts, LabelImg annotation software was used to manually annotate the individual cells of the seven categories of Saccharomyces cerevisiae and other microorganisms contained in the dataset.
[0029] Step 2: Construct a DH-YOLO model, including a backbone feature extraction network, a feature fusion network, and a detection head, wherein Dynmaic-HGNetV2 is selected as the backbone feature extraction network; Specifically, the YOLOv11 model is selected as the baseline network and improved to form a target detection model with enhanced feature extraction and feature fusion characteristics, namely the DH-YOLO model, such as... Figure 2 As shown, it includes: Step 2-1: Backbone uses the Dynmaic-HGNetV2 network for feature extraction. By using an hourglass structure and depthwise separable convolutions, the model parameters are reduced, which avoids the gradient vanishing problem and enhances the feature extraction capability.
[0030] Step 2-2: The Neck part uses the PANet+FPN structure, which has a top-down and top-down feature fusion path to achieve feature information fusion between feature maps of different sizes.
[0031] Steps 2-3: The Head part completes the regression and classification tasks separately by decoupling the head, which can directly predict the center point or boundary point of the target and simplify the model training process.
[0032] Furthermore, in the constructed DH-YOLO model: Dynamic-HGNetV2 first downsamples the input image by a factor of 4 using a parallel max-pooling and convolution dual-branch approach. Then, the network alternately stacks downsampled convolutional layers, integrating... n The sequentially executed Dynamic-HGblock modules capture multi-scale information from the image. Each Dynamic-HGblock module performs dynamic convolution and standard convolution to extract cell image features, where the appropriate expert convolution kernel is adaptively activated according to the cell size, thereby constructing multi-scale information and performing channel fusion, which is then compressed and restored to obtain the features required for the next block.
[0033] Dynamic convolution reduces the dimensionality of the input feature map and flattens it. Multiple expert convolutional kernels are dynamically fused based on routing weights to generate customized convolutional weights for each input sample. The fused kernels are then used to convolve the input, resulting in the final feature map through CBS. CBS is composed of... The model consists of convolution, Batch Normalization (BN), and SiLU activation functions. This design effectively reduces the number of model parameters while improving model accuracy, thus optimizing resource utilization.
[0034] The integrated high-frequency feedforward module C2PSA-RFFN consists of two parts: one is the spatial domain gating mechanism input feature map. First, a 1×1 convolution is used to expand the channel dimension, doubling the number of channels to obtain the feature map. Then, the spatial context information is aggregated through a 3×3 depthwise separable convolution, and the output is evenly divided into... and By applying the gating mechanism, we obtain: , Features after gating Calibration was performed using a 1×1 convolutional layer to obtain... Next, frequency domain adaptive filtering is performed to fill the spatial domain output. Make its height and width both the size of a block. Multiples of, and divide them into Non-overlapping Image patches are used to obtain output feature maps. Performing a two-dimensional real-valued Fourier transform (RFFT2) on each block converts the signal from the spatial domain to the frequency domain. , Use a learnable, input-independent complex filter in the frequency domain. Element-by-element modulation of the spectrum yields: , Perform an inverse two-dimensional real-valued Fourier transform (IRFFT2) on the filtered spectrum to transform it back to the spatial domain, and then rearrange and clip it back to the original input size. The enhanced feature map is obtained. : , In the feature fusion network Neck, to further enhance information fusion between high and low layers, a dynamic convolutional hybrid module is proposed, utilizing a dynamic perception module with different kernel sizes and combining it with a gating mechanism. The core design of this module is a residual block based on dynamic perception and stable optimization. First, the input feature map... Through batch normalization, and uniformly divided along the channel dimension into and Two parts. The two sub-feature maps are then fed into the DID module (which has the same structure but independent parameters) for processing. The outputs of the two paths... and The concatenation is performed along the channel dimension, and then fused and channel number restored using a 1×1 convolution. , To ensure stable training of deep networks and promote gradient flow, the module introduces layer scaling parameters. First, a learnable vector with minimal initial values is used. The output of the residual branch is scaled channel by channel, i.e. , in This represents channel-by-channel multiplication. The scaled features are added to the input to form the first residual sub-block.
[0035] , feature Then it proceeds to a second identical sub-block, which contains another BN layer and a convolutional GELU feedforward network, and is similarly scaled. After performing residual join processing, we finally obtain: .
[0036] The Dynamic Convolutional Hybrid Module (DIMM) introduces a Dynamically Aware Hybrid Module (DID), which first processes the input feature map... Global average pooling is performed to obtain ,go through Convolution with dimensional expansion yields Then, the reshape dimension is expanded to obtain Then, softmax is used to obtain the soft attention weights. ,in Simultaneously, the original feature maps are processed separately... convolution, convolution, Convolution is used to extract cell edge texture features at different scales. Finally, the output of each branch is compared with the soft attention weights generated from the global context. By performing element-wise multiplication and summing along the first dimension, we get:
[0037] Step 3: Train the constructed DH-YOLO model using the training set prepared in Step 1. This embodiment trains a DH-YOLO model based on an improved YOLOv11 and sets an early stop button. If the model's mAP value does not improve within 100 training rounds, the training ends.
[0038] Step 4: Input the images of Saccharomyces cerevisiae and other microorganisms in the validation set into the best model obtained from training for validation, and then save the best model; Step 5: Use the optimal model to detect the images of the yeast and other microorganisms to be detected, and obtain the visual detection results of the images of the yeast and other microorganisms to be detected.
[0039] In this embodiment of the application, all experiments are conducted on the same computer, and the experimental environment configuration is shown in Table 1: Table 1 Experimental Environment Configuration
[0040] When images are fed into the model for training, they are compressed into... During training, the initial learning rate was set to 0.01, the learning rate decay factor was 0.0005, the SGD optimizer was used, the momentum value was set to 0.937, and the batch size was set to 32. Training was conducted for 300 epochs with 8 workers to accelerate the training process. Early stopping was implemented; training would automatically stop after 100 epochs if the model's mAP value no longer increased, saving training time.
[0041] In this invention, precision, recall, mean average precision (mAP), model parameters (Parmas), and frames per second (FPS) are used to evaluate the performance of the proposed method. Precision represents the percentage of correctly predicted positive samples (TP) out of the total number of predicted positive samples (TP+FP), reflecting the false positive rate of *Saccharomyces cerevisiae* cells. Recall represents the percentage of correctly predicted positive samples out of the total number of labels (TP+FN), reflecting the false negative rate of *Saccharomyces cerevisiae* cells. Mean accuracy (mAP) reflects the performance of the proposed method. It comprehensively considers precision and recall, integrating the PR curve with recall on the x-axis and precision on the y-axis to obtain the AP value. Then, the AP values for each category are summed, and finally divided by the total number of categories (k) to obtain mAP. The calculation methods for the above indicators are shown in the following formulas:
[0042]
[0043]
[0044]
[0045] To verify whether the Dynamic-HGNetV2, the introduced RFFN module, and the DIMM module in the DH-YOLO network of this invention can effectively improve the detection accuracy of the model and enhance the recognition ability of Saccharomyces cerevisiae and other microbial cells, the embodiments of this invention conduct ablation experiments on different modules and objectively compare and analyze the different modules according to the evaluation indicators.
[0046] The experiment involved arranging and combining the three modules into 6 groups: Dynamic-HGNetV2, RFFN, DIMM, Dynamic-HGNetV2+ RFFN, Dynamic-HGNetV2+ DIMM, and Dynamic-HGNetV2+RFFN+DIMM. The mAP@0.5, Recall, and Precision results obtained from the ablation experiment are shown in Table 2. The √ mark represents the module used.
[0047] Table 2 Ablation Experiment Results
[0048] As shown in Table 2 above, after introducing the lightweight backbone network Dynamic-HGNetV2 alone, the number of model parameters decreased significantly from 20.04 to 17.1 MB, while mAP@0.5 improved by 0.7%, indicating that the backbone network Dynamic-HGNetV2 effectively reduced model complexity while enhancing feature extraction capabilities. After introducing DIMM alone, the model achieved the highest recall rate, and the mAP@0.5:0.95 ratio improved by 1.9 percentage points. The DIMM module, through its multi-scale dynamic receptive field structure, greatly enhanced the model's ability to perceive cells in complex backgrounds, effectively reducing false negatives. After introducing the RFFN module alone, the model improved mAP@0.5 by 1.9% and precision by 3.4% compared to the baseline model (i.e., the original YOLOv11m). This indicates that the RFFN module, utilizing frequency domain analysis capabilities, significantly improved feature quality and model discrimination ability, enabling it to more accurately distinguish cells from the background, thereby greatly reducing false positives.
[0049] After integrating all optimization modules, the proposed solution achieved the highest mAP@0.5 and mAP@0.5:0.95, representing improvements of 2.6% and 2.5% respectively compared to the baseline model, while reducing the number of parameters by 3.5 MB. This demonstrates that the three modules are functionally complementary, and their collaborative work achieves the optimal balance between accuracy and efficiency. Therefore, DH-YOLO, in addition to detecting contaminating microorganisms, can accurately determine the different growth stages of each contaminating microorganism and monitor the degree of contamination of samples in real time during fermentation. To comprehensively evaluate the performance advantages of this method, DH-YOLO was compared with mainstream target detection models, and the experimental results are shown in Table 3.
[0050] Table 3 Comparison of experimental results
[0051] This experiment used the well-balanced and widely used YOLOv11m as the benchmark model. The results showed that DH-YOLO achieved 93.7% accuracy at mAP@0.5, a 2.6 percentage point improvement over YOLOv11m's 91.1%. At mAP@0.5:0.95, DH-YOLO achieved 81.6%, a 2.5 percentage point improvement over YOLOv11m's 79.1%. This benchmark requires more precise bounding box localization; the improvement indicates that the model not only classifies accurately but also generates prediction boxes that fit more closely to cell edges, which is crucial for subsequent morphological analysis. The DH-YOLO model presented in this application achieved the highest accuracy among all compared models, indicating minimal false positives and high reliability. Simultaneously, its recall rate was second only to the more parameter-intensive YOLOv13l, indicating a very low probability of missed detections and the ability to identify target cells as accurately as possible. The advanced nature and practicality of the DH-YOLO model in monitoring contaminants during microbial fermentation can provide reliable technical support for real-time and accurate quality control of the fermentation process.
[0052] Figure 8 This is a visualization of the image detection results of Saccharomyces cerevisiae and other microorganisms using the existing YOLOv11m model. Figure 9 The visualization results of image detection of brewer's yeast and other microorganisms using the DH-YOLO model provided in this invention are compared. Figure 8 and Figure 9 It can be seen that the method proposed in this invention has stronger feature fusion and multi-scale feature extraction capabilities. Compared with the YOLOv11m network, it can improve the accuracy of detection results of Saccharomyces cerevisiae and other bacteria in complex backgrounds, image boundaries, overlapping cells and other scenarios with a reduced number of parameters.
[0053] Some steps in the embodiments of the present invention can be implemented using software, and the corresponding software program can be stored in a readable storage medium, such as an optical disc or a hard disk.
[0054] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for the synergistic detection of Saccharomyces cerevisiae and other microorganisms based on DH-YOLO, characterized in that, The method includes: Step 1: Prepare the dataset by dividing the labeled Saccharomyces cerevisiae and miscellaneous bacteria cell image dataset into a training set and a validation set; Step 2: Construct a DH-YOLO model based on the improved YOLOv11. The DH-YOLO model includes a backbone feature extraction network, a Neck feature fusion network, and a head detection network. Step 3: Train the constructed DH-YOLO model using the training set prepared in Step 1, and set early stopping conditions to obtain the early stopping model. Step 4: Input the images of Saccharomyces cerevisiae and other microorganisms in the validation set into the early-stop model obtained through training for validation, and then save the best model. Step 5: Use the optimal model to detect the images of the yeast and other microorganisms to be detected, and obtain the visual detection results of the images of the yeast and other microorganisms to be detected.
2. The method according to claim 1, characterized in that, In step 2, the Backbone uses the Dynmaic-HGNetV2 network for feature extraction. The hourglass structure and depthwise separable convolution reduce the model parameters, avoiding the gradient vanishing problem while enhancing the feature extraction capability. The Neck part uses the PANet+FPN structure, which has a top-down and top-down feature fusion path to achieve feature information fusion between feature maps of different sizes. The Head component decouples the head from the regression and classification tasks, directly predicting the center point or boundary point of the target.
3. The method according to claim 2, characterized in that, Step 2 includes: Step 2.1: Select Dynamic-HGNetV2 as the backbone feature extraction network. Dynamic-HGNetV2 includes a multi-scale hourglass structure HG-Stem module, n sequentially executed dynamic gated frequency domain modules Dynamic-HGblock, a dynamic convolution module Dynmanic-Conv, a multi-scale pyramid pooling layer module SPPF, and an integrated sharp frequency feedforward module C2PSA-RFFN. Step 2.2: Design the structure of the HG-Stem module, Dynamic-HGblock module, Dynamic-Conv dynamic convolution module, and C2PSA-RFFN integrated sharp frequency feedforward module in the Dynamic-HGNetV2 network; Step 2.3: In the feature fusion network Neck, an embedded dynamic convolutional hybrid module DIMM is designed to replace the original C3k2 module of YOLOv11.
4. The method according to claim 3, characterized in that, The HG-Stem module of the multi-scale hourglass structure performs a 4x downsampling of the input image through parallel max pooling and convolution dual branches. One of the parallel branches uses two max pooling and convolution operations, while the other uses one double max pooling and convolution operation to form differentiated results. n sequentially executed Dynamic-HGblock modules perform dynamic convolution and standard convolution to extract cell image features. The Dynamic-Conv module includes an average pooling layer, a fully connected layer, a sigmoid function, a routed convolution module CondConv2d, ordinary convolution, residual connections, and a CBS module for dimensionality expansion. The routed convolution module adaptively activates the corresponding expert convolution kernels according to the cell size, thereby constructing multi-scale information and performing channel fusion. After compression and restoration, the features required for the next block are obtained.
5. The method according to claim 4, characterized in that, The integrated high-frequency feedforward module processes the input feature map. First, a 1×1 convolution is used to expand the channel dimension, doubling the number of channels to obtain the feature map. Then, the spatial context information is aggregated through a 3×3 depthwise separable convolution, and the output is evenly divided into feature maps. and feature map The gating mechanism is applied to obtain the gating-processed features. ;in , , These represent the number of image channels, width, and height, respectively. This indicates element-wise multiplication; Features after gating Calibration was performed using a 1×1 convolutional layer to obtain... Next, frequency domain adaptive filtering is performed to fill the spatial domain output. Make its height and width both the size of a block. Multiples of, and divided into by Reshape. Non-overlapping Image patches are used to obtain output feature maps. ;in, and These represent the height and width of the transformed feature map, respectively. Perform a two-dimensional real-valued Fourier transform on each block to convert the signal from the spatial domain to the frequency domain and obtain the spectrum. In the frequency domain, a learnable, input-independent complex filter is used. The modulated spectrum is obtained by performing element-wise modulation on the spectrum. The filtered spectrum is subjected to an inverse two-dimensional real-valued Fourier transform to convert it back to the spatial domain, and then reshaped and clipped back to the original input size. The enhanced feature map is obtained. , .
6. The method according to claim 5, characterized in that, The dynamic convolutional mixing module DIMM first processes the input feature map. Through batch normalization, and uniformly divided along the channel dimension into and Two parts; then the two sub-feature maps are fed into the Dynamic Perception Hybrid Module (DID) with the same structure but independent parameters for processing, and the outputs of the two paths are... and The concatenation is performed along the channel dimension, and then fused and channel number recovery is achieved through a 1×1 convolution to obtain the fused features. To ensure stable training of deep networks and promote gradient flow, the module introduces a layer scaling parameter. First, use a learnable vector with minimal initial values. The output of the residual branch is scaled channel by channel. The scaled fused features are added to the input to form the first residual sub-block. ,feature Then it proceeds to a second identical sub-block, which contains another BN layer and a convolutional GELU feedforward network, and is similarly scaled. After processing the residuals, we finally obtain... .
7. The method according to claim 6, characterized in that, The dynamic sensing hybrid module DID first processes the input feature map Global average pooling is performed to obtain After 1×1 convolution to expand the dimension, we get Then, the reshape dimension is expanded to obtain Then, softmax is used to obtain the soft attention weights. ,in Simultaneously, the original feature maps are processed through 3×3 convolutions, 1×11 convolutions, and 11×1 convolutions to extract cell edge texture features at different scales. Finally, the outputs of each branch are compared with the soft attention weights generated from the global context. By performing element-wise multiplication and summing along the first dimension, we can obtain... .
8. The method according to claim 7, characterized in that, The early stopping condition set in step 3 is: the model's mAP value does not improve during 100 training rounds.
9. A method for detecting the growth stage of miscellaneous bacteria during the fermentation process of alcoholic beverages, characterized in that, The method is based on any one of claims 1-9 to detect miscellaneous bacteria, and then determines the growth stage of the miscellaneous bacteria based on the morphological change rate of the miscellaneous bacteria.
10. The application of the method according to any one of claims 1-9 in the fermentation process of alcoholic beverages.