A multi-modal dynamic guidance and topology-based real-time counting method for sugar cane tail
By employing multimodal dynamic guidance and topology techniques, the problems of color similarity interference, dense occlusion, and physical plausibility in sugarcane tail counting were solved, achieving high-precision sugarcane tail counting and improving the model's adaptability and interpretability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGXI YUEGUI GUANGYE HOLDINGS CO LTD
- Filing Date
- 2025-03-27
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies for counting sugarcane tails suffer from problems such as color similarity interference, insufficient modeling of dense occlusion, lack of physical rationality, and insufficient utilization of multimodal information, resulting in poor counting accuracy and adaptability.
A multimodal dynamic guidance and topology approach is adopted to acquire the original image through a camera, perform multimodal encoding, extract image features and fuse text encoding and coordinate encoding to generate a predicted density map. The initial density map is generated by using block-based continuous homology features and transposed convolution. The counting model is optimized by combining maximum stacking density constraints and filtering techniques.
It significantly improves the model's robustness to changes in illumination and metallic reflection, suppresses false detections due to color confusion, ensures the topological connectivity of the sugarcane tail distribution and the physical rationality of the counting, and enhances the interpretability and verification efficiency of the counting results.
Smart Images

Figure CN120339201B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing, and in particular to a method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology. Background Technology
[0002] In the sugar production process, sugarcane tail counting is mainly used to optimize raw material processing and quality control. The tail of the sugarcane typically has a lower sugar content and higher fiber content, which may affect sugar production efficiency and finished product quality. With the development of automation in the sugar industry and intelligent quality inspection of agricultural products, vision-based automatic sugarcane tail counting technology has become a key link in improving the efficiency of raw material statistics. However, current mainstream dense target counting methods still face significant technical bottlenecks in industrial-grade sugarcane tail detection scenarios.
[0003] Color similarity interference: Traditional convolutional neural networks (CNNs) rely on RGB color space feature extraction, which makes it difficult to effectively distinguish the subtle color difference between sugarcane tails (yellow) and the reflection from the metal trough (bright yellow), leading to the reflection area being misidentified as sugarcane tail clumps. Existing improvement schemes use HSV color enhancement, but cannot dynamically adapt to changes in lighting conditions.
[0004] Insufficient modeling of dense occlusion: Density map regression-based methods, such as CSRNet, are prone to feature confusion in areas with severe overlap of sugarcane tails. Existing spatial attention mechanisms only focus on local texture differences and lack explicit constraints on the topological connectivity of the target, resulting in a fragmented distribution of the predicted density map.
[0005] Lack of physical rationality: Mainstream counting models, such as MCNN, directly regress density values without introducing physical prior constraints such as maximum packing density. The prediction results may exceed the actual material packing limit, such as a single pixel prediction of more than 5 sugarcane tails, which amplifies the cumulative counting error.
[0006] Insufficient utilization of multimodal information: Existing industrial inspection solutions mostly use a single visual input and do not effectively integrate scene description text, such as "high density on the right side of the conveyor belt," with equipment coordinate information, making it difficult for the model to quickly adapt to changes in camera position or sudden changes in regional density distribution.
[0007] Poor adaptability to dynamic scenes: Traditional frequency domain enhancement methods, such as wavelet decomposition, use fixed filter banks and cannot adapt to changes in the cane tail distribution pattern. While recent improvements, such as frequency domain attention networks, introduce learnable frequency band selection, they are not jointly optimized with topological persistence features, making it difficult to balance noise suppression and detail preservation. Summary of the Invention
[0008] To address the aforementioned issues, this invention provides a real-time sugarcane tail counting method based on multimodal dynamic guidance and topology, which can simultaneously improve the ability to suppress color confusion under complex lighting conditions, the rationality of the distribution of densely occluded areas, and the counting accuracy of sugarcane tails in industrial scenarios.
[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0010] A real-time sugarcane tail counting method based on multimodal dynamic guidance and topology includes the following steps:
[0011] S1. Take a picture of the sugarcane tails on the conveyor belt using a camera to obtain the original image of the sugarcane tails;
[0012] S2. Perform multimodal encoding on the original image to obtain a preprocessed image with text encoding and coordinate encoding;
[0013] S3. Extract multi-scale features from the preprocessed image to obtain image features, and concatenate the image features with the local encoding and the coordinate encoding in the channel dimension to obtain a prediction density map;
[0014] S4. After dividing the predicted density map into blocks, calculate persistent homology features to filter topological features; fuse the block-based topological features to generate a spatial attention weight map:
[0015] S5. The attention weight map is progressively upsampled to the original resolution by transposed convolution to generate an initial density map. A maximum stacking density constraint is applied to the density value of each pixel in the initial density map to obtain a constrained density map. The constrained density map is then filtered to obtain a filtered density map.
[0016] S6. Accumulate the filtered graph at the pixel level to output the total number of tails in the current frame;
[0017] S7. Collect multiple industrial site images, label the center point of the sugarcane tail in the images, and generate density map labels using a Gaussian kernel; associate the scene description text and associated camera coordinate parameters with the images to construct a multimodal dataset;
[0018] S8. Construct an initial counting model through steps S2-S6, and train the model using the multimodal dataset to obtain an optimized counting model.
[0019] Further, in step S1, the original image includes RGB image size, text prompts, and physical parameters, wherein the RGB image size is... H×W×3 , H The height of the image; W The width of the image; the text prompt is a scene description, and the text prompt is used to provide semantic prior information; the physical parameters include slot size, conveyor belt speed, and maximum packing density.
[0020] Further, in step S2, the original image is normalized before multimodal encoding. The multimodal encoding includes text encoding and coordinate encoding. The text encoding inputs a scene description based on the position of the original image, and converts the scene description into a 512-dimensional semantic vector through a CLIP text encoder. The coordinate encoding converts the pixel coordinates of the original image into normalized relative coordinates based on the installation position of the camera, and maps the relative coordinates into 512-dimensional position encoding through a fully connected layer.
[0021] Further, in step S2, the original image is normalized according to the mean and standard deviation of the dataset:
[0022] Formula (1)
[0023] in, The original image has pixel values ranging from [0, 225]. This represents the mean of all images in the dataset across the RGB three channels. The standard deviation of all images in the dataset across the RGB three channels; The original image after normalization;
[0024] The text encoding method is as follows:
[0025] Formula (2)
[0026] Where Prompt is a natural language prompt word; CLIP text is a pre-trained CLIP text encoder; t is a 512-dimensional semantic vector that encodes the semantic information of the text;
[0027] The encoding method for the coordinates is as follows:
[0028] Formula (3)
[0029] in, p(x, y) Normalized coordinate mapping; (x,y) The coordinates of the camera are the input. H The height of the original image's RGB image; W The width of the original image's RGB image.
[0030] Further, in step S3, the preprocessed image is input into a ResNet-50 network to extract image features:
[0031] Formula (4)
[0032] Among them, f img Image feature mapping; For the output resolution, its number of channels is 2048;
[0033] The text encoding is performed using a text encoder to obtain text features:
[0034] Formula (5)
[0035] in, This is the text feature vector obtained after encoding natural language prompt words;
[0036] The coordinate encoding generates coordinate features through coordinate projection:
[0037] Formula (6)
[0038] in, The weight matrix is a learnable weight matrix; A learnable bias vector; f coord The output is a 512-dimensional coordinate embedding vector, which represents the spatial information of a certain location in the image;
[0039] The image features, text features, and coordinate features are concatenated along the channels to obtain a prediction density map.
[0040] Formula (7)
[0041] in, f fused This represents the feature mapping after multimodal fusion.
[0042] Further, in step S4, the predicted density map is divided into 8×8 blocks:
[0043] Formula (8)
[0044] in, This is the predicted density map; K is the number of blocks.
[0045] For each block, continuous homology features are calculated, and long-lived connected components are selected while short-lived noise is suppressed.
[0046] Formula (9)
[0047] in, This is sub-block D k The continuous homology plot; For the birth and death of connected components, each point Corresponding to one Betti-0 Connected components at the threshold bi At birth, within the threshold d i Time disappears;
[0048] via sub-block D k The lifetime of all connected components Perform a weighted summation to obtain the sub-block. D k topological feature vectors v k Then we have:
[0049] Formula (10)
[0050] Formula (11)
[0051] in, v k These are topological feature vectors; The lifetime of the connected components of the sub-block; The dimension of the continuous cohomology vector; It is a Gaussian kernel weighting function; σ is the lifetime mean; σ is the standard deviation, used to control the smoothness of the weighting function.
[0052] when ≈ ,and When ≈1, it represents a long-lifetime connected component, indicating a stable sugarcane tail cluster; when and The difference is large, and ϕ≈0, or ϕ is very small, in order to suppress short-lifecycle noise, to represent metallic reflection.
[0053] Furthermore, in step S4, attention weights are calculated for the selected topological features, and a spatial attention weight map is obtained by element-wise multiplication of the features:
[0054] Formula (12)
[0055] Formula (13)
[0056] in, This is the final attention map; This is the Sigmoid function, used to restrict the attention value at each pixel location to the range [0,1]. For learnable weight tensors; For learnable bias terms, and b a Same dimension as α; The output is a spatial attention weight map; This is the feature map obtained from the previous stage of multimodal fusion, where C is the number of channels.
[0057] Furthermore, in step S5, the attention weight map is progressively upsampled to the initial density map obtained after the original resolution:
[0058]
[0059] Formula (14)
[0060] in, This is the initial density map; ConvTranspose is the transpose convolution operation; Spatial attention weight map; for kernel size; The step size is 2; This is padding during the transpose convolution process;
[0061] A maximum packing density constraint is applied to the density value of each pixel in the initial density map:
[0062] Formula (15)
[0063] in, For constrained density plots; Defined as The maximum physical packing density; To perform a clip operation on all pixel values, push back if the value exceeds the limit. ;
[0064] The constraint density map is subjected to Gaussian filtering along the horizontal movement direction of the conveyor belt to obtain a filtered density map.
[0065] Furthermore, in step S8, the initial counting model is trained sequentially through basic training, topology fine-tuning, and joint optimization.
[0066] The basic training is conducted by minimizing the pixel-level error of the density map and constraining the maximum density value;
[0067] The topology fine-tuning is performed by freezing the image encoder weights and introducing a continuous cohomology loss to optimize the density map topology during training.
[0068] The joint optimization is trained by minimizing the pixel-level error of the density map, constraining the maximum density value, introducing a continuous cohomology loss to optimize the density map topology, and cross-modal alignment, and all parameters are unfrozen. The joint optimization is trained until convergence to obtain an optimized counting model.
[0069] The beneficial effects of this invention are:
[0070] Multimodal coding enables the fusion of textual semantics, image features, and spatial coordinate information. Through a cross-modal dynamic guidance mechanism, the robustness of the model to interferences such as lighting changes and metallic reflections is significantly improved, effectively suppressing false detections caused by color confusion. Based on block-based continuous homology feature extraction technology, the topological connectivity of the sugarcane tail distribution is explicitly constrained, avoiding the density map fragmentation problem caused by severely occluded areas in traditional methods, and ensuring the reasonableness of counting in densely stacked scenarios. The introduction of a maximum stacking density threshold and directional post-processing filtering forces the density prediction value to conform to the actual stacking characteristics of the material, solving the problem of prediction exceeding limits caused by the lack of physical rules in traditional models. Textual prompts and coordinate encoding work together to guide the model's attention distribution, achieving dynamic alignment between semantic priors and visual features, enhancing the interpretability of counting results and the efficiency of manual verification. Attached Figure Description
[0071] Figure 1 This is a flowchart of a preferred embodiment of the present invention, which describes a real-time sugarcane tail counting method based on multimodal dynamic guidance and topology. Detailed Implementation
[0072] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0073] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0074] Please see Figure 1 A preferred embodiment of the present invention provides a real-time sugarcane tail counting method based on multimodal dynamic guidance and topology, comprising the following steps:
[0075] S1. Take a picture of the sugarcane tails on the conveyor belt using a camera to obtain the original image of the sugarcane tails.
[0076] In step S1, the original image includes RGB image size, text prompts, and physical parameters. The RGB image size is... H ×W×3 , H The height of the image;W The width of the image is denoted by 'width'. The text prompts are scene descriptions and are used to provide semantic prior information, such as the high density of sugarcane tails at the right 1 / 3 of the conveyor belt. The physical parameters include the slot size, conveyor belt speed, and maximum packing density, which are used to constrain the physical rationality of the model.
[0077] S2. Perform multimodal encoding on the original image to obtain a preprocessed image with text encoding and coordinate encoding.
[0078] In step S2, the original image is normalized before multimodal encoding. Multimodal encoding includes text encoding and coordinate encoding. Text encoding inputs a scene description based on the position of the original image, and the scene description is converted into a 512-dimensional semantic vector through the CLIP text encoder. Coordinate encoding converts the pixel coordinates of the original image into normalized relative coordinates based on the installation position of the camera, and the relative coordinates are mapped into 512-dimensional position codes through a fully connected layer.
[0079] In step S2, the original image is normalized according to the mean and standard deviation of the dataset:
[0080] Formula (1)
[0081] in, The original image has pixel values ranging from [0, 225]. This represents the mean of all images in the dataset across the RGB three channels. The standard deviation of all images in the dataset across the RGB three channels; This is the normalized original image; normalizing the original image can eliminate illumination differences and accelerate model convergence.
[0082] The text encoding method is as follows:
[0083] Formula (2)
[0084] Among them, "Prompt" refers to a natural language prompt, such as "high-density sugarcane tails at the right third of the conveyor belt"; CLIP text is a pre-trained CLIP text encoder, which uses a Transformer architecture; t is a 512-dimensional semantic vector that encodes the semantic information of the text;
[0085] The coordinate encoding method is as follows:
[0086] Formula (3)
[0087] in, p(x, y) Normalized coordinate mapping; (x,y) The coordinates of the camera are the input.H The height of the original image's RGB image; W The width of the original image's RGB image.
[0088] S3. Extract multi-scale features from the preprocessed image to obtain image features, and concatenate the image features with the encoding and coordinate encoding in the channel dimension to obtain a prediction density map.
[0089] In step S3, the preprocessed image is input into the ResNet-50 network to extract image features:
[0090] Formula (4)
[0091] Among them, f img Image feature mapping; For the output resolution, its number of channels is 2048;
[0092] The residual structure of ResNet-50 in this embodiment effectively alleviates the gradient vanishing problem in deep networks, making it suitable for processing high-resolution surveillance images. Shallow layers (stages 1-2) capture local details of the sugarcane tail, such as color and texture; deep layers (stages 4-5) identify global distribution patterns, such as dense regions and occlusion relationships. This embodiment optimizes output resolution, preserving key spatial information through downsampling while reducing computational load; a large receptive field models the contrast between the sugarcane tail and the background (metal trough, conveyor belt); and the complementary nature of shallow layer details and deep layer semantics enhances the robustness of occluded region detection.
[0093] Text encoding uses a text encoder to obtain text features:
[0094] Formula (5)
[0095] in, This refers to the text feature vector obtained after encoding natural language prompt words. This is used to represent the distribution location of the text in the semantic space; CLIP's contrastive pre-training aligns text and image features in a shared space, such as the text, where the features of high-density sugarcane tails are close to the visual features of dense areas in the image. This embodiment enhances the sensitivity of the sugarcane tail color feature (yellow), reduces false positives for metallic reflection, and visualizes text-image similarity to verify the effectiveness of semantic cues.
[0096] Coordinate encoding generates coordinate features through coordinate projection:
[0097] Formula (6)
[0098] in, The weight matrix is a learnable weight matrix; A learnable bias vector; f coord The output is a 512-dimensional coordinate embedding vector, which represents the spatial information of a certain location in the image.
[0099] This embodiment distinguishes the distribution differences between the right 1 / 3 of the conveyor belt and the center of the slot, and maintains spatial perception consistency when the camera position changes.
[0100] Image features, text features, and coordinate features are concatenated across channels to obtain a predicted density map.
[0101] Formula (7)
[0102] in, f fused This is the feature mapping after multimodal fusion. In this embodiment, image features (2048 dimensions) are concatenated with text and coordinate features (512 dimensions). f fused Features from the image encoder, text encoder, and coordinate projection layer are combined to enable subsequent networks to learn cross-modal interaction information.
[0103] The multimodal feature fusion in this embodiment exhibits modal complementarity: image features include sugarcane tail shape and texture; text features include semantic constraints such as high density; and coordinate features include spatial priors such as the right-side region. Furthermore, the multimodal features collaboratively correct biases when illumination changes.
[0104] S4. After dividing the predicted density map into blocks, calculate the persistent homology features to filter the topological features; fuse the block-based topological features to generate a spatial attention weight map.
[0105] In step S4, the predicted density map is divided into 8×8 blocks:
[0106] Formula (8)
[0107] in, To predict the density map; K is the number of blocks; by dividing the density map into 8×8 blocks, the computational complexity is reduced, while the connectivity characteristics of different regions can be identified, such as densely packed regions vs. sparse regions, reducing complexity from... Down to It is adapted for high-resolution input. This is a single-channel density map of size H×W obtained from network prediction or annotation. Each pixel value can be regarded as "the density of the sugarcane tail at that location" or "a certain intensity value"; after dividing it into smaller sub-blocks, the topology of these sub-blocks (number of connected components, duration, etc.) is calculated to reduce the amount of computation and allow the model to focus more finely on local distribution differences.
[0108] For each block, continuous homology features are calculated, and long-lived connected components are selected while short-lived noise is suppressed.
[0109] Formula (9)
[0110] in, This is a sub-block D k The continuous homology plot; For the birth and death of connected components, each point Corresponding to one Betti-0 Connected components at the threshold b i At birth, within the threshold d i The sugarcane tails disappear over time. Since the distribution of sugarcane tails is mainly in connected regions (clump-like), and ring or void structures are rarely seen, this embodiment is based on Betti-0 feature selection, only calculating connected components, ignoring rings or voids. Ignoring Betti-1 / Betti-2 can reduce computational redundancy and can directly reflect the physical distribution characteristics of sugarcane tails.
[0111] In calculations, sub-block D is usually used. k The pixel values are set as thresholds from low to high. By observing how many independent connected components appear in which threshold range, and when these connected components merge or disappear, the following conclusions can be drawn. A set of.
[0112] via sub-block D k The lifetime of all connected components Perform a weighted summation to obtain sub-block D. k The topological feature vector v k Then we have:
[0113] Formula (10)
[0114] Formula (11)
[0115] Among them, v k These are topological feature vectors; The lifetime of the connected components of the sub-block; The dimension of the continuous cohomology vector; It is a Gaussian kernel weighting function; σ is the lifetime mean; σ is the standard deviation, used to control the smoothness of the weighting function.
[0116] when ≈ ,and When ≈1, it represents a long-lifetime connected component, indicating a stable sugarcane tail cluster; when and The difference is large, and ϕ≈0, or ϕ is very small, in order to suppress short-lifecycle noise, to represent metallic reflection.
[0117] The lifetime of each connected component is calculated using a Gaussian kernel weighting function. Multiply by a weight that varies based on its difference from μ in order to highlight "components whose lifecycle is close to μ" or "components whose lifecycle is far from μ"; This is the life-cycle mean, which can be estimated from historical data or statistical priors. In normal scenarios, the average lifespan of the main sugarcane tail clusters; σ is the standard deviation, used to control the smoothness of the weighting function. If σ is large... The tolerance for lifecycle is more lenient; if σ is small, only components with lifecycles very close to μ will be given higher weights.
[0118] This embodiment utilizes continuous coherence feature extraction to suppress noise; isolated false high-density points (such as metallic reflections) are identified as short-lived connected components. Its influence is suppressed by using a Gaussian kernel with low weights (ϕ≈0). This allows for the distribution of reasonable constraints and long-lived connected components. This corresponds to stable sugarcane tail clumps, ensuring that the density map matches the actual packing pattern, such as clumps rather than scattered points.
[0119] In step S4, attention weights are calculated for the selected topological features, and a spatial attention weight map is obtained by weighting the features through element-wise multiplication.
[0120] Formula (12)
[0121] Formula (13)
[0122] in, This is the final attention map; This is the Sigmoid function, used to restrict the attention value at each pixel location to the range [0,1]. For learnable weight tensors; For learnable bias terms, and b a Same dimension as α; The output is a spatial attention weight map; This is the feature map obtained from the previous stage of multimodal fusion, where C is the number of channels.
[0123] To concatenate the topological feature vectors of all blocks sequentially to form a vector of size . ; For the first Sub-blocks (total) (n) persistent homogeneous eigenvectors; concatenated, a vector of length is obtained. The vector contains the topological information of all sub-blocks; Mapping all sub-block topological features to the image coordinate system The lesson learned which pixel is associated with which topological features (which block, which lifecycle), and how much weight is needed. To add a learnable "translation" to the attention distribution, the model can more flexibly adjust the overall attention level, such as globally high or low; It is used to perform pixel-level filtering when subsequently fused with features such as images and text; This is the feature map obtained from the previous stage of multimodal fusion, with the number of channels being... ,may be Such as; For element-wise multiplication, Broadcast to One channel, that is, for each channel Features are weighted by the same attention coefficient; The weighted feature map output indicates that the model gives higher attention to key regions in the density map, while suppressing the weight of noisy or unimportant regions, thus reducing their impact on subsequent predictions.
[0124] Region-adaptive focusing: Long-lifecycle components (high-density sugarcane tail clumps) are weighted with a Gaussian kernel (ϕ≈1) to obtain high weights, and the model prioritizes salient targets. Short-lifecycle components (noise or isolated points) are suppressed (ϕ≈0) to reduce false detections.
[0125] Cross-modal collaboration: Text prompts such as "dense region" and PH features jointly guide the attention distribution: The text encoder outputs semantic features, such as density, which are combined with the PH feature vector. v k Joint optimization W a Weight allocation. For example, when the text prompt is "high density on the right", the PH feature weight of the right block is automatically increased.
[0126] This embodiment enables the model to have dynamic perception capabilities by calculating attention weights. When the illumination changes, the PH features provide anti-interference capabilities through topological stability (long-lived components), complementing image features to correct prediction biases. Moreover, interpretability is improved, and the visualization of attention weights can verify the synergistic effect between PH features and text prompts, such as the spatial consistency between high-weight regions and dense sugarcane tail prompts.
[0127] S5. The attention weight map is progressively upsampled to the original resolution through transposed convolution to generate an initial density map. A maximum stacking density constraint is applied to the density value of each pixel in the initial density map to obtain a constrained density map. After filtering the constrained density map, a filtered density map is obtained.
[0128] In step S5, the attention weight map is progressively upsampled to the original resolution to obtain the initial density map:
[0129]
[0130] Formula (14)
[0131] in, This is the initial density map; ConvTranspose is the transpose convolution operation; Spatial attention weight map; for kernel size; The step size is 2; Padding is applied during the transpose convolution process to ensure that the output size matches the target (i.e., restores the resolution to the original image). ).
[0132] Transposed convolution can be used to transform low-resolution feature maps (such as...) Restore to original image size This allows for pixel-level density predictions. Compared to bilinear interpolation or nearest-neighbor interpolation, transposed convolution contains learnable kernels that can extract and combine high-level features while upsampling, resulting in a better fit to the target distribution. It is an uncropped, unconstrained original predicted density map.
[0133] Apply a maximum packing density constraint to the density value of each pixel in the initial density map:
[0134] Formula (15)
[0135] in, For constrained density plots; Defined as The maximum physical packing density; To perform a clip operation on all pixel values, push back if the value exceeds the limit. .
[0136] For sugarcane tail conveying, there is often a priori limitation on the number of sugarcane tails that a unit pixel can carry. In real-world scenarios, it's unlikely that they can be stacked indefinitely. Once the prediction exceeds this limit... This is considered oversaturation and needs to be truncated. Ensuring it doesn't exceed physical limits will collectively improve adaptability to the real world and reduce false predictions.
[0137] It can suppress extreme predictions and prevent the network from generating unreasonable, excessively large values locally.
[0138] Gaussian filtering is applied to the constraint density map along the horizontal movement direction of the conveyor belt to obtain the filtered density map.
[0139] S6. Accumulate the filter map pixel by pixel to output the total number of tails in the current frame.
[0140] In this embodiment, each pixel ( The value of ) That is, the sugarcane tail density at that location. (Generated) It can be used directly for visualization, such as heatmaps, or for further integration to calculate the total number.
[0141] S7. Collect multiple industrial site images, label the center point of the sugarcane tail in the images, and generate density map labels using Gaussian kernels; associate scene description text and camera coordinate parameters with the site images to construct a multimodal dataset;
[0142] S8. Construct an initial counting model through steps S2-S6, and train the model using a multimodal dataset to obtain an optimized counting model.
[0143] In step S8, the initial counting model is trained sequentially through basic training, topology fine-tuning, and joint optimization.
[0144] Basic training is performed by minimizing the pixel-level error of the density map and constraining the maximum density value;
[0145] Topology fine-tuning is performed by freezing the image encoder weights and introducing a continuous cohomology loss to optimize the density map topology during training.
[0146] Joint optimization is performed by minimizing the pixel-level error of the density map, constraining the maximum density value, introducing continuous cohomology loss to optimize the density map topology, and performing cross-modal alignment. All parameters are then unfrozen, and joint optimization training is performed until convergence to obtain an optimized counting model.
[0147] The loss function in this embodiment includes:
[0148] Total loss function L:
[0149] Formula (16)
[0150] MSE loss :
[0151] Formula (17)
[0152] in, : A true density map generated by Gaussian kernel blur point annotation;
[0153] Continuous homology loss :
[0154] Formula (18)
[0155] in, 1-Wasserstein distance, used to calculate the difference between two persistent homology maps; block-based calculation: dividing the density map into... Each block is calculated independently;
[0156] Physical constraint loss :
[0157] Formula (19)
[0158] The penalty term is a linear penalty applied to predicted values that exceed the maximum density.
[0159] Cross-modal contrast loss:
[0160] Image feature vector (output of CLIP image encoder);
[0161] Positive / negative example text prompt word features;
[0162] Temperature coefficient, used to adjust similarity distribution.
[0163] In basic training:
[0164] Activation loss: ( );
[0165] Optimizer: AdamW, Learning Rate Weight decay ;
[0166] Objective: To initially fit the density distribution and satisfy the physical constraints.
[0167] Topology fine-tuning in progress:
[0168] Activation loss: ( );
[0169] Parameters frozen: Image encoder (ResNet-50) weights are fixed;
[0170] Learning rate: Reduced to 5 × 10^(-5);
[0171] Objective: Optimize the topology of the density map.
[0172] Joint optimization in progress:
[0173] Activate all losses ( );
[0174] Unfreeze parameters: All parameters are trainable;
[0175] Learning rate: further reduced to 1×10 -5 ;
[0176] Objective: To balance local accuracy with global consistency.
[0177] The reasoning process in this embodiment is as follows:
[0178] Input processing:
[0179] Image normalization: ;
[0180] Text Encoding .
[0181] The prediction results are as follows:
[0182]
[0183] This embodiment uses a high-definition industrial camera, such as a Hikvision camera, mounted directly above the conveyor belt trough, to capture RGB images at a resolution of 1920×1080 at a rate of 30 frames per second. The computing unit uses the NVIDIA Jetson AGX Xavier embedded platform, integrating image preprocessing, model inference, and post-processing modules.
Claims
1. A real-time sugarcane tail counting method based on multimodal dynamic guidance and topology, characterized in that, Includes the following steps: S1. Take a picture of the sugarcane tails on the conveyor belt using a camera to obtain the original image of the sugarcane tails; S2. Perform multimodal encoding on the original image to obtain a preprocessed image with text encoding and coordinate encoding; S3. Extract multi-scale features from the preprocessed image to obtain image features, and concatenate the image features with the local encoding and the coordinate encoding in the channel dimension to obtain a prediction density map; S4. After dividing the predicted density map into blocks, calculate the persistent homology features to filter topological features; The segmented topological features are then fused to generate a spatial attention weight map: S5. The attention weight map is progressively upsampled to the original resolution by transposed convolution to generate an initial density map. A maximum stacking density constraint is applied to the density value of each pixel in the initial density map to obtain a constrained density map. The constrained density map is then filtered to obtain a filtered density map. S6. Accumulate the filtered graph at the pixel level to output the total number of tails in the current frame; S7. Collect multiple industrial site images, label the center point of the sugarcane tail in the images, and generate density map labels using a Gaussian kernel; associate the scene description text and associated camera coordinate parameters with the images to construct a multimodal dataset; S8. Construct an initial counting model through steps S2-S6, and train the model using the multimodal dataset to obtain an optimized counting model.
2. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 1, characterized in that: In step S1, the original image includes RGB image size, text prompts, and physical parameters, wherein the RGB image size is... H×W×3 , H The height of the image; W The width of the image; The text prompts are scene descriptions and are used to provide semantic prior information; the physical parameters include slot size, conveyor belt speed, and maximum packing density.
3. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 1, characterized in that: In step S2, the original image is normalized before multimodal encoding. The multimodal encoding includes text encoding and coordinate encoding. The text encoding inputs a scene description based on the position of the original image, and converts the scene description into a 512-dimensional semantic vector through a CLIP text encoder. The coordinate encoding converts the pixel coordinates of the original image into normalized relative coordinates based on the installation position of the camera, and maps the relative coordinates into 512-dimensional positional encoding through a fully connected layer.
4. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 3, characterized in that: In step S2, the original image is normalized according to the mean and standard deviation of the dataset: Official (1) in, The original image has pixel values ranging from [0, 225]. This represents the mean of all images in the dataset across the RGB three channels. The standard deviation of all images in the dataset across the RGB three channels; The original image after normalization; The text encoding method is as follows: Official (2) Where Prompt is a natural language prompt word; CLIP text is a pre-trained CLIP text encoder; t is a 512-dimensional semantic vector that encodes the semantic information of the text; The encoding method for the coordinates is as follows: Official (3) in, p(x, y) Normalized coordinate mapping; (x,y) The coordinates of the camera are the input. H The height of the original image's RGB image; W The width of the original image's RGB image.
5. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 4, characterized in that: In step S3, the preprocessed image is input into a ResNet-50 network to extract image features: Official (4) Among them, f img Image feature mapping; For the output resolution, its number of channels is 2048; The text encoding is performed using a text encoder to obtain text features: Official (5) in, This is the text feature vector obtained after encoding natural language prompt words; The coordinate encoding generates coordinate features through coordinate projection: Official (6) in, The weight matrix is a learnable weight matrix; A learnable bias vector; f coord The output is a 512-dimensional coordinate embedding vector, which represents the spatial information of a certain location in the image; The image features, text features, and coordinate features are concatenated along the channels to obtain a prediction density map. Official (7) Among them, f fused This represents the feature mapping after multimodal fusion.
6. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 1, characterized in that: In step S4, the predicted density map is divided into 8×8 blocks: Official (8) in, This is the predicted density map; K is the number of blocks. For each block, continuous homology features are calculated, and long-lived connected components are selected while short-lived noise is suppressed. Formula (9) in, This is sub-block D k The continuous homology plot; For the birth and death of connected components, each point A Betti-0 connected component corresponds to a threshold b. i At birth, at the threshold d i Time disappears; via sub-block D k The lifetime of all connected components Perform a weighted summation to obtain sub-block D. k The topological feature vector v k Then we have: Official (10) Official (11) Among them, v k These are topological feature vectors; The lifetime of the connected components of the sub-block; The dimension of the continuous cohomology vector; It is a Gaussian kernel weighting function; σ is the lifetime mean; σ is the standard deviation, used to control the smoothness of the weighting function. when ≈ ,and When ≈1, it represents a long-lifetime connected component, indicating a stable sugarcane tail cluster; when and The difference is large, and ϕ≈0, or ϕ is very small, in order to suppress short-lifecycle noise, to represent metallic reflection.
7. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 6, characterized in that: In step S4, attention weights are calculated for the selected topological features, and a spatial attention weight map is obtained by weighting the features through element-wise multiplication. Official (12) Official (13) in, This is the final attention map; This is the Sigmoid function, used to restrict the attention value at each pixel location to the range [0,1]. For learnable weight tensors; For learnable bias terms, and b a Same dimension as α; The output is a spatial attention weight map; This is the feature map obtained from the previous stage of multimodal fusion, where C is the number of channels.
8. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 7, characterized in that: In step S5, the attention weight map is progressively upsampled to the original resolution to obtain the initial density map: Formula (14) in, This is the initial density map; ConvTranspose is the transpose convolution operation; Spatial attention weight map; for kernel size; The step size is 2; This is padding during the transpose convolution process; A maximum packing density constraint is applied to the density value of each pixel in the initial density map: Official (15) in, For constrained density plots; Defined as The maximum physical packing density; To perform a clip operation on all pixel values, push back if the value exceeds the limit. ; The constraint density map is subjected to Gaussian filtering along the horizontal movement direction of the conveyor belt to obtain a filtered density map.
9. The method for real-time counting of sugarcane tails based on multimodal dynamic guidance and topology according to claim 1, characterized in that: In step S8, the initial counting model is trained sequentially through basic training, topology fine-tuning, and joint optimization. The basic training is conducted by minimizing the pixel-level error of the density map and constraining the maximum density value; The topology fine-tuning is performed by freezing the image encoder weights and introducing a continuous cohomology loss to optimize the density map topology during training. The joint optimization is trained by minimizing the pixel-level error of the density map, constraining the maximum density value, introducing a continuous cohomology loss to optimize the density map topology, and cross-modal alignment, and all parameters are unfrozen. The joint optimization is trained until convergence to obtain an optimized counting model.