Chemical laboratory equipment identification method and device, computer device and storage medium
By combining feature extraction, enhancement, and fusion modules, the robustness and autonomous adaptability of robot vision systems in chemical laboratories are addressed, enabling high-precision identification of chemical laboratory equipment and improving the level of automation and intelligence in laboratories.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANCHANG YANNUO TECH CO LTD
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing robot vision systems lack sufficient robustness and autonomous adaptability in chemical laboratory environments, making it difficult to effectively identify complex and diverse instruments, experimental devices, and chemical reagents. Furthermore, they lack high-order environmental understanding capabilities, resulting in low levels of automation.
A chemical laboratory equipment identification method is proposed. By combining a feature extraction module, a backbone network, a feature enhancement module, a feature fusion network, and an RTDETRDecoder module, and utilizing GSLS, AIFI, BIFPN, and RepC3 modules for multi-scale feature enhancement and fusion, a high-precision and robust identification method for chemical laboratory equipment is achieved.
It achieves high-precision and robust equipment identification in chemical laboratory environments that face challenges such as varying equipment sizes, mutual occlusion, complex backgrounds, and high detail requirements, thereby improving the efficiency and intelligence level of chemical laboratory automation.
Smart Images

Figure CN122244568A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence technology and relates to a method, apparatus, computer equipment, and storage medium for identifying chemical laboratory equipment. Background Technology
[0002] In the field of chemical experiment automation, the realization of autonomous robot operation capabilities highly depends on the environmental information provided by visual perception systems. While general-purpose computer vision technology, as the cornerstone of robot environmental perception and task planning, has matured, especially large-scale pre-training methods represented by basic visual models, which significantly enhance the model's generalization ability and adaptability to multi-scenario tasks, robots still face a series of deep-seated perceptual and cognitive obstacles in the highly structured and constrained professional environment of chemical laboratories. These obstacles include the precise identification of complex and diverse instruments, experimental setups, and chemical reagents, spatial state perception, and understanding of operational logic. Currently, a systematic and highly reliable overall technical solution is lacking. The high complexity and structure of the chemical laboratory environment necessitate the development of intelligent robot systems with advanced environmental understanding capabilities. The key to realizing such systems lies in integrating multimodal visual information with intelligent analysis models to achieve precise perception and real-time dynamic understanding of experimental setups, material properties, and reaction processes. This technological breakthrough will not only drive the automation of chemical laboratories, freeing researchers from highly repetitive and precise operations and thus improving experimental efficiency and result reproducibility, but its more fundamental significance lies in promoting the intelligent transformation of chemical research. It will serve as a new research method, reshaping the path of scientific discovery and ushering in a next-generation research paradigm characterized by autonomous experimentation.
[0003] Currently, robot vision and imaging technologies have been widely applied and integrated into various industrial processes, including dimensional measurement, quality inspection, target recognition, autonomous navigation, and automated assembly. However, the intelligence demonstrated by existing systems in areas such as overall environmental perception and deep semantic understanding still lags significantly behind human cognitive levels. The fundamental reason for this limitation lies in the fact that several underlying scientific problems and key technological bottlenecks in robot vision systems have not yet been overcome. The most critical constraint is that, when facing open, highly dynamic, and complex scenarios, the system lacks sufficiently robust scene understanding capabilities and autonomous adaptive decision-making mechanisms. Summary of the Invention
[0004] To address the problems existing in the above-mentioned traditional methods, the present invention proposes a method, apparatus, computer equipment, and storage medium for identifying chemical laboratory equipment.
[0005] To achieve the above objectives, the embodiments of the present invention adopt the following technical solutions:
[0006] On the one hand, a method for identifying chemical laboratory equipment is provided, including the following steps: Images of chemical laboratory experiments are acquired and preprocessed to obtain the input images.
[0007] The input image is fed into the feature extraction module, where features are extracted through convolution and max pooling operations to obtain shallow features.
[0008] Shallow features are input into the backbone network for feature extraction to obtain multi-scale deep features.
[0009] Multi-scale deep features are input into the feature enhancement module. The GSLS module and AIFI module are used to enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features. The GSLS module is a parallel dual-branch feature enhancer used to model the long-range dependency of feature maps through the global spatial attention mechanism to understand the spatial layout relationship between chemical laboratory equipment, and to enhance the local detailed features of chemical laboratory equipment through the local spatial attention mechanism.
[0010] Multi-scale enhanced features are input into the feature fusion network, and feature fusion is performed using a multi-layer cascaded BIFPN module. In the fusion path, the RepC3 module is used to further refine the features to obtain multi-scale fused features.
[0011] The multi-scale fusion features are input into the RTDETRDecoder module to obtain the chemical laboratory equipment recognition results.
[0012] On the other hand, a chemical laboratory equipment identification device is also provided, comprising: The image preprocessing unit is used to acquire images of chemical laboratory experiments and preprocess them to obtain the input image.
[0013] The shallow feature extraction unit is used to input the input image into the feature extraction module, and perform feature extraction through convolution and max pooling operations to obtain shallow features.
[0014] The multi-scale deep feature extraction unit is used to input shallow features into the backbone network for feature extraction to obtain multi-scale deep features.
[0015] The multi-scale feature enhancement unit is used to input multi-scale deep features into the feature enhancement module. The GSLS module and AIFI module enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features. The GSLS module is a parallel dual-branch feature enhancer used to model the long-range dependency of feature maps through the global spatial attention mechanism to understand the spatial layout relationship between chemical laboratory equipment, and to enhance the local detailed features of chemical laboratory equipment through the local spatial attention mechanism.
[0016] The feature fusion unit is used to input multi-scale enhanced features into the feature fusion network. It uses a multi-layer cascaded BIFPN module for feature fusion. In the fusion path, the RepC3 module is used to further refine the features to obtain multi-scale fused features.
[0017] The chemical laboratory equipment identification unit is used to input multi-scale fused features into the RTDETRDecoder module to obtain the chemical laboratory equipment identification results.
[0018] In another aspect, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of any of the above-described chemical laboratory equipment identification methods.
[0019] Furthermore, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of any of the above-described chemical laboratory equipment identification methods.
[0020] One of the above technical solutions has the following advantages and beneficial effects: The aforementioned chemical laboratory equipment identification method, apparatus, computer equipment, and storage medium include the following steps: inputting an input image into a feature extraction module, inputting the obtained shallow features into a backbone network to obtain multi-scale deep features; inputting the multi-scale deep features into a feature enhancement module, using a GSLS module and an AIFI module to enhance feature representation from local-global and self-attention perspectives respectively; inputting the obtained multi-scale enhanced features into a feature fusion network, using a multi-layer cascaded BIFPN module for feature fusion; further refining features using a RepC3 module in the fusion path; and inputting the obtained multi-scale fused features into an RTDETRDecoder to obtain the chemical laboratory equipment identification result. This method can address the challenges of varying equipment scales, mutual occlusion, complex backgrounds, and high detail requirements in chemical laboratory environments, achieving high-precision and robust equipment identification. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of this application or the conventional technology, the drawings used in the description of the embodiments or the conventional technology will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 This is a flowchart illustrating a chemical laboratory equipment identification method in one embodiment; Figure 2This is a schematic diagram of a chemical laboratory equipment identification model in one embodiment; Figure 3 This is a schematic diagram of the GSLS module in one embodiment; Figure 4 This is a graph showing the training process on the ChemEq25 dataset in one embodiment, where... Figure 4 Middle (a) to Figure 4 (c) shows the curves of generalized intersection-union loss, classification loss, and L1 loss on the training set of the ChemEq25 dataset. Figure 4 Middle (d) to Figure 4 In the middle (f), the graphs of generalized intersection-union loss, classification loss, and L1 loss on the validation set of the ChemEq25 dataset are shown. Figure 4 (g) to Figure 4 The graphs in the middle (j) show the precision, recall, mAP50(B), and mAP50-95(B) metrics for the ChemEq25 dataset. Figure 5 This is a training process curve graph on a dataset of labeled chemical apparatus images in one embodiment, wherein... Figure 5 Middle (a) to Figure 5 (c) shows the curves of generalized intersection-union loss, classification loss, and L1 loss on the training set of the labeled chemical apparatus image dataset. Figure 5 Middle (d) to Figure 5 In figure (f) are the curves of generalized intersection-union loss, classification loss, and L1 loss on the validation set of the labeled chemical apparatus image dataset. Figure 5 (g) to Figure 5 The graphs in the middle (j) show the validation precision, recall, mAP50(B), and mAP50-95(B) metrics for the labeled chemical apparatus image dataset. Figure 6 This is a schematic diagram of the visualization results of an annotated chemical apparatus image dataset in one embodiment; Figure 7 A heatmap (a) from a dataset of labeled chemical apparatus images in one embodiment; Figure 8 A heatmap (II) from a dataset of labeled chemical apparatus images in one embodiment; Figure 9 A heatmap (III) from a dataset of labeled chemical apparatus images in one embodiment; Figure 10 This is a schematic diagram of the visualization results of the ChemEq25 dataset in one embodiment; Figure 11 A heatmap (I) of the ChemEq25 dataset visualization results in one embodiment; Figure 12 A heatmap (II) of the visualization results of the ChemEq25 dataset in one embodiment; Figure 13 Heatmap (Part 3) of the visualization results of the ChemEq25 dataset in one embodiment. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0024] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
[0025] It should be noted that, in this document, the reference to "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The presentation of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art will understand that the embodiments described herein can be combined with other embodiments. The term "and / or" as used herein refers to any combination of one or more of the associated listed items, and all possible combinations, including such combinations.
[0026] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0027] In one embodiment, such as Figure 1 As shown, a method for identifying chemical laboratory equipment is provided, which may include the following processing steps 1 to 6: Step 1: Acquire images of the chemical laboratory experiment process and preprocess them to obtain the input image.
[0028] Step 2: Input the input image into the feature extraction module, and perform feature extraction through convolution and max pooling operations to obtain shallow features.
[0029] Specifically, the input image is processed through a series of convolutional modules (each convolutional module contains a convolutional layer, batch normalization, and ReLU activation function) to perform primary feature extraction and nonlinear transformation, quickly capturing basic edge and texture information. Then, spatial downsampling is performed through the MaxPool2d layer to expand the receptive field while compressing the data dimension.
[0030] Step 3: Input the shallow features into the backbone network for feature extraction to obtain multi-scale deep features.
[0031] Specifically, the backbone of the network consists of multiple stacked BasicBlock residual modules, which alleviate the gradient vanishing problem of deep networks through shortcut connections, enabling the network to stably learn more abstract and discriminative semantic features, such as the overall shape and category information of experimental instruments.
[0032] Step 4: Input the multi-scale deep features into the feature enhancement module. Use the GSLS module and AIFI module to enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features. The GSLS module is a parallel dual-branch feature enhancer used to model the long-range dependency of feature maps through the global spatial attention mechanism, understand the spatial layout relationship between chemical laboratory equipment, and enhance the local detail features of chemical laboratory equipment through the local spatial attention mechanism.
[0033] Specifically, the GSLS module replaces traditional convolutional operations in the feature enhancement module. This module is a parallel dual-branch attention mechanism whose core function is to enhance the network's ability to represent complex visual patterns by synergistically utilizing global contextual information and local detailed features. The global branch of the GSLS module models the long-range dependencies of feature maps through the attention mechanism, understanding the spatial layout relationships between devices (such as recognizing a distillation apparatus consisting of a heating platform, iron stand, and flask); its local branch enhances local details (such as the scale on the beaker and the shape of the switch) through convolutional operations, thereby synergistically improving the network's ability to perceive complex laboratory scenes.
[0034] The deep features are then fed into the AIFI module (an efficient self-attention Transformer encoder). This module can model global pixel relationships, overcome the locality limitation of convolution operations, and effectively capture the contextual relationships between scattered or occluded targets (for example, even if a pipette is partially occluded, it can still be correctly inferred from its positional relationship with reagent bottles and test tube racks).
[0035] Step 5: Input the multi-scale enhanced features into the feature fusion network, and use a multi-layer cascaded BIFPN module for feature fusion. In the fusion path, the RepC3 module is used to further refine the features to obtain multi-scale fused features.
[0036] Specifically, a Bidirectional Feature Pyramid Network (BIFPN) is used to replace the traditional simple concat operation for multi-scale feature fusion. Through learnable weighted fusion and dense bidirectional connections, it adaptively integrates features from various levels, from high-resolution details to high semantic information, enabling the network to accurately identify targets with huge scale differences simultaneously.
[0037] A Bidirectional Feature Pyramid Network (BIFPN) structure is introduced to replace the traditional concat operation, aiming to build a powerful multi-scale feature fusion backbone to address the core challenges of diverse device scales, complex environments, and critical details in this scenario. This structure first receives attention-extracted salient features, then uses its unique dense bidirectional connection topology. Information is fused with shallow features after upsampling along a top-down path, and with shallow detail features after downsampling along a bottom-up path. Learnable weighted feature fusion is performed at each fusion node, achieving sufficient information exchange and enhancement between features of different scales. Finally, the multi-scale feature maps, deeply fused and refined by BIFPN, are fed in parallel into subsequent convolutional modules for further feature transformation, providing more discriminative and robust feature representations for the final target classification and localization detection head.
[0038] BIFPN, through its unique bidirectional (top-down and bottom-up) cross-layer connections, efficiently fuses features of varying depths extracted from the backbone network—shallow features contain rich spatial details, while deep features contain advanced semantic information. This fusion mechanism enables the network to adaptively integrate multi-scale information, achieving accurate identification of chemical equipment with vastly different sizes (from miniature test tubes to large reaction vessels) in a single inference. Furthermore, BIFPN's context-enhanced perception capabilities improve the model's robustness in the face of common laboratory interferences such as equipment occlusion, glassware reflections, and complex backgrounds. Moreover, by combining fine local details with overall semantic understanding, this structure also helps distinguish instruments with similar shapes but different functions or specifications (such as graduated cylinders with different scales), achieving more refined identification. In summary, BIFPN provides a comprehensive, accurate, and robust feature representation foundation for this network, making it a key technology for improving the performance of automated visual perception systems in chemical laboratories.
[0039] Step 6: Input the multi-scale fusion features into the RTDETRDecoder module to obtain the chemical laboratory equipment recognition results.
[0040] Specifically, the refined multi-scale features are fed into the Transformer-based RTDETRDecoder module. This decoder starts with a set of learnable query vectors and predicts the target bounding boxes and categories in parallel through interaction with the encoded features. It abandons the anchor box design in traditional detectors and directly performs end-to-end target set prediction, making it particularly suitable for handling complex situations where targets are densely arranged and overlapping in laboratory settings.
[0041] A chemical laboratory equipment recognition model consists of a feature extraction module, a backbone network, a feature enhancement module, a feature fusion module, and an RTDETRDecoder module. This model achieves robust feature extraction through convolutional and BasicBlock residual modules, enhances feature representation from local-global and self-attention perspectives using GSLS and AIFI modules respectively, achieves efficient multi-scale fusion and refinement using BIFPN and RepC3 modules, and finally completes accurate end-to-end detection using the RTDETRDecoder module. The structure of the chemical laboratory equipment recognition model is as follows: Figure 2 As shown, this network integrates advanced image recognition and target detection algorithms, enabling real-time and accurate detection of various chemical instruments and equipment in real-world experimental scenarios.
[0042] The aforementioned chemical laboratory equipment identification method includes: inputting an input image into a feature extraction module, inputting the obtained shallow features into a backbone network to obtain multi-scale deep features; inputting the multi-scale deep features into a feature enhancement module, using a GSLS module and an AIFI module to enhance feature representation from local-global and self-attention perspectives respectively; inputting the obtained multi-scale enhanced features into a feature fusion network, using a multi-layer cascaded BIFPN module for feature fusion; in the fusion path, using a RepC3 module to further refine features; and inputting the obtained multi-scale fused features into an RTDETRDecoder module to obtain the chemical laboratory equipment identification result. This method can address the challenges of varying equipment scales, mutual occlusion, complex backgrounds, and high detail requirements in chemical laboratory environments, achieving high-precision and robust equipment identification.
[0043] In one embodiment, the feature extraction module in step 2 includes a first convolutional module, a second convolutional module, a third convolutional module, and a max pooling layer connected in sequence; the convolutional module includes a convolutional layer, a batch normalization layer, and a ReLU function.
[0044] In one embodiment, the backbone network includes four stacked BasicBlock residual modules; step 3 includes: inputting shallow features into the first BasicBlock residual module in the backbone network to obtain first-scale deep features; inputting the first-scale deep features into the second BasicBlock residual module in the backbone network to obtain second-scale deep features; inputting the second-scale deep features into the third BasicBlock residual module in the backbone network to obtain third-scale deep features; and inputting the third-scale deep features into the fourth BasicBlock residual module in the backbone network to obtain fourth-scale deep features.
[0045] In one embodiment, the feature enhancement module includes a first convolutional layer, a first GSLS module, a second GSLS module, a second convolutional layer, and an AIFI module; step 4 includes: inputting the first-scale deep features into the first convolutional layer of the feature enhancement module to obtain the first-scale enhanced features; inputting the second-scale deep features into the first GSLS module of the feature enhancement module to obtain the second-scale enhanced features; inputting the third-scale deep features into the second GSLS module of the feature enhancement module to obtain the third-scale enhanced features; and processing the fourth-scale deep features through the second convolutional layer and the AIFI module to obtain the fourth-scale enhanced features.
[0046] In one embodiment, such as Figure 3 As shown, the GSLS module includes a global spatial attention mechanism, a local spatial attention mechanism, and a point convolutional layer. The second-scale deep features are input into the first GSLS module of the feature enhancement module to obtain second-scale enhanced features. This includes: dividing the second-scale deep features into channels to obtain first-branch features and second-branch features; inputting the first-branch features into the global spatial attention mechanism to obtain context features; the global spatial attention mechanism is used to perform point convolution, transpose, and Softmax activation on the first-branch features to obtain intermediate features; multiplying and fusing the first-branch features and intermediate features, then performing point convolution, layer normalization, and ReLU activation; performing point convolution on the activation result and adding it to the first-branch features to obtain context features; inputting the second-branch features into the local spatial attention mechanism to obtain local detail features; concatenating the context features and local detail features and processing them through a point convolutional layer to obtain the second-scale enhanced features.
[0047] Specifically, the GSLS module innovatively integrates global spatial attention with local detail extraction capabilities. The global branch captures long-range dependencies through matrix transformation Softmax to understand the layout relationships between devices; the local branch enhances key details (such as beaker markings and interface shapes) through depthwise separable convolution. This design goes beyond traditional single attention or convolution operations, achieving collaborative perception of both the "macro layout" and "micro features" of a laboratory scene.
[0048] The operation process of the GSLS module is as follows: First, input feature maps are processed... F iThe channel dimension is evenly divided into two parts, which are fed into a global spatial attention branch and a local spatial attention branch for parallel processing, respectively. The global spatial attention branch compresses channels through 1×1 convolutions, reconstructs spatial relationships through matrix transposition, and generates a global attention map through softmax normalization. Finally, it restores the channels through another 1×1 convolution, thereby capturing the overall long-range dependencies of the feature map. The local spatial attention branch focuses on extracting detailed features and spatial structure within the neighborhood using 1×1 convolutions, 3×3 depthwise separable convolutions, and a sigmoid activation function. The output features from the two branches are concatenated and then fused across channels and adjusted for dimensionality through 1×1 convolutions, ultimately outputting the enhanced features. F i ' In the task of identifying chemical experimental equipment, this module plays a crucial role: its global attention mechanism can understand the overall layout of the laboratory scene and the spatial relationships between equipment (such as identifying the structure of an entire distillation apparatus), helping the model locate targets in complex backgrounds; while its local attention mechanism can enhance the response of key instrument details (such as the scale of a flask, the shape of an interface, or the fine structure of an electrode), improving the ability to distinguish similar equipment. This complementary fusion of global and local information enables the model to grasp both the semantics of the macroscopic scene and capture microscopic discriminative features, thereby significantly improving the accuracy and robustness of identifying chemical experimental equipment (especially those with complex structures, key local features, and potential mutual occlusion).
[0049] In one embodiment, the local spatial attention mechanism includes a first point convolutional layer, a depthwise separable convolutional layer, a second point convolutional layer, and a sigmoid function. Inputting the second branch feature into the local spatial attention mechanism to obtain local detail features includes: inputting the second branch feature into the first point convolutional layer to obtain a first intermediate feature; inputting the first intermediate feature into the depthwise separable convolutional layer to obtain a second intermediate feature; adding and fusing the second branch feature and the second intermediate feature and then inputting the result into the second point convolutional layer, activating the result through the sigmoid function to obtain a third intermediate feature; and multiplying the second branch feature and the third intermediate feature point by point and then adding and fusing them with the second branch feature to obtain the local detail features.
[0050] In one embodiment, the feature fusion network includes: four BIFPN modules, two upsampling layers, five RepC3 modules, one GSLS module, a third convolutional layer, a fourth convolutional layer, and a fifth convolutional layer; step 5 includes: processing the fourth scale enhancement feature through the third convolutional layer, the GSLS module, and the first upsampling layer to obtain a first upsampling feature; inputting the first upsampling feature and the third scale enhancement feature into the first BIFPN module to obtain a first intermediate fusion feature; inputting the first intermediate fusion feature into the first RepC3 module to obtain a fourth intermediate feature; processing the fourth intermediate feature through the second upsampling layer to obtain a second upsampling feature; inputting the second upsampling feature and the second scale enhancement feature into the second BIFPN module to obtain a second intermediate fusion feature; processing the second intermediate fusion feature through the second upsampling layer to obtain a second upsampling feature; inputting the second upsampling feature and the second scale enhancement feature into the second BIFPN module to obtain a second intermediate fusion feature; and processing the second intermediate fusion feature through the second upsampling layer... The second RepC3 module processes the data, and the resulting data, along with the first and second scale enhancement features, is input into the third BIFPN module to obtain the third intermediate fusion feature. The third intermediate fusion feature is then input into the third RepC3 module to obtain the first scale fusion feature. The first scale fusion feature is processed through a fourth convolutional layer, and the resulting data, along with the third scale enhancement feature and the fourth intermediate feature, is input into the fourth BIFPN module to obtain the fourth intermediate fusion feature. The fourth intermediate fusion feature is then input into the fourth RepC3 module to obtain the second scale fusion feature. The second scale fusion feature is then processed through a fifth convolutional layer and input into the fifth BIFPN module to obtain the fifth intermediate fusion feature. Finally, the fifth intermediate fusion feature is input into the fifth RepC3 module to obtain the third scale fusion feature.
[0051] Specifically, the feature fusion network employs a multi-layered cascaded BIFPN (Bidirectional Feature Pyramid Network) for feature fusion. Through a bidirectional path—both top-down and bottom-up—combined with learnable weights, it adaptively fuses features (high-resolution details and high semantic information) from different levels of the backbone network, constructing a multi-scale feature pyramid rich in both detail and semantics. This enables the network to accurately detect targets with vastly different scales, from tiny test tubes to large fume hoods.
[0052] In the fusion path, the RepC3 module (composed of reparameterizable convolutions) is used to further refine features. It has a multi-branch structure during training to enrich the gradient flow, and can be merged into a single path during inference, thereby enhancing feature representation without increasing inference time.
[0053] Figure 2 In the diagram, P3 represents the first-scale fusion feature, P4 represents the second-scale fusion feature, and P5 represents the third-scale fusion feature.
[0054] In a validation embodiment, the performance of the proposed method was verified using the ChemEq25 dataset and an annotated chemical apparatus image dataset. The ChemEq25 dataset focuses on real-time detection of chemical experimental equipment and contains a total of 4,599 high-quality images. All images have been meticulously annotated by professional annotators to ensure label accuracy. The dataset was constructed with full consideration of the complexity of real-world applications, covering shooting environments with multiple perspectives, lighting conditions, and diverse backgrounds, aiming to improve the generalization ability and robustness of models trained on this data in real-world scenarios. Furthermore, for ease of research use, the dataset has undergone unified image resizing and format standardization, and has been pre-divided into training, validation, and test sets, providing a reliable and reproducible benchmark evaluation platform for research in the field of chemical apparatus identification.
[0055] This labeled chemical apparatus image dataset was specifically designed for target detection tasks. The images, totaling 5,078, were captured from video frames of experimental procedures taken with smartphones. All samples have undergone standardized annotation, covering not only the experimenter's hands but also the location and category information of six common chemical apparatus types: conical beakers, eggplant flasks, Erlenmeyer flasks, pipettes, reagent bottles, and separatory funnels. The dataset has been uniformly processed into JPG format and provides three resolutions: 1920×1080, 1280×720, and 960×540. It also features standardized divisions of training, validation, and test sets, providing high-quality, directly usable benchmark data support for the development and performance evaluation of target detection algorithms in chemical experimental scenarios.
[0056] (1) Loss function L1 loss (also known as mean absolute error, MAE) is a commonly used error metric in regression tasks. It calculates the average of the absolute differences between predicted and true values. Compared to L2 loss (mean squared error, MSE), L1 loss is more robust to outliers because it does not amplify large errors by squaring them, thus reducing the dominant influence of outliers on the overall training process. Furthermore, L1 loss has a constant gradient (±1), which ensures that the parameter update direction during optimization is consistent with the error sign, contributing to sparser solutions during iteration. From an interpretability perspective, L1 loss maintains the same dimensions as the original data, intuitively reflecting the average absolute magnitude of the error, facilitating analysis and debugging. In visual tasks such as image restoration and denoising, L1 loss helps maintain pixel-level numerical accuracy, avoiding excessive weighting of a few extreme errors, thereby achieving a more stable and balanced restoration effect overall.
[0057] Generalized Intersection over Union (GIoU) Loss is an advanced loss function used to optimize bounding box regression in object detection. Compared to standard IoU loss, its core improvement lies in introducing the area of the smallest closure rectangle (the smallest rectangle that simultaneously encloses the predicted and ground truth boxes) as a normalization factor. GIoU calculation, based on IoU, additionally considers the relative position and shape differences between the predicted and ground truth boxes, with a value range of [-1, 1]: GIoU = 1 when the two boxes perfectly overlap; and approaches -1 when the two boxes do not overlap at all and are infinitely far apart. This design allows GIoU to still provide effective gradient direction and magnitude even when the two boxes do not overlap, thus continuously driving the optimization process and alleviating the gradient vanishing problem of standard IoU loss in this scenario. Therefore, GIoU loss not only improves the accuracy of bounding box localization but also significantly accelerates model convergence by providing more stable and informative gradients, ultimately improving the regression robustness and performance of the detection model in complex scenarios.
[0058] In the training of object detection models, the classification loss (CLS) is the core optimizer driving the model to accurately identify object categories. It provides the model with a clear learning signal by measuring the difference between the model's predicted class probability distribution and the true labels. Faced with the challenges of class imbalance and the mixing of easy and difficult samples in training data, a well-designed classification loss can dynamically evaluate the learning status of samples, automatically reducing the weight given to "easy samples" that the model has already mastered, while focusing optimization on those difficult-to-distinguish "hard samples" or key categories. This adaptive sample weighting mechanism prompts the model to learn more discriminative features, thereby significantly enhancing the model's classification robustness in complex scenarios while improving overall average accuracy.
[0059] In the multi-task learning framework of the object detection model, classification loss, generalized intersection-union (OCU) loss, and L1 loss are collaboratively optimized to drive the model to achieve accurate end-to-end detection. Classification loss focuses on learning the semantic category of the target, ensuring the model "recognizes accurately"; OCU loss optimizes the spatial overlap and relative position between the predicted and ground truth boxes, enabling the model to "locate accurately"; and L1 loss typically operates on fine-grained regression of bounding box coordinates and calibration of target presence confidence, further improving the numerical accuracy and reliability of localization. These three loss functions each perform their respective functions while also constraining each other, forming effective gradient synergy during joint training. This promotes a balance in the three key sub-tasks of recognition, localization, and confidence assessment, ultimately merging into a robust detection system with strong generalization capabilities.
[0060] (2) Evaluation methods This embodiment systematically evaluates the model's performance, employing multiple benchmark datasets to ensure the comprehensiveness and comparability of the results. The evaluation process is conducted under publicly available and reproducible experimental conditions to verify the model's generalization ability on unknown test sets. The model's effectiveness is measured using a comprehensive evaluation system: regarding detection accuracy, precision, recall, and mean precision (mAP50 and mAP50-95) at different intersection-union thresholds are calculated; these metrics comprehensively reflect the model's accuracy in localization and classification. Regarding computational efficiency, the model's floating-point operations (GFLOPs) and real-time inference speed (FPS) are simultaneously calculated to quantify its computational complexity and deployment feasibility. This evaluation framework balances accuracy and efficiency, aiming to provide multi-dimensional performance data for the model's practical application.
[0061] (3) Experimental results Based on the detailed experimental results of the ChemEq25 dataset shown in Table 1, the model demonstrates comprehensive and outstanding detection performance. Its superiority is primarily reflected in its extremely high recognition accuracy and reliability: the model achieves high precision and high recall for all 25 categories of chemical instruments. More than half of these categories, such as pipettes, wash bottles, porcelain mortars, and precision electronic balances, have precision and recall exceeding 0.99, achieving near-perfect recognition coverage and extremely low false alarms. The model performs particularly well in the core metric of mAP50, with all categories exceeding 0.93. Several instruments, including Buchner funnels, glass rods, and volumetric flasks, achieve top-tier mAP50 values above 0.99, indicating that the model's predicted bounding boxes almost perfectly overlap with the actual targets. Furthermore, even when facing subcategories with similar shapes and high differentiation difficulty (such as single-necked, two-necked, and three-necked round-bottom flasks), the model maintains excellent mAP50 values ranging from 0.974 to 0.983, demonstrating its powerful fine-grained recognition capabilities. On the more stringent mAP50-95 metric, which comprehensively evaluates performance under different cross-union ratio (CUI) thresholds, the model consistently maintains a value between 0.635 and 0.826, demonstrating its strong adaptability and robustness to targets of various scales, proportions, and partial occlusion. In summary, this model achieves extremely high-precision identification and localization of multi-category, multi-morphological, and multi-scale chemical experimental equipment on the ChemEq25 dataset. Its comprehensive performance metrics fully validate the model's effectiveness and advancement, providing a solid and reliable technical foundation for intelligent perception of laboratory equipment in complex real-world scenarios.
[0062] Table 1. Experimental results using the ChemEq25 dataset.
[0063] Based on the provided training process Figure 4The model's experimental results on the ChemEq25 dataset demonstrate comprehensive and excellent performance. Looking at the loss function curves, the L1 loss, generalized intersection-union loss, and classification loss on both the training and validation sets rapidly decrease with each training iteration and eventually stabilize at a low level. This indicates that the model achieves effective and stable optimization in the three core tasks of bounding box regression, classification recognition, and fine-grained coordinate regression, without significant overfitting or oscillations. Regarding key detection metrics, the precision and recall on the training set rapidly rise to a high level and remain stable. The corresponding curves on the validation set show a highly consistent convergence trend and numerical level, indicating that the model not only learns sufficiently from the training data but also exhibits strong generalization ability and recognition consistency on unknown data. In particular, the mAP50 and mAP50-95 metrics, representing comprehensive detection performance, rapidly approach and stabilize at extremely high levels on both the training and validation sets. This confirms that the model can achieve high-precision localization and reliable recognition of various chemical experimental instruments, regardless of their size, shape, or whether they are partially occluded. The overall training curve is smooth and converges rapidly, and all metrics are highly consistent between the training and validation sets, which fully demonstrates the rationality of the model structure design and the effectiveness of the optimization strategy. Ultimately, it achieves excellent and robust detection performance on the ChemEq25 dataset.
[0064] Comprehensive evaluation on the labeled chemical apparatus image dataset demonstrates that the proposed model exhibits excellent recognition performance and good generalization ability. In challenging real-world experimental scenarios, the model effectively detected all seven target classes (including operator hands and six common chemical laboratory equipment). The model's recognition accuracy for separatory funnels is particularly outstanding, reaching 0.899, reflecting extremely high discrimination accuracy for this type of device with significant structural features. Simultaneously, it achieved a high recall of 0.855 and an mAP50 value of 0.874 for reagent bottles, proving the model's high coverage and precise localization ability for such targets. For various shapes of conical beakers, conical flasks, and eggplant-shaped flasks, the model's accuracy (0.75-0.859) and mAP50 (0.774-0.887) are both robustly high, indicating its ability to reliably distinguish glassware of different geometries. Of particular note is the model's detection accuracy for the operator's hand, reaching 0.85, with an mAP50 of 0.797. This provides a crucial technical foundation for achieving human-computer interaction and operational safety monitoring. Across all categories, the model achieved a range of 0.57 to 0.803 on the more stringent comprehensive evaluation metric mAP50-95, reflecting its stable localization and classification quality across different cross-union thresholds. Overall, the model demonstrates balanced and powerful detection capabilities in complex laboratory scenarios, facing multi-category and multi-morphological targets, providing reliable technical support for the automated perception and analysis of chemical experimental processes. Experimental results for the labeled chemical apparatus image dataset are shown in Table 2.
[0065] Table 2 Experimental Results of the Annotated Chemical Apparatus Image Dataset
[0066] from Figure 5The curves shown illustrate the model's performance during training and validation, demonstrating comprehensive and excellent performance on the dataset. Regarding model optimization, the generalized intersection-over-union (GOU) loss, classification loss, and L1 loss on both the training and validation sets consistently and steadily decreased throughout the training process, eventually converging to low levels. This trend clearly indicates that the model has learned effectively and thoroughly in the three core tasks of bounding box localization, object classification, and coordinate regression, with a smooth optimization process without significant fluctuations or divergence. In terms of key performance indicators, the model exhibited rapid learning capabilities from the early stages of training, with precision, recall, and mean precision (mAP50 and mAP50-95) all rapidly climbing to high levels. Importantly, these performance indicators on the validation set maintained a highly consistent high-level convergence trend with the training set, demonstrating not only outstanding numerical performance but also stability throughout the entire training cycle. This fully proves that the model did not merely memorize training samples but truly learned highly generalizable feature representations and discrimination rules. In summary, the synergistic, continuous optimization and convergence of all evaluation metrics jointly demonstrate the effectiveness of the model architecture and training strategy, enabling it to achieve accurate, reliable and highly generalizable detection performance on this dataset.
[0067] Based on the training results provided in Table 3 using the ChemEq25 dataset and the labeled chemical apparatus image dataset, the model demonstrates excellent and balanced overall performance across two different chemical experimental scenario datasets. On the ChemEq25 dataset, the model achieves an ultra-high precision of 97.13% and a recall of 0.9732, with an mAP50 index of 97.62%. This fully demonstrates that the model possesses near-perfect recognition and localization capabilities in well-defined and standardized laboratory scenarios, enabling it to complete chemical instrument detection tasks with extremely high reliability. Meanwhile, the model also achieved remarkable results on another more challenging real-world scenario dataset—the labeled chemical apparatus image dataset: while maintaining robust precision of 81.03% mAP50 and 66.76% mAP50-95, its inference speed reached 108.7 FPS, and the model parameter size was successfully compressed to only 17.69M. This result highlights the model's ability to maintain excellent detection efficiency and extremely lightweight operation even in complex real-world environments (such as those with occlusion or complex backgrounds), demonstrating its superior ability to balance accuracy, speed, and resource consumption in practical deployments. Overall, the model not only achieves top-tier recognition accuracy under ideal conditions but also enables efficient, lightweight, and reliable operation in complex real-world scenarios, laying a solid technical foundation for its widespread deployment in various practical chemical laboratory automation applications.
[0068] Table 3. Training results using the ChemEq25 dataset and the labeled chemical apparatus image dataset.
[0069] (4) Visualization of labeled chemical apparatus image datasets First, this method demonstrates extremely reliable performance in the accurate identification and localization of multiple targets in complex real-world scenarios. For example... Figure 6 As shown, in a typical laboratory benchtop scenario, the model can simultaneously and accurately detect multiple transparent or translucent chemical vessels (such as conical flasks, beakers, and reagent bottles), and label them with high confidence (e.g., 0.9). This demonstrates that the model not only possesses fine-grained recognition capabilities to distinguish similar categories of glassware, but also achieves stable detection with high recall and high precision even in the presence of complex backgrounds and potential reflective interference.
[0070] Secondly, to further demonstrate the effectiveness of the model, the trained model was tested using a dedicated heatmap visualization method, which can better showcase the model's recognition performance. Figures 7 to 9 This advantage is vividly demonstrated: the model can seamlessly overlay its extracted deep visual features (such as attention heatmaps) onto the original RGB image or infrared thermal image in the form of a pseudo-color heatmap. For example... Figure 9 In the heat map, the shape of the operator's hand and tools is clearly outlined and combined with temperature distribution information; Figure 8 In this model, the heat map precisely focuses on the liquid level in the graduated cylinder and the key instrument area. This capability intuitively reveals the basis for the model's decisions, making it not just a "black box" but an interpretable analytical tool that can effectively link physical operations, equipment status, and the model's focus.
[0071] Furthermore, the model enables collaborative perception of key operational elements and equipment status. Figure 7 and Figure 9 Together, they demonstrate that the model can simultaneously process and understand dynamic operations (such as hand holding a tool to apply paint) and static devices in the scene.
[0072] Finally, all these capabilities are integrated into a highly efficient and lightweight architecture. The model maintains extremely high processing speed and low latency when processing a single image containing multiple targets and complex backgrounds, while having a very small number of parameters. This ensures that its superior perception performance can be seamlessly integrated into practical automated experimental platforms or real-time monitoring systems, meeting the needs of high-throughput, resource-constrained edge deployments.
[0073] In summary, the series of visualization results collectively confirm that the model has successfully constructed a high-precision, interpretable, context-aware, and highly deployable visual perception system for chemical experiments. It can not only accurately locate and identify various instruments, but also deeply analyze operational processes and equipment status, presenting their "thinking" process through intuitive visualization. This provides a powerful and reliable technological foundation for automated operation, process monitoring, and scientific research in intelligent laboratories.
[0074] (5) Visualization results of the ChemEq25 dataset A comprehensive analysis of the four visualization results demonstrates the method's superior multi-dimensional perception and understanding capabilities on this dataset. Firstly, this method exhibits accurate and reliable target recognition and localization in complex real-world scenarios, such as... Figure 10 As shown, for transparent conical flasks and pipettes, the model not only accurately detects their shape but also provides confidence scores as high as 0.8 and 0.9, demonstrating its stable and high-precision recognition capability on challenging targets (such as transparent and reflective objects). In the model's visualization analysis, the core role of the heatmap is to intuitively reveal the distribution of the model's "attention" or "decision focus" in the image. It maps the model's internal feature responses or activation values of specific layers onto the original image using color, clearly showing which regions contribute most to the model's prediction results. This allows us to "see" whether the model focuses on which key component of the target (such as a specific structure of an instrument) or is interfered with by the background or other irrelevant areas during recognition or detection, greatly enhancing the transparency and interpretability of the model's decision-making process. Figure 11 and Figure 12 As shown in the heatmap, the model can clearly focus its attention on key areas (highlighted feature bands in the image) and present its decision-making basis in an intuitive visual form. This not only verifies the rationality of the model's focus but also makes it an understandable and analyzable tool. Furthermore, the model possesses excellent adaptability to complex environments and contextual awareness, such as... Figure 13 As shown, in a typical laboratory benchtop scene with low light and a cluttered background, the model effectively handles multiple stacked or adjacent objects and accurately maps feature responses onto the target, demonstrating strong robustness to occlusion and complex backgrounds. This visual evidence collectively demonstrates that the model successfully integrates high-precision detection, strong scene adaptability, and deep interpretability. Its attention mechanism effectively focuses on semantically key regions, thus achieving reliable and transparent intelligent visual perception in realistic and dynamic chemical experimental environments.
[0075] It should be understood that, although the above Figure 1The steps are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise explicitly stated in this document, there is no strict order in which these steps are executed; they can be performed in other orders. Furthermore, the above... Figure 1 At least some of the steps may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0076] In one embodiment, a chemical laboratory equipment identification device is also provided, comprising: The image preprocessing unit is used to acquire images of chemical laboratory experiments and preprocess them to obtain the input image.
[0077] The shallow feature extraction unit is used to input the input image into the feature extraction module, and perform feature extraction through convolution and max pooling operations to obtain shallow features.
[0078] The multi-scale deep feature extraction unit is used to input shallow features into the backbone network for feature extraction to obtain multi-scale deep features.
[0079] The multi-scale feature enhancement unit is used to input multi-scale deep features into the feature enhancement module. The GSLS module and AIFI module enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features. The GSLS module is a parallel dual-branch feature enhancer used to model the long-range dependency of feature maps through the global spatial attention mechanism to understand the spatial layout relationship between chemical laboratory equipment, and to enhance the local detailed features of chemical laboratory equipment through the local spatial attention mechanism.
[0080] The feature fusion unit is used to input multi-scale enhanced features into the feature fusion network. It uses a multi-layer cascaded BIFPN module for feature fusion. In the fusion path, the RepC3 module is used to further refine the features to obtain multi-scale fused features.
[0081] The chemical laboratory equipment identification unit is used to input multi-scale fused features into the RTDETRDecoder module to obtain the chemical laboratory equipment identification results.
[0082] In one embodiment, the feature extraction module in the shallow feature extraction unit includes a first convolutional module, a second convolutional module, a third convolutional module, and a max pooling layer connected in sequence; the convolutional module includes a convolutional layer, a batch normalization layer, and a ReLU function.
[0083] In one embodiment, the backbone network includes four stacked BasicBlock residual modules; a multi-scale deep feature extraction unit is further configured to input shallow features into the first BasicBlock residual module in the backbone network to obtain first-scale deep features; input the first-scale deep features into the second BasicBlock residual module in the backbone network to obtain second-scale deep features; input the second-scale deep features into the third BasicBlock residual module in the backbone network to obtain third-scale deep features; and input the third-scale deep features into the fourth BasicBlock residual module in the backbone network to obtain fourth-scale deep features.
[0084] In one embodiment, the feature enhancement module includes a first convolutional layer, a first GSLS module, a second GSLS module, a second convolutional layer, and an AIFI module; the multi-scale feature enhancement unit is further configured to input first-scale deep features into the first convolutional layer of the feature enhancement module to obtain first-scale enhanced features; input second-scale deep features into the first GSLS module of the feature enhancement module to obtain second-scale enhanced features; input third-scale deep features into the second GSLS module of the feature enhancement module to obtain third-scale enhanced features; and process fourth-scale deep features through the second convolutional layer and the AIFI module to obtain fourth-scale enhanced features.
[0085] In one embodiment, the GSLS module includes a global spatial attention mechanism, a local spatial attention mechanism, and a point convolutional layer; a multi-scale feature enhancement unit is further used to divide the second-scale deep features into channels to obtain first-branch features and second-branch features; the first-branch features are input into the global spatial attention mechanism to obtain context features; the global spatial attention mechanism is used to perform point convolution, transpose, and Softmax activation on the first-branch features to obtain intermediate features; the first-branch features and intermediate features are multiplied and fused, then subjected to point convolution, layer normalization, and ReLU activation, and the activation result is added to the first-branch features after point convolution to obtain context features; the second-branch features are input into the local spatial attention mechanism to obtain local detail features; the context features and local detail features are concatenated and processed by the point convolutional layer to obtain second-scale enhanced features.
[0086] In one embodiment, the local spatial attention mechanism includes a first point convolutional layer, a depthwise separable convolutional layer, a second point convolutional layer, and a sigmoid function; the multi-scale feature enhancement unit is further configured to input the second branch feature into the first point convolutional layer to obtain a first intermediate feature; input the first intermediate feature into the depthwise separable convolutional layer to obtain a second intermediate feature; add and fuse the second branch feature and the second intermediate feature and input the result into the second point convolutional layer, activate the result through the sigmoid function to obtain a third intermediate feature; multiply the second branch feature and the third intermediate feature point by point and add and fuse them with the second branch feature to obtain local detail features.
[0087] In one embodiment, the feature fusion network includes: four BIFPN modules, two upsampling layers, five RepC3 modules, one GSLS module, a third convolutional layer, a fourth convolutional layer, and a fifth convolutional layer; the feature fusion unit is further configured to process the fourth scale enhancement feature through the third convolutional layer, the GSLS module, and the first upsampling layer to obtain a first upsampling feature; input the first upsampling feature and the third scale enhancement feature into the first BIFPN module to obtain a first intermediate fusion feature; input the first intermediate fusion feature into the first RepC3 module to obtain a fourth intermediate feature; process the fourth intermediate feature through the second upsampling layer to obtain a second upsampling feature; input the second upsampling feature and the second scale enhancement feature into the second BIFPN module to obtain a second intermediate fusion feature; and input the second intermediate fusion feature into the second BIFPN module to obtain a second intermediate fusion feature; and input the second intermediate fusion feature into the second BIFPN module to obtain a second intermediate fusion feature. The features are processed by the second RepC3 module, and the resulting processing, first scale enhancement features, and second scale enhancement features are input into the third BIFPN module to obtain the third intermediate fusion feature. The third intermediate fusion feature is then input into the third RepC3 module to obtain the first scale fusion feature. The first scale fusion feature is processed by the fourth convolutional layer, and the resulting processing, third scale enhancement features, and fourth intermediate features are input into the fourth BIFPN module to obtain the fourth intermediate fusion feature. The fourth intermediate fusion feature is then input into the fourth RepC3 module to obtain the second scale fusion feature. The second scale fusion feature is then processed by the fifth convolutional layer and input into the fifth BIFPN module to obtain the fifth intermediate fusion feature. The fifth intermediate fusion feature is then input into the fifth RepC3 module to obtain the third scale fusion feature.
[0088] It is understood that for a detailed explanation of the chemical laboratory equipment identification device, please refer to the corresponding explanations of the various embodiments of the chemical laboratory equipment identification method above, and will not be repeated here. Each module in the aforementioned chemical laboratory equipment identification device can be implemented entirely or partially through software, hardware, or a combination thereof. Each module can be embedded in hardware or independently of a device with data processing capabilities, or stored in software in the memory of the aforementioned device, so that the processor can call and execute the operations corresponding to each module. The aforementioned device can be, but is not limited to, various types of data processing computer devices already existing in the art.
[0089] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0090] It is understood that, in addition to the memory and processor mentioned above, the computer equipment described above also includes other hardware and software components not listed in this specification. The specific components can be determined according to the model of the image processing computer in different application scenarios, and will not be listed and described in detail in this specification.
[0091] In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps in the above method embodiments.
[0092] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), memory bus DRAM (RDRAM), and interface DRAM (DRDRAM), etc.
[0093] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0094] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of protection of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and all such modifications and improvements fall within the scope of protection of this application.
Claims
1. A method for identifying chemical laboratory equipment, characterized in that, Including the following steps: Images of chemical laboratory experiments are acquired and preprocessed to obtain the input images; The input image is fed into the feature extraction module, where features are extracted through convolution and max pooling operations to obtain shallow features. The shallow features are input into the backbone network for feature extraction to obtain multi-scale deep features; The multi-scale deep features are input into the feature enhancement module. The GSLS module and AIFI module are used to enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features. The GSLS module is a parallel dual-branch feature enhancer, which is used to model the long-range dependency of the feature map through the global spatial attention mechanism to understand the spatial layout relationship between chemical laboratory equipment, and to enhance the local detail features of chemical laboratory equipment through the local spatial attention mechanism. The multi-scale enhanced features are input into the feature fusion network, and feature fusion is performed using a multi-layer cascaded BIFPN module. In the fusion path, the RepC3 module is used to further refine the features to obtain multi-scale fused features. The multi-scale fusion features are input into the RTDETRDecoder module to obtain the chemical laboratory equipment recognition results.
2. The chemical laboratory equipment identification method according to claim 1, characterized in that, The feature extraction module includes a first convolutional module, a second convolutional module, a third convolutional module, and a max pooling layer connected in sequence; the convolutional module includes a convolutional layer, a batch normalization layer, and a ReLU function.
3. The chemical laboratory equipment identification method according to claim 1, characterized in that, The backbone network comprises four stacked BasicBlock residual modules; The shallow features are input into the backbone network for feature extraction to obtain multi-scale deep features, including: The shallow features are input into the first BasicBlock residual module in the backbone network to obtain the first-scale deep features; The first-scale deep features are input into the second BasicBlock residual module in the backbone network to obtain the second-scale deep features. The second-scale deep features are input into the third BasicBlock residual module in the backbone network to obtain the third-scale deep features. The third-scale deep features are input into the fourth BasicBlock residual module in the backbone network to obtain the fourth-scale deep features.
4. The chemical laboratory equipment identification method according to claim 1, characterized in that, The feature enhancement module includes a first convolutional layer, a first GSLS module, a second GSLS module, a second convolutional layer, and an AIFI module; Multi-scale deep features are input into the feature enhancement module. The GSLS module and AIFI module are used to enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features, including: The first-scale deep features are input into the first convolutional layer of the feature enhancement module to obtain the first-scale enhanced features; The second-scale deep features are input into the first GSLS module of the feature enhancement module to obtain the second-scale enhanced features; The third-scale deep features are input into the second GSLS module of the feature enhancement module to obtain the third-scale enhanced features; The fourth-scale deep features are processed by the second convolutional layer and the AIFI module to obtain the fourth-scale enhanced features.
5. The chemical laboratory equipment identification method according to claim 1, characterized in that, The GSLS module includes a global spatial attention mechanism, a local spatial attention mechanism, and point convolutional layers; The second-scale deep features are input into the first GSLS module of the feature enhancement module to obtain the second-scale enhanced features, including: The second-scale deep features are divided into channels to obtain the first branch features and the second branch features. The first branch feature is input into the global spatial attention mechanism to obtain context features; the global spatial attention mechanism is used to perform point convolution, transpose, and Softmax function activation on the first branch feature to obtain intermediate features; the first branch feature and the intermediate feature are multiplied and fused, and then point convolution, layer normalization, and ReLU function activation are performed. The activation result is then point convolutioned and added to the first branch feature to obtain context features; The second branch feature is input into the local spatial attention mechanism to obtain local detail features; The context features and the local detail features are concatenated and then processed through the point convolutional layer to obtain the second scale enhanced features.
6. The chemical laboratory equipment identification method according to claim 5, characterized in that, The local spatial attention mechanism includes a first point convolutional layer, a depthwise separable convolutional layer, a second point convolutional layer, and a sigmoid function; The second branch features are input into the local spatial attention mechanism to obtain local detail features, including: The second branch feature is input into the first point convolutional layer to obtain the first intermediate feature; The first intermediate feature is input into the depth-separable convolutional layer to obtain the second intermediate feature; The second branch feature and the second intermediate feature are added and fused together and then input into the second point convolutional layer. The result is activated by the Sigmoid function to obtain the third intermediate feature. The second branch feature and the third intermediate feature are multiplied point by point and then added to the second branch feature to obtain the local detail feature.
7. The chemical laboratory equipment identification method according to claim 1, characterized in that, The feature fusion network consists of: four BIFPN modules, two upsampling layers, five RepC3 modules, one GSLS module, a third convolutional layer, a fourth convolutional layer, and a fifth convolutional layer; The multi-scale enhanced features are input into a feature fusion network, and feature fusion is performed using a multi-layered cascaded BIFPN module. In the fusion path, the RepC3 module further refines the features to obtain multi-scale fused features, including: The fourth-scale enhanced features are processed through the third convolutional layer, the GSLS module, and the first upsampling layer to obtain the first upsampling features; The first upsampled feature and the third scale-enhanced feature are input into the first BIFPN module to obtain the first intermediate fusion feature; The first intermediate fusion feature is input into the first RepC3 module to obtain the fourth intermediate feature; The fourth intermediate feature is processed by the second upsampling layer to obtain the second upsampling feature; The second upsampling feature and the second scale enhancement feature are input into the second BIFPN module to obtain the second intermediate fusion feature; The second intermediate fusion feature is processed by the second RepC3 module, and the processing result, the first scale enhancement feature, and the second scale enhancement feature are input into the third BIFPN module to obtain the third intermediate fusion feature. The third intermediate fusion feature is input into the third RepC3 module to obtain the first scale fusion feature; The first scale fusion feature is processed by the fourth convolutional layer, and the result, the third scale enhancement feature, and the fourth intermediate feature are input into the fourth BIFPN module to obtain the fourth intermediate fusion feature. The fourth intermediate fusion feature is input into the fourth RepC3 module to obtain the second scale fusion feature; The second scale fusion feature is processed by the fifth convolutional layer and then input into the fifth BIFPN module to obtain the fifth intermediate fusion feature; The fifth intermediate fusion feature is input into the fifth RepC3 module to obtain the third scale fusion feature.
8. A chemical laboratory equipment identification device, characterized in that, include: The image preprocessing unit is used to acquire images of chemical laboratory experiments and preprocess them to obtain the input image; The shallow feature extraction unit is used to input the input image into the feature extraction module, and extract features through convolution and max pooling operations to obtain shallow features; A multi-scale deep feature extraction unit is used to input the shallow features into the backbone network for feature extraction to obtain multi-scale deep features; A multi-scale feature enhancement unit is used to input multi-scale deep features into the feature enhancement module. The GSLS module and AIFI module are used to enhance the feature representation from the local-global and self-attention perspectives, respectively, to obtain multi-scale enhanced features. The GSLS module is a parallel dual-branch feature enhancer used to model the long-range dependency of the feature map through the global spatial attention mechanism to understand the spatial layout relationship between chemical laboratory equipment, and to enhance the local detailed features of chemical laboratory equipment through the local spatial attention mechanism. The feature fusion unit is used to input multi-scale enhanced features into the feature fusion network. It uses a multi-layer cascaded BIFPN module for feature fusion. In the fusion path, the RepC3 module is used to further refine the features to obtain multi-scale fused features. The chemical laboratory equipment identification unit is used to input multi-scale fused features into the RTDETRDecoder module to obtain the chemical laboratory equipment identification results.
9. A computer device, comprising a memory and a processor, characterized in that, The memory stores a computer program, and when the processor executes the computer program, it implements the steps of the chemical laboratory equipment identification method according to any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the chemical laboratory equipment identification method according to any one of claims 1 to 7.