Method, device and equipment for matching lifting appliance and container based on deep learning
By using an end-to-end multi-task network based on deep learning, efficient, accurate, and real-time alignment of spreaders and containers is achieved, solving the problems of low efficiency, high cost, low accuracy, and the influence of lighting in existing technologies, and making it suitable for different lighting conditions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA RAILWAY ENG MASCH RES & DESIGN INST CO LTD
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-26
AI Technical Summary
Existing visual alignment technologies for spreaders and containers suffer from problems such as low efficiency, high cost, low accuracy, insufficient real-time performance, and susceptibility to lighting conditions in detecting the three-dimensional coordinates of keyholes.
An end-to-end multi-task network based on deep learning is adopted. Video streams are acquired through a monocular camera to perform keyhole detection, segmentation, vertical depth estimation, and keyhole tracking. The two-dimensional real-time coordinates and vertical depth of the keyhole center are obtained and converted to three-dimensional coordinates in the spreader coordinate system. The rotation angle is calculated in combination with the container dimensions to achieve alignment.
It improves the efficiency and accuracy of detection of the alignment between the spreader and the container, reduces costs, has good real-time performance and robustness, and is adaptable to different lighting conditions.
Smart Images

Figure CN122289375A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of mechanical engineering technology, specifically to a method, apparatus, and equipment for aligning a spreader and a container based on deep learning. Background Technology
[0002] Currently, existing research on visual alignment technology between spreaders and containers mainly falls into two categories: one based on traditional visual algorithms and the other based on deep learning methods.
[0003] In related technologies, the alignment method between the spreader and the container based on traditional vision algorithms mainly uses traditional visual detection methods to detect the keyhole and center point positions. In "Container Keyhole Tracking and Center Localization Based on Video Stream," the HSV method is used to segment the acquired image, then a sliding window method combined with an SVM classifier is used to identify the keyhole position in the image and save the keyhole image. The Candy operator is used to perform edge detection on the keyhole image to obtain the keyhole center point position, and the MOSSE algorithm is used to track the keyhole center position.
[0004] Furthermore, with the development of deep learning algorithms in image detection, numerous studies have applied deep learning methods to the alignment of spreader and container. One such study, "Research on Visual Alignment Technology of Spreader and Container Based on Deep Learning," proposes a two-step method to determine the center of the keyhole pixels from the container image. First, an object detection network is used to determine the keyhole bounding box; then, a salient object segmentation network is used to determine the center of the keyhole pixels; finally, binocular ranging is used to determine the corresponding three-dimensional coordinates of the keyhole pixel center, completing the automatic alignment of the spreader and container. Another study, "Fully Automatic Container Grabbing and Positioning System Based on Deep Learning," proposes installing four cameras at the four corners of the spreader and using a YOLOv7-OBB network to detect the rotated container keyholes. The coordinates of the keyhole center point are obtained based on the detection bounding box. The deviation between the container center position and the spreader position is calculated using the four keyhole center positions, thereby controlling the alignment of the spreader and container.
[0005] However, the technical solution of "Container Keyhole Tracking and Center Localization Based on Video Stream" uses traditional vision algorithms in key steps such as keyhole detection, keyhole center point detection, and keyhole tracking. Therefore, it is time-consuming, has low detection efficiency, and the recognition effect is easily affected by lighting conditions. While the technical solution of "Research on Visual Alignment Technology between Spreader and Container Based on Deep Learning" can locate container keyholes better, it uses a two-step method that requires two deep learning networks, which is time-consuming. In addition, during the alignment of the spreader and the container, the spreader may swing, and the method does not track the keyhole position in real time, so the real-time performance is insufficient. Furthermore, the method uses a binocular camera to obtain the three-dimensional coordinates of the keyhole center, which is costly and not conducive to widespread use. Furthermore, the technical solution of "Fully Automated Container Grabbing and Positioning System Based on Deep Learning" can achieve alignment of the spreader and container in the XY plane at a fixed height, but it lacks Z-axis data and cannot calculate the three-dimensional coordinates of the lock hole center in the spreader coordinate system in real time. When the spreader is descending, the Z-axis position cannot be known in real time. At the same time, the execution time of the next container locking step cannot be accurately indicated. More importantly, the method only uses the detection frame to obtain the lock hole center position, which relies too much on the accuracy of the detection frame. When the detection frame is offset due to camera parallax, it will lead to inaccurate alignment. Summary of the Invention
[0006] This application provides a method, apparatus, and equipment for aligning spreaders and containers based on deep learning, which solves the problems of being unable to detect the three-dimensional coordinates of lock holes during the alignment process of spreaders and containers, or the low efficiency, high cost, low accuracy, insufficient real-time performance, and the detection effect being easily affected by lighting conditions.
[0007] In a first aspect, embodiments of this application provide a deep learning-based alignment method for spreader and container, comprising the following steps: When the spreader is lowered close to the container, an even number of monocular cameras installed on the spreader capture video streams, extract frames and perform data preprocessing to obtain real-time image data; Real-time image data is input into the keyhole detection model. The keyhole detection model performs keyhole detection, keyhole segmentation, vertical depth estimation, and keyhole tracking tasks simultaneously through its multi-task network, thereby obtaining the two-dimensional real-time coordinates and vertical depth of each keyhole center in pixel coordinates. The two-dimensional real-time coordinates and vertical depth of each keyhole center are sequentially transformed to the camera coordinate system and the lifting device coordinate system to obtain the three-dimensional coordinates of each keyhole center in the lifting device coordinate system. Calculate the three-dimensional coordinate difference between the container's center point and the spreader's center point, and obtain the rotation angle based on the container's known dimensions. Control the spreader to be directly aligned with the container until the Z-axis coordinate difference reaches the set range.
[0008] In conjunction with the first aspect, in one embodiment, the alignment method further includes pre-establishing a keyhole detection model, wherein establishing the keyhole detection model includes: Collect videos of container hoisting operations of different sizes and colors under different weather and lighting conditions; The container hoisting operation video is framed, and each frame is labeled using annotation software to obtain the coordinates of each frame, the corresponding keyhole frame, and the coordinates of the polygons that divide the keyhole frame, thus forming a training dataset. A keyhole detection model with a multi-task network is constructed. The multi-task model includes a deep learning encoder that acquires a pre-processed image, a deep learning encoder that extracts and fuses features, and a keyhole detection decoder, a keyhole segmentation decoder, a tracking task decoder, and a vertical depth estimation decoder corresponding to the deep learning encoder. The training dataset is input into the keyhole detection model for training, and the convergence of the loss curve is observed until the loss curve tends to stabilize, at which point the keyhole detection model training is complete.
[0009] In conjunction with the first aspect, in one implementation, after the keyhole detection model is trained, the method further includes: converting the keyhole detection model into an ONNX format model; Use ONNXSIM to trim the redundant parts of the model; Use TensorRT to generate an INT8 type binary engine from the corresponding ONNX model; Use the generated engine to identify container lock holes.
[0010] In conjunction with the first aspect, in one embodiment, the alignment method further includes calibrating each monocular camera, the calibration step comprising: The scaffold equipped with a monocular camera is lifted to a fixed height, and the height data is recorded. ; Place the checkerboard calibration plate under the hanger and move and rotate the checkerboard calibration plate so that each monocular camera collects image data of the checkerboard calibration plate, and obtain more than 10 images of the checkerboard calibration plate from each camera; Each camera is calibrated, and its intrinsic parameter matrix is obtained. K and distortion coefficient array; Mount an even number of monocular cameras onto the rigging. Place a checkerboard calibration plate below each monocular camera, parallel to the rigging, and take photos. Combine this with... h Calculate and obtain the rotation matrix between the camera coordinate system and the rigging coordinate system for each monocular camera. R Translation vector t .
[0011] In conjunction with the first aspect, in one embodiment, the number of monocular cameras is two, and the intrinsic parameter matrix is... The distortion coefficient array is ( , , , , ), combined with methods for correcting image distortion; among which, , This represents the known focal length of a monocular camera. , These are the principal point coordinates of the monocular camera; Establish computational equations Calculate and obtain the rotation matrix between the camera coordinate system and the rigging coordinate system for each monocular camera. R Translation vector t ; in, , These are the pixel coordinates of the acquired image obtained through measurement. , - Z0 represents the coordinates of each point on the checkerboard calibration board in the lifting coordinate system, and Z0 represents the vertical depth coordinates of the checkerboard calibration board in the camera coordinate system.
[0012] In conjunction with the first aspect, in one implementation, the step of sequentially transforming the two-dimensional real-time coordinates and vertical depth of each keyhole center to the camera coordinate system and the rigging coordinate system to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system includes: pixel coordinates ( , Convert to ideal pixel coordinates ( , ), ; ; in, This represents the distance from a pixel to the principal point of the image, and its calculation method is as follows: ;in and It is the offset caused by tangential distortion, by , Calculated; ( , () are the coordinates of the principal point; Convert ideal pixel coordinates to camera coordinate system coordinates. , , ); ; ; The depth is obtained from the vertical depth estimation task of the keyhole detection model; Based on the obtained coordinates of the keyhole center in the camera coordinate system ( , , The coordinates of the lock hole center in the lifting device coordinate system are calculated using the rotation matrix R and the offset vector t. , , The calculation formula is as follows: .
[0013] In conjunction with the first aspect, in one embodiment, the multi-task network of the keyhole detection model includes a backbone network and four branch networks. The backbone network is used to extract features from the reused image, and the four branch networks respectively implement keyhole detection, keyhole segmentation, vertical depth estimation, and keyhole tracking based on the features from the reused image.
[0014] In conjunction with the first aspect, in one embodiment, calculating the three-dimensional coordinate difference between the container's center point and the spreader's center point, obtaining the rotation angle based on the container's known dimensions, and controlling the spreader to be directly opposite the container includes: The coordinates of the center of the two diagonally opposite keyholes on the container ( , , )and( , , ), calculate the position coordinates of the container's center point in the spreader coordinate system ( , , ); If | - When the width is greater than the known width M of the container, it can be achieved through ( , The distance from the origin of the spreader coordinate system, and | - The rotation angle is calculated from the difference between M and M. ; According to the rotation angle ,and( , The distance from the origin of the spreader coordinate system is used to control the alignment of the spreader with the container.
[0015] Secondly, embodiments of this application provide a deep learning-based alignment device for a spreader and a container, comprising: The video stream acquisition module is used to acquire video streams, extract frames, and perform data preprocessing by using an even number of monocular cameras installed on the spreader when it is lowered close to the container, in order to obtain real-time image data. The keyhole center localization module is used to input real-time image data into the keyhole detection model. Through the multi-task network of the keyhole detection model, the keyhole detection task, keyhole segmentation task, vertical depth estimation task and keyhole tracking task are performed simultaneously to obtain the two-dimensional real-time coordinates and vertical depth of each keyhole center in pixel coordinates. The coordinate transformation module is used to transform the two-dimensional real-time coordinates and vertical depth of each keyhole center to the camera coordinate system and the rigging coordinate system in sequence, so as to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system. The alignment module is used to calculate the three-dimensional coordinate difference between the center point of the container and the center point of the spreader. It combines the known dimensions of the container to obtain the rotation angle and controls the spreader to be aligned with the container until the Z-axis coordinate difference reaches the set range.
[0016] Thirdly, embodiments of this application provide a deep learning-based alignment device for spreader and container. The deep learning-based alignment device for spreader and container includes a processor, a memory, and a deep learning-based alignment program for spreader and container stored in the memory and executable by the processor. When the deep learning-based alignment program for spreader and container is executed by the processor, it implements the steps of the deep learning-based alignment method for spreader and container described above.
[0017] The beneficial effects of the technical solutions provided in this application include: This application presents a deep learning-based method for aligning spreaders and containers. Addressing the issue of low algorithm detection efficiency during spreader and container alignment, the keyhole detection model features an end-to-end multi-task network based on deep learning. Real-time image data from a monocular camera is fed into the keyhole detection model, simultaneously performing keyhole detection (i.e., target-based position detection) and keyhole segmentation (i.e., target detection followed by subdivision and identification of the center). The data undergoes recognition only once via the deep learning network, resulting in higher recognition efficiency. This significantly improves processing efficiency compared to the two-step method in "Research on Visual Alignment Technology of Spreaders and Containers Based on Deep Learning." Furthermore, this application uses a monocular camera, making it less expensive than using a binocular camera. To address the low accuracy issue, the keyhole detection model in this application simultaneously calculates the center point of the detection box based on the keyhole detection task and the center point of the keyhole segmentation image based on subdivision cutting. When the coordinate error between the two center points is within a certain range, the median value of the two center point coordinates is taken as the keyhole center point coordinate. Compared with the method in "Fully Automatic Container Grabbing and Positioning System Based on Deep Learning", which relies solely on the center point of the detection box (i.e., based on target position detection), the accuracy of the keyhole center point coordinate in this application is higher. To address the insufficient real-time performance, the multi-task network of the keyhole detection model in this application also includes a keyhole tracking task to track the identified container keyholes in real time, ensuring that the obtained keyhole center point coordinates are relative to the current coordinates of the spreader coordinate system, thus exhibiting good real-time performance. Furthermore, since the method used in this application is based on deep learning detection, it has good robustness and can adapt to different lighting conditions such as day and night. To address the issue of 3D coordinate acquisition, the multi-task network of this application includes a vertical depth estimation task branch, which can better estimate the distance from the container keyhole to the spreader, thereby obtaining the 3D coordinates of the keyhole center point. This application effectively solves the problems of low efficiency, low accuracy, insufficient real-time performance, and easy influence of lighting conditions on the detection of keyhole position and three-dimensional coordinates of keyhole center during the alignment of spreader and container. Attached Figure Description
[0018] Figure 1 This is a flowchart illustrating the alignment method between the spreader and the container based on deep learning proposed in this application. Figure 2 A schematic diagram of an embodiment of mounting a monocular camera on a lifting device according to this application; Figure 3 This is a schematic diagram of the internal channel of the keyhole detection model in an embodiment of this application; Figure 4 This is a schematic diagram illustrating the camera calibration process in an embodiment of this application. Figure 5 This is a schematic diagram of the training process of the keyhole detection model according to an embodiment of this application; Figure 6This is a schematic diagram of the internal layers of the keyhole detection model in an embodiment of this application; Figure 7 This is a schematic diagram showing the rotation angle of the spreader and the container in the alignment method of this application embodiment; Figure 8 This is a schematic diagram of each module of the alignment device according to an embodiment of this application; Figure 9 This is a schematic diagram of the hardware structure of the deep learning-based alignment device for spreaders and containers involved in the embodiments of this application. Detailed Implementation
[0019] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present application.
[0020] First, some of the technical terms used in this application will be explained to help those skilled in the art understand this application.
[0021] ONNX, short for Open Neural Network Exchange, is an open neural network model representation format designed to achieve interoperability between different deep learning frameworks. It allows developers to export models from one framework and then import and run them in another, solving the challenge of migrating deep learning models across different platforms.
[0022] ONNX Simplifier, also known as ONNX Simplifier or ONNXSIM, is a tool specifically designed to optimize and simplify ONNX models. It streamlines the ONNX model structure through techniques such as eliminating redundant operations, merging equivalent operations, and constant folding, resulting in smaller model size, faster inference speed, and while maintaining the model's prediction accuracy. This is particularly important for edge device deployments.
[0023] TensorRT is a high-performance deep learning inference optimizer and runtime library developed by NVIDIA. It is deeply optimized for NVIDIA GPUs, significantly improving the inference speed and energy efficiency of deep learning models through techniques such as layer fusion, precision calibration, automatic kernel tuning, and dynamic tensors. TensorRT is particularly suitable for real-time AI applications requiring low latency and high throughput.
[0024] In the TensorRT Inference Engine context, "engine" refers to an executable inference model optimized by TensorRT. It is a binary file highly optimized for a specific GPU architecture, transformed from the original deep learning model through TensorRT's optimization process (including layer fusion, precision selection, kernel selection, etc.).
[0025] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.
[0026] In a first aspect, embodiments of this application provide a deep learning-based alignment method for spreaders and containers.
[0027] In one embodiment, reference is made to Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the deep learning-based alignment method for spreader and container according to this application. Figure 1 As shown, the deep learning-based alignment method for spreader and container includes: As the spreader is lowered close to the container, an even number of monocular cameras mounted on the spreader capture video streams, extract frames, and perform data preprocessing to obtain real-time image data. Specifically, the even number of cameras is typically two or four, and the cameras are monocular cameras, which greatly reduces costs and facilitates widespread application.
[0028] Real-time image data is input into the keyhole detection model. The keyhole detection model performs keyhole detection, keyhole segmentation, vertical depth estimation, and keyhole tracking tasks simultaneously through its multi-task network. This process obtains the real-time two-dimensional coordinates of each keyhole center in pixel coordinates and the vertical depth of each keyhole center.
[0029] The two-dimensional real-time coordinates and vertical depth of each keyhole center are sequentially transformed to the camera coordinate system and the rigging coordinate system to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system. Specifically, the pixel coordinates of the keyhole centers are first transformed into the camera coordinate system, and then from the camera coordinate system into the rigging coordinate system. Both the camera coordinate system and the rigging coordinate system are three-dimensional coordinate systems. The three-dimensional origin of the camera coordinate system is the center origin of the lens, and the three-dimensional origin of the rigging coordinate system is the center origin of the upper surface of the rectangular frame.
[0030] The three-dimensional coordinate difference between the container's center point and the spreader's center point is calculated. Combined with the container's known dimensions, the rotation angle is obtained, and the spreader is controlled to be directly aligned with the container until the Z-axis coordinate difference reaches a set range. Calculating the coordinate difference between the container's center point and the spreader's center point refers to calculating the X-axis coordinate difference and using the container's known X-axis dimensions to calculate the rotation angle; or calculating the Y-axis coordinate difference and using the container's known Y-axis dimensions to calculate the rotation angle.
[0031] Specifically, keyhole detection refers to directly detecting the keyhole target. In practice, this involves identifying the coordinates of two points forming a rectangular frame to delineate the keyhole; this is the mainstream target detection technology currently available. Keyhole segmentation involves first identifying the rectangular frame containing the keyhole, then subdividing the frame into several coordinate points, and identifying the center of the keyhole based on the segmentation results. Vertical depth estimation refers to a strategy that estimates the vertical depth of a keyhole, recognizing that keyholes with different known dimensions will exhibit different sizes. Keyhole tracking refers to real-time tracking of the identified keyholes even when the spreader is swaying relative to the container.
[0032] Specifically, the X and Y axes of the 3D coordinate system form a horizontal plane, while the Z axis points vertically upwards. The spreader is a rectangular frame with extendable lengths on both sides. During pre-installation, the spreader is adjusted to match the dimensions of the container, meaning the dimensions of both the container and the spreader are known beforehand. Furthermore, the X and Y axes in the pixel coordinate system, camera coordinate system, and spreader coordinate system all maintain the same direction, while the Z axis in both systems points vertically upwards.
[0033] When the spreader is directly opposite the container and the Z-coordinate difference between the center point of the container and the center point of the spreader reaches the set range, it means that the two are aligned and have made contact in the Z direction. At this point, the subsequent process of locking the container can be carried out.
[0034] This application presents a deep learning-based method for aligning spreaders and containers. Addressing the issue of low algorithm detection efficiency during spreader and container alignment, the keyhole detection model features an end-to-end multi-task network based on deep learning. Real-time image data from a monocular camera is fed into the keyhole detection model, simultaneously performing keyhole detection (i.e., target-based position detection) and keyhole segmentation (i.e., target detection followed by subdivision and identification of the center). The data undergoes recognition only once via the deep learning network, resulting in higher recognition efficiency. This significantly improves processing efficiency compared to the two-step method in "Research on Visual Alignment Technology of Spreaders and Containers Based on Deep Learning." Furthermore, this application uses a monocular camera, making it less expensive than using a binocular camera. To address the low accuracy issue, the keyhole detection model in this application simultaneously calculates the center point of the detection box based on the keyhole detection task and the center point of the keyhole segmentation image based on subdivision cutting. When the coordinate error between the two center points is within a certain range, the median value of the two center point coordinates is taken as the keyhole center point coordinate. Compared with the method in "Fully Automatic Container Grabbing and Positioning System Based on Deep Learning", which relies solely on the center point of the detection box (i.e., based on target position detection), the accuracy of the keyhole center point coordinate in this application is higher. To address the insufficient real-time performance, the multi-task network of the keyhole detection model in this application also includes a keyhole tracking task to track the identified container keyholes in real time, ensuring that the obtained keyhole center point coordinates are relative to the current coordinates of the spreader coordinate system, thus exhibiting good real-time performance. Furthermore, since the method used in this application is based on deep learning detection, it has good robustness and can adapt to different lighting conditions such as day and night. To address the issue of 3D coordinate acquisition, the multi-task network of this application includes a vertical depth estimation task branch, which can better estimate the distance from the container keyhole to the spreader, thereby obtaining the 3D coordinates of the keyhole center point. This application effectively solves the problems of low efficiency, low accuracy, insufficient real-time performance, and easy influence of lighting conditions on the detection of keyhole position and three-dimensional coordinates of keyhole center during the alignment of spreader and container.
[0035] like Figure 3 and Figure 5 As shown, further, in one embodiment, the alignment method also includes pre-establishing a keyhole detection model, which includes: Data collection was conducted, including videos of container hoisting operations of different sizes and colors under different weather and lighting conditions. Data annotation was performed by extracting frames from the container hoisting operation video and annotating each frame using annotation software. The coordinates of each frame, the corresponding keyhole frame, and the polygons that divide the keyhole frame were obtained to form a training dataset. A multi-task network keyhole detection model is constructed. The multi-task model includes a deep learning encoder for acquiring pre-processed images, which is used to extract and fuse features. It also includes a keyhole detection decoder, a keyhole segmentation decoder, a tracking task decoder, and a vertical depth estimation decoder corresponding to the deep learning encoder. Specifically, the multi-task network keyhole detection model is constructed based on the existing Yolov11 network.
[0036] To train the model, input the training dataset into the keyhole detection model and observe the convergence of the loss curve. When the loss curve stabilizes, the keyhole detection model training is complete.
[0037] Specifically, the data acquisition step involves obtaining video and image data of the keyholes on the containers to create a dataset for training the keyhole recognition model. The data acquisition process is relatively simple: during the operation of the spreader, two monocular cameras mounted on it save the video data. By extracting keyframes from the video, the operation video is converted into images. It is important to note that during data acquisition, videos of container lifting operations of different sizes and colors should be collected under different weather and lighting conditions. The more diverse the dataset, the better the performance and robustness of the subsequently trained keyhole recognition model.
[0038] Data annotation is used to identify the targets in images for model training. During data annotation, the video data obtained from the data acquisition step is frame-by-frame extracted. Annotation software is used to mark the keyhole frame positions, using rectangles to define the keyhole locations and polygons to segment the keyhole. This results in the coordinates of the keyhole frame (represented by the two diagonals of the rectangle) and the coordinates of the polygons that segmented the keyhole frame (represented by the coordinates of the many points after cutting the rectangle) for each image. Therefore, each frame from the extracted video corresponds to a text file containing the coordinates. All image and text files together constitute the training dataset. The labeled dataset is used for model training. The model gradually adjusts its data based on the labeled dataset, gradually increasing its recognition accuracy. In general, the richness and accuracy of the dataset affect the final recognition performance of the model. The richer the data acquisition and the more accurate the data annotation, the better the model training effect. Therefore, both data acquisition and data annotation are indispensable steps.
[0039] During model training, the labeled dataset is fed into the model for training, and the convergence of the loss curve is observed. When the loss curve tends to stabilize, it indicates that the model has converged and the model training is complete.
[0040] Specifically, when building a special keyhole detection model, data acquisition, data annotation, and model training are a continuous iterative process. As the types of containers and operating cycles in the dataset gradually become richer, the accuracy and robustness of the optimal weights obtained after training with the expanded dataset will continue to improve.
[0041] The keyhole detection model in this application is an improvement on the Yolov11 model. Multiple tasks, including detection, segmentation, tracking, and vertical depth estimation, are added to its task side. Video stream data is collected by an even number of monocular cameras. After preprocessing, the data is sent to a deep learning encoder to extract and fuse features. The extracted features are then sent to different task decoders, thus simultaneously completing keyhole detection, keyhole segmentation, keyhole tracking, and keyhole vertical depth estimation. This greatly improves the efficiency of data processing and the accuracy of keyhole recognition.
[0042] The deep learning-based alignment method for spreaders and containers in this application establishes a special keyhole detection model that combines tasks such as detection, segmentation, tracking, and vertical depth estimation. The keyhole detection model can process multiple tasks in parallel, efficiently and accurately locating keyholes, thus improving both efficiency and accuracy.
[0043] Furthermore, in one embodiment, after the keyhole detection model is trained, it further includes: Convert the keyhole detection model into an ONNX format model; Use onnxsim to trim the redundant parts of the model; Use TensorRT to generate an INT8 type binary engine from the corresponding ONNX model; Use the generated engine to identify container lock holes.
[0044] The deep learning-based alignment method for spreader and container proposed in this application is comprehensive. However, since the weights of the trained model are often large and rely on the deep learning framework and software environment, the application cost of the model is high. When the model is to be applied to the production environment, good hardware conditions are required to meet the requirements of the model operation. Based on this, the deep learning-based alignment method for spreader and container proposed in this application optimizes and deploys the keyhole detection model after training, and performs accurate quantization of the model to reduce the resource consumption of the model operation, so that the trained model weights can be deployed to the production environment at the lowest cost.
[0045] like Figure 4 As shown, further, in one embodiment, the alignment method also includes calibrating each monocular camera, the calibration step including: The scaffold equipped with a monocular camera is lifted to a fixed height, and the height data is recorded. ;Specifically, It requires manual measurement or measurement with the aid of tools, such as laser measurement; Place the checkerboard calibration plate under the hanger and move and rotate the checkerboard calibration plate so that each monocular camera collects image data of the checkerboard calibration plate, and obtain more than 10 images of the checkerboard calibration plate from each camera; Calibrate each camera and obtain its intrinsic parameter matrix. K and distortion coefficient array; Mount an even number of monocular cameras onto the rigging. Place the checkerboard calibration plate directly below the even number of monocular cameras and parallel to the rigging. h Obtain the rotation matrix between the camera coordinate system and the rigging coordinate system for each monocular camera. R Translation vector t .
[0046] Preferably, the checkerboard calibration plate is moved and rotated to ensure that the calibration plate is located at the upper left corner, middle position, and lower right corner of each camera image, and the calibration plate is rotated up and down and tilted to ensure that each camera captures more than 10 images with the calibration plate.
[0047] Specifically, the Zhang Zhengyou calibration method in OpenCV was used to calibrate the two cameras.
[0048] The deep learning-based alignment method for spreader and container proposed in this application works by first converting each image captured by a monocular camera into pixel coordinates, and then transforming the camera coordinates into the spreader coordinate system. The intrinsic parameter matrix... K The purpose of the distortion coefficient array is to transform the pixel coordinates into the camera coordinate system of the corresponding monocular camera, while the rotation matrix... R Translation vector t Its function is to transform the camera coordinate system to the rigging coordinate system.
[0049] Furthermore, in one embodiment, the number of monocular cameras is two (see...). Figure 2 (Top view of the lifting device), intrinsic parameter matrix is The distortion coefficient array is ( , , , , K and the distortion coefficient array are combined to correct image distortion, where, , This represents the known focal length of a monocular camera. , These are the principal point coordinates of the monocular camera. During calibration, distortion correction is performed on the images captured by the camera using the camera intrinsic parameter K and the distortion coefficient array, resulting in more accurate image information.
[0050] Preferably, the Zhang Zhengyou calibration method in OpenCV is used to calibrate the two cameras. Wherein, , It can also be understood as half of the horizontal resolution and half of the vertical resolution; , , These are the radial distortion coefficients, , These are the tangential distortion coefficients.
[0051] Establish computational equations Calculate and obtain the rotation matrix between the camera coordinate system and the rigging coordinate system for each monocular camera. R Translation vector t ; in, , These are the pixel coordinates of the acquired image obtained through measurement. , - Z0 represents the coordinates of each point on the checkerboard calibration board in the rigging coordinate system, and Z0 represents the vertical depth coordinates of the checkerboard calibration board in the camera coordinate system, where i is the number of rows and columns.
[0052] The deep learning-based alignment method for spreader and container proposed in this application obtains the final coordinate system transformation parameters (camera intrinsic parameter K and extrinsic parameter rotation matrix) by calibrating each monocular camera. and translation vector Therefore, the subsequent keyhole detection model can be implemented using the extrinsic rotation matrix. and translation vector The coordinates are transformed into the spreader coordinate system to obtain the coordinates of the keyhole image captured by the monocular camera in the real world. This allows for accurate control of the spreader to move to the corresponding position of the keyhole. Therefore, the installation and calibration of the camera is an indispensable step in improving the alignment accuracy between the spreader and the container in this method.
[0053] Further, in one embodiment, the two-dimensional real-time coordinates and vertical depth of each keyhole center are sequentially transformed to the camera coordinate system and the rigging coordinate system to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system, including: pixel coordinates ( , Convert to ideal pixel coordinates ( , ), ; ; in, This represents the distance from a pixel to the principal point of the image, and its calculation method is as follows: ;in and It is the offset caused by tangential distortion, by , Calculated; ( , () are the coordinates of the principal point; Convert ideal pixel coordinates to camera coordinate system coordinates. , , ); ; ; The depth is obtained from the vertical depth estimation task of the keyhole detection model; Based on the obtained coordinates of the keyhole center in the camera coordinate system ( , , The rotation matrix R and offset vector t obtained through calibration are used to calculate the coordinates of the lock hole center in the lifting device coordinate system. , , The calculation formula is as follows: .
[0054] The deep learning-based alignment method for spreader and container proposed in this application obtains unique coordinate system transformation parameters through calibration, and then directly uses them in practical applications. This method obtains high-precision real-time two-dimensional coordinates of the keyhole center in pixel coordinates within the keyhole detection model. , After that, first set the pixel coordinates ( , Convert to ideal pixel coordinates ( , Then convert the ideal pixel coordinates into camera coordinate system coordinates. , , Finally, the camera coordinate system coordinates ( , , The coordinates of the keyhole center in the lifting device coordinate system are calculated using the rotation matrix R and the offset vector t. , , Through a clever calculation and transformation method, high-precision real-time three-dimensional coordinates of the keyhole center are obtained.
[0055] In one embodiment, such as Figure 6 As shown, the multi-task network of the keyhole detection model includes a backbone network and four branch networks. The backbone network is used to extract features from the reused image, and the four branch networks respectively implement keyhole detection, keyhole segmentation, vertical depth estimation and keyhole tracking based on the features from the reused image.
[0056] Specifically, such as Figure 6 The modules are explained below: Downsampling: A method of processing the original image in deep learning. In the original image, a value is taken at every other pixel, resulting in four independent feature layers. These four feature layers are then concatenated and stacked along the channel dimension. This reduces the height and width of the input image, expands the channels, and does not lose information.
[0057] Convolutional layer: The input image is convolved by a convolution kernel to obtain a new feature map.
[0058] Parallel convolution: The input feature map is divided into two parts. One part is convolutionally processed, and the other part is directly passed. Finally, the two results are concatenated and fused to obtain a new output feature map.
[0059] Pooling layers: Input features are sequentially passed through multiple max pooling layers of different sizes, and the outputs are concatenated and fused through convolution to capture multi-scale contextual information. By fusing spatial features of different granularities, the receptive field of the neural network is enhanced.
[0060] Concatenation layer: Feature maps with the same height and width are stacked along the channel dimension to obtain a new feature map.
[0061] Upsampling: By using a bilinear interpolation algorithm, the height and width of the feature map are increased. After upsampling, the size is expanded to twice that of the input feature map.
[0062] Feature extraction: By using convolutional kernels of different sizes to manipulate the feature map, the features are dynamically adjusted to enhance the model's perception of target details and improve detection accuracy in complex scenes.
[0063] It's worth noting that commonly used deep learning networks are primarily designed for a single task. For example, object detection networks are mainly used to detect objects in images, while segmentation networks are mainly used for semantic segmentation. These single-task networks perform well in detection, but in actual production, to complete a specific visual task, the results of object detection, segmentation, and tracking are often needed simultaneously. Running multiple single-task networks at the same time incurs high computational overhead, slow speed, poor real-time performance, and high hardware requirements, leading to increased costs.
[0064] The keyhole detection model proposed in this application introduces an unprecedented multi-task network that performs multiple tasks simultaneously. By simultaneously performing several tasks such as keyhole target detection, keyhole segmentation, keyhole tracking, and keyhole vertical depth estimation using camera data, a multi-task network is more suitable. Since these tasks target the same object, they share a common backbone network, and the features extracted by the backbone network can be reused, achieving good detection results in several tasks simultaneously. Compared with existing deep learning-based keyhole alignment methods, this application uses a single network to complete these tasks simultaneously, which can effectively reduce resource consumption, provide good real-time performance, and has low hardware requirements, thus reducing production costs.
[0065] like Figure 7 As shown, further, in one embodiment, calculating the three-dimensional coordinate difference between the container's center point and the spreader's center point, obtaining the rotation angle based on the container's known dimensions, and controlling the spreader to be directly opposite the container includes: The coordinates of the center of the two diagonally opposite keyholes on the container ( , , )and( , , ), calculate the position coordinates of the container's center point in the spreader coordinate system ( , , ); If | - When the width is greater than the known width M of the container, it can be achieved through ( , The distance from the origin of the spreader coordinate system, and | - The rotation angle is calculated from the difference between M and M. ; According to the rotation angle ,and( , The distance from the origin of the spreader coordinate system is used to control the alignment of the spreader with the container.
[0066] Furthermore, according to The size controls whether the lowering stops, when When the Z-axis coordinate difference reaches the set range, the lowering process terminates.
[0067] Specifically, the width of a container is generally fixed at M=2.438m.
[0068] Specifically, upon knowing the rotation angle Then, control the angle of the spreader to rotate backward. Rotate in the direction of shrinking.
[0069] Specifically, Figure 7 The container is tilted in the middle, while the spreader is horizontal.
[0070] This application utilizes camera installation and calibration to effectively acquire the camera's intrinsic and extrinsic parameters, providing accurate data support for subsequent conversion between the camera coordinate system and the spreader coordinate system, thus ensuring the accuracy of the method. Through the data acquisition and annotation steps of this application, a dataset specifically for container keyhole recognition can be obtained. The keyhole recognition model trained using this dataset can reduce false recognitions and improve the model's recognition accuracy.
[0071] The deep learning-based keyhole detection model constructed in this application uses a state-of-the-art deep learning network as its base model. Other branch tasks are decoupled from the base model and constructed as task heads, resulting in a dedicated model for container keyhole recognition. This model shares a backbone network and features, enabling the acquisition of various information such as keyhole position detection, keyhole segmentation, keyhole center point, and keyhole distance. Therefore, compared to other deep learning-based alignment methods, this invention offers better accuracy. Furthermore, this invention only requires one deep learning network pass to obtain the result, thus simplifying the process and reducing complexity.
[0072] In this application, the trained keyhole recognition model is trimmed and quantized in the model deployment step, and the model is accelerated using the more real-time TensorRT, which has lower hardware requirements and is more suitable for deployment in actual production environments. Therefore, the method of this invention has better real-time performance and higher detection efficiency.
[0073] The coordinate system transformation step in this application transforms the recognition result from the image pixel coordinate system to the lifting device coordinate system, obtaining the actual distance deviation between the center of the keyhole and the lifting device, providing accurate data for lifting device control, and has good practicality.
[0074] The method in this application uses a monocular camera, which is less expensive than other methods using binocular cameras. Furthermore, compared to other methods using four cameras, it reduces the number of cameras by two, thus lowering the requirements for the edge computing platform. Therefore, overall, the method of this invention is less expensive and more conducive to widespread adoption compared to other methods.
[0075] like Figure 8 As shown, in a second aspect, embodiments of this application also provide a deep learning-based alignment device for a spreader and a container. The alignment device includes a video stream acquisition module, a keyhole center positioning module, a coordinate transformation module, and an alignment module.
[0076] The video stream acquisition module is used to acquire video streams, extract frames, and perform data preprocessing by using an even number of monocular cameras installed on the spreader when it is lowered close to the container, in order to obtain real-time image data.
[0077] The keyhole center localization module is used to input real-time image data into the keyhole detection model. Through the multi-task network of the keyhole detection model, the keyhole detection task, keyhole segmentation task, vertical depth estimation task and keyhole tracking task are performed simultaneously to obtain the two-dimensional real-time coordinates and vertical depth of each keyhole center in pixel coordinates.
[0078] The coordinate transformation module is used to convert the two-dimensional real-time coordinates and vertical depth of each keyhole center to the camera coordinate system and the rigging coordinate system in sequence, so as to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system.
[0079] The alignment module is used to calculate the three-dimensional coordinate difference between the center point of the container and the center point of the spreader. It combines the known dimensions of the container to obtain the rotation angle and controls the spreader to be aligned with the container until the Z-axis coordinate difference reaches the set range.
[0080] The functions of each module in the aforementioned deep learning-based alignment device for spreaders and containers correspond to the steps in the aforementioned deep learning-based alignment method for spreaders and containers. Their functions and implementation processes will not be described in detail here.
[0081] Thirdly, embodiments of this application provide a deep learning-based alignment device for spreaders and containers. This deep learning-based alignment device for spreaders and containers can be a personal computer (PC), a laptop computer, a server, or other device with data processing capabilities.
[0082] Reference Figure 9 , Figure 9 This is a schematic diagram of the hardware structure of the deep learning-based alignment device for spreader and container involved in the embodiments of this application. In the embodiments of this application, the deep learning-based alignment device for spreader and container may include a processor, a memory, a communication interface, and a communication bus.
[0083] The communication bus can be of any type and is used to interconnect the processor, memory, and communication interface.
[0084] The communication interface includes input / output (I / O) interfaces, physical interfaces, and logical interfaces. These interfaces enable interconnection of internal components within the deep learning-based spreader-to-container alignment device, as well as interfaces for interconnection between the deep learning-based spreader-to-container alignment device and other devices (such as other computing devices or user equipment). Physical interfaces can be Ethernet interfaces, fiber optic interfaces, ATM interfaces, etc.; user equipment can be displays, keyboards, etc.
[0085] Memory can be various types of storage media, such as random access memory (RAM), read-only memory (ROM), non-volatile RAM (NVRAM), flash memory, optical storage, hard disk, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), etc.
[0086] The processor can be a general-purpose processor, which can call a deep learning-based alignment program for the spreader and container stored in memory and execute the deep learning-based alignment method for the spreader and container provided in the embodiments of this application. For example, the general-purpose processor can be a central processing unit (CPU). The method executed when the deep learning-based alignment program for the spreader and container is called can be referred to in the various embodiments of the deep learning-based alignment method for the spreader and container of this application, and will not be repeated here.
[0087] Those skilled in the art will understand that Figure 9 The hardware structure shown does not constitute a limitation of this application and may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0088] It should be noted that the sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0089] The terms "comprising" and "having," and any variations thereof, in the specification, claims, and accompanying drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to such process, method, product, or apparatus. The terms "first," "second," and "third," etc., are used to distinguish different objects, etc., and do not indicate a sequence, nor do they limit "first," "second," and "third" to different types.
[0090] In the description of the embodiments of this application, terms such as "exemplary," "for example," or "for instance" are used to indicate examples, illustrations, or explanations. Any embodiment or design described as "exemplary," "for example," or "for instance" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of terms such as "exemplary," "for example," or "for instance" is intended to present the relevant concepts in a concrete manner.
[0091] In the description of the embodiments of this application, unless otherwise stated, " / " means "or". For example, A / B can mean A or B. The "and / or" in the text is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of this application, "multiple" means two or more.
[0092] In some processes described in the embodiments of this application, multiple operations or steps are included in a specific order. However, it should be understood that these operations or steps may not be executed in the order they appear in the embodiments of this application, or they may be executed in parallel. The sequence number of the operation is only used to distinguish different operations, and the sequence number itself does not represent any execution order. In addition, these processes may include more or fewer operations, and these operations or steps may be executed sequentially or in parallel, and these operations or steps may be combined.
[0093] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device to execute the methods described in the various embodiments of this application.
[0094] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A deep learning-based alignment method for spreader and container, characterized in that, Includes the following steps: When the spreader is lowered close to the container, an even number of monocular cameras installed on the spreader capture video streams, extract frames and perform data preprocessing to obtain real-time image data; Real-time image data is input into the keyhole detection model. The keyhole detection model performs keyhole detection, keyhole segmentation, vertical depth estimation, and keyhole tracking tasks simultaneously through its multi-task network, thereby obtaining the two-dimensional real-time coordinates and vertical depth of each keyhole center in pixel coordinates. The two-dimensional real-time coordinates and vertical depth of each keyhole center are sequentially transformed to the camera coordinate system and the lifting device coordinate system to obtain the three-dimensional coordinates of each keyhole center in the lifting device coordinate system. Calculate the three-dimensional coordinate difference between the container's center point and the spreader's center point, and obtain the rotation angle based on the container's known dimensions. Control the spreader to be directly aligned with the container until the Z-axis coordinate difference reaches the set range.
2. The deep learning-based alignment method for spreader and container as described in claim 1, characterized in that, The alignment method further includes pre-establishing a keyhole detection model, which includes: Collect videos of container hoisting operations of different sizes and colors under different weather and lighting conditions; The container hoisting operation video is framed, and each frame is labeled using annotation software to obtain the coordinates of each frame, the corresponding keyhole frame, and the coordinates of the polygons that divide the keyhole frame, thus forming a training dataset. A keyhole detection model with a multi-task network is constructed. The multi-task model includes a deep learning encoder that acquires a pre-processed image, a deep learning encoder that extracts and fuses features, and a keyhole detection decoder, a keyhole segmentation decoder, a tracking task decoder, and a vertical depth estimation decoder corresponding to the deep learning encoder. The training dataset is input into the keyhole detection model for training, and the convergence of the loss curve is observed until the loss curve tends to stabilize, at which point the keyhole detection model training is complete.
3. The deep learning-based alignment method for spreader and container as described in claim 2, characterized in that, After the keyhole detection model is trained, it also includes: Convert the keyhole detection model into an ONNX format model; Use ONNXSIM to trim the redundant parts of the model; Use TensorRT to generate an INT8 type binary engine from the corresponding ONNX model; Use the generated engine to identify container lock holes.
4. The deep learning-based alignment method for spreader and container as described in claim 1, characterized in that, The alignment method further includes calibrating each monocular camera, the calibration step comprising: The scaffold equipped with a monocular camera is lifted to a fixed height, and the height data is recorded. ; Place the checkerboard calibration plate under the hanger and move and rotate the checkerboard calibration plate so that each monocular camera collects image data of the checkerboard calibration plate, and obtain more than 10 images of the checkerboard calibration plate from each camera; Each camera is calibrated, and its intrinsic parameter matrix is obtained. K and distortion coefficient array; Mount an even number of monocular cameras onto the rigging. Place a checkerboard calibration plate below each monocular camera, parallel to the rigging, and take photos. Combine this with... h Calculate and obtain the rotation matrix between the camera coordinate system and the rigging coordinate system for each monocular camera. R Translation vector t .
5. The deep learning-based alignment method for spreader and container as described in claim 4, characterized in that, The number of monocular cameras is two, and the intrinsic parameter matrix is... The distortion coefficient array is ( , , , , ), combined for correcting image distortion; among which, , This represents the known focal length of a monocular camera. , These are the principal point coordinates of the monocular camera; Establish computational equations Calculate and obtain the rotation matrix between the camera coordinate system and the rigging coordinate system for each monocular camera. R Translation vector t ; in, , These are the pixel coordinates of the acquired image obtained through measurement. , - Z0 represents the coordinates of each point on the checkerboard calibration board in the lifting coordinate system, and Z0 represents the vertical depth coordinates of the checkerboard calibration board in the camera coordinate system.
6. The deep learning-based alignment method for spreader and container as described in claim 5, characterized in that, The step of sequentially transforming the two-dimensional real-time coordinates and vertical depth of each keyhole center to the camera coordinate system and the rigging coordinate system to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system includes: pixel coordinates ( , Convert to ideal pixel coordinates ( , ), ; ; in, This represents the distance from a pixel to the principal point of the image, and its calculation method is as follows: ;in and It is the offset caused by tangential distortion, by , Calculated; ( , () are the coordinates of the principal point; Convert ideal pixel coordinates to camera coordinate system coordinates. , , ); ; ; The depth is obtained from the vertical depth estimation task of the keyhole detection model; Based on the obtained coordinates of the keyhole center in the camera coordinate system ( , , The coordinates of the lock hole center in the lifting device coordinate system are calculated using the rotation matrix R and the offset vector t. , , The calculation formula is as follows: 。 7. The deep learning-based alignment method for spreader and container as described in claim 1, characterized in that, The multi-task network of the keyhole detection model includes a backbone network and four branch networks. The backbone network is used to extract features from the reused image, and the four branch networks respectively implement keyhole detection, keyhole segmentation, vertical depth estimation, and keyhole tracking based on the features from the reused image.
8. The deep learning-based alignment method for spreader and container as described in claim 1, characterized in that, The calculation of the three-dimensional coordinate difference between the container's center point and the spreader's center point, combined with the container's known dimensions to obtain the rotation angle, and controlling the spreader to be directly aligned with the container, includes: The coordinates of the center of the two diagonally opposite keyholes on the container ( , , )and( , , ), calculate the position coordinates of the container's center point in the spreader coordinate system ( , , ); If | - When the width is greater than the known width M of the container, it can be achieved through ( , The distance from the origin of the spreader coordinate system, and | - The rotation angle is calculated from the difference between M and M. ; According to the rotation angle ,and( , The distance from the origin of the spreader coordinate system is used to control the alignment of the spreader with the container.
9. A deep learning-based alignment device for a spreader and a container, characterized in that, include: The video stream acquisition module is used to acquire video streams, extract frames, and perform data preprocessing by using an even number of monocular cameras installed on the spreader when it is lowered close to the container, in order to obtain real-time image data. The keyhole center localization module is used to input real-time image data into the keyhole detection model. Through the multi-task network of the keyhole detection model, the keyhole detection task, keyhole segmentation task, vertical depth estimation task and keyhole tracking task are performed simultaneously to obtain the two-dimensional real-time coordinates and vertical depth of each keyhole center in pixel coordinates. The coordinate transformation module is used to transform the two-dimensional real-time coordinates and vertical depth of each keyhole center to the camera coordinate system and the rigging coordinate system in sequence, so as to obtain the three-dimensional coordinates of each keyhole center in the rigging coordinate system. The alignment module is used to calculate the three-dimensional coordinate difference between the center point of the container and the center point of the spreader. It combines the known dimensions of the container to obtain the rotation angle and controls the spreader to be aligned with the container until the Z-axis coordinate difference reaches the set range.
10. A deep learning-based alignment device for spreader and container, characterized in that, The deep learning-based spreader and container alignment device includes a processor, a memory, and a deep learning-based spreader and container alignment program stored in the memory and executable by the processor, wherein when the deep learning-based spreader and container alignment program is executed by the processor, it implements the steps of the deep learning-based spreader and container alignment method as described in any one of claims 1 to 8.