A method for excavator three-dimensional pose estimation based on domain randomization to generate a labeled dataset
By constructing a parametric excavator model and a virtual simulation environment, generating a multi-dimensional randomized dataset and training a two-stage neural network, the high cost and domain gap problems of excavator 3D pose estimation are solved, achieving high-precision, robust, cross-scene adaptive pose estimation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CCCC SOUTH CHINA TRANSPORTATION CONSTR CO LTD
- Filing Date
- 2026-02-06
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for estimating the 3D pose of excavators are costly, difficult to obtain datasets, have weak generalization ability, and suffer from domain gap problems, making them difficult to adapt to complex construction site environments.
By constructing a parametric 3D model of an excavator, building a virtual simulation environment, generating a multi-dimensional randomized parameter library, training the model using a two-stage deep neural network, and combining the perspective n-point algorithm for pose estimation, the cost of data acquisition is reduced and the model accuracy and cross-scene adaptability are improved.
It enables low-cost, large-scale labeled dataset generation, improves the accuracy and robustness of pose estimation, adapts to different weather conditions, scenarios and device models, reduces hardware costs and improves stability.
Abstract
Description
Technical Field
[0001] This invention relates to the interdisciplinary field of computer vision and intelligent and automated control of engineering machinery, specifically to a method for generating synthetic training data using domain randomization technology to achieve high-precision three-dimensional pose estimation of excavators in complex scenarios. Background Technology
[0002] With the rise of concepts such as smart construction sites and unmanned construction, the estimation of the three-dimensional pose of excavators, including the position of the machine body, the posture of the bucket, and the joint angles, has become one of the core technologies for realizing autonomous operation, remote control, high-precision motion tracking, and human-machine collaborative operation.
[0003] Existing methods for 3D pose estimation of excavators are mainly divided into two categories: one is the sensor fusion-based method, which relies on high-precision sensors such as inertial measurement units (IMU), lidar (LiDAR), and GPS to calculate pose by fusing data from multiple sensors; the other is the vision-based method, which uses monocular / binocular cameras to acquire images and achieve pose estimation through traditional computer vision algorithms or deep learning models.
[0004] Existing technologies have many key problems: Sensor fusion methods are costly, with the purchase, installation, and maintenance of high-precision sensors accounting for 30%-50% of the total equipment cost. Furthermore, they are susceptible to vibration and dust interference in complex construction environments, resulting in insufficient stability. Additionally, they perform poorly in scenarios where satellite signals are blocked (such as tunnels and canyons).
[0005] Visual methods rely on large datasets of real-world scenes with precise annotations. However, excavator operation scenarios are complex, such as varying lighting, dust obstruction, and diverse terrain. Collecting real-world datasets is difficult, annotation is time-consuming, requires specialized equipment and knowledge, and the cost of data acquisition is extremely high.
[0006] The real-world datasets lack data diversity and have significant limitations in domain adaptability. The trained models struggle to adapt to different weather conditions, work scenarios, and excavator models, exhibiting poor generalization ability and a sharp drop in estimation accuracy in non-trained scenarios.
[0007] Some studies have attempted to generate highly realistic synthetic data using computer graphics techniques, but this requires the construction of detailed 3D models and realistic rendering environments, resulting in complex processes, high computational costs, and the generated models are prone to overfitting to specific rendering styles of the synthetic data, leading to gaps in domains.
[0008] While domain randomization can alleviate the domain gap problem, there is currently no mature solution for its effective application in the three-dimensional pose estimation of engineering machinery such as excavators, which have complex articulated structures and variable working postures. Summary of the Invention
[0009] To address the problems of high cost, difficulty in acquiring datasets, weak generalization ability, and domain gaps in existing technologies, this invention provides a method for estimating the 3D pose of excavators based on domain randomization to generate labeled datasets. By generating diverse labeled datasets through virtual simulation, the method reduces data acquisition costs while improving the accuracy, robustness, and cross-scene adaptability of the pose estimation model.
[0010] To achieve the above objectives, this invention provides a method for estimating the 3D pose of an excavator based on a domain randomization-generated labeled dataset, comprising the following steps: S1: Construct a parametric 3D model of the excavator and define its kinematic chain and joint degrees of freedom; S2: Build a virtual simulation environment, configure a multi-dimensional domain randomization parameter library, batch generate synthetic training images and corresponding automatic 3D pose labels, and form a domain randomization training dataset. S3: Using the domain randomized training dataset, train and optimize a two-stage deep neural network model, which is configured to detect key components of the excavator from the input image and regress the 2D key points, 3D center point depth, rotation matrix and joint angles of the excavator. S4: Input the real excavator image into the trained deep neural network model, obtain the relevant predicted values, combine the camera intrinsic parameters for calculation and post-processing, and obtain the complete pose of the excavator in three-dimensional space.
[0011] Furthermore, the parameterized three-dimensional model of the excavator in step S1 refers to decomposing the excavator's motion into rigid body transformations and rotational joint movements of the chassis, slewing platform, boom, stick, and bucket, thereby restoring the structural parameters and joint movement range of the real machine model.
[0012] Furthermore, the virtual simulation environment in step S2 is built based on Unreal Engine 5, Unity, or Blender simulation engine, and includes a 3D model of an excavator and a base of operational scenarios covering construction sites, farmland, mines, cities, and forests.
[0013] Furthermore, the multi-dimensional domain randomization parameter library in step S2 includes at least one of target parameters, environmental parameters, acquisition parameters, and attitude parameters. The target parameters include the excavator material, color, texture, and wear level; the environmental parameters include light type, intensity, color temperature, atmospheric environment, and background; the acquisition parameters include camera intrinsic parameters, extrinsic parameters, image noise, and resolution; and the attitude parameters include the random angles of each joint of the excavator.
[0014] Furthermore, the automatically generated 3D pose label in step S2 includes: the 2D projected coordinates of the vertices of the excavator's 3D bounding box, the 3D coordinates (X,Y,Z) of the excavator chassis center in the camera coordinate system, the rotation matrix R, and the rotation angle θ of each joint.
[0015] Furthermore, the deep neural network model with a two-stage architecture in step S3 includes: the first stage: detection and localization of the excavator body and key components based on the improved YOLOv8 model; the second stage: a multi-task learning architecture, including a branch for regressing 2D key points, a branch for regressing the depth of the 3D center point, and a branch for regressing the angles of each joint.
[0016] Furthermore, in step S3, the model training adopts a hybrid loss function that combines MSE loss with cosine loss, and fine-tunes and optimizes a small amount of labeled data by combining transfer learning and real-world scenarios. The training data is mainly the domain randomized synthetic data generated in step S2.
[0017] Furthermore, the calculation in step S4 specifically involves using the perspective n-point algorithm, combined with the predicted 2D key points, the corresponding 3D model points, and the predicted depth prior, to solve the three-dimensional translation and rotation of the excavator chassis center relative to the camera coordinate system; post-processing includes smoothing filtering and outlier removal.
[0018] The calculation process of the perspective n-point algorithm in step S4 satisfies the following quantification relationship: (1) The camera imaging model follows the perspective projection formula, and the camera coordinates of the three points are... Mapped to image pixel coordinates : , In the formula, , This refers to the focal length (in pixels) in the camera's intrinsic parameters. These are the coordinates (in pixels) of the principal point of the image in the camera's intrinsic parameters. And n≥4, where n is the number of vertices in the 3D bounding box of the excavator; (2) Local coordinates of the excavator 3D model With camera coordinates Satisfy rigid body transformation relations: , In the formula, R is a 3×3 rotation matrix of the excavator model's local coordinate system relative to the camera coordinate system. This is the three-dimensional translation vector of the excavator chassis center in the camera coordinate system; Rotation matrix Euler angles (ZYX order, corresponding to yaw angles) are used. Pitch angle Roll angle (represented), expanded form is: , During the optimization process, the rotation matrix is represented by quaternions to avoid Euler angle singularity problems, and the quaternions are normalized.
[0019] (3) Using the 3D center point depth prediction value Z0 output by the model as the depth prior, determine the Z component of the translation vector: = Z0; (4) Solve by least squares optimization and , The objective function is to minimize the sum of squared residuals between the predicted 2D keypoints and the projected 2D coordinates: , In the formula Defined as: , R ij Let the element be the element in the i-th row and j-th column of the rotation matrix R. The Levenberg-Marquardt algorithm is used to solve this nonlinear optimization problem, yielding the converged rotation matrix. Translation vector .
[0020] (5) The engineering implementation of the perspective n-point algorithm uses OpenCV library functions. The input parameters include the model local coordinate set of the excavator 3D bounding box vertices, the predicted 2D key point coordinate set, and the camera intrinsic parameter matrix: , The pose parameters are initialized using the solvePnP function, and the depth prior Z0 is passed to the solvePnPRefineLM function for constraint optimization, outputting rotation and translation vectors.
[0021] The present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method described above.
[0022] In another aspect, the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described above.
[0023] The beneficial effects of this invention are: Low data acquisition cost: No on-site collection and manual annotation are required. Large-scale labeled datasets are generated in batches through virtual simulation and domain randomization technology, avoiding the cumbersome process of professional annotation and reducing data acquisition costs by more than 70%.
[0024] Strong generalization ability: The dataset covers multi-dimensional randomized parameters of target, environment, collection, and posture, covering different weather, operation scenarios and machine features, forcing the model to learn the geometric structure and kinematic essence of the excavator, effectively narrowing the domain gap and showing outstanding cross-scene adaptability.
[0025] High estimation accuracy: Synthetic data labels are automatically generated, eliminating manual annotation errors. Combining a two-stage neural network and a multi-task learning architecture with the precise calculation of the perspective n-point algorithm, translation and rotation errors are reduced, while joint angle errors are also reduced, improving the pose estimation accuracy in complex environments.
[0026] Flexible and convenient deployment: The model is lightweight and can be deployed on edge computing devices with an inference speed of ≥30fps to meet the needs of real-time operations; it does not rely on expensive sensors such as GPS and LiDAR, reducing hardware deployment costs, and is not affected by environmental interference such as satellite signal blockage, vibration and dust, resulting in stronger stability.
[0027] Wide range of applications: It provides reliable technical support for excavator autonomous operation, remote control, high-precision motion tracking and human-machine collaborative operation, and is suitable for various intelligent application scenarios of construction machinery such as smart construction sites, mining, and farmland operations. Detailed Implementation
[0028] Examples are used to illustrate the present invention, but are not intended to limit the scope of the invention.
[0029] It should be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of a descriptive feature, integral, step, operation, element, and / or component, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or sets.
[0030] For the sake of brevity, the embodiments only schematically illustrate the parts relevant to the present invention, and they do not represent the actual structure of the product. Furthermore, in this document, "a" can mean not only "only one," but also "more than one."
[0031] It should also be further understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the relevant listed items and all possible combinations, and includes such combinations.
[0032] In the description of this application, the terms "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0033] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, specific implementation methods of the present invention will be compared below. Obviously, the embodiments described below are merely some embodiments of the present invention, and those skilled in the art can obtain other implementation methods without creative effort.
[0034] The technical solution of this embodiment includes four core modules: parametric 3D model construction, domain randomized labeled dataset generation, pose estimation model training and optimization, and real-scene pose estimation, as detailed below: 1. Parametric Excavator 3D Model Construction Obtain the excavator's CAD model or create a precise 3D mesh model through 3D scanning. Alternatively, a digital twin model can be used to recreate the structural parameters of the actual machine (body length, bucket size, etc.). Then, parameterize the joints of this model, decomposing the excavator's motion into rigid body transformations and rotational joint movements of components such as the chassis, swing platform, boom, stick, and bucket. Define its kinematic chains and joint degrees of freedom, and clarify the rotation axes and range of motion of each joint.
[0035] 2. Generation of Domain Randomized Labeled Datasets Virtual simulation environment construction: Based on simulation engines such as Unreal Engine 5, Unity, or Blender, a virtual environment is built that includes the aforementioned parametric excavator 3D model and the base of the work scene. The base scene covers typical operations such as construction sites, farmland, mines, cities, and forests, as well as the surrounding terrain.
[0036] Multi-dimensional domain randomization configuration: Set up a library of randomizable parameters, covering three core parameter categories: Target parameters: the surface material (metal, plastic, rubber, etc.), color (yellow, blue, gray, etc.), texture, and wear level (no wear, light wear, heavy wear) of the excavator model.
[0037] Environmental parameters: lighting type (sunlight, lamplight), light intensity (1000-10000 lux), color temperature (3000-6500K), atmospheric environment (sunny, cloudy, light rain, dust), background texture (soil, gravel, vegetation, etc.). Randomly selected panoramic images or video frames of the real environment can be used as the rendering background. At the same time, simple 3D geometric shapes such as mounds of earth and pits can be randomly placed in the scene to simulate the working environment.
[0038] Acquisition parameters: camera intrinsic parameters (focal length 50-120mm, etc.), extrinsic parameters (shooting distance, horizontal and pitch angles, shooting angle range 0°-90°), image noise (Gaussian noise, salt and pepper noise), resolution (1920×1080, 1280×720, etc.).
[0039] Attitude parameters: Within the physical constraints, the angles of each joint of the excavator are randomly generated to form various working postures.
[0040] Automatic Data Labeling Generation: Through the simulation engine's API interface, the virtual camera is controlled to batch acquire excavator images under different randomization configurations. Simultaneously, corresponding precise 3D pose labels are automatically extracted, generating a labeled dataset with a one-to-one correspondence between image and pose ground truth. Pose labels include the 2D projected coordinates of the eight vertices of the excavator's 3D bounding box, the 3D coordinates (X, Y, Z) of the excavator chassis center in the camera coordinate system, the rotation matrix R, the global yaw angle of the slewing platform, and the joint angles θ of the boom, stick, bucket, etc., relative to their parent components. The dataset size can be expanded to 100,000-1,000,000 images as needed.
[0041] 3. Design and Training of Pose Estimation Model Model Architecture: A two-stage architecture combining object detection and multi-task pose regression is adopted. The first stage uses an improved YOLOv8 model to detect and locate the excavator body and key components (bucket, boom, stick). The second stage uses a deep neural network with a multi-task learning architecture, which can be combined with the Transformer-PoseNet network. It takes the detected component regions as input and includes three core branches: 2D keypoint detection branch: used to regress the two-dimensional coordinates of the eight vertices of the 3D bounding box in the image.
[0042] 3D center point depth estimation branch: used to estimate the distance (Z coordinate) from the center point of the excavator chassis to the camera.
[0043] Joint angle estimation branch: used to regress the rotation angle of each joint.
[0044] Model training strategy: The labeled dataset generated by domain randomization is used as the training set, which is divided into training set, validation set and test set. The training data consists entirely of synthetic data and may not contain any manually labeled real image data.
[0045] A hybrid loss function is used, which combines MSE loss (translation parameters, 2D key points, depth values) and cosine loss (rotation parameters, joint angles). The total loss is the weighted sum of the losses of each component.
[0046] The optimizer used is Adam or AdamW, the initial learning rate is set to 0.0001, the learning rate is adjusted using a cosine annealing strategy, the training batch size is 32, and the number of training epochs is 200.
[0047] By introducing transfer learning, pre-trained ImageNet weights are used as initialization parameters for the backbone network (such as ResNet-50), accelerating model convergence.
[0048] Model optimization: Enhance feature extraction of key components through attention mechanism, suppress overfitting by using Dropout regularization, and fine-tune with a small amount of labeled data from real scene to narrow the domain gap between virtual and real scene.
[0049] 4. Pose estimation process in a real-world scenario Images of excavators in actual working scenarios are captured using industrial or ordinary cameras, and the images are preprocessed (denoising, normalization, etc.).
[0050] The preprocessed image is input into the trained pose estimation model. First, key components are detected by improving YOLOv8. Then, the multi-task regression network outputs 2D key point prediction values, 3D center point depth prediction value Z, joint angle prediction values, and rotation matrix related parameters.
[0051] Using camera intrinsic parameters (obtainable through calibration), predicted 2D keypoints, and depth value Z, the complete 3D coordinates of the excavator chassis center in the camera coordinate system are calculated using the PnP (perspective n-point) algorithm.
[0052] The calculation process of the perspective n-point algorithm satisfies the following quantization relationship: (1) The camera imaging model follows the perspective projection formula, which maps the camera coordinates of 3D points. Mapped to image pixel coordinates : , In the formula, , This refers to the focal length (in pixels) in the camera's intrinsic parameters. These are the coordinates (in pixels) of the principal point of the image in the camera's intrinsic parameters. And n≥4, where n is the number of vertices in the 3D bounding box of the excavator; (2) Local coordinates of the excavator 3D model With camera coordinates Satisfy rigid body transformation relations: , In the formula, R is a 3×3 rotation matrix of the excavator model's local coordinate system relative to the camera coordinate system. This is the three-dimensional translation vector of the excavator chassis center in the camera coordinate system; Rotation matrix Euler angles (ZYX order, corresponding to yaw angles) are used. Pitch angle Roll angle (represented), expanded form is: , During the optimization process, the rotation matrix is represented by quaternions to avoid Euler angle singularity problems, and the quaternions are normalized.
[0053] (3) Using the 3D center point depth prediction value Z0 output by the model as the depth prior, determine the Z component of the translation vector: = Z0; (4) Solve by least squares optimization and , The objective function is to minimize the sum of squared residuals between the predicted 2D keypoints and the projected 2D coordinates: , In the formula Defined as: , R ij Let the element be the element in the i-th row and j-th column of the rotation matrix R. The Levenberg-Marquardt algorithm is used to solve this nonlinear optimization problem, yielding the converged rotation matrix. Translation vector .
[0054] (5) The engineering implementation of the perspective n-point algorithm uses OpenCV library functions. The input parameters include the model local coordinate set of the excavator 3D bounding box vertices, the predicted 2D key point coordinate set, and the camera intrinsic parameter matrix: , The pose parameters are initialized using the solvePnP function, and the depth prior Z0 is passed to the solvePnPRefineLM function for constraint optimization, outputting rotation and translation vectors.
[0055] The output pose parameters are post-processed, such as by using smoothing filters and outlier removal. Combined with the predicted joint angles and rotation matrices, the complete pose information of the excavator in three-dimensional space is finally output.
[0056] The embodiments described above are some, but not all, embodiments of the present invention. The detailed description of the embodiments of the present invention is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
Claims
1. A method for estimating the 3D pose of an excavator based on a labeled dataset generated by domain randomization, characterized in that, Includes the following steps: S1: Construct a parametric 3D model of the excavator and define its kinematic chain and joint degrees of freedom; S2: Build a virtual simulation environment, configure a multi-dimensional domain randomization parameter library, batch generate synthetic training images and corresponding automatic 3D pose labels, and form a domain randomization training dataset. S3: Using the domain randomized training dataset, train and optimize a two-stage deep neural network model, which is configured to detect key components of the excavator from the input image and regress the 2D key points, 3D center point depth, rotation matrix and joint angles of the excavator. S4: Input the real excavator image into the trained deep neural network model, obtain the relevant predicted values, combine the camera intrinsic parameters for calculation and post-processing, and obtain the complete pose of the excavator in three-dimensional space.
2. The method according to claim 1, characterized in that, The parameterized 3D model of the excavator in step S1 refers to decomposing the excavator's motion into rigid body transformations and rotational joint movements of the chassis, slewing platform, boom, stick, and bucket, thereby restoring the structural parameters and joint movement range of the real machine model.
3. The method according to claim 1, characterized in that, The virtual simulation environment in step S2 is built based on Unreal Engine 5, Unity or Blender simulation engine, and includes a 3D model of an excavator and a base of operational scenarios covering construction sites, farmland, mines, cities and forests.
4. The method according to claim 1, characterized in that, The multi-dimensional domain randomization parameter library in step S2 includes at least one of target parameters, environmental parameters, acquisition parameters, and attitude parameters. The target parameters include the excavator material, color, texture, and wear level. The environmental parameters include the type, intensity, color temperature, atmospheric environment, and background. The acquisition parameters include camera intrinsic parameters, extrinsic parameters, image noise, and resolution. The attitude parameters include the random angles of each joint of the excavator.
5. The method according to claim 1, characterized in that, The automatic 3D pose labels generated in step S2 include: the 2D projected coordinates of the vertices of the excavator's 3D bounding box, the 3D coordinates (X,Y,Z) of the excavator chassis center in the camera coordinate system, the rotation matrix R, and the rotation angle θ of each joint.
6. The method according to claim 1, characterized in that, The deep neural network model with a two-stage architecture in step S3 includes: the first stage: detection and localization of the excavator body and key components based on the improved YOLOv8 model; the second stage: a multi-task learning architecture, including a branch for regressing 2D key points, a branch for regressing the depth of the 3D center point, and a branch for regressing the joint angles.
7. The method according to claim 1 or 6, characterized in that, The model training in step S3 uses a hybrid loss function combining MSE loss and cosine loss, and fine-tunes and optimizes a small amount of labeled data by combining transfer learning and real-world scenarios. The training data is mainly the domain randomized synthetic data generated in step S2.
8. The method according to claim 1, characterized in that, The calculation in step S4 specifically involves using the perspective n-point algorithm, combined with the predicted 2D key points, the corresponding 3D model points, and the predicted depth prior, to solve the three-dimensional translation and rotation of the excavator chassis center relative to the camera coordinate system; post-processing includes smoothing filtering and outlier removal.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1 to 8.