A domain randomization based simulation training method and system for bicycle control strategy
By randomizing the training of dynamic parameters and states in a bicycle simulation environment, and combining a multi-objective reward function and a proximal policy optimization algorithm, the problem of migrating bicycle control strategies from simulation to reality is solved, enabling efficient and low-cost controller development and deployment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-19
AI Technical Summary
In the current technology, there is a huge technical gap in the migration of bicycle control strategies from simulation environment to real physical world. Traditional simulation training cannot improve the robustness and generalization ability of controllers, and relying on physical prototypes is costly, inefficient and risky.
A simulation training method for bicycle control strategy based on domain randomization is adopted. By constructing a parameterized digital twin model in the physical simulation engine, the dynamic parameters, initial state and environment are randomized, and a robust neural network controller is obtained by using a multi-objective reward function and a proximal policy optimization algorithm for large-scale parallel training.
It achieves zero-sample transfer from simulation to reality, improves the robustness and generalization ability of the controller, reduces development costs and risks, and builds a new paradigm of efficient offline training-online deployment.
Smart Images

Figure CN122242211A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of bicycle simulation and training technology, specifically to a bicycle control strategy simulation training method and system based on domain randomization. Background Technology
[0002] Bicycles are a typical dynamic nonholonomic constraint system. Their center of gravity is relatively high, and unlike four-wheeled vehicles, they lack a stable chassis support, making them inherently unstable and unable to maintain balance on their own. Therefore, active intervention from an external controller is required to adjust the handlebars, wheels, and other actuators, thereby achieving self-balancing and steering control of the bicycle.
[0003] Currently, the self-balancing and steering control schemes in existing technologies can be analyzed mainly from two aspects: hardware configuration and balance control algorithm.
[0004] According to the hardware configuration, the self-balancing control technology of bicycles can be divided into two categories: (1) the angular momentum conservation balance control scheme based on the flywheel and (2) the flywheel-less balance control scheme based on the handlebar steering.
[0005] Among them, the flywheel-based angular momentum conservation balance control scheme has a simple and intuitive control principle, but it suffers from inherent problems such as bulky structure, high energy consumption, and high drive cost, making it difficult to be widely used. On the other hand, the handlebar steering scheme has a simple structure and low cost, making it a promising development direction. However, the strong coupling between balance and steering control in this scheme poses a significant challenge to the control algorithm.
[0006] There are significant technological gaps and fundamental flaws in deploying control strategies from the development environment to the real physical world. The "Sim-to-Real Gap" leads to controller migration failures. Traditional simulation methods cannot effectively improve the robustness and generalization ability of controllers. Traditional simulation training is usually carried out in a simulation environment with a single, fixed physical parameter. This training method is essentially no different from development based on a single model and cannot enable the controller to learn the ability to cope with changes. The development model that relies on physical prototypes is costly, inefficient, and risky. Because the migration from simulation to reality is extremely difficult, traditional development processes have to rely on physical prototypes for a large number of field tests and parameter adjustments. Summary of the Invention
[0007] The purpose of this invention is to overcome the above-mentioned technical deficiencies and provide a simulation training method and system for bicycle control strategies based on domain randomization, thereby solving the technical problem that there is a huge technical gap and fundamental defects in the process of deploying control strategies from the development environment to the real physical world in the prior art.
[0008] To achieve the above-mentioned technical objectives, in a first aspect, the present invention provides a simulation training method for bicycle control strategies based on domain randomization, comprising the following steps:
[0009] In the physics simulation engine, a parametric digital twin model of the bicycle is built, and all physical parameters that have a significant impact on the bicycle's dynamics are defined as programmable variables.
[0010] At the start of each new training round, all physical parameters of each digital twin model are randomly sampled independently to randomize the dynamic parameters, and the initial state and environment and task are randomized for each digital twin model.
[0011] A multi-objective reward function is constructed to guide the learning of the neural network controller. The neural network controller is trained in a large-scale parallel multi-round training using a proximal policy optimization algorithm to obtain the trained neural network controller.
[0012] The trained neural network controller was used to perform simulation verification of the bicycle control strategy under complex working conditions.
[0013] Compared with the prior art, the beneficial effects of the present invention include:
[0014] 1. It bridges the "simulation-reality gap," achieving "zero-shot" transfer from simulation to reality. By learning extensively from massive amounts of randomized dynamic parameters, initial states, and external environments during training, the controller is forced to master universal control laws independent of any specific model.
[0015] 2. Solved the problems of controller robustness and environment generalization.
[0016] Because the controller has undergone parameter disturbances, state disturbances, and environmental changes far exceeding the range of real-world variations during the training phase, it possesses the ability to adapt to uncertainty. Whether it's load changes, parameter drift caused by component aging, or traveling from a smooth highway to a bumpy gravel road, the controller can maintain robust performance, fundamentally solving the problem of traditional controllers failing due to changes in operating conditions.
[0017] 3. A new, efficient, and low-cost development paradigm of "offline training - online deployment" has been established.
[0018] This invention places the computationally intensive controller training process entirely in a virtual simulation environment. By utilizing massively parallel computing, it can complete the debugging work that would take weeks to complete using traditional methods in just tens of minutes.
[0019] According to some embodiments of the present invention, constructing a parametric digital twin model of a bicycle in a physics simulation engine includes the following steps:
[0020] The digital twin model is structurally decomposed and modeled into four core rigid body components: frame, handlebar and fork assembly, front wheel and rear wheel;
[0021] It also defines and controls joints, defining precise motion joints between the various rigid body components.
[0022] According to some embodiments of the present invention, the kinematic joint includes:
[0023] The handlebar joint, which connects the frame and the handlebar assembly, is a rotatable joint that uses a position control mode and directly receives the target steering angle command output by the neural network controller.
[0024] The rear wheel joint, which connects the frame to the rear wheel, is a rotatable joint that uses a speed control mode. It receives the target speed command output by the neural network controller and serves as the drive source for the entire vehicle.
[0025] Front wheel joint: Connects the handlebar assembly to the front wheel. It is a rotatable joint, set to a free-rotation mode without power, simulating the driven characteristics of a real bicycle front wheel.
[0026] According to some embodiments of the present invention, the physical parameters include: frame mass, handlebar mass, fork assembly mass, front wheel mass and rear wheel mass, center of gravity position, moment of inertia, static / dynamic friction coefficients of the front and rear tires with the ground, and key geometric dimensions of the frame: wheelbase, fork loft angle and trail distance.
[0027] According to some embodiments of the present invention, in order to accurately simulate the driving effect of a motor in the real world in the simulation, a proportional-derivative (PD) controller is configured for the handlebar joint and rear wheel joint in the simulation model in the physical simulation engine. For the handlebar joint using position control, the PD parameter tuning goal is to respond to the target angle command quickly and accurately. The stiffness parameter P of the handlebar joint determines the speed of the joint response command and the error correction force, while the damping parameter D of the handlebar joint is used to suppress overshoot and oscillation.
[0028] For rear wheel joints employing speed control, the goal of PD parameter tuning is to stably and smoothly track the target speed command. The stiffness parameter P of the rear wheel joint primarily affects its ability to correct tracking speed errors, while the damping parameter D of the rear wheel joint is used to ensure the smoothness of speed changes and prevent speed jitter.
[0029] For the unpowered front wheel joint, in order to simulate its free rotation characteristics, the stiffness parameter P and damping parameter D of the PD controller are both set to zero.
[0030] According to some embodiments of the present invention, initial state randomization and environment and task randomization are performed on each of the digital twin models, including the following steps:
[0031] Initial state randomization: In order to improve the controller's recovery capability under various unexpected conditions, the initial state of the vehicle is randomized, including: randomly setting the initial roll angle and roll rate within a small range, randomly setting the initial linear velocity, and randomly setting the initial position and orientation on the ground.
[0032] Environment and task randomization: To ensure the controller can adapt to diverse external environments and commands, including:
[0033] Terrain randomization: Allow bicycles to be trained on various virtual terrains such as flat ground, slopes of different gradients, simulated gravel roads, steps of random height, and rough grid roads to enhance their adaptability to changes in road surface.
[0034] Command randomization: In each round, the target linear velocity and target steering commands are randomly varied and issued to the neural network controller. The magnitude and frequency of the commands are also randomly generated within a reasonable range to ensure that the controller can flexibly track various complex trajectories.
[0035] According to some embodiments of the present invention, the multi-objective reward function includes:
[0036] The survival reward subfunction is used to maintain the plant's stability.
[0037] The task tracking reward subfunction is used to follow speed and steering commands;
[0038] The motion smoothness penalty function is used to encourage energy-efficient and smooth motions.
[0039] According to some embodiments of the present invention, the simulation verification of the trained neural network controller for bicycle control strategy under complex working conditions includes the following steps:
[0040] Zero-shot generalization test: The trained neural network controller is deployed on multiple new bicycle models with physical parameters set outside the training randomization range but still physically reasonable, to test whether the neural network controller can still complete the balance and line-following tasks.
[0041] Strong disturbance resistance test: During stable cycling in a simulated environment, strong external disturbances that were not explicitly present during training are applied, including: simulating continuous crosswinds, placing sudden obstacles or potholes on the path, and applying a thrust to the bicycle instantly. The dynamic robustness is tested by evaluating the speed and ability of the neural network controller to recover balance from the instability state.
[0042] Extreme mission execution test: On a rugged and complex test terrain, a series of rapidly changing combined commands are issued to the controller to test the comprehensive performance under strong coupling of multiple targets and dynamic limits.
[0043] Secondly, the present invention provides a simulation training system for bicycle control strategies based on domain randomization, comprising:
[0044] The model building module constructs a parametric digital twin model of the bicycle in the physics simulation engine, defining all physical parameters that have a significant impact on bicycle dynamics as programmable variables.
[0045] The domain randomization module performs an independent random sampling of all the physical parameters of each digital twin model at the beginning of each new training round to randomize the dynamic parameters, and performs initial state randomization and environment and task randomization for each digital twin model.
[0046] The controller learning module constructs a multi-objective reward function to guide the neural network controller to learn, and uses a proximal policy optimization algorithm to perform large-scale parallel multi-round training on the neural network controller to obtain the trained neural network controller.
[0047] The verification and testing module performs simulation verification of the bicycle control strategy under complex working conditions on the trained neural network controller.
[0048] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0049] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, wherein the abstract drawings are to be completely consistent with one of the drawings in the specification:
[0050] Figure 1 A flowchart illustrating a simulation training method for bicycle control strategies based on domain randomization, provided in one embodiment of the present invention;
[0051] Figure 2 A physical simulation modeling diagram of a bicycle provided in one embodiment of the present invention;
[0052] Figure 3 This is a domain randomization parameter graph provided in one embodiment of the present invention;
[0053] Figure 4 This is a simulation test diagram of self-balancing and steering control on flat terrain provided in one embodiment of the present invention;
[0054] Figure 5This is a simulation test diagram of self-balancing and steering control in complex terrain provided in one embodiment of the present invention. Detailed Implementation
[0055] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0056] It should be noted that although functional modules are divided in the system diagram and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the system or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, and the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0057] Reference Figures 1 to 5 , Figure 1 A flowchart illustrating a simulation training method for bicycle control strategies based on domain randomization, provided in one embodiment of the present invention; Figure 2 A physical simulation modeling diagram of a bicycle provided in one embodiment of the present invention; Figure 3 This is a domain randomization parameter graph provided in one embodiment of the present invention; Figure 4 This is a simulation test diagram of self-balancing and steering control on flat terrain provided in one embodiment of the present invention; Figure 5 This is a simulation test diagram of self-balancing and steering control in complex terrain provided in one embodiment of the present invention.
[0058] In one embodiment, the simulation training method for bicycle control strategy based on domain randomization includes the following steps: constructing a parameterized digital twin model of the bicycle in a physical simulation engine, defining all physical parameters that have a significant impact on bicycle dynamics as programmable variables; at the beginning of each new training round, performing an independent random sampling of all physical parameters of each digital twin model to randomize the dynamic parameters, and performing initial state randomization and environment and task randomization on each digital twin model; constructing a multi-objective reward function to guide the neural network controller to learn, and using a proximal policy optimization algorithm to perform large-scale parallel multi-round training on the neural network controller to obtain a trained neural network controller; and performing simulation verification of the trained neural network controller under complex operating conditions.
[0059] Figure 1 -Overall flowchart description:
[0060] This invention proposes a complete simulation training method and system based on domain randomization. It aims to systematically solve the problem of migrating control strategies from simulation to reality by constructing a more diverse and challenging virtual training environment than the real world, thereby endowing the controller with robustness and generalization ability. The specific implementation principle and process are as follows:
[0061] (1) Digital twin physical modeling and simulation based on bicycle configuration (e.g.) Figure 2 (As shown)
[0062] This method first constructs a parametric digital twin model of a bicycle in a high-performance physics simulation engine (such as NVIDIA Isaac Sim, MuJoCo, ORCA, etc.). It is not only a geometric and physical reproduction of a real bicycle, but also a "template" that can be used for subsequent large-scale randomized training.
[0063] Structural decomposition and modeling: The bicycle is decomposed into four core rigid body components: frame, handlebar and fork assembly, front wheel, and rear wheel.
[0064] Joint definition and control: Define precise motion joints between rigid bodies, including:
[0065] a) Handlebar joint: Connecting the frame and handlebar assembly, it is a rotatable joint. It uses a position control mode, directly receiving the target steering angle command output from the neural network controller;
[0066] b) Rear wheel joint: Connecting the frame to the rear wheel, it is a rotatable joint that uses a speed control mode. It receives the target speed command output by the neural network and serves as the drive source for the entire vehicle.
[0067] c) Front wheel joint: Connects the handlebar assembly to the front wheel. It is also a rotatable joint, but it is set to a free-rotation mode without power to simulate the driven characteristics of a real bicycle front wheel.
[0068] Parameterization of physical parameters: This is the key step in this process. Unlike traditional simulations that use fixed parameters, this invention defines all physical properties that have a significant impact on dynamics as programmable variables rather than fixed values. This includes, but is not limited to: the mass, center of mass, and moment of inertia of various components such as the frame and wheels; the static / dynamic friction coefficients between the front and rear tires and the ground; and key geometric dimensions of the frame such as wheelbase, fork loft angle, and trail distance.
[0069] Furthermore, special attention needs to be paid to the tuning of the motion joint drive and PD control parameters: In order to accurately simulate the driving effect of a motor in the real world, this invention configures a proportional-derivative (PD) controller for the active joints in the simulation model. The tuning of the PD controller parameters is a key step in achieving high-fidelity simulation, and its goal is to make the dynamic response characteristics of the simulated joint match the physical characteristics of the real motor as closely as possible.
[0070] a) Handlebar joint (position control): For handlebar joints with position control, the goal of PD parameter tuning is to respond to target angle commands quickly, accurately and without overshoot.
[0071] The stiffness parameter P determines the speed and correction force of the joint response command, while the damping parameter D is used to suppress overshoot and oscillation. We will adjust the values of P and D based on the maximum torque and rated speed of the selected real steering motor through experiments or system identification methods, so that the rise time, overshoot, and other indicators of the simulated joint under step response test match the performance of the real motor when driving a load.
[0072] b) Rear wheel joint (speed control): For a rear wheel joint that uses speed control, the goal of its PD parameter tuning is to stably and smoothly track the target speed command.
[0073] The stiffness parameter P here mainly affects its ability to correct tracking speed errors, while the damping parameter D ensures the smoothness of speed changes and prevents speed jitter. Its parameter settings reference the torque-speed characteristic curve of a real hub motor to ensure that the acceleration and deceleration dynamics of the simulated joint remain consistent with those of a real motor under different loads.
[0074] c) Front wheel joint (free rotation): For the unpowered front wheel joint, in order to simulate its free rotation characteristics, the stiffness parameter P and damping parameter D of its PD controller are both set to zero. This makes it rotate only under the control of physical laws and handlebar steering in the simulation, thus accurately reproducing the driven characteristics of the real front wheel.
[0075] By constructing such a fully parameterized digital twin template, the patented method built a simulation system capable of generating an infinite number of "variant" bicycles, laying the foundation for subsequently addressing the "simulation-reality gap" problem through domain randomization. Bicycle physical modeling is as follows: Figure 2 As shown.
[0076] (2) Design of domain randomization strategy based on dynamic parameters
[0077] Domain randomization is the core technical means of this invention. Its purpose is not to infinitely approximate reality, but to create a possibility space far richer than reality, forcing the controller to learn more fundamental control laws independent of any specific physical parameters. This invention designs a comprehensive randomization strategy covering multiple levels, such as... Figure 3 As shown:
[0078] Dynamics Randomization: This is fundamental to improving controller robustness. At the start of each new training episode, the system independently samples all the physical parameters (mass, center of mass, friction coefficient, etc.) defined above from a pre-defined reasonable range. This means that the neural network faces almost two "different" bicycles in two adjacent training sessions, thus forcing it to learn a general strategy that remains stable under various physical characteristics.
[0079] Initial State Randomization: To improve the controller's recovery capability under various unexpected conditions, the system randomizes the vehicle's initial state, including: randomly setting the initial roll angle and roll rate within a small range, randomly setting the initial linear velocity, and randomly setting the initial position and orientation on the ground.
[0080] Environment & Task Randomization: To ensure the controller can adapt to diverse external environments and instructions, this strategy also includes:
[0081] a) Terrain randomization: Train bicycles on various virtual terrains such as flat ground, slopes of different gradients, simulated gravel roads, steps of random height, and rough grid roads to enhance their adaptability to changes in road surface.
[0082] b) Command randomization: In each round, the controller is given randomly varying target linear velocity and target turning commands. The magnitude and frequency of the commands are also randomly generated within a certain range to ensure that the controller can flexibly track various complex trajectories.
[0083] (3) Domain randomized reinforcement learning simulation training for robust control
[0084] In this step, we utilize the simulation system and randomization strategy built in the first two steps to efficiently train the neural network controller.
[0085] Training Framework and Algorithm: The Proximal Policy Optimization (PPO) algorithm is used for training, which demonstrates stability and efficiency in handling continuous control tasks. The neural network controller receives vectors describing the bicycle's state as input, such as lean angle, angular velocity, current speed, and task commands, and outputs control commands for the handlebar motor and rear wheel motor.
[0086] Reward Function Guidance: Learning is guided by a multi-objective reward function that combines survival rewards (maintaining the top), task tracking rewards (following speed and steering commands), and motion smoothness penalties (encouraging energy-efficient and smooth actions), enabling the controller to evolve towards high performance and high robustness.
[0087] Massive parallel training: This is key to achieving efficient training. This system supports running tens of thousands of independent, parallel simulation environments simultaneously on GPUs. At any given time, tens of thousands of agents are exploring and learning in their respective randomized "worlds." This mode delivers enormous data throughput, reaching hundreds of thousands of interactions per second, enabling training processes that would normally take weeks or even months to converge within 60 minutes, significantly accelerating the algorithm's iteration cycle.
[0088] (4) Simulation verification of bicycle control strategy under complex working conditions
[0089] After training, in order to verify the robustness and generalization ability of the control strategy trained by this invention, the patent designed a series of simulation verification tests:
[0090] Zero-Shot Generalization Test: A trained single controller is deployed on multiple "unfamiliar" bicycle models with physical parameters intentionally set outside the training randomization range but still physically reasonable. The test assesses whether it can still perform balance and line-following tasks on these never-before-seen vehicles, a key metric for measuring "sim-to-real" transfer capabilities.
[0091] Strong Disturbance Rejection Test: During stable cycling, a strong external disturbance not explicitly encountered during training is applied, such as simulating continuous crosswinds, placing sudden obstacles or potholes on the path, or applying a sudden thrust to the bike. The dynamic robustness of the control strategy is tested by evaluating its speed and ability to recover balance from an unstable state.
[0092] Complex Task Test: On rugged and complex test terrain, the controller is given a series of rapidly changing combined commands, such as "high-speed continuous S-shaped turns" or "acceleration and steering immediately after emergency braking", to test its comprehensive performance under strong coupling of multiple targets and dynamic limits.
[0093] Through the above series of simulation verifications, it can be proved that the training method and system proposed in this invention can effectively bridge the gap between simulation and reality, and successfully train a highly robust and generalizable control strategy that can be directly deployed on a physical prototype without secondary optimization.
[0094] The effect after improvement:
[0095] By employing the domain randomization-based simulation training method and system proposed in this patent, the following fundamental benefits have been achieved in the development and performance of bicycle control strategies:
[0096] 1. It bridges the “simulation-reality gap” and enables “zero-sample” migration and deployment from simulation to reality.
[0097] The core effect of this invention is that it fundamentally solves the "Sim-to-Real Gap" problem in existing technologies, where controller performance degrades or even fails when migrating from a simulation environment to the real physical world. By learning extensively from massive amounts of randomized dynamic parameters, initial states, and external environments during training, the controller is forced to master universal control laws independent of any specific model. This makes the "real world" merely one of countless random instances that the trained controller has already "seen" in simulation. Therefore, the controller can be deployed directly and effectively with "zero samples" without any secondary development or parameter fine-tuning for physical prototypes, greatly improving the algorithm's practicality and reliability.
[0098] 2. The controller achieves good robustness and adaptability to dynamic changes.
[0099] Through the "domain randomization" training of this invention, the controller acquires not only static robustness but also dynamic adaptive capability. It can implicitly perceive changes in system dynamics in real time during operation, such as load changes caused by rider replacement, parameter drift due to component aging, and changes in road friction coefficient, and dynamically adjust its control strategy to achieve a new optimal trade-off between the two strongly coupled objectives of maintaining balance and task tracking. This effect solves the fundamental deficiency of traditional controllers that use fixed parameters and cannot adapt to changing operating conditions, enabling a single control strategy to operate robustly in dynamic real-world environments.
[0100] 3. A new development paradigm of "offline training - online deployment" has been established, which is efficient, low-cost, and risk-free.
[0101] This invention provides a complete, large-scale parallel simulation training system, completely revolutionizing the traditional development model that relies on repeated trial and error using physical prototypes. The entire process of developing, training, and fully validating control strategies can be automated in a virtual environment, reducing the on-site debugging cycle from weeks or even months to tens of minutes. This new paradigm not only completely eliminates the risk of physical prototype damage and high hardware costs, but also systematically solves the most difficult parameter tuning and robustness design problems in controller development through a data-driven approach, greatly reducing the R&D threshold and cycle time for high-performance motion control systems.
[0102] The process of constructing a parametric digital twin model of a bicycle in a physics simulation engine includes the following steps: structurally decomposing and modeling the digital twin model into four core rigid body components: frame, handlebar and fork assembly, front wheel and rear wheel; and defining and controlling joints, defining precise motion joints between each rigid body component.
[0103] The frame is the main structure of a bicycle. The handlebars control the bicycle's steering and are the primary component for executing steering commands. The front fork assembly is an essential component connecting the handlebars and the front wheel. The front wheel is the steering wheel and is unpowered, while the rear wheel is the driving wheel and is used to execute speed commands. By structurally decomposing and modeling the digital twin model into four core rigid body components—the frame, handlebars and front fork assembly, front wheel, and rear wheel—accurate control of the bicycle can be achieved in a simulation environment. This facilitates the simulation of bicycle agents under various conditions, allowing the neural network controller to learn as many different environments and parameters as possible how to control the bicycle according to the control strategy, thus facilitating the development of a neural network controller with optimal performance.
[0104] Furthermore, the motion joints include: the handlebar joint, which connects the frame and the handlebar assembly, is a rotatable joint that uses a position control mode and directly receives the target steering angle command output by the neural network controller; the rear wheel joint, which connects the frame and the rear wheel, is a rotatable joint that uses a speed control mode and receives the target speed command output by the neural network controller, serving as the drive source for the entire vehicle; and the front wheel joint, which connects the handlebar assembly and the front wheel, is a rotatable joint set to a free-rotation mode without power, simulating the driven characteristics of a real bicycle front wheel.
[0105] The physical parameters include: frame mass, handlebar mass, fork assembly mass, front wheel mass and rear wheel mass, center of gravity position, moment of inertia, static / dynamic friction coefficients of the front and rear tires with the ground, and key geometric dimensions of the frame: wheelbase, fork loft angle and trail distance.
[0106] In the real world, with numerous bicycle manufacturers producing countless bicycle models, different models inevitably possess different physical parameters. These parameters include: frame mass, handlebar mass, fork assembly mass, front and rear wheel mass, center of gravity position, moment of inertia, static / dynamic friction coefficients between the front and rear tires and the ground, and key frame geometric dimensions such as wheelbase, fork loft angle, and trail distance. These physical parameters determine the different performance characteristics and control methods of the bicycle. Conventional neural network controllers, with their limited training sets, inevitably face technical challenges in controlling bicycles with altered parameters. This patent's domain-randomized bicycle control strategy simulation training method, by randomizing physical parameters within a reasonable range during multiple training rounds, allows the neural network controller to encounter a sufficient number of different types of simulated bicycles. After training, even when faced with a new bicycle, the neural network controller can handle it with ease, achieving better bicycle control performance.
[0107] Furthermore, in order to accurately simulate the driving effect of the motor in the real world in the simulation, proportional-derivative (PD) controllers are configured for the handlebar joint and rear wheel joint in the simulation model in the physical simulation engine. For the handlebar joint using position control, the PD parameter tuning goal is to respond to the target angle command quickly and accurately. The stiffness parameter P of the handlebar joint determines the speed of the joint response command and the error correction force, while the damping parameter D of the handlebar joint is used to suppress overshoot and oscillation.
[0108] For the rear wheel joint using speed control, the goal of PD parameter tuning is to stably and smoothly track the target speed command. The stiffness parameter P of the rear wheel joint mainly affects its ability to correct tracking speed errors, while the damping parameter D of the rear wheel joint is used to ensure the smoothness of the speed change process and prevent speed jitter. For the unpowered front wheel joint, in order to simulate its free rotation characteristics, the stiffness parameter P and damping parameter D of the PD controller are both set to zero.
[0109] Furthermore, each digital twin model undergoes initial state randomization and environment / task randomization, including the following steps: Initial state randomization: To improve the controller's recovery capability under various unexpected conditions, the initial state of the vehicle is randomized, including: randomly setting the initial roll angle and roll rate within a small range, randomly setting the initial linear velocity, and randomly placing the vehicle in its initial position and orientation on the ground; Environment and task randomization: To ensure the controller can adapt to diverse external environments and commands, this includes: Terrain randomization: Training the bicycle simulation on various virtual terrains such as flat ground, slopes of different gradients, simulated gravel roads, steps of random height, and rough grid roads to enhance its adaptability to road surface changes; Command randomization: In each round, randomly varying target linear velocity and target steering commands are issued to the neural network controller. The amplitude and frequency of these commands are also randomly generated within a reasonable range to ensure the controller can flexibly track various complex trajectories.
[0110] Through the "domain randomization" training of this invention, the controller acquires not only static robustness but also dynamic adaptive capability. It can implicitly perceive changes in system dynamics in real time during operation, such as load changes caused by rider changes, parameter drift caused by component aging, and changes in road friction coefficient, and dynamically adjust its control strategy to achieve a new optimal trade-off between the two strongly coupled objectives of maintaining balance and tracking tasks.
[0111] The multi-objective reward function includes: a survival reward sub-function to maintain balance; a task tracking reward sub-function to follow speed and steering commands; and a motion smoothness penalty sub-function to encourage energy-efficient and smooth movements. By setting the bicycle's control strategy based on this multi-objective reward function, the simulated agent of the bicycle can maintain balance for an extended period and obtain the maximum multi-objective reward points, enabling the controller to evolve towards high performance and high robustness.
[0112] The process includes: Simulation verification of the trained neural network controller under complex operating conditions for bicycle control strategies, comprising the following steps: Zero-shot generalization capability test: Deploying the trained neural network controller on multiple new bicycle models with physical parameters set outside the training randomization range but still physically reasonable, testing whether the neural network controller can still complete balance and tracking tasks; Strong disturbance resistance test: Applying strong external disturbances that were not clearly present during training during stable bicycle riding in a simulation environment, including: simulating continuous crosswinds, placing sudden obstacles or potholes on the path, and applying a thrust to the bicycle instantly, evaluating the speed and ability of the neural network controller to recover balance from an unstable state to verify dynamic robustness; Extreme task execution test: Issuing continuously and rapidly changing combined commands to the controller on rugged and complex test terrain to verify the comprehensive performance under strong coupling of multiple objectives and dynamic limits.
[0113] In one embodiment, the bicycle control strategy simulation training system based on domain randomization includes: a model building module, which constructs a parameterized digital twin model of the bicycle in a physics simulation engine, defining all physical parameters that have a significant impact on bicycle dynamics as programmable variables; a domain randomization module, which performs an independent random sampling of all physical parameters of each digital twin model at the beginning of each new training round to randomize the dynamic parameters, and performs initial state randomization and environment and task randomization on each digital twin model; a controller learning module, which constructs a multi-objective reward function to guide the neural network controller to learn, and uses a proximal policy optimization algorithm to perform large-scale parallel multi-round training on the neural network controller to obtain a trained neural network controller; and a verification and testing module, which performs simulation verification of the bicycle control strategy under complex working conditions on the trained neural network controller.
[0114] The above is a detailed description of the preferred embodiments of the present invention. However, the present invention is not limited to the above embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. All such equivalent modifications or substitutions are included within the scope defined by the claims of the present invention.
[0115] The specific embodiments of the present invention described above do not constitute a limitation on the scope of protection of the present invention. Any other corresponding changes and modifications made in accordance with the technical concept of the present invention should be included within the scope of protection of the claims of the present invention.
Claims
1. A domain randomization based simulation training method for bicycle control strategy, characterized in that, Including the following steps: In the physics simulation engine, a parametric digital twin model of the bicycle is built, and all physical parameters that have a significant impact on the bicycle's dynamics are defined as programmable variables. At the start of each new training round, all physical parameters of each digital twin model are randomly sampled independently to randomize the dynamic parameters, and the initial state and environment and task are randomized for each digital twin model. A multi-objective reward function is constructed to guide the learning of the neural network controller. The neural network controller is trained in a large-scale parallel multi-round training using a proximal policy optimization algorithm to obtain the trained neural network controller. The trained neural network controller was used to perform simulation verification of the bicycle control strategy under complex working conditions.
2. The domain randomization based bicycle control policy simulation training method according to claim 1, wherein, Building a parametric digital twin model of a bicycle in a physics simulation engine includes the following steps: The digital twin model is structurally decomposed and modeled into four core rigid body components: frame, handlebar and fork assembly, front wheel and rear wheel; It also defines and controls joints, defining precise motion joints between the various rigid body components.
3. The simulation training method for bicycle control strategy based on domain randomization according to claim 2, characterized in that, The movable joint includes: The handlebar joint, which connects the frame and the handlebar assembly, is a rotatable joint that uses a position control mode and directly receives the target steering angle command output by the neural network controller. The rear wheel joint, which connects the frame to the rear wheel, is a rotatable joint that uses a speed control mode. It receives the target speed command output by the neural network controller and serves as the drive source for the entire vehicle. Front wheel joint: Connects the handlebar assembly to the front wheel. It is a rotatable joint, set to a free-rotation mode without power, simulating the driven characteristics of a real bicycle front wheel.
4. The simulation training method for bicycle control strategy based on domain randomization according to claim 3, characterized in that, The physical parameters include: frame mass, handlebar mass, fork assembly mass, front wheel mass and rear wheel mass, center of gravity position, moment of inertia, static / dynamic friction coefficients of the front and rear tires with the ground, and key geometric dimensions of the frame: wheelbase, fork loft angle and trail distance.
5. The simulation training method for bicycle control strategy based on domain randomization according to claim 4, characterized in that, In order to accurately simulate the driving effect of motors in the real world, proportional-derivative (PD) controllers are configured for the handlebar joints and rear wheel joints in the simulation model in the physics simulation engine. For handlebar joints that use position control, the goal of PD parameter tuning is to respond to the target angle command quickly and accurately. The stiffness parameter P of the handlebar joint determines the speed of joint response to the command and the error correction force, while the damping parameter D of the handlebar joint is used to suppress overshoot and oscillation. For a rear wheel joint with speed control, the goal of PD parameter tuning is to stably and smoothly track the target speed command. The stiffness parameter P of the rear wheel joint mainly affects its ability to correct tracking speed error, while the damping parameter D of the rear wheel joint is used to ensure the smoothness of the speed change process and prevent speed jitter. For the unpowered front wheel joint, in order to simulate its free rotation characteristics, the stiffness parameter P and damping parameter D of the PD controller are both set to zero.
6. The simulation training method for bicycle control strategy based on domain randomization according to claim 1, characterized in that, For each of the digital twin models, initial state randomization and environment and task randomization are performed, including the following steps: Initial state randomization: In order to improve the controller's recovery capability under various unexpected conditions, the initial state of the vehicle is randomized, including: randomly setting the initial roll angle and roll rate within a small range, randomly setting the initial linear velocity, and randomly setting the initial position and orientation on the ground. Environment and task randomization: To ensure the controller can adapt to diverse external environments and commands, including: Terrain randomization: Allow bicycles to be trained on various virtual terrains such as flat ground, slopes of different gradients, simulated gravel roads, steps of random height, and rough grid roads to enhance their adaptability to changes in road surface. Command randomization: In each round, the target linear velocity and target steering commands are randomly varied and issued to the neural network controller. The magnitude and frequency of the commands are also randomly generated within a reasonable range to ensure that the controller can flexibly track various complex trajectories.
7. The simulation training method for bicycle control strategy based on domain randomization according to claim 1, characterized in that, The multi-objective reward function includes: The survival reward subfunction is used to maintain the system from collapsing. The task tracking reward subfunction is used to follow speed and steering commands; The motion smoothness penalty function is used to encourage energy-efficient and smooth motions.
8. The simulation training method for bicycle control strategy based on domain randomization according to claim 1, characterized in that, The trained neural network controller is subjected to simulation verification of a bicycle control strategy under complex operating conditions, including the following steps: Zero-shot generalization test: The trained neural network controller is deployed on multiple new bicycle models with physical parameters set outside the training randomization range but still physically reasonable, to test whether the neural network controller can still complete the balance and line-following tasks. Strong disturbance resistance test: During stable cycling in a simulated environment, strong external disturbances that were not explicitly present during training are applied, including: simulating continuous crosswinds, placing sudden obstacles or potholes on the path, and applying a thrust to the bicycle instantly. The dynamic robustness is tested by evaluating the speed and ability of the neural network controller to recover balance from the instability state. Extreme mission execution test: On a rugged and complex test terrain, a series of rapidly changing combined commands are issued to the controller to test the comprehensive performance under strong coupling of multiple targets and dynamic limits.
9. A simulation training system for bicycle control strategies based on domain randomization, characterized in that, include: The model building module constructs a parametric digital twin model of the bicycle in the physics simulation engine, defining all physical parameters that have a significant impact on bicycle dynamics as programmable variables. The domain randomization module performs an independent random sampling of all the physical parameters of each digital twin model at the beginning of each new training round to randomize the dynamic parameters, and performs initial state randomization and environment and task randomization for each digital twin model. The controller learning module constructs a multi-objective reward function to guide the neural network controller to learn, and uses a proximal policy optimization algorithm to perform large-scale parallel multi-round training on the neural network controller to obtain the trained neural network controller. The verification and testing module performs simulation verification of the bicycle control strategy under complex working conditions on the trained neural network controller.