A balancing car environment perception method, device and medium for field operation
By generating visual reference images through synchronous multi-view image acquisition and scene illumination correction models, and combining them with the output of terrain and mechanical features through a physical property mapping network, the problems of unstable illumination changes and unreliable cost assessment in field operations are solved, and stable environmental perception and optimized path planning are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- YONGKANG TANGSHENG IND & TRADE CO LTD
- Filing Date
- 2026-03-03
- Publication Date
- 2026-06-19
AI Technical Summary
Existing environmental perception methods for self-balancing scooters suffer from instability due to changes in lighting conditions and unreliable cost assessments during field operations. This results in perception results that are highly sensitive to instantaneous lighting conditions, unstable representations of drift and passable boundaries, and a lack of pixel-level quantitative representation of terrain mechanical properties such as rolling resistance and slip risk. Consequently, these methods affect the consistency and interpretability of path and control commands.
A synchronous multi-view image acquisition and scene illumination correction model is adopted. By decoupling the visual attributes and instantaneous illumination parameters of the multi-view image group, a visual reference image is generated. Combined with a pre-trained physical property mapping network, the terrain mechanical features are regressed pixel by pixel to output a dual-channel physical property map. The optimal path is planned based on the principle of minimum cost.
It achieves stable extraction of visual features under complex outdoor lighting conditions, improves the robustness and reliability of perception, enables refined assessment of terrain passability and energy consumption, and optimizes the driving efficiency and safety of mobile platforms.
Smart Images

Figure CN122244281A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to a method, device and storage medium for environmental perception of a self-balancing scooter used in field operations. Background Technology
[0002] In recent years, with the widespread application of mobile robot technology in complex environments such as field exploration, agricultural monitoring, and disaster relief, significant progress has been made in environmental perception methods for small mobile platforms such as self-balancing scooters. The core of these methods lies in using onboard visual sensors to understand unstructured terrain in real time, enabling autonomous navigation and obstacle avoidance. Early research primarily relied on monocular or stereo vision systems for obstacle detection and traversable area segmentation. Recent technological trends have shifted towards utilizing multi-view visual geometric information to construct 3D representations of scenes and combining this with deep learning models to enhance semantic understanding of complex terrains such as grasslands, sand, and gravel, thereby providing richer environmental context information for path planning.
[0003] Existing methods have some shortcomings. Illumination conditions in outdoor scenes are highly time-varying and localized. Shadows, backlighting, glare, and reflections can cause the same surface to appear differently at different times, making the perception results highly sensitive to instantaneous illumination, which in turn leads to representation drift and instability of traversable boundaries. In addition, cost assessments often remain at the level of appearance texture, geometric undulations, or coarse-grained categories, lacking pixel-level quantitative representation of terrain mechanical properties such as rolling resistance and slip risk. This results in insufficient coupling between cost maps and actual energy consumption / safety risks, and limited consistency and interpretability of paths and control commands. Summary of the Invention
[0004] In view of the aforementioned existing problems, the present invention is proposed.
[0005] Therefore, the present invention provides an environmental perception method for self-balancing scooters used in field operations to solve the problems of unstable lighting changes and unreliable cost assessment.
[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution: In a first aspect, the present invention provides an environmental perception method for a self-balancing scooter used in field operations, comprising: acquiring a synchronous multi-view image sequence of the current field environment; performing image preprocessing on the synchronous multi-view image sequence to generate a multi-view image group; constructing a scene illumination correction model; inputting the multi-view image group into the scene illumination correction model; generating a visual reference image by decoupling the visual attributes and instantaneous illumination parameters of the multi-view image group; inputting the visual reference image into a pre-trained physical property mapping network; regressing terrain mechanical features pixel by pixel through a deep convolutional structure to output a dual-channel physical property map; the dual-channel physical property map includes the rolling resistance coefficient and slip probability at the corresponding position; calculating the cost value of the dual-channel physical property map through a path planning algorithm; and planning the optimal path based on the principle of minimum cost; deconstructing the optimal path into continuous action primitives; and querying a preset action primitive library to generate a motion control command sequence.
[0007] As a preferred embodiment of the environmental perception method for self-balancing scooters used in field operations according to the present invention, the specific steps for generating multi-view image sets are as follows: Raw images of the field environment at the same time are collected according to a unified trigger signal, and a timestamp and camera number are added to each raw image; Each original image is denoised using a Gaussian filtering algorithm. All denoised original images are then combined to generate a multi-view image group.
[0008] As a preferred embodiment of the environmental perception method for self-balancing scooters used in field operations according to the present invention, the specific steps for constructing the scene illumination correction model are as follows: A neural scene encoder is constructed based on a neural radiation field architecture, and an illumination parameter estimator is constructed based on the principle of spherical harmonic illumination. A visual attribute decoupler is constructed based on the inversion calculation of the reflection equation, and an image renderer is constructed based on the differentiable rendering mechanism. A scene lighting correction model is constructed using a neural scene encoder, a lighting parameter estimator, a visual attribute decoupler, and an image renderer.
[0009] As a preferred embodiment of the environmental perception method for a self-balancing scooter used in field operations according to the present invention, the step of inputting a multi-view image set into a scene illumination correction model, and generating a visual reference image by decoupling the visual attributes and instantaneous illumination parameters of the multi-view image set, includes the following specific steps. The multi-view image group is input into the scene lighting correction model, the neural scene encoder extracts geometric appearance features, and the lighting parameter estimator parses instantaneous lighting parameters. The visual attribute decoupler uses instantaneous lighting parameters to decouple geometric appearance features and outputs visual attributes; The image renderer renders based on visual attributes and preset standard lighting parameters to generate a visual reference image.
[0010] As a preferred embodiment of the environmental perception method for a self-balancing scooter used in field operations according to the present invention, the steps of inputting a visual reference image into a pre-trained property mapping network, regressing terrain mechanical features pixel by pixel through a deep convolutional structure, and outputting a dual-channel property map are as follows. The visual reference image is resized and its pixel values are normalized to generate a normalized image; Visual features related to topographic mechanical properties in normalized images are extracted step by step through the convolutional layers of the material property mapping network to generate a high-dimensional mechanical feature map. Perform pixel-by-pixel regression calculations on the high-dimensional mechanical feature map and output the terrain mechanical feature values for each pixel location; Numerical range constraints and spatial alignment corrections are applied to the numerical values of topographic mechanical features to generate dual-channel physical property maps.
[0011] As a preferred embodiment of the environmental perception method for a self-balancing scooter used in field operations according to the present invention, the terrain mechanical feature values include the initial predicted values of the rolling resistance coefficient and the slip probability.
[0012] As a preferred embodiment of the environmental perception method for a self-balancing scooter used in field operations according to the present invention, the specific steps of calculating the cost value of the dual-channel property map using a path planning algorithm and planning the optimal path based on the principle of minimum cost are as follows. The rolling resistance coefficient and slip probability at each location in the dual-channel physical property map are input into a predefined cost calculation function to generate a cost map. Set the path start and end points on the cost value map, find the pixel sequence with the minimum cumulative cost connecting the path start and end points, and generate the optimal path.
[0013] As a preferred embodiment of the environmental perception method for a self-balancing scooter used in field operations according to the present invention, the specific steps of deconstructing the optimal path into continuous motion primitives and querying a preset motion primitive library to generate a motion control command sequence are as follows. The pixel sequence of the optimal path is subjected to kinematic analysis and trajectory smoothing, and then converted into a continuous motion primitive sequence. The system queries the preset motion primitive library, maps each motion primitive in the motion primitive sequence to the corresponding motion control instruction parameter, and integrates them into a motion control instruction sequence.
[0014] In a second aspect, the present invention provides a computer device including a memory and a processor, wherein the memory stores a computer program, wherein the computer program, when executed by the processor, implements any step of the environmental perception method for a self-balancing scooter for field operations as described in the first aspect of the present invention.
[0015] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein: when the computer program is executed by a processor, it implements any step of the environmental perception method for a self-balancing scooter for field operations as described in the first aspect of the present invention.
[0016] The beneficial effects of this invention are as follows: By constructing a scene illumination correction model, a visual reference image is generated by inputting a group of multi-view images, overcoming the representation drift problem caused by drastic changes in outdoor lighting, achieving stable extraction of the essential visual features of the scene, and improving the robustness and reliability of perception; by performing pixel-by-pixel regression on the visual reference image through a pre-trained physical property mapping network, a dual-channel physical property map containing rolling resistance coefficient and slip probability is output, enabling path planning to perform refined cost-value assessment and optimal path generation based on physical indicators such as terrain passability, energy consumption and slip risk, thereby optimizing the driving efficiency and energy consumption of the mobile platform while ensuring navigation safety. Attached Figure Description
[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a flowchart of an environmental perception method for self-balancing scooters used in field operations.
[0019] Figure 2 A flowchart for generating a visual reference image.
[0020] Figure 3 A flowchart for generating a dual-channel property map.
[0021] Figure 4 A flowchart for generating a sequence of motion control commands. Detailed Implementation
[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
[0023] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.
[0024] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.
[0025] Reference Figures 1-4 This is one embodiment of the present invention, which provides an environmental perception method for a self-balancing scooter used in field operations, comprising the following steps: S1. Collect a sequence of synchronous multi-view images of the current field environment, perform image preprocessing on the synchronous multi-view image sequence, and generate a multi-view image group.
[0026] S1.1 Multiple cameras acquire raw images of the field environment at the same time according to a unified trigger signal, and add a timestamp and camera number to each raw image.
[0027] It should be noted that the trigger signal is simultaneously sent to a group of cameras deployed on the self-balancing scooter via a parallel bus; all cameras start upon receiving the trigger signal, ensuring that a set of raw images of the outdoor environment are captured at the same time; a high-precision rubidium atomic clock is used as the reference clock source to add a Coordinated Universal Time (UTC) timestamp accurate to the microsecond level to each successfully captured raw image; the camera's device identifier is embedded as metadata in the raw image file header as a camera number; each camera number corresponds to a captured raw image, and the timestamp of each raw image records the shooting time, ensuring the temporal and spatial consistency of all raw image data.
[0028] S1.2. Denoise each original image using a Gaussian filtering algorithm, and integrate all the denoised original images to generate a multi-view image group.
[0029] It should be noted that the original image is decoded to obtain an 8-bit unsigned integer RGB value array for each pixel; a floating-point array of the same size as the original image is created, and the ratio of the integer RGB value to 255 is used as the floating-point value. The floating-point values of all pixels are filled into the floating-point array to form a floating-point pixel matrix; a sliding window convolution operation is performed on the floating-point pixel matrix using Gaussian convolution kernels. A weighted average is calculated for all pixels in the neighborhood of each pixel with the same size as the Gaussian convolution kernel, and the weighted average is replaced with the floating-point value of the center pixel in the neighborhood. Gaussian convolution is then performed on the entire floating-point pixel matrix to complete the convolution smoothing operation for all pixels, forming a convolution smoothed image, thus completing the noise reduction process.
[0030] An empty data structure is created as a container for the multi-view image group; all denoised convolutionally smoothed images and their associated information, including timestamps and camera numbers, are read; all convolutionally smoothed images are grouped according to their timestamps, and images with the same timestamp are grouped into the same set to represent environmental information collected from different perspectives at the same time; for convolutionally smoothed images within the same timestamp set, they are sorted in ascending order according to their camera numbers to form an image list with a fixed viewpoint order; the image list grouped by timestamp and sorted by camera number is then encapsulated into a multi-view image group.
[0031] S2. Construct a scene lighting correction model. Input the multi-view image group into the scene lighting correction model. By decoupling the visual attributes and instantaneous lighting parameters of the multi-view image group, a visual reference image is generated.
[0032] S2.1. Construct a neural scene encoder based on the neural radiation field architecture, construct a lighting parameter estimator based on the principle of spherical harmonic lighting, construct a visual attribute decoupler based on the reflection equation inversion calculation, and construct an image renderer based on the differentiable rendering mechanism.
[0033] It should be noted that the neural radiation field architecture refers to expressing the lighting and visual attributes of a 3D scene by encoding the input spatial coordinates and viewing direction, processing them through a multilayer perceptron network, and combining them with a specific output structure (such as dual-branch output). This includes high-frequency position encoding structures, multilayer perceptron network structures, skip connection structures, and dual-branch output structures. The neural radiation field architecture is used as the structure of a neural scene encoder. In the high-frequency position encoding structure, the frequency sequence and corresponding sine, cosine, and concatenation orders are fixed to form a high-dimensional encoding transformation for multi-frequency representation. In the multilayer perceptron network structure, multiple fully connected layers are sequentially configured, and nonlinear activation functions are configured between adjacent fully connected layers to form a continuously stacked deep representation backbone. In the skip connection structure, skip connections are pre-specified. The skip connection insertion layer number is configured, and the channel concatenation rules are set so that the high-dimensional encoding transformation can be concatenated with the output feature vector of the Lth fully connected layer at the specified layer number and then fed into the subsequent fully connected layer. L is the fully connected layer number corresponding to the skip connection insertion layer number. In the dual-branch output structure, the deep representation backbone is branched into a volume density branch and a view-dependent radiation branch, and the corresponding fully connected layer and output dimension are configured respectively. The ReLU activation function is configured for the volume density branch to keep the output non-negative, and the Sigmoid activation function is configured for the view-dependent radiation branch to constrain the output range. The high-frequency position encoding structure, multilayer perceptron network structure, skip connection structure, volume density branch and view-dependent radiation branch are encapsulated into the same component entity to complete the construction of the neural scene encoder.
[0034] Based on the principle of spherical harmonic illumination, this involves defining the order of the spherical harmonic function and fixing the set and index order of the spherical harmonic basis functions accordingly. Simultaneously, it defines the set of spherical harmonic coefficients corresponding to the set of spherical harmonic basis functions to form a parameterized representation of instantaneous illumination parameters. A fully connected neural network is constructed, with its input dimension matching the dimension of the scene representation information output by the neural scene encoder. The output dimension of the fully connected neural network is set to the number of spherical harmonic coefficients corresponding to the order of the spherical harmonic function. Several hidden layers are configured within the fully connected neural network, using a linear rectified activation function. A linear activation function is used in the output layer to directly regress the set of spherical harmonic coefficients, thus completing the construction of the illumination parameter estimator.
[0035] Reflection equation inversion calculation refers to the reverse solution of the physical reflection relationship describing the interaction between light and the surface of an object. The core is to decompose the observed image's apparent color into intrinsic visual attributes independent of illumination. Based on reflection equation inversion calculation, a convolutional computation graph with an encoder-decoder structure is defined. The encoder part is configured with several downsampling convolutional layers to form a progressively compressed feature extraction path, while the decoder part is configured with several upsampling transposed convolutional layers to form a progressively restored feature reconstruction path. Skip connections are configured between the encoder and decoder parts to fuse multi-level features. The scene representation information output by the neural scene encoder is organized into a feature map form according to pixel position. The number of input channels of the convolutional computation graph is set to be consistent with the channel dimension of the feature map, and the number of output channels of the convolutional computation graph is set to the number of visual attributes to be decoupled. The visual attributes include at least diffuse albedo and surface normal vectors, and a linear rectified activation function is selected for the convolutional computation graph. At the output of the convolutional computation graph, a sigmoid activation function is configured for diffuse albedo to constrain its range, and normalization processing is configured for surface normal vectors to maintain unit length, thus completing the construction of the visual attribute decoupler.
[0036] Based on a differentiable rendering mechanism, a volume rendering integration process is defined and solidified into a callable computational graph structure. The volume rendering integration process includes a virtual camera ray emission structure, a ray sampling point sequence generation structure, a local lighting calculation structure, and a differentiable numerical integration accumulation structure. The virtual camera is determined by a preset set of virtual camera parameters, which includes at least the virtual camera position, virtual camera orientation, and imaging plane parameters. The organization of the sampling point sequence is fixed in the ray sampling point sequence generation structure. The calculation method of the combination of diffuse albedo, surface normal vector, and preset standard lighting parameters is fixed in the local lighting calculation structure. The color contribution accumulation rule and the transparency contribution accumulation rule are fixed in the differentiable numerical integration accumulation structure. The transparency contribution is determined by the volume density branch output of the neural scene encoder and participates in the volume rendering integration. Each structure of the volume rendering integration process is encapsulated into an image renderer, thus completing the construction of the image renderer.
[0037] S2.2 Construct a scene lighting correction model using a neural scene encoder, a lighting parameter estimator, a visual attribute decoupler, and an image renderer.
[0038] It should be noted that the output of the neural scene encoder is connected to the input of the lighting parameter estimator, and the output of the neural scene encoder is also connected in parallel to the first input of the visual attribute decoupler; the output of the lighting parameter estimator is connected to the second input of the visual attribute decoupler; the output of the visual attribute decoupler is finally connected to the first input of the image renderer, and the second input of the image renderer is connected to the preset standard lighting parameters; by integrating the four components into a unified whole with sequential and branching data flows, the scene lighting correction model is constructed.
[0039] During the training phase of the scene illumination correction model, a scene sample list is established, and scene numbers are assigned to each typical outdoor surface scene. Pose point numbers are assigned to the observation position and orientation of the self-balancing scooter in each typical outdoor surface scene to fix the same viewing angle conditions. Preset standard illumination parameters are determined, and the corresponding standard illumination configurations are solidified into a parameter set that can be directly written to the second input of the image renderer. The installation relationship of multiple cameras is fixed and numbered, and a camera parameter set bound to the camera numbers is generated to ensure the reproducibility of multi-view acquisition. Synchronous multi-view images of each "scene number - pose point number" are acquired under the standard illumination configuration corresponding to the preset standard illumination parameters. The sequence is grouped according to camera number to form a visual reference image group. While keeping the scene number and pose point number unchanged, it is switched to multiple instantaneous illumination states and synchronous multi-view image sequences are collected to form a multi-view image group. According to scene number, pose point number, and camera number, each group of multi-view image groups under instantaneous illumination state is paired with the corresponding visual reference image group and bound as training samples. At the same time, the identifier reference of the preset standard illumination parameters and the identifier reference of the camera parameter set are recorded in the training samples. The training samples are filtered for sharpness, exposure saturation, viewpoint missing and synchronization consistency, and unqualified samples are removed to obtain the training sample set.
[0040] The learnable parameters of the neural scene encoder, illumination parameter estimator, visual attribute decoupler, and image renderer are initialized, and the learning rate and parameter update rules are set. Training samples are read from the training sample set in batches. The Xavier method can be used for parameter initialization, with the learning rate set to 0.001 and the batch size set to 32. The Adam optimizer is used for parameter updates. The multi-view image group of the training samples is input into the scene illumination correction model to complete the forward calculation to obtain instantaneous illumination parameters, visual attributes, and visual reference images. The visual reference image group bound to the same training sample is read as the supervision target, and the visual reference consistency error is calculated. The difference is used to measure the pixel-level difference between the visual reference image and the visual reference image. Gradient backpropagation is performed on the visual reference consistency error to obtain gradients for the parameters of the neural scene encoder, lighting parameter estimator, visual attribute decoupler, and image renderer. A parameter update is then performed according to the parameter update rules. This process of batch reading, forward computation, error calculation, gradient backpropagation, and parameter update is repeated until a preset maximum number of iterations is reached. The learnable parameters at this point are then solidified into scene lighting correction model parameters that can map multi-view image sets to the visual reference image, completing the training of the scene lighting correction model. The expression for calculating the visual reference consistency error is... ; in, Indicates visual baseline consistency error; This represents the number of viewpoints contained in a training sample, i.e., the number of viewpoints / cameras in a multi-view image. This represents the view index, with values ranging from 1 to... ; It represents the set of pixel locations, that is, the set of all pixel locations in an image that are involved in error calculation; Represents the set of pixel positions The number of elements in the calculation, i.e., the total number of pixel positions involved in the calculation; Indicates the pixel position index; Indicates the visual reference image at the viewpoint pixel position The pixel value vector at that location; Indicates the visual reference image at the viewpoint pixel position The pixel value vector at that location.
[0041] Standard lighting parameters are defined as a set of spherical harmonic coefficients stored separately for each color channel, and the storage order of the spherical harmonic coefficients is strictly consistent with the arrangement order of the spherical harmonic basis functions. Standard uniform ambient light is selected as the standard lighting configuration, and the spherical harmonic coefficients are set to retain only the lowest-order component representing the overall uniform brightness, while all other components used to describe directional changes are set to zero, so that the standard uniform ambient light does not carry any bias in any specific direction. The spherical harmonic coefficients are written into the standard lighting parameter storage item and the corresponding parameter identifier is generated to complete the preset of the standard lighting parameters. The scene lighting is adjusted to a state where the intensity is consistent in all directions and there is no principal light direction through controlled diffuse lighting. After checking that the brightness fluctuation of the image does not show a directional gradient with a diffuse reflection calibration board under a fixed set of camera parameters and acquisition position, the lighting configuration at this time is the standard uniform ambient light.
[0042] The maximum number of iterations is defined based on the number of complete traversals of the training sample set, as follows: Divide the training sample set into a training subset and a validation subset, ensuring that the scene number for each category is covered; preset the number of training samples used for a single parameter update as the batch size, and group the total number of training samples in the training subset according to the batch size and round up to obtain the batch number; use the visual benchmark consistency error on the validation subset as the criterion, perform several complete traversals, and record the validation error after each traversal, find the traversal number corresponding to the minimum validation error; use the cumulative number of parameter updates corresponding to the traversal number as the maximum number of iterations.
[0043] S2.3 Input the multi-view image group into the scene lighting correction model, the neural scene encoder extracts geometric appearance features, and the lighting parameter estimator parses instantaneous lighting parameters.
[0044] It should be noted that multiple images corresponding to the same moment are selected from the multi-view image group according to the timestamp, and the multiple images are sorted by viewpoint based on the camera number to form a multi-view image set with ordered viewpoints. The camera parameter set bound to the camera number is read, and the pixel position in the multi-view image set is determined as pixel ray based on the camera parameter set and a pixel ray sampling point sequence is generated. The pixel ray sampling points corresponding to each viewpoint of the multi-view image set are input into the scene illumination correction model, and the neural scene encoder performs encoding calculation on the multi-view image set and outputs scene representation information. The scene representation information includes a volume density array and a view-related radiation array. The volume density array and the view-related radiation array are organized according to the viewpoint index, pixel position index and sampling point sequence index, and the volume density array and the view-related radiation array form geometric appearance features.
[0045] After receiving geometric appearance features, the illumination parameter estimator averages the volume density array in the sampling point index dimension and the pixel position dimension to obtain a density statistics vector, and averages the view-dependent radiation array in the sampling point index dimension and the pixel position dimension to obtain a radiation statistics vector. The density statistics vector and the radiation statistics vector are concatenated to form a scene description vector. The scene description vector is then sequentially input into several fully connected layers inside the illumination parameter estimator and mapped layer by layer with a linear rectified activation function. In the last fully connected layer, an output vector with the same number of spherical harmonic coefficients is output by a linear activation function. The output vector is rearranged according to the spherical harmonic basis function index order and organized by color channel to form a set of spherical harmonic coefficients, which is the instantaneous illumination parameter.
[0046] The expression for performing the encoding calculation is, ; ; in, Represents scene characterization information; It represents a set symbol, that is, a set consisting of a number of elements; This represents an element in the set; Indicates the first Each sampling point at the viewpoint pixel position The volume density at that location is output as a scalar; Indicates the first Each sampling point at the viewpoint pixel position View-dependent radiation output vector at the location; This represents the sampling point index, with a value range of 1 to... ; This indicates the number of sampling points on each pixel ray; Function mappings representing neural scene encoders; This represents the set of parameters of the neural scene encoder. Represents a high-frequency position coding function; Indicates the first Each sampling point at the viewpoint pixel position The spatial sampling point location vector at that location; Indicates perspective pixel position The observation direction vector at that location.
[0047] S2.4 The visual attribute decoupler uses instantaneous lighting parameters to decouple geometric appearance features and outputs visual attributes; the image renderer renders based on visual attributes and preset standard lighting parameters to generate a visual reference image.
[0048] It should be noted that after aligning the volume density array, view-dependent radiation array, and spherical harmonic coefficient set according to view index, pixel position index, and sampling point number index, the data is input into the visual attribute decoupler. The visual attribute decoupler uses the instantaneous illumination parameters represented by the spherical harmonic coefficient set to decompose the view-dependent radiation array and outputs the visual attributes decoupled from illumination changes. The visual attributes include at least diffuse albedo and surface normal vector. The diffuse albedo output is a pixel-level attribute within a limited range, and the surface normal vector output is a pixel-level attribute per unit length. The diffuse albedo and surface normal vector are organized in the same index order as the volume density array to form the visual attribute result.
[0049] The visual attribute results and volume density array are used as the rendering basis for the first input of the image renderer. The preset standard lighting parameters are written into the second input of the image renderer as the standard lighting configuration. The image renderer performs volume drawing and accumulation for the pixel light rays of each viewpoint along the sampling point index. Based on the diffuse albedo, surface normal vector and preset standard lighting parameters, the color contribution of each sampling point is calculated, and the transparency contribution of each sampling point is calculated based on the volume density array. The pixel value under standard lighting at each pixel position is generated according to the cumulative synthesis rule of color contribution and transparency contribution, forming a visual reference image corresponding to the viewpoint index.
[0050] The expression for calculating the color contribution of each sampling point is as follows: ; in, Indicates the first Each sampling point at the viewpoint pixel position Color contribution at the location; This indicates the lighting conditions under preset standard lighting parameters; Indicates the first Each sampling point at the viewpoint pixel position The diffuse albedo vector at that location; Indicates the first Each sampling point at the viewpoint pixel position Surface normal vector at that location; This represents the set of spherical harmonic coefficients corresponding to the preset standard illumination parameters; This represents the illumination intensity term calculated from the set of spherical harmonic coefficients on the surface normal vector; Indicates the index of the order of the spherical harmonic function; Represents the spherical harmonic index, in order of The sequence number index below.
[0051] The expression for calculating the transparency contribution of each sampling point is as follows: ; in, Indicates the first Each sampling point at the viewpoint pixel position Contribution to transparency; Indicates the first Each sampling point at the viewpoint pixel position The sampling interval step size at the location.
[0052] It should also be noted that existing technologies mostly mitigate the impact of strong light and shadow on images through global or local illumination compensation such as brightness equalization and color correction. However, these technologies struggle to achieve a consistent and directly comparable scene appearance across different times and viewing angles, and still easily lead to fluctuations in the identification of passable areas and the judgment of risk boundaries with instantaneous changes in illumination. This solution generates a visual reference image through a scene illumination correction model, enabling the same terrain surface to be uniformly mapped to a stable and consistent visual representation across multiple times and viewing angles. This reduces the interference of illumination changes on environmental perception results, improves the reliability and robustness of subsequent terrain property estimation and path planning, and thus solves the problems of easy drift and difficulty in alignment of environmental perception under complex outdoor lighting conditions.
[0053] S3. Input the visual reference image into the pre-trained property mapping network, and regress the terrain mechanical features pixel by pixel through a deep convolutional structure to output a dual-channel property map.
[0054] S3.1. Perform size adjustment and pixel value normalization on the visual reference image to generate a normalized image.
[0055] It should be noted that the process involves acquiring the current resolution and channel format of the visual reference image, determining the target size required by the material property mapping network input, and performing a size adjustment on the visual reference image to resample it to the target size while maintaining the spatial correspondence of each pixel position. Specifically, this involves reading the width and height of the visual reference image and determining the target width and height required by the material property mapping network. Using each pixel position corresponding to the target width and height as the traversal object, a pixel-center aligned coordinate mapping rule is applied to map each pixel position to a continuous coordinate position in the visual reference image. The horizontal mapping ratio is determined by the ratio of the width of the visual reference image to the target width, and the vertical mapping ratio is determined by the ratio of the height of the visual reference image to the target height. The visual reference is then located at each continuous coordinate position. The image is divided into four adjacent integer pixel positions (top left, top right, bottom left, and bottom right) and their corresponding pixel values in each color channel are read. The four pixel values are then weighted and summed according to the distance weights from the continuous coordinate position to the four integer pixel positions to obtain the resampling result of the current pixel position. When the continuous coordinate position falls outside the boundary of the visual reference image, the continuous coordinate is clamped to the boundary of the visual reference image according to the boundary truncation rule. The resampling results of all pixel positions are combined to obtain the resized visual reference image. The pixel values of each color channel of the resized visual reference image are read pixel by pixel. The pixel values are converted to floating-point values and the ratio of the value to the maximum theoretical value of the pixel in the visual reference image (usually 255) is calculated to complete the normalization, so that the pixel values of all color channels are mapped to the range of 0 to 1, resulting in a normalized image.
[0056] S3.2. Visual features related to topographic mechanical properties in the normalized image are extracted step by step through the convolutional layer group of the material property mapping network to generate a high-dimensional mechanical feature map. Pixel-by-pixel regression calculation is performed on the high-dimensional mechanical feature map to output the topographic mechanical feature value at each pixel position.
[0057] It should be noted that, under various typical field surface scenarios, reference synchronous multi-view image sequences were collected, and image preprocessing was performed to generate reference multi-view image sets. These reference multi-view image sets were then input into a scene illumination correction model for visual attribute decoupling and instantaneous illumination parameter rendering to generate reference visual baseline images. Each reference visual baseline image was then resized and its pixel values were normalized to obtain reference normalized images. Reference terrain mechanical feature values corresponding to each pixel position were constructed in each reference normalized image. These reference terrain mechanical feature values were organized in the form of a dual-channel reference map, with channel one representing the rolling resistance coefficient reference value and channel two representing the slip probability reference value. The reference normalized images and the dual-channel reference maps were aligned pixel-wise and bound as training samples. All training samples were then integrated. Obtain the training sample set; initialize the learnable parameters of the convolutional layer group and pixel-wise regression output layer of the material property mapping network; read reference normalized images from the training sample set in batches and input them into the material property mapping network to obtain the predicted values of the rolling resistance coefficient and slip probability at each pixel position; measure the difference between the predicted values and the dual-channel reference image bound to the same training sample at the same pixel position to obtain the training error; perform gradient backpropagation based on the training error and update the parameters of the material property mapping network according to the parameter update rules; cyclically execute batch reading, forward computation, error calculation, gradient backpropagation and parameter update; terminate training when the training error is less than the convergence threshold; solidify the learnable parameters of the material property mapping network at this time as the pre-training parameters of the material property mapping network, and complete the pre-training of the material property mapping network.
[0058] The difference metric refers to calculating the pixel error by using the squared error of the rolling resistance coefficient and the binary cross-entropy of the slip probability. The total error of the current pixel is obtained by summing the two pixel errors. The total error of all pixel positions in the current training sample is averaged to obtain the sample error. The average of the sample errors of all training samples is calculated to obtain the training error.
[0059] Divide the training sample set into training subsets and validation subsets while ensuring consistent coverage of various scenarios; perform several rounds of complete traversal training on the property mapping network using parameter update rules, and calculate the training error on the training subset and the validation error on the validation subset at the end of each traversal, forming an error sequence sorted by traversal round; locate the traversal round corresponding to the minimum value of the validation error in the error sequence, and record the training error value corresponding to the traversal round as the convergence threshold.
[0060] The normalized image is input into a pre-trained material property mapping network. The normalized image enters the first convolutional layer of the material property mapping network to perform convolutional sliding calculation to complete weighted convergence in the local neighborhood and output the first layer feature map. The first layer feature map is then transformed nonlinearly and used as the input of the next convolutional layer to continue performing convolutional sliding calculation to output the second layer feature map. The convolutional layer group extracts and fuses the feature maps step by step, so that the output feature map of each level corresponds to the normalized image in spatial position and at the same time gradually improves the expressive power of the feature channels, resulting in a high-dimensional mechanical feature map for characterizing the mechanical properties of the terrain.
[0061] The high-dimensional mechanical feature map is input into the pixel-by-pixel regression calculation part. A linear mapping of the channel direction is performed on the high-dimensional mechanical feature map at each pixel position to output two pixel-by-pixel regression result channels. The first channel is used as the initial predicted value of the rolling resistance coefficient, and the second channel is used as the initial predicted value of the slip probability. The output is the topographic mechanical feature value organized by pixel position.
[0062] S3.3 Perform numerical range constraints and spatial alignment corrections on the numerical values of topographic mechanical features to generate dual-channel physical property maps.
[0063] It should be noted that numerical range constraints are applied to the initial predicted values of the rolling resistance coefficient and the slip probability. The initial predicted values of the rolling resistance coefficient that are less than zero are set to zero to ensure that the rolling resistance coefficient meets the non-negativity constraint. The initial predicted values of the slip probability that are less than zero are set to zero and the initial predicted values of the slip probability that are greater than one are set to one to ensure that the slip probability meets the range constraint of 0 to 1.
[0064] After completing the numerical range constraints, when the pixel grid size of the rolling resistance coefficient and slip probability is inconsistent with the pixel grid size of the visual reference image, spatial alignment correction is performed on the rolling resistance coefficient and slip probability. Specifically, the target width and target height determined by the input requirements of the property mapping network are reused, and the rolling resistance coefficient and slip probability are resampled from the current pixel grid to the pixel grid corresponding to the target width and target height, so that the rolling resistance coefficient channel and the slip probability channel correspond to the same pixel position. During the resampling process, out-of-bounds positions are processed according to the boundary truncation rule to ensure that the pixel position is valid. The two channel results after completing the numerical range constraints and spatial alignment correction are merged and solidified into a dual-channel property map according to the pixel position.
[0065] It should also be noted that existing methods typically use convolutional feature extraction to perform pixel-by-pixel regression of ground properties to output their distribution. However, in outdoor scenes, these methods are easily affected by differences in lighting and scale, leading to issues such as numerical out-of-bounds errors, unstable pixel-level correspondences between the two types of properties, and difficulty in reusing results. This approach extracts features related to terrain mechanics at each level and outputs the rolling resistance coefficient and slip probability simultaneously at each pixel, generating a dual-channel property map. This results in more stable property representation, more consistent positional correspondences, and more reliable subsequent cost calculations and path planning, thus solving the problems of unstable pixel-by-pixel property output and difficulty in alignment.
[0066] S4. Calculate the cost value of the dual-channel property map using a path planning algorithm, and plan the optimal path based on the principle of minimum cost; decompose the optimal path into continuous motion primitives, and generate a motion control command sequence by querying a preset motion primitive library.
[0067] S4.1 Input the rolling resistance coefficient and slip probability of each position in the dual-channel physical property map into the predefined cost calculation function to generate a cost map.
[0068] It should be noted that, according to the pixel position index, each pixel position of the dual-channel physical property map is traversed point by point. At each pixel position, the rolling resistance coefficient and slip probability are simultaneously extracted and substituted into the cost value calculation function to obtain the cost value of the current pixel position. When the cost value is less than zero, it is set to zero to avoid negative costs. The cost value calculated for each pixel position is then compared with the dual-channel physical property... Figure 1 The pixel position indexes are written to form a value map of the same size as the dual-channel property map.
[0069] The expression for the cost calculation function is: ; in, Indicates pixel position The value of the place; Indicates pixel position Rolling resistance coefficient at the location; Indicates pixel position The slip probability at that point.
[0070] S4.2 Set the path start point and path end point on the cost value map, find the pixel sequence with the minimum cumulative cost value connecting the path start point and path end point, and generate the optimal path.
[0071] It should be noted that the pixel position index representing the current position of the self-balancing scooter in the cost value map is used as the starting point of the path; the pixel position index of the locally reachable target in the cost value map is used as the ending point of the path. Specifically, a target candidate pixel band is determined at the far boundary of the cost value map and the cost value value in the target candidate pixel band is read point by point. The pixel position with the smallest cost value value is selected as the ending point of the path. When there are multiple candidate pixel positions with the same cost value value, the pixel position closer to the horizontal center of the cost value map is selected as the ending point of the path.
[0072] Starting from the path's origin, the cumulative cost value of the path's origin is initialized to the cost value of the path's origin pixel position, and the cumulative cost values of the remaining pixel positions are initialized to infinity. A predecessor pixel record table is established to record the previous hop pixel position of each pixel position. Starting from the path's origin, the pixel positions adjacent to the current pixel position are expanded sequentially, and out-of-bounds pixel positions are eliminated. The adjacent pixel positions include the four neighboring pixel positions: above, below, left, and right. The cumulative cost value of the current pixel position is summed with the cost values of the adjacent pixel positions to obtain the candidate cumulative cost value of each adjacent pixel position. When the candidate cumulative cost value is less than the already recorded cumulative cost value of the adjacent pixel position, the cumulative cost value of the adjacent pixel position is updated to the candidate cumulative cost value, and the predecessor pixel of the adjacent pixel position is updated to the current pixel position. In each round of expansion, the pixel position with the smallest cumulative cost value is always selected as the new current pixel position, and the adjacent expansion and update are continued until the path's end point is selected as the current pixel position. From the path's end point, the predecessor pixel record table is used to backtrack back to the path's origin to obtain the pixel sequence connecting the path's origin and end point, and the pixel sequence is reversed to generate the optimal path.
[0073] S4.3 Perform kinematic analysis and trajectory smoothing on the pixel sequence of the optimal path, and convert it into a continuous motion primitive sequence.
[0074] It should be noted that each pixel position in the pixel sequence of the optimal path consists of a row index and a column index. Adjacent two points are taken out in the order of the pixel sequence to form adjacent pixel position pairs. For each adjacent pixel position pair, the column index difference and the row index difference are calculated to obtain the displacement vector of the adjacent step. Based on the displacement vector, the corresponding travel direction sequence is obtained. The travel direction is limited to the four-neighborhood direction.
[0075] To eliminate single-step jitter, a trajectory smoothing process is performed on the movement direction sequence. Specifically, a neighborhood direction consistency determination is performed on several consecutive movement directions to obtain a smoothing direction. Isolated pixel steps that do not conform to the smoothing direction are replaced with pixel steps that conform to the smoothing direction to generate a smoothed pixel sequence, while keeping the path start and path end unchanged. When replacing isolated pixel steps, the position of the next pixel is re-determined according to the smoothing direction based on the position of the previous pixel step, ensuring that the old and new pixel positions are still adjacent pixels and do not cross the boundary. If the replacement results in non-adjacency or cross the boundary, the original pixel step is kept unchanged.
[0076] The smoothed pixel sequence is subjected to kinematic analysis and converted into a continuous motion primitive sequence. Specifically, the smoothed pixel sequence is divided into several continuous segments according to the smoothing direction. The continuous segments with the smoothing direction unchanged are merged into linear motion primitives, and the continuous segments with the smoothing direction changed but maintaining the same turning trend are merged into turning motion primitives. The number of pixel steps corresponding to each motion primitive is recorded as the primitive length, and the turning direction corresponding to the change in direction is recorded as the primitive attribute. The motion primitives are arranged in the order of the pixel sequence of the optimal path to form a continuous motion primitive sequence.
[0077] Neighborhood direction consistency determination refers to counting the number of times the direction of travel within a few consecutive steps in the pixel sequence for each step. The direction with the most occurrences is taken as the smoothed direction of the current step, and the operation is repeated iteratively. When there are ties in the number of occurrences, the direction consistent with the previous smoothed direction is selected first; if the previous smoothed direction does not exist, the direction consistent with the original direction of travel is selected, until the entire pixel sequence is processed.
[0078] S4.4 Query the preset motion primitive library, map each motion primitive in the motion primitive sequence to the corresponding motion control instruction parameter, and integrate them into a motion control instruction sequence.
[0079] It should be noted that the motion primitive library establishes an entry set based on primitive type, turning direction, and primitive length range as an index, and fixes a set of motion control command parameters within each entry. The motion control command parameter set includes speed-type command values describing straight-line or turning motions and execution steps describing the execution length. It is divided into a set of straight-line motion primitive entries and a set of turning motion primitive entries according to primitive type. Within the turning motion primitive entry set, subsets are established for left turns and right turns, and corresponding entries are established within each set according to several primitive length ranges. In a controllable field, the same control cycle and the same execution step length measurement method are fixed, and parameters for straight-line and turning motions are tuned and verified separately to ensure that the motion control command parameter set corresponding to each entry can stably generate a motion pattern consistent with the primitive type, and that the execution steps and primitive length ranges have a consistent coverage relationship. The entry index and the corresponding motion control command parameter set are fixed to form the motion primitive library.
[0080] Read the continuous sequence of motion primitives and extract the motion primitive records one by one in sequence. From each motion primitive record, read the primitive type field, primitive length field, and steering direction field in sequence. Use the primitive type field and steering direction field as the first-level index conditions of the motion primitive library to locate the corresponding entry set. Within the corresponding entry set, use the primitive length field as the second-level index condition to match the primitive length range and select a unique motion primitive library entry. Extract the motion control command parameters from the selected motion primitive library entry and read the fixed execution steps in the motion control command parameters as the execution steps corresponding to the current motion primitive to complete the parameter. Instantiation: When the primitive length field falls near the boundary of two adjacent primitive length intervals, the closer primitive length interval entry is selected according to the nearest principle; the instantiated motion control command parameters are appended to the motion control command sequence cache in the order in which the motion primitives are recorded in the continuous motion primitive sequence, and the speed-type command values in the previous motion control command parameters are consistent with the speed-type command values in the next motion control command parameters; the process of reading motion primitives, retrieving motion primitive libraries, instantiating parameters, and appending sequences is repeated until the continuous motion primitive sequence is processed, generating the motion control command sequence.
[0081] This embodiment also provides a computer device applicable to the environmental perception method for a self-balancing scooter used in field operations, comprising: a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the environmental perception method for a self-balancing scooter used in field operations as proposed in the above embodiment.
[0082] The computer device can be a terminal, comprising a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.
[0083] This embodiment also provides a storage medium storing a computer program that, when executed by a processor, implements the environmental perception method for a self-balancing scooter used in field operations as proposed in the above embodiments. The storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0084] In summary, this invention overcomes the representation drift problem caused by drastic changes in ambient light in the wild by constructing a scene illumination correction model and inputting a group of multi-view images to generate a visual reference image. This achieves stable extraction of the essential visual features of the scene and improves the robustness and reliability of perception. Furthermore, by performing pixel-by-pixel regression on the visual reference image through a pre-trained physical property mapping network, a dual-channel physical property map containing rolling resistance coefficient and slip probability is output. This enables path planning to perform refined cost-benefit assessment and optimal path generation based on physical indicators such as terrain passability, energy consumption, and slip risk. Thus, while ensuring navigation safety, the invention optimizes the driving efficiency and energy consumption of the mobile platform.
[0085] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.
Claims
1. A balance car environment perception method for field operation, characterized in that: include, Collect a sequence of synchronous multi-view images of the current field environment, perform image preprocessing on the synchronous multi-view image sequence, and generate a multi-view image group; A scene lighting correction model is constructed. A multi-view image group is input into the scene lighting correction model. By decoupling the visual attributes and instantaneous lighting parameters of the multi-view image group, a visual reference image is generated. The visual reference image is input into a pre-trained property mapping network, and the terrain mechanical features are regressed pixel by pixel through a deep convolutional structure to output a dual-channel property map; the dual-channel property map includes the rolling resistance coefficient and slip probability at the corresponding position. The cost of the dual-channel physical property map is calculated using a path planning algorithm, and the optimal path is planned based on the principle of minimum cost. The optimal path is deconstructed into continuous motion primitives, and a motion control command sequence is generated by querying a preset motion primitive library. 2.The balance car environment perception method for field operation of claim 1, wherein: The specific steps for generating the multi-view image group are as follows. Raw images of the field environment are collected at the same time according to a unified trigger signal, and a timestamp and camera number are added to each raw image; Each original image is denoised using a Gaussian filtering algorithm. All denoised original images are then combined to generate a multi-view image group.
3. The environmental perception method for a self-balancing scooter used in field operations as described in claim 2, characterized in that: The specific steps for constructing the scene illumination correction model are as follows: A neural scene encoder is constructed based on a neural radiation field architecture, and an illumination parameter estimator is constructed based on the principle of spherical harmonic illumination. A visual attribute decoupler is constructed based on the inversion calculation of the reflection equation, and an image renderer is constructed based on the differentiable rendering mechanism. A scene lighting correction model is constructed using a neural scene encoder, a lighting parameter estimator, a visual attribute decoupler, and an image renderer.
4. The environmental perception method for a self-balancing scooter used in field operations as described in claim 3, characterized in that: The process involves inputting a multi-view image set into a scene illumination correction model, and generating a visual reference image by decoupling the visual attributes and instantaneous illumination parameters of the multi-view image set. The specific steps are as follows: The multi-view image group is input into the scene lighting correction model, the neural scene encoder extracts geometric appearance features, and the lighting parameter estimator parses instantaneous lighting parameters. The visual attribute decoupler uses instantaneous lighting parameters to decouple geometric appearance features and outputs visual attributes; The image renderer renders based on visual attributes and preset standard lighting parameters to generate a visual reference image.
5. The environmental perception method for a self-balancing scooter used in field operations as described in claim 4, characterized in that: The process involves inputting a visual reference image into a pre-trained property mapping network, regressing topographic mechanical features pixel-by-pixel through a deep convolutional structure, and outputting a dual-channel property map. The specific steps are as follows: The visual reference image is resized and its pixel values are normalized to generate a normalized image; Visual features related to topographic mechanical properties in normalized images are extracted step by step through the convolutional layers of the material property mapping network to generate a high-dimensional mechanical feature map. Perform pixel-by-pixel regression calculations on the high-dimensional mechanical feature map and output the terrain mechanical feature values for each pixel location; Numerical range constraints and spatial alignment corrections are applied to the numerical values of topographic mechanical features to generate dual-channel physical property maps.
6. The environmental perception method for a self-balancing scooter used in field operations as described in claim 5, characterized in that: The numerical values of the terrain mechanics features include the initial predicted values of the rolling resistance coefficient and the slip probability.
7. The environmental perception method for a self-balancing scooter used in field operations as described in claim 1, characterized in that: The process involves calculating the cost value of the dual-channel property map using a path planning algorithm and planning the optimal path based on the principle of minimum cost. The specific steps are as follows: The rolling resistance coefficient and slip probability at each location in the dual-channel physical property map are input into a predefined cost calculation function to generate a cost map. Set the path start and end points on the cost value map, find the pixel sequence with the minimum cumulative cost connecting the path start and end points, and generate the optimal path.
8. The environmental perception method for a self-balancing scooter used in field operations as described in claim 7, characterized in that: The specific steps for deconstructing the optimal path into continuous motion primitives and generating a motion control command sequence by querying a preset motion primitive library are as follows. The pixel sequence of the optimal path is subjected to kinematic analysis and trajectory smoothing, and then converted into a continuous motion primitive sequence. The system queries the preset motion primitive library, maps each motion primitive in the motion primitive sequence to the corresponding motion control instruction parameter, and integrates them into a motion control instruction sequence.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it implements the steps of the environmental perception method for a self-balancing vehicle for field operations as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the environmental perception method for a self-balancing vehicle for field operations as described in any one of claims 1 to 8.